• Non ci sono risultati.

Paolo Ferragina Clustering

N/A
N/A
Protected

Academic year: 2021

Condividi "Paolo Ferragina Clustering"

Copied!
24
0
0

Testo completo

(1)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Objectives of Cluster Analysis

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or

unrelated to) the objects in other groups

Inter-cluster distances are

maximized Intra-cluster

distances are minimized

Competing objectives

The commonest form of unsupervised learning

(3)

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 Similarity = cosine similarity, LSH,…

(4)

Clustering Algorithms

Flat algorithms

Create a set of clusters

Usually start with a random (partial) partitioning

Refine it iteratively

K means clustering

Hierarchical algorithms

Create a hierarchy of clusters (dendogram)

Bottom-up, agglomerative

Top-down, divisive

(5)

Hard clustering: Each document belongs to exactly one cluster

More common and easier to do

Sof clustering: Each document can belong to more than one cluster.

Makes more sense for applications like creating browsable hierarchies

News is a proper example

Search results is another example

(6)

Flat & Partitioning Algorithms

Given: a set of n documents and the number K

Find: a partition in K clusters that optimizes the chosen partitioning criterion

 Globally optimal

Intractable for many objective functions

Ergo, exhaustively enumerate all partitions

 Locally optimal

Effective heuristic methods: K-means and K-medoids algorithms

(7)

K-Means

 Assumes documents are real-valued vectors.

 Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based on distance to the current cluster centroids.

c x

c x

|

| (c) 1

μ

(8)

K-Means Algorithm

Select K random docs {s1, s2,… sK} as seed centroids.

Until clustering converges (or other stopping criterion):

For each doc di:

Assign di to the cluster cr such that dist(di, sr) is minimal.

For each cluster cj sj = (cj)

Sec. 16.4

(9)

K Means Example (K=2)

Pick seeds

Reassign clusters Compute centroids

x

x

Reassign clusters

x

x x

x Compute centroids

Reassign clusters Converged!

(10)

Termination conditions

 Several possibilities, e.g.,

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

Sec. 16.4

(11)

Time Complexity

 The centroids are K

 Each doc/centroid consists of M dimensions

Computing distance btw vectors is O(M) time.

Reassigning clusters: Each doc compared with all centroids, O(KNM) time.

Computing centroids: Each doc gets added once to some centroid, O(NM) time.

Assume these two steps are each done once for I iterations: O(IKNM).

(12)

How Many Clusters?

Number of clusters K is given

Partition n docs into predetermined number of clusters

Finding the “right” number of clusters is part of the problem

 Can usually take an algorithm for one flavor and convert to the other.

(13)

Variant of K-means that can produce a partitional or a hierarchical clustering

SSE = Sum of Squared Error SSE = Sum of Squared Error

(14)

Bisecting K-means Example

(15)

Pros

Simple

Fast for low dimensional data

Good performance on globular data

Cons

K-Means is restricted to data which has the notion of a center (centroid)

K-Means cannot handle non-globular data of different sizes and densities

K-Means will not identify outliers

(16)

Hierarchical Clustering

 Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

 One approach: recursive application of a partitional clustering algorithm

animal vertebrate

fish reptile amphib. mammal worm insect crustacean invertebrate

Ch. 17

(17)

 No assumption of any particular number of clusters

Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

(18)

Hierarchical Agglomerative Clustering (HAC)

Starts with each doc in a separate cluster

Then repeatedly joins the closest pair of clusters, until there is only one cluster.

 The history of mergings forms a binary tree or hierarchy.

 The closest pair drives the mergings, how is it defined ?

Sec. 17.1

(19)

Closest pair of clusters

Single-link

Similarity of the closest points, the most cosine-similar

Complete-link

Similarity of the farthest points, the least cosine-similar

Centroid

Clusters whose centroids are the closest (or most cosine- similar)

Average-link

Clusters whose average distance/cosine between pairs of elements is the smallest

(20)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. . . Similarity?

Single link (MIN)

Complete link (MAX)

Centroids

Average Proximity

Matrix

(21)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

(22)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

(23)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

(24)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

Riferimenti

Documenti correlati

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Roma “Tor Vergata”, via della Ricerca Scientifica, 00133 Roma,

 Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters.  The history of merging forms a binary tree or

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication. 

 We need to “normalize” terms in indexed text and query words into the same form.  We want to

 Each page gets a score given by the number of in-links plus the number of out-links (es..

 Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.  Hub score: A good hub page for a topic points to many authoritative pages

 Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large

The strong foundations put in place by these masters have been built on more recently by sev- eral scholars, including Leonard Barkan, in the field of archaeology (1999);