Paolo Ferragina Clustering

(1)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Chap 16 and 17 Chap 16 and 17

(2)

 Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or

unrelated to) the objects in other groups

Inter-cluster distances are

maximized Intra-cluster

distances are minimized

Competing objectives

The commonest form of unsupervised learning

(3)

effective news presentation metaphor

(4)

 Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs

 Therefore, to improve search recall:

 Cluster docs in corpus a priori

 When a query matches a doc D, also return other docs in the cluster containing D

 Hope if we do this: The query “car” will also return docs containing automobile

But also for speeding up the search operation But also for speeding up

the search operation

(5)

For better visualization/navigation of search results

(6)

Issues for clustering

 Representation for clustering

 Document representation

 Vector space? Normalization?

 Need a notion of similarity/distance

 How many clusters?

 Fixed a priori?

 Completely data driven?

(7)

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 We will use cosine similarity.

 For many algorithms, easier to think in

terms of a distance (rather than similarity)

between docs.

(8)

 Flat algorithms

 Create a set of clusters

 Usually start with a random (partial) partitioning

 Refine it iteratively

 K means clustering

 Hierarchical algorithms

 Create a hierarchy of clusters (dendogram)

 Bottom-up, agglomerative

 Top-down, divisive

(9)

 Hard clustering: Each document belongs to exactly one cluster

 More common and easier to do

 Sof clustering: Each document can belong to more than one cluster.

 Makes more sense for applications like creating browsable hierarchies

 News is a proper example

 Search results is another example

(10)

 Given: a set of n documents and the number K

 Find: a partition in K clusters that optimizes the chosen partitioning criterion

 Globally optimal

 Intractable for many objective functions

 Ergo, exhaustively enumerate all partitions

 Locally optimal

 Effective heuristic methods: K-means and K-medoids algorithms

(11)

K-Means

 Assumes documents are real-valued vectors.

 Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based on distance to the current cluster centroids.







c x

c

^

x



|

| (c) 1

μ

(12)

K-Means Algorithm

Select K random docs {s

₁

, s

₂

,… s

_K

} as seeds.

Until clustering converges (or other stopping criterion):

For each doc d

_i

:

Assign d_i to the cluster c_rsuch that dist(d_i, s_r) is minimal.

For each cluster c

_j

s_j= (c_j)

(13)

K Means Example ^(K=2)

Pick seeds

Reassign clusters Compute centroids

x

Reassign clusters

x

x x

x Compute centroids

Reassign clusters Converged!

(14)

Termination conditions

 Several possibilities:

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

(15)

Convergence

 Why should the K-means algorithm ever reach a fixed point?

 K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm

 EM is known to converge

 Number of iterations could be large

 But in practice usually isn’t

(16)

Convergence of K-Means

 Define goodness measure of cluster c as sum of squared distances from cluster centroid:

 G(c,s) = Σ

_{j in c}

(d

_j

– s

_c

)

²

(sum over all d

_i

in cluster c)

 G(C,s) = Σ

_c

G(c,s)

 Reassignment monotonically decreases G

 It is a coordinate descent algorithm (optimize one component at a time)

 At any step we have some value for G(C,s)

1) Fix s, optimize C  assign doc to the closest centroid  G(C’,s) < G(C,s) 2) Fix C’, optimize s  take the new centroids  G(C’,s’) < G(C’,s) < G(C,s)

The new cost is smaller than the original one ^ local minimum

(17)

Time Complexity

 The centroids are K

 Each doc/centroid consists of M dimensions

 Computing distance btw vectors is O(M) time.

 Reassigning clusters: Each doc compared with all centroids, O(N x K x M) time.

 Computing centroids: Each doc gets added once to some centroid, O(NM) time.

Assume these two steps are each done once for I

iterations: O(IKNM).

(18)

Seed Choice

 Results can vary based on random seed selection.

 Some seeds can result in poor

convergence rate, or convergence to sub-optimal clusterings.

 Select good seeds using a heuristic

 doc least similar to any existing centroid

 According to a probability distribution that depends inversely-proportional on the distance from the other current centroids

In the above, if you start with B and E as centroids you converge to {A,B,C}

and {D,E,F}

If you start with D and F you converge to

{A,B,D,E} {C,F}

Example showing sensitivity to seeds

(19)

 Number of clusters K is given

 Partition

n

docs into predetermined number of clusters

 Finding the “right” number of clusters is part of the problem

 Can usually take an algorithm for one flavor and

convert to the other.

(20)

Variant of K-means that can produce a partitional or a hierarchical clustering

SSE = G(C,s) is called Sum of Squared Error

(21)

(22)

Pros

 Simple

 Fast for low dimensional data

 It can find pure sub-clusters if large number of clusters is specified (but, over-partitioning)

Cons

 K-Means cannot handle non-globular data of different sizes and densities

 K-Means will not identify outliers

 K-Means is restricted to data which has the notion of a

center (centroid)

(23)

Hierarchical Clustering

 Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

 Possibility: recursive application of a partitional clustering algorithm

animal vertebrate

fish reptile amphib. mammal worm insect crustacean invertebrate

(24)

 No assumption of any particular number of clusters

 Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

 They may correspond to meaningful taxonomies

 Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

(25)

Hierarchical Agglomerative Clustering (HAC)

 Starts with each doc in a separate cluster

 Then repeatedly join the most similar pair of clusters, until there is only one cluster.

 Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

 The history of merging forms a binary tree or

hierarchy.

(26)

Similarity of pair of clusters

 Single-link

 Similarity of the closest points

 Complete-link

 Similarity of the farthest points

 Centroid

 Similarity among centroids

 Average-link

 Similarity = Average distance between all pairs of items

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(27)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. . . Similarity?

 Single link (MIN)

 Complete link (MAX)

 Centroids

 Average Proximity

Matrix

(28)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

 MIN

 MAX

 Centroids

 Average

(29)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

(30)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

 

(31)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

(32)

 Start with clusters of individual points and a proximity matrix

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

(33)

 After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5 C4 C2

C3 C4 C5

Proximity Matrix

C1 C3 C2 C5 C4

(34)

 We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5 C4 C2

C3 C4 C5

C1 C3 C2 U C5 C4

(35)

 The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5 C3

? ? ? ?

?

? C2 U C5 C1

C1

C3 C4 C2 U C5

C3 C4

C1 C3 C2 U C5 C4

(36)

 Similarity of two clusters is based on the two most similar (closest) points in the different clusters

 Determined by one pair of points, i.e., by one link in the proximity graph.

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5

??

(37)

Original Points Two

Clusters

• Can handle non-elliptical shapes

(38)

Original Points Two Clusters

• Sensitive to noise and outliers

(39)

 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

 Determined by all pairs of points in the two clusters

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.00 ₁ ₂ _{3 4} ₅

??

(40)

Original Points Two Clusters

• Less susceptible to noise and outliers

(41)

Original Points

Two Clusters

•Tends to break large clusters

•Biased towards globular clusters

(42)

 Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

|

|Cluster

|

|Cluster

) p , p proximity(

) Cluster ,

Cluster proximity(

j i

Cluster p Cluster

p i j

j

i ^j ^j

i i

 





1 2 4 5 3

??

(43)

Average

MIN MAX

1

2

3 4 5

6 1 2

5

4 3

1

2

3 4 5

6 1

2 5

3 1 4

2

3 4 5

6 1 2

3

4

5

(44)

Paolo Ferragina Clustering

Paolo Ferragina

effective news presentation metaphor

For better visualization/navigation of search results

Issues for clustering

 Representation for clustering

 How many clusters?

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 We will use cosine similarity.

 For many algorithms, easier to think in

terms of a distance (rather than similarity)

between docs.

 Flat algorithms

 Create a set of clusters

 Usually start with a random (partial) partitioning

 Refine it iteratively

 Hierarchical algorithms

 Create a hierarchy of clusters (dendogram)

 Bottom-up, agglomerative

 Top-down, divisive

 Given: a set of n documents and the number K

 Find: a partition in K clusters that optimizes the chosen partitioning criterion

 Globally optimal

 Locally optimal

K-Means

 Assumes documents are real-valued vectors.

 Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based on distance to the current cluster centroids.





c

x





|

| (c) 1

μ

K-Means Algorithm

Select K random docs {s

, s

,… s

} as seeds.

Until clustering converges (or other stopping criterion):

For each doc d

:

For each cluster c

K Means Example (K=2)

Termination conditions

 Several possibilities:

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

Convergence

 Why should the K-means algorithm ever reach a fixed point?

Convergence of K-Means

 Define goodness measure of cluster c as sum of squared distances from cluster centroid:

 G(c,s) = Σ

(d

– s

)

(sum over all d

in cluster c)

 G(C,s) = Σ

G(c,s)

 Reassignment monotonically decreases G

 At any step we have some value for G(C,s)

Time Complexity

 The centroids are K

 Each doc/centroid consists of M dimensions

 Computing distance btw vectors is O(M) time.

 Reassigning clusters: Each doc compared with all centroids, O(N x K x M) time.

 Computing centroids: Each doc gets added once to some centroid, O(NM) time.

Assume these two steps are each done once for I

iterations: O(IKNM).

Seed Choice

 Results can vary based on random seed selection.

 Select good seeds using a heuristic

 Number of clusters K is given

K Means Example ^(K=2)