Paolo Ferragina Clustering

(1)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Objectives of Cluster Analysis

 Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or

unrelated to) the objects in other groups

Inter-cluster distances are

maximized Intra-cluster

distances are minimized

Competing objectives

The commonest form of unsupervised learning

(3)

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 Similarity = cosine similarity, LSH,…

(4)

Clustering Algorithms

 Flat algorithms

 Create a set of clusters

 Usually start with a random (partial) partitioning

 Refine it iteratively

 K means clustering

 Hierarchical algorithms

 Create a hierarchy of clusters (dendogram)

 Bottom-up, agglomerative

 Top-down, divisive

(5)

 Hard clustering: Each document belongs to exactly one cluster

 More common and easier to do

 Sof clustering: Each document can belong to more than one cluster.

 Makes more sense for applications like creating browsable hierarchies

 News is a proper example

 Search results is another example

(6)

Flat & Partitioning Algorithms

 Given: a set of n documents and the number K

 Find: a partition in K clusters that optimizes the chosen partitioning criterion

 Globally optimal

 Intractable for many objective functions

 Ergo, exhaustively enumerate all partitions

 Locally optimal

 Effective heuristic methods: K-means and K-medoids algorithms

(7)

K-Means

 Assumes documents are real-valued vectors.

 Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based on distance to the current cluster centroids.







c x

c ^ x



|

| (c) 1

μ

(8)

K-Means Algorithm

Select K random docs {s₁, s₂,… s_K} as seed centroids.

Until clustering converges (or other stopping criterion):

For each doc d_i:

Assign d_i to the cluster c_rsuch that dist(d_i, s_r) is minimal.

For each cluster c_j s_j= (c_j)

Sec. 16.4

(9)

K Means Example ^(K=2)

Pick seeds

Reassign clusters Compute centroids

x

Reassign clusters

x

x x

x Compute centroids

Reassign clusters Converged!

(10)

Termination conditions

 Several possibilities, e.g.,

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

Sec. 16.4

(11)

Time Complexity

 The centroids are K

 Each doc/centroid consists of M dimensions

 Computing distance btw vectors is O(M) time.

 Reassigning clusters: Each doc compared with all centroids, O(KNM) time.

 Computing centroids: Each doc gets added once to some centroid, O(NM) time.

Assume these two steps are each done once for I iterations: O(IKNM).

(12)

How Many Clusters?

 Number of clusters K is given

 Partition n docs into predetermined number of clusters

 Finding the “right” number of clusters is part of the problem

 Can usually take an algorithm for one flavor and convert to the other.

(13)

Variant of K-means that can produce a partitional or a hierarchical clustering

SSE = Sum of Squared Error SSE = Sum of Squared Error

(14)

Bisecting K-means Example

(15)

Pros

 Simple

 Fast for low dimensional data

 Good performance on globular data

Cons

 K-Means is restricted to data which has the notion of a center (centroid)

 K-Means cannot handle non-globular data of different sizes and densities

 K-Means will not identify outliers

(16)

Hierarchical Clustering

 Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

 One approach: recursive application of a partitional clustering algorithm

animal vertebrate

fish reptile amphib. mammal worm insect crustacean invertebrate

Ch. 17

(17)

 No assumption of any particular number of clusters

 Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

(18)

Hierarchical Agglomerative Clustering (HAC)

 Starts with each doc in a separate cluster

 Then repeatedly joins the closest pair of clusters, until there is only one cluster.

 The history of mergings forms a binary tree or hierarchy.

 The closest pair drives the mergings, how is it defined ?

Sec. 17.1

(19)

Closest pair of clusters

 Single-link

 Similarity of the closest points, the most cosine-similar

 Complete-link

 Similarity of the farthest points, the least cosine-similar

 Centroid

 Clusters whose centroids are the closest (or most cosine- similar)

 Average-link

 Clusters whose average distance/cosine between pairs of elements is the smallest

(20)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. . . Similarity?

 Single link (MIN)

 Complete link (MAX)

 Centroids

 Average Proximity

Matrix

(21)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

 MIN

 MAX

 Centroids

 Average

(22)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

(23)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

 

(24)

How to Define Inter-Cluster Similarity

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

 MIN

 MAX

 Centroids

 Average

Paolo Ferragina Clustering

Objectives of Cluster Analysis

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 Similarity = cosine similarity, LSH,…

Clustering Algorithms

Flat & Partitioning Algorithms

 Given: a set of n documents and the number K

 Find: a partition in K clusters that optimizes the chosen partitioning criterion

K-Means



K-Means Algorithm

K Means Example (K=2)

Termination conditions

 Several possibilities, e.g.,

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

Time Complexity

How Many Clusters?

Bisecting K-means Example

Hierarchical Clustering

 Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

 One approach: recursive application of a partitional clustering algorithm

Hierarchical Agglomerative Clustering (HAC)

 Starts with each doc in a separate cluster

 Then repeatedly joins the closest pair of clusters, until there is only one cluster.

 The history of mergings forms a binary tree or hierarchy.

 The closest pair drives the mergings, how is it defined ?

Closest pair of clusters

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

K Means Example ^(K=2)