Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Objectives of Cluster Analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster distances are
maximized Intra-cluster
distances are minimized
Competing objectives
The commonest form of unsupervised learning
Ideal: semantic similarity
Practical: term-statistical similarity
Docs as vectors
Similarity = cosine similarity, LSH,…
Clustering Algorithms
Flat algorithms
Create a set of clusters
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
Hierarchical algorithms
Create a hierarchy of clusters (dendogram)
Bottom-up, agglomerative
Top-down, divisive
Hard clustering: Each document belongs to exactly one cluster
More common and easier to do
Sof clustering: Each document can belong to more than one cluster.
Makes more sense for applications like creating browsable hierarchies
News is a proper example
Search results is another example
Flat & Partitioning Algorithms
Given: a set of n documents and the number K
Find: a partition in K clusters that optimizes the chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Locally optimal
Effective heuristic methods: K-means and K-medoids algorithms
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
Reassignment of instances to clusters is based on distance to the current cluster centroids.
c x
c x
|
| (c) 1
μ
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seed centroids.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cr such that dist(di, sr) is minimal.
For each cluster cj sj = (cj)
Sec. 16.4
K Means Example (K=2)
Pick seeds
Reassign clusters Compute centroids
x
x
Reassign clusters
x
x x
x Compute centroids
Reassign clusters Converged!
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Sec. 16.4
Time Complexity
The centroids are K
Each doc/centroid consists of M dimensions
Computing distance btw vectors is O(M) time.
Reassigning clusters: Each doc compared with all centroids, O(KNM) time.
Computing centroids: Each doc gets added once to some centroid, O(NM) time.
Assume these two steps are each done once for I iterations: O(IKNM).
How Many Clusters?
Number of clusters K is given
Partition n docs into predetermined number of clusters
Finding the “right” number of clusters is part of the problem
Can usually take an algorithm for one flavor and convert to the other.
Variant of K-means that can produce a partitional or a hierarchical clustering
SSE = Sum of Squared Error SSE = Sum of Squared Error
Bisecting K-means Example
Pros
Simple
Fast for low dimensional data
Good performance on globular data
Cons
K-Means is restricted to data which has the notion of a center (centroid)
K-Means cannot handle non-globular data of different sizes and densities
K-Means will not identify outliers
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents
One approach: recursive application of a partitional clustering algorithm
animal vertebrate
fish reptile amphib. mammal worm insect crustacean invertebrate
Ch. 17
No assumption of any particular number of clusters
Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
Hierarchical Agglomerative Clustering (HAC)
Starts with each doc in a separate cluster
Then repeatedly joins the closest pair of clusters, until there is only one cluster.
The history of mergings forms a binary tree or hierarchy.
The closest pair drives the mergings, how is it defined ?
Sec. 17.1
Closest pair of clusters
Single-link
Similarity of the closest points, the most cosine-similar
Complete-link
Similarity of the farthest points, the least cosine-similar
Centroid
Clusters whose centroids are the closest (or most cosine- similar)
Average-link
Clusters whose average distance/cosine between pairs of elements is the smallest
How to Define Inter-Cluster Similarity
p1
p3
p5 p4 p2
p1 p2 p3 p4 p5 . . .
. . . Similarity?
Single link (MIN)
Complete link (MAX)
Centroids
Average Proximity
Matrix
p1
p3
p5 p4 p2
p1 p2 p3 p4 p5 . . .
. .
. Proximity Matrix
MIN
MAX
Centroids
Average
How to Define Inter-Cluster Similarity
p1
p3
p5 p4 p2
p1 p2 p3 p4 p5 . . .
. .
. Proximity Matrix
MIN
MAX
Centroids
Average
p1
p3
p5 p4 p2
p1 p2 p3 p4 p5 . . .
. .
. Proximity Matrix
MIN
MAX
Centroids
Average
How to Define Inter-Cluster Similarity
p1
p3
p5 p4 p2
p1 p2 p3 p4 p5 . . .
. .
. Proximity Matrix
MIN
MAX
Centroids
Average