• Non ci sono risultati.

Paolo Ferragina Clustering

N/A
N/A
Protected

Academic year: 2021

Condividi "Paolo Ferragina Clustering"

Copied!
44
0
0

Testo completo

(1)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Chap 16 and 17 Chap 16 and 17

(2)

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or

unrelated to) the objects in other groups

Inter-cluster distances are

maximized Intra-cluster

distances are minimized

Competing objectives

The commonest form of unsupervised learning

(3)

effective news presentation metaphor

(4)

Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs

Therefore, to improve search recall:

Cluster docs in corpus a priori

When a query matches a doc D, also return other docs in the cluster containing D

Hope if we do this: The query “car” will also return docs containing automobile

But also for speeding up the search operation But also for speeding up

the search operation

(5)

For better visualization/navigation of search results

(6)

Issues for clustering

 Representation for clustering

 Document representation

Vector space? Normalization?

 Need a notion of similarity/distance

 How many clusters?

 Fixed a priori?

 Completely data driven?

(7)

 Ideal: semantic similarity

 Practical: term-statistical similarity

 Docs as vectors

 We will use cosine similarity.

 For many algorithms, easier to think in

terms of a distance (rather than similarity)

between docs.

(8)

Flat algorithms

 Create a set of clusters

 Usually start with a random (partial) partitioning

 Refine it iteratively

K means clustering

Hierarchical algorithms

Create a hierarchy of clusters (dendogram)

 Bottom-up, agglomerative

 Top-down, divisive

(9)

Hard clustering: Each document belongs to exactly one cluster

More common and easier to do

Sof clustering: Each document can belong to more than one cluster.

Makes more sense for applications like creating browsable hierarchies

News is a proper example

Search results is another example

(10)

Given: a set of n documents and the number K

Find: a partition in K clusters that optimizes the chosen partitioning criterion

 Globally optimal

Intractable for many objective functions

Ergo, exhaustively enumerate all partitions

 Locally optimal

Effective heuristic methods: K-means and K-medoids algorithms

(11)

K-Means

 Assumes documents are real-valued vectors.

 Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

 Reassignment of instances to clusters is based on distance to the current cluster centroids.

c x

c

x

|

| (c) 1

μ

(12)

K-Means Algorithm

Select K random docs {s

1

, s

2

,… s

K

} as seeds.

Until clustering converges (or other stopping criterion):

For each doc d

i

:

Assign di to the cluster cr such that dist(di, sr) is minimal.

For each cluster c

j

sj = (cj)

(13)

K Means Example (K=2)

Pick seeds

Reassign clusters Compute centroids

x

x

Reassign clusters

x

x x

x Compute centroids

Reassign clusters Converged!

(14)

Termination conditions

 Several possibilities:

 A fixed number of iterations.

 Doc partition unchanged.

 Centroid positions don’t change.

(15)

Convergence

Why should the K-means algorithm ever reach a fixed point?

K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm

 EM is known to converge

 Number of iterations could be large

 But in practice usually isn’t

(16)

Convergence of K-Means

Define goodness measure of cluster c as sum of squared distances from cluster centroid:

 G(c,s) = Σ

j in c

(d

j

– s

c

)

2

(sum over all d

i

in cluster c)

 G(C,s) = Σ

c

G(c,s)

 Reassignment monotonically decreases G

It is a coordinate descent algorithm (optimize one component at a time)

At any step we have some value for G(C,s)

1) Fix s, optimize C  assign doc to the closest centroid  G(C’,s) < G(C,s) 2) Fix C’, optimize s  take the new centroids  G(C’,s’) < G(C’,s) < G(C,s)

The new cost is smaller than the original one local minimum

(17)

Time Complexity

 The centroids are K

 Each doc/centroid consists of M dimensions

Computing distance btw vectors is O(M) time.

Reassigning clusters: Each doc compared with all centroids, O(N x K x M) time.

Computing centroids: Each doc gets added once to some centroid, O(NM) time.

Assume these two steps are each done once for I

iterations: O(IKNM).

(18)

Seed Choice

 Results can vary based on random seed selection.

 Some seeds can result in poor

convergence rate, or convergence to sub-optimal clusterings.

 Select good seeds using a heuristic

 doc least similar to any existing centroid

 According to a probability distribution that depends inversely-proportional on the distance from the other current centroids

In the above, if you start with B and E as centroids you converge to {A,B,C}

and {D,E,F}

If you start with D and F you converge to

{A,B,D,E} {C,F}

Example showing sensitivity to seeds

(19)

Number of clusters K is given

 Partition

n

docs into predetermined number of clusters

 Finding the “right” number of clusters is part of the problem

 Can usually take an algorithm for one flavor and

convert to the other.

(20)

Variant of K-means that can produce a partitional or a hierarchical clustering

SSE = G(C,s) is called Sum of Squared Error

(21)
(22)

Pros

 Simple

 Fast for low dimensional data

 It can find pure sub-clusters if large number of clusters is specified (but, over-partitioning)

Cons

 K-Means cannot handle non-globular data of different sizes and densities

 K-Means will not identify outliers

 K-Means is restricted to data which has the notion of a

center (centroid)

(23)

Hierarchical Clustering

 Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents

Possibility: recursive application of a partitional clustering algorithm

animal vertebrate

fish reptile amphib. mammal worm insect crustacean invertebrate

(24)

 No assumption of any particular number of clusters

 Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

 They may correspond to meaningful taxonomies

 Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

(25)

Hierarchical Agglomerative Clustering (HAC)

Starts with each doc in a separate cluster

Then repeatedly join the most similar pair of clusters, until there is only one cluster.

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

 The history of merging forms a binary tree or

hierarchy.

(26)

Similarity of pair of clusters

Single-link

Similarity of the closest points

Complete-link

Similarity of the farthest points

Centroid

 Similarity among centroids

Average-link

 Similarity = Average distance between all pairs of items

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(27)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. . . Similarity?

Single link (MIN)

Complete link (MAX)

Centroids

Average Proximity

Matrix

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(28)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(29)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(30)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(31)

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

MIN

MAX

Centroids

Average

Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters

(32)

 Start with clusters of individual points and a proximity matrix

p1

p3

p5 p4 p2

p1 p2 p3 p4 p5 . . .

. .

. Proximity Matrix

(33)

After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5 C4 C2

C3 C4 C5

Proximity Matrix

C1 C3 C2 C5 C4

(34)

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5 C4 C2

C3 C4 C5

Proximity Matrix

C1 C3 C2 U C5 C4

(35)

The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5 C3

? ? ? ?

?

?

? C2 U C5 C1

C1

C3 C4 C2 U C5

C3 C4

Proximity Matrix

C1 C3 C2 U C5 C4

(36)

 Similarity of two clusters is based on the two most similar (closest) points in the different clusters

 Determined by one pair of points, i.e., by one link in the proximity graph.

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5

??

(37)

Original Points Two

Clusters

• Can handle non-elliptical shapes

(38)

Original Points Two Clusters

• Sensitive to noise and outliers

(39)

 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters

 Determined by all pairs of points in the two clusters

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

??

(40)

Original Points Two Clusters

• Less susceptible to noise and outliers

(41)

Original Points

Two Clusters

•Tends to break large clusters

•Biased towards globular clusters

(42)

Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

|

|Cluster

|

|Cluster

) p , p proximity(

) Cluster ,

Cluster proximity(

j i

Cluster p Cluster

p i j

j

i j j

i i

1 2 4 5 3

??

(43)

Average

MIN MAX

1

2

3 4 5

6 1 2

5

4 3

1

2

3 4 5

6 1

2 5

3 1 4

2

3 4 5

6 1 2

3

4

5

(44)

How to evaluate clustering quality ?

Assesses a clustering with respect to ground truth … requires labeled data

Produce the gold standard is costly !!

Riferimenti

Documenti correlati

Hansen was heavily influenced by Gauss in his astronomical work, but does not mention Gauss in the development of his algorithms for harmonic analysis, for reasons

So, partly emulat- ing Giotto’s virtuosistic gesture, when he gives Pope Boni- facio VIII’s messenger a “simple” circle, and partly Chuang Tzu’s gesture, when he draws “the

Una sentenza che, rivendicando la capacità del disegno di assurgere a grimaldello volto a scardina- re l’apparenza per svelarci ciò che altrimenti rimarrebbe

View of the crystal packing of: (A) BE_I and (B) BE_IV (S enantiomer in green, R enantiomer in pink). For sake of clarity, only the set of coordinates having the highest

For this matter of fact, the concept of GOG (Grade of Generation) protocols has been introduced in order to describe different Levels of Geometry, in function of

We study a class of first-order theories whose complete quantifier-free types with one free variable either have a trivial positive part or are isolated by a positive

I present here four slides (translated from Italian into English) concerning Apollonius’ model for the planetary motion and for the retrogradation of the inferior planets to give

 Then repeatedly joins the closest pair of clusters, until there is only one cluster.  The history of mergings forms a binary tree