Cluster analysis

(1)

Cluster analysis

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

riccomagno@dima.unige.it rogantin@dima.unige.it

http://www.dima.unige.it/~rogantin/IIT/

(2)

Aim From Wikipedia:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

https://en.wikipedia.org/wiki/Cluster_analysis

The units are considered as points in a multidimensional space:

“similar” units ≡ “close” points

(3)

Distances between two points Let p the number of the variables.

Let x and y be two units (two points in dimension p):

x = (x

₁

, x

₂

, . . . , x

_p

) (y

₁

, y

₂

, . . . , y

_p

)

• For quantitative variables: Euclidean distance d

₂

(x, y) =

v u u t

p X

k=1

(x

_k

− y

_k

)

²

(sometime: the square of Euclidean distance)

• For binary or ordinal variables: “Manhattan” distance d

₁

(x, y) =

p X

k=1

|x

_k

− y

_k

| Ex:

x = (0, 1, 1, 0), y = (0, 0, 1, 1), d

₁

(x, y) = 2

number of differences between x and y x = (0, 1, 2, 0), y = (0, 2, 0, 1), d

₁

(x, y) = 4

sum of the distances between the levels

(4)

Different methods

• Hierarchical cluster

At each step: the two closest units or classes are gather together (smallest distance or weighted distance)

It is possible to draw a graphical representation of the aggre- gation process: dendogram

The clusters are determined at the end of the process Very useful when the number of units is small (< 100)

• Centroid-based cluster

First step: choice of the initial centroids

At each step: a unit is assigned to the closest centroid

Two units can be in the same cluster at one step and in two different clusters in another step

Very sensitive to the choice of the initial centroids

More convenient when the number of units is large

(5)

• Distribution-based cluster, density-based cluster (inferential methods)

• Cluster for high-dimensional data

many of the traditional methods fail (distance functions prob- lematic in high-dimensional spaces)

• ...

(6)

Hierarchical clustering: a small example Percentage composition of the milk of 16 animals

> anim_milk=read.table("milk.txt",sep="\t",header=T,row.names=1);anim_milk Water Proteins Fats Lactose

Horse 90.1 2.6 1.0 6.9

Donkey 90.3 1.7 1.4 6.2

Mule 90.0 2.0 1.8 5.5

Camel 87.7 3.5 3.4 4.8

Llama 86.5 3.9 3.2 5.6

Zebra 86.2 3.0 4.8 5.3

Sheep 82.0 5.6 6.4 4.7

Buffalo 82.1 5.9 7.9 4.7

Guinea_Pig 81.9 7.4 7.2 2.7

Fox 81.6 6.6 5.9 4.9

Pig 82.8 7.1 5.1 3.7

Rabbit 71.3 12.3 13.1 1.9

Rat 72.5 9.2 12.6 3.3

Deer 65.9 10.4 19.7 2.6

Reindeer 64.8 10.7 20.3 2.5

Whale 64.8 11.1 21.2 1.6

(7)

657075808590

Water

24681012

Proteins

5101520

Fats

234567

Lactose

par(mfrow=c(1,4))

for (i in 1:dim(anim_milk)[2])

boxplot(anim_milk[,i],main=colnames(anim_milk)[i],cex.axis=2,cex.main=2,lw=2) par(mfrow=c(1,1))

Pay attention to the different dispersion

The variables should be standardized so as not to give predomi- nance to those with larger dispersion

anim_milk_sc=scale(anim_milk)

(8)

Distance between points (after standardization)

> d=dist(anim_milk_sc,method="euclidean");round(d,1)

Horse Donkey Mule Camel Llama Zebra Sheep Buffalo G_Pig Fox Pig Rabbit Rat Deer Reind Donkey 0.5

Mule 0.9 0.5

Camel 1.4 1.1 0.7

Llama 1.0 0.9 0.7 0.5

Zebra 1.2 0.9 0.7 0.4 0.4

Sheep 2.0 1.9 1.6 1.0 1.0 1.0

Buffalo 2.1 2.0 1.7 1.1 1.2 1.1 0.2

Guinea_Pig 3.2 3.0 2.6 1.9 2.2 2.1 1.4 1.3

Fox 2.1 2.0 1.7 1.2 1.1 1.2 0.3 0.4 1.4

Pig 2.6 2.4 2.1 1.4 1.6 1.6 0.8 0.8 0.7 0.8

Rabbit 5.0 4.9 4.5 3.8 4.0 4.0 3.0 2.9 2.1 2.9 2.5

Rat 3.9 3.8 3.5 2.8 2.9 2.9 1.9 1.8 1.4 1.9 1.7 1.3

Deer 5.2 5.0 4.8 4.1 4.2 4.1 3.2 3.1 2.7 3.2 3.0 1.3 1.4

Reindeer 5.3 5.2 4.9 4.3 4.4 4.3 3.4 3.3 2.8 3.4 3.2 1.4 1.5 0.2

Whale 5.8 5.6 5.3 4.7 4.8 4.7 3.8 3.6 3.0 3.8 3.5 1.4 1.9 0.7 0.6

First step:

Deer and Reindeer are put together (min. distance): group 1

actually: d2(Deer,Reindeer) = 0.18 d2(Buffalo,Sheep) = 0.23

Second step:

A distance between group 1 and the other points is calculated (different methods) and the two points or groups with minimum distance are put together:

group 2 ....

(9)

Popular distances between two groups

• minimum (single linkage)

• maximum (complete linkage)

• mean (average linkage)

• median (median linkage)

of all the distances between points

• distance between centroids (centroid linkage)

• weighted distance between centroids

(Ward linkage)

_l

l

(10)

Dendogram

Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig

051015

Dendogram − Euclidean distance − Ward method

How many clusters?

(11)

Two

051015

Three

051015

(12)

R coding

aggr = hclust(d, method="ward.D")

plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="", cex=1.5,hang = -0.1,frame.plot = TRUE)

# draw dendogram num_clus= 3

groups3 = cutree(aggr, k=num_clus)

# cut dendogram in <num_clus> clusters rect.hclust(aggr, k=num_clus, border="red")

# draw red rectangles around clusters plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE) num_clus= 2

groups2 = cutree(aggr, k=num_clus)

rect.hclust(aggr, k=num_clus, border="blue")

(13)

Meaning of the clusters

• Means in each cluster – Original variables

n_units Water Proteins Fats Lactose

1 6 88.47 2.78 2.60 5.72

2 5 82.08 6.52 6.50 4.14

3 5 67.86 10.74 17.38 2.38

– Standardized variables

n_units Water Protein Fat Lactose 1 6 0.92 -1.04 -0.85 0.96 2 5 0.22 0.02 -0.28 -0.03 3 5 -1.33 1.23 1.30 -1.13

In cluster 1 milk contains more water and lactose In cluster 3 milk contains more proteins and fats

In cluster 2 milk is “medium” (near to the total centroid) Other analysis when more units:

e.g. boxplot of the variables by cluster

(14)

Which animals in the clusters?

Horse Donkey Mule Camel Llama Zebra

1 1 1 1 1 1

Sheep Buffalo Guinea_Pig Fox Pig

2 2 2 2 2

Rabbit Rat Deer Reindeer Whale

3 3 3 3 3

where do they live?

(15)

R coding

B=matrix(rep(0,num_clus*dim(anim_milk)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)

{ B[i,]=colMeans(anim_milk[groups3==i,]) } C=cbind(table(groups3),round(B,2))

colnames(C)=c("n_units",colnames(anim_milk)) C

B=matrix(rep(0,num_clus*dim(anim_milk_sc)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)

{ B[i,]=colMeans(anim_milk_sc[groups3==i,]) } C=cbind(table(groups3),round(B,2))

colnames(C)=c("n_units",colnames(anim_milk_sc)) C

groups3

(16)

• Dispersion of the clusters: within and between As in the univariate case

^(?)

:

total variance = within variance + between variance

good clustering ≡ small within and large between variance

(?)σ_tot² = PK

k=1 f_k σ_k² +PK

k=1f_k(x_k − x_tot)²

(17)

Useful within group indices:

1. mean of the point distances from the group centroid

2. maximum of the point distances from the group centroid 3. within variance (mean of the square of the distances of

the points from its centroid)

(difference between 1. and 2. allows us to detect outliers)

mean_within_dist max_within_dist within_var

1 0.52 0.80 0.30

2 0.54 0.94 0.33

3 0.77 1.12 0.65

Cluster 3 has the largest dispersion

4. percentage of total variance explained by between variance (the larger the better)

58.18 quite good clustering [cf. 1-way ANOVA]

(18)

R coding

mean_within_dist=c(1:num_clus) max_within_dist=c(1:num_clus) within_var=c(1:num_clus)

within_var2=c(1:num_clus) for (i in 1:num_clus)

{a=rowSums(scale(anim_milk_sc[groups3==i,],scale=F)^2) mean_within_dist[i]=mean(sqrt(a))

max_within_dist[i]=max(sqrt(a)) within_var[i]=mean(a)

within_var2[i]=sum(a) }

round(cbind(mean_within_dist,max_within_dist,within_var),2) perc_between=round(sum(within_var2)*100/16,2);perc_between

(19)

Clustering variables

Same methods transposing variables with observations

From

variables

observations

to

variables

observations

The square of the Euclidean distance between two point-variables X and Y is proportional to

the complementary to 1 of the correlation between X and Y

:

d

²₂

(X, Y ) = 2 n (1 − ρ(X, Y ))

(20)

Animal Milk Correlation matrix

> round(cor(anim_milk),2)

Water Proteins Fats Lactose Water 1.00 -0.94 -0.99 0.89 Proteins -0.94 1.00 0.90 -0.94 Fats -0.99 0.90 1.00 -0.87 Lactose 0.89 -0.94 -0.87 1.00

ρ(Water, Proteins) = −0.94 ρ(Water, Fats) = −0.99 ρ(Water, Lactose) = 0.89

ρ(Proteins, Fats) = 0.90 ρ(. . . , . . . ) = . . . ρ(Water, Water) = 1

Distance matrix

> d_var=1-cor(anim_milk);round(d_var,3)

Water Proteins Fats Lactose Water 0.000 1.943 1.989 0.113 Proteins 1.943 0.000 0.103 1.938 Fats 1.989 0.103 0.000 1.865 Lactose 0.113 1.938 1.865 0.000

Proteins Fats Water Lactose

0123

1−rho − Ward linkage

aggr_var = hclust(as.dist(d_var), method="ward.D") ### pay attention to as.dist plot(aggr_var,main = "1-rho - Ward linkage",

ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE)

(21)

References

• Quick R. http://www.statmethods.net/advstats/cluster.html

• Chapter 4.8 in Data Mining: Practical Machine Learning Tools and Techniques (Third Edition) 2011 Elsevier Inc. by Ian H. Witten, Eibe Frank and Mark A. Hall