Cluster analysis
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
riccomagno@dima.unige.it rogantin@dima.unige.it
http://www.dima.unige.it/~rogantin/IIT/
Aim From Wikipedia:
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
https://en.wikipedia.org/wiki/Cluster_analysis
The units are considered as points in a multidimensional space:
“similar” units ≡ “close” points
Distances between two points Let p the number of the variables.
Let x and y be two units (two points in dimension p):
x = (x
1, x
2, . . . , x
p) (y
1, y
2, . . . , y
p)
• For quantitative variables: Euclidean distance d
2(x, y) =
v u u t
p X
k=1
(x
k− y
k)
2(sometime: the square of Euclidean distance)
• For binary or ordinal variables: “Manhattan” distance d
1(x, y) =
p X
k=1
|x
k− y
k| Ex:
x = (0, 1, 1, 0), y = (0, 0, 1, 1), d
1(x, y) = 2
number of differences between x and y x = (0, 1, 2, 0), y = (0, 2, 0, 1), d
1(x, y) = 4
sum of the distances between the levels
Different methods
• Hierarchical cluster
At each step: the two closest units or classes are gather together (smallest distance or weighted distance)
It is possible to draw a graphical representation of the aggre- gation process: dendogram
The clusters are determined at the end of the process Very useful when the number of units is small (< 100)
• Centroid-based cluster
First step: choice of the initial centroids
At each step: a unit is assigned to the closest centroid
Two units can be in the same cluster at one step and in two different clusters in another step
Very sensitive to the choice of the initial centroids
More convenient when the number of units is large
• Distribution-based cluster, density-based cluster (inferential methods)
• Cluster for high-dimensional data
many of the traditional methods fail (distance functions prob- lematic in high-dimensional spaces)
• ...
Hierarchical clustering: a small example Percentage composition of the milk of 16 animals
> anim_milk=read.table("milk.txt",sep="\t",header=T,row.names=1);anim_milk Water Proteins Fats Lactose
Horse 90.1 2.6 1.0 6.9
Donkey 90.3 1.7 1.4 6.2
Mule 90.0 2.0 1.8 5.5
Camel 87.7 3.5 3.4 4.8
Llama 86.5 3.9 3.2 5.6
Zebra 86.2 3.0 4.8 5.3
Sheep 82.0 5.6 6.4 4.7
Buffalo 82.1 5.9 7.9 4.7
Guinea_Pig 81.9 7.4 7.2 2.7
Fox 81.6 6.6 5.9 4.9
Pig 82.8 7.1 5.1 3.7
Rabbit 71.3 12.3 13.1 1.9
Rat 72.5 9.2 12.6 3.3
Deer 65.9 10.4 19.7 2.6
Reindeer 64.8 10.7 20.3 2.5
Whale 64.8 11.1 21.2 1.6
657075808590
Water
24681012
Proteins
5101520
Fats
234567
Lactose
par(mfrow=c(1,4))
for (i in 1:dim(anim_milk)[2])
boxplot(anim_milk[,i],main=colnames(anim_milk)[i],cex.axis=2,cex.main=2,lw=2) par(mfrow=c(1,1))
Pay attention to the different dispersion
The variables should be standardized so as not to give predomi- nance to those with larger dispersion
anim_milk_sc=scale(anim_milk)
Distance between points (after standardization)
> d=dist(anim_milk_sc,method="euclidean");round(d,1)
Horse Donkey Mule Camel Llama Zebra Sheep Buffalo G_Pig Fox Pig Rabbit Rat Deer Reind Donkey 0.5
Mule 0.9 0.5
Camel 1.4 1.1 0.7
Llama 1.0 0.9 0.7 0.5
Zebra 1.2 0.9 0.7 0.4 0.4
Sheep 2.0 1.9 1.6 1.0 1.0 1.0
Buffalo 2.1 2.0 1.7 1.1 1.2 1.1 0.2
Guinea_Pig 3.2 3.0 2.6 1.9 2.2 2.1 1.4 1.3
Fox 2.1 2.0 1.7 1.2 1.1 1.2 0.3 0.4 1.4
Pig 2.6 2.4 2.1 1.4 1.6 1.6 0.8 0.8 0.7 0.8
Rabbit 5.0 4.9 4.5 3.8 4.0 4.0 3.0 2.9 2.1 2.9 2.5
Rat 3.9 3.8 3.5 2.8 2.9 2.9 1.9 1.8 1.4 1.9 1.7 1.3
Deer 5.2 5.0 4.8 4.1 4.2 4.1 3.2 3.1 2.7 3.2 3.0 1.3 1.4
Reindeer 5.3 5.2 4.9 4.3 4.4 4.3 3.4 3.3 2.8 3.4 3.2 1.4 1.5 0.2
Whale 5.8 5.6 5.3 4.7 4.8 4.7 3.8 3.6 3.0 3.8 3.5 1.4 1.9 0.7 0.6
First step:
Deer and Reindeer are put together (min. distance): group 1
actually: d2(Deer,Reindeer) = 0.18 d2(Buffalo,Sheep) = 0.23
Second step:
A distance between group 1 and the other points is calculated (different meth- ods) and the two points or groups with minimum distance are put together:
group 2 ....
Popular distances between two groups
• minimum (single linkage)
• maximum (complete linkage)
• mean (average linkage)
• median (median linkage)
of all the distances between points
• distance between centroids (centroid linkage)
• weighted distance between centroids
(Ward linkage)
ll
l
Dendogram
Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig
051015
Dendogram − Euclidean distance − Ward method
How many clusters?
Two
Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig
051015
Dendogram − Euclidean distance − Ward method
Three
Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig
051015
Dendogram − Euclidean distance − Ward method
R coding
aggr = hclust(d, method="ward.D")
plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="", cex=1.5,hang = -0.1,frame.plot = TRUE)
# draw dendogram num_clus= 3
groups3 = cutree(aggr, k=num_clus)
# cut dendogram in <num_clus> clusters rect.hclust(aggr, k=num_clus, border="red")
# draw red rectangles around clusters plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE) num_clus= 2
groups2 = cutree(aggr, k=num_clus)
rect.hclust(aggr, k=num_clus, border="blue")
Meaning of the clusters
• Means in each cluster – Original variables
n_units Water Proteins Fats Lactose
1 6 88.47 2.78 2.60 5.72
2 5 82.08 6.52 6.50 4.14
3 5 67.86 10.74 17.38 2.38
– Standardized variables
n_units Water Protein Fat Lactose 1 6 0.92 -1.04 -0.85 0.96 2 5 0.22 0.02 -0.28 -0.03 3 5 -1.33 1.23 1.30 -1.13
In cluster 1 milk contains more water and lactose In cluster 3 milk contains more proteins and fats
In cluster 2 milk is “medium” (near to the total centroid) Other analysis when more units:
e.g. boxplot of the variables by cluster
Which animals in the clusters?
Horse Donkey Mule Camel Llama Zebra
1 1 1 1 1 1
Sheep Buffalo Guinea_Pig Fox Pig
2 2 2 2 2
Rabbit Rat Deer Reindeer Whale
3 3 3 3 3
where do they live?
R coding
B=matrix(rep(0,num_clus*dim(anim_milk)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)
{ B[i,]=colMeans(anim_milk[groups3==i,]) } C=cbind(table(groups3),round(B,2))
colnames(C)=c("n_units",colnames(anim_milk)) C
B=matrix(rep(0,num_clus*dim(anim_milk_sc)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)
{ B[i,]=colMeans(anim_milk_sc[groups3==i,]) } C=cbind(table(groups3),round(B,2))
colnames(C)=c("n_units",colnames(anim_milk_sc)) C
groups3
• Dispersion of the clusters: within and between As in the univariate case
(?):
total variance = within variance + between variance
good clustering ≡ small within and large between variance
(?)σtot2 = PK
k=1 fk σk2 +PK
k=1fk(xk − xtot)2
Useful within group indices:
1. mean of the point distances from the group centroid
2. maximum of the point distances from the group centroid 3. within variance (mean of the square of the distances of
the points from its centroid)
(difference between 1. and 2. allows us to detect outliers)
mean_within_dist max_within_dist within_var
1 0.52 0.80 0.30
2 0.54 0.94 0.33
3 0.77 1.12 0.65
Cluster 3 has the largest dispersion
4. percentage of total variance explained by between variance (the larger the better)
58.18
quite good clustering [cf. 1-way ANOVA]
R coding
mean_within_dist=c(1:num_clus) max_within_dist=c(1:num_clus) within_var=c(1:num_clus)
within_var2=c(1:num_clus) for (i in 1:num_clus)
{a=rowSums(scale(anim_milk_sc[groups3==i,],scale=F)^2) mean_within_dist[i]=mean(sqrt(a))
max_within_dist[i]=max(sqrt(a)) within_var[i]=mean(a)
within_var2[i]=sum(a) }
round(cbind(mean_within_dist,max_within_dist,within_var),2) perc_between=round(sum(within_var2)*100/16,2);perc_between
Clustering variables
Same methods transposing variables with observations
From
variables
observations
to
variables
observations
The square of the Euclidean distance between two point-variables X and Y is proportional to
the complementary to 1 of the correlation between X and Y
:
d
22(X, Y ) = 2 n (1 − ρ(X, Y ))
Animal Milk Correlation matrix
> round(cor(anim_milk),2)
Water Proteins Fats Lactose Water 1.00 -0.94 -0.99 0.89 Proteins -0.94 1.00 0.90 -0.94 Fats -0.99 0.90 1.00 -0.87 Lactose 0.89 -0.94 -0.87 1.00
ρ(Water, Proteins) = −0.94 ρ(Water, Fats) = −0.99 ρ(Water, Lactose) = 0.89
ρ(Proteins, Fats) = 0.90 ρ(. . . , . . . ) = . . . ρ(Water, Water) = 1
Distance matrix
> d_var=1-cor(anim_milk);round(d_var,3)
Water Proteins Fats Lactose Water 0.000 1.943 1.989 0.113 Proteins 1.943 0.000 0.103 1.938 Fats 1.989 0.103 0.000 1.865 Lactose 0.113 1.938 1.865 0.000
Proteins Fats Water Lactose
0123
1−rho − Ward linkage
aggr_var = hclust(as.dist(d_var), method="ward.D") ### pay attention to as.dist plot(aggr_var,main = "1-rho - Ward linkage",
ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE)