• Non ci sono risultati.

Cluster analysis

N/A
N/A
Protected

Academic year: 2021

Condividi "Cluster analysis"

Copied!
21
0
0

Testo completo

(1)

Cluster analysis

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

riccomagno@dima.unige.it rogantin@dima.unige.it

http://www.dima.unige.it/~rogantin/IIT/

(2)

Aim From Wikipedia:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

https://en.wikipedia.org/wiki/Cluster_analysis

The units are considered as points in a multidimensional space:

“similar” units ≡ “close” points

(3)

Distances between two points Let p the number of the variables.

Let x and y be two units (two points in dimension p):

x = (x

1

, x

2

, . . . , x

p

) (y

1

, y

2

, . . . , y

p

)

• For quantitative variables: Euclidean distance d

2

(x, y) =

v u u t

p X

k=1

(x

k

− y

k

)

2

(sometime: the square of Euclidean distance)

• For binary or ordinal variables: “Manhattan” distance d

1

(x, y) =

p X

k=1

|x

k

− y

k

| Ex:

x = (0, 1, 1, 0), y = (0, 0, 1, 1), d

1

(x, y) = 2

number of differences between x and y x = (0, 1, 2, 0), y = (0, 2, 0, 1), d

1

(x, y) = 4

sum of the distances between the levels

(4)

Different methods

• Hierarchical cluster

At each step: the two closest units or classes are gather together (smallest distance or weighted distance)

It is possible to draw a graphical representation of the aggre- gation process: dendogram

The clusters are determined at the end of the process Very useful when the number of units is small (< 100)

• Centroid-based cluster

First step: choice of the initial centroids

At each step: a unit is assigned to the closest centroid

Two units can be in the same cluster at one step and in two different clusters in another step

Very sensitive to the choice of the initial centroids

More convenient when the number of units is large

(5)

• Distribution-based cluster, density-based cluster (inferential methods)

• Cluster for high-dimensional data

many of the traditional methods fail (distance functions prob- lematic in high-dimensional spaces)

• ...

(6)

Hierarchical clustering: a small example Percentage composition of the milk of 16 animals

> anim_milk=read.table("milk.txt",sep="\t",header=T,row.names=1);anim_milk Water Proteins Fats Lactose

Horse 90.1 2.6 1.0 6.9

Donkey 90.3 1.7 1.4 6.2

Mule 90.0 2.0 1.8 5.5

Camel 87.7 3.5 3.4 4.8

Llama 86.5 3.9 3.2 5.6

Zebra 86.2 3.0 4.8 5.3

Sheep 82.0 5.6 6.4 4.7

Buffalo 82.1 5.9 7.9 4.7

Guinea_Pig 81.9 7.4 7.2 2.7

Fox 81.6 6.6 5.9 4.9

Pig 82.8 7.1 5.1 3.7

Rabbit 71.3 12.3 13.1 1.9

Rat 72.5 9.2 12.6 3.3

Deer 65.9 10.4 19.7 2.6

Reindeer 64.8 10.7 20.3 2.5

Whale 64.8 11.1 21.2 1.6

(7)

657075808590

Water

24681012

Proteins

5101520

Fats

234567

Lactose

par(mfrow=c(1,4))

for (i in 1:dim(anim_milk)[2])

boxplot(anim_milk[,i],main=colnames(anim_milk)[i],cex.axis=2,cex.main=2,lw=2) par(mfrow=c(1,1))

Pay attention to the different dispersion

The variables should be standardized so as not to give predomi- nance to those with larger dispersion

anim_milk_sc=scale(anim_milk)

(8)

Distance between points (after standardization)

> d=dist(anim_milk_sc,method="euclidean");round(d,1)

Horse Donkey Mule Camel Llama Zebra Sheep Buffalo G_Pig Fox Pig Rabbit Rat Deer Reind Donkey 0.5

Mule 0.9 0.5

Camel 1.4 1.1 0.7

Llama 1.0 0.9 0.7 0.5

Zebra 1.2 0.9 0.7 0.4 0.4

Sheep 2.0 1.9 1.6 1.0 1.0 1.0

Buffalo 2.1 2.0 1.7 1.1 1.2 1.1 0.2

Guinea_Pig 3.2 3.0 2.6 1.9 2.2 2.1 1.4 1.3

Fox 2.1 2.0 1.7 1.2 1.1 1.2 0.3 0.4 1.4

Pig 2.6 2.4 2.1 1.4 1.6 1.6 0.8 0.8 0.7 0.8

Rabbit 5.0 4.9 4.5 3.8 4.0 4.0 3.0 2.9 2.1 2.9 2.5

Rat 3.9 3.8 3.5 2.8 2.9 2.9 1.9 1.8 1.4 1.9 1.7 1.3

Deer 5.2 5.0 4.8 4.1 4.2 4.1 3.2 3.1 2.7 3.2 3.0 1.3 1.4

Reindeer 5.3 5.2 4.9 4.3 4.4 4.3 3.4 3.3 2.8 3.4 3.2 1.4 1.5 0.2

Whale 5.8 5.6 5.3 4.7 4.8 4.7 3.8 3.6 3.0 3.8 3.5 1.4 1.9 0.7 0.6

First step:

Deer and Reindeer are put together (min. distance): group 1

actually: d2(Deer,Reindeer) = 0.18 d2(Buffalo,Sheep) = 0.23

Second step:

A distance between group 1 and the other points is calculated (different meth- ods) and the two points or groups with minimum distance are put together:

group 2 ....

(9)

Popular distances between two groups

• minimum (single linkage)

• maximum (complete linkage)

• mean (average linkage)

• median (median linkage)

of all the distances between points

• distance between centroids (centroid linkage)

• weighted distance between centroids

(Ward linkage)

l

l

l

(10)

Dendogram

Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig

051015

Dendogram − Euclidean distance − Ward method

How many clusters?

(11)

Two

Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig

051015

Dendogram − Euclidean distance − Ward method

Three

Whale Deer Reindeer Rabbit Rat Camel Llama Zebra Horse Donkey Mule Fox Sheep Buffalo Guinea_Pig Pig

051015

Dendogram − Euclidean distance − Ward method

(12)

R coding

aggr = hclust(d, method="ward.D")

plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="", cex=1.5,hang = -0.1,frame.plot = TRUE)

# draw dendogram num_clus= 3

groups3 = cutree(aggr, k=num_clus)

# cut dendogram in <num_clus> clusters rect.hclust(aggr, k=num_clus, border="red")

# draw red rectangles around clusters plot(aggr,main = "Dendogram - Euclidean distance - Ward method", ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE) num_clus= 2

groups2 = cutree(aggr, k=num_clus)

rect.hclust(aggr, k=num_clus, border="blue")

(13)

Meaning of the clusters

• Means in each cluster – Original variables

n_units Water Proteins Fats Lactose

1 6 88.47 2.78 2.60 5.72

2 5 82.08 6.52 6.50 4.14

3 5 67.86 10.74 17.38 2.38

– Standardized variables

n_units Water Protein Fat Lactose 1 6 0.92 -1.04 -0.85 0.96 2 5 0.22 0.02 -0.28 -0.03 3 5 -1.33 1.23 1.30 -1.13

In cluster 1 milk contains more water and lactose In cluster 3 milk contains more proteins and fats

In cluster 2 milk is “medium” (near to the total centroid) Other analysis when more units:

e.g. boxplot of the variables by cluster

(14)

Which animals in the clusters?

Horse Donkey Mule Camel Llama Zebra

1 1 1 1 1 1

Sheep Buffalo Guinea_Pig Fox Pig

2 2 2 2 2

Rabbit Rat Deer Reindeer Whale

3 3 3 3 3

where do they live?

(15)

R coding

B=matrix(rep(0,num_clus*dim(anim_milk)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)

{ B[i,]=colMeans(anim_milk[groups3==i,]) } C=cbind(table(groups3),round(B,2))

colnames(C)=c("n_units",colnames(anim_milk)) C

B=matrix(rep(0,num_clus*dim(anim_milk_sc)[2]),byrow=T,nrow=num_clus) for (i in 1:num_clus)

{ B[i,]=colMeans(anim_milk_sc[groups3==i,]) } C=cbind(table(groups3),round(B,2))

colnames(C)=c("n_units",colnames(anim_milk_sc)) C

groups3

(16)

• Dispersion of the clusters: within and between As in the univariate case

(?)

:

total variance = within variance + between variance

good clustering ≡ small within and large between variance

(?)σtot2 = PK

k=1 fk σk2 +PK

k=1fk(xk − xtot)2

(17)

Useful within group indices:

1. mean of the point distances from the group centroid

2. maximum of the point distances from the group centroid 3. within variance (mean of the square of the distances of

the points from its centroid)

(difference between 1. and 2. allows us to detect outliers)

mean_within_dist max_within_dist within_var

1 0.52 0.80 0.30

2 0.54 0.94 0.33

3 0.77 1.12 0.65

Cluster 3 has the largest dispersion

4. percentage of total variance explained by between variance (the larger the better)

58.18

quite good clustering [cf. 1-way ANOVA]

(18)

R coding

mean_within_dist=c(1:num_clus) max_within_dist=c(1:num_clus) within_var=c(1:num_clus)

within_var2=c(1:num_clus) for (i in 1:num_clus)

{a=rowSums(scale(anim_milk_sc[groups3==i,],scale=F)^2) mean_within_dist[i]=mean(sqrt(a))

max_within_dist[i]=max(sqrt(a)) within_var[i]=mean(a)

within_var2[i]=sum(a) }

round(cbind(mean_within_dist,max_within_dist,within_var),2) perc_between=round(sum(within_var2)*100/16,2);perc_between

(19)

Clustering variables

Same methods transposing variables with observations

From

variables

observations

to

variables

observations

The square of the Euclidean distance between two point-variables X and Y is proportional to

the complementary to 1 of the correlation between X and Y

:

d

22

(X, Y ) = 2 n (1 − ρ(X, Y ))

(20)

Animal Milk Correlation matrix

> round(cor(anim_milk),2)

Water Proteins Fats Lactose Water 1.00 -0.94 -0.99 0.89 Proteins -0.94 1.00 0.90 -0.94 Fats -0.99 0.90 1.00 -0.87 Lactose 0.89 -0.94 -0.87 1.00

ρ(Water, Proteins) = −0.94 ρ(Water, Fats) = −0.99 ρ(Water, Lactose) = 0.89

ρ(Proteins, Fats) = 0.90 ρ(. . . , . . . ) = . . . ρ(Water, Water) = 1

Distance matrix

> d_var=1-cor(anim_milk);round(d_var,3)

Water Proteins Fats Lactose Water 0.000 1.943 1.989 0.113 Proteins 1.943 0.000 0.103 1.938 Fats 1.989 0.103 0.000 1.865 Lactose 0.113 1.938 1.865 0.000

Proteins Fats Water Lactose

0123

1−rho − Ward linkage

aggr_var = hclust(as.dist(d_var), method="ward.D") ### pay attention to as.dist plot(aggr_var,main = "1-rho - Ward linkage",

ylab = " ",sub="",xlab="",cex=1.5,hang = -0.1,frame.plot = TRUE)

(21)

References

• Quick R. http://www.statmethods.net/advstats/cluster.html

• Chapter 4.8 in Data Mining: Practical Machine Learning Tools and Techniques (Third Edition) 2011 Elsevier Inc. by Ian H. Witten, Eibe Frank and Mark A. Hall

• Chapter 16 in IBM SPSS Statistics

http://www.norusis.com/pdf/SPC_v13.pdf

Riferimenti

Documenti correlati

Program Description: Spontaneous cerebrospinal fluid (CSF) fistula is an important emerging clinical entity that may be related to excess production or inadequate resorption of

I2 andamento dell’epidemia di rabbia silvestre nella provincia è stato desunto dall’analisi delle volpi pervenute al Servizio veterinario provinciale nel periodo

Yeats applica il simbolismo anche alla sua autobiografia per trovare un significato filosofico nella sua vita: “By applying the power of his symbolic

La sua vicinanza a 1 è aumentata, come dev’essere: nel primo modello la sua posizione a 0 “trascinava” il modello in una direzione che rendeva Denmark più vicina a 0 (perché

modellano: tempi di trasferta, controllo rx o ricarica batterie veicolo, intervalli di arrivo clienti ai check-in, MTBF, ... !  usi: throughput sistema (bagagli/h), analisi di code

Focusing on the effects of structural change on the grand strategy of the United States and especially on US relations with France and Germany, this paper reviews the implicit

Sayan Turkic is the technical term used in Turcology to refer to the South-Siberian branch of the Turkic language family, which includes standard Tuvan, Tofan and

Alla fine della seconda guerra mondiale, tra l’inverno 1945 e l’estate 1946, egli ha soggiornato a Casarano in qualità di studente-militare dell’Istituto superiore ad