Exploratory Data Analysis part 4
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
http://www.dima.unige.it/~rogantin/UnigeStat/
Distribution and graphic representation of two quantitative variables
X and Y measured on the same population
The set of the points (xi, yi) and of the corresponding frequencies is the joint distribution of X and Y
The scatter-plot
Example: height and weight of a graduate class students
peso_alt= read.table(file=
"C:/DATA/stid98.txt",header=F,
col.names=c("altezza","peso","genere")) attach(peso_alt)
plot(altezza,peso,pch=16,col="red")
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
155 160 165 170 175 180 185 190
5060708090
altezza
peso
Indices for two quantitative variables Covariance
Xc and Y c centred variables
Xc = X − x Y c = Y − y
Γ(X, Y ) = 1 n
n X
i=1
xci yic = 1 n
n X
i=1
(xi − x) (yi − y)
Interpretation
The origin is in the centroid (x, y)
The points in the I and III quadrant give positive contribution to
n X
i=1
xci yic
while the points in the II and IV quadrant give negative contribution
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
10 12 14 16 18 20
101520
x
y
+
−
−
+
then the covariance is
- positive if –on average– high values of X correspond to high values of Y (and low values of X correspond to low values of Y )
- negative if –on average– high values of X correspond to low values of Y (and low values of X correspond to high values of Y )
- about zero otherwise
● ● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
10 12 14 16 18 20
5101520
x1
y1
+
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
10 12 14 16 18 20
101520
x2
y2
~0
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●
10 12 14 16 18 20
0510
x3
y3
−
Correlation index (or linear correlation index)
Xs and Y s standardized variables:
Xs = X − x
σX end Y s = Y − y σY
ρ(X, Y ) = 1 n
n X
i=1
xsi yis =
n X
i=1
(xi − x)
q
P (xi − x)2
(yi − y)
q
P (yi − y)2
= Γ(X, Y ) σX σY
Some properties:
- ρ(X, Y ) does not depend on variability of the variables - ρ(X, Y ) is a pure number
- ρ(X, Y ) ∈ [−1, 1]
Remark 1 : positive correlation does not imply causality
Remark 2 : the correlation index detects only the linear depen- dence. Below ρ(X, Y ) = 0.02
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
20 25 30 35 40
050100
Covariance and correlation indices in a population and its sub-groups
Γtot(X, Y ) =
K X
k=1
fk Γk(X, Y ) +
K X
k=1
fk (xk − x) (yk − y)
(cf. formula of the variance)
Some paradoxes
left:
ρtot < 0, ρ1 > 0, ρ2 > 0, ρ3 > 0 right:
ρtot ' 0, ρ1 > 0, ρ2 < 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
0 10 20 30 40 50
5060708090
a
b
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
10 12 14 16 18 20
406080100
x
y
Example:
height and weight of a graduate class students (continue)
ρtot = 0.61,
ρM = 0.26, ρF = 0.78
plot(altezza,peso,pch=c(16,17)[genere],
col=c("red", "blue")[genere]) ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
155 160 165 170 175 180 185 190
5060708090
altezza
peso
tot 0.61, M: 0.26, F: 0.78
Linear transformations of variables:
how the indices change
Univariate case
X variable with x, σX2 and qαX
• change of position: R = X + a
r = x + a σR2=σX2 qαR = qαX + a
• change of dispersion/scale: T = b X
t = b x σT2 = b2 σX2 qαT = b qαX
• change of both position and scale: Z = b X + a z = b x + a σZ2 = b2 σX2 qαZ = b qαX + a
R=X+a
0 5 10 15 20
04812
0 5 10 15 20
04812
Z=2X+a
0 5 10 15 20
04812
0 5 10 15 20
04812
●
05101520
X
●
05101520
R=X+a
●
05101520
Z=2X+a
Pay attention to the ticks on the axes
Two special transformations:
• centered variable:
Xc = X − x xc = 0 σX2 c = σX2
• standardized variable:
Xs = X − x
σX xs = 0 σX2 s = 1
Bivariate case
X with x, σX2 and qαX Y with y, σY2 and qαY
• change of position: R = X + a, M = Y + c
Γ(R, M ) = Γ(X, Y ) ρ(R, M ) = ρ(X, Y )
• change of dispersion: T = b X N = d Y
Γ(T , N ) = b d Γ(X, Y ) ρ(T , N ) = sign(b d) ρ(X, Y )
In the bivariate and multivariate cases it is often useful to use standardized variables
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−5 0 5 10
0510
original variables
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −3 −2 −1 0 1 2
−3−2−10123
scaled variables
x_sc
y_sc
x=rnorm(100,3,1);y=rnorm(100,3,3); x_sc=scale(x);y_sc=scale(y) par(mfrow=c(1,2)
plot(x,y,pch=19,asp=1,main="original variables")
plot(x_sc,y_sc,pch=19,asp=1,main="scaled variables")
## asp=1 to have same scale both on horizontal and on vertical axis