Exploratory Data Analysis part 4

(1)

Exploratory Data Analysis part 4

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/UnigeStat/

(2)

Distribution and graphic representation of two quantitative variables

X and Y measured on the same population

The set of the points (x_i, y_i) and of the corresponding frequencies is the joint distribution of X and Y

The scatter-plot

Example: height and weight of a graduate class students

peso_alt= read.table(file=

"C:/DATA/stid98.txt",header=F,

col.names=c("altezza","peso","genere")) attach(peso_alt)

plot(altezza,peso,pch=16,col="red")

● ●

●

● ●

●

● ●

●

● ●

●

● ●

155 160 165 170 175 180 185 190

5060708090

altezza

peso

(3)

Indices for two quantitative variables Covariance

X^c and Y ^c centred variables

X^c = X − x Y ^c = Y − y

Γ(X, Y ) = 1 n

n X

i=1

x^c_i y_i^c = 1 n

n X

i=1

(x_i − x) (y_i − y)

Interpretation

The origin is in the centroid (x, y)

The points in the I and III quadrant give positive contribution to

n X

i=1

x^c_i y_i^c

while the points in the II and IV quadrant give negative contribution

●

● ●

●

● ●●

●

● ●

●

● ●

●

10 12 14 16 18 20

101520

x

y

+

−

+

(4)

then the covariance is

- positive if –on average– high values of X correspond to high values of Y (and low values of X correspond to low values of Y )

- negative if –on average– high values of X correspond to low values of Y (and low values of X correspond to high values of Y )

- about zero otherwise

● ● ● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●● ●

●

10 12 14 16 18 20

5101520

x1

y1

+

●

● ●

●

● ● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

10 12 14 16 18 20

101520

x2

y2

~0

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●● ●

●

10 12 14 16 18 20

0510

x3

y3

−

(5)

Correlation index (or linear correlation index)

X^s and Y ^s standardized variables:

X^s = X − x

σ_X end Y ^s = Y − y σ_Y

ρ(X, Y ) = 1 n

n X

i=1

x^s_i y_i^s =

n X

i=1

(x_i − x)

q

P (x_i − x)²

(y_i − y)

q

P (y_i − y)²

= Γ(X, Y ) σ_X σ_Y

Some properties:

- ρ(X, Y ) does not depend on variability of the variables - ρ(X, Y ) is a pure number

- ρ(X, Y ) ∈ [−1, 1]

(6)

Remark 1 : positive correlation does not imply causality

Remark 2 : the correlation index detects only the linear depen- dence. Below ρ(X, Y ) = 0.02

●

● ●

●

● ●

●

● ●

●

● ●

●

20 25 30 35 40

050100

(7)

Covariance and correlation indices in a population and its sub-groups

Γ_tot(X, Y ) =

K X

k=1

f_k Γ_k(X, Y ) +

K X

k=1

f_k (x_k − x) (y_k − y)

(cf. formula of the variance)

(8)

Some paradoxes

left:

ρ_tot < 0, ρ₁ > 0, ρ₂ > 0, ρ₃ > 0 right:

ρ_tot ' 0, ρ₁ > 0, ρ₂ < 0

●

● ●

●

0 10 20 30 40 50

5060708090

a

b

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

10 12 14 16 18 20

406080100

x

y

Example:

height and weight of a graduate class students (continue)

ρ_tot = 0.61,

ρ_M = 0.26, ρ_F = 0.78

plot(altezza,peso,pch=c(16,17)[genere],

col=c("red", "blue")[genere]) ●

● ●

●

155 160 165 170 175 180 185 190

5060708090

altezza

peso

tot 0.61, M: 0.26, F: 0.78

(9)

Linear transformations of variables:

how the indices change

Univariate case

X variable with x, σ_X² and q_α^X

• change of position: R = X + a

r = x + a σ_R²=σ_X² q_α^R = q_α^X + a

• change of dispersion/scale: T = b X

t = b x σ_T² = b² σ_X² q_α^T = b q_α^X

• change of both position and scale: Z = b X + a z = b x + a σ_Z² = b² σ_X² q_α^Z = b q_α^X + a

(10)

R=X+a

0 5 10 15 20

04812

0 5 10 15 20

04812

Z=2X+a

0 5 10 15 20

04812

0 5 10 15 20

04812

●

05101520

X

●

05101520

R=X+a

●

05101520

Z=2X+a

Pay attention to the ticks on the axes

(11)

Two special transformations:

• centered variable:

X^c = X − x x^c = 0 σ_X² c = σ_X²

• standardized variable:

X^s = X − x

σ_X x^s = 0 σ_X² s = 1

(12)

Bivariate case

X with x, σ_X² and q_α^X Y with y, σ_Y² and q_α^Y

• change of position: R = X + a, M = Y + c

Γ(R, M ) = Γ(X, Y ) ρ(R, M ) = ρ(X, Y )

• change of dispersion: T = b X N = d Y

Γ(T , N ) = b d Γ(X, Y ) ρ(T , N ) = sign(b d) ρ(X, Y )

(13)

In the bivariate and multivariate cases it is often useful to use standardized variables

●

● ●

●

●●

●

●●

●

−5 0 5 10

0510

original variables

x

y

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

−4 −3 −2 −1 0 1 2

−3−2−10123

scaled variables

x_sc

y_sc

x=rnorm(100,3,1);y=rnorm(100,3,3); x_sc=scale(x);y_sc=scale(y) par(mfrow=c(1,2)

plot(x,y,pch=19,asp=1,main="original variables")

plot(x_sc,y_sc,pch=19,asp=1,main="scaled variables")

## asp=1 to have same scale both on horizontal and on vertical axis