Exploratory Data Analysis

(1)

Exploratory Data Analysis

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/IIT/

(2)

Running example

Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.

Variables

PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)

RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)

SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)

ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

(3)

Quantitative and qualitative variables

Quantitative variables represent measurable quantities

weight, height, number of times that a phenomenon occurs, . . .

Qualitative (or categorical) variables take values that are names or labels

- ordinal variables (the observed values admit a natural or- der)

response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .

- nominal variables

blood groups, colours, gender, smoking habit,. . .

(4)

Distribution of a categorical variable X and contingency tables

Notation

- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)

- n number of units

- n_k observed count of level k; ^Pn_k = n;

- f_k = n_k/n (relative) frequency of the level k; ^P f_k = 1 Example – Activity

Codings 0: nothing – 1: low – 2: moderate – 3: high

k n_k f_k f_k (%)

nothing 0 1 0.0109 1.09

low 1 9 0.0978 9.78

moderate 2 61 0.6630 66.30

high 3 21 0.2283 22.83

92 1 100

tab=table(ACTIVITY) freq=prop.table(tab)

cbind(tab,round(freq,4),round(freq*100,2))

(5)

Barplot

0 1 2 3

0103050

0 1 2 3

0.00.20.40.6

par(mfrow=c(1,2))

barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts

barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies par(mfrow=c(1,1))

Note the similarity

(6)

Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies

ACTIVITY (Y )

0 1 2 3

SEX 1 1 5 35 16 57

(X) 2 0 4 26 5 35

1 9 61 21 92

ACTIVITY (Y )

0 1 2 3

1 1.09 5.43 38.04 17.39 61.95 2 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100

> SA=table(SEX,ACTIVITY);SA > round(prop.table(SA)*100,2)

> margin.table(SA,1);margin.table(SA,2)

1 2

01020304050

1 2

05101520253035

0 1 2 3

0102030405060

0 1 2 3

05101520253035

AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))

barplot(AS,beside=F,cex.names=2);barplot(AS,beside=T,cex.names=2) ## note beside=T/F

barplot(SA,beside=F,cex.names=2);barplot(SA,beside=T,cex.names=2) ## note the first variable par(mfcol=c(1,1))

(7)

Row profiles:

distribution of Y in the sub-groups identified by X (Y | X)

> row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sum is 100

ACTIVITY

SEX 0 1 2 3

1 1.75 8.77 61.40 28.07 2 0.00 11.43 74.29 14.29

no low medium high males

0.00.40.8

no low medium high females

0.00.40.8

act=c("no","low","medium","high") par(mfrow=c(1,2))

barplot(row_pr[1,],ylim=c(0,1),main="males",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

barplot(row_pr[2,],ylim=c(0,1),main="females",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

par(mfrow=c(1,1))

NOTE: useful when there are many levels

(8)

Column profiles:

distribution of X in the sub-groups identified by Y , X | Y

> col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100

ACTIVITY

SEX 0 1 2 3

1 100.00 55.56 57.38 76.19 2 0.00 44.44 42.62 23.81

M F

no

0.00.20.40.60.81.0

M F

low

0.00.20.40.60.81.0

M F

medium

0.00.20.40.60.81.0

M F

high

0.00.20.40.60.81.0

par(mfrow=c(1,1)) gen=c("M","F") par(mfrow=c(1,4))

for (j in 1:dim(SA)[2]) {

barplot(col_pr[,j],ylim=c(0,1),main=act[j],names=gen,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

}

par(mfrow=c(1,1))

similar profiles ≡ “independence”

(9)

Distribution of a quantitative variable X on a finite population

Example – First pulse rates

> PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] 90 78 70 90 92 60 72 68 84 74 68 84 61 64 94 60 72 58 88 [77] 66 84 62 66 80 78 68 72 82 76 87 90 78 68 86 76

> sort(PULSE1)

[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100

> round(prop.table(table(PULSE1)),2)

48 54 58 60 61 62 64 66 68 70 72 74 76 78 80

0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03

82 84 86 87 88 90 92 94 96 100

0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01

x_k observed value of X

f_k its (relative) frequency with f_k ∈ (0, 1) and ^P^K_k=1 f_k = 1 Distribution of X : (x₁, f₁), . . . , (x_k, f_k), . . . , (x_K, f_K)

(cf. qualitative variables; here, in general, more different observed values)

(10)

Dot-plot

When the observed values are “sparse”

50 60 70 80 90 100

stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)

(11)

Cumulative distribution function of X, F_X (or F )

F_X(x) is the (relative) frequency of units with value less or equal to x:

F_X(x) = #{obs ≤ x}

n for x ∈ R

- if x is observed, x = x_k: F_X(x_k) = ^P^k_i=1 f_i

- if x is between two observed values, x ∈ [x_k, x_k+1), then F_X(x) = F_X(x_k)

- if x ≤ min X then F_X(x) = 0, if x ≥ max X, then F_X(x) = 1 F_X : R → [0, 1]

F_X is a step-function

(12)

Example – Female height (in cm)

• x_k sorted (and non repeated) observed values

• n_k count of x_k

• N_k cumulative counts

• f_k frequency of x_k

• F_k cumulative frequencies

x_k n_k N_k f_k F_k

155 1 1 2.86 2.86

157 5 6 14.29 17.14

159 1 7 2.86 20.00

160 4 11 11.43 31.43 163 2 13 5.71 37.14 165 4 17 11.43 48.57 166 1 18 2.86 51.43 168 4 22 11.43 62.86 170 3 25 8.57 71.43 173 6 31 17.14 88.57 175 3 34 8.57 97.14 178 1 35 2.86 100.00

155 160 165 170 175 180

0.00.20.40.60.81.0

ecdf(altezza_f)

altezza femmine (X)

F_X (x)

●

(13)

R code

altezza_f=round(HEIGHT[SEX==2]*2.54)

## select HEIGHT for units having SEX==2

## the product (from inch to cm) acts on all elements

## round(A) is equivalent to round(A,0)

table(altezza_f) ## counts of the variable cumsum(table(altezza_f)) ## cumulative counts

round(prop.table(table(altezza_f))*100,2)

round(cumsum(prop.table(table(altezza_f)))*100,2)

## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,

xlab="altezza femmine (X)",ylab="F_X (x)")

## ecdf empirical cumulative distribution function

## lw dimension of the lines

## cex.axis dimension of the axis

## xlab ylab labels of the axis

(14)

Quantile and percentile function (reverse problem)

Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data

Determine x such that F_X(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)

Example – Female height

rank r_i 1 2 3 4 5 6 7 8 9 10

value x_(i) 155 157 157 157 157 157 159 160 160 160

rank r_i 11 12 13 14 15 16 17 18 19 20

value x_(i) 160 163 163 165 165 165 165 166 168 168

rank r_i 21 22 23 24 25 26 27 28 29 30

value x_(i) 168 168 170 170 170 173 173 173 173 173

rank r_i 31 32 33 34 35

value x_(i) 173 175 175 175 178

- central value: 17th value

- first quart is between 8th and 9th values (0.25 × 35 = 8.75)

- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.

Statistical software have many possibilities. (R has 9 definitions)

The k-th percentile of a set of values divides them so that k% lie below and (100 − k)% lie above

(15)

Invert a not-invertible function (step function)

For each frequency α, the quantile q_α is the minimum real number such that F (q_α) ≥ α

q_α = F^g⁻¹(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1) It is the i-th sorted data, where i is the ceiling of nα

If α is expressed as a percentage, q_α is called percentile In general (with all definitions)

q_α ∈ ^hx_(bnαc), x_(bnαc+1)ⁱ where bnαc is the integer part of nα

> quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%

155 160 166 173 178

Pay attention: type=1 give “our” definition (default type=7) See help(quantile)

(16)

Tukey’s five-number summary

minimum, Q1, Q2, Q3, maximum

> fivenum(altezza_f) [1] 155 160 166 173 178

Inter Quartile Range (IQR): Q3 − Q1

The interval (Q1, Q3) contains the 50% of the “central” data

(17)

The box-plot

1.5 IQR 1.5 IQR

3 IQR 3 IQR

Q1

* Q3

Q2

- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median

- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box

- outliers (beyond the end of the whisker) are denoted by o (at times observations further than 3 IQR are denoted by ?)

(18)

Example - Weight

50 60 70 80 90

● ● ●●

●

● ●●

●

●●●●●

●

●●●●

●

●●●

●

● ●●

●

●●●

●

●●●

●

● ●●

●

●● ●●

●

● ●●

●

● ●

●

50 60 70 80 90

dot-plot and box-plot

peso=round(WEIGHT*0.45359,1)

stripchart(peso,method="stack",offset=.5,at =.7,pch=19) par(new=T)

boxplot(peso, horizontal=T)

(19)

> peso=round(WEIGHT*0.45359,1)

> sort(peso)

[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5

> fivenum(peso)

[1] 43.10 56.70 65.80 70.75 97.50

- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075

- 3 IQR = 42.15

- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825

- Q3+3 IQR = 70.75 + 42.15 = 112.9.

(20)

Example - Second pulse rates

Are true outliers?

Some students ran, some sat.

●

●●

●

6080100120140

(21)

Sub-groups: graphic comparisons

Example - Second pulse rates Box plot

Each Tukey number for the students who ran is larger than for the students who sat.

Different dispersion

Different “form” (more or less

symmetrical w.r.t. the median) corsa fermo

6080100120140

boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))

(22)

Cumulative distribution function

CDF of students that sat has a slope higher than CDF of students that run:

steep slope ≡ low dispersion

Each quantile of “ran” is greater than the corresponding of “sat”

plot(ecdf(PULSE2[RAN==1]),pch=19,

cex.axis=1.2, xlim=c(50,140),main="") par(new=T)

#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,

cex.axis=1.2, xlim=c(50,140),col="red",main="")

# pch=19 circle pch=17 triangle

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

●

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

red triangle: sat blue circle: ran

(23)

Example - Male and female height

M F

155160165170175180185190

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

●

● ●

●

● ●

●

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

same “form” – slope – dispersion different values

(24)

Indices for quantitative variables

Position indices – central tendencies - median: Q2 = min{x | F (x) ≥ 0.50}

- mean: x = ¹_n ^Pⁿ_i=1 x_i

- trimmed mean: mean of the 90% of the “central” data - mode: value with maximal frequency

1. ^Pⁿ_i=1 (x_i − x) = 0

2. ^Pⁿ_i=1 (x_i − x)² ≤ ^Pⁿ_i=1 (x_i − a)² ∀ a ∈ R the mean is centroid of the distri-

bution (equilibrium point)

the mean is affected by outliers

the median no ⁶⁰ ⁸⁰ ¹⁰⁰ ¹²⁰ ¹⁴⁰

● ● ●●

●

● ●●

●

● ●●

●

●●●

●

● ●●

●

● ●●

●

● ● ●●

● ● ● ● ● ●● ●●

● ●

3. ^Pⁿ_i=1 |x_i − Q2| ≤ ^Pⁿ_i=1|x_i − a| ∀ a ∈ R

(25)

Variability indices

• ranges

- range (R): max - min - interquartile range (IQR): Q3−Q1

• indices based on the deviations from a central value - variance and standard deviation

(V(X) or σ_X² or σ² – std(X) or σ_X or σ)

V(X) = 1 n

n X

i=1

(x_i − x)² std(X) =

v u u t

1 n

n X

i=1

(x_i − x)²

(Variance and standard deviation can be defined with n − 1)

- mean absolute deviations (from mean and median) 1

n

n X

i=1

|x_i − x| 1 n

n X

i=1

|x_i − Q2|

• variability w.r.t. central value coefficient of variation CV(X) = std(X)

x (if x 6= 0).

Properties: σ ≤ ^R₂ and |x − Q2| < σ

(26)

Mean and variance in a population and its sub-groups Two subgroups: A and B

n_A and n_B units, f_A and f_B frequencies,

x_A and x_B means and σ_A² and σ_B² variances.

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

n_A = 100, x_A = 9.9, σ_A² = 6.5; n_B = 300, x_B = 30.0, σ_B² = 4.4 Mean

x_tot = f_A x_A + f_B x_B (weighted mean) x_tot = 100

400 9.9 + 300

400 30.0 = 24.98

(27)

Variance

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

x_B = 30.0

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

x_B = 40.0

σ_tot² = f_A σ_A² + f_B σ_B² + f_A (x_A − x_tot)² + f_B (x_B − x_tot)² weighted variance plus weighted “variance of the means”

(28)

In the example:

above: σ_tot² = 80.68 below: σ_tot² = 179.53

In general

x_tot =

K X

k=1

f_k x_k

σ_tot² =

K X

k=1

f_k σ_k² +

K X

k=1

f_k (x_k − x_tot)²

total variance =

within classes variance + between classes variance

(29)

The histogram

Pay attention to different “form” of the two histograms of the same data with different break points

Left: asymmetrical distribution Right: symmetrical distribution

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)

par(mfrow=c(1,2))

hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))

Use histogram if different break points and classe range produce the same “form”

(30)

Distribution and graphic representation of two quantitative variables

X and Y measured on the same population

The set of the points (x_i, y_i) and of the corresponding frequencies is the joint distribution of X and Y .

The scatter-plot

Example: height and weight of a graduate class students

peso_alt= read.table(file=

"C:/DATA/stid98.txt",header=F,

col.names=c("altezza","peso","genere")) attach(peso_alt)

plot(altezza,peso,pch=16,col="red")

● ●

●

● ●

●

● ●

●

● ●

●

● ●

155 160 165 170 175 180 185 190

5060708090

altezza

peso

(31)

Indices for two quantitative variables Covariance

X^c and Y ^c centred variables

X^c = X − x Y ^c = Y − y Γ(X, Y ) = 1

n

n X

i=1

x^c_i y_i^c = 1 n

n X

i=1

(x_i − x) (y_i − y)

Interpretation

The origin is in the centroid (x, y)

The points in the I and III quadrant give positive contribute to

n X

i=1

x^c_i y_i^c

while the points in the II and IV quadrant give negative contribute.

●

● ●

●

● ●●

●

● ●

●

● ●

●

10 12 14 16 18 20

101520

y

+

−

+

(32)

then the covariance is

- positive if –on average– to high values of X correspond high values of Y (and to low values of X, low values of Y )

- negative if –on average– to high values of X correspond low values of Y (and to low values of X, high values of Y )

- about zero otherwise

● ● ● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●● ●

●

10 12 14 16 18 20

5101520

x1

y1

+

●

● ●

●

● ● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

10 12 14 16 18 20

101520

x2

y2

~0

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●● ●

●

10 12 14 16 18 20

0510

x3

y3

−

(33)

Correlation index (or linear correlation index) X^s and Y ^s standardized variables:

X^s = X − x

σ_X end Y ^s = Y − y σ_Y

ρ(X, Y ) = 1 n

n X

i=1

x^s_i y_i^s =

n X

i=1

(x_i − x)

q

P (x_i − x)²

(y_i − y)

q

P (y_i − y)²

= Γ(X, Y ) σ_X σ_Y

Some properties:

- it does not depend on variability of the variables - it is a pure number

- ρ(X, Y ) ∈ [−1, 1]

(34)

Remark 1 : positive correlation does not imply causality

Remark 2 : the correlation index detects only the linear depen- dence ρ(X, Y ) = 0.02.

●

● ●

●

● ●

●

● ●

●

● ●

●

20 25 30 35 40

050100

(35)

Covariance and correlation indices in a population and its sub-groups

Γ_tot(X, Y ) =

K X

k=1

f_k Γ_k(X, Y ) +

K X

k=1

f_k (x_k − x) (y_k − y)

(cf. formula of the variance)

(36)

Some paradoxes left:

ρ_tot < 0, ρ₁ > 0, ρ₂ > 0, ρ₃ > 0 right:

ρ_tot ' 0, ρ₁ > 0, ρ₂ < 0

●

● ●

●

0 10 20 30 40 50

5060708090

a

b

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

10 12 14 16 18 20

406080100

x

y

Example:

height and weight of a graduate class students (continue)

ρ_tot = 0.61,

ρ_M = 0.26, ρ_F = 0.78.

plot(altezza,peso,pch=c(16,17)[genere],

col=c("red", "blue")[genere]) ●

● ●

●

155 160 165 170 175 180 185 190

5060708090

altezza

peso

tot 0.61, M: 0.26, F: 0.78

(37)

Linear transformations of variables:

how the indices change

Univariate case

X variable with x, σ_X² and q_α^X

• change of position: R = X + a

r = x + a σ_R²=σ_X² q_α^R = q_α^X + a

• change of dispersion/scale: T = b X

t = b x σ_T² = b² σ_X² q_α^T = b q_α^X

• change of both indices: Z = b X + a

z = b x + a σ_Z² = b² σ_X² q_α^Z = b q_α^X + a

(38)

R=X+8

0 5 10 15 20

02468

0 5 10 15 20

02468

Z=2X+8

0 5 10 15 20

02468

0 5 10 15 20

02468

●

05101520

X

●

05101520

R=X+3

●

05101520

Z=2X+3

Pay attention to the ticks on the axes

(39)

Two special transformation:

• centered variable:

X^c = X − x x^c = 0 σ_X² c = σ_X²

• standardized variable:

X^s = X − x

σ_X x^s = 0 σ_X² s = 1

(40)

Bivariate case

X with x, σ_X² and q_α^X Y with y, σ_Y² and q_α^Y

• change of position: R = X + a, M = Y + c

Γ(R, M ) = Γ(X, Y ) ρ(R, M ) = ρ(X, Y )

• change of dispersion: T = b X N = d Y

Γ(T , N ) = b d Γ(X, Y ) ρ(T , N ) = sign(bd) ρ(X, Y )