Exploratory Data Analysis part 2

(1)

Exploratory Data Analysis part 2

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/UnigeStat/

(2)

Distribution of a quantitative variable X on a finite population

Example – First pulse rates

PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] ...

sort(PULSE1)

[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100

round(prop.table(table(PULSE1)),2)

48 54 58 60 61 62 64 66 68 70 72 74 76 78 80

0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03 82 84 86 87 88 90 92 94 96 100

0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01

sum(round(prop.table(table(PULSE1)),2)) ## pay attention to round [1] 0.96

x

_k

observed value of X

f

_k

its (relative) frequency with f

_k

∈ (0, 1) and

^P^K_k=1

f

_k

= 1 Distribution of X : (x

₁

, f

₁

), . . . , (x

_k

, f

_k

), . . . , (x

_K

, f

_K

)

(there are many more observed values in a quantitative variable than in a qualitative one)

(3)

Dot-plot

When the observed values are “not too closed”

50 60 70 80 90 100

stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)

(4)

The histogram

Pay attention to different “form” of the two histograms of the same data with different break points

Left: asymmetrical distribution Right: symmetrical distribution

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

dati_ist

Frequency

110 120 130 140 150 160

051015

dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)

par(mfrow=c(1,2))

hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))

Use histogram if different break points and classe range produce

the same “form”

(5)

Use a large number of classes

dati_ist

Frequency

110 120 130 140 150 160

0.00.51.01.52.02.53.0

hist(dati_ist,breaks=30,xlim=c(110,160),freq=T)

## breaks=30 to have at most 30 classes

(6)

Cumulative distribution function of X, F

_X

(or F )

F

_X

(x) is the (relative) frequency of units with value less or equal to x:

F

_X

(x) = # {obs ≤ x}

n for x ∈ R

- if x is observed, x = x

_k

: F

_X

(x

_k

) =

^P^k_i=1

f

_i

- if x is between two observed values, x ∈ [x

_k

, x

_k+1

), then F

_X

(x) = F

_X

(x

_k

)

- if x ≤ min X then F

_X

(x) = 0, if x ≥ max X, then F

_X

(x) = 1 F

_X

: R → [0, 1]

F

_X

is a step-function

(7)

Example – Female height (in cm)

• x

_k

sorted (and non repeated) observed values

• n

_k

count of x

_k

• N

_k

cumulative counts

• f

_k

frequency of x

_k

• F

_k

cumulative frequencies

x_k n_k N_k f_k F_k

155 1 1 2.86 2.86

157 5 6 14.29 17.14

159 1 7 2.86 20.00

160 4 11 11.43 31.43

163 2 13 5.71 37.14

165 4 17 11.43 48.57

166 1 18 2.86 51.43

168 4 22 11.43 62.86

170 3 25 8.57 71.43

173 6 31 17.14 88.57

175 3 34 8.57 97.14

178 1 35 2.86 100.00

155 160 165 170 175 180

0.00.20.40.60.81.0

ecdf(altezza_f)

altezza femmine (X)

F_X (x)

●

(8)

R code

altezza_f=round(HEIGHT[SEX==2]*2.54)

## select HEIGHT for units having SEX==2

## the product (from inch to cm) acts on all elements

## round(A) is equivalent to round(A,0)

count_alt_f=table(altezza_f) ## counts of the variable cum_count_alt_f=cumsum(count_alt_f) ## cumulative counts

freq_alt_f=prop.table(count_alt_f) round(freq_alt_f*100,2)

round(cumsum(freq_alt_f)*100,2)

## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,

xlab="altezza femmine (X)",ylab="F_X (x)")

## ecdf empirical cumulative distribution function

## lw dimension of the lines

## cex.axis dimension of the axis

## xlab ylab labels of the axis

(9)

Quantile and percentile function (reverse problem)

Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data

Determine x such that F_X(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)

Example – Female height

rank r_i 1 2 3 4 5 6 7 8 9 10

value x_(i) 155 157 157 157 157 157 159 160 160 160

rank r_i 11 12 13 14 15 16 17 18 19 20

value x_(i) 160 163 163 165 165 165 165 166 168 168

rank r_i 21 22 23 24 25 26 27 28 29 30

value x_(i) 168 168 170 170 170 173 173 173 173 173

rank r_i 31 32 33 34 35

value x_(i) 173 175 175 175 178

- central value: 17th value

- first quart is between 8th and 9th values (0.25 × 35 = 8.75)

- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.

Statistical software have many possibilities. (R has 9 definitions)

The k-th percentile of a set real numbers partitions it so that k% of its values are below and (100 − k)% are above of the k-th percentile

(10)

Invert a not-invertible function (step function):

a mathematical definition

For each frequency α, α ∈ (0, 1), the quantile q

_α

is the minimum real number such that F (q

_α

) ≥ α

q

_α

= F

⁻

(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1)

If α, α ∈ (0, 100), is expressed as a percentage, q

_α

is called percentile

In general (with all definitions – R has 9 definitions)

q

_α

∈

^h

x

_(bnαc)

, x

_(bnαc+1)ⁱ

where bnαc is the integer part of nα

quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%

155 160 166 173 178

Pay attention: type=1 give “our” definition (default type=7)

See help(quantile)

(11)

Tukey’s five-number summary

minimum, Q1, Q2, Q3, maximum

fivenum(altezza_f)

[1] 155 160 166 173 178

Inter Quartile Range (IQR): Q3 − Q1

The interval (Q1, Q3) contains the 50% of the “central” data

(12)

The box-plot

1.5 IQR 1.5 IQR

Q1 Q2 Q3

- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median

- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box

- outliers (beyond the end of the whisker) are denoted by o

(13)

Example - Weight

50 60 70 80 90

● ● ●●

●

● ●●

●

●●●●●

●

●●●●

●

●●●

●

● ●●

●

●●●

●

●●●

●

● ●●

●

●● ●●

●

● ●●

●

● ●

●

50 60 70 80 90

dot-plot and box-plot

peso=round(WEIGHT*0.45359,1) boxplot(peso, horizontal=T) par(new=T)

stripchart(peso,method="stack",offset=.5,at =.7,pch=19)

(14)

sort(peso)

[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5

fivenum(peso)

[1] 43.10 56.70 65.80 70.75 97.50

- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075

- 3 IQR = 42.15

- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825

(15)

Example - Second pulse rates

Are true outliers?

Some students ran, some sat.

boxplot(PULSE2)

### horizontal=F is the default

●

●●

●

6080100120140

(16)

Sub-groups: graphic comparisons

Example - Second pulse rates

Box plot

Each Tukey number for the stu- dents who ran is larger than for the students who sat.

Different dispersion

Different “form” (more or less

symmetrical w.r.t. the median)

corsa fermo

6080100120140

boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))

(17)

Cumulative distribution function

CDF of students that sat has a slope higher than CDF of students that run:

steep slope ≡ low dispersion

Each quantile of “ran” is greater than the corresponding of “sat”

plot(ecdf(PULSE2[RAN==1]),pch=19,

cex.axis=1.2, xlim=c(50,140),main="") par(new=T)

#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,

cex.axis=1.2, xlim=c(50,140),col="red",main="")

# pch=19 circle pch=17 triangle

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

●

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

red triangle: sat

black circle: ran

(18)

Example - Male and female height

M F

155160165170175180185190

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

●

● ●

●

● ●

●

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

same “form” – slope – dispersion different values

altezza=HEIGHT*2.54

boxplot(altezza~SEX,cex.axis=1.2,lwd=2,names=c("M","F")) plot(ecdf(altezza[SEX==1]),pch=19, cex.axis=1.2,

xlim=c(min(altezza),max(altezza)),main="") ## pay attention to xlim par(new=T) #the following plot on the same graphic window

plot(ecdf(altezza[SEX==2] ),pch=17,cex.axis=1.2, ## pay attention to xlim xlim=c(min(altezza),max(altezza)),col="red",main="")