Exploratory Data Analysis part 2
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
http://www.dima.unige.it/~rogantin/UnigeStat/
Distribution of a quantitative variable X on a finite population
Example – First pulse rates
PULSE1
[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] ...
sort(PULSE1)
[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100
round(prop.table(table(PULSE1)),2)
48 54 58 60 61 62 64 66 68 70 72 74 76 78 80
0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03 82 84 86 87 88 90 92 94 96 100
0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01
sum(round(prop.table(table(PULSE1)),2)) ## pay attention to round [1] 0.96
x
kobserved value of X
f
kits (relative) frequency with f
k∈ (0, 1) and
PKk=1f
k= 1 Distribution of X : (x
1, f
1), . . . , (x
k, f
k), . . . , (x
K, f
K)
(there are many more observed values in a quantitative variable than in a qualitative one)
Dot-plot
When the observed values are “not too closed”
50 60 70 80 90 100
stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)
The histogram
Pay attention to different “form” of the two histograms of the same data with different break points
Left: asymmetrical distribution Right: symmetrical distribution
Histogram of dati_ist
dati_ist
Frequency
110 120 130 140 150 160
051015
Histogram of dati_ist
dati_ist
Frequency
110 120 130 140 150 160
051015
dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)
par(mfrow=c(1,2))
hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))
Use histogram if different break points and classe range produce
the same “form”
Use a large number of classes
Histogram of dati_ist
dati_ist
Frequency
110 120 130 140 150 160
0.00.51.01.52.02.53.0
hist(dati_ist,breaks=30,xlim=c(110,160),freq=T)
## breaks=30 to have at most 30 classes
Cumulative distribution function of X, F
X(or F )
F
X(x) is the (relative) frequency of units with value less or equal to x:
F
X(x) = # {obs ≤ x}
n for x ∈ R
- if x is observed, x = x
k: F
X(x
k) =
Pki=1f
i- if x is between two observed values, x ∈ [x
k, x
k+1), then F
X(x) = F
X(x
k)
- if x ≤ min X then F
X(x) = 0, if x ≥ max X, then F
X(x) = 1 F
X: R → [0, 1]
F
Xis a step-function
Example – Female height (in cm)
• x
ksorted (and non repeated) observed values
• n
kcount of x
k• N
kcumulative counts
• f
kfrequency of x
k• F
kcumulative frequencies
xk nk Nk fk Fk
155 1 1 2.86 2.86
157 5 6 14.29 17.14
159 1 7 2.86 20.00
160 4 11 11.43 31.43
163 2 13 5.71 37.14
165 4 17 11.43 48.57
166 1 18 2.86 51.43
168 4 22 11.43 62.86
170 3 25 8.57 71.43
173 6 31 17.14 88.57
175 3 34 8.57 97.14
178 1 35 2.86 100.00
155 160 165 170 175 180
0.00.20.40.60.81.0
ecdf(altezza_f)
altezza femmine (X)
F_X (x)
●
●
●
●
●
●
●
●
●
●
●
●
R code
altezza_f=round(HEIGHT[SEX==2]*2.54)
## select HEIGHT for units having SEX==2
## the product (from inch to cm) acts on all elements
## round(A) is equivalent to round(A,0)
count_alt_f=table(altezza_f) ## counts of the variable cum_count_alt_f=cumsum(count_alt_f) ## cumulative counts
freq_alt_f=prop.table(count_alt_f) round(freq_alt_f*100,2)
round(cumsum(freq_alt_f)*100,2)
## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,
xlab="altezza femmine (X)",ylab="F_X (x)")
## ecdf empirical cumulative distribution function
## lw dimension of the lines
## cex.axis dimension of the axis
## xlab ylab labels of the axis
Quantile and percentile function (reverse problem)
Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data
Determine x such that FX(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)
Example – Female height
rank ri 1 2 3 4 5 6 7 8 9 10
value x(i) 155 157 157 157 157 157 159 160 160 160
rank ri 11 12 13 14 15 16 17 18 19 20
value x(i) 160 163 163 165 165 165 165 166 168 168
rank ri 21 22 23 24 25 26 27 28 29 30
value x(i) 168 168 170 170 170 173 173 173 173 173
rank ri 31 32 33 34 35
value x(i) 173 175 175 175 178
- central value: 17th value
- first quart is between 8th and 9th values (0.25 × 35 = 8.75)
- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.
Statistical software have many possibilities. (R has 9 definitions)
The k-th percentile of a set real numbers partitions it so that k% of its values are below and (100 − k)% are above of the k-th percentile
Invert a not-invertible function (step function):
a mathematical definition
For each frequency α, α ∈ (0, 1), the quantile q
αis the minimum real number such that F (q
α) ≥ α
q
α= F
−(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1)
If α, α ∈ (0, 100), is expressed as a percentage, q
αis called percentile
In general (with all definitions – R has 9 definitions)
q
α∈
hx
(bnαc), x
(bnαc+1)iwhere bnαc is the integer part of nα
quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%
155 160 166 173 178
Pay attention: type=1 give “our” definition (default type=7)
See help(quantile)
Tukey’s five-number summary
minimum, Q1, Q2, Q3, maximum
fivenum(altezza_f)
[1] 155 160 166 173 178
Inter Quartile Range (IQR): Q3 − Q1
The interval (Q1, Q3) contains the 50% of the “central” data
The box-plot
1.5 IQR 1.5 IQR
Q1 Q2 Q3
- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median
- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box
- outliers (beyond the end of the whisker) are denoted by o
Example - Weight
50 60 70 80 90
● ● ●●
●
●
● ●●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●
● ●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●
●
●
●● ●●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●
50 60 70 80 90
dot-plot and box-plot
peso=round(WEIGHT*0.45359,1) boxplot(peso, horizontal=T) par(new=T)
stripchart(peso,method="stack",offset=.5,at =.7,pch=19)
sort(peso)
[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5
fivenum(peso)
[1] 43.10 56.70 65.80 70.75 97.50
- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075
- 3 IQR = 42.15
- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825
Example - Second pulse rates
Are true outliers?
Some students ran, some sat.
boxplot(PULSE2)
### horizontal=F is the default
●
●
●●
●
●
6080100120140
Sub-groups: graphic comparisons
Example - Second pulse rates
Box plot
Each Tukey number for the stu- dents who ran is larger than for the students who sat.
Different dispersion
Different “form” (more or less
symmetrical w.r.t. the median)
corsa fermo6080100120140
boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))
Cumulative distribution function
CDF of students that sat has a slope higher than CDF of students that run:
steep slope ≡ low dispersion
Each quantile of “ran” is greater than the corresponding of “sat”
plot(ecdf(PULSE2[RAN==1]),pch=19,
cex.axis=1.2, xlim=c(50,140),main="") par(new=T)
#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,
cex.axis=1.2, xlim=c(50,140),col="red",main="")
# pch=19 circle pch=17 triangle
60 80 100 120 140
0.00.20.40.60.81.0
x
Fn(x)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
60 80 100 120 140
0.00.20.40.60.81.0
x
Fn(x)
red triangle: sat
black circle: ran
Example - Male and female height
M F
155160165170175180185190
155 160 165 170 175 180 185 190
0.00.20.40.60.81.0
x
Fn(x)
●
●
●
● ●
●
● ●
●
●
●
●
●
155 160 165 170 175 180 185 190
0.00.20.40.60.81.0
x
Fn(x)
same “form” – slope – dispersion different values
altezza=HEIGHT*2.54
boxplot(altezza~SEX,cex.axis=1.2,lwd=2,names=c("M","F")) plot(ecdf(altezza[SEX==1]),pch=19, cex.axis=1.2,
xlim=c(min(altezza),max(altezza)),main="") ## pay attention to xlim par(new=T) #the following plot on the same graphic window
plot(ecdf(altezza[SEX==2] ),pch=17,cex.axis=1.2, ## pay attention to xlim xlim=c(min(altezza),max(altezza)),col="red",main="")