Exploratory Data Analysis part 3
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
http://www.dima.unige.it/~rogantin/UnigeStat/
Indices for quantitative variables
Position indices – central tendencies
- median: Q2 = min{x | F (x) ≥ 0.50}
- mean: x = 1n Pni=1 xi
- trimmed mean: mean of the 90% of the “central” data - mode: value with maximal frequency
1. Pni=1 (xi − x) = 0
2. Pni=1 (xi − x)2 ≤ Pni=1 (xi − a)2 per ogni a ∈ R - the mean is centroid of
the distribution (equilibrium point)
- the mean is affected by out- liers, the median is not
60 80 100 120 140
● ● ●●
●
●
● ●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
● ● ●●
● ● ● ● ● ●● ●●
● ●
3. Pni=1 |xi − Q2| ≤ Pni=1|xi − a| for all a ∈ R
Variability indices
• ranges
- range (R): max - min - interquartile range (IQR): Q3−Q1
• indices based on the deviations from a central value - variance and standard deviation
(V(X) or σX2 or σ2 – std(X) or σX or σ)
V(X) = 1 n
n X
i=1
(xi − x)2 std(X) =
v u u t
1 n
n X
i=1
(xi − x)2
(Variance and standard deviation can be defined with n − 1)
- mean absolute deviations (from mean and median) 1
n
n X
i=1
|xi − x| 1 n
n X
i=1
|xi − Q2|
• variability w.r.t. central value coefficient of variation CV(X) = std(X)
x (if x 6= 0).
Properties: σ ≤ R2 and |x − Q2| < σ
Mean and variance in a population and its sub-groups
Two subgroups: A and B
nA and nB units, fA and fB frequencies
xA and xB means and σA2 and σB2 variances
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
nA = 100, xA = 9.9, σA2 = 6.5; nB = 300, xB = 30.0, σB2 = 4.4 Mean
xtot = fA xA + fB xB (weighted mean) xtot = 100
9.9 + 300
30.0 = 24.98
Variance
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
xB = 30.0
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
xB = 40.0
σtot2 = fA σA2 + fB σB2 + fA (xA − xtot)2 + fB (xB − xtot)2 weighted variance plus weighted “variance of the means”
In the example:
above: σtot2 = 80.68 below: σtot2 = 179.53
In general
xtot =
K X
k=1
fk xk
σtot2 =
K X
k=1
fk σk2 +
K X
k=1
fk (xk − xtot)2
total variance =
within classes variance + between classes variance
R code for white, red and blue histograms
a=rnorm(100,10,2.4);mean(a);var(a) b=rnorm(300,30,2);mean(b);var(b) c=b+10
br=seq(1,50,.5)
hist(a, breaks=br,main=" ", xlab=" ",xlim=c(5,50),ylim=c(0,40)) par(new=T)
hist(b, breaks=br,main=" ", xlab=" ",xlim=c(5,50),ylim=c(0,40),col="red") par(new=F)
hist(a, breaks=br,main=" ", xlab=" ",xlim=c(5,50),ylim=c(0,40)) par(new=T)
hist(c, breaks=br,main=" ", xlab=" ",xlim=c(5,50),ylim=c(0,40),col="blue")