Exploratory Data Analysis
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
http://www.dima.unige.it/~rogantin/IIT/
Running example
Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.
Variables
PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)
RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)
SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)
ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY
1 64 88 1 2 1 66.00 140 2
2 58 70 1 2 1 72.00 145 2
3 62 76 1 1 1 73.50 160 3
Quantitative and qualitative variables
Quantitative variables represent measurable quantities
weight, height, number of times that a phenomenon occurs, . . .
Qualitative (or categorical) variables take values that are names or labels
- ordinal variables (the observed values admit a natural or- der)
response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .
- nominal variables
blood groups, colours, gender, smoking habit,. . .
Distribution of a categorical variable X and contingency tables
Notation
- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)
- n number of units
- nk observed count of level k; Pnk = n;
- fk = nk/n (relative) frequency of the level k; P fk = 1 Example – Activity
Codings 0: nothing – 1: low – 2: moderate – 3: high
k nk fk fk (%)
nothing 0 1 0.0109 1.09
low 1 9 0.0978 9.78
moderate 2 61 0.6630 66.30
high 3 21 0.2283 22.83
92 1 100
tab=table(ACTIVITY) freq=prop.table(tab)
cbind(tab,round(freq,4),round(freq*100,2))
Barplot
0 1 2 3
0103050
0 1 2 3
0.00.20.40.6
par(mfrow=c(1,2))
barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts
barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies par(mfrow=c(1,1))
Note the similarity
Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies
ACTIVITY (Y )
0 1 2 3
SEX 1 1 5 35 16 57
(X) 2 0 4 26 5 35
1 9 61 21 92
ACTIVITY (Y )
0 1 2 3
1 1.09 5.43 38.04 17.39 61.95 2 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100
> SA=table(SEX,ACTIVITY);SA > round(prop.table(SA)*100,2)
> margin.table(SA,1);margin.table(SA,2)
1 2
01020304050
1 2
05101520253035
0 1 2 3
0102030405060
0 1 2 3
05101520253035
AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))
barplot(AS,beside=F,cex.names=2);barplot(AS,beside=T,cex.names=2) ## note beside=T/F
barplot(SA,beside=F,cex.names=2);barplot(SA,beside=T,cex.names=2) ## note the first variable par(mfcol=c(1,1))
Row profiles:
distribution of Y in the sub-groups identified by X (Y | X)
> row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sum is 100
ACTIVITY
SEX 0 1 2 3
1 1.75 8.77 61.40 28.07 2 0.00 11.43 74.29 14.29
no low medium high males
0.00.40.8
no low medium high females
0.00.40.8
act=c("no","low","medium","high") par(mfrow=c(1,2))
barplot(row_pr[1,],ylim=c(0,1),main="males",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
barplot(row_pr[2,],ylim=c(0,1),main="females",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
par(mfrow=c(1,1))
NOTE: useful when there are many levels
Column profiles:
distribution of X in the sub-groups identified by Y , X | Y
> col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100
ACTIVITY
SEX 0 1 2 3
1 100.00 55.56 57.38 76.19 2 0.00 44.44 42.62 23.81
M F
no
0.00.20.40.60.81.0
M F
low
0.00.20.40.60.81.0
M F
medium
0.00.20.40.60.81.0
M F
high
0.00.20.40.60.81.0
par(mfrow=c(1,1)) gen=c("M","F") par(mfrow=c(1,4))
for (j in 1:dim(SA)[2]) {
barplot(col_pr[,j],ylim=c(0,1),main=act[j],names=gen,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
}
par(mfrow=c(1,1))
similar profiles ≡ “independence”
Distribution of a quantitative variable X on a finite population
Example – First pulse rates
> PULSE1
[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] 90 78 70 90 92 60 72 68 84 74 68 84 61 64 94 60 72 58 88 [77] 66 84 62 66 80 78 68 72 82 76 87 90 78 68 86 76
> sort(PULSE1)
[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100
> round(prop.table(table(PULSE1)),2)
48 54 58 60 61 62 64 66 68 70 72 74 76 78 80
0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03
82 84 86 87 88 90 92 94 96 100
0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01
xk observed value of X
fk its (relative) frequency with fk ∈ (0, 1) and PKk=1 fk = 1 Distribution of X : (x1, f1), . . . , (xk, fk), . . . , (xK, fK)
(cf. qualitative variables; here, in general, more different observed values)
Dot-plot
When the observed values are “sparse”
50 60 70 80 90 100
stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)
Cumulative distribution function of X, FX (or F )
FX(x) is the (relative) frequency of units with value less or equal to x:
FX(x) = #{obs ≤ x}
n for x ∈ R
- if x is observed, x = xk: FX(xk) = Pki=1 fi
- if x is between two observed values, x ∈ [xk, xk+1), then FX(x) = FX(xk)
- if x ≤ min X then FX(x) = 0, if x ≥ max X, then FX(x) = 1 FX : R → [0, 1]
FX is a step-function
Example – Female height (in cm)
• xk sorted (and non repeated) observed values
• nk count of xk
• Nk cumulative counts
• fk frequency of xk
• Fk cumulative frequencies
xk nk Nk fk Fk
155 1 1 2.86 2.86
157 5 6 14.29 17.14
159 1 7 2.86 20.00
160 4 11 11.43 31.43 163 2 13 5.71 37.14 165 4 17 11.43 48.57 166 1 18 2.86 51.43 168 4 22 11.43 62.86 170 3 25 8.57 71.43 173 6 31 17.14 88.57 175 3 34 8.57 97.14 178 1 35 2.86 100.00
155 160 165 170 175 180
0.00.20.40.60.81.0
ecdf(altezza_f)
altezza femmine (X)
F_X (x)
●
●
●
●
●
●
●
●
●
●
●
●
R code
altezza_f=round(HEIGHT[SEX==2]*2.54)
## select HEIGHT for units having SEX==2
## the product (from inch to cm) acts on all elements
## round(A) is equivalent to round(A,0)
table(altezza_f) ## counts of the variable cumsum(table(altezza_f)) ## cumulative counts
round(prop.table(table(altezza_f))*100,2)
round(cumsum(prop.table(table(altezza_f)))*100,2)
## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,
xlab="altezza femmine (X)",ylab="F_X (x)")
## ecdf empirical cumulative distribution function
## lw dimension of the lines
## cex.axis dimension of the axis
## xlab ylab labels of the axis
Quantile and percentile function (reverse problem)
Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data
Determine x such that FX(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)
Example – Female height
rank ri 1 2 3 4 5 6 7 8 9 10
value x(i) 155 157 157 157 157 157 159 160 160 160
rank ri 11 12 13 14 15 16 17 18 19 20
value x(i) 160 163 163 165 165 165 165 166 168 168
rank ri 21 22 23 24 25 26 27 28 29 30
value x(i) 168 168 170 170 170 173 173 173 173 173
rank ri 31 32 33 34 35
value x(i) 173 175 175 175 178
- central value: 17th value
- first quart is between 8th and 9th values (0.25 × 35 = 8.75)
- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.
Statistical software have many possibilities. (R has 9 definitions)
The k-th percentile of a set of values divides them so that k% lie below and (100 − k)% lie above
Invert a not-invertible function (step function)
For each frequency α, the quantile qα is the minimum real number such that F (qα) ≥ α
qα = Fg−1(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1) It is the i-th sorted data, where i is the ceiling of nα
If α is expressed as a percentage, qα is called percentile In general (with all definitions)
qα ∈ hx(bnαc), x(bnαc+1)i where bnαc is the integer part of nα
> quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%
155 160 166 173 178
Pay attention: type=1 give “our” definition (default type=7) See help(quantile)
Tukey’s five-number summary
minimum, Q1, Q2, Q3, maximum
> fivenum(altezza_f) [1] 155 160 166 173 178
Inter Quartile Range (IQR): Q3 − Q1
The interval (Q1, Q3) contains the 50% of the “central” data
The box-plot
1.5 IQR 1.5 IQR
3 IQR 3 IQR
Q1
* Q3
Q2
- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median
- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box
- outliers (beyond the end of the whisker) are denoted by o (at times observations further than 3 IQR are denoted by ?)
Example - Weight
50 60 70 80 90
● ● ●●
●
●
● ●●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●
● ●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●
●
●
●● ●●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●
50 60 70 80 90
dot-plot and box-plot
peso=round(WEIGHT*0.45359,1)
stripchart(peso,method="stack",offset=.5,at =.7,pch=19) par(new=T)
boxplot(peso, horizontal=T)
> peso=round(WEIGHT*0.45359,1)
> sort(peso)
[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5
> fivenum(peso)
[1] 43.10 56.70 65.80 70.75 97.50
- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075
- 3 IQR = 42.15
- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825
- Q3+3 IQR = 70.75 + 42.15 = 112.9.
Example - Second pulse rates
Are true outliers?
Some students ran, some sat.
●
●
●●
●
●
6080100120140
Sub-groups: graphic comparisons
Example - Second pulse rates Box plot
Each Tukey number for the stu- dents who ran is larger than for the students who sat.
Different dispersion
Different “form” (more or less
symmetrical w.r.t. the median) corsa fermo
6080100120140
boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))
Cumulative distribution function
CDF of students that sat has a slope higher than CDF of students that run:
steep slope ≡ low dispersion
Each quantile of “ran” is greater than the corresponding of “sat”
plot(ecdf(PULSE2[RAN==1]),pch=19,
cex.axis=1.2, xlim=c(50,140),main="") par(new=T)
#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,
cex.axis=1.2, xlim=c(50,140),col="red",main="")
# pch=19 circle pch=17 triangle
60 80 100 120 140
0.00.20.40.60.81.0
x
Fn(x)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
60 80 100 120 140
0.00.20.40.60.81.0
x
Fn(x)
red triangle: sat blue circle: ran
Example - Male and female height
M F
155160165170175180185190
155 160 165 170 175 180 185 190
0.00.20.40.60.81.0
x
Fn(x)
●
●
●
● ●
●
● ●
●
●
●
●
●
155 160 165 170 175 180 185 190
0.00.20.40.60.81.0
x
Fn(x)
same “form” – slope – dispersion different values
Indices for quantitative variables
Position indices – central tendencies - median: Q2 = min{x | F (x) ≥ 0.50}
- mean: x = 1n Pni=1 xi
- trimmed mean: mean of the 90% of the “central” data - mode: value with maximal frequency
1. Pni=1 (xi − x) = 0
2. Pni=1 (xi − x)2 ≤ Pni=1 (xi − a)2 ∀ a ∈ R the mean is centroid of the distri-
bution (equilibrium point)
the mean is affected by outliers
the median no 60 80 100 120 140
● ● ●●
●
●
● ●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
● ● ●●
● ● ● ● ● ●● ●●
● ●
3. Pni=1 |xi − Q2| ≤ Pni=1|xi − a| ∀ a ∈ R
Variability indices
• ranges
- range (R): max - min - interquartile range (IQR): Q3−Q1
• indices based on the deviations from a central value - variance and standard deviation
(V(X) or σX2 or σ2 – std(X) or σX or σ)
V(X) = 1 n
n X
i=1
(xi − x)2 std(X) =
v u u t
1 n
n X
i=1
(xi − x)2
(Variance and standard deviation can be defined with n − 1)
- mean absolute deviations (from mean and median) 1
n
n X
i=1
|xi − x| 1 n
n X
i=1
|xi − Q2|
• variability w.r.t. central value coefficient of variation CV(X) = std(X)
x (if x 6= 0).
Properties: σ ≤ R2 and |x − Q2| < σ
Mean and variance in a population and its sub-groups Two subgroups: A and B
nA and nB units, fA and fB frequencies,
xA and xB means and σA2 and σB2 variances.
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
nA = 100, xA = 9.9, σA2 = 6.5; nB = 300, xB = 30.0, σB2 = 4.4 Mean
xtot = fA xA + fB xB (weighted mean) xtot = 100
400 9.9 + 300
400 30.0 = 24.98
Variance
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
xB = 30.0
Frequency
10 20 30 40 50
010203040
Frequency
10 20 30 40 50
010203040
xB = 40.0
σtot2 = fA σA2 + fB σB2 + fA (xA − xtot)2 + fB (xB − xtot)2 weighted variance plus weighted “variance of the means”
In the example:
above: σtot2 = 80.68 below: σtot2 = 179.53
In general
xtot =
K X
k=1
fk xk
σtot2 =
K X
k=1
fk σk2 +
K X
k=1
fk (xk − xtot)2
total variance =
within classes variance + between classes variance
The histogram
Pay attention to different “form” of the two histograms of the same data with different break points
Left: asymmetrical distribution Right: symmetrical distribution
Histogram of dati_ist
dati_ist
Frequency
110 120 130 140 150 160
051015
Histogram of dati_ist
dati_ist
Frequency
110 120 130 140 150 160
051015
dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)
par(mfrow=c(1,2))
hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))
Use histogram if different break points and classe range produce the same “form”
Distribution and graphic representation of two quantitative variables
X and Y measured on the same population
The set of the points (xi, yi) and of the corresponding frequencies is the joint distribution of X and Y .
The scatter-plot
Example: height and weight of a graduate class students
peso_alt= read.table(file=
"C:/DATA/stid98.txt",header=F,
col.names=c("altezza","peso","genere")) attach(peso_alt)
plot(altezza,peso,pch=16,col="red")
● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
155 160 165 170 175 180 185 190
5060708090
altezza
peso
Indices for two quantitative variables Covariance
Xc and Y c centred variables
Xc = X − x Y c = Y − y Γ(X, Y ) = 1
n
n X
i=1
xci yic = 1 n
n X
i=1
(xi − x) (yi − y)
Interpretation
The origin is in the centroid (x, y)
The points in the I and III quadrant give positive contribute to
n X
i=1
xci yic
while the points in the II and IV quadrant give negative contribute.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
10 12 14 16 18 20
101520
y
+
−
−
+
then the covariance is
- positive if –on average– to high values of X correspond high values of Y (and to low values of X, low values of Y )
- negative if –on average– to high values of X correspond low values of Y (and to low values of X, high values of Y )
- about zero otherwise
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
10 12 14 16 18 20
5101520
x1
y1
+
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
10 12 14 16 18 20
101520
x2
y2
~0
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●
10 12 14 16 18 20
0510
x3
y3
−
Correlation index (or linear correlation index) Xs and Y s standardized variables:
Xs = X − x
σX end Y s = Y − y σY
ρ(X, Y ) = 1 n
n X
i=1
xsi yis =
n X
i=1
(xi − x)
q
P (xi − x)2
(yi − y)
q
P (yi − y)2
= Γ(X, Y ) σX σY
Some properties:
- it does not depend on variability of the variables - it is a pure number
- ρ(X, Y ) ∈ [−1, 1]
Remark 1 : positive correlation does not imply causality
Remark 2 : the correlation index detects only the linear depen- dence ρ(X, Y ) = 0.02.
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
20 25 30 35 40
050100
Covariance and correlation indices in a population and its sub-groups
Γtot(X, Y ) =
K X
k=1
fk Γk(X, Y ) +
K X
k=1
fk (xk − x) (yk − y)
(cf. formula of the variance)
Some paradoxes left:
ρtot < 0, ρ1 > 0, ρ2 > 0, ρ3 > 0 right:
ρtot ' 0, ρ1 > 0, ρ2 < 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
0 10 20 30 40 50
5060708090
a
b
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
● ●
10 12 14 16 18 20
406080100
x
y
Example:
height and weight of a graduate class students (continue)
ρtot = 0.61,
ρM = 0.26, ρF = 0.78.
plot(altezza,peso,pch=c(16,17)[genere],
col=c("red", "blue")[genere]) ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
155 160 165 170 175 180 185 190
5060708090
altezza
peso
tot 0.61, M: 0.26, F: 0.78
Linear transformations of variables:
how the indices change
Univariate case
X variable with x, σX2 and qαX
• change of position: R = X + a
r = x + a σR2=σX2 qαR = qαX + a
• change of dispersion/scale: T = b X
t = b x σT2 = b2 σX2 qαT = b qαX
• change of both indices: Z = b X + a
z = b x + a σZ2 = b2 σX2 qαZ = b qαX + a
R=X+8
0 5 10 15 20
02468
0 5 10 15 20
02468
Z=2X+8
0 5 10 15 20
02468
0 5 10 15 20
02468
●
05101520
X
●
05101520
R=X+3
●
05101520
Z=2X+3
Pay attention to the ticks on the axes
Two special transformation:
• centered variable:
Xc = X − x xc = 0 σX2 c = σX2
• standardized variable:
Xs = X − x
σX xs = 0 σX2 s = 1
Bivariate case
X with x, σX2 and qαX Y with y, σY2 and qαY
• change of position: R = X + a, M = Y + c
Γ(R, M ) = Γ(X, Y ) ρ(R, M ) = ρ(X, Y )
• change of dispersion: T = b X N = d Y
Γ(T , N ) = b d Γ(X, Y ) ρ(T , N ) = sign(bd) ρ(X, Y )