• Non ci sono risultati.

Exploratory Data Analysis

N/A
N/A
Protected

Academic year: 2021

Condividi "Exploratory Data Analysis"

Copied!
41
0
0

Testo completo

(1)

Exploratory Data Analysis

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/IIT/

(2)

Running example

Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.

Variables

PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)

RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)

SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)

ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

(3)

Quantitative and qualitative variables

Quantitative variables represent measurable quantities

weight, height, number of times that a phenomenon occurs, . . .

Qualitative (or categorical) variables take values that are names or labels

- ordinal variables (the observed values admit a natural or- der)

response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .

- nominal variables

blood groups, colours, gender, smoking habit,. . .

(4)

Distribution of a categorical variable X and contingency tables

Notation

- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)

- n number of units

- nk observed count of level k; Pnk = n;

- fk = nk/n (relative) frequency of the level k; P fk = 1 Example – Activity

Codings 0: nothing – 1: low – 2: moderate – 3: high

k nk fk fk (%)

nothing 0 1 0.0109 1.09

low 1 9 0.0978 9.78

moderate 2 61 0.6630 66.30

high 3 21 0.2283 22.83

92 1 100

tab=table(ACTIVITY) freq=prop.table(tab)

cbind(tab,round(freq,4),round(freq*100,2))

(5)

Barplot

0 1 2 3

0103050

0 1 2 3

0.00.20.40.6

par(mfrow=c(1,2))

barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts

barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies par(mfrow=c(1,1))

Note the similarity

(6)

Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies

ACTIVITY (Y )

0 1 2 3

SEX 1 1 5 35 16 57

(X) 2 0 4 26 5 35

1 9 61 21 92

ACTIVITY (Y )

0 1 2 3

1 1.09 5.43 38.04 17.39 61.95 2 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100

> SA=table(SEX,ACTIVITY);SA > round(prop.table(SA)*100,2)

> margin.table(SA,1);margin.table(SA,2)

1 2

01020304050

1 2

05101520253035

0 1 2 3

0102030405060

0 1 2 3

05101520253035

AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))

barplot(AS,beside=F,cex.names=2);barplot(AS,beside=T,cex.names=2) ## note beside=T/F

barplot(SA,beside=F,cex.names=2);barplot(SA,beside=T,cex.names=2) ## note the first variable par(mfcol=c(1,1))

(7)

Row profiles:

distribution of Y in the sub-groups identified by X (Y | X)

> row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sum is 100

ACTIVITY

SEX 0 1 2 3

1 1.75 8.77 61.40 28.07 2 0.00 11.43 74.29 14.29

no low medium high males

0.00.40.8

no low medium high females

0.00.40.8

act=c("no","low","medium","high") par(mfrow=c(1,2))

barplot(row_pr[1,],ylim=c(0,1),main="males",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

barplot(row_pr[2,],ylim=c(0,1),main="females",names= act,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

par(mfrow=c(1,1))

NOTE: useful when there are many levels

(8)

Column profiles:

distribution of X in the sub-groups identified by Y , X | Y

> col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100

ACTIVITY

SEX 0 1 2 3

1 100.00 55.56 57.38 76.19 2 0.00 44.44 42.62 23.81

M F

no

0.00.20.40.60.81.0

M F

low

0.00.20.40.60.81.0

M F

medium

0.00.20.40.60.81.0

M F

high

0.00.20.40.60.81.0

par(mfrow=c(1,1)) gen=c("M","F") par(mfrow=c(1,4))

for (j in 1:dim(SA)[2]) {

barplot(col_pr[,j],ylim=c(0,1),main=act[j],names=gen,cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

}

par(mfrow=c(1,1))

similar profiles ≡ “independence”

(9)

Distribution of a quantitative variable X on a finite population

Example – First pulse rates

> PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] 90 78 70 90 92 60 72 68 84 74 68 84 61 64 94 60 72 58 88 [77] 66 84 62 66 80 78 68 72 82 76 87 90 78 68 86 76

> sort(PULSE1)

[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100

> round(prop.table(table(PULSE1)),2)

48 54 58 60 61 62 64 66 68 70 72 74 76 78 80

0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03

82 84 86 87 88 90 92 94 96 100

0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01

xk observed value of X

fk its (relative) frequency with fk ∈ (0, 1) and PKk=1 fk = 1 Distribution of X : (x1, f1), . . . , (xk, fk), . . . , (xK, fK)

(cf. qualitative variables; here, in general, more different observed values)

(10)

Dot-plot

When the observed values are “sparse”

50 60 70 80 90 100

stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)

(11)

Cumulative distribution function of X, FX (or F )

FX(x) is the (relative) frequency of units with value less or equal to x:

FX(x) = #{obs ≤ x}

n for x ∈ R

- if x is observed, x = xk: FX(xk) = Pki=1 fi

- if x is between two observed values, x ∈ [xk, xk+1), then FX(x) = FX(xk)

- if x ≤ min X then FX(x) = 0, if x ≥ max X, then FX(x) = 1 FX : R → [0, 1]

FX is a step-function

(12)

Example – Female height (in cm)

• xk sorted (and non repeated) observed values

• nk count of xk

• Nk cumulative counts

• fk frequency of xk

• Fk cumulative frequencies

xk nk Nk fk Fk

155 1 1 2.86 2.86

157 5 6 14.29 17.14

159 1 7 2.86 20.00

160 4 11 11.43 31.43 163 2 13 5.71 37.14 165 4 17 11.43 48.57 166 1 18 2.86 51.43 168 4 22 11.43 62.86 170 3 25 8.57 71.43 173 6 31 17.14 88.57 175 3 34 8.57 97.14 178 1 35 2.86 100.00

155 160 165 170 175 180

0.00.20.40.60.81.0

ecdf(altezza_f)

altezza femmine (X)

F_X (x)

(13)

R code

altezza_f=round(HEIGHT[SEX==2]*2.54)

## select HEIGHT for units having SEX==2

## the product (from inch to cm) acts on all elements

## round(A) is equivalent to round(A,0)

table(altezza_f) ## counts of the variable cumsum(table(altezza_f)) ## cumulative counts

round(prop.table(table(altezza_f))*100,2)

round(cumsum(prop.table(table(altezza_f)))*100,2)

## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,

xlab="altezza femmine (X)",ylab="F_X (x)")

## ecdf empirical cumulative distribution function

## lw dimension of the lines

## cex.axis dimension of the axis

## xlab ylab labels of the axis

(14)

Quantile and percentile function (reverse problem)

Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data

Determine x such that FX(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)

Example – Female height

rank ri 1 2 3 4 5 6 7 8 9 10

value x(i) 155 157 157 157 157 157 159 160 160 160

rank ri 11 12 13 14 15 16 17 18 19 20

value x(i) 160 163 163 165 165 165 165 166 168 168

rank ri 21 22 23 24 25 26 27 28 29 30

value x(i) 168 168 170 170 170 173 173 173 173 173

rank ri 31 32 33 34 35

value x(i) 173 175 175 175 178

- central value: 17th value

- first quart is between 8th and 9th values (0.25 × 35 = 8.75)

- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.

Statistical software have many possibilities. (R has 9 definitions)

The k-th percentile of a set of values divides them so that k% lie below and (100 − k)% lie above

(15)

Invert a not-invertible function (step function)

For each frequency α, the quantile qα is the minimum real number such that F (qα) ≥ α

qα = Fg−1(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1) It is the i-th sorted data, where i is the ceiling of nα

If α is expressed as a percentage, qα is called percentile In general (with all definitions)

qαhx(bnαc), x(bnαc+1)i where bnαc is the integer part of nα

> quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%

155 160 166 173 178

Pay attention: type=1 give “our” definition (default type=7) See help(quantile)

(16)

Tukey’s five-number summary

minimum, Q1, Q2, Q3, maximum

> fivenum(altezza_f) [1] 155 160 166 173 178

Inter Quartile Range (IQR): Q3 − Q1

The interval (Q1, Q3) contains the 50% of the “central” data

(17)

The box-plot

1.5 IQR 1.5 IQR

3 IQR 3 IQR

Q1

* Q3

Q2

- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median

- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box

- outliers (beyond the end of the whisker) are denoted by o (at times observations further than 3 IQR are denoted by ?)

(18)

Example - Weight

50 60 70 80 90

● ●

●●●

●●●

●●

● ●

●●

●●

● ●

● ●

● ●

50 60 70 80 90

dot-plot and box-plot

peso=round(WEIGHT*0.45359,1)

stripchart(peso,method="stack",offset=.5,at =.7,pch=19) par(new=T)

boxplot(peso, horizontal=T)

(19)

> peso=round(WEIGHT*0.45359,1)

> sort(peso)

[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5

> fivenum(peso)

[1] 43.10 56.70 65.80 70.75 97.50

- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075

- 3 IQR = 42.15

- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825

- Q3+3 IQR = 70.75 + 42.15 = 112.9.

(20)

Example - Second pulse rates

Are true outliers?

Some students ran, some sat.

6080100120140

(21)

Sub-groups: graphic comparisons

Example - Second pulse rates Box plot

Each Tukey number for the stu- dents who ran is larger than for the students who sat.

Different dispersion

Different “form” (more or less

symmetrical w.r.t. the median) corsa fermo

6080100120140

boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))

(22)

Cumulative distribution function

CDF of students that sat has a slope higher than CDF of students that run:

steep slope ≡ low dispersion

Each quantile of “ran” is greater than the corresponding of “sat”

plot(ecdf(PULSE2[RAN==1]),pch=19,

cex.axis=1.2, xlim=c(50,140),main="") par(new=T)

#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,

cex.axis=1.2, xlim=c(50,140),col="red",main="")

# pch=19 circle pch=17 triangle

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

red triangle: sat blue circle: ran

(23)

Example - Male and female height

M F

155160165170175180185190

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

same “form” – slope – dispersion different values

(24)

Indices for quantitative variables

Position indices – central tendencies - median: Q2 = min{x | F (x) ≥ 0.50}

- mean: x = 1n Pni=1 xi

- trimmed mean: mean of the 90% of the “central” data - mode: value with maximal frequency

1. Pni=1 (xi − x) = 0

2. Pni=1 (xi − x)2Pni=1 (xi − a)2 ∀ a ∈ R the mean is centroid of the distri-

bution (equilibrium point)

the mean is affected by outliers

the median no 60 80 100 120 140

● ●

● ●

● ●

●●

● ●

● ●

● ● ●

● ● ● ● ● ●● ●

3. Pni=1 |xi − Q2| ≤ Pni=1|xi − a| ∀ a ∈ R

(25)

Variability indices

• ranges

- range (R): max - min - interquartile range (IQR): Q3−Q1

• indices based on the deviations from a central value - variance and standard deviation

(V(X) or σX2 or σ2 – std(X) or σX or σ)

V(X) = 1 n

n X

i=1

(xi − x)2 std(X) =

v u u t

1 n

n X

i=1

(xi − x)2

(Variance and standard deviation can be defined with n − 1)

- mean absolute deviations (from mean and median) 1

n

n X

i=1

|xi − x| 1 n

n X

i=1

|xi − Q2|

• variability w.r.t. central value coefficient of variation CV(X) = std(X)

x (if x 6= 0).

Properties: σ ≤ R2 and |x − Q2| < σ

(26)

Mean and variance in a population and its sub-groups Two subgroups: A and B

nA and nB units, fA and fB frequencies,

xA and xB means and σA2 and σB2 variances.

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

nA = 100, xA = 9.9, σA2 = 6.5; nB = 300, xB = 30.0, σB2 = 4.4 Mean

xtot = fA xA + fB xB (weighted mean) xtot = 100

400 9.9 + 300

400 30.0 = 24.98

(27)

Variance

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

xB = 30.0

Frequency

10 20 30 40 50

010203040

Frequency

10 20 30 40 50

010203040

xB = 40.0

σtot2 = fA σA2 + fB σB2  + fA (xA − xtot)2 + fB (xB − xtot)2 weighted variance plus weighted “variance of the means”

(28)

In the example:

above: σtot2 = 80.68 below: σtot2 = 179.53

In general

xtot =

K X

k=1

fk xk

σtot2 =

K X

k=1

fk σk2 +

K X

k=1

fk (xk − xtot)2

total variance =

within classes variance + between classes variance

(29)

The histogram

Pay attention to different “form” of the two histograms of the same data with different break points

Left: asymmetrical distribution Right: symmetrical distribution

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)

par(mfrow=c(1,2))

hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))

Use histogram if different break points and classe range produce the same “form”

(30)

Distribution and graphic representation of two quantitative variables

X and Y measured on the same population

The set of the points (xi, yi) and of the corresponding frequencies is the joint distribution of X and Y .

The scatter-plot

Example: height and weight of a graduate class students

peso_alt= read.table(file=

"C:/DATA/stid98.txt",header=F,

col.names=c("altezza","peso","genere")) attach(peso_alt)

plot(altezza,peso,pch=16,col="red")

155 160 165 170 175 180 185 190

5060708090

altezza

peso

(31)

Indices for two quantitative variables Covariance

Xc and Y c centred variables

Xc = X − x Y c = Y − y Γ(X, Y ) = 1

n

n X

i=1

xci yic = 1 n

n X

i=1

(xi − x) (yi − y)

Interpretation

The origin is in the centroid (x, y)

The points in the I and III quadrant give positive contribute to

n X

i=1

xci yic

while the points in the II and IV quadrant give negative contribute.

10 12 14 16 18 20

101520

y

+

+

(32)

then the covariance is

- positive if –on average– to high values of X correspond high values of Y (and to low values of X, low values of Y )

- negative if –on average– to high values of X correspond low values of Y (and to low values of X, high values of Y )

- about zero otherwise

10 12 14 16 18 20

5101520

x1

y1

+

10 12 14 16 18 20

101520

x2

y2

~0

●●

10 12 14 16 18 20

0510

x3

y3

(33)

Correlation index (or linear correlation index) Xs and Y s standardized variables:

Xs = X − x

σX end Y s = Y − y σY

ρ(X, Y ) = 1 n

n X

i=1

xsi yis =

n X

i=1

(xi − x)

q

P (xi − x)2

(yi − y)

q

P (yi − y)2

= Γ(X, Y ) σX σY

Some properties:

- it does not depend on variability of the variables - it is a pure number

- ρ(X, Y ) ∈ [−1, 1]

(34)

Remark 1 : positive correlation does not imply causality

Remark 2 : the correlation index detects only the linear depen- dence ρ(X, Y ) = 0.02.

20 25 30 35 40

050100

(35)

Covariance and correlation indices in a population and its sub-groups

Γtot(X, Y ) =

K X

k=1

fk Γk(X, Y ) +

K X

k=1

fk (xk − x) (yk − y)

(cf. formula of the variance)

(36)

Some paradoxes left:

ρtot < 0, ρ1 > 0, ρ2 > 0, ρ3 > 0 right:

ρtot ' 0, ρ1 > 0, ρ2 < 0

0 10 20 30 40 50

5060708090

a

b

10 12 14 16 18 20

406080100

x

y

Example:

height and weight of a graduate class students (continue)

ρtot = 0.61,

ρM = 0.26, ρF = 0.78.

plot(altezza,peso,pch=c(16,17)[genere],

col=c("red", "blue")[genere])

155 160 165 170 175 180 185 190

5060708090

altezza

peso

tot 0.61, M: 0.26, F: 0.78

(37)

Linear transformations of variables:

how the indices change

Univariate case

X variable with x, σX2 and qαX

• change of position: R = X + a

r = x + a σR2X2 qαR = qαX + a

• change of dispersion/scale: T = b X

t = b x σT2 = b2 σX2 qαT = b qαX

• change of both indices: Z = b X + a

z = b x + a σZ2 = b2 σX2 qαZ = b qαX + a

(38)

R=X+8

0 5 10 15 20

02468

0 5 10 15 20

02468

Z=2X+8

0 5 10 15 20

02468

0 5 10 15 20

02468

05101520

X

05101520

R=X+3

05101520

Z=2X+3

Pay attention to the ticks on the axes

(39)

Two special transformation:

• centered variable:

Xc = X − x xc = 0 σX2 c = σX2

• standardized variable:

Xs = X − x

σX xs = 0 σX2 s = 1

(40)

Bivariate case

X with x, σX2 and qαX Y with y, σY2 and qαY

• change of position: R = X + a, M = Y + c

Γ(R, M ) = Γ(X, Y ) ρ(R, M ) = ρ(X, Y )

• change of dispersion: T = b X N = d Y

Γ(T , N ) = b d Γ(X, Y ) ρ(T , N ) = sign(bd) ρ(X, Y )

Riferimenti

Documenti correlati

More precisely, the results can be described in terms of s/c initial conditions determination, gravity field and rotation state determina- tion, surface geodetic network

In previous works (Iannario, 2007; Iannario and Piccolo, 2007), we found that for data set concerning ordinal values (ratings, evaluations, preference scores, etc.), a class

The collected reports, for example, appear to be extremely longer (ranging from the 90 pages of the 2013 to the 38 pages of the 2016 Sustainability report) than the collected

The k-th percentile of a set real numbers partitions it so that k% of its values are below and (100 − k)% are above of the k-th percentile..

Eva Riccomagno, Maria Piera Rogantin. DIMA – Universit` a

Remark 2 : the correlation index detects only the linear depen- dence.. formula of

Nel nostro studio, nella valutazione del grado di infiltrazione miometriale delle lesioni endometriali maligne, l’ecografia transvaginale ha ottenuto gli stessi risultati

The simplest and most commonly used probabilistic topic approach to document modelling is the generative model Latent Dirichlet Allocation (LDA) [4]. The idea behind LDA is