• Non ci sono risultati.

Exploratory Data Analysis part 2

N/A
N/A
Protected

Academic year: 2021

Condividi "Exploratory Data Analysis part 2"

Copied!
18
0
0

Testo completo

(1)

Exploratory Data Analysis part 2

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/UnigeStat/

(2)

Distribution of a quantitative variable X on a finite population

Example – First pulse rates

PULSE1

[1] 64 58 62 66 64 74 84 68 62 76 90 80 92 68 60 62 66 70 68 [20] 72 70 74 66 70 96 62 78 82 100 68 96 78 88 62 80 62 60 72 [39] 62 76 68 54 74 74 68 72 68 82 64 58 54 70 62 48 76 88 70 [58] ...

sort(PULSE1)

[1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 62 [20] 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 68 68 [39] 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 74 74 74 [58] 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 82 84 84 84 [77] 84 86 87 88 88 88 90 90 90 90 92 92 94 96 96 100

round(prop.table(table(PULSE1)),2)

48 54 58 60 61 62 64 66 68 70 72 74 76 78 80

0.01 0.02 0.03 0.04 0.01 0.10 0.04 0.05 0.12 0.07 0.07 0.05 0.05 0.05 0.03 82 84 86 87 88 90 92 94 96 100

0.03 0.04 0.01 0.01 0.03 0.04 0.02 0.01 0.02 0.01

sum(round(prop.table(table(PULSE1)),2)) ## pay attention to round [1] 0.96

x

k

observed value of X

f

k

its (relative) frequency with f

k

∈ (0, 1) and

PKk=1

f

k

= 1 Distribution of X : (x

1

, f

1

), . . . , (x

k

, f

k

), . . . , (x

K

, f

K

)

(there are many more observed values in a quantitative variable than in a qualitative one)

(3)

Dot-plot

When the observed values are “not too closed”

50 60 70 80 90 100

stripchart(PULSE1,method="stack", offset=.5,at =.15,pch=19)

(4)

The histogram

Pay attention to different “form” of the two histograms of the same data with different break points

Left: asymmetrical distribution Right: symmetrical distribution

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

051015

dati_ist=c(117,117,118,119,121,122,122,126,127,128,128,128,129,129,129,132,133,134,135, 136,137,138,139,141,142,144,148,150,155,156)

par(mfrow=c(1,2))

hist(dati_ist,breaks=c(115,130, 145,160),xlim=c(110,160),freq=T,ylim=c(0,16)) hist(dati_ist,breaks=c(112,127, 142,157),xlim=c(110,160),freq=T,ylim=c(0,16)) par(mfrow=c(1,1))

Use histogram if different break points and classe range produce

the same “form”

(5)

Use a large number of classes

Histogram of dati_ist

dati_ist

Frequency

110 120 130 140 150 160

0.00.51.01.52.02.53.0

hist(dati_ist,breaks=30,xlim=c(110,160),freq=T)

## breaks=30 to have at most 30 classes

(6)

Cumulative distribution function of X, F

X

(or F )

F

X

(x) is the (relative) frequency of units with value less or equal to x:

F

X

(x) = # {obs ≤ x}

n for x ∈ R

- if x is observed, x = x

k

: F

X

(x

k

) =

Pki=1

f

i

- if x is between two observed values, x ∈ [x

k

, x

k+1

), then F

X

(x) = F

X

(x

k

)

- if x ≤ min X then F

X

(x) = 0, if x ≥ max X, then F

X

(x) = 1 F

X

: R → [0, 1]

F

X

is a step-function

(7)

Example – Female height (in cm)

• x

k

sorted (and non repeated) observed values

• n

k

count of x

k

• N

k

cumulative counts

• f

k

frequency of x

k

• F

k

cumulative frequencies

xk nk Nk fk Fk

155 1 1 2.86 2.86

157 5 6 14.29 17.14

159 1 7 2.86 20.00

160 4 11 11.43 31.43

163 2 13 5.71 37.14

165 4 17 11.43 48.57

166 1 18 2.86 51.43

168 4 22 11.43 62.86

170 3 25 8.57 71.43

173 6 31 17.14 88.57

175 3 34 8.57 97.14

178 1 35 2.86 100.00

155 160 165 170 175 180

0.00.20.40.60.81.0

ecdf(altezza_f)

altezza femmine (X)

F_X (x)

(8)

R code

altezza_f=round(HEIGHT[SEX==2]*2.54)

## select HEIGHT for units having SEX==2

## the product (from inch to cm) acts on all elements

## round(A) is equivalent to round(A,0)

count_alt_f=table(altezza_f) ## counts of the variable cum_count_alt_f=cumsum(count_alt_f) ## cumulative counts

freq_alt_f=prop.table(count_alt_f) round(freq_alt_f*100,2)

round(cumsum(freq_alt_f)*100,2)

## percentage frequencies and percentage cumulative frequencies plot(ecdf(altezza_f),lw=2,cex.axis=1.2,

xlab="altezza femmine (X)",ylab="F_X (x)")

## ecdf empirical cumulative distribution function

## lw dimension of the lines

## cex.axis dimension of the axis

## xlab ylab labels of the axis

(9)

Quantile and percentile function (reverse problem)

Ex: the central value, the value that delimits the first quarter, last quarter of the sorted data

Determine x such that FX(x) is (about) 0.50, 0.25, 0.75 (median, first quartile, third quartile)

Example – Female height

rank ri 1 2 3 4 5 6 7 8 9 10

value x(i) 155 157 157 157 157 157 159 160 160 160

rank ri 11 12 13 14 15 16 17 18 19 20

value x(i) 160 163 163 165 165 165 165 166 168 168

rank ri 21 22 23 24 25 26 27 28 29 30

value x(i) 168 168 170 170 170 173 173 173 173 173

rank ri 31 32 33 34 35

value x(i) 173 175 175 175 178

- central value: 17th value

- first quart is between 8th and 9th values (0.25 × 35 = 8.75)

- third quart is between 26th and 27th values (0.75 × 35 = 26.25) The height values are univocally determined, but in general NOT.

Statistical software have many possibilities. (R has 9 definitions)

The k-th percentile of a set real numbers partitions it so that k% of its values are below and (100 − k)% are above of the k-th percentile

(10)

Invert a not-invertible function (step function):

a mathematical definition

For each frequency α, α ∈ (0, 1), the quantile q

α

is the minimum real number such that F (q

α

) ≥ α

q

α

= F

(α) = min{x ∈ R s.t. F (x) ≥ α}, α ∈ (0, 1)

If α, α ∈ (0, 100), is expressed as a percentage, q

α

is called percentile

In general (with all definitions – R has 9 definitions)

q

α

h

x

(bnαc)

, x

(bnαc+1)i

where bnαc is the integer part of nα

quantile(altezza_f,seq(0,1,0.25),type=1) 0% 25% 50% 75% 100%

155 160 166 173 178

Pay attention: type=1 give “our” definition (default type=7)

See help(quantile)

(11)

Tukey’s five-number summary

minimum, Q1, Q2, Q3, maximum

fivenum(altezza_f)

[1] 155 160 166 173 178

Inter Quartile Range (IQR): Q3 − Q1

The interval (Q1, Q3) contains the 50% of the “central” data

(12)

The box-plot

1.5 IQR 1.5 IQR

Q1 Q2 Q3

- the left and right side of the box are Q1 and Q3 - the vertical line inside the box is the median

- the “whiskers” extend to the furthest observation which is no more than 1.5 times IQR from the box

- outliers (beyond the end of the whisker) are denoted by o

(13)

Example - Weight

50 60 70 80 90

● ●

●●●

●●●

●●

● ●

●●

●●

● ●

● ●

● ●

50 60 70 80 90

dot-plot and box-plot

peso=round(WEIGHT*0.45359,1) boxplot(peso, horizontal=T) par(new=T)

stripchart(peso,method="stack",offset=.5,at =.7,pch=19)

(14)

sort(peso)

[1] 43.1 46.3 49.0 49.0 49.9 49.9 50.8 52.2 52.2 52.6 52.6 53.5 [13] 53.5 54.4 54.4 54.4 54.9 55.3 55.8 56.7 56.7 56.7 56.7 56.7 [25] 59.0 59.0 59.0 59.0 59.0 59.4 60.3 61.2 61.2 61.2 61.7 62.6 [37] 62.6 63.5 63.5 63.5 63.5 64.4 65.8 65.8 65.8 65.8 65.8 67.1 [49] 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 68.0 69.4 70.3 [61] 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 70.3 71.2 72.6 72.6 [73] 72.6 72.6 74.4 74.8 77.1 77.1 77.1 77.1 79.4 79.4 81.6 81.6 [85] 81.6 83.9 86.2 86.2 86.2 86.2 88.5 97.5

fivenum(peso)

[1] 43.10 56.70 65.80 70.75 97.50

- IQR = 70.75 − 56.70 = 14.04 - 1.5 IQR = 21.075

- 3 IQR = 42.15

- Q1−1.5 IQR = 56.70 − 21.075 = 35.625. No data is less than 35.6 - Q3+1.5 IQR = 70.75 + 21.075 = 91.825

(15)

Example - Second pulse rates

Are true outliers?

Some students ran, some sat.

boxplot(PULSE2)

### horizontal=F is the default

6080100120140

(16)

Sub-groups: graphic comparisons

Example - Second pulse rates

Box plot

Each Tukey number for the stu- dents who ran is larger than for the students who sat.

Different dispersion

Different “form” (more or less

symmetrical w.r.t. the median)

corsa fermo

6080100120140

boxplot(PULSE2~RAN,cex.axis=1.2,lwd=2,names=c("corsa","fermo"))

(17)

Cumulative distribution function

CDF of students that sat has a slope higher than CDF of students that run:

steep slope ≡ low dispersion

Each quantile of “ran” is greater than the corresponding of “sat”

plot(ecdf(PULSE2[RAN==1]),pch=19,

cex.axis=1.2, xlim=c(50,140),main="") par(new=T)

#the following plot on the same graphic window plot(ecdf(PULSE2[RAN==2]),pch=17,

cex.axis=1.2, xlim=c(50,140),col="red",main="")

# pch=19 circle pch=17 triangle

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

60 80 100 120 140

0.00.20.40.60.81.0

x

Fn(x)

red triangle: sat

black circle: ran

(18)

Example - Male and female height

M F

155160165170175180185190

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

155 160 165 170 175 180 185 190

0.00.20.40.60.81.0

x

Fn(x)

same “form” – slope – dispersion different values

altezza=HEIGHT*2.54

boxplot(altezza~SEX,cex.axis=1.2,lwd=2,names=c("M","F")) plot(ecdf(altezza[SEX==1]),pch=19, cex.axis=1.2,

xlim=c(min(altezza),max(altezza)),main="") ## pay attention to xlim par(new=T) #the following plot on the same graphic window

plot(ecdf(altezza[SEX==2] ),pch=17,cex.axis=1.2, ## pay attention to xlim xlim=c(min(altezza),max(altezza)),col="red",main="")

Riferimenti

Documenti correlati

Questi ultimi erano entrambi di forma trapezoidale: il primo bacino era detto diacciaja grande, ed era ubicato immediatamente a destra del bastione uscendo da Porta

with polyhedral jump sets In this fourth and last chapter, we present a second result that provides an approximating sequence with polyhedral jump sets.

facevano memoria, e fornisce altresì qualche traccia circa l’orientamento spirituale e memoriale della tradizione cassinese, sulla quale non solo il senso della storia ma anche

La medesima influenza del tipo torinese si può riscontare sul versante occidentale a Vernante (P. 87) nei confronti, questa volta, della voce del provenzale cisalpino

In the fellow eye group, VA, CMT, and all choroidal parameters showed no differences be- tween baseline and any follow-up visits (all P > .05).  CONCLUSIONS: After PDT for

maius Zn mycelium exposed to cadmium remains to be established, but it is intriguing that both actin and tubulin were specifically identified in the cadmium treated sample using

Considering that the inflammatory status of CD patients may induce different immune responses to fungi compared to healthy subjects, we evaluated the cytokine pro files elicited by

The average pulse height for a non-irradiated scCVD pixel diamond sensor as a function of incident particle flux is shown in Fig.. The data shows a small decrease of 5% between