Corso introduttivo alla statistica

(1)

Corso introduttivo alla statistica

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

(2)

Exploratory data analysis

analysis of the available data

Inferential statistics

analysis of the available sample

:

probability

:

information about population or pro- cess

No inference possible on census data

(3)

Structure of the course

• Exploratory data analysis ∼ 33%

• Inferential Statistics

– estimate and estimators ∼ 10%

– hypothesis statistical tests ∼ 57%

(4)

Exploratory Data Analysis

From

https://en.wikipedia.org/wiki/Exploratory_data_analysis

John W. Tukey wrote the book Exploratory Data Analysis in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

(5)

The objectives of EDA are to:

• Suggest hypotheses about the causes of observed phenomena

• Assess assumptions on which statistical inference will be based

• Support the selection of appropriate statistical tools and tech- nique

• Provide a basis for further data collection through surveys or experiments

Many EDA techniques have been adopted into data mining, as well as into big data analytics. They are also being taught to young students as a way to introduce them to statistical thinking.

(6)

Running example

Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.

Variables

PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)

RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)

SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)

ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

(7)

Quantitative and qualitative variables

Quantitative variables represent measurable quantities

weight, height, number of times that a phenomenon occurs, . . .

Qualitative (or categorical) variables take values that are names or labels

- ordinal variables (the values admit a natural order)

response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .

- nominal variables

blood groups, colours, gender, smoking habit,. . .

(8)

Distribution of a categorical variable X and contingency tables

Notation

- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)

- n number of units

- n_k observed count of level k; ^Pn_k = n;

- f_k = n_k/n (relative) frequency of the level k; ^P f_k = 1 Example – Activity

Codings 0: no – 1: poca – 2: media – 3: molta

k n_k f_k f_k (%)

no 0 1 0.0109 1.09

poca 1 9 0.0978 9.78

media 2 61 0.6630 66.30 molta 3 21 0.2283 22.83

92 1 100

pulse$ACTIVITY=ordered(pulse$ACTIVITY, levels=c(0,1,2,3), labels=c("no","poca", "media","molta") ) tab=table(ACTIVITY)

(9)

freq=prop.table(tab)

cbind(tab,round(freq,4),round(freq*100,2))

Barplot

no poca media molta

0102030405060

no poca media molta

0.00.10.20.30.40.50.6

par(mfrow=c(1,2))

barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts

barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies

(10)

Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies

ACTIVITY (Y ) no poca media molta

SEX M 1 5 35 16 57

(X) F 0 4 26 5 35

1 9 61 21 92

ACTIVITY (Y )

no poca media molta

M 1.09 5.43 38.04 17.39 61.95 F 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100 pulse$SEX=factor(pulse$SEX, levels=c(1,2), labels=c("M","F") )

SA=table(SEX,ACTIVITY);SA round(prop.table(SA)*100,2) margin.table(SA,1);margin.table(SA,2)

M F

01020304050

M F

05101520253035

no poca media molta

0102030405060

no poca media molta

05101520253035

AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))

barplot(AS,beside=F,cex.names=1.5);barplot(AS,beside=T,cex.names=1.5) ## note beside=T/F

barplot(SA,beside=F,cex.names=1.5);barplot(SA,beside=T,cex.names=1.5) ## note the first variable par(mfcol=c(1,1))

(11)

Row profiles:

distribution of Y in the sub-groups identified by X (Y | X)

row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sums to 100

ACTIVITY

SEX no poca media molta M 1.75 8.77 61.40 28.07 F 0.00 11.43 74.29 14.29

no poca media molta Maschi

0.00.40.8

no poca media molta Femmine

0.00.40.8

par(mfrow=c(1,2))

barplot(row_pr[1,],ylim=c(0,1),main="Maschi",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

barplot(row_pr[2,],ylim=c(0,1),main="Femmine",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

(12)

Column profiles:

distribution of X in the sub-groups identified by Y (X | Y )

col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100

ACTIVITY

SEX no poca media molta M 100.00 55.56 57.38 76.19 F 0.00 44.44 42.62 23.81

M F

no

0.00.20.40.60.81.0

M F

low

0.00.20.40.60.81.0

M F

medium

0.00.20.40.60.81.0

M F

high

0.00.20.40.60.81.0

par(mfrow=c(1,4))

for (j in 1:dim(SA)[2]) {

barplot(col_pr[,j],ylim=c(0,1),main=colnames(col_pr)[j],cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

}

par(mfrow=c(1,1))

similar profiles ≡ “independence”

(13)

Software R

Site http://www.r-project.org How to download and Software R

First install R

E.g from http://cran.stat.unipd.it/

Download and Install R on your platform Then install RStudio

From