• Non ci sono risultati.

Corso introduttivo alla statistica

N/A
N/A
Protected

Academic year: 2021

Condividi "Corso introduttivo alla statistica"

Copied!
13
0
0

Testo completo

(1)

Corso introduttivo alla statistica

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

(2)

Exploratory data analysis

analysis of the available data

Inferential statistics

analysis of the available sample

:

probability

:

information about population or pro- cess

No inference possible on census data

(3)

Structure of the course

• Exploratory data analysis ∼ 33%

• Inferential Statistics

– estimate and estimators ∼ 10%

– hypothesis statistical tests ∼ 57%

(4)

Exploratory Data Analysis

From

https://en.wikipedia.org/wiki/Exploratory_data_analysis

John W. Tukey wrote the book Exploratory Data Analysis in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analy- sis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

(5)

The objectives of EDA are to:

• Suggest hypotheses about the causes of observed phenomena

• Assess assumptions on which statistical inference will be based

• Support the selection of appropriate statistical tools and tech- nique

• Provide a basis for further data collection through surveys or experiments

Many EDA techniques have been adopted into data mining, as well as into big data analytics. They are also being taught to young students as a way to introduce them to statistical thinking.

(6)

Running example

Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.

Variables

PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)

RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)

SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)

ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY

1 64 88 1 2 1 66.00 140 2

2 58 70 1 2 1 72.00 145 2

3 62 76 1 1 1 73.50 160 3

(7)

Quantitative and qualitative variables

Quantitative variables represent measurable quantities

weight, height, number of times that a phenomenon occurs, . . .

Qualitative (or categorical) variables take values that are names or labels

- ordinal variables (the values admit a natural order)

response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .

- nominal variables

blood groups, colours, gender, smoking habit,. . .

(8)

Distribution of a categorical variable X and contingency tables

Notation

- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)

- n number of units

- nk observed count of level k; Pnk = n;

- fk = nk/n (relative) frequency of the level k; P fk = 1 Example – Activity

Codings 0: no – 1: poca – 2: media – 3: molta

k nk fk fk (%)

no 0 1 0.0109 1.09

poca 1 9 0.0978 9.78

media 2 61 0.6630 66.30 molta 3 21 0.2283 22.83

92 1 100

pulse$ACTIVITY=ordered(pulse$ACTIVITY, levels=c(0,1,2,3), labels=c("no","poca", "media","molta") ) tab=table(ACTIVITY)

(9)

freq=prop.table(tab)

cbind(tab,round(freq,4),round(freq*100,2))

Barplot

no poca media molta

0102030405060

no poca media molta

0.00.10.20.30.40.50.6

par(mfrow=c(1,2))

barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts

barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies

(10)

Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies

ACTIVITY (Y ) no poca media molta

SEX M 1 5 35 16 57

(X) F 0 4 26 5 35

1 9 61 21 92

ACTIVITY (Y )

no poca media molta

M 1.09 5.43 38.04 17.39 61.95 F 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100 pulse$SEX=factor(pulse$SEX, levels=c(1,2), labels=c("M","F") )

SA=table(SEX,ACTIVITY);SA round(prop.table(SA)*100,2) margin.table(SA,1);margin.table(SA,2)

M F

01020304050

M F

05101520253035

no poca media molta

0102030405060

no poca media molta

05101520253035

AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))

barplot(AS,beside=F,cex.names=1.5);barplot(AS,beside=T,cex.names=1.5) ## note beside=T/F

barplot(SA,beside=F,cex.names=1.5);barplot(SA,beside=T,cex.names=1.5) ## note the first variable par(mfcol=c(1,1))

(11)

Row profiles:

distribution of Y in the sub-groups identified by X (Y | X)

row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sums to 100

ACTIVITY

SEX no poca media molta M 1.75 8.77 61.40 28.07 F 0.00 11.43 74.29 14.29

no poca media molta Maschi

0.00.40.8

no poca media molta Femmine

0.00.40.8

par(mfrow=c(1,2))

barplot(row_pr[1,],ylim=c(0,1),main="Maschi",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

barplot(row_pr[2,],ylim=c(0,1),main="Femmine",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

(12)

Column profiles:

distribution of X in the sub-groups identified by Y (X | Y )

col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100

ACTIVITY

SEX no poca media molta M 100.00 55.56 57.38 76.19 F 0.00 44.44 42.62 23.81

M F

no

0.00.20.40.60.81.0

M F

low

0.00.20.40.60.81.0

M F

medium

0.00.20.40.60.81.0

M F

high

0.00.20.40.60.81.0

par(mfrow=c(1,4))

for (j in 1:dim(SA)[2]) {

barplot(col_pr[,j],ylim=c(0,1),main=colnames(col_pr)[j],cex.names=2,cex.axis=2,cex.main=2) abline(h=0)

}

par(mfrow=c(1,1))

similar profiles ≡ “independence”

(13)

Software R

Site http://www.r-project.org How to download and Software R

First install R

E.g from http://cran.stat.unipd.it/

Download and Install R on your platform Then install RStudio

From

Riferimenti

Documenti correlati

Gli storici hanno lavorato per raccogliere i dati da inserire nella banca dati – e questa è la parte più consistente della ricerca – spogliando tutti i lavori da lo- ro giudicati

L’Amiata tuttavia fu nel medioevo una terra di castelli (e ancora lo è), come dimostra la sopravvivenza, praticamente fino ad oggi, del quadro insediativo stabilizzatosi tra

Popolazione e campioni, stima puntuale, proprietà degli stimatori con particolare riferimento agli stimatori della media di una variabile.. Proprietà degli

If the sample is large, even a small difference can be “evidence”, that is hard to explain by the chance variability.. are close to those of the standard

A marketing company claims that 25% of the IT professionals choose the Chicago Tri- bune as their primary source for local IT news. A survey was conducted last month to check

The provisional conclusion is that none of the two previous solutions can satisfy the twofold objective of the current debate among European economists, that is, how to