Corso introduttivo alla statistica
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
Exploratory data analysis
analysis of the available data
Inferential statistics
analysis of the available sample
:
probability
:
information about population or pro- cess
No inference possible on census data
Structure of the course
• Exploratory data analysis ∼ 33%
• Inferential Statistics
– estimate and estimators ∼ 10%
– hypothesis statistical tests ∼ 57%
Exploratory Data Analysis
From
https://en.wikipedia.org/wiki/Exploratory_data_analysis
John W. Tukey wrote the book Exploratory Data Analysis in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analy- sis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
The objectives of EDA are to:
• Suggest hypotheses about the causes of observed phenomena
• Assess assumptions on which statistical inference will be based
• Support the selection of appropriate statistical tools and tech- nique
• Provide a basis for further data collection through surveys or experiments
Many EDA techniques have been adopted into data mining, as well as into big data analytics. They are also being taught to young students as a way to introduce them to statistical thinking.
Running example
Students in a statistics class measured their pulse rates; then throw a coin: if head ran for one minute, if tail sat; then they measured their pulse rates again.
Variables
PULSE1 First pulse measurement (rate per minute) PULSE2 Second pulse measurement (rate per minute)
RAN Whether the student ran or sat (1: head, run – 2: tail, sit) SMOKES Regular smoker? (1: yes – 2: no)
SEX (1: M – 2: F) HEIGHT Height (inch) WEIGHT Weight (lb)
ACTIVITY Frequency (0: nothing – 1: low – 2: moderate – 3: high) OBS PULSE1 PULSE2 RAN SMOKES SEX HEIGHT WEIGHT ACTIVITY
1 64 88 1 2 1 66.00 140 2
2 58 70 1 2 1 72.00 145 2
3 62 76 1 1 1 73.50 160 3
Quantitative and qualitative variables
Quantitative variables represent measurable quantities
weight, height, number of times that a phenomenon occurs, . . .
Qualitative (or categorical) variables take values that are names or labels
- ordinal variables (the values admit a natural order)
response to a drug (worsening, no change, slight improve- ment, healing), level of education, level of physical activity, . . .
- nominal variables
blood groups, colours, gender, smoking habit,. . .
Distribution of a categorical variable X and contingency tables
Notation
- E = {1, . . . , k, . . . , K} coding of the levels (or other choices)
- n number of units
- nk observed count of level k; Pnk = n;
- fk = nk/n (relative) frequency of the level k; P fk = 1 Example – Activity
Codings 0: no – 1: poca – 2: media – 3: molta
k nk fk fk (%)
no 0 1 0.0109 1.09
poca 1 9 0.0978 9.78
media 2 61 0.6630 66.30 molta 3 21 0.2283 22.83
92 1 100
pulse$ACTIVITY=ordered(pulse$ACTIVITY, levels=c(0,1,2,3), labels=c("no","poca", "media","molta") ) tab=table(ACTIVITY)
freq=prop.table(tab)
cbind(tab,round(freq,4),round(freq*100,2))
Barplot
no poca media molta
0102030405060
no poca media molta
0.00.10.20.30.40.50.6
par(mfrow=c(1,2))
barplot(table(ACTIVITY),cex.names=2,cex.axis=2) ## counts
barplot(prop.table(table(ACTIVITY)),cex.names=2,cex.axis=2) ## frequencies
Joint distribution of two categorical variables X and Y Example – Sex and Activity Counts and relative frequencies
ACTIVITY (Y ) no poca media molta
SEX M 1 5 35 16 57
(X) F 0 4 26 5 35
1 9 61 21 92
ACTIVITY (Y )
no poca media molta
M 1.09 5.43 38.04 17.39 61.95 F 0.00 4.35 28.26 5.43 38.04 0.01 9.78 66.30 22.83 100 pulse$SEX=factor(pulse$SEX, levels=c(1,2), labels=c("M","F") )
SA=table(SEX,ACTIVITY);SA round(prop.table(SA)*100,2) margin.table(SA,1);margin.table(SA,2)
M F
01020304050
M F
05101520253035
no poca media molta
0102030405060
no poca media molta
05101520253035
AS=table(ACTIVITY,SEX) par(mfcol=c(1,4))
barplot(AS,beside=F,cex.names=1.5);barplot(AS,beside=T,cex.names=1.5) ## note beside=T/F
barplot(SA,beside=F,cex.names=1.5);barplot(SA,beside=T,cex.names=1.5) ## note the first variable par(mfcol=c(1,1))
Row profiles:
distribution of Y in the sub-groups identified by X (Y | X)
row_pr=prop.table(SA,1);round(row_pr*100,2) # each row sums to 100
ACTIVITY
SEX no poca media molta M 1.75 8.77 61.40 28.07 F 0.00 11.43 74.29 14.29
no poca media molta Maschi
0.00.40.8
no poca media molta Femmine
0.00.40.8
par(mfrow=c(1,2))
barplot(row_pr[1,],ylim=c(0,1),main="Maschi",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
barplot(row_pr[2,],ylim=c(0,1),main="Femmine",cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
Column profiles:
distribution of X in the sub-groups identified by Y (X | Y )
col_pr=prop.table(SA,2);round(col_pr*100,2) #each column sum is 100
ACTIVITY
SEX no poca media molta M 100.00 55.56 57.38 76.19 F 0.00 44.44 42.62 23.81
M F
no
0.00.20.40.60.81.0
M F
low
0.00.20.40.60.81.0
M F
medium
0.00.20.40.60.81.0
M F
high
0.00.20.40.60.81.0
par(mfrow=c(1,4))
for (j in 1:dim(SA)[2]) {
barplot(col_pr[,j],ylim=c(0,1),main=colnames(col_pr)[j],cex.names=2,cex.axis=2,cex.main=2) abline(h=0)
}
par(mfrow=c(1,1))
similar profiles ≡ “independence”
Software R
Site http://www.r-project.org How to download and Software R
First install R
E.g from http://cran.stat.unipd.it/
Download and Install R on your platform Then install RStudio
From