Inferential Statistics Hypothesis tests Normal Probability Plot

(1)

Inferential Statistics Hypothesis tests

Normal Probability Plot

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

[email protected] [email protected]

(2)

Part I. Test on the equality of the means in sub-groups

1. Descriptive approach: within group and between group vari- ances

2. Inferential approach: test on the equality of the group means (one-way ANOVA)

Part J. Normal Probability Plot

(3)

I. Test on the equality of the means in sub-groups (one-way ANOVA)

I.1 Descriptive approach: decomposition of the variance

Recall

K groups of size n

₁

, . . . , n

_K

n =

^P^K_k=1

n

_k

total sample size group means x

₁

, . . . , x

_K

and variances σ

₁²

, . . . , σ

_K²

within group variance:

weighted sum of the variances of the groups

^P^K_k=1 ⁿ^k

n

σ

_k²

between group variance:

weighted variances of the group means

^P^K_k=1 ⁿ^k

n

(x

_k

− x

_tot

)

²

total variance = within group var. + between group var.

σ

_tot²

=

K X

k=1

n

_k

n σ

_k²

+

K X

k=1

n

_k

n (x

_k

− x

_tot

)

²

(4)

Three examples. size: n

_A

= 50, n

_B

= 60, n

_C

= 30, n = 140 Example 1 and Example 2

Group variances: σ

_A²

= 4.89, σ

_B²

= 4.94, σ

_C²

= 4.17 Then, in both cases, the within group variance is

1 140 (50 × 4.89 + 60 × 4.94 + 30 × 4.17) = 666

140 ' 4.76 Example 3

Group variances: σ

_A²

= 1190, σ

_B²

= 1195, σ

_C²

= 1785 Then the within group variance is

1 140 (50 × 1190 + 60 × 1195 + 30 × 1785) = 184750

140 ' 1320 Means on the groups

Example 1: x

_A

= 10.1, x

_B

= 24.8, x

_C

= 39.8 Example 2: x

_A

= 21.1, x

_B

= 21.8, x

_C

= 22.3 Example 3: x

_A

= 11.8, x

_B

= 18.9, x

_C

= 31.7

The group means of Example 1 and Example 3 are similar but

the within group variances are strongly different

(5)

Example 1 x

_A

= 10.1, x

_B

= 24.8, x

_C

= 39.8

0 10 20 30 40 50

0246810

0 10 20 30 40 50

0246810

0 10 20 30 40 50

0246810 σ_A²^{= 4.97} σ_B^{2= 5.05} σ_C^{2= 4.32}

n_A^{= 50} n_B^{= 60} n_C^{= 30}

x_A x_B x_C

x_tot

x

_tot

=

₁₄₀¹

(50 × 10.1 + 60 × 24.8 + 30 × 39.8) ' 21.7 between group variance ' 122.4

1

140 50(10.1 − 21.7)² + 60(24.8 − 21.7)² + 30(39.8 − 21.7)²

= ¹⁷¹³³₁₄₀ ' 122.4

within group variance ' 4.76

(6)

Example 2 x

_A

= 21.1, x

_B

= 21.8, x

_C

= 22.3

0 10 20 30 40 50

0246810

0 10 20 30 40 50

0246810

0 10 20 30 40 50

0246810

x_A x_C x_B x_tot

σ_A²^{= 4.97} σ_B^{2= 5.05} σ_C^{2= 4.32}

n_A^{= 50} n_B^{= 60} n_C^{= 30}

x

_tot

=

₁₄₀¹

(50 × 21.1 + 60 × 21.8 + 30 × 22.3) ' 21.7 between group variance ' 0.21

1

140 50(21.1 − 21.7)² + 60(21.8 − 21.7)² + 30(22.3 − 21.7)²

= ^29.4₁₄₀ ' 0.21

within group variance ' 4.76

(7)

Example 3 x

_A

= 11.8, x

_B

= 18.9, x

_C

= 31.7

−100 −50 0 50 100 150

012345

−100 −50 0 50 100 150

012345

−100 −50 0 50 100 150

012345

x_A x_Bx_C x_tot

σ_A²^{= 1090}

n_A^{= 50}

σ_B^{2= 1195} σ_C^{2= 1785}

n_B^{= 60} n_C^{= 30}

Pay attention to the different scales w.r.t previous Examples

x

_tot

' 19.1

between group variance ' 1320

within group variance ' 7430

(8)

A measure of the “similarity” of the group means is between group variance

within group variance

Example 1:

^122.4

4.76

' 25.7 Example 2:

^0.21

4.76

' 0.04 Example 3:

¹³²⁰₇₄₃₀

' 0.18

In Example 1 the means are very different and the ratio is much

larger than one.

(9)

R code to generate data of Example 1 and to construct the histograms

na=50;nb=60;nc=30 ma=10;mb=25;mc=40

a=rnorm(na,ma,2.4);b=rnorm(nb,mb,2);c=rnorm(nc,mc,2.3) x=c(a,b,c)

gruppi=c(rep("A",na),rep("B",nb),rep("C",nc)) br=seq(1,50,.5)

x_l=c(0,50); y_l=c(0,10)

hist(a, breaks=br,main="",xlab="",ylab="",xlim=x_l,ylim=y_l,col="blue") par(new=T)

hist(b, breaks=br,main="",xlab="",ylab="",xlim=x_l,ylim=y_l) par(new=T)

hist(c, breaks=br,main="",xlab="",ylab="",xlim=x_l,ylim=y_l,col="red") par(new=F)

abline(v=mean(a),col="blue",lwd=2) abline(v=mean(b),lwd=2)

abline(v=mean(c),col="red",lwd=2)

abline(v=mean(x),col="darkgreen",lwd=2)

(10)

I.2 Inferential approach:

Test on the equality of the means in sub-groups

For each group assume a Normal random variable X

^k

X

^k

∼ N (µ

^k

, σ

_k²

) k = 1, . . . , K

Test hypotheses

H

₀

: µ

¹

= µ

²

= · · · = µ

^K

and H

₁

: at least two are different Let n

₁

, . . . , n

_K

be the size of the K independent samples from X

¹

, . . . , X

^K

and let n be the total sample size

The K sample mean random variables are X

¹

∼ N µ

¹

, σ

₁²

n

₁

!

. . . X

^K

∼ N µ

^K

, σ

_K²

n

_K

!

The K sample variance random variables are

S

₁²

, . . . , S

_K²

(11)

The estimators of the between and the within variances multiplied by n (called also between/within variations) are

between group variation: weighted variation of the group means

V

_B

=

K X

k=1

n

_k

X

_k

− X

_tot²

where X_tot is the weighted sum of the sample mean random variables

within group variation: weighted sum of the group variances

V

_W

=

K X

k=1

(n

_k

− 1) S

_k²

Test statistics

F = V

_B

/ (K − 1) V

_W

/ (n − K)

It follows a Fisher distribution F ∼ F

_{[K−1,n−K]}

whose mean is (n − K)/(n − K − 2)

High values of the ratio between/within imply to reject H

₀

(12)

Analysis of Variance Table

Degrees Sum of Mean of F value p-value of freedom Squares Squares

factor K − 1 v_B v_B/(K − 1) f P(F > f ) residuals n − K v_W v_W/(n − K)

total n − 1 v_B + v_W

where small letters indicate the sample value of the estimators

The row “total” is not displayed by some software, as R

(13)

Example 1

> anova(lm(x~gruppi))

Analysis of Variance Table Response: x

Df Sum Sq Mean Sq F value Pr(>F)

gruppi 2 18301.3 9150.6 1883.3 < 2.2e-16 ***

Residuals 137 665.6 4.9 ---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Example 2

> anova(lm(x1~gruppi))

Analysis of Variance Table Response: x1

Df Sum Sq Mean Sq F value Pr(>F) gruppi 2 28.76 14.3810 2.9598 0.05515 . Residuals 137 665.65 4.8587

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(14)

Example 3

Analysis of Variance Table Response: x2

Df Sum Sq Mean Sq F value Pr(>F) gruppi 2 7928 3963.9 3.11 0.04777 * Residuals 137 174615 1274.6

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

In Example 1 there is a strong evidence that the three groups are distinct

In Example 2 and Example 3 the evidence against H

₀

is weak

(15)

Example. Nitrogen in red clover plants

Effect of bacteria (5 strains and a composite) on the nitrogen content of red clover plants.

10 15 20 25 30 35

● ● ●●●

3DOK1

10 15 20 25 30 35

●● ●●●

3DOK13

10 15 20 25 30 35

● ● ● ● ●

3DOK4

10 15 20 25 30 35

● ●●● ●

3DOK5

10 15 20 25 30 35

●● ●●●

3DOK7

10 15 20 25 30 35

●● ●● ●

COMPOS

10 15 20 25 30 35

● ● ●●●

3DOK1

10 15 20 25 30 35

●● ●●●

3DOK13

10 15 20 25 30 35

● ● ● ● ●

3DOK4

10 15 20 25 30 35

● ●●● ●

3DOK5

10 15 20 25 30 35

●● ●●●

3DOK7

10 15 20 25 30 35

●● ●● ●

COMPOS

> clover=read.table("C:/DATA/anova_redclover.txt",header =T);attach(clover)

> anova(lm(Nitrogen~Strain)) Analysis of Variance Table Response: Nitrogen

Df Sum Sq Mean Sq F value Pr(>F)

Strain 5 847.05 169.409 14.37 1.485e-06 ***

Residuals 24 282.93 11.789

(16)

Code for stripchart

for (i in 1:6){h=.15*i*2

stripchart(Nitrogen[Strain==levels(Strain)[i]],

method="stack", offset=.5,at =.15*i*2,pch=19,xlim=c(8,38),cex=2,col="red") axis(2, at=h, labels = FALSE)

text(y=h, par("usr")[1]-0.2, labels=levels(Strain)[i],pos = 2,xpd = TRUE) abline(h=h);par(new=T)}

(17)

Part J. Normal Probability Plot

Graphical technique for assessing whether or not the data can be considered a sample from a Normal distribution

The quantiles of the data are plotted against the corresponding quantiles of a standard Normal distribution

If the points form a nearly linear pattern, the normal distribution

is a good model for this data. Departures from the straight line

indicate departures from normality

(18)

Example.

Normal probability plot of PULSE1

qqnorm(PULSE1,pch=16,

main="Normal Q-Q plot of PULSE1") qqline(PULSE1,col="red",lwd=2)

fivenum(PULSE1)

[1] 48 64 71 80 100

round(qnorm(c(0.001,0.25,0.5,0.75, 0.999)),3)

[1] -3.090 -0.674 0.000 0.674 3.090

●

●●

●

−2 −1 0 1 2

5060708090100

Normal Q−Q plot of PULSE1

Theoretical Quantiles

Sample Quantiles

Example.

Normal probability plot of BILIRUBINA

qqnorm(BILIRUBINA,pch=16,cex=0.5,

main="Normal Q-Q plot of BILIRUBINA") qqline(BILIRUBINA,col="red",lwd=2) fivenum(BILIRUBINA)

[1] 0.30 0.80 1.35 3.45 28.00

●

● ●●

●

● ●

●

●● ●●

●

●● ●

●

●● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ● ●

●

●●

●

● ● ● ●

●

●●

● ●

●

● ●●

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●● ● ●

●

● ● ●●

● ●

●

● ●

●

●●

● ●●

●

●●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●●

●

● ●●

●

● ●

●●

●

−3 −2 −1 0 1 2 3

0510152025

Normal Q−Q plot of BILIRUBINA

Theoretical Quantiles

Sample Quantiles