Inferential Statistics

(1)

Inferential Statistics

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

riccomagno@dima.unige.it rogantin@dima.unige.it

(2)

Part G

Distribution free hypothesis tests

1. Classical and distribution-free tests 2. Distribution-free statistics and tests

3. Aside of Probability. Two distribution-free statistics 4. The sign test

5. The Wilcoxon-Mann-Whitney test 6. The goodness-of-fit tests

a) Chi-square test

b) Kolmogorov-Smirnov tests (one and two samples) 7. Final remarks

(3)

1. Classical and distribution-free tests

• Differences between independent groups

– Classical: t-test (or Welch test) to compare the mean of two groups; ANOVA for more groups

– Distribution-free: Mann-Whitney U test and Kolmogorov- Smirnov two-sample test; Kruskal-Wallis and Median test for more groups.

• Differences between variables

– Classical: t-test for paired samples; repeated measures ANOVA for more than two variables

– Distribution-free: Sign test and Wilcoxon’s matched pairs test.

• Relationships between variables

– Classical: correlation coefficient.

– Distribution-free: Spearman R,... . For binary variables:

Chi-square test, Phi coefficient, and Fisher exact test.

(4)

2. Distribution-free statistics and tests

Let X₁, . . . , X_n ∼ F be i.i.d. sample variables.

A statistic T = T (X₁, . . . , X_n) is distribution-free if its distribution is invariant for each distribution of the sample variables.

An example: the Wald test (using CLT approximation for the distribution of X_n). Under H₀ : µ = µ₀ for large n

X_n − µ₀ S/√

n ∼

approx N (0, 1)

This is a particular case of: the statistics with asymptotic (limit)

distribution independent from the sample distribution are distribution- free.

A test is distribution-free if the test statistic is distribution-free.

(5)

3. Aside of Probability. Two distribution free statistics

Sign Statistics

Consider any i.i.d. random sample X₁, . . . , X_n with median equal to 0.

Assume P(Xi = 0) = 0, for i = 1, . . . , n (e.g. X_i continuous).

Define Z_i =

( 1 if X_i > 0

0 if X_i < 0 and note that Z_i ∼ B(1, 1/2) The statistic B =

n X

i=1

Z_i ∼ B(n, 1/2) is distribution free

Furthermore, for large n the statistic B − 1/2

1/2√

n ∼

approx N (0, 1)

(6)

Rank statistics

Consider the sample variables X₁, . . . , X_n and the corresponding rank variables R₁, . . . , R_n where R_i represents the position of X_i in the sample

Note. In lecture 2 we saw that an observed sample can be ordered by eg. the R command sort. Also random variables can be sorted returning the ordered random vector (X₍₁₎, ..., X_(n)). Thus R1 is the index of the minimum of the random sample and it is a random variable.

The joint distribution of (R₁, . . . , R_n) does not depend on the distribution of the sample variables.

We do not give here the details (proof based on combinatorial computation).

If the data contains ties, to the tied values assign the average of the ranks they would have received had they not been tied.

E.g. to the values 13 14 14 16 17 are assigned the ranks 1 2.5 2.5 3 4.

(7)

4. The simplest distribution-free test: the sign test

a) Test for the median of a random variable

b) Test for the equality of two medians - paired sample a) Test for the median of a random variable.

Consider X₁, . . . , X_n i.i.d. random sample and the test with hypotheses:

H₀ : Q2 = λ₀ against H₁ : Q2 6= λ₀ H₁ could be Q2 < λ₀ or Q2 > λ₀.

Consider Z_i =

( 1 if X_i ≥ λ₀

0 if X_i < λ₀ then Z_i ∼ B(1, 1/2) and the test statistic is

B =

n X i=1

Z_i

Under H₀, B ∼ B(n, 1/2) The test is carried out as usual.

(8)

b) Test for the equality of two medians - paired samples

Let X and Y be two continuous random variables modeling some characteristic of the same population, with median Q2_X and Q2_Y respectively. Consider a test with hypotheses:

H₀ : Q2_X = Q2_Y against H₁ : Q2_X > Q2_Y

H₁ could be Q2_X 6= Q2_Y or Q2_X < Q2_Y .

Let (X₁, Y₁), . . . , (X_n, Y_n) be the n paired random sample and define (D₁, . . . , D_i, . . . , D_n) with D_i = X_i − Y_i.

The test hypotheses become

H₀ : Q2_D = 0 against H₁ : Q2_D > 0

and we fall in the set-up of case a).

Remark. A powerful alternative for both a) and b) is the rank signed Wilcoxon test. We do not give here the details.

(9)

Example. Deer legs

Zar, Jerold H. (1999), ”Chapter 24: More on Dichotomous Variables”, Bio- statistical Analysis (Fourth ed.), Prentice-Hall

The null hypothesis is that there is no difference between the hind leg and foreleg length in deer. The alternative hypothesis is that the hind leg length is longer than foreleg length.

Thus:

Deer Hind leg Foreleg Diff. sign

1 142 138 +

2 140 136 +

3 144 147 -

4 144 139 +

5 142 143 -

6 146 141 +

7 149 143 +

8 150 145 +

9 142 136 +

10 148 146 +

H₀ : Q2_D = 0 against H₁ : Q2_D > 0

Under H₀ the test statistic is B = ^P¹⁰_i=1Z_i ∼ B(10, 1/2). Its sample value is b = 8.

The test is one-sided right.

The p-value of b is 0.055 (in R: 1-pbinom(7,10,0.5)).

(10)

Direct computation in R

> binom.test(8,10,alternative ="greater") Exact binomial test

data: 8 and 10

number of successes = 8, number of trials = 10, p-value = 0.05469

alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval:

0.4930987 1.0000000 sample estimates:

probability of success 0.8

There is “weak evidence” against H₀. When the sample size is small the tails of the test statistic distribution are large. This leads to reject H₀ often. In practice, to overcome this choose a high α. In our example there is evidence to retain H₀.

(11)

5. The Mann-Whitney U test or Wilcoxon rank-sum test (equality of two distributions - unpaired samples)

The null hypothesis can be expressed as the probability of an observation from the population X exceeding an observation from the population Y equals the probability of an observation from Y exceeding an observation from X:

H₀ : P(X > Y ) = P(X < Y ) = 0.5

The alternative hypothesis can be stated in terms of one-sided (left or right) or two-sided test.

Here X and Y are two continuous independent random variables and to test H₀ we consider X₁, . . . , X_n₁ and Y₁, . . . , Y_n₂ two independent random samples, with possibly different size.

The variables could be discrete or ordinal with P(X = Y ) = 0.

(12)

Put together the two samples, so that there are n = n₁ + n₂ observations in total.

Let R₁, . . . , R_n₁ be the rank variables assigned to X₁, . . . , X_n₁ and R_n₁₊₁, . . . , R_n the rank variables assigned to Y₁, . . . , Y_n₂.

The statistics W₁ =

n₁ X

i=1

R_i and U₁ = W₁ − n₁(n₁ + 1) 2

are distribution-free and are used as test statistic.

U₁ takes integer values between 0 and n₁n₂.

The statistics W₂ and U₂ (based on ranks of the Y ’s) are defined analogously. Moreover W₁ + W₂ = n(n + 1)/2.

Which between W₁ and W₂ is to consider? (or U₁ and U₂?) Usually the statistics with lower sample value is used.

(13)

A small example

Does the treatment A produce lower values of a variable than the treatment B?

Denote by X and Y the variables modeling the results of treatment A and B respectively.

H₀ : P(X < Y ) = P(X > Y ) H₁ : P(X < Y ) > P(X > Y ) Seven elements are drawn from the population at random.

Three, randomly chosen, are assigned to treatment A; the other four to treatment B: n₁ = 3 and n₂ = 4.

The sample values and the corresponding sample ranks are x_i 12 16 13 r(x_i) 1 4 2

y_i 17 15 18 20 r(y_i) 5 3 6 7 The sample value of W₁ is w = 7.

(14)

Computation of the distribution of W₁ under H₀ (n₁ = 3, n₂ = 4)

W1 is sum of 3 different numbers chosen among {1, . . . , 7}.

It takes values between 6 and 18. It is symmetrical w.r.t. 12.

How many ways are there to form w?

- 6: one way, 1 + 2 + 3;

- 7: one way, 1 + 2 + 4;

- 8: two ways, 1 + 2 + 5 and 1 + 3 + 4; . . .

Under H0, the three ranks of X are randomly chosen among {1, . . . , 7}: ⁷₃

= 35 cases.

Then the distribution of W1 for n1 = 3 and n2 = 4 is

w associated ranks f_W₁(w)

6 (1,2,3) 1/35

7 (1,2,4) 1/35

8 (1,2,5);(1,3,4) 2/35

9 (1,2,6); (1,3,5); (2,3,4) 3/35

10 (1,2,7); (1,3,6); (1,4,5); (2,3,5) 4/35 11 (1,3,7); (1,4,6); (2,4,5); (2,3,6) 4/35 12 (1,4,7); (1,5,6); (2,3,7); (2,4,6); (3,4,5) 5/35

The distribution of W₁ depends only on n₁ and n₂: W₁ is a distribution-free statistics

(15)

Some properties of W1 and U1 under H0

• Minimum value: all the ranks of the X_i’s are smaller than the ranks of the Y_i’s:

min(W1) =

n1

X

i=1

i = n1(n1 + 1)

2 min(U1) = 0

• Maximum value: all the ranks of the Y_i’s are smaller than the ranks of the X_i’s:

max(W1) =

n

X

i=n1+1

i = n1(n + n2 + 1)

2 max(U1) = n1n2

• Mean value:

E(W¹) = n1 (n + 1)

2 E(U¹) = n1n2

2

• Variance:

V(W¹) = V(U¹) = n1 n2 (n + 1) 12

• W1 and U1 are symmetrical w.r.t. their mean values.

Moreover, for n1 and n2 greater than 10:

U1 − E(U¹) std(U1) ∼

approx N (0, 1)

(16)

Back to the test

The test is one-sided left; the sample value is 7 and its p-value is P(W1 ≤ 7) = 2/35 = 0.057

In such a case with low sample size, we can say that the evidence is against H₀.

> x=c( 12, 16,13);y=c(17,15,18, 20)

> wilcox.test(x,y,"less")

Wilcoxon rank sum test data: x and y

W = 1, p-value = 0.05714

alternative hypothesis: true location shift is less than 0

The approximation of W₁ with a standard normal distribution is not appropriate for small sample sizes.

But, in such a case, the exact computation and the normal approximation give similar results:

z = 7 − 12

√8 = −1.77 p-value(−1.77) = 0.039

(17)

6. Goodness-of-fit tests

Measures of goodness-of-fit typically summarize the discrepancy between observed values and the values expected under a known probability model.

Such measures can also be used to test whether two samples are drawn from identical distributions.

We consider here two goodness-of-fit tests:

a) Chi-square test (discrete variables) b) Kolmogorov-Smirnov

(18)

6. a) Chi-square goodness-of-fit tests

Let X be a discrete random variable with finite support variable with

P(X = xi) = π_i i = 1, . . . , r The test hypotheses are:

H₀ : π_i = π_i0 for all i and H₁ : π_i 6= π_i0 for at least one i Let

- X₁, ..., X_n be a random sample

- F₁, . . . , F_r be the sample variables denoting the sample frequencies of the values 1, . . . , r

- N₁, . . . , N_r be the corresponding counts variables, N_i = nF_i, i = 1, . . . , r.

Often the N_i’s variables are called observed (counts) while the nπ_i0’s are called expected (counts) and denoted by O_i and E_i respectively.

(19)

The test statistic is Q = n

r X

i=1

(F_i − π_i0)² π_i0 =

r X

i=1

(N_i − nπ_i0)²

nπ_i0 =

(simply)

r X

i=1

(O_i − E_i)² E_i

Its asymptotic distribution is a chi-square with r − 1 degrees of freedom:

Q ∼

approx χ²_[r−1]

0 5 10 15 20

0.000.050.100.15

The test is one-sided right because large sample values of Q state large difference between observed frequencies and expected frequencies.

(20)

Dependence on a parameter

Often the π_i’s depends on a unknown parameter θ. Examples:

• X ∼ B(n, θ) (binomial),

• X ∼ U {0, θ} (discrete uniform between 0 and θ),

• X ∼ P(θ) (truncated Poisson, considering null the probability of “large” integers)

We can write π_i = π_i(θ) and the test hypotheses become:

H₀ : π_i = π_i0(θ) and H₁ : π_i 6= π_i0(θ)

If Θ_n is a consistent estimator of θ with normal asymptotic distribution N (θ, V(Θⁿ)) (e.g. maximum likelihood estimator) then the test can be conduct with the statistic:

Q = n

r X

i=1

(F_i − π_i0(Θ_n))² π_i0(Θ_n)

(21)

Example. Sons among the first 7 children

(Edwards and Fraccaro 1960)

Consider the number of males among the first seven sons of 1334 Swedish Ministers

n. sons 0 1 2 3 4 5 6 7

counts 6 57 206 362 365 256 69 13 We want to test if they are sample values of a random variable

X ∼ B(7, θ)

The point estimator of θ is X/7, the maximum likelihood estimator. The estimate of θ is 0.514.

> x=c( 0,1,2,3,4,5,6,7);o=c(6,57,206,362,365,256,69,13)

> t=sum(x*o)/sum(o)/7;t [1] 0.5140287

The expected counts under H0 are:

> e=sum(o)*dbinom(0:7,7,t);round(e,1)

[1] 8.5 63.2 200.6 353.7 374.1 237.4 83.7 12.6

The sample values of Q is 5.98 with p-value 0.54. (q=sum((o-e)^2/e); 1-pchisq(q,7)) Then there is no evidence to reject H0

(22)

Effects of small sample size.

Recall that Q = n

r X i=1

(F_i − π_i0)² π_i0 =

r X i=1

(N_i − nπ_i0)²

nπ_i0 ∼

approx χ²_[r−1]

The chi-square approximation is valid when the sample size is large and the expected counts nπ_i are not too small (at least 5 for all i = 1, . . . , r).

In fact:

(1) small n ⇒ small q ⇒ risk of type II error (2) small nπ_i0 ⇒ large q ⇒ risk type I error.

(23)

Examples.

Case (1): small n ⇒ small q ⇒ risk of type II error Consider the expected and ob-

served frequencies beside where the differences between them are greater than 40%.

1 2

expected 0.40 0.60 observed 0.15 0.85

In such a case: ^(f¹^−π¹⁰⁾

2

π₁₀ + ^(f²^−π²⁰⁾

2

π₂₀ = 0.2604

If n = 10, then q = 10 × 0.2604 = 2.604 with p-value 0.107

⇒ retain H₀.

If n = 30, then q = 30 × 0.2604 = 7.812 with p-value 0.005

⇒ reject H₀.

> e=c(0.4,0.6); o=c(0.15,0.85); cf=sum((o-e)^2/e);cf [1] 0.2604167

> n=10;cbind(cf*n,1-pchisq(cf*n,1)) [1,] 2.604167 0.1065832

> n=30;cbind(cf*n,1-pchisq(cf*n,1)) [1,] 7.8125 0.005188608

(24)

Case (2): small nπ_i0 ⇒ large q ⇒ risk type I error

Consider the expected and observed counts beside.

In (A) the expected counts are small twice.

values 0 1 2

(A) expected 10 2 2 observed 12 3 6

values 0 1 2

(B) expected 10 12 12 observed 12 13 16 In (A): q = 8.900 with p-value 0.0117 ⇒ reject H₀.

In (B): q = 1.817 with p-value 0.4032 ⇒ retain H₀.

> e=c(10,2,2); o=c(12,3,6); cf=sum((o-e)^2/e)

> cbind(cf,1-pchisq(cf,2)) [1,] 8.9 0.01167857

> e=c(10,12,12); o=c(12,13,16); cf=sum((o-e)^2/e)

> cbind(cf,1-pchisq(cf,2)) [1,] 1.816667 0.4031957

(25)

6 b1) Kolmogorov-Smirnov goodness-of-fit tests

Let X₁, . . . , X_n be i.i.d. sample variables from a continuous random variable X with cumulative distribution function F .

Consider the test hypotheses:

H₀ : F (x) = F₀(x) for all x ∈ R

H₁ : F (x) 6= F₀(x) for at least a x ∈ R

Let F be the^b empirical cumulative distribution function:

F (x) =b n X

i

i n

X_(i) < x < X_(i+1)

where (X₍₁₎, . . . , X_(n)) is the sorted random sample and (.) denote the indicator function (equal to 1 if the condition is satisfied and equal to 0 otherwise). ˆF is a step function.

The sample values of F (x) are discussed in the slides “Exploratory Data Anal-b ysis”.

(26)

The Kolmogorov test statistic is D = sup

x∈R

F (x) − Fb ₀(x) =

1≤x≤nmax

max

i

n − F₀ X_(i)

,

i − 1

n − F₀ X_(i)

D is a distribution-free statistic.

The test is one-sided right because a large sample value of D corresponds to a large difference between empirical and tested cumulative distribution function.

(27)

Example.

Goodness-of-fit of a uniform random variable X ∼ U (0, 2)

We want to test if a uniform random variable X ∼ U (0, 2) fits the following (sorted) data:

0.03 0.12 0.25 0.41 0.49 1.18 1.21 1.56 1.57 1.69

A random variable X ∼ U (0, 2) has

cumulative distribution function ^F⁰^(x) ⁼







0 if x < 0

1/2 x if 0 ≤ x < 2 1x if 2 ≤ x

Beside the empirical cumulative distribution function (red) and F₀ (black).

The maximum distance fo the two plot is achieved for x = 0.49 (fifth sorted value) and d = 0.49 ∗ ¹₂ − ¹₅ = 0.255

0.0 0.5 1.0 1.5 2.0

0.00.20.40.60.81.0

●

(28)

> s=c(0.03,0.12,0.25,0.41,0.49,1.18,1.21,1.56,1.57,1.69)

> ks.test(s,"punif",0,2)

One-sample Kolmogorov-Smirnov test data: s

D = 0.255, p-value = 0.4593

alternative hypothesis: two-sided

There is no evidence to reject H₀.

(29)

Example. Approximate distribution of X_n

see slides on Central limit theorem

Consider a simulation of 1000 samples, of size n each, from an exponential random variable X ∼ E(λ) with λ = 2.

The simulated distribution is compared with

- a Normal variable with sample mean and standard deviation - a Normal variable with theoretical mean and standard devia-

tion; which are known: 1/λ and 1/(λ√

n) respectively.

• n = 10

> lambda=2;x=c(1:1000);n=10

> for (i in 1:1000) x[i]=mean(rexp(n,lambda))

> ######### empirical mean and standard deviation

> ks.test(x,"pnorm",mean(x),sd(x))

One-sample Kolmogorov-Smirnov test data: x

D = 0.050283, p-value = 0.01273 alternative hypothesis: two-sided

(30)

> ######### theoretical mean and standard deviation

> ks.test(x,"pnorm",(1/lambda),(1/lambda/sqrt(n))) One-sample Kolmogorov-Smirnov test

data: x

In the first case there is evidence to reject that the simulated distribution of X₁₀ is Normal. In the second one the evidence is weak.

(31)

• n = 30

> lambda=2;x=c(1:1000);n=30

> for (i in 1:1000) x[i]=mean(rexp(n,lambda))

> ######### empirical mean and standard deviation

> ks.test(x,"pnorm",mean(x),sd(x))

One-sample Kolmogorov-Smirnov test data: x

> ######### theoretical mean and standard deviation

> ks.test(x,"pnorm",(1/lambda),(1/lambda/sqrt(n))) One-sample Kolmogorov-Smirnov test

data: x

D = 0.032839, p-value = 0.231

alternative hypothesis: two-sided

In both cases there is evidence to retain that the simulated distribution of X₁₀ is Normal.

(32)

6 b2) Two-sample Kolmogorov-Smirnov goodness-of-fit tests

Let X and Y be two continuous independent random variables with cumulative distribution functions F_X and F_Y respectively.

The test hypotheses are:

H₀ : F_X(t) = F_Y (t) for all t ∈ R

H₁ : F_X(t) 6= F_Y (t) for at least a t ∈ R

Let X₁, . . . , X_n₁ and Y₁, . . . , Y_n₂ be two independent random samples with empirical cumulative distribution functions F^b_X and F^b_Y respectively.

The Kolmogorov-Smirnov test statistic is D_n₁_,n₂ = sup

x∈R

Fb_X(x) − F^b_Y (x)

D_n₁_,n₂ is a distribution-free statistic.

(33)

Example. Juiper trees

We want to test if biomass of male and female Juniper trees have the same distribution.

The two samples have size 6 each.

> m=c(71,72,74,76,77,78); f=c(73,79,80,82,83,84)

>

> Fm_Ff=rbind(cumsum(table(factor(m, levels=71:84)))/6, + cumsum(table(factor(f, levels=71:84)))/6)

> round(Fm_Ff,2)

71 72 73 74 75 76 77 78 79 80 81 82 83 84

0.17 0.33 0.33 0.50 0.50 0.67 0.83 1.00 1.00 1.0 1.0 1.00 1.00 1 0.00 0.00 0.17 0.17 0.17 0.17 0.17 0.17 0.33 0.5 0.5 0.67 0.83 1

plot(ecdf(m),col="blue",

xlim=c(70,85),xlab="",ylab="",main="") plot(ecdf(f),add=T,col="red",

xlim=c(70,85),xlab="",ylab="",main="")

70 75 80 85

0.00.20.40.60.81.0

●

(34)

The absolute values of difference between F^b_M and F^b_F are listed below and their maximum value is 0.833 reached at 78 of biomass.

> D=abs(Fm_Ff[1,]-Fm_Ff[2,])

> round(rbind(Fm_Ff,D),2)

71 72 73 74 75 76 77 78 79 80 81 82 83 84

0.17 0.33 0.33 0.50 0.50 0.67 0.83 1.00 1.00 1.0 1.0 1.00 1.00 1 0.00 0.00 0.17 0.17 0.17 0.17 0.17 0.17 0.33 0.5 0.5 0.67 0.83 1 D 0.17 0.33 0.17 0.33 0.33 0.50 0.67 0.83 0.67 0.5 0.5 0.33 0.17 0

> max(D)

[1] 0.8333333

> ks.test(m, f)

Two-sample Kolmogorov-Smirnov test data: m and f

There is evidence to reject H₀

(35)

7. Final remarks

Form the book by T. Hill and P. Levicky (2006) Statistics method and applications. StatSoft. p. 385

It is not easy to give simple advice concerning the use of nonparametric procedures.

Each nonparametric procedure has its peculiar sensitivi- ties and blind spots.

For example, the Kolmogorov-Smirnov two-sample test is not only sensitive to differences in the location of distributions (for example, differences in means) but is also greatly affected by differences in their shapes.

The Wilcoxon matched pairs test assumes that one can rank order the magnitude of differences in matched observations in a meaningful manner. If this is not the case, one should rather use the Sign test.

(36)

In general, if the result of a study is important (e.g., does a very expensive and painful drug therapy help people get better?), then it is always advisable to run different nonparametric tests; should discrepancies in the results occur contingent upon which test is used, one should try to understand why some tests give different results.

On the other hand, nonparametric statistics are less sta- tistically powerful (sensitive) than their parametric coun- terparts, and if it is important to detect even small effects (e.g., is this food additive harmful to people?) one should be very careful in the choice of a test statistic.

Nonparametric methods are most appropriate when the sample sizes are small.