• Non ci sono risultati.

Inferential Statistics First Part

N/A
N/A
Protected

Academic year: 2021

Condividi "Inferential Statistics First Part"

Copied!
38
0
0

Testo completo

(1)

Inferential Statistics First Part

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

[email protected] [email protected]

(2)

Part A

Generalities and point estimation

• Population and samples

• Point estimation

• Sample mean and its distribution

• Properties of estimators

(3)

Inferential statistics

analysis of the available sample

:

probability

:

information about population or pro- cess

Data collection or experiment simulation – EDA

used to assess unknown parameter values of the whole population Partial observation can be intrinsic to the problem:

- limitations due to costs (time and money) - polls and elections

- invasive experiments (industrial, pharmaceutical) - time forecasts

(4)

From the book

J. Maindonald and J. Brown. Data Analysis and Graphics. Cam- bridge University Press. 2010. Chapter 4. p. 102

A random sample is a set of values drawn independently from a larger population. A (uniform) random sample has the characteristic that all members of the population have an equal chance of being drawn.

[. . . ]

The challenge is to use the one sample that is available, together with the assumption of independent and iden- tically distributed sample values, to infer the sampling distribution of the mean.

(5)

Probability allows a theoretic model of the variability to predict behavior in not-sampled cases :

• starting from experience,

• formally consistent,

• able to describe the phenomenon,

• able to evaluate the inevitable approximations committed in the transition from partial information of the observed data to statements regarding the entire population or the entire phenomenon

(6)

From the book: Larry Wassermann. All of Statistics. Springer.

2010. Chapter 6.1 p. 87

Statistical inference, or “learning” as it is called in com- puter science, is the process of using data to infer the distribution that generated the data

A typical statistical inference question is:

Given a sample X1, . . . , Xn ∼ F, how do we infer F ? In some cases, we may want to infer only some feature of F such as its mean.

Typically the observed sample is indicated with lower case letters x1, x2, . . . , xn

(7)

Statistics/Data Mining Dictionary

Statistics Computer Science Meaning

estimation learning using data to estimate

an unknown quantity

classification supervised learning predicting a discrete Y from X clustering unsupervised learning putting data into groups

data training sample (X1, Y1), . . . , (Xn, Yn)

covariates features the Xi’s

classifier hypothesis a map from covariates to outcomes

hypothesis subset of

a parameter space Θ confidence interval

interval that contains an unknown quantity with given frequency ...

(8)

Population and samples

Example of populations: the inhabitants of a city, the groceries sold in a particular region.

It is important to select the observed sample appropriately, ideally the sample should be:

• representative of the population (for example, if we study the average price of a product the sample should not derive from only supermarkets, but also from small shops);

• formed by elements mutually independent (for example: vari- able: blood pressure; population: inhabitants of a region;

sample: not only hospitalized).

(9)

Point estimation

Let X be a random variable modeling the data.

Example Estimation of the mean µ of the systolic blood pressure (mmHg) X in the population. Sample of 8 subjects whose systolic blood pressure is:

x1 x2 x3 x4 x5 x6 x7 x8

125 128 133 136 126 129 131 135

Choice of the estimator – two example:

- sample mean: X = X1+X2n+···+Xn - mid-range:

T = max(X1, X2, . . . , Xn) + min(X1, X2, . . . , Xn) 2

Point estimation: x = 130 e t = 131

Which estimate and estimator should we choose?

The sample mean is the best estimator because it has good statistical and mathematical properties

(10)

Effect of the sample choice

on the estimation of the population mean

Consider the estimator sample mean X.

Which are the possible samples from a population?

Which are the possible values of X? How likely is each value?

A small example

Population: 4 subjects A, B, C, D. Sample size: 2

Systolic blood pressure (mmHg) of the 4 subjects – variable X

A B C D

125 129 131 133 µ = 125+129+131+133

4 = 129.5

Each subject (each value) has probability 14 to be drawn

(11)

Aim: estimation of µ by one sample of size 2, using the sample mean estimator

List of all samples and the corresponding sample means

Note that the value of the population mean is not a possible value of ¯X.

Each sample has probability 1

16 to be drawn

Ex: the sample mean 131 has prob- ability 163

In practice, only a sample will be drawn!

sample x1 x2 x

AA 125 125 125

AB 125 129 127

AC 125 131 128

AD 125 133 129

BA 129 125 127

BB 129 129 129

BC 129 131 130

BD 129 133 131

CA 131 125 128

CB 131 129 130

CC 131 131 131

CD 131 133 132

DA 133 125 129

DB 133 129 131

DC 133 131 132

DD 133 133 133

(12)

Values taken by the estimator X and their probabilities

x 125 127 128 129 130 131 132 133

P (X = x) 1/16 2/16 2/16 3/16 2/16 3/16 2/16 1/16

What probability is there to overestimate the mean value of blood pressure?

How likely is an estimated value close to the true mean?

What is the probability of being away from the true value for more than 2 mmHg?

Where is the randomness?

Why do we say that X is a random variable?

The randomness is in randomly drawing a sample and ob- taining one of the possible values

The probability of observing a certain sample value is written above and impinges on the assumption that all 16 sample are equally likely (i.i.d. sampling). If the sample scheme changes, then the sample distribution of ¯X changes accordingly.

(13)

Sample distribution of X and its mean

125 126 127 128 129 130 131 132 133

0.06 0.08 0.10 0.12 0.14 0.16 0.18



E(X) = 125+2×127+2×128+3×129+2×130+3×131+2×132+131

16 = 129.5

X is centered in the parameter µ we want to estimate

In general: the mean of ¯X is equal to the population mean!

Moreover, the distribution of X is close to µ with high probability when n is “large”

(14)

Properties of the estimators

A point estimator T of a parameter θ should be

• unbiased or centered (its mean is θ)

• consistent (unbiased and its variance tends towards 0 as the sample size approaches to infinity)

Example

For estimating the maximum length θ of a Mikado. Suppose that the length is uni- form in (0, θ), or in (a, θ + a), a known.

We could consider

• the maximum in the sample

• twice the sample mean

The sample distributions of the two esti- mators are plotted on the right

0 2 4 6 8 10

0.00.20.40.60.81.0

0 2 4 6 8 10

0.00.20.40.60.81.0

θ

The variance of T is the average of the squared differences from the mean of T

(15)

A bit of probability: the law of large numbers

X1, X2, ..., Xn i.i.d. r. v. with mean µ and variance σ2 Let Xn be the sample mean random variable

Xn = Sn

n It has theoretical mean µ and variance σ2/n Law of large numbers (LLN)

If the sample size n grows to infinity, the probability that Xn takes values outside the interval (µ − δ, µ + δ) goes to zero, for any positive δ

More precisely: P

|Xn − µ| > δ → 0 if n → ∞

The distribution of Xn becomes more concentrated around µ as n gets large, where µ is both the parameter to be estimated and the expectation (or theoretical mean) of Xn.

In other words, the distribution of Xn piles up near µ

Here we add the index n to explicit the dependence on it

(16)

Trusting in the experience and the LLN

The LLN is a theorem that describes the result of performing the same experiment a large number of times.

The LLN “guarantees” stable long-term results for the averages of random events.

For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be ”balanced” by the other.

From Wikipedia https://en.wikipedia.org/wiki/Law_of_large_numbers

(17)

Simulation in R of an experiment with two outcomes:

1 and 0

Example: probability of 1: 0.3 – probability of 0: 0.7

(we will analyze in detail this type of experiment later)

> out=rbinom(10,1,0.3);out ##10 trials [1] 1 0 0 1 0 0 1 0 0 1

> mean(out) [1] 0.4

> out=rbinom(100,1,0.3);out ##100 trials

[1] 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 [33] 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 [65] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 [97] 0 1 0 0

> mean(out) [1] 0.29

> out=rbinom(1000,1,0.3); mean(out) ##1000 trials [1] 0.31

> out=rbinom(10000,1,0.3); mean(out) ##10000 trials [1] 0.3005

(18)

Part B

An introduction to hypothesis tests

• Introduction

• A probability aside: the binomial random variable

• Significance level (α) and rejection region

• Decision error

(19)

Example

An experiment has two possible outcomes:

1 for success, and 0 for failure

It is known that the probability of success, p, is either 0.3 or 0.7 and 20 independent trials are performed in exactly the same way.

Aim: infer the true value of p from the outcomes of the 20 trials

The outcomes are modelled by a binomial random variables X:

X ∼ B(20, p)

Two hypotheses for p: H0 : p = 0.3 null hypotesis

H1 : p = 0.7 alternative hypotesis In hypothesis testing, we choose some null hypothesis H0 and we ask if the data provide sufficient evidence to reject it

In the example, rejecting H0 implies accepting H1

(20)

A probability aside. Binomial random variable

An experiment that satisfies the following four conditions can be modelled by a Binomial random variable

1. there is a fixed sample size (number of trials, n)

2. on each trial, the event of interest either occurs or does not 3. the probability of occurrence (or not) is the same on each

trial (0 < p < 1)

4. trials are independent of one another Example:

toss an unbalanced coin 10 times where, in a single toss, head has probability p and tail has probability 1 − p

What is the probability of observing the sequence HHT HHT T HHH?

pp(1 − p)pp(1 − p)(1 − p)ppp = p7(1 − p)3

Equal to the probability of observing any other sequence with 7 heads and 3 tails. How many sequences with exactly 7 heads?

(21)

Let X be the random variable modeling the number of successes in n independent trials.

Coding 1 for success, and 0 for failure the number of successes is the sum of the 1’s

The probability that X is equal to k is:

P(X = k) =

n k

 pk (1 − p)n−k for k = 0, 1, . . . , n where n

k

 is the number of the sequences of n elements (success or failure) with k successes.

Example n=20; p=0.10; x=seq(0,n);df=dbinom(x,n,p)

k 0 1 2 3 4 5 6 7 8 . . .

P (X = k) 0.122 0.270 0.285 0.190 0.090 0.032 0.009 0.002 0.000 . . .

(22)

Plots of the probability density functions n = 20 and different p

0 5 10 15 20

0.000.150.30

Binomial n=20 p= 0.1

0 5 10 15 20

0.000.150.30

Binomial n=20 p= 0.3

0 5 10 15 20

0.000.150.30

Binomial n=20 p= 0.5

0 5 10 15 20

0.000.150.30

Binomial n=20 p= 0.9

First plot n=20; p=0.10; x=seq(0,n);df=dbinom(x,n,p)

k 0 1 2 3 4 5 6 7 8 . . .

P (X = k) 0.122 0.270 0.285 0.190 0.090 0.032 0.009 0.002 0.000 . . .

(23)

Example n=20; p=0.10; x=seq(0,n);df=dbinom(x,n,p)

k 0 1 2 3 4 5 6 7 8 . . .

P (X = k) 0.122 0.270 0.285 0.190 0.090 0.032 0.009 0.002 0.000 . . .

Functions for random variables in R

d<ran-var-name>(x,<other parameters>) density probability function in x

p<ran-var-name>(x,<other parameters>) cumulative distribution function in x q<ran-var-name>(a,<other parameters>) a-th quantile

r<ran-var-name>(n,<other parameters>) random sample of size n

(24)

Binomial random variable

dbinom(x,n,p) pbinom (x,n,p) qbinom (a,n,p) rbinom (N,n,p)

n=20; p=0.30 x=seq(0,n)

df=dbinom(x,n,p) cdf=pbinom(x,n,p) cbind(x,round(df,6),

round(cdf,6)) par(mfrow=c(2,1)) plot(x,df,type="h",

col="blue",lwd=3) plot(x,cdf,type="s",

col="blue",lwd=3) par(mfrow=c(1,1))

x

[1,] 0 0.000798 0.000798 [2,] 1 0.006839 0.007637 [3,] 2 0.027846 0.035483 [4,] 3 0.071604 0.107087 [5,] 4 0.130421 0.237508 [6,] 5 0.178863 0.416371 [7,] 6 0.191639 0.608010 [8,] 7 0.164262 0.772272 [9,] 8 0.114397 0.886669 [10,] 9 0.065370 0.952038 [11,] 10 0.030817 0.982855 [12,] 11 0.012007 0.994862 [13,] 12 0.003859 0.998721 [14,] 13 0.001018 0.999739 [15,] 14 0.000218 0.999957 [16,] 15 0.000037 0.999994 [17,] 16 0.000005 0.999999 [18,] 17 0.000001 1.000000 [19,] 18 0.000000 1.000000 [20,] 19 0.000000 1.000000 [21,] 20 0.000000 1.000000

0 5 10 15 20

0.000.10

x

df

0 5 10 15 20

0.00.40.8

x

cdf

(25)

Back to the Hypothesis Tests

Plots of probability density functions - under H0 : p = 0.3 red

- under H1 : p = 0.7 black

We have to choose a threshold s so as to decide which of the two hypotheses is

more supported by the data 0 5 10 15 20

0.000.050.100.150.20

0 5 10 15 20

0.000.050.100.150.20

We reject H0 if in the sample there are more 1’s than s.

Does reject H0 mean accept H1? Here yes.

In order to determine s choose the significance level of the test α in (0, 1) (here 0.05).

Assume H0 is true, (p = 0.3) compute the smallest s such that:

P(X > s | p = 0.3) < 0.05

Meaning: if we obtain a “high” number of successes, we consider more likely that the data are the realization of a random variable with p = 0.7

(26)

The threshold s is the quantile 1 − α of X

0 5 10 15 20

0.000.050.100.150.20

probability density function − p=0.3

0 5 10 15 20

0.00.20.40.60.81.0

cumulative distribution function − p=0.3

> s=qbinom(0.95,20,0.3);s [1] 9

Decision rule here: “if in the 20 trials we obtain more than 9 success, we reject H0 (p = 0.3)”

Terminology:

- the rejection region of the test is {10, 11, 12, . . . , 20}

- the test statistics is the number of ones in the sample

(27)

Review and comments A hypothesis testing is formed by:

- a null and alternative hypothesis

- a test statistics (function of the sample variables) - a rejection region

The null is a default theory and we ask if the data provide suffi- cient evidence to reject the theory. If not we retain it. We also say that we accept the null or fail to reject the null.

In the example: more than nine successes - does not support the null hypothesis

- is evidence that the alternative hypothesis holds.

Reject the null, does not imply to accept the alternative neces- sarily. We accepted the alternative if the possible decisions are limited to null and alternative.

In order to investigate whether the data support or not the al- ternative, the test should be reformulated. Go back to slide 24 to see what changes.

(28)

The formulation of statistical hypothesis testing Let X1, ...Xn ∼ F be a random sample.

Example.

Null hypothesis: the disease rate is the same in the groups A and B.

Alternative hypothesis: the disease rate is higher in the group A.

The rejection region R0: an appropriate subset of the out- comes.

Let x = (x1, . . . , xn) be the sample values. If x ∈ R0 we reject the null hypothesis, otherwise we do not reject it

x ∈ R0 ⇒ reject H0

x /∈ R0 ⇒ retain (do not reject) H0 The test statistic T. Usually R0 has the form

R0 = {x such that T (x) > s}

where T is a test statistic and s is a critical value.

Problem: find appropriate T and s.

(29)

Behaviour in hypothesis testing

From the book: Larry Wassermann. All of Statistics. Springer.

2010. Chapter 6.1 p. 87

Hypothesis testing is like a legal trial. We assume some- one is innocent unless the evidence strongly suggests that he is guilty. Similarly, we retain H0 unless there is strong evidence to reject H0

[cerco nei dati l’evidenza contro H0]

(30)

The decisions taken are affected by errors – Types of error

Two types of error:

type I error: rejecting H0 when H0 is true type II error: retaining H0 when H0 is false

Usually the experimenter sets a maximum allowed probability α for the type I error (α = 0.1, 0.5, 0.01).

In the example with 20 tosses of a biased coin,

H0 : p = 0.3 H1 : p = 0.7

R0 = {more than 9 successes}

α was set to 0.05 (sum of the red probabilities in the plot).

0 5 10 15 20

0.000.050.100.150.20

0 5 10 15 20

0.000.050.100.150.20

retain H0 reject H0

(31)

The probability of the type II error is indicated with β (probability to retaining H0 when H1 is true)

It can happen that in the 20 trials you get fewer successes ≤ 9 even if the true probability is 0.7

(sum of black probabilities in the plot)

0 5 10 15 20

0.000.050.100.150.20

0 5 10 15 20

0.000.050.100.150.20

retain H0 reject H0

β is the cumulative distribution function of X under H1 calculated in s β = P(X ≤ s | p = 0.7)

In our case: β = 0.017

> b=pbinom(9,20,0.7);round(b,3) [1] 0.017

(32)

Types of error (continue)

DECISION PROBABILITY

H0 retained H0 rejected H0 retained H0 rejected H0 true correct type I

1 − α α

error H0 false type II

correct β 1 − β

error

Cumulative distribution func- tion plots under H0 and H1

The threshold s is indicated.

Locate α and β

0 5 10 15 20

0.00.20.40.60.81.0

0 5 10 15 20

0.00.20.40.60.81.0

(33)

Simulation in R

H0 : p = 0.3, H1 : p = 0.7 – T : number of successes – critical value s = 9

Simulate a binomial experiment assuming H0

> rbinom(1,20,0.3) [1] 7

Correct decision

assuming H1

> rbinom(1,20,0.7) [1] 12

Correct decision

Simulate 100 binomial experiments and count how many times the test returns the correct decision.

assuming H0

> a=rbinom(100,20,0.3)

> length(a[a<=9]) [1] 94

In the 94% of cases correct decision

assuming H1

> b=rbinom(100,n,0.7)

> length(b[b>9]) [1] 99

In the 99% of cases correct decision

Compare α and β with this two percentages

(34)

Formulation of the hypotheses

Examples.

1. A new drug should lower the probability of side effects. The best drug on the market has probability p = 0.2 of side effects

H0: p = 0.2 H1: p < 0.2

2. A new type of bulbs should increase how long a cut flower lasts. Currently they last on average less than 1500 days

H0: µ ≤ 1500 H1: µ > 1500

3. The percentage of left-handed USA President of USA is not 1/4 as in the general population.

H0: p = 1/4 H1: p 6= 1/4

Reformulate 3. to investigate whether it is higher than in the general population

[cerco nei dati l’evidenza contro H0]

(35)

One-sided and two-sided test

- Example 1: one-sided test (left)

H0: p = 0.2 H1: p < 0.2 - Example 2: one-sided test (right)

H0: µ ≤ 1500 H1: µ > 1500 - Example 3: two-sided test

H0: p = 1/4 H1: p 6= 1/4

Simple and composite hypothesis

- H0: p = 1/4 simple

- H0: µ ≤ 1500 composite

(36)

The “form” of R0 depends on H1 H0: p = 0.3 0.05 = P (X ∈ R0|H0)

0 5 10 15 20

0.000.100.20

0 5 10 15 20

0.000.100.20

H0: p=0.3 −− H1: p>0.3

0 5 10 15 20

0.000.100.20

0 5 10 15 20

0.000.100.20

H0: p=0.3 −− H1: p<0.3

0 5 10 15 20

0.000.100.20

0 5 10 15 20

0.000.100.20

0 5 10 15 20

0.000.100.20

H0: p=0.3 −− H1: p not 0.3

H1 : p > 0.3

one-sided (right) R0 = {x > 9}

s=qbinom((1-a),n,p)

H1 : p < 0.3 one-sided (left) R0 = {x ≤ 2}

s=qbinom(a,n,p)-1

H1 : p 6= 0.3 two-sided R0 = {x ≤ 1} ∪ {x > 10}

s1=qbinom((a/2),n,p)-1 s2=qbinom((1-a/2),n,p)

(37)

R coding for plots of page 24

n=20;p_0=0.3;p_1=0.7;a=0.05 s=qbinom((1-a),n,p_0)

x=seq(0,n)

y_0=dbinom(x,n,p_0) y_1=dbinom(x,n,p_1)

plot(x+0.1,y_0,xlim=c(0,n),ylim=c(0,.2),type="h", lwd=3,xlab=" ",ylab=" ",col="red")

par(new=T)

plot(x-0.1,y_1,xlim=c(0,n),ylim=c(0,.2),type="h", lwd=3,xlab=" ",ylab=" ")

abline(h=0);abline(v=s+.5, col="blue",lwd=3)

## the bars are drawn slightly shifted (x+0.1 and x-0.1) for better visualisation

(38)

R coding for plots of page 35

n=20;p=0.3;a=0.05 par(mfrow=c(3,1))

s=qbinom((1-a),n,p) ## one-sided right

x1=seq(0,s); x2=seq(s+1,n)

y1=dbinom(x1,n,p);y2=dbinom(x2,n,p)

plot(x2,y2,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,xlab=" ",ylab=" ",col="blue") par(new=T)

plot(x1,y1,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,

xlab=" ",ylab=" ",col="red", main="H0: p=0.3 -- H1: p>0.3") abline(h=0) ;abline(v=s+.5, col="black",lwd=3)

s=qbinom(a,n,p);s=s-1;s ## one-sided left

x1=seq(0,s); x2=seq(s+1,n)

y1=dbinom(x1,n,p);y2=dbinom(x2,n,p)

plot(x2,y2,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,xlab=" ",ylab=" ",col="red") par(new=T)

plot(x1,y1,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,

xlab=" ",ylab=" ",col="blue", main="H0: p=0.3 -- H1: p<0.3") abline(h=0) ;abline(v=s+.5, col="black",lwd=3)

s1=qbinom((a/2),n,p);s1=s1-1;s1 ## two-sided s2=qbinom((1-a/2),n,p);s2

x1=seq(0,s1); x2=seq(s1+1,s2); x3=seq(s2+1,20)

y1=dbinom(x1,n,p);y2=dbinom(x2,n,p);y3=dbinom(x3,n,p)

plot(x2,y2,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,xlab=" ",ylab=" ",col="red") par(new=T)

plot(x1,y1,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,xlab=" ",ylab=" ",col="blue") par(new=T)

plot(x3,y3,xlim=c(0,n),ylim=c(0,.2),type="h",lwd=3,

xlab=" ",ylab=" ",col="blue",main="H0: p=0.3 -- H1: p not 0.3") abline(h=0) ;abline(v=c(s1+.5,s2+0.5), lwd=3)

Riferimenti

Documenti correlati

All in all, the research work aims at pointing out and highlighting the effective capabilities of the new web marketing tools from a relationship and

Therefore the product of the absolute values of the roots of one of the polynomials g and h is equal

[r]

[r]

Omitted variable bias means that technology is not truly exogenous but depends on various factors, that are omitted in the model.. - The factors explaining investments in R&amp;D,

23 Among the tools most recently available are: OpenDOAR containing 355 registered repositories searchable by country, content and discipline, http://www.opendoar.org/;

In that case, we could say that the external control, and in particular the accounting control entrusted to the external auditor, should not be mandatory, but it should be quite

With a focus on the renowned Italian–Armenian novelist Antonia Arslan’s Genocide narrative La masseria delle allodole (2004; English translation Skylark Farm, Arslan 2006), I’ll