Inferential Statistics Part A

(1)

Inferential Statistics Part A

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

http://www.dima.unige.it/~rogantin/UnigeStat/

(2)

Part A

Generalities and point estimation

• A1. Introduction

• A2. Population and samples

• A3. Point estimation

• A4. Sample mean and its distribution

• A5. Properties of estimators

• A6. A bit of probability: the law of large numbers

(3)

A1. Introduction to inferential statistics

analysis of the available sample

:

probability

:

information about population or process

Data collection or experiment simulation – EDA

used to assess unknown parameter values of the whole population Partial observation can be intrinsic to the problem:

- limitations due to costs (time and money) - polls and elections

- invasive experiments (industrial, pharmaceutical) - time forecasts

(4)

From the book

J. Maindonald and J. Brown. Data Analysis and Graphics. Cam- bridge University Press. 2010. Chapter 4. p. 102

A random sample is a set of values drawn independently from a larger population. A (uniform) random sample has the characteristic that all members of the population have an equal chance of being drawn

[. . . ]

The challenge is to use the one sample that is available, together with the assumption of independent and iden- tically distributed sample values, to infer the sampling distribution of the mean

(5)

Probability allows a theoretic model of the variability to predict behavior in not-sampled cases :

• starting from experience,

• formally consistent,

• able to describe the phenomenon,

• able to evaluate the inevitable approximations committed in the transition from partial information of the observed data to statements regarding the entire population or the entire phenomenon

(6)

From the book: Larry Wassermann. All of Statistics. Springer.

2010. Chapter 6.1 p. 87

Statistical inference, or “learning” as it is called in computer science, is the process of using data to infer the distribution that generated the data

A typical statistical inference question is:

Given a sample (X₁, . . . , X_n) ∼ F, how do we infer F ? In some cases, we may want to infer only some feature of F such as its mean

Often one makes some assumptions about F , for instance i.i.d.

sample

Typically the observed sample is indicated with lower case letters x₁, x₂, . . . , x_n

(7)

Statistics/Data Mining Dictionary

Statistics Computer Science Meaning

estimation learning using data to estimate an unknown quantity

classification supervised learning predicting a discrete Y from X clustering unsupervised learning putting data into groups

data training sample (X₁, Y₁), . . . , (X_n, Y_n)

covariates features the X_i’s

classifier hypothesis a map from covariates to outcomes

hypothesis – subset of

a parameter space Θ

confidence interval –

interval that contains an unknown quantity with given frequency ...

(8)

A2. Population and samples

Example of populations: the inhabitants of a city, the groceries sold in a particular region

It is important to select the observed sample appropriately, ideally the sample should be:

• representative of the population. Examples:

– if we study the average price of a product the sample should not derive from only supermarkets, but also from small shops

– if we study the blood pressure of the inhabitants of a region the sample should not derive from only hospitalized

• formed by elements mutually independent. In the last example the sample should not derive from the same families

(9)

A3. Point estimation Let X be a random variable modeling the data

Example Estimation of the mean µ of the systolic blood pressure (mmHg) X in the population. Sample of 8 subjects whose systolic blood pressure is:

x1 x2 x3 x4 x5 x6 x7 x8

125 128 133 136 126 129 131 135

Choice of the estimator – two example:

- sample mean: X = ^X¹^+X²_n^+···+Xⁿ - mid-range:

T = max(X1, X2, . . . , X_n) + min(X1, X2, . . . , X_n) 2

Point estimation: x = 130 e t = 131

Which estimate and estimator should we choose?

The sample mean is the best estimator because it has good statistical and mathematical properties

(10)

A4. Estimation of the population mean Effect of the sample choice: random selection Consider the estimator sample mean X

Which are the possible samples from a population?

Which are the possible values of X? How likely is each value?

A small example

Population: 4 subjects A, B, C, D Sample size: 2

Systolic blood pressure (mmHg) of the 4 subjects – variable X

A B C D

125 129 131 133 µ ⁼ 125+129+131+133

4 = 129.5

Each subject (each value) has probability ¹₄ to be drawn

(11)

Aim: estimation of µ by one sample of size 2, using the sample mean estimator

List of all samples and the corresponding sample means

Note that the value of the population mean is not a possible value of ¯X

Each sample has probability ¹

16 to be drawn

Ex: the sample mean 131 has probability ₁₆³

In practice, only a sample will be drawn!

sample x1 x2 x AA 125 125 125 AB 125 129 127 AC 125 131 128 AD 125 133 129 BA 129 125 127 BB 129 129 129 BC 129 131 130 BD 129 133 131 CA 131 125 128 CB 131 129 130 CC 131 131 131 CD 131 133 132 DA 133 125 129 DB 133 129 131 DC 133 131 132 DD 133 133 133

(12)

Values taken by the estimator X and their probabilities

x 125 127 128 129 130 131 132 133

P (X = x) 1/16 2/16 2/16 3/16 2/16 3/16 2/16 1/16

What probability is there to overestimate the mean value of blood pressure?

What is the probability of being away from the true value for more than 2 mmHg?

Where is the randomness?

Why do we say that X is a random variable?

The randomness is in randomly drawing a sample and ob- taining one of the possible values

The probability of observing a certain sample value is written above and impinges on the assumption that all 16 sample are equally likely (i.i.d. sampling). If the sample scheme changes, then the sample distribution of ¯X changes accordingly

(13)

Sample distribution of X and its mean

125 126 127 128 129 130 131 132 133

0.06 0.08 0.10 0.12 0.14 0.16 0.18



E(X) = 125+2×127+2×128+3×129+2×130+3×131+2×132+133

16 = 129.5

X is centered in the parameter µ we want to estimate

In general: the mean of ¯X is equal to the population mean!

Moreover, the distribution of X is close to µ with high probability when n is “large”

(14)

Sample distribution of X and X and their variances

124 126 128 130 132 134

0.000.050.100.150.200.250.30

124 126 128 130 132 134

0.000.050.100.150.200.250.30 ^●

●

variable X

variable sample mean

X has smaller variance than X

In general for i.i.d. samples:

Var(X) = Var(X) n with n sample size

In the example: Var(X) = Var(X)/2

Var(X) = 1

4 (125− 129.5)² + (129− 129.5)² + (131− 129.5)² + (133− 129.5)²

= 8.75

Var(X) = 1

16 (125− 129.5)²+ 2× (127 − 129.5)²+ 2× (128 − 129.5)²+ 3× (129 − 129.5)²+ 2× (130 − 129.5)² + 3× (131 − 129.5)² + 2× (132 − 129.5)² + (133− 129.5)²

= 4.375

(15)

Summary on the simple mean random variable x₁, x₂, . . . , x_n observed values – sample values

X₁, X₂, . . . , X_n ∼ F i.i.d. sample random variables with mean µ and variance σ²

E(X1) = µ Var(X₁) = σ² Let X be the sample mean random variable

X =

P_n

i=1 X_i n

It has theoretical mean µ and variance σ²/n E(X) = µ Var(X) = σ²

n Sometimes ^√^σ

n is denoted as SEM (standard error of the mean variable)

(16)

The unbiased estimator of the variance of a variable X in the population X₁, . . . , X_n ∼ F i.i.d – a sample

σ² the variance of each X_i, i = 1, . . . , n The estimator

S² = 1 n − 1

n X

i=1

X_i − X²

is unbiased, i.e. its mean is σ²

The theoretical variance of S² is Var(S²) = ¹

n µ4 − ⁿ⁻³_n−1σ⁴

(17)

Sample distribution of S², estimator of the variance of a variable X, and its mean

Example (continue)

Systolic blood pressure of 4 subjects (variable X)

List of all samples i.i.d. (n = 2) and the corresponding s².

sample x1 x2 x P2

i=1(x_i − x)²

AA 125 125 125 0

AB 125 129 127 8 = (125 − 127)² + (129 − 127)² /1 AC 125 131 128 18 = (125 − 128)² + (131 − 128)²

/1

AD 125 133 129 32 = . . .

BA 129 125 127 8

BB 129 129 129 0

BC 129 131 130 2

BD 129 133 131 8

CA 131 125 128 18

CB 131 129 130 2

CC 131 131 131 0

CD 131 133 132 2

DA 133 125 129 32

DB 133 129 131 8 Each sample has probability

DC 133 131 132 2 ₁₆¹ to be drawn

DD 133 133 133 0

(18)

Values taken by the estimator S² and their probabilities

s² 0 2 8 18 32

P (S² = s²) 4/16 4/16 4/16 2/16 2/16

0 5 10 15 20 25 30

0.000.050.100.150.200.250.30

E(S²) = 4 × 0 + 4 × 2 + 4 × 8 + 2 × 18 + 2 × 32

16 = 8.75

S² is centered in the parameter σ² we want to estimate

(19)

A5. Properties of the estimators A point estimator T of a parameter θ should be

• unbiased or centered (its mean is θ)

• consistent (unbiased and its variance tends towards 0 as the sample size approaches to infinity)

X is a consistent estimator of µ Example

For estimating the maximum length θ of a Mikado. Suppose that the length is uniform in (0, θ), or in (a, θ + a) with a known We could consider

• the maximum in the sample max{X₁, . . . , X_n}

• twice the sample mean 2X

The sample distributions of the two estimators are plotted on the right

0 2 4 6 8 10

0.00.20.40.60.81.0

0 2 4 6 8 10

0.00.20.40.60.81.0

θ

(20)

Aside

Random selection is not the only reason why

• probability models are used to make inference from the sample to a large population

• uncertainty associated to inference is modelled by random samples and probability

Sources of uncertainty

• random selection

• measurement (e.g. in a Lab due to equipment, unit under test, operator, calibration of the measurement instrument, . . . )

• not yet observed process

• intrinsically uncertainty outcome

• . . .

(21)

A6. A bit of probability: the law of large numbers X₁, X₂, . . . , X_n i.i.d. r. v. with mean µ and variance σ²

Let S_n = X₁ + X₂ + · · · + X_n be the sum of the sample random variables.

Let X_n^∗ be the sample mean random variable X_n = S_n

n with mean µ and variance σ²/n Law of large numbers (LLN)

If the sample size n grows to infinity, the probability that X_n takes values outside the interval (µ − δ, µ + δ) goes to zero, for any positive δ

More precisely: P

|X_n − µ| > δ → 0 if n → ∞

The distribution of X_n becomes more concentrated around µ as n gets large, where µ is both the parameter to be estimated and the expectation (or theoretical mean) of X_n

In other words, the distribution of X_n piles up near µ

∗Here we add the index n to explicit the dependence on the sample size

(22)

Trusting in the experience and the LLN

The LLN is a theorem that describes the result of performing the same experiment a large number of times

The LLN “guarantees” stable long-term results for the averages of random events

For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be ”balanced” by the other

From Wikipedia https://en.wikipedia.org/wiki/Law_of_large_numbers

(23)

Simulation in R of an experiment with two outcomes:

1 ^and 0

Example: probability of 1: 0.3 – probability of 0: 0.7

(we will analyze in detail this type of experiment later)

> out=rbinom(10,1,0.3);out ##10 trials [1] 1 0 0 1 0 0 1 0 0 1

> mean(out) [1] 0.4

> out=rbinom(100,1,0.3);out ##100 trials

[1] 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 [33] 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 [65] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 [97] 0 1 0 0

> mean(out) [1] 0.29

> out=rbinom(1000,1,0.3); mean(out) ##1000 trials [1] 0.31

> out=rbinom(10000,1,0.3); mean(out) ##10000 trials [1] 0.3005