Inferential Statistics Part A
Eva Riccomagno, Maria Piera Rogantin
DIMA – Universit`a di Genova
http://www.dima.unige.it/~rogantin/UnigeStat/
Part A
Generalities and point estimation
• A1. Introduction
• A2. Population and samples
• A3. Point estimation
• A4. Sample mean and its distribution
• A5. Properties of estimators
• A6. A bit of probability: the law of large numbers
A1. Introduction to inferential statistics
analysis of the available sample
:
probability
:
information about population or pro- cess
Data collection or experiment simulation – EDA
used to assess unknown parameter values of the whole population Partial observation can be intrinsic to the problem:
- limitations due to costs (time and money) - polls and elections
- invasive experiments (industrial, pharmaceutical) - time forecasts
From the book
J. Maindonald and J. Brown. Data Analysis and Graphics. Cam- bridge University Press. 2010. Chapter 4. p. 102
A random sample is a set of values drawn independently from a larger population. A (uniform) random sample has the characteristic that all members of the population have an equal chance of being drawn
[. . . ]
The challenge is to use the one sample that is available, together with the assumption of independent and iden- tically distributed sample values, to infer the sampling distribution of the mean
Probability allows a theoretic model of the variability to predict behavior in not-sampled cases :
• starting from experience,
• formally consistent,
• able to describe the phenomenon,
• able to evaluate the inevitable approximations committed in the transition from partial information of the observed data to statements regarding the entire population or the entire phenomenon
From the book: Larry Wassermann. All of Statistics. Springer.
2010. Chapter 6.1 p. 87
Statistical inference, or “learning” as it is called in com- puter science, is the process of using data to infer the distribution that generated the data
A typical statistical inference question is:
Given a sample (X1, . . . , Xn) ∼ F, how do we infer F ? In some cases, we may want to infer only some feature of F such as its mean
Often one makes some assumptions about F , for instance i.i.d.
sample
Typically the observed sample is indicated with lower case letters x1, x2, . . . , xn
Statistics/Data Mining Dictionary
Statistics Computer Science Meaning
estimation learning using data to estimate an unknown quantity
classification supervised learning predicting a discrete Y from X clustering unsupervised learning putting data into groups
data training sample (X1, Y1), . . . , (Xn, Yn)
covariates features the Xi’s
classifier hypothesis a map from covariates to outcomes
hypothesis – subset of
a parameter space Θ
confidence interval –
interval that contains an unknown quantity with given frequency ...
A2. Population and samples
Example of populations: the inhabitants of a city, the groceries sold in a particular region
It is important to select the observed sample appropriately, ideally the sample should be:
• representative of the population. Examples:
– if we study the average price of a product the sample should not derive from only supermarkets, but also from small shops
– if we study the blood pressure of the inhabitants of a region the sample should not derive from only hospitalized
• formed by elements mutually independent. In the last exam- ple the sample should not derive from the same families
A3. Point estimation Let X be a random variable modeling the data
Example Estimation of the mean µ of the systolic blood pressure (mmHg) X in the population. Sample of 8 subjects whose systolic blood pressure is:
x1 x2 x3 x4 x5 x6 x7 x8
125 128 133 136 126 129 131 135
Choice of the estimator – two example:
- sample mean: X = X1+X2n+···+Xn - mid-range:
T = max(X1, X2, . . . , Xn) + min(X1, X2, . . . , Xn) 2
Point estimation: x = 130 e t = 131
Which estimate and estimator should we choose?
The sample mean is the best estimator because it has good statistical and mathematical properties
A4. Estimation of the population mean Effect of the sample choice: random selection Consider the estimator sample mean X
Which are the possible samples from a population?
Which are the possible values of X? How likely is each value?
A small example
Population: 4 subjects A, B, C, D Sample size: 2
Systolic blood pressure (mmHg) of the 4 subjects – variable X
A B C D
125 129 131 133 µ = 125+129+131+133
4 = 129.5
Each subject (each value) has probability 14 to be drawn
Aim: estimation of µ by one sample of size 2, using the sample mean estimator
List of all samples and the corresponding sample means
Note that the value of the population mean is not a possible value of ¯X
Each sample has probability 1
16 to be drawn
Ex: the sample mean 131 has prob- ability 163
In practice, only a sample will be drawn!
sample x1 x2 x AA 125 125 125 AB 125 129 127 AC 125 131 128 AD 125 133 129 BA 129 125 127 BB 129 129 129 BC 129 131 130 BD 129 133 131 CA 131 125 128 CB 131 129 130 CC 131 131 131 CD 131 133 132 DA 133 125 129 DB 133 129 131 DC 133 131 132 DD 133 133 133
Values taken by the estimator X and their probabilities
x 125 127 128 129 130 131 132 133
P (X = x) 1/16 2/16 2/16 3/16 2/16 3/16 2/16 1/16
What probability is there to overestimate the mean value of blood pressure?
What is the probability of being away from the true value for more than 2 mmHg?
Where is the randomness?
Why do we say that X is a random variable?
The randomness is in randomly drawing a sample and ob- taining one of the possible values
The probability of observing a certain sample value is written above and impinges on the assumption that all 16 sample are equally likely (i.i.d. sampling). If the sample scheme changes, then the sample distribution of ¯X changes accordingly
Sample distribution of X and its mean
125 126 127 128 129 130 131 132 133
0.06 0.08 0.10 0.12 0.14 0.16 0.18
E(X) = 125+2×127+2×128+3×129+2×130+3×131+2×132+133
16 = 129.5
X is centered in the parameter µ we want to estimate
In general: the mean of ¯X is equal to the population mean!
Moreover, the distribution of X is close to µ with high probability when n is “large”
Sample distribution of X and X and their variances
124 126 128 130 132 134
0.000.050.100.150.200.250.30
124 126 128 130 132 134
0.000.050.100.150.200.250.30 ●
●
variable X
variable sample mean
X has smaller variance than X
In general for i.i.d. samples:
Var(X) = Var(X) n with n sample size
In the example: Var(X) = Var(X)/2
Var(X) = 1
4 (125− 129.5)2 + (129− 129.5)2 + (131− 129.5)2 + (133− 129.5)2
= 8.75
Var(X) = 1
16 (125− 129.5)2+ 2× (127 − 129.5)2+ 2× (128 − 129.5)2+ 3× (129 − 129.5)2+ 2× (130 − 129.5)2 + 3× (131 − 129.5)2 + 2× (132 − 129.5)2 + (133− 129.5)2
= 4.375
Summary on the simple mean random variable x1, x2, . . . , xn observed values – sample values
X1, X2, . . . , Xn ∼ F i.i.d. sample random variables with mean µ and variance σ2
E(X1) = µ Var(X1) = σ2 Let X be the sample mean random variable
X =
Pn
i=1 Xi n
It has theoretical mean µ and variance σ2/n E(X) = µ Var(X) = σ2
n Sometimes √σ
n is denoted as SEM (standard error of the mean variable)
The unbiased estimator of the variance of a variable X in the population X1, . . . , Xn ∼ F i.i.d – a sample
σ2 the variance of each Xi, i = 1, . . . , n The estimator
S2 = 1 n − 1
n X
i=1
Xi − X2
is unbiased, i.e. its mean is σ2
The theoretical variance of S2 is Var(S2) = 1
n µ4 − n−3n−1σ4
Sample distribution of S2, estimator of the variance of a variable X, and its mean
Example (continue)
Systolic blood pressure of 4 subjects (variable X)
List of all samples i.i.d. (n = 2) and the corresponding s2.
sample x1 x2 x P2
i=1(xi − x)2
AA 125 125 125 0
AB 125 129 127 8 = (125 − 127)2 + (129 − 127)2 /1 AC 125 131 128 18 = (125 − 128)2 + (131 − 128)2
/1
AD 125 133 129 32 = . . .
BA 129 125 127 8
BB 129 129 129 0
BC 129 131 130 2
BD 129 133 131 8
CA 131 125 128 18
CB 131 129 130 2
CC 131 131 131 0
CD 131 133 132 2
DA 133 125 129 32
DB 133 129 131 8 Each sample has probability
DC 133 131 132 2 161 to be drawn
DD 133 133 133 0
Values taken by the estimator S2 and their probabilities
s2 0 2 8 18 32
P (S2 = s2) 4/16 4/16 4/16 2/16 2/16
0 5 10 15 20 25 30
0.000.050.100.150.200.250.30
E(S2) = 4 × 0 + 4 × 2 + 4 × 8 + 2 × 18 + 2 × 32
16 = 8.75
S2 is centered in the parameter σ2 we want to estimate
A5. Properties of the estimators A point estimator T of a parameter θ should be
• unbiased or centered (its mean is θ)
• consistent (unbiased and its variance tends towards 0 as the sample size approaches to infinity)
X is a consistent estimator of µ Example
For estimating the maximum length θ of a Mikado. Suppose that the length is uni- form in (0, θ), or in (a, θ + a) with a known We could consider
• the maximum in the sample max{X1, . . . , Xn}
• twice the sample mean 2X
The sample distributions of the two esti- mators are plotted on the right
0 2 4 6 8 10
0.00.20.40.60.81.0
0 2 4 6 8 10
0.00.20.40.60.81.0
θ
Aside
Random selection is not the only reason why
• probability models are used to make inference from the sam- ple to a large population
• uncertainty associated to inference is modelled by random samples and probability
Sources of uncertainty
• random selection
• measurement (e.g. in a Lab due to equipment, unit under test, operator, calibration of the measurement instrument, . . . )
• not yet observed process
• intrinsically uncertainty outcome
• . . .
A6. A bit of probability: the law of large numbers X1, X2, . . . , Xn i.i.d. r. v. with mean µ and variance σ2
Let Sn = X1 + X2 + · · · + Xn be the sum of the sample random variables.
Let Xn∗ be the sample mean random variable Xn = Sn
n with mean µ and variance σ2/n Law of large numbers (LLN)
If the sample size n grows to infinity, the probability that Xn takes values outside the interval (µ − δ, µ + δ) goes to zero, for any positive δ
More precisely: P
|Xn − µ| > δ → 0 if n → ∞
The distribution of Xn becomes more concentrated around µ as n gets large, where µ is both the parameter to be estimated and the expectation (or theoretical mean) of Xn
In other words, the distribution of Xn piles up near µ
∗Here we add the index n to explicit the dependence on the sample size
Trusting in the experience and the LLN
The LLN is a theorem that describes the result of performing the same experiment a large number of times
The LLN “guarantees” stable long-term results for the averages of random events
For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. It is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be ”balanced” by the other
From Wikipedia https://en.wikipedia.org/wiki/Law_of_large_numbers
Simulation in R of an experiment with two outcomes:
1 and 0
Example: probability of 1: 0.3 – probability of 0: 0.7
(we will analyze in detail this type of experiment later)
> out=rbinom(10,1,0.3);out ##10 trials [1] 1 0 0 1 0 0 1 0 0 1
> mean(out) [1] 0.4
> out=rbinom(100,1,0.3);out ##100 trials
[1] 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 [33] 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 [65] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 [97] 0 1 0 0
> mean(out) [1] 0.29
> out=rbinom(1000,1,0.3); mean(out) ##1000 trials [1] 0.31
> out=rbinom(10000,1,0.3); mean(out) ##10000 trials [1] 0.3005