Estimation of linear model under normality assumptions

Testo completo

(1)

Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ2I ) (1) i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ2I ). (2)

If y is the vector of observed data and θ = (β, σ2) are parameters to estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ2)−n/2exp{− 1

2||y − X β||2}. (3) log L(θ|y ) = LL(θ) = −n

2log(2π) −n

2log(σ2) − 1

2||y − X β||2. (4) Maximizing likelihood is equivalent tominimizing

n

2log(σ2) + 1

2||y − X β||2.

(2)

Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ2I ) (1) i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ2I ). (2)

If y is the vector of observed data and θ = (β, σ2) are parameters to estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ2)−n/2exp{− 1

2||y − X β||2}. (3) log L(θ|y ) = LL(θ) = −n

2log(2π) −n

2log(σ2) − 1

2||y − X β||2. (4)

Maximizing likelihood is equivalent tominimizing n

2log(σ2) + 1

2||y − X β||2.

5 marzo 2015 1 / 12

(3)

Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ2I ) (1) i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ2I ). (2)

If y is the vector of observed data and θ = (β, σ2) are parameters to estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ2)−n/2exp{− 1

2||y − X β||2}. (3) log L(θ|y ) = LL(θ) = −n

2log(2π) −n

2log(σ2) − 1

2||y − X β||2. (4) Maximizing likelihood is equivalent tominimizing

nlog(σ2) + 1

||y − X β||2.

(4)

Maximum likelihood estimation

We need to find β and σ2 that minimize n

2log(σ2) + 1

2||y − X β||2. (5)

It is clear that β that minimizes it is what minimizes ||y − X β||2, i.e.

β = (Xˆ tX )−1XtY .

Taking the derivative of (5) w.r. to σ2, one obtians the MLE σ˜2= ||y − X ˆβ||2

n . (6)

Actually ˜σ2 is a biased estimator of σ2 while σˆ2 = ||Y − X ˆβ||2

n − p − 1 . [X : n × (p + 1) matrix] (7) is an unbiased estimator of σ2 and generally preferred.

5 marzo 2015 2 / 12

(5)

Maximum likelihood estimation

We need to find β and σ2 that minimize n

2log(σ2) + 1

2||y − X β||2. (5)

It is clear that β that minimizes it is what minimizes ||y − X β||2, i.e.

β = (Xˆ tX )−1XtY .

Taking the derivative of (5) w.r. to σ2, one obtians the MLE σ˜2= ||y − X ˆβ||2

n . (6)

Actually ˜σ2 is a biased estimator of σ2 while σˆ2 = ||Y − X ˆβ||2

n − p − 1 . [X : n × (p + 1) matrix] (7)

(6)

A property of normal distributions

Theorem (Cochran)

Assume Y ∼ N(0, σ2I ) (n-dim.) and E1, . . . , Ek (of dim. ni) subspaces of Rn orthogonal to each other; Pi the corresponding orthogonal projections.

Then PiY are independent (and normal), and σ−2kPiY k2 ∼ χ2(ni).

In particular, if P is an orthogonal projection of dim. k, PY and (I − P)Y are independent and σ−2k(I − P)Y k2 ∼ χ2(n − k).

As W ∼ χ2(k) =⇒ E(W ) = k, it follows E(kY − X βk2) = σ2(n − p − 1), hence ˆσ2= ||Y − X ˆβ||2

n − p − 1 is an unbiased estimator of σ2.

5 marzo 2015 3 / 12

(7)

A property of normal distributions

Theorem (Cochran)

Assume Y ∼ N(0, σ2I ) (n-dim.) and E1, . . . , Ek (of dim. ni) subspaces of Rn orthogonal to each other; Pi the corresponding orthogonal projections.

Then PiY are independent (and normal), and σ−2kPiY k2 ∼ χ2(ni).

In particular, if P is an orthogonal projection of dim. k, PY and (I − P)Y are independent and σ−2k(I − P)Y k2 ∼ χ2(n − k).

As W ∼ χ2(k) =⇒ E(W ) = k, it follows E(kY − X βk2) = σ2(n − p − 1), hence ˆσ2 = ||Y − X ˆβ||2

n − p − 1 is an unbiased estimator of σ2.

(8)

Reminders on multivariate normal

Definition

Y = (Y1, . . . , Yn) is multivariate normal if, ∀a ∈ Rn, atY is a univariate normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ Rn A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AAt, i.e. Y ∼ N(b, AAt). Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density fY(y ) = (2π)−n/2|S|−1/2exp{−(y − µ)tS−1(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

(9)

Reminders on multivariate normal

Definition

Y = (Y1, . . . , Yn) is multivariate normal if, ∀a ∈ Rn, atY is a univariate normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ Rn A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b.

=⇒ E(Y ) = b, Cov (Y ) = AAt, i.e. Y ∼ N(b, AAt). Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density fY(y ) = (2π)−n/2|S|−1/2exp{−(y − µ)tS−1(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

(10)

Reminders on multivariate normal

Definition

Y = (Y1, . . . , Yn) is multivariate normal if, ∀a ∈ Rn, atY is a univariate normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ Rn A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AAt, i.e. Y ∼ N(b, AAt).

Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density fY(y ) = (2π)−n/2|S|−1/2exp{−(y − µ)tS−1(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

(11)

Reminders on multivariate normal

Definition

Y = (Y1, . . . , Yn) is multivariate normal if, ∀a ∈ Rn, atY is a univariate normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ Rn A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AAt, i.e. Y ∼ N(b, AAt).

Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density fY(y ) = (2π)−n/2|S|−1/2exp{−(y − µ)tS−1(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

(12)

Reminders on multivariate normal

Definition

Y = (Y1, . . . , Yn) is multivariate normal if, ∀a ∈ Rn, atY is a univariate normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ Rn A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AAt, i.e. Y ∼ N(b, AAt).

Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density fY(y ) = (2π)−n/2|S|−1/2exp{−(y − µ)tS−1(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

(13)

Confidence intervals for β

i

From Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P projection on the subspace generated by the columns of X ). Hence the same is true for ˆβ and ˆσ2 = ||Y − X ˆβ||2

n − p − 1 .

We know moreover ˆβ = N(β, σ2(XtX )−1. Letting M = (XtX )−1, it follows

pn − p − 1βˆi− βi σ2

Mii/ q

σ−2k ˆY − Y k2 = βˆi− βi q

Miiσˆ2 follows a t-distribution with n − p − 1 degrees of freedom. Then ˆβi ± tγ

q

Miiσˆ2 is a γ confidence interval for βi. Correspondigly T = √βˆi

Miiσˆ2 is a test statistics for the hypothesis βi = 0 against βi 6= 0.

(14)

Confidence intervals for β

i

From Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P projection on the subspace generated by the columns of X ). Hence the same is true for ˆβ and ˆσ2 = ||Y − X ˆβ||2

n − p − 1 .

We know moreover ˆβ = N(β, σ2(XtX )−1. Letting M = (XtX )−1, it follows

pn − p − 1βˆi− βi σ2

Mii/ q

σ−2k ˆY − Y k2 = βˆi− βi q

Miiσˆ2 follows a t-distribution with n − p − 1 degrees of freedom.

Then ˆβi ± tγ q

Miiσˆ2 is a γ confidence interval for βi. Correspondigly T = √βˆi

Miiσˆ2 is a test statistics for the hypothesis βi = 0 against βi 6= 0.

5 marzo 2015 5 / 12

(15)

Confidence intervals for β

i

From Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P projection on the subspace generated by the columns of X ). Hence the same is true for ˆβ and ˆσ2 = ||Y − X ˆβ||2

n − p − 1 .

We know moreover ˆβ = N(β, σ2(XtX )−1. Letting M = (XtX )−1, it follows

pn − p − 1βˆi− βi σ2

Mii/ q

σ−2k ˆY − Y k2 = βˆi− βi q

Miiσˆ2 follows a t-distribution with n − p − 1 degrees of freedom.

Then ˆβi ± tγ q

Miiσˆ2 is a γ confidence interval for βi. Correspondigly T = √βˆi

M σˆ2 is a test statistics for the hypothesis βi = 0 against βi 6= 0.

(16)

How to read the output of lm in R

5 marzo 2015 6 / 12

(17)

Hypothesis testing in linear models

Remember: everything under the assumption of E ∼ N(0, σ2I ).

A general test can be written as

H0: β ∈ V0 against H1: β ∈ V1\ V) with V0, V1 subspaces and V0⊂ V1 ⊂ Rp+1.

(p + 1 the rank of matrix X ).

Example 1: H0 : {β1 = · · · = βp = 0},

V1 = {(β1, . . . , βp) 6= (0, . . . , 0)} = Rp+1. (test of the whole regression) Example 2: V0 = {βi = 0}, V1= {βi 6= 0} = Rp+1. (test of the relevance of column i )

Example 3: V0 = {βi = βj}, V1 = {βi 6= βj} = Rp+1. (are the coefficients the same?)

(18)

Hypothesis testing in linear models

Remember: everything under the assumption of E ∼ N(0, σ2I ).

A general test can be written as

H0: β ∈ V0 against H1: β ∈ V1\ V) with V0, V1 subspaces and V0⊂ V1 ⊂ Rp+1.

(p + 1 the rank of matrix X ).

Example 1: H0 : {β1 = · · · = βp = 0},

V1 = {(β1, . . . , βp) 6= (0, . . . , 0)} = Rp+1. (test of the whole regression) Example 2: V0 = {βi = 0}, V1= {βi 6= 0} = Rp+1. (test of the relevance of column i )

Example 3: V0 = {βi = βj}, V1 = {βi 6= βj} = Rp+1. (are the coefficients the same?)

5 marzo 2015 7 / 12

(19)

F test

Theorem

The critical regions C can be written as

C = {Y : F > c} where F = ||X ˆβ1− X ˆβ0||2/(p1− p0)

||Y − X ˆβ1||2/(n − p1) . (8) If β ∈ V0, then F follows a distribution F (p1− p0, n − p1); p0 and p1 the dimensions of V0 and V1, ˆβ0 and ˆβ1 the respective estimates.

An F (q, r ) distribution is the ratio of two independent χ2 variables (divided by their d.f.), the numerator χ2(q), the denominator χ2(r ).

In practice, one computes Fobs and finds the p-value, i.e.

− p , n − p ) > F ) .

(20)

Critical regions and likelihood ratio tests

A very general method for hypotehsis testing are likelihood ratio tests (compare Neyman-Pearson lemma).

H0: ϑ ∈ Θ0 against H1: ϑ ∈ Θ1\ Θ1 with Θ0 ⊂ Θ1 ⊂ Rp+1 and let L(ϑ) the likelihood of the parameter ϑ.

If L0 = L( ˆϑ0) = maxϑ∈Θ0L(ϑ) and L1 = L( ˆϑ1) = maxϑ∈Θ1L(ϑ) reject H0

when L1/L0 is larger than some constant cα (c chosen so as to have probability α of errors of 1st species).

It can be proved that the F -test discussed above for linear models is equivalent to this test.

5 marzo 2015 9 / 12

(21)

Critical regions and likelihood ratio tests

A very general method for hypotehsis testing are likelihood ratio tests (compare Neyman-Pearson lemma).

H0: ϑ ∈ Θ0 against H1: ϑ ∈ Θ1\ Θ1 with Θ0 ⊂ Θ1 ⊂ Rp+1 and let L(ϑ) the likelihood of the parameter ϑ.

If L0 = L( ˆϑ0) = maxϑ∈Θ0L(ϑ) and L1 = L( ˆϑ1) = maxϑ∈Θ1L(ϑ) reject H0

when L1/L0 is larger than some constant cα (c chosen so as to have probability α of errors of 1st species).

It can be proved that the F -test discussed above for linear models is equivalent to this test.

(22)

R

2

and significance of the regression

R2= SSmodel

SStotal = k ˆY − ¯Y k2 kY − ¯Y k2 is an indicator of the explanatory power of the model.

If the 0-th column of X is all 1s,

(SStotal) kY − ¯Y k2 = kY − ˆY k2 (SSresidual) + k ˆY − ¯Y k2 (SSmodel) so that R2≤ 1.

In a test of V0 = {β1= · · · = βp= 0} vs. V1 = Rp+1,

F = SSmodel/p

SSresidual/(n − p − 1)

High value of R2 makes it likely to reject the null hypothesis; however, high R2 and signifance of a regression are different conditions.

5 marzo 2015 10 / 12

(23)

Tests of significance in R

Use anova(reg1,reg2) where reg1 and reg2 are two linear models.

(24)

Diagnostics

Check residuals: ˆεi = yi − ˆyi.

5 marzo 2015 12 / 12

figura

Updating...

Riferimenti

Updating...

Argomenti correlati :