### Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ^{2}I ) (1)
i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ^{2}I ). (2)

If y is the vector of observed data and θ = (β, σ^{2}) are parameters to
estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ^{2})^{−n/2}exp{− 1

2σ^{2}||y − X β||^{2}}. (3)
log L(θ|y ) = LL(_{θ}) = −n

2log(2π) −n

2log(σ^{2}) − 1

2σ^{2}||y − X β||^{2}. (4)
Maximizing likelihood is equivalent tominimizing

n

2log(σ^{2}) + 1

2σ^{2}||y − X β||^{2}.

### Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ^{2}I ) (1)
i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ^{2}I ). (2)

If y is the vector of observed data and θ = (β, σ^{2}) are parameters to
estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ^{2})^{−n/2}exp{− 1

2σ^{2}||y − X β||^{2}}. (3)
log L(θ|y ) = LL(_{θ}) = −n

2log(2π) −n

2log(σ^{2}) − 1

2σ^{2}||y − X β||^{2}. (4)

Maximizing likelihood is equivalent tominimizing n

2log(σ^{2}) + 1

2σ^{2}||y − X β||^{2}.

5 marzo 2015 1 / 12

### Estimation of linear model under normality assumptions

Assume Y = X β + E with E ∼ N(0, σ^{2}I ) (1)
i.e. errors are independent, homoscedastic and normally distributed.

Then Y ≈ N(X β, σ^{2}I ). (2)

If y is the vector of observed data and θ = (β, σ^{2}) are parameters to
estimate, the likelihood L(θ|y ) is

L(θ|y ) = (2πσ^{2})^{−n/2}exp{− 1

2σ^{2}||y − X β||^{2}}. (3)
log L(θ|y ) = LL(_{θ}) = −n

2log(2π) −n

2log(σ^{2}) − 1

2σ^{2}||y − X β||^{2}. (4)
Maximizing likelihood is equivalent tominimizing

nlog(σ^{2}) + 1

||y − X β||^{2}.

### Maximum likelihood estimation

We need to find β and σ^{2} that minimize
n

2log(σ^{2}) + 1

2σ^{2}||y − X β||^{2}. (5)

It is clear that β that minimizes it is what minimizes ||y − X β||^{2}, i.e.

β = (Xˆ ^{t}X )^{−1}X^{t}Y .

Taking the derivative of (5) w.r. to σ^{2}, one obtians the MLE
σ˜^{2}= ||y − X ˆβ||^{2}

n . (6)

Actually ˜σ^{2} is a biased estimator of σ^{2} while
σˆ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 . [X : n × (p + 1) matrix] (7)
is an unbiased estimator of σ^{2} and generally preferred.

5 marzo 2015 2 / 12

### Maximum likelihood estimation

We need to find β and σ^{2} that minimize
n

2log(σ^{2}) + 1

2σ^{2}||y − X β||^{2}. (5)

It is clear that β that minimizes it is what minimizes ||y − X β||^{2}, i.e.

β = (Xˆ ^{t}X )^{−1}X^{t}Y .

Taking the derivative of (5) w.r. to σ^{2}, one obtians the MLE
σ˜^{2}= ||y − X ˆβ||^{2}

n . (6)

Actually ˜σ^{2} is a biased estimator of σ^{2} while
σˆ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 . [X : n × (p + 1) matrix] (7)

### A property of normal distributions

Theorem (Cochran)

Assume Y ∼ N(0, σ^{2}I ) (n-dim.) and E_{1}, . . . , E_{k} (of dim. n_{i}) subspaces of
R^{n} orthogonal to each other; Pi the corresponding orthogonal projections.

Then P_{i}Y are independent (and normal), and σ^{−2}kP_{i}Y k^{2} ∼ χ^{2}(n_{i}).

In particular, if P is an orthogonal projection of dim. k, PY and (I − P)Y
are independent and σ^{−2}k(I − P)Y k^{2} ∼ χ^{2}(n − k).

As W ∼ χ^{2}(k) =⇒ E(W ) = k, it follows E(kY − X βk^{2}) = σ^{2}(n − p − 1),
hence ˆσ^{2}= ||Y − X ˆβ||^{2}

n − p − 1 is an unbiased estimator of σ^{2}.

5 marzo 2015 3 / 12

### A property of normal distributions

Theorem (Cochran)

Assume Y ∼ N(0, σ^{2}I ) (n-dim.) and E_{1}, . . . , E_{k} (of dim. n_{i}) subspaces of
R^{n} orthogonal to each other; Pi the corresponding orthogonal projections.

Then P_{i}Y are independent (and normal), and σ^{−2}kP_{i}Y k^{2} ∼ χ^{2}(n_{i}).

In particular, if P is an orthogonal projection of dim. k, PY and (I − P)Y
are independent and σ^{−2}k(I − P)Y k^{2} ∼ χ^{2}(n − k).

As W ∼ χ^{2}(k) =⇒ E(W ) = k, it follows E(kY − X βk^{2}) = σ^{2}(n − p − 1),
hence ˆσ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 is an unbiased estimator of σ^{2}.

### Reminders on multivariate normal

Definition

Y = (Y_{1}, . . . , Y_{n}) is multivariate normal if, ∀a ∈ R^{n}, a^{t}Y is a univariate
normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ R^{n} A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such
that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AA^{t}, i.e. Y ∼ N(b, AA^{t}).
Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density
f_{Y}(y ) = (2π)^{−n/2}|S|^{−1/2}exp{−(y − µ)^{t}S^{−1}(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

### Reminders on multivariate normal

Definition

Y = (Y_{1}, . . . , Y_{n}) is multivariate normal if, ∀a ∈ R^{n}, a^{t}Y is a univariate
normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ R^{n} A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such that Y = AX + b.

=⇒ E(Y ) = b, Cov (Y ) = AA^{t}, i.e. Y ∼ N(b, AA^{t}).
Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density
f_{Y}(y ) = (2π)^{−n/2}|S|^{−1/2}exp{−(y − µ)^{t}S^{−1}(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

### Reminders on multivariate normal

Definition

Y = (Y_{1}, . . . , Y_{n}) is multivariate normal if, ∀a ∈ R^{n}, a^{t}Y is a univariate
normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ R^{n} A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such
that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AA^{t}, i.e. Y ∼ N(b, AA^{t}).

Alternative characterization via characteristic function.

If Cov (Y ) = S positive definite (i.e. invertible), Y ∼ N(µ, S ) has density
f_{Y}(y ) = (2π)^{−n/2}|S|^{−1/2}exp{−(y − µ)^{t}S^{−1}(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

### Reminders on multivariate normal

Definition

_{1}, . . . , Y_{n}) is multivariate normal if, ∀a ∈ R^{n}, a^{t}Y is a univariate
normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ R^{n} A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such
that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AA^{t}, i.e. Y ∼ N(b, AA^{t}).

Alternative characterization via characteristic function.

_{Y}(y ) = (2π)^{−n/2}|S|^{−1/2}exp{−(y − µ)^{t}S^{−1}(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

### Reminders on multivariate normal

Definition

_{1}, . . . , Y_{n}) is multivariate normal if, ∀a ∈ R^{n}, a^{t}Y is a univariate
normal.

Equivalently, Y is multivariate normal ⇐⇒ there exists b ∈ R^{n} A

(n × m) matrix, X = (X1, . . . , Xm) independent standard normal r.v. such
that Y = AX + b. =⇒ E(Y ) = b, Cov (Y ) = AA^{t}, i.e. Y ∼ N(b, AA^{t}).

Alternative characterization via characteristic function.

_{Y}(y ) = (2π)^{−n/2}|S|^{−1/2}exp{−(y − µ)^{t}S^{−1}(y − µ)/2}.

(non-singular distribution).

When projecting on a sub-space, one gets singular distributions.

5 marzo 2015 4 / 12

### Confidence intervals for β

iFrom Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P
projection on the subspace generated by the columns of X ). Hence the
same is true for ˆβ and ˆσ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 .

We know moreover ˆβ = N(β, σ^{2}(X^{t}X )^{−1}. Letting M = (X^{t}X )^{−1}, it
follows

pn − p − 1βˆ_{i}− β_{i}
σ^{2}√

M_{ii}/
q

σ^{−2}k ˆY − Y k^{2} = βˆ_{i}− β_{i}
q

M_{ii}σˆ^{2}
follows a t-distribution with n − p − 1 degrees of freedom.
Then ˆβi ± t_{γ}

q

Miiσˆ^{2} is a γ confidence interval for βi. Correspondigly
T = √^{β}^{ˆ}^{i}

M_{ii}σˆ^{2} is a test statistics for the hypothesis β_{i} = 0 against β_{i} 6= 0.

### Confidence intervals for β

iFrom Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P
projection on the subspace generated by the columns of X ). Hence the
same is true for ˆβ and ˆσ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 .

We know moreover ˆβ = N(β, σ^{2}(X^{t}X )^{−1}. Letting M = (X^{t}X )^{−1}, it
follows

pn − p − 1βˆ_{i}− β_{i}
σ^{2}√

M_{ii}/
q

σ^{−2}k ˆY − Y k^{2} = βˆ_{i}− β_{i}
q

M_{ii}σˆ^{2}
follows a t-distribution with n − p − 1 degrees of freedom.

Then ˆβi ± t_{γ}
q

Miiσˆ^{2} is a γ confidence interval for βi. Correspondigly
T = √^{β}^{ˆ}^{i}

M_{ii}σˆ^{2} is a test statistics for the hypothesis β_{i} = 0 against β_{i} 6= 0.

5 marzo 2015 5 / 12

### Confidence intervals for β

iFrom Cochran’s theorem ˆY = X ˆβ = PY and Y − ˆY are independent (P
projection on the subspace generated by the columns of X ). Hence the
same is true for ˆβ and ˆσ^{2} = ||Y − X ˆβ||^{2}

n − p − 1 .

We know moreover ˆβ = N(β, σ^{2}(X^{t}X )^{−1}. Letting M = (X^{t}X )^{−1}, it
follows

pn − p − 1βˆ_{i}− β_{i}
σ^{2}√

M_{ii}/
q

σ^{−2}k ˆY − Y k^{2} = βˆ_{i}− β_{i}
q

M_{ii}σˆ^{2}
follows a t-distribution with n − p − 1 degrees of freedom.

Then ˆβi ± t_{γ}
q

Miiσˆ^{2} is a γ confidence interval for βi. Correspondigly
T = √^{β}^{ˆ}^{i}

M σˆ^{2} is a test statistics for the hypothesis β_{i} = 0 against β_{i} 6= 0.

### How to read the output of lm in R

5 marzo 2015 6 / 12

### Hypothesis testing in linear models

Remember: everything under the assumption of E ∼ N(0, σ^{2}I ).

A general test can be written as

H0: β ∈ V0 against H1: β ∈ V1\ V_{)}
with V_{0}, V_{1} subspaces and V_{0}⊂ V_{1} ⊂ R^{p+1}.

(p + 1 the rank of matrix X ).

Example 1: H0 : {β1 = · · · = βp = 0},

V_{1} = {(β_{1}, . . . , β_{p}) 6= (0, . . . , 0)} = R^{p+1}. (test of the whole regression)
Example 2: V0 = {βi = 0}, V1= {βi 6= 0} = R^{p+1}. (test of the
relevance of column i )

Example 3: V_{0} = {β_{i} = β_{j}}, V_{1} = {β_{i} 6= β_{j}} = R^{p+1}. (are the
coefficients the same?)

### Hypothesis testing in linear models

Remember: everything under the assumption of E ∼ N(0, σ^{2}I ).

A general test can be written as

H0: β ∈ V0 against H1: β ∈ V1\ V_{)}
with V_{0}, V_{1} subspaces and V_{0}⊂ V_{1} ⊂ R^{p+1}.

(p + 1 the rank of matrix X ).

Example 1: H0 : {β1 = · · · = βp = 0},

V_{1} = {(β_{1}, . . . , β_{p}) 6= (0, . . . , 0)} = R^{p+1}. (test of the whole regression)
Example 2: V0 = {βi = 0}, V1= {βi 6= 0} = R^{p+1}. (test of the
relevance of column i )

Example 3: V_{0} = {β_{i} = β_{j}}, V_{1} = {β_{i} 6= β_{j}} = R^{p+1}. (are the
coefficients the same?)

5 marzo 2015 7 / 12

### F test

Theorem

The critical regions C can be written as

C = {Y : F > c} where F = ||X ˆβ_{1}− X ˆβ_{0}||^{2}/(p_{1}− p_{0})

||Y − X ˆβ_{1}||^{2}/(n − p_{1}) . (8)
If β ∈ V_{0}, then F follows a distribution F (p_{1}− p_{0}, n − p_{1}); p_{0} and p_{1} the
dimensions of V0 and V1, ˆβ0 and ˆβ1 the respective estimates.

An F (q, r ) distribution is the ratio of two independent χ^{2} variables
(divided by their d.f.), the numerator χ^{2}(q), the denominator χ^{2}(r ).

In practice, one computes F_{obs} and finds the p-value, i.e.

− p , n − p ) > F ) .

### Critical regions and likelihood ratio tests

A very general method for hypotehsis testing are likelihood ratio tests (compare Neyman-Pearson lemma).

H_{0}: ϑ ∈ Θ_{0} against H_{1}: ϑ ∈ Θ_{1}\ Θ_{1} with Θ_{0} ⊂ Θ_{1} ⊂ R^{p+1}
and let L(ϑ) the likelihood of the parameter ϑ.

If L0 = L( ˆϑ0) = maxϑ∈Θ0L(ϑ) and L1 = L( ˆϑ1) = maxϑ∈Θ1L(ϑ) reject H0

when L_{1}/L_{0} is larger than some constant c_{α} (c chosen so as to have
probability α of errors of 1st species).

It can be proved that the F -test discussed above for linear models is equivalent to this test.

5 marzo 2015 9 / 12

### Critical regions and likelihood ratio tests

A very general method for hypotehsis testing are likelihood ratio tests (compare Neyman-Pearson lemma).

H_{0}: ϑ ∈ Θ_{0} against H_{1}: ϑ ∈ Θ_{1}\ Θ_{1} with Θ_{0} ⊂ Θ_{1} ⊂ R^{p+1}
and let L(ϑ) the likelihood of the parameter ϑ.

If L0 = L( ˆϑ0) = maxϑ∈Θ0L(ϑ) and L1 = L( ˆϑ1) = maxϑ∈Θ1L(ϑ) reject H0

when L_{1}/L_{0} is larger than some constant c_{α} (c chosen so as to have
probability α of errors of 1st species).

It can be proved that the F -test discussed above for linear models is equivalent to this test.

### R

^{2}

### and significance of the regression

R^{2}= SSmodel

SStotal = k ˆY − ¯Y k^{2}
kY − ¯Y k^{2}
is an indicator of the explanatory power of the model.

If the 0-th column of X is all 1s,

(SStotal) kY − ¯Y k^{2} = kY − ˆY k^{2} (SSresidual) + k ˆY − ¯Y k^{2} (SSmodel)
so that R^{2}≤ 1.

In a test of V_{0} = {β_{1}= · · · = β_{p}= 0} vs. V_{1} = R^{p+1},

F = SSmodel/p

SSresidual/(n − p − 1)

High value of R^{2} makes it likely to reject the null hypothesis; however,
high R^{2} and signifance of a regression are different conditions.

5 marzo 2015 10 / 12

### Tests of significance in R

Use anova(reg1,reg2) where reg1 and reg2 are two linear models.

### Diagnostics

Check residuals: ˆε_{i} = y_{i} − ˆy_{i}.

5 marzo 2015 12 / 12