Linear models including analysis of variance

(1)

Linear models

including analysis of variance

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

riccomagno@dima.unige.it rogantin@dima.unige.it

(2)

Part A. Linear model

1. Introduction

2. Linear model on a sample

3. Inference on the coefficients of the model 4. Inference on the means of the responses

5. Analysis of the residuals for the goodness-of-fit 6. Test on a subset of coefficients

7. Prediction of the response and prediction error

(3)

1. Introduction Example. Oxygen consumption in athletes

ossigeno eta peso tempo p_ferm p_med p_max

44.609 44 89.47 11.37 62 178 182

45.313 40 75.07 10.07 62 185 185

54.297 44 85.84 8.65 45 156 168

49.874 38 89.02 9.22 55 178 180

...

We want to determine if the oxygen consumption by athletes practicing endurance sports can be expressed as a linear combination of easily measurable explanatory variables:

- age (eta),

- weight (peso),

- time to a fixed distance (tempo),

- stationary heart beat per minute (pulsfer), - average heart rate per minute (pulsmed),

- maximum heart rate per minute when running (pulsmax)

(4)

Let Y and X₁, . . . , X_p−1 be quantitative variables on n units.

We want to express Y as a linear combination of X₁, . . . , X_p−1 plus a random residual.

- Y response variable; Y₁, . . . , Y_n sample variables and y₁, ...., y_n the observed values

- X₁, . . . , X_p−1 explanatory variables (or covariates) and x_i1, ..., x_i,p−1 the observed covariates for the i-th sample unit

For i-th sample unit, for i = 1, . . . , n

y_i = β₀ + β₁ x_i1 + β₂ x_i2 + · · · + β_p−1 x_ip−1 +ε_i

= x^t_i β +ε_i

In matrix form Y = Xβ + ε whose the i-th row is made explicit above.

(5)

Two main applications of linear models.

• To quantify the strength of the relationship between Y and the covariates, to assess which covariate may have no relationship with Y at all, and to identify which subsets of the covariates contain redundant information about Y .

• After developing a model to approximate Y through appropri- ate covariates, if an additional sample unit and its covariates are given without the accompanying value of Y , the fitted model can be used to make a prediction of the value of Y .

(6)

Example. Simple linear regression y = β₀ + β₁ x + ε

b₀ + b₁ x_i (belonging to the line) is the best linear approximation of y_i trough x_i.

(x , y )_i _i

(x , _i β₁ x + _i ₂)

(x_i,b₀⁺x_ib₁) (x_i, y_i)

0 10

20 30 40 50 60

20 30 40

x y

Residuals

The i-th residual is ε_i = y_i − x^t_i β, for i = 1, . . . , n.

It is a function of the parameters β

Estimate of the parameters – a least square problem The estimate of β minimize the sum of squares of the residuals

n X

i=1

ε²_i

(7)

2. Linear model on a sample

The parameter

β_k gives the “importance” of the explanatory variable X_k on the response approximation.

The β parameters are also called “effects”.

The aim is to study the parameters β

• estimating them

• computing confidence intervals

• carrying out tests

(8)

Estimator of the parameters

Let Y = (Y₁, . . . , Y_n) be sample independent variables.

The covariates X₁, . . . , X_p−1 are assumed deterministic.

Let X be the matrix with one in the entries of the first row and in whose i-th row there are the covariates’ values for the i-th observed unit. This is the data matrix (for covariates) augmented by a vector of one.

The least squares estimator of the coefficients β = (β₀, . . . , β_p−1) are

B = (X^tX)⁻¹X^tY

where B is a vector of estimators: B = (B₀, . . . , B_p−1).

(9)

Response variable with normal distribution

Assume ε_i ∼ N (0, σ²) σ unknown and cov(ε_i, ε_j) = 0 Then Y_i ∼ N x^t_iβ, σ² and cov(Y_i, Y_j) = 0

where µ_i = x^t_iβ = β₀ + β₁ x_i1 + β₂ x_i2 + · · · β_p−1 x_{i p−1} The responses Y₁, . . . , Y_n

• are independent because normally distributed and with zero covariances.

• are not identically distributed because the mean value of Y_i depends on the covariates of the i-th sample unit.

The used unbiased estimator is:

S² = 1 n − p

n

X (Y_i − x^t_iB)²

(10)

3. Inference on the coefficients β of the model Point estimators

Recall that the least squares estimator of the model coefficients β₀, . . . , β_p−1 are B = (X^tX)⁻¹X^tY where B is a vector of estimators: B = (B₀, . . . , B_p−1).

The variance/covariance matrix of the B’s estimators is:

V(B) = σ²(X^tX)⁻¹

and can be estimated using S², point estimator of σ².

We denote by ˜σ_k² the variance of B_k and by ˜S_k² its estimator, k = 0, . . . , p − 1.

The point estimator B of the model coefficients β are linear combination of the sample variables (Y₁, . . . , Y_n). Then they are normal random variables.

B_k ∼ N (β_k, ˜σ_k) and T_k = B_k − β_k

S˜_k ∼ t_n−p

(11)

Example. Simple regression. Height of poplars

We want to express the height (m) of a variety of poplars as linear function of the diameter (cm)

Diametro Altezza 2.23 3.76 2.12 3.15 1.06 1.85 ....

> poplar=read.delim("DATA/pioppi.txt",header =T);attach(poplar)

> regr_H_D=lm(Altezza~Diametro,poplar[1:50,])

> summary(regr_H_D)

(12)

Call:

lm(formula = Altezza ~ Diametro, data = poplar[1:50, ]) Residuals:

Min 1Q Median 3Q Max

-14.001 -2.686 0.797 2.937 8.396 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 8.045 2.487 3.234 0.00221 **

Diametro 14.560 1.082 13.460 < 2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 4.933 on 48 degrees of freedom

Multiple R-squared: 0.7906, Adjusted R-squared: 0.7862 F-statistic: 181.2 on 1 and 48 DF, p-value: < 2.2e-16

Consider now only the column Coefficients:

- column Estimate: estimate b_k of the parameter β_k

- column Std.Error: estimate of the standard deviation of B_k

(13)

Confidence interval for β_k

B_k − t_1−α/2 S˜_k, B_k + t_1−α/2 S˜_k

From the sample values b_k and ˜s_k we get the sample confidence interval.

Sample confidence intervals in R

> confint(regr_H_D)

2.5 % 97.5 %

(Intercept) 3.043746 13.04638 Diametro 12.385520 16.73548

(14)

Test on the β_k (individually)

The k-th covariate X_k has effect on the response if β_k is different from 0.

H₀: β_k = 0 and H₁: β_k 6= 0 Under H₀, T_k = ^B_˜^k

S_k ∼ t_n−p

Given α, R₀ = −∞, −t_1−α/2˜s_k ∪ t_1−α/2˜s_k, +∞. Coefficients:

- column t value: estimate of t_k = b_k/˜s_k - column Pr(>|t|): p-value of t_k

In the example, there is strong evidence to reject H₀ for both coefficients β₀ and β₁.

In fact the sample confidence intervals do not contain 0 and the p-values are very small.

(15)

4. Inference on the mean values µ_i of the responses

The mean value of the sample variable Y_i is µ_i = x^t_iβ

Point estimator for µ_i (computed by the point estimators B) Yb_i = x^t_iB

The variance/covariance matrix of the Y ’s estimators is:^b V(Y ) = σ²X(X^tX)⁻¹X^t

and can be estimated using S², point estimator of σ².

We denote by σ_i^∗² the variance of Y_i and by S_i^∗² its estimator.

Yb_i ∼ N (µ_i, σ_i^∗²) and T_i = Y^b_i − µ_i

S_i^∗ ∼ t_n−p

(16)

Confidence interval for µ_i

Yb_i − t_1−α/2 S_i^∗,Y^b_i + t_1−α/2 S_i^∗

Sample confidence intervals in R

> round(predict(regr_H_D, interval="confidence"),1) fit lwr upr

1 25.1 22.4 27.7 2 31.1 29.1 33.0 3 23.5 20.6 26.3 4 23.0 20.1 26.0 5 25.2 22.6 27.9 6 28.3 26.0 30.5 7 27.0 24.6 29.4 8 29.0 26.8 31.2 ...

(17)

5. Analysis of the residuals for the goodness-of-fit

The sample residuals are estimated by E_i = Y_i − Y^b_i

The variance/covariance matrix of the E’s estimators is:

V(E) = σ²

I − X(X^tX)⁻¹X^t

and can be estimated using S², point estimator of σ². We denote by σ_i^∗∗² the variance of E_i and by S_i^∗∗² its estimator.

E_i ∼ N (0, σ_i^∗∗²) and T_i = E_i

S_i^∗∗ ∼ t_n−p

The normality of the response variable is checked e.g. not through y₁, . . . , y_n but through the standardized residuals (t_i = (y_i−ˆy_i)/s^∗∗_i ), i = 1, . . . , n. In fact y₁, . . . , y_n are one-samples of n random variables with different means, while t₁, . . . , t_n are an n-sample of one

(18)

The scatterplot of residuals w.r.t. predicted values should not have neither trend nor “form” but the cloud should be homoge- neous around the zero. If not, a transformation of either response or the covariates can be performed.

Example. Poplar (continue)

●

●●

●

●●●

●

●●

●

●●

●

30 40 50 60

−3−2−101

fitted values

standardized residuals

30 40 50 60

−15−10−50510

Fitted values

Residuals

●

●●

●

●●●

●

●●

●

●●

●

Residuals vs Fitted

2

1 45

●

● ●

●

●●

●

●●

●

● ●

●

−2 −1 0 1 2

−3−2−1012

Theoretical Quantiles

Standardized residuals

Normal Q−Q

2 1 45

plot(predict(regr_H_D),rstandard(regr_H_D),pch=16,cex.axis=1.5,

xlab="fitted values",ylab="standardized residuals",cex.lab=1.5);abline(h=0) par(mfrow=c(1,2));plot(regr_H_D,which = 1:2)

The Tukey’s five numbers of the residuals indicate that they are quite symmetrical w.r.t. zero.

Residuals:

-14.001 -2.686 0.797 2.937 8.396

(19)

The R² index to the goodness-of-fit

The index has two equivalent interpretations:

• it is the ratio between the variance of the fitted values and the variance of the sample values:

R² =

P(ˆy_i − y)²

P(y_i − y)²

• it is the squared correlation between the fitted values and the sample value:

R² = ρ²(y, y)_b

(20)

Example. Oxygen consumption in athletes (continue)

> atleti=read.table("C:/DATA/atleti.txt",header=T);attach(atleti)

> regr_oxigen=lm(ossigeno~eta+peso+tempo+pulsferm+pulsmed+pulsmax)

> summary(regr_oxigen)

Goodness-of-fit

• Analysis of the residuals

plot(predict(regr_ossigeno),rstandard(regr_ossigeno),pch=16,cex.axis=1.5, xlab="fitted values",ylab="standardized residuals",cex.lab=1.5)

abline(h=0)

par(mfrow=c(1,2));plot(regr_oxigen,which = 1:2)

●

● ●

●

40 45 50 55

−2−1012

fitted values

standardized residuals

40 45 50 55

-6-2246

Fitted values

Residuals

Residuals vs Fitted

17 15

23

-2 -1 0 1 2

-2-10123

Theoretical Quantiles

Standardized residuals

Normal Q-Q

15

17

20

• R² index (in the output of summary(regr oxigen))

Multiple R-squared: 0.8487

(21)

Inference of the model coefficients

(in the output of summary(regr oxigen))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 102.93448 12.40326 8.299 1.64e-08 ***

eta -0.22697 0.09984 -2.273 0.03224 * peso -0.07418 0.05459 -1.359 0.18687

tempo -2.62865 0.38456 -6.835 4.54e-07 ***

pulsferm -0.02153 0.06605 -0.326 0.74725 pulsmed -0.36963 0.11985 -3.084 0.00508 **

pulsmax 0.30322 0.13650 2.221 0.03601 * ---

Considering the coefficients individually:

there is a very strong evidence to reject the “non influence” (on the oxygen consumption) of the variables peso and pulsferm and there is a strong evidence for eta and pulsmax.

Interpretation: removing pulsferm the model is good, removing pulsmax the model is good. What about a model without both

(22)

6. Test for a subset of coefficients The test hypotheses are

H₀ : β_i₁ = . . . = β_i_q = 0 and H₁ : at least one 6= 0 The test is carried out comparing

- the sum of squares of the residuals in the reduced model, SS_R - the sum of squares of the residuals in the complete model, SS_C (SS_R > SS_C always)

The test statistics is the relative error of the two sums of squares (multiplied by a constant linked to the degrees of freedom of the reduced and complete models):

F = (SS_R − SS_C) / q

SS_C / (n − p) ∼ F_[q,n−p]

(Fisher distribution with q and n − p degrees of freedom)

The test is one-sided right because large sample values of F state large difference between complete and reduced models.

(23)

Example. Oxygen consumption in athletes (continue)

The covariates peso and pulsferm are individually non influent.

What about a model without both these variables?

> reduced = lm(ossigeno~eta+tempo+pulsmed+pulsmax)

> complete = lm(ossigeno~eta+peso+tempo+pulsferm+pulsmed+pulsmax)

> anova(reduced,complete) Analysis of Variance Table

Model 1: ossigeno ~ eta + tempo + pulsmed + pulsmax

Model 2: ossigeno ~ eta + peso + tempo + pulsferm + pulsmed + pulsmax Res.Df RSS Df Sum of Sq F Pr(>F)

1 26 138.93

2 24 128.84 2 10.092 0.94 0.4045

There is evidence to retain H₀, that is the coefficients of peso and pulsferm are null. The reduced model approximates the response as well as the complete model.

(24)

Let’s analyze the reduced model

Call:

lm(formula = ossigeno ~ eta + tempo + pulsmed + pulsmax) Coefficients:

(Intercept) 98.14789 11.78569 8.328 8.26e-09 ***

eta -0.19773 0.09564 -2.068 0.04877 * tempo -2.76758 0.34054 -8.127 1.31e-08 ***

pulsmed -0.34811 0.11750 -2.963 0.00644 **

pulsmax 0.27051 0.13362 2.024 0.05330 . ---

Residual standard error: 2.312 on 26 degrees of freedom

Multiple R-squared: 0.8368, Adjusted R-squared: 0.8117

In this model, there is a weak evidence to reject that the coeffi- cient pulsmax is zero.

Exercise: check whether a reduced model without peso, pulsferm and pulsmax approximates the oxygen consumption as well as the complete model.

(25)

Test for the nullity of all the coefficients except the constant

Are the covariates, all together, useful to explain the response?

Test against the constant model.

In R: (last row in the output of summary(regr oxigen) complete model)

F-statistic: 22.43 on 6 and 24 DF, p-value: 9.715e-09

The reduced model has the constant only. Let SS₀ be the sum of squares of the residuals. The test statistic is:

F = (SS₀ − SS_C) / (p − 1)

SS_C / (n − p) ∼ F_{[p−1,n−p]}

In the example of the oxygen consumption:

- p = 7, n = 31; degrees of freedom: 6 and 24 - F-statistic: 22.43

(26)

7. Prediction of the response and prediction error

Consider the linear model Y_i = x^t_iβ + ε_i, for i = 1, . . . , n.

Let B be the point estimators of the coefficients based on the n sample units.

An additional sample unit with covariates x₀ = (1, . . . , x_0p−1) is given without the corresponding sample value of the response.

For the new observation consider the model:

Y₀ = x^t₀β + ε₀ = µ₀ + ε₀ with the same coefficients β as above.

The mean value of Y₀, µ₀, is predicted using the point estimator x^t₀B.

(27)

• Confidence interval for µ₀ of the new “sample” variable Y₀ It holds:

E(x^t0B) = x^t₀β V(x^t0B) = σ²v₀² Then a confidence interval for µ₀ is:

x^t₀B − t_1−α/2 S v₀, x^t₀B + t_1−α/2 S v₀

• Prediction interval for Y₀ 1 − α =

P

x^t₀B − t_1−α/2 S

q

1 + v₀² < Y₀ < x^t₀B + t_1−α/2 S

q

1 + v₀²

Remark: it is not a confidence interval because is not about a parameter.

Note the semi-range of the two intervals:

t_α S

q

v₀² and t_α S

q

1 + v₀²

(28)

Example. Growth of mice

Growth in percent of mice subjected to a special diet is analyzed.

dose growth

1 10 73

2 10 78

3 15 85

4 20 90

5 20 91

6 25 87

7 25 86

8 25 91

9 30 75

10 35 65

11 40 NA

12 45 NA

●

●●

●

10 15 20 25 30 35 40 45

657075808590

dose

growth

We want to predict the growth for the last doses A second order polynomial model is considered

growth = β₀ + β₁ dose + β₂ dosesq + ε [dosesq=dose^2]

(29)

> topi=read.table("C:/DATA/topi.txt",header=T,na.string=".");attach(topi)

> dosesq=dose^2

> regr=lm(growth ~ dose + dosesq, subset=c(1:10)) Call:

lm(formula = growth ~ dose + dosesq, subset = c(1:10)) Residuals:

-3.6377 -1.2937 -0.1396 1.4450 3.5665 Coefficients:

(Intercept) 35.65744 5.61793 6.347 0.000386 ***

dose 5.26290 0.55802 9.431 3.14e-05 ***

dosesq -0.12767 0.01281 -9.966 2.19e-05 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 2.541 on 7 degrees of freedom

Multiple R-squared: 0.9364, Adjusted R-squared: 0.9183 F-statistic: 51.56 on 2 and 7 DF, p-value: 6.478e-05

(30)

> conf_int=predict(regr,topi,interval="confidence")

> pred_int=predict(regr,topi, interval="predict")

> cbind(dose, growth, round(conf_int,2),round(pred_int[,2:3],2)) dose growth fit lwr upr lwr upr

1 10 73 75.52 71.52 79.52 68.30 82.74 2 10 78 75.52 71.52 79.52 68.30 82.74 3 15 85 85.87 83.33 88.42 79.35 92.40 4 20 90 89.85 87.23 92.47 83.29 96.40 5 20 91 89.85 87.23 92.47 83.29 96.40 6 25 87 87.43 84.90 89.96 80.91 93.95 7 25 86 87.43 84.90 89.96 80.91 93.95 8 25 91 87.43 84.90 89.96 80.91 93.95 9 30 75 78.64 75.79 81.49 71.99 85.29 10 35 65 63.46 58.09 68.82 55.40 71.51 11 40 NA 41.89 31.94 51.85 30.27 53.52 12 45 NA 13.95 -2.27 30.17 -3.35 31.25

(31)

●

●●

●

10 15 20 25 30 35 40 45

020406080100

dose

●

● ●● ●●●

●

10 15 20 25 30 35 40 45

020406080100

●

● ●●

●

10 15 20 25 30 35 40 45

020406080100

**

* ** ***

*

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

●

● ●● ●●●

●

10 15 20 25 30 35 40 45

020406080100

●

● ●●

●

10 15 20 25 30 35 40 45

020406080100

**

* ** ***

*

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

●

● ●● ●●●

●

10 15 20 25 30 35 40 45

020406080100

●

● ●●

●

10 15 20 25 30 35 40 45

020406080100

- dot: observed data

- star and solid red line: predicted value for growth

- diamond and dashed blue line: 95% confidence interval for the means of the response

- empty dot and dashed black line: 95% predicted interval for the response

(32)

R coding for plots

plot(dose,growth,pch=16,xlim=c(10,45),ylim=c(-10,100),ylab=" ",cex=1.5) par(new=T)

plot(dose,conf_int[,1],pch="*",type="b",col="red",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lwd=2,cex=1.5)

par(new=T)

plot(dose,conf_int[,2],pch=18,type="b",col="blue",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=4,lwd=2,cex=1.5)

par(new=T)

plot(dose,conf_int[,3],pch=18,type="b",col="blue",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=4,lwd=2,cex=1.5)

par(new=T)

plot(dose,pred_int[,2],pch=21,type="b",col="black",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=5,lwd=2,cex=1.5)

par(new=T)

plot(dose,pred_int[,3],pch=21,type="b",col="black",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=5,lwd=2,cex=1.5)