• Non ci sono risultati.

Linear models including analysis of variance

N/A
N/A
Protected

Academic year: 2021

Condividi "Linear models including analysis of variance"

Copied!
32
0
0

Testo completo

(1)

Linear models

including analysis of variance

Eva Riccomagno, Maria Piera Rogantin

DIMA – Universit`a di Genova

riccomagno@dima.unige.it rogantin@dima.unige.it

(2)

Part A. Linear model

1. Introduction

2. Linear model on a sample

3. Inference on the coefficients of the model 4. Inference on the means of the responses

5. Analysis of the residuals for the goodness-of-fit 6. Test on a subset of coefficients

7. Prediction of the response and prediction error

(3)

1. Introduction Example. Oxygen consumption in athletes

ossigeno eta peso tempo p_ferm p_med p_max

44.609 44 89.47 11.37 62 178 182

45.313 40 75.07 10.07 62 185 185

54.297 44 85.84 8.65 45 156 168

49.874 38 89.02 9.22 55 178 180

...

We want to determine if the oxygen consumption by athletes practicing endurance sports can be expressed as a linear combi- nation of easily measurable explanatory variables:

- age (eta),

- weight (peso),

- time to a fixed distance (tempo),

- stationary heart beat per minute (pulsfer), - average heart rate per minute (pulsmed),

- maximum heart rate per minute when running (pulsmax)

(4)

Let Y and X1, . . . , Xp−1 be quantitative variables on n units.

We want to express Y as a linear combination of X1, . . . , Xp−1 plus a random residual.

- Y response variable; Y1, . . . , Yn sample variables and y1, ...., yn the observed values

- X1, . . . , Xp−1 explanatory variables (or covariates) and xi1, ..., xi,p−1 the observed covariates for the i-th sample unit

For i-th sample unit, for i = 1, . . . , n

yi = β0 + β1 xi1 + β2 xi2 + · · · + βp−1 xip−1i

= xti β +εi

In matrix form Y = Xβ + ε whose the i-th row is made explicit above.

(5)

Two main applications of linear models.

• To quantify the strength of the relationship between Y and the covariates, to assess which covariate may have no rela- tionship with Y at all, and to identify which subsets of the covariates contain redundant information about Y .

• After developing a model to approximate Y through appropri- ate covariates, if an additional sample unit and its covariates are given without the accompanying value of Y , the fitted model can be used to make a prediction of the value of Y .

(6)

Example. Simple linear regression y = β0 + β1 x + ε

b0 + b1 xi (belonging to the line) is the best linear approximation of yi trough xi.

(x , y )i i

(x , i β1 x + i 2)

(xi,b0+xib1) (xi, yi)

0 10

20 30 40 50 60

20 30 40

x y

Residuals

The i-th residual is εi = yi − xti β, for i = 1, . . . , n.

It is a function of the parameters β

Estimate of the parameters – a least square problem The estimate of β minimize the sum of squares of the residuals

n X

i=1

ε2i

(7)

2. Linear model on a sample

The parameter

βk gives the “importance” of the explanatory variable Xk on the response approximation.

The β parameters are also called “effects”.

The aim is to study the parameters β

• estimating them

• computing confidence intervals

• carrying out tests

(8)

Estimator of the parameters

Let Y = (Y1, . . . , Yn) be sample independent variables.

The covariates X1, . . . , Xp−1 are assumed deterministic.

Let X be the matrix with one in the entries of the first row and in whose i-th row there are the covariates’ values for the i-th observed unit. This is the data matrix (for covariates) augmented by a vector of one.

The least squares estimator of the coefficients β = (β0, . . . , βp−1) are

B = (XtX)−1XtY

where B is a vector of estimators: B = (B0, . . . , Bp−1).

(9)

Response variable with normal distribution

Assume εi ∼ N (0, σ2) σ unknown and cov(εi, εj) = 0 Then Yi ∼ N xtiβ, σ2 and cov(Yi, Yj) = 0

where µi = xtiβ = β0 + β1 xi1 + β2 xi2 + · · · βp−1 xi p−1 The responses Y1, . . . , Yn

• are independent because normally distributed and with zero covariances.

• are not identically distributed because the mean value of Yi depends on the covariates of the i-th sample unit.

The used unbiased estimator is:

S2 = 1 n − p

n

X (Yi − xtiB)2

(10)

3. Inference on the coefficients β of the model Point estimators

Recall that the least squares estimator of the model coefficients β0, . . . , βp−1 are B = (XtX)−1XtY where B is a vector of estima- tors: B = (B0, . . . , Bp−1).

The variance/covariance matrix of the B’s estimators is:

V(B) = σ2(XtX)−1

and can be estimated using S2, point estimator of σ2.

We denote by ˜σk2 the variance of Bk and by ˜Sk2 its estimator, k = 0, . . . , p − 1.

The point estimator B of the model coefficients β are linear combination of the sample variables (Y1, . . . , Yn). Then they are normal random variables.

Bk ∼ N (βk, ˜σk) and Tk = Bk − βk

k ∼ tn−p

(11)

Example. Simple regression. Height of poplars

We want to express the height (m) of a variety of poplars as linear function of the diameter (cm)

Diametro Altezza 2.23 3.76 2.12 3.15 1.06 1.85 ....

> poplar=read.delim("DATA/pioppi.txt",header =T);attach(poplar)

> regr_H_D=lm(Altezza~Diametro,poplar[1:50,])

> summary(regr_H_D)

(12)

Call:

lm(formula = Altezza ~ Diametro, data = poplar[1:50, ]) Residuals:

Min 1Q Median 3Q Max

-14.001 -2.686 0.797 2.937 8.396 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 8.045 2.487 3.234 0.00221 **

Diametro 14.560 1.082 13.460 < 2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 4.933 on 48 degrees of freedom

Multiple R-squared: 0.7906, Adjusted R-squared: 0.7862 F-statistic: 181.2 on 1 and 48 DF, p-value: < 2.2e-16

Consider now only the column Coefficients:

- column Estimate: estimate bk of the parameter βk

- column Std.Error: estimate of the standard deviation of Bk

(13)

Confidence interval for βk

Bk − t1−α/2k, Bk + t1−α/2k

From the sample values bk and ˜sk we get the sample confidence interval.

Sample confidence intervals in R

> confint(regr_H_D)

2.5 % 97.5 %

(Intercept) 3.043746 13.04638 Diametro 12.385520 16.73548

(14)

Test on the βk (individually)

The k-th covariate Xk has effect on the response if βk is different from 0.

H0: βk = 0 and H1: βk 6= 0 Under H0, Tk = B˜k

Sk ∼ tn−p

Given α, R0 = −∞, −t1−α/2˜skt1−α/2˜sk, +∞. Coefficients:

- column t value: estimate of tk = bk/˜sk - column Pr(>|t|): p-value of tk

In the example, there is strong evidence to reject H0 for both coefficients β0 and β1.

In fact the sample confidence intervals do not contain 0 and the p-values are very small.

(15)

4. Inference on the mean values µi of the responses

The mean value of the sample variable Yi is µi = xtiβ

Point estimator for µi (computed by the point estimators B) Ybi = xtiB

The variance/covariance matrix of the Y ’s estimators is:b V(Y ) = σ2X(XtX)−1Xt

and can be estimated using S2, point estimator of σ2.

We denote by σi2 the variance of Yi and by Si2 its estimator.

Ybi ∼ N (µi, σi2) and Ti = Ybi − µi

Si ∼ tn−p

(16)

Confidence interval for µi



Ybi − t1−α/2 Si,Ybi + t1−α/2 Si

Sample confidence intervals in R

> round(predict(regr_H_D, interval="confidence"),1) fit lwr upr

1 25.1 22.4 27.7 2 31.1 29.1 33.0 3 23.5 20.6 26.3 4 23.0 20.1 26.0 5 25.2 22.6 27.9 6 28.3 26.0 30.5 7 27.0 24.6 29.4 8 29.0 26.8 31.2 ...

(17)

5. Analysis of the residuals for the goodness-of-fit

The sample residuals are estimated by Ei = Yi − Ybi

The variance/covariance matrix of the E’s estimators is:

V(E) = σ2

I − X(XtX)−1Xt

and can be estimated using S2, point estimator of σ2. We denote by σi∗∗2 the variance of Ei and by Si∗∗2 its estimator.

Ei ∼ N (0, σi∗∗2) and Ti = Ei

Si∗∗ ∼ tn−p

The normality of the response variable is checked e.g. not through y1, . . . , yn but through the standardized residuals (ti = (yi−ˆyi)/s∗∗i ), i = 1, . . . , n. In fact y1, . . . , yn are one-samples of n random vari- ables with different means, while t1, . . . , tn are an n-sample of one

(18)

The scatterplot of residuals w.r.t. predicted values should not have neither trend nor “form” but the cloud should be homoge- neous around the zero. If not, a transformation of either response or the covariates can be performed.

Example. Poplar (continue)

●●●

●●

30 40 50 60

−3−2−101

fitted values

standardized residuals

30 40 50 60

−15−10−50510

Fitted values

Residuals

●●●

●●

Residuals vs Fitted

2

1 45

−2 −1 0 1 2

−3−2−1012

Theoretical Quantiles

Standardized residuals

Normal Q−Q

2 1 45

plot(predict(regr_H_D),rstandard(regr_H_D),pch=16,cex.axis=1.5,

xlab="fitted values",ylab="standardized residuals",cex.lab=1.5);abline(h=0) par(mfrow=c(1,2));plot(regr_H_D,which = 1:2)

The Tukey’s five numbers of the residuals indicate that they are quite symmetrical w.r.t. zero.

Residuals:

Min 1Q Median 3Q Max

-14.001 -2.686 0.797 2.937 8.396

(19)

The R2 index to the goodness-of-fit

The index has two equivalent interpretations:

• it is the ratio between the variance of the fitted values and the variance of the sample values:

R2 =

P(ˆyi − y)2

P(yi − y)2

• it is the squared correlation between the fitted values and the sample value:

R2 = ρ2(y, y)b

(20)

Example. Oxygen consumption in athletes (continue)

> atleti=read.table("C:/DATA/atleti.txt",header=T);attach(atleti)

> regr_oxigen=lm(ossigeno~eta+peso+tempo+pulsferm+pulsmed+pulsmax)

> summary(regr_oxigen)

Goodness-of-fit

• Analysis of the residuals

plot(predict(regr_ossigeno),rstandard(regr_ossigeno),pch=16,cex.axis=1.5, xlab="fitted values",ylab="standardized residuals",cex.lab=1.5)

abline(h=0)

par(mfrow=c(1,2));plot(regr_oxigen,which = 1:2)

40 45 50 55

−2−1012

fitted values

standardized residuals

40 45 50 55

-6-2246

Fitted values

Residuals

Residuals vs Fitted

17 15

23

-2 -1 0 1 2

-2-10123

Theoretical Quantiles

Standardized residuals

Normal Q-Q

15

17

20

• R2 index (in the output of summary(regr oxigen))

Multiple R-squared: 0.8487

(21)

Inference of the model coefficients

(in the output of summary(regr oxigen))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 102.93448 12.40326 8.299 1.64e-08 ***

eta -0.22697 0.09984 -2.273 0.03224 * peso -0.07418 0.05459 -1.359 0.18687

tempo -2.62865 0.38456 -6.835 4.54e-07 ***

pulsferm -0.02153 0.06605 -0.326 0.74725 pulsmed -0.36963 0.11985 -3.084 0.00508 **

pulsmax 0.30322 0.13650 2.221 0.03601 * ---

Considering the coefficients individually:

there is a very strong evidence to reject the “non influence” (on the oxygen consumption) of the variables peso and pulsferm and there is a strong evidence for eta and pulsmax.

Interpretation: removing pulsferm the model is good, removing pulsmax the model is good. What about a model without both

(22)

6. Test for a subset of coefficients The test hypotheses are

H0 : βi1 = . . . = βiq = 0 and H1 : at least one 6= 0 The test is carried out comparing

- the sum of squares of the residuals in the reduced model, SSR - the sum of squares of the residuals in the complete model, SSC (SSR > SSC always)

The test statistics is the relative error of the two sums of squares (multiplied by a constant linked to the degrees of freedom of the reduced and complete models):

F = (SSR − SSC) / q

SSC / (n − p) ∼ F[q,n−p]

(Fisher distribution with q and n − p degrees of freedom)

The test is one-sided right because large sample values of F state large difference between complete and reduced models.

(23)

Example. Oxygen consumption in athletes (continue)

The covariates peso and pulsferm are individually non influent.

What about a model without both these variables?

> reduced = lm(ossigeno~eta+tempo+pulsmed+pulsmax)

> complete = lm(ossigeno~eta+peso+tempo+pulsferm+pulsmed+pulsmax)

> anova(reduced,complete) Analysis of Variance Table

Model 1: ossigeno ~ eta + tempo + pulsmed + pulsmax

Model 2: ossigeno ~ eta + peso + tempo + pulsferm + pulsmed + pulsmax Res.Df RSS Df Sum of Sq F Pr(>F)

1 26 138.93

2 24 128.84 2 10.092 0.94 0.4045

There is evidence to retain H0, that is the coefficients of peso and pulsferm are null. The reduced model approximates the response as well as the complete model.

(24)

Let’s analyze the reduced model

Call:

lm(formula = ossigeno ~ eta + tempo + pulsmed + pulsmax) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 98.14789 11.78569 8.328 8.26e-09 ***

eta -0.19773 0.09564 -2.068 0.04877 * tempo -2.76758 0.34054 -8.127 1.31e-08 ***

pulsmed -0.34811 0.11750 -2.963 0.00644 **

pulsmax 0.27051 0.13362 2.024 0.05330 . ---

Residual standard error: 2.312 on 26 degrees of freedom

Multiple R-squared: 0.8368, Adjusted R-squared: 0.8117

In this model, there is a weak evidence to reject that the coeffi- cient pulsmax is zero.

Exercise: check whether a reduced model without peso, pulsferm and pulsmax approximates the oxygen consumption as well as the complete model.

(25)

Test for the nullity of all the coefficients except the con- stant

Are the covariates, all together, useful to explain the response?

Test against the constant model.

In R: (last row in the output of summary(regr oxigen) complete model)

F-statistic: 22.43 on 6 and 24 DF, p-value: 9.715e-09

The reduced model has the constant only. Let SS0 be the sum of squares of the residuals. The test statistic is:

F = (SS0 − SSC) / (p − 1)

SSC / (n − p) ∼ F[p−1,n−p]

In the example of the oxygen consumption:

- p = 7, n = 31; degrees of freedom: 6 and 24 - F-statistic: 22.43

(26)

7. Prediction of the response and prediction error

Consider the linear model Yi = xtiβ + εi, for i = 1, . . . , n.

Let B be the point estimators of the coefficients based on the n sample units.

An additional sample unit with covariates x0 = (1, . . . , x0p−1) is given without the corresponding sample value of the response.

For the new observation consider the model:

Y0 = xt0β + ε0 = µ0 + ε0 with the same coefficients β as above.

The mean value of Y0, µ0, is predicted using the point estimator xt0B.

(27)

• Confidence interval for µ0 of the new “sample” variable Y0 It holds:

E(xt0B) = xt0β V(xt0B) = σ2v02 Then a confidence interval for µ0 is:

xt0B − t1−α/2 S v0, xt0B + t1−α/2 S v0

• Prediction interval for Y0 1 − α =

P



xt0B − t1−α/2 S

q

1 + v02 < Y0 < xt0B + t1−α/2 S

q

1 + v02



Remark: it is not a confidence interval because is not about a parameter.

Note the semi-range of the two intervals:

tα S

q

v02 and tα S

q

1 + v02

(28)

Example. Growth of mice

Growth in percent of mice subjected to a special diet is analyzed.

dose growth

1 10 73

2 10 78

3 15 85

4 20 90

5 20 91

6 25 87

7 25 86

8 25 91

9 30 75

10 35 65

11 40 NA

12 45 NA

10 15 20 25 30 35 40 45

657075808590

dose

growth

We want to predict the growth for the last doses A second order polynomial model is considered

growth = β0 + β1 dose + β2 dosesq + ε [dosesq=dose^2]

(29)

> topi=read.table("C:/DATA/topi.txt",header=T,na.string=".");attach(topi)

> dosesq=dose^2

> regr=lm(growth ~ dose + dosesq, subset=c(1:10)) Call:

lm(formula = growth ~ dose + dosesq, subset = c(1:10)) Residuals:

Min 1Q Median 3Q Max

-3.6377 -1.2937 -0.1396 1.4450 3.5665 Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 35.65744 5.61793 6.347 0.000386 ***

dose 5.26290 0.55802 9.431 3.14e-05 ***

dosesq -0.12767 0.01281 -9.966 2.19e-05 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 2.541 on 7 degrees of freedom

Multiple R-squared: 0.9364, Adjusted R-squared: 0.9183 F-statistic: 51.56 on 2 and 7 DF, p-value: 6.478e-05

(30)

> conf_int=predict(regr,topi,interval="confidence")

> pred_int=predict(regr,topi, interval="predict")

> cbind(dose, growth, round(conf_int,2),round(pred_int[,2:3],2)) dose growth fit lwr upr lwr upr

1 10 73 75.52 71.52 79.52 68.30 82.74 2 10 78 75.52 71.52 79.52 68.30 82.74 3 15 85 85.87 83.33 88.42 79.35 92.40 4 20 90 89.85 87.23 92.47 83.29 96.40 5 20 91 89.85 87.23 92.47 83.29 96.40 6 25 87 87.43 84.90 89.96 80.91 93.95 7 25 86 87.43 84.90 89.96 80.91 93.95 8 25 91 87.43 84.90 89.96 80.91 93.95 9 30 75 78.64 75.79 81.49 71.99 85.29 10 35 65 63.46 58.09 68.82 55.40 71.51 11 40 NA 41.89 31.94 51.85 30.27 53.52 12 45 NA 13.95 -2.27 30.17 -3.35 31.25

(31)

10 15 20 25 30 35 40 45

020406080100

dose

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

**

* ** ***

*

*

*

*

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

**

* ** ***

*

*

*

*

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

10 15 20 25 30 35 40 45

020406080100

- dot: observed data

- star and solid red line: predicted value for growth

- diamond and dashed blue line: 95% confidence interval for the means of the response

- empty dot and dashed black line: 95% predicted interval for the response

(32)

R coding for plots

plot(dose,growth,pch=16,xlim=c(10,45),ylim=c(-10,100),ylab=" ",cex=1.5) par(new=T)

plot(dose,conf_int[,1],pch="*",type="b",col="red",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lwd=2,cex=1.5)

par(new=T)

plot(dose,conf_int[,2],pch=18,type="b",col="blue",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=4,lwd=2,cex=1.5)

par(new=T)

plot(dose,conf_int[,3],pch=18,type="b",col="blue",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=4,lwd=2,cex=1.5)

par(new=T)

plot(dose,pred_int[,2],pch=21,type="b",col="black",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=5,lwd=2,cex=1.5)

par(new=T)

plot(dose,pred_int[,3],pch=21,type="b",col="black",xlab=" ",ylab=" ", xlim=c(10,45),ylim=c(-10,100),lty=5,lwd=2,cex=1.5)

Riferimenti

Documenti correlati

Figura 4: Posizione delle cale con rete pelagica e transetti acustici effettuati nel corso della campagna “Evatir 2011” a bordo della N/O “G.. Una volta imbarcate le reti, il

The present analysis describes the results of three 12 week studies that compared the lung function effects of once-daily FF/VI 100/25 mcg and twice-daily FP/salmeterol (SAL) 250/

This result is considerably d values reported in the literature, in which it w about 40% of the users select paths that ove route in terms of travel time for more than 90% is

Questa proposta creerebbe sì un hapax (come del resto lo è il nome proprio Φάρσος), ma potrebbe avere due vantaggi, ovvero da una parte spiegare il signifi cato del segno

The statistical inference of the alternation of wet periods (WP) and dry periods (DP) in daily rainfall records can be achieved through the modelling of inter-arrival time-series

Based on phase-resolved broadband spectroscopy using XMM-Newton and NuSTAR, we report on a potential cyclotron resonant scattering feature (CRSF) at E∼13 keV in the pulsed spectrum