Modello lineare multiplo in R e in SAS

(1)

Modello lineare multiplo in R e in SAS

In uno studio antropometrico, si vuole descrivere la densit` a corporea (calcolata in immersione) [qui moltiplicata per 100] come combinazione lineare del peso, delle dimensioni dell’addome, della coscia, dei bicipiti e del polso.

1 Software R

1.1 Lettura dati e funzione lm

setwd("<path>")

antrop=read.table("antrop_no_outlier.txt",header=T) attach(antrop)

r_antrop=lm(densita~peso+addome+coscia+bicipiti+polso) r_antrop[1:10]

La struttura costruita con la funzione lm ` e composta da vari elementi; i principali sono i seguenti (sotto sono riportati i primi valori).

$coefficients

(Intercept) peso addome coscia bicipiti polso

118.88942260 0.03667805 -0.22281208 -0.04080249 -0.07159943 0.29922878

$residuals

1 2 3 4 5 6 7 8

1.098293551 0.912432470 0.421358358 -0.378276157 -0.923959775 -0.458556576 0.420540220 1.177998423

$rank [1] 6

$fitted.values

1 2 3 4 5 6 7 8 9 10 11

105.9817 107.6176 107.0886 103.7783 105.9440 105.9486 106.6195 107.8220 107.0413 107.8450 106.8790

$df.residual [1] 232

$model

densita peso addome coscia bicipiti polso 1 1.0708 154.25 85.2 59.0 32.0 17.1 2 1.0853 173.25 83.0 58.7 30.5 18.2

1.2 Alcune funzioni che agiscono sulla struttura costruita con lm

• summary

> summary(r_antrop)

Call:

lm(formula = densita ~ peso + addome + coscia + bicipiti + polso)

Residuals:

Min 1Q Median 3Q Max

-2.0644 -0.7900 0.0127 0.6808 3.6582

Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 118.889423 1.973971 60.229 < 2e-16 ***

peso 0.036678 0.007879 4.655 5.44e-06 ***

addome -0.222812 0.013400 -16.628 < 2e-16 ***

coscia -0.040802 0.026691 -1.529 0.12771 bicipiti -0.071599 0.036281 -1.973 0.04963 * polso 0.299229 0.106381 2.813 0.00533 **

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.9838 on 232 degrees of freedom Multiple R-squared: 0.7278, Adjusted R-squared: 0.7219 F-statistic: 124.1 on 5 and 232 DF, p-value: < 2.2e-16

(2)

Per ottenere alcune statistiche riportate nel precedente output si procede nel seguente modo.

> s=summary(r_antrop)

> s$coefficients ##ritorna una matrice

Estimate Std. Error t value Pr(>|t|) (Intercept) 118.88942260 1.973970655 60.228566 1.234056e-143 peso 0.03667805 0.007878679 4.655355 5.442965e-06 addome -0.22281208 0.013399703 -16.628135 2.074185e-41 coscia -0.04080249 0.026691366 -1.528677 1.277067e-01 bicipiti -0.07159943 0.036281017 -1.973468 4.962911e-02 polso 0.29922878 0.106381400 2.812792 5.332192e-03

> s$sigma [1] 0.983766

> s$df ##ritorna un vettore [1] 6 232 6

> s$r.squared [1] 0.7277806

> s$fstatistic ##ritorna un vettore value numdf dendf

124.0507 5.0000 232.0000

• confint (realizzazioni degli intervalli di confidenza per i parametri β)

> round(confint(r_antrop),3) 2.5 % 97.5 % (Intercept) 115.000 122.779

peso 0.021 0.052

addome -0.249 -0.196 coscia -0.093 0.012 bicipiti -0.143 0.000

polso 0.090 0.509

• model.matrix matrice delle variabili esplicative, compresa la costante

> model.matrix(r_antrop)

(Intercept) peso addome coscia bicipiti polso

1 1 154.25 85.2 59.0 32.0 17.1

2 1 173.25 83.0 58.7 30.5 18.2

3 1 184.75 86.4 60.1 32.4 18.2

• predict con opzione interval="confidence"

(realizzazioni degli intervalli di confidenza per i valori attesi della variable risposta Xβ)

> round(predict(r_antrop, interval="confidence")[1:4,],2)

fit lwr upr

1 105.98 105.71 106.26 2 107.62 107.36 107.88 3 107.09 106.84 107.34 4 103.78 103.50 104.06

• predict con opzione interval="prediction"

(realizzazioni degli intervalli di previsione per la variable risposta)

> round(predict(r_antrop, interval="prediction")[1:4,],2)

fit lwr upr

1 105.98 104.02 107.94 2 107.62 105.66 109.57 3 107.09 105.13 109.04 4 103.78 101.82 105.74 Warning message:

In predict.lm(r_antrop, interval = "prediction") : predictions on current data refer to _future_ responses

• rstandard

> round(rstandard(r_antrop),2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

1.13 0.94 0.43 -0.39 -0.95 -0.47 0.43 1.23 0.18 0.47 1.29 -0.36 1.05 0.52 0.71 -1.24 -0.49

(3)

• vcov (stima della matrice di varianza covarianza degli stimatori dei coefficienti) Costruzione della corrispondente matrice di correlazione

> covb=vcov(r_antrop)

> corb=diag(diag(covb)^(-1/2))%*%covb%*%diag(diag(covb)^(-1/2))

> rownames(corb)=rownames(covb);colnames(corb)=colnames(covb)

> round(corb,4)

(Intercept) peso addome coscia bicipiti polso (Intercept) 1.0000 0.7330 -0.3255 -0.5428 -0.1281 -0.8207 peso 0.7330 1.0000 -0.6516 -0.5539 -0.2613 -0.4466 addome -0.3255 -0.6516 1.0000 0.0696 0.0776 0.0633 coscia -0.5428 -0.5539 0.0696 1.0000 -0.2462 0.2429 bicipiti -0.1281 -0.2613 0.0776 -0.2462 1.0000 -0.1321 polso -0.8207 -0.4466 0.0633 0.2429 -0.1321 1.0000

• anova

> anova(r_antrop)

Analysis of Variance Table

Response: densita

Df Sum Sq Mean Sq F value Pr(>F) peso 1 316.221 316.221 326.7436 < 2.2e-16 ***

addome 1 266.776 266.776 275.6528 < 2.2e-16 ***

coscia 1 7.097 7.097 7.3333 0.007273 **

bicipiti 1 2.528 2.528 2.6121 0.107413 polso 1 7.657 7.657 7.9118 0.005332 **

Residuals 232 224.529 0.968 ---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

1.3 Costruzione della tabella di decomposizione della varianza

> n_espl=length(r_antrop$coefficients)-1

> SS_M=sum(anova(r_antrop)[1:n_espl,2])

> DF_M=sum(anova(r_antrop)[1:n_espl,1])

> SS_E=anova(r_antrop)[n_espl+1,2]

> DF_E=anova(r_antrop)[n_espl+1,1]

> SS_T=SS_M+SS_E;DF_T=DF_M+DF_E

> tab_dec_var=rbind(cbind(DF_M,SS_M),cbind(DF_E,SS_E),cbind(DF_T,SS_T))

> rownames(tab_dec_var)=c("Model", "Residuals", "Total");

> colnames(tab_dec_var)=c("DF","SumOfSquares");

> tab_dec_var ## Tabella della decomposizione della varianza

DF SumOfSquares

Model 5 600.2786

Residuals 232 224.5285

Total 237 824.8072

Test sulla nullit` a di tutti i coefficienti esclusa la costante: il valore campionario della statistica test e il corrispondente p-value sono riportati nell’output di summary(r_antrop):

F-statistic: 124.1 on 5 and 232 DF, p-value: < 2.2e-16

1.4 Test sulla nullit` a di un sottoinsieme di q coefficienti

Prima bisogna costruire il modello ridotto e poi calcolare il valore campionario della statistica test e il suo p-value:

f = ke

R

k

²

− ke

C

k

²

/q (ke

C

k

²

) /(n − p)

> r_antrop_rid=lm(densita~peso+addome+polso)

> s_rid=summary(r_antrop_rid)

> SS_E_C=(s$sigma^2)*s$df[2]

> SS_E_R=(s_rid$sigma^2)*s_rid$df[2]

(4)

> df_num=s$df[1]-s_rid$df[1]

> f=((SS_E_R-SS_E_C)/df_num)/(s$sigma^2)

> pvalue=1-pf(f,df_num ,s$df[2])

> round(c(f,pvalue),4) [1] 4.1076 0.0177

1.5 Stima della varianza di ˆ Y

Lo stimatore della varianza di ˆ Y

_i

` e il termine i-esimo della diagonale della matrice P

_V

(il cosiddetto “leverage”) moltiplicato per S

²

. La sua realizzazione campionaria si ottiene nel seguente modo

> X=model.matrix(r_antrop)

> leverage = hat(X)

> sd_pred=s$sigma*sqrt(leverage)

> sd_pred

[1] 0.140 0.132 0.128 0.142 0.154 0.107 0.129 0.215 0.166 0.166 0.218 0.124 [13] 0.132 0.231 0.206 0.189 0.168 0.188 0.159 0.160 0.172 0.108 0.152 0.122 ...

1.6 Grafici dei residui

• Output di default

par(mfrow=c(2,2)) plot(r_antrop)

100 102 104 106 108

−201234

Fitted values

Residuals

● ●

●

● ●

●

●●

●

● ● ●

●

● ●

●

● ● ●

●

●● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●●●

●

● ●●

●

● ●

●

● ●

●

● ●

●

● ●

●● ●

●

● ●

●

● ● ●

●

●●

●

● ●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

Residuals vs Fitted

84 211 210

● ●

●

● ●

●

●●

●

●●●

●

● ●

●

● ●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

● ●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

−3 −2 −1 0 1 2 3

−201234

Theoretical Quantiles

Standardized residuals

Normal Q−Q

84

211210

100 102 104 106 108

0.00.51.01.52.0

Fitted values

Standardized residuals

●

● ●

●

● ●

●

● ● ● ●

●

● ●

●

●● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

Scale−Location

84

211 210

0.00 0.02 0.04 0.06 0.08 0.10 0.12

−201234

Leverage

Standardized residuals ^●^●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

● ● ● ●

●

● ● ●

●

● ●

●

● ●

●

●●

●

● ● ●

●

● ●

●

● ●●●

●

● ●

●

● ●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ● ●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

Cook's distance

Residuals vs Leverage

84

236

46

• Output grafico “programmato”

plot(predict(r_antrop),rstandard(r_antrop),pch=16,cex.axis=1.5,

xlab="fitted values",ylab="standardized residuals",cex.lab=1.5);abline(h=0)

●

●●

●

● ● ●

●

● ●

●

●● ●

●

● ●

●

● ●

●

● ●

●

●●

●

●●●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

100 102 104 106 108

−2−101234

fitted values

standardized residuals

(5)

2 Software SAS

2.1 Lettura dati e proc reg

data antrop;

infile "<path>.txt" EXPANDTABS firstobs=2;

input densita eta peso altezza collo torace addome fianchi coscia ginocch caviglia bicipiti avambr polso;

run;

proc reg data=antrop;

model densita=peso addome coscia bicipiti polso/ corrb clb clm cli p r;

quit;run;

The REG Procedure Model: MODEL1 Dependent Variable: densita Number of Observations Read 238 Number of Observations Used 238

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 5 600.27862 120.05572 124.05 <.0001

Error 232 224.52855 0.96780

Corrected Total 237 824.80717

Root MSE 0.98377 R-Square 0.7278

Dependent Mean 105.63277 Adj R-Sq 0.7219

Coeff Var 0.93131

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits

Intercept 1 118.88942 1.97397 60.23 <.0001 115.00022 122.77862

peso 1 0.03668 0.00788 4.66 <.0001 0.02116 0.05220

addome 1 -0.22281 0.01340 -16.63 <.0001 -0.24921 -0.19641

coscia 1 -0.04080 0.02669 -1.53 0.1277 -0.09339 0.01179

bicipiti 1 -0.07160 0.03628 -1.97 0.0496 -0.14308 -0.00011705

polso 1 0.29923 0.10638 2.81 0.0053 0.08963 0.50883

Correlation of Estimates

Variable Intercept peso addome coscia bicipiti polso

Intercept 1.0000 0.7330 -0.3255 -0.5428 -0.1281 -0.8207

peso 0.7330 1.0000 -0.6516 -0.5539 -0.2613 -0.4466

addome -0.3255 -0.6516 1.0000 0.0696 0.0776 0.0633

coscia -0.5428 -0.5539 0.0696 1.0000 -0.2462 0.2429

bicipiti -0.1281 -0.2613 0.0776 -0.2462 1.0000 -0.1321

polso -0.8207 -0.4466 0.0633 0.2429 -0.1321 1.0000

Output Statistics Std

Error

Dependent Predicted Mean Std Error Student

Obs Variable Value Predict 95% CL Mean 95% CL Predict Residual Residual Residual 1 107.1 105.9817 0.1398 105.7063 106.2571 104.0240 107.9394 1.0983 0.974 1.128 2 108.5 107.6176 0.1318 107.3579 107.8773 105.6620 109.5731 0.9124 0.975 0.936 3 107.5 107.0886 0.1283 106.8359 107.3414 105.1340 109.0433 0.4214 0.975 0.432 4 103.4 103.7783 0.1425 103.4975 104.0590 101.8198 105.7368 -0.3783 0.973 -0.389 ...

Sum of Residuals -6.8141E-12

Sum of Squared Residuals 224.52855 Predicted Residual SS (PRESS) 237.31357

2.2 Test sulla nullit` a di un sottoinsieme di q coefficienti

Si usa l’istruzione test nella proc reg.

proc reg data=antrop;

model densita=peso addome coscia bicipiti polso;

test coscia, bicipiti;

quit;run;

(6)

The REG Procedure Model: MODEL1

Test 1 Results for Dependent Variable densita Mean

Source DF Square F Value Pr > F

Numerator 2 3.97530 4.11 0.0177

Denominator 232 0.96780

2.3 Grafici dei residui - default

(grafici “programmati” si trovano sulle dispense, ad es. a pag. 35-36)

ods pdf file="<path>\res_sas.pdf" notoc startpage=no nogtitle;

ods noproctitle; ods graphics / width=17cm; option nodate;

proc reg data=fitness;

model densita=peso addome coscia bicipiti polso;

quit;run;

ods pdf close;

(7)

The SAS System 2 Model: MODEL1

Dependent Variable: densita Fit Diagnostics for densita

0.7219 Adj R-Square

0.7278 R-Square

0.9678

MSE 232

Error DF

6 Parameters

238 Observations

Proportion Less 0.0 0.4 0.8

Residual

0.0 0.4 0.8 Fit–Mean

-6 -4 -2 0 2 4

-3 -1.2 0.6 2.4 4.2 Residual 0

5 10 15 20 25

Percent

0 50 100 150 200 250 Observation 0.00

0.05 0.10 0.15

Cook's D

100 105 110

Predicted Value 100.0

102.5 105.0 107.5 110.0

densita

-3 -2 -1 0 1 2 3 Quantile -2

0 2 4

Residual

0.02 0.06 0.10 Leverage -2

0 2 4

RStudent

100 102 104 106 108 Predicted Value -2

0 2 4

RStudent

100 102 104 106 108 Predicted Value -2

0 2 4

Residual

(8)

The SAS System 3 Model: MODEL1

Dependent Variable: densita Residual by Regressors for densita

16 18 20

polso

25 30 35 40

bicipiti

50 55 60 65 70 75

coscia

80 100 120

addome

150 200 250

peso -2

0 2 4

Residual

-2 0 2 4

Residual