Modello lineare multiplo in R e in SAS
In uno studio antropometrico, si vuole descrivere la densit` a corporea (calcolata in immersione) [qui moltiplicata per 100] come combinazione lineare del peso, delle dimensioni dell’addome, della coscia, dei bicipiti e del polso.
1 Software R
1.1 Lettura dati e funzione lm
setwd("<path>")
antrop=read.table("antrop_no_outlier.txt",header=T) attach(antrop)
r_antrop=lm(densita~peso+addome+coscia+bicipiti+polso) r_antrop[1:10]
La struttura costruita con la funzione lm ` e composta da vari elementi; i principali sono i seguenti (sotto sono riportati i primi valori).
$coefficients
(Intercept) peso addome coscia bicipiti polso
118.88942260 0.03667805 -0.22281208 -0.04080249 -0.07159943 0.29922878
$residuals
1 2 3 4 5 6 7 8
1.098293551 0.912432470 0.421358358 -0.378276157 -0.923959775 -0.458556576 0.420540220 1.177998423
$rank [1] 6
$fitted.values
1 2 3 4 5 6 7 8 9 10 11
105.9817 107.6176 107.0886 103.7783 105.9440 105.9486 106.6195 107.8220 107.0413 107.8450 106.8790
$df.residual [1] 232
$model
densita peso addome coscia bicipiti polso 1 1.0708 154.25 85.2 59.0 32.0 17.1 2 1.0853 173.25 83.0 58.7 30.5 18.2
1.2 Alcune funzioni che agiscono sulla struttura costruita con lm
• summary
> summary(r_antrop)
Call:
lm(formula = densita ~ peso + addome + coscia + bicipiti + polso)
Residuals:
Min 1Q Median 3Q Max
-2.0644 -0.7900 0.0127 0.6808 3.6582
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 118.889423 1.973971 60.229 < 2e-16 ***
peso 0.036678 0.007879 4.655 5.44e-06 ***
addome -0.222812 0.013400 -16.628 < 2e-16 ***
coscia -0.040802 0.026691 -1.529 0.12771 bicipiti -0.071599 0.036281 -1.973 0.04963 * polso 0.299229 0.106381 2.813 0.00533 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.9838 on 232 degrees of freedom Multiple R-squared: 0.7278, Adjusted R-squared: 0.7219 F-statistic: 124.1 on 5 and 232 DF, p-value: < 2.2e-16
Per ottenere alcune statistiche riportate nel precedente output si procede nel seguente modo.
> s=summary(r_antrop)
> s$coefficients ##ritorna una matrice
Estimate Std. Error t value Pr(>|t|) (Intercept) 118.88942260 1.973970655 60.228566 1.234056e-143 peso 0.03667805 0.007878679 4.655355 5.442965e-06 addome -0.22281208 0.013399703 -16.628135 2.074185e-41 coscia -0.04080249 0.026691366 -1.528677 1.277067e-01 bicipiti -0.07159943 0.036281017 -1.973468 4.962911e-02 polso 0.29922878 0.106381400 2.812792 5.332192e-03
> s$sigma [1] 0.983766
> s$df ##ritorna un vettore [1] 6 232 6
> s$r.squared [1] 0.7277806
> s$fstatistic ##ritorna un vettore value numdf dendf
124.0507 5.0000 232.0000
• confint (realizzazioni degli intervalli di confidenza per i parametri β)
> round(confint(r_antrop),3) 2.5 % 97.5 % (Intercept) 115.000 122.779
peso 0.021 0.052
addome -0.249 -0.196 coscia -0.093 0.012 bicipiti -0.143 0.000
polso 0.090 0.509
• model.matrix matrice delle variabili esplicative, compresa la costante
> model.matrix(r_antrop)
(Intercept) peso addome coscia bicipiti polso
1 1 154.25 85.2 59.0 32.0 17.1
2 1 173.25 83.0 58.7 30.5 18.2
3 1 184.75 86.4 60.1 32.4 18.2
• predict con opzione interval="confidence"
(realizzazioni degli intervalli di confidenza per i valori attesi della variable risposta Xβ)
> round(predict(r_antrop, interval="confidence")[1:4,],2)
fit lwr upr
1 105.98 105.71 106.26 2 107.62 107.36 107.88 3 107.09 106.84 107.34 4 103.78 103.50 104.06
• predict con opzione interval="prediction"
(realizzazioni degli intervalli di previsione per la variable risposta)
> round(predict(r_antrop, interval="prediction")[1:4,],2)
fit lwr upr
1 105.98 104.02 107.94 2 107.62 105.66 109.57 3 107.09 105.13 109.04 4 103.78 101.82 105.74 Warning message:
In predict.lm(r_antrop, interval = "prediction") : predictions on current data refer to _future_ responses
• rstandard
> round(rstandard(r_antrop),2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1.13 0.94 0.43 -0.39 -0.95 -0.47 0.43 1.23 0.18 0.47 1.29 -0.36 1.05 0.52 0.71 -1.24 -0.49
• vcov (stima della matrice di varianza covarianza degli stimatori dei coefficienti) Costruzione della corrispondente matrice di correlazione
> covb=vcov(r_antrop)
> corb=diag(diag(covb)^(-1/2))%*%covb%*%diag(diag(covb)^(-1/2))
> rownames(corb)=rownames(covb);colnames(corb)=colnames(covb)
> round(corb,4)
(Intercept) peso addome coscia bicipiti polso (Intercept) 1.0000 0.7330 -0.3255 -0.5428 -0.1281 -0.8207 peso 0.7330 1.0000 -0.6516 -0.5539 -0.2613 -0.4466 addome -0.3255 -0.6516 1.0000 0.0696 0.0776 0.0633 coscia -0.5428 -0.5539 0.0696 1.0000 -0.2462 0.2429 bicipiti -0.1281 -0.2613 0.0776 -0.2462 1.0000 -0.1321 polso -0.8207 -0.4466 0.0633 0.2429 -0.1321 1.0000
• anova
> anova(r_antrop)
Analysis of Variance Table
Response: densita
Df Sum Sq Mean Sq F value Pr(>F) peso 1 316.221 316.221 326.7436 < 2.2e-16 ***
addome 1 266.776 266.776 275.6528 < 2.2e-16 ***
coscia 1 7.097 7.097 7.3333 0.007273 **
bicipiti 1 2.528 2.528 2.6121 0.107413 polso 1 7.657 7.657 7.9118 0.005332 **
Residuals 232 224.529 0.968 ---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
1.3 Costruzione della tabella di decomposizione della varianza
> n_espl=length(r_antrop$coefficients)-1
> SS_M=sum(anova(r_antrop)[1:n_espl,2])
> DF_M=sum(anova(r_antrop)[1:n_espl,1])
> SS_E=anova(r_antrop)[n_espl+1,2]
> DF_E=anova(r_antrop)[n_espl+1,1]
> SS_T=SS_M+SS_E;DF_T=DF_M+DF_E
> tab_dec_var=rbind(cbind(DF_M,SS_M),cbind(DF_E,SS_E),cbind(DF_T,SS_T))
> rownames(tab_dec_var)=c("Model", "Residuals", "Total");
> colnames(tab_dec_var)=c("DF","SumOfSquares");
> tab_dec_var ## Tabella della decomposizione della varianza
DF SumOfSquares
Model 5 600.2786
Residuals 232 224.5285
Total 237 824.8072
Test sulla nullit` a di tutti i coefficienti esclusa la costante: il valore campionario della statistica test e il corrispondente p-value sono riportati nell’output di summary(r_antrop):
F-statistic: 124.1 on 5 and 232 DF, p-value: < 2.2e-16
1.4 Test sulla nullit` a di un sottoinsieme di q coefficienti
Prima bisogna costruire il modello ridotto e poi calcolare il valore campionario della statistica test e il suo p-value:
f = ke
Rk
2− ke
Ck
2/q (ke
Ck
2) /(n − p)
> r_antrop_rid=lm(densita~peso+addome+polso)
> s_rid=summary(r_antrop_rid)
> SS_E_C=(s$sigma^2)*s$df[2]
> SS_E_R=(s_rid$sigma^2)*s_rid$df[2]
> df_num=s$df[1]-s_rid$df[1]
> f=((SS_E_R-SS_E_C)/df_num)/(s$sigma^2)
> pvalue=1-pf(f,df_num ,s$df[2])
> round(c(f,pvalue),4) [1] 4.1076 0.0177
1.5 Stima della varianza di ˆ Y
Lo stimatore della varianza di ˆ Y
i` e il termine i-esimo della diagonale della matrice P
V(il cosiddetto “leverage”) moltiplicato per S
2. La sua realizzazione campionaria si ottiene nel seguente modo
> X=model.matrix(r_antrop)
> leverage = hat(X)
> sd_pred=s$sigma*sqrt(leverage)
> sd_pred
[1] 0.140 0.132 0.128 0.142 0.154 0.107 0.129 0.215 0.166 0.166 0.218 0.124 [13] 0.132 0.231 0.206 0.189 0.168 0.188 0.159 0.160 0.172 0.108 0.152 0.122 ...
1.6 Grafici dei residui
• Output di default
par(mfrow=c(2,2)) plot(r_antrop)
100 102 104 106 108
−201234
Fitted values
Residuals
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ● ●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●● ●
●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
Residuals vs Fitted
84 211 210
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
−3 −2 −1 0 1 2 3
−201234
Theoretical Quantiles
Standardized residuals
Normal Q−Q
84
211210
100 102 104 106 108
0.00.51.01.52.0
Fitted values
Standardized residuals
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
Scale−Location
84
211 210
0.00 0.02 0.04 0.06 0.08 0.10 0.12
−201234
Leverage
Standardized residuals ●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
● ● ● ●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
● ●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
●
● ● ●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
Cook's distance
Residuals vs Leverage
84
236
46
• Output grafico “programmato”
plot(predict(r_antrop),rstandard(r_antrop),pch=16,cex.axis=1.5,
xlab="fitted values",ylab="standardized residuals",cex.lab=1.5);abline(h=0)
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
100 102 104 106 108
−2−101234
fitted values
standardized residuals
2 Software SAS
2.1 Lettura dati e proc reg
data antrop;
infile "<path>.txt" EXPANDTABS firstobs=2;
input densita eta peso altezza collo torace addome fianchi coscia ginocch caviglia bicipiti avambr polso;
run;
proc reg data=antrop;
model densita=peso addome coscia bicipiti polso/ corrb clb clm cli p r;
quit;run;
The REG Procedure Model: MODEL1 Dependent Variable: densita Number of Observations Read 238 Number of Observations Used 238
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 600.27862 120.05572 124.05 <.0001
Error 232 224.52855 0.96780
Corrected Total 237 824.80717
Root MSE 0.98377 R-Square 0.7278
Dependent Mean 105.63277 Adj R-Sq 0.7219
Coeff Var 0.93131
Parameter Estimates Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits
Intercept 1 118.88942 1.97397 60.23 <.0001 115.00022 122.77862
peso 1 0.03668 0.00788 4.66 <.0001 0.02116 0.05220
addome 1 -0.22281 0.01340 -16.63 <.0001 -0.24921 -0.19641
coscia 1 -0.04080 0.02669 -1.53 0.1277 -0.09339 0.01179
bicipiti 1 -0.07160 0.03628 -1.97 0.0496 -0.14308 -0.00011705
polso 1 0.29923 0.10638 2.81 0.0053 0.08963 0.50883
Correlation of Estimates
Variable Intercept peso addome coscia bicipiti polso
Intercept 1.0000 0.7330 -0.3255 -0.5428 -0.1281 -0.8207
peso 0.7330 1.0000 -0.6516 -0.5539 -0.2613 -0.4466
addome -0.3255 -0.6516 1.0000 0.0696 0.0776 0.0633
coscia -0.5428 -0.5539 0.0696 1.0000 -0.2462 0.2429
bicipiti -0.1281 -0.2613 0.0776 -0.2462 1.0000 -0.1321
polso -0.8207 -0.4466 0.0633 0.2429 -0.1321 1.0000
Output Statistics Std
Error
Dependent Predicted Mean Std Error Student
Obs Variable Value Predict 95% CL Mean 95% CL Predict Residual Residual Residual 1 107.1 105.9817 0.1398 105.7063 106.2571 104.0240 107.9394 1.0983 0.974 1.128 2 108.5 107.6176 0.1318 107.3579 107.8773 105.6620 109.5731 0.9124 0.975 0.936 3 107.5 107.0886 0.1283 106.8359 107.3414 105.1340 109.0433 0.4214 0.975 0.432 4 103.4 103.7783 0.1425 103.4975 104.0590 101.8198 105.7368 -0.3783 0.973 -0.389 ...
Sum of Residuals -6.8141E-12
Sum of Squared Residuals 224.52855 Predicted Residual SS (PRESS) 237.31357
2.2 Test sulla nullit` a di un sottoinsieme di q coefficienti
Si usa l’istruzione test nella proc reg.
proc reg data=antrop;
model densita=peso addome coscia bicipiti polso;
test coscia, bicipiti;
quit;run;
The REG Procedure Model: MODEL1
Test 1 Results for Dependent Variable densita Mean
Source DF Square F Value Pr > F
Numerator 2 3.97530 4.11 0.0177
Denominator 232 0.96780
2.3 Grafici dei residui - default
(grafici “programmati” si trovano sulle dispense, ad es. a pag. 35-36)
ods pdf file="<path>\res_sas.pdf" notoc startpage=no nogtitle;
ods noproctitle; ods graphics / width=17cm; option nodate;
proc reg data=fitness;
model densita=peso addome coscia bicipiti polso;
quit;run;
ods pdf close;
The SAS System 2 Model: MODEL1
Dependent Variable: densita Fit Diagnostics for densita
0.7219 Adj R-Square
0.7278 R-Square
0.9678
MSE 232
Error DF
6 Parameters
238 Observations
Proportion Less 0.0 0.4 0.8
Residual
0.0 0.4 0.8 Fit–Mean
-6 -4 -2 0 2 4
-3 -1.2 0.6 2.4 4.2 Residual 0
5 10 15 20 25
Percent
0 50 100 150 200 250 Observation 0.00
0.05 0.10 0.15
Cook's D
100 105 110
Predicted Value 100.0
102.5 105.0 107.5 110.0
densita
-3 -2 -1 0 1 2 3 Quantile -2
0 2 4
Residual
0.02 0.06 0.10 Leverage -2
0 2 4
RStudent
100 102 104 106 108 Predicted Value -2
0 2 4
RStudent
100 102 104 106 108 Predicted Value -2
0 2 4
Residual
The SAS System 3 Model: MODEL1
Dependent Variable: densita Residual by Regressors for densita
16 18 20
polso
25 30 35 40
bicipiti
50 55 60 65 70 75
coscia
80 100 120
addome
150 200 250
peso -2
0 2 4
Residual
-2 0 2 4
Residual