Regressione in SAS :

(1)

Regressione in SAS : Proc Reg

I dati da analizzare si riferiscono ad un’indagine condotta su 61 città di Inghilterra e Galles riguardante l’indice annuale di mortalità su 100.000 abitanti maschi calcolato come media degli anni dal 1958 al 1964 e la concentrazione di calcio (in parti per milione) dell’acqua potabile. Le città sono classificate come “Nord” se sono a nord della città di Derby, altrimenti sono classificate come “Sud”.

I dati sono contenuti nel file regressione.xls e riportati nell’utima pagina.

Le domande a cui si cerca di rispondere sono :

 Le variabili Mortality e Durezza acqua sono correlate?

 Esiste un fattore geografico nella relazione?

Il DataSet SAS si chiama Water ed è stato costruito importando i dati dal file excel regressione.xls . STEP 1 : Analisi descrittiva delle variabili

La Proc Univariate permette di determinare il valore dei principali indici statistici e inserendo l’opzione normal viene effettuato un test di bontà di adattamento della distribuzione dei dati ad una normale.

Proc Univariate data=water normal;

var mortality durezza_acqua;

histogram mortality durezza_acqua /normal;

run;

(2)

L’output della procedura è il seguente :

The UNIVARIATE Procedure

Variable: mortality (mortality) Moments

N 61 Sum Weights 61 Mean 1524.14754 Sum Observations 92973 Std Deviation 187.668754 Variance 35219.5612 Skewness -0.0844436 Kurtosis -0.4879484 Uncorrected SS 143817743 Corrected SS 2113173.67 Coeff Variation 12.3130307 Std Error Mean 24.0285217

Basic Statistical Measures Location Variability

Mean 1524.148 Std Deviation 187.66875 Median 1555.000 Variance 35220 Mode 1486.000 Range 891.00000 Interquartile Range 289.00000

NOTE: The mode displayed is the smallest of 3 modes with a count of 2.

Tests for Location: Mu0=0

Test -Statistic- ---p Value--- Student's t t 63.43077 Pr > |t| <.0001 Sign M 30.5 Pr >= |M| <.0001 Signed Rank S 945.5 Pr >= |S| <.0001

Tests for Normality

Test --Statistic--- ---p Value--- Shapiro-Wilk W 0.985543 Pr < W 0.6884

(3)

Quantile Estimate

100% Max 1987

99% 1987

95% 1800

90% 1742

75% Q3 1668

50% Median 1555

25% Q1 1379

10% 1259

5% 1247

1% 1096

0% Min 1096

Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 1096 26 1772 29

1175 38 1800 4

1236 42 1807 7

1247 1 1828 30

1254 15 1987 45

Parameter Symbol Estimate Mean Mu 1524.148 Std Dev Sigma 187.6688

Goodness-of-Fit Tests for Normal Distribution Test ---Statistic---- ---p Value--- Kolmogorov-Smirnov D 0.07348799 Pr > D >0.150 Cramer-von Mises W-Sq 0.04868837 Pr > W-Sq >0.250 Anderson-Darling A-Sq 0.33739780 Pr > A-Sq >0.250

Quantiles for Normal Distribution

---Quantile---

Percent Observed Estimated

1.0 1096.00 1087.56

5.0 1247.00 1215.46

10.0 1259.00 1283.64

25.0 1379.00 1397.57

50.0 1555.00 1524.15

75.0 1668.00 1650.73

90.0 1742.00 1764.65

95.0 1800.00 1832.84

99.0 1987.00 1960.73

(4)

The UNIVARIATE Procedure

Variable: durezza_acqua (durezza acqua)

Moments

N 61 Sum Weights 61 Mean 47.1803279 Sum Observations 2878 Std Deviation 38.0939664 Variance 1451.15027 Skewness 0.69223461 Kurtosis -0.6657553 Uncorrected SS 222854 Corrected SS 87069.0164 Coeff Variation 80.7412074 Std Error Mean 4.8774326

Basic Statistical Measures Location Variability

Mean 47.18033 Std Deviation 38.09397 Median 39.00000 Variance 1451 Mode 14.00000 Range 133.00000 Interquartile Range 61.00000

Tests for Location: Mu0=0

Test -Statistic- ---p Value--- Student's t t 9.673189 Pr > |t| <.0001 Sign M 30.5 Pr >= |M| <.0001 Signed Rank S 945.5 Pr >= |S| <.0001

Tests for Normality

Test --Statistic--- ---p Value---

Shapiro-Wilk W 0.887867 Pr < W <0.0001

(5)

Quantiles (Definition 5) Quantile Estimate

100% Max 138

99% 138

95% 122

90% 101

75% Q3 75

50% Median 39

25% Q1 14

10% 8

5% 6

1% 5

0% Min 5

Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 5 39 107 38

5 3 122 19

6 41 122 59

6 37 133 35

8 45 138 26 Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 47.18033 Std Dev Sigma 38.09397

Goodness-of-Fit Tests for Normal Distribution Test ---Statistic---- ---p Value--- Kolmogorov-Smirnov D 0.19666241 Pr > D <0.010 Cramer-von Mises W-Sq 0.39400529 Pr > W-Sq <0.005 Anderson-Darling A-Sq 2.39960138 Pr > A-Sq <0.005

Quantiles for Normal Distribution

---Quantile---

Percent Observed Estimated

1.0 5.00000 -41.43949

5.0 6.00000 -15.47867

10.0 8.00000 -1.63905

25.0 14.00000 21.48634

50.0 39.00000 47.18033

75.0 75.00000 72.87432

90.0 101.00000 95.99971

95.0 122.00000 109.83933

99.0 138.00000 135.80015

(6)

STEP 2 : RAPPRESENTAZIONE GRAFICA

La rappresentazione grafica che permette di esaminare la relazione fra le due variabili Mortality e Durezza_acqua è lo scatterplot. Il programma SAS è il seguente :

proc gplot data=water;

plot mortality*durezza_acqua;

run;

mo r t a l i t y

1 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 6 0 0 1 7 0 0 1 8 0 0 1 9 0 0 2 0 0 0

d u r e z z a a c q u a

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 1 1 0 1 2 0 1 3 0 1 4 0

Alcune opzioni grafiche permettono di ottenere un grafico più leggibile permettendo di suddividere le osservazioni fra Nord e Sud.

symbol1 v=dot c=blue;

symbol2 v=star c=red;

proc gplot data=water;

plot mortality*durezza_acqua=zona;

run;

z o n a n o r d s u d

mo r t a l i t y

1 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 6 0 0 1 7 0 0 1 8 0 0 1 9 0 0 2 0 0 0

d u r e z z a a c q u a

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 1 1 0 1 2 0 1 3 0 1 4 0

(7)

Lo scatterplot sembra mettere in evidenza un correlazione negativa fra le due variabili. La procedura Corr permette di calcolare la correlazione fra mortality e durezza_acqua.

Programma SAS

proc corr data=water pearson;

var mortality durezza_acqua;

by zona;run;

Output SAS

The CORR Procedure

2 Variables: mortality durezza_acqua Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum mortality 61 1524 187.66875 92973 1096 1987 durezza_acqua 61 47.18033 38.09397 2878 5.00000 138.00000

Pearson Correlation Coefficients, N = 61

durezza_

mortality acqua mortality 1.00000 -0.65485 durezza_acqua -0.65485 1.00000

Con l’istruzione by della procedura Corr viene calcolata la correlazione suddividendo i dati fra Nord e Sud.

proc sort data= water ; by zona;

proc corr data= water pearson;

var mortality durezza_acqua;

by zona;

run;

--- zona=nord --- The CORR Procedure

2 Variables: mortality durezza_acqua Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum mortality 26 1377 140.26918 35797 1096 1627 durezza_acqua 26 69.76923 40.36068 1814 5.00000 138.00000

Pearson Correlation Coefficients, N = 26 durezza_

mortality acqua

mortality 1.00000 -0.60215

durezza_acqua -0.60215 1.00000

(8)

--- zona=sud --- The CORR Procedure

2 Variables: mortality durezza_acqua Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum mortality 35 1634 136.93691 57176 1378 1987 durezza_acqua 35 30.40000 26.13449 1064 6.00000 94.00000 Pearson Correlation Coefficients, N = 35

durezza_

mortality acqua

mortality 1.00000 -0.36860

durezza_acqua -0.36860 1.00000

(9)

DATASET

zona citta mortality durezza

acqua zona citta mortality durezza acqua

nord Bath 1247 105 nord Reading 1236 101

sud Birkenhead 1668 17 sud Rochdale 1711 13 nord Birmingham 1466 5 sud Rotherham 1444 14

sud Blackburn 1800 14 sud Salford 1987 8

sud Blackpool 1609 18 sud Sheffield 1495 14 sud Bolton 1558 10 sud South Shields 1713 71 sud Bootle 1807 15 nord Southampton 1369 68 nord Bournemouth 1299 78 nord Southend 1257 50

sud Bradford 1637 10 sud Southport 1587 75

nord Brighton 1359 84 sud St Helens 1591 49

nord Bristol 1392 73 sud Stockport 1557 13

sud Burnley 1755 12 sud Stoke 1640 57

nord Cardiff 1519 21 sud Sunderland 1709 71

nord Coventry 1307 78 nord Swansea 1625 13

nord Croydon 1254 96 sud Wallasey 1625 20

sud Darlington 1491 20 nord Walsall 1527 60

sud Derby 1555 39 nord West Brom 1627 53

sud Doncaster 1428 39 nord West Ham 1486 122 nord East 1318 122 nord Wolverhampton 1485 81

nord Exeter 1260 21 sud York 1378 71

sud Gateshead 1723 44 sud Manchester 1828 8 sud Grimsby 1379 94 sud Middlesbrough 1704 26

sud Halifax 1742 8 sud Newcastle 1702 44

sud Huddersfield 1574 9 nord Newport 1581 14

sud Hull 1569 91 nord Northampton 1309 59

nord Ipswich 1096 138 nord Norwich 1259 133

sud Leeds 1591 16 sud Nottingham 1427 27

nord Leicester 1402 37 sud Oldham 1724 6

sud Liverpool 1772 15 nord Oxford 1175 107

nord Portsmouth 1456 90 nord Plymouth 1486 5

sud Preston 1696 6

(10)

Esempio

Si vuole stabilire se esiste una dipendenza fra il flusso di un corso d’acqua (cioè la quantità di acqua che passa in un minuto) e la profondità del corso d’acqua. I dati sono i seguenti:

zona profond flusso

1 0.34 0.636

2 0.29 0.319

3 0.28 0.734

4 0.42 1.327

5 0.29 0.487

6 0.41 0.924

7 0.76 7.350

8 0.73 5.890

9 0.46 1.979

10 0.40 1.124

Il programma SAS per leggere i dati ed effettuare una prima analisi di regressione è il seguente :

data flusso;

input zona profond flusso;

datalines;

1 0.34 0.636

2 0.29 0.319

3 0.28 0.734

4 0.42 1.327

5 0.29 0.487

6 0.41 0.924

7 0.76 7.350

8 0.73 5.890

9 0.46 1.979

10 0.40 1.124

;

proc reg data=flusso;

model flusso= profond;

plot flusso* profond;

output out=regout p=flussopred r=flussores;

run;

Output SAS :

The REG Procedure Dependent Variable: flusso Number of Observations Read 10

Root MSE 0.60347 R-Square 0.9467 Dependent Mean 2.07700 Adj R-Sq 0.9400 Coeff Var 29.05490

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -3.98213 0.54298 -7.33 <.0001

profond 1 13.83363 1.16061 11.92 <.0001

(11)

fl u s s o = - 3 . 9 8 2 1 + 1 3 . 8 3 4 p r o f o n d

N 1 0 R s q 0 . 9 4 6 7 A d j R s q 0 . 9 4 0 0 R MS E 0 . 6 0 3 5

0 1 2 3 4 5 6 7 8

p r o f o n d

0 . 2 5 0 . 3 0 0 . 3 5 0 . 4 0 0 . 4 5 0 . 5 0 0 . 5 5 0 . 6 0 0 . 6 5 0 . 7 0 0 . 7 5 0 . 8 0

Per ottenere il grafico dei residui : symbol v=dot;

proc gplot data=regout;

plot residui*predetti/vref=0;

run;

R e s i d u a l

- 0 . 8 - 0 . 7 - 0 . 6 - 0 . 5 - 0 . 4 - 0 . 3 - 0 . 2 - 0 . 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9

P r e d i c t e d V a l u e o f fl u s s o

- 1 0 1 2 3 4 5 6 7

(12)

Si può già intravedere che la dipendenza lineare non è marcata; questo si osserva ancora meglio tramite il grafico dei residui di un modello in cui si è supposta una dipendenza

lineare.

I dati e il precedente grafico dei residui possono indurre a supporre una dipendenza quadratica; si può quindi costruire un modello polinomiale del secondo ordine del tipo:

y = β 0 + β 1 x + β 2 x

²

+ ε in cui le variabili esplicative sono due, X e X

²

.

Il programma SAS è il seguente :

data flusso;

set flusso;

sqprof=profond**2;

proc reg data=flusso;

model flusso= profond sqprof;

output out=regout p = predetti r = residui;

proc gplot data=sqregout;

plot residui*predetti/vref=0;

run;

quit;

The REG Procedure

Model: MODEL1 Dependent Variable: flusso Number of Observations Read 10

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F Model 2 54.10549 27.05275 346.50 <.0001 Error 7 0.54652 0.07807

Corrected Total 9 54.65201

Root MSE 0.27942 R-Square 0.9900 Dependent Mean 2.07700 Adj R-Sq 0.9871 Coeff Var 13.45294

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1.68269 1.05912 1.59 0.1561

profond 1 -10.86091 4.51711 -2.40 0.0472

sqprof 1 23.53522 4.27447 5.51 0.0009

(13)

GRAFICO DEI RESIDUI

R e s i d u a l

- 0 . 5 - 0 . 4 - 0 . 3 - 0 . 2 - 0 . 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4

P r e d i c t e d V a l u e o f fl u s s o

0 1 2 3 4 5 6 7 8

(14)

Il grafico dei residui della regressione polinomiale del secondo ordine presenta già un andamento migliore ma si possono provare altri modelli ad esempio:

 √y = β ⁰ + β ¹ x + ε

 log(y) = β ⁰ + β ¹ log(x) + ε

Il primo di questi due modelli è del tutto simile al modello 2, mentre il secondo è motivato dal fatto che i due valori con il flusso e la profondità più alti sono quelli che si discostano maggiormente dalla linearità rispetto agli altri dati e il logaritmo “schiaccia”i valori più alti.

data flusso;

set flusso;

logprof=log(profond);

logflusso=log(flusso);

sqrflusso=sqrt(flusso);

proc reg data=flusso;

model sqrflusso= profond ; output out=sqrtregout p = predetti r = residui;

proc gplot data=sqrtregout;

plot residui*predetti/vref=0;

run;

proc reg data=flusso;

model logflusso= logprof ; output out=logregout p = predetti r = residui;

proc gplot data=logregout;

plot residui*predetti/vref=0;

run;

quit;

(15)

The REG Procedure

Model: MODEL1

Dependent Variable: sqrflusso Number of Observations Read 10

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F Model 1 4.67505 4.67505 286.72 <.0001 Error 8 0.13044 0.01631

Corrected Total 9 4.80550

Root MSE 0.12769 R-Square 0.9729 Dependent Mean 1.26351 Adj R-Sq 0.9695 Coeff Var 10.10623

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -0.55785 0.11489 -4.86 0.0013 profond 1 4.15836 0.24558 16.93 <.0001

R e s i d u a l

- 0 . 1 9 - 0 . 1 8 - 0 . 1 7 - 0 . 1 6 - 0 . 1 5 - 0 . 1 4 - 0 . 1 3 - 0 . 1 2 - 0 . 1 1 - 0 . 1 0 - 0 . 0 9 - 0 . 0 8 - 0 . 0 7 - 0 . 0 6 - 0 . 0 5 - 0 . 0 4 - 0 . 0 3 - 0 . 0 2 - 0 . 0 1 0 . 0 0 0 . 0 1 0 . 0 2 0 . 0 3 0 . 0 4 0 . 0 5 0 . 0 6 0 . 0 7 0 . 0 8 0 . 0 9 0 . 1 0 0 . 1 1 0 . 1 2 0 . 1 3 0 . 1 4 0 . 1 5 0 . 1 6 0 . 1 7 0 . 1 8 0 . 1 9 0 . 2 0 0 . 2 1 0 . 2 2 0 . 2 3 0 . 2 4 0 . 2 5 0 . 2 6

P r e d i c t e d V a l u e o f s q r fl u s s o

0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 1 1 . 2 1 . 3 1 . 4 1 . 5 1 . 6 1 . 7 1 . 8 1 . 9 2 . 0 2 . 1 2 . 2 2 . 3 2 . 4 2 . 5 2 . 6 2 . 7

(16)

The REG Procedure Model: MODEL1

Dependent Variable: logflusso Number of Observations Read 10 Number of Observations Used 10 Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F Model 1 8.77270 8.77270 121.24 <.0001 Error 8 0.57886 0.07236

Corrected Total 9 9.35156

Root MSE 0.26899 R-Square 0.9381 Dependent Mean 0.21475 Adj R-Sq 0.9304 Coeff Var 125.26096

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 2.66614 0.23833 11.19 <.0001 logprof 1 2.76413 0.25103 11.01 <.0001

R e s i d u a l

- 0 . 4 - 0 . 3 - 0 . 2 - 0 . 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6

P r e d i c t e d V a l u e o f l o g fl u s s o

- 1 0 1 2