Consider the general linear model Y = Xβ + E, where Y = (y1

(1)

Statistical models A.Y. 2014/15

Written exam of June 15, 2015.

1. Consider the general linear model Y = Xβ + E, where Y = (y₁, . . . , y_n) is a vector in Rⁿ, X is a matrix n × p and E are random variables with mean 0.

(a) Show that ˆβ = (X^tX)⁻¹X^tY is an unbiased estimator of β. Compute also V ( ˆβ). This choice is generally called the least square method; what does this exactly mean?

(b) Two possible motivations for the choice of ˆβ are maximum likelihood estimation, or Gauss-Markov theorem. State precisely the results (proofs are not necessary, but, if time allows, they are welcome) and the assumptions on the errors E for either result.

(c) Let ˆε = Y − X ˆβ the observed residuals; prove that V (ˆε_i) = σ²(1 − H_ii) where H = X(X^tX)⁻¹X^t and σ² = V (E_j) for each j. Explain why, if H_ii is close to 1 for some value of i, this results makes us expect that ˆy_iwill be close to yi.

(d) Are ˆε_iand ˆε_j independent for i 6= j? [Hint: check the proof of the previous result]

2. Consider a linear model with response variable Y and two predictor variables, X₁quantitative and X2 qualitative (with three values, say A, B and C).

(a) Write down (in a mathematical way) the assumptions of the additive model (Y ∼ X₁+ X2in R).

(b) Give a graphical representation of this model

(c) Which are the (null and alternative) hypotheses that are routinely tested in this model?

(d) Write down (in a mathematical way) the assumptions of the model with interaction (Y ∼ X₁∗ X₂ in R).

(e) Give a graphical representation of this model

(f) Does this model differ (and, if so, how) from performing separate linear regressions of Y on X₁(Y ∼ X₁in R) for each of the subsets {X₂ = A}, {X₂ = B}, {X₂ = C}?

(g) How can we decide whether to choose the model with or without interaction?

3. On a dataset we perform the regression of ozone concentration on wind speed (Wind) and month (with values from 5 to 9). Using R, we obtain the following regression table:

Coefficients:

Estimate Std. Error t value Pr(> |t|) (Intercept) 50.748 15.748 3.223 0.00169 **

Wind -2.368 1.316 -1.799 0.07484 .

month6 -41.793 31.148 -1.342 0.18253

month7 68.296 20.995 3.253 0.00153 **

month8 82.211 20.314 4.047 9.88e-05 ***

month9 23.439 20.663 1.134 0.25919

Wind:month6 4.051 2.490 1.627 0.10680 Wind:month7 -4.663 2.026 -2.302 0.02329 * Wind:month8 -6.154 1.923 -3.201 0.00181 **

Wind:month9 -1.874 1.820 -1.029 0.30569

—

Signif. codes: *** < 0.001 ** < 0.01 ∗ < 0.05 . < 0.1

Residual standard error: 23.12 on 106 degrees of freedom 1

(2)

Multiple R-squared: 0.5473, Adjusted R-squared: 0.5089 F-statistic: 14.24 on 9 and 106 DF, p-value: 7.879e-15

Describe clearly which is the final model that is obtained from this analysis. It is advisable writing down separately the model for observations belonging to each month; in other words, write formulae Ozone = . . . if month = 5, Ozone = . . . if month = 6, . . . .

Describe precisely which are the tests that have been performed (and presented with a P - value); discuss which results appear to be significant, which are the potential problems of this analysis, and how one could proceed.

As can be seen from the Table, the variable “month” has been used as qualitative. It could have also be considered as quantitative; what would have been the difference in the resulting model¹?

1I just want to see the structure of the model, not the numerical values

2