Example 1: discrete setting

(1)

Maximum Likelihood (ML)

Estimation and Specification Tests

Econometrics I Part II

2016

(2)

Introduction

The ML estimation methodology is grounded on the assumption that the (conditional) distribution of an observed phenomenon (the endogenous variable) is known up to a finite number of unknown parameters.

These unknown parameters are estimated by taking those values for them that give the observed values the highest probability (likelihood) to be drawn given the assumed

conditional distribution.

(3)

Example 1: discrete setting

Consider a large pool filled with black and white balls. We are interested in the fraction of white balls, p in this pool.

(4)

To obtain information on p we extract a random sample of n balls.

Let us denote y_i  1 if ball i is white and yi  0 otherwise.

Then it holds by assumption that

Pryⁱ  1  p (1)

(5)

Suppose our sample contains n1 

∑

_i ^yⁱ white balls and n − n1

black balls, the probability of obtaining such a sample (in a given order) is given by

Lp  pⁿ¹1 − pⁿ⁻ⁿ¹ (2) The expression in (2), Lp, seen as a function of the unknown

parameter p, is referred to as the likelihood function.

(6)

The maximum likelihood estimator of p implies that that we

choose a value for p that maximizes the expression in (2), that is the probability of drawing the observed sample.

(7)

For computational reasons it is often more convenient to

maximize the natural log (which is a monotonic transformation) of the expression in (2). Such a transformation is referred to as the log-likelihood function

log Lp  n1 log p  n − n1 log1 − p (3)

(8)

Maximizing the expression in (3) with respect to p gives the first-order condition

d log Lp

dp  np −¹ n − n1

1 − p  0 (4)

which, solving for p, gives the ML estimator

p_ML  nn¹ (5)

(9)

To be sure that the solution we have corresponds to a maximum we also need to check the second-order condition

d² log Lp

dp²  − n¹

p² − n − n¹

1 − p²  0 (6) which indeed shows that Lp_ML is a maximum.

(10)

Example 2: continuos setting

Consider a bivariate classical linear regression model

augmented with the normality assumption of the error terms y_i  1  2x_i  i

ⁱ|x  NID0, ²

(7) where the NID acronym stands for normally N and

independently I distributed errors.

(11)

Given the assumptions in (7) the following holds

yi|x  NID1  2xi,² (8) Therefore the contribution of observation i to the likelihood

function is the value of the density function at the observed point yi. For the normal distribution this gives

fyi|xi;, ²  1

2² exp − 12 yi − 1 − 2xi²

² (9)

(12)

Because of the independency assumption, the joint density of y₁, y2, . . . yn (conditional on x) is given by

fy1, y2, . . . yn|x;, ² 



_i ^f^yⁱ^|xⁱ^;^{, }²^

 1

2²

n



_i ^exp ^{− 1}₂ ^yⁱ ^{− }¹ ^{− }²^xⁱ^²

²

(13)

The likelihood function is identical to the joint density function in (10) but it is seen as a function of the unknown parameters 

and ². We can therefore write L, ²  1

2²

n



_i ^exp ^{− 1}₂ ^yⁱ ^{− }¹ ^{− }²^xⁱ^²

² (11)

and, by applying the log transformation

log L, ²  − n2 log2² − 12

∑

_i ^yⁱ ^{− }¹ ^{− }²^xⁱ^²

² (12)

(14)

As the first term in (12) does not depend upon  it can be easily seen that maximizing the expression in (12) with respect to 1

and ² corresponds to minimizing the residual sum of squares S.

(15)

That is, the ML estimators for 1 and 2 are identical to the OLS estimators. In general terms the following holds

bML  bOLS  X^′X⁻¹X^′y (13)

(16)

Given the expression in (13) we can substitute yi − 1 − 2xi in expression (12) with the corresponding ML residuals (which are also the OLS residuals)

log L²  − n2 log2² − 12

∑

_i ^eⁱ²

² (14)

After differentiating the expression in (14) with respect to ²

(17)

we obtain the first order condition d log L²

d²  − n2 2

2²  12

∑

_i ^eⁱ²

⁴  0 (15)

Solving for ² yields the ML estimator for ²

ML

2  e^′e

n (16)

(18)

This estimator is consistent even if it is biased. In fact it does not correspond to the unbiased estimator for ² that was derived

from the OLS estimator, given by s²  e^′e

n − K (17)

(19)

General properties of the ML estimator

Suppose that we are interested in the conditional distribution of y_i given x_i.

The probability mass (in a discrete setting) or the density function (in a continuos setting) can be written as

fyi|xi; (18)

where  is the unknown parameter vector.

(20)

Assume that observations are mutually independent. In this situation the probability mass or joint density function of the sample y1, y2, . . . yn conditional on x1, x2, . . . xn can be written as:

fy1, y2, . . . yn|X; 



_i ^f^yⁱ^|xⁱ^;^ ⁽¹⁹⁾

(21)

The likelihood function is therefore given by

L 



_i ^Lⁱ^{ }



_i ^f^yⁱ^|xⁱ^;^ ⁽²⁰⁾

where L_i is the likelihood contribution of observation i, which represents how much observation i contributes to the

likelihood.

(22)

The ML estimator for  is the solution to the maximization problem



Max log L 



Max

∑

_i ^{log L}ⁱ^ ⁽²¹⁾

(23)

First order conditions are given by

∂ log L

∂ _^__ML 

∑

_i ^{∂ log L}_∂ⁱ^



ML

 0 (22)

If the log-likelihood function is globally concave there is a unique global maximum and the ML estimator is uniquely determined by these first-order conditions.

Only in special cases, however, the ML estimator can be determined analytically. More often numerical optimization methods are required.

(24)

For notational convenience, we denote the vector of the first derivatives of the likelihood function as

s  ∂log L

∂ 

∑

_i ^{∂ log L}_∂ⁱ^ ^

∑

_i ^sⁱ^ ⁽²³⁾

where the s vector is referred to as the score vector and the si vector is referred to as the score contribution for

observation i.

(25)

The first order conditions thus say that the K sample averages of the score contributions, evaluated at the ML estimate, 

ML

should be zero

s|_^__ML 

∑

_i ^sⁱ^ _^_

ML  0 (24)

(26)

Provided that the likelihood function is correctly specified, it can be shown under weak regularity conditions that

(a) the ML estimator is consistent for 

ML

→ p (25)

(27)

(b) the ML estimator is asymptotically efficient, that is,

asymptotically, the ML estimator has the smallest variance among all consistent (linear and non-linear) asymptotic

estimators;

(28)

(c) the ML estimator is asymptotically normally distributed, according to

n 

ML − 

d

→ N0, V (26)

where V is the asymptotic covariance matrix which

corresponds to the inverse of the information matrix I.

(29)

The covariance matrix V is determined by the shape of the log-likelihood function.

To describe it in general terms we define the information in observation i as

Ii  −E ∂² log Li

∂∂^′ (27)

Loosely speaking, this K  K matrix summarizes the expected amount of information about  contained in observation i.

(30)

The average information matrix for a sample of size n is given by

In  1n

∑

_i ^Iⁱ  −E 1n

∑

_i ^∂² ^{log L}ⁱ^

∂∂^′

 −E 1n ∂² log L

∂∂^′

(28)

(31)

while the limiting information matrix is defined as I 

n→

lim In (29)

In the special case where observations are identically and independently distributed the following holds

I  In  Ii (30)

(32)

Intuitive interpretation of the information matrix

The expression in (28) is (minus) the expected value of the matrix of second order derivatives, scaled by the number of observations.

If the log-likelihood function is highly curved around its

maximum, the second derivative is large, the variance is small and the ML estimator is relatively accurate.

If, on the other hand, the function is less curved, the second derivative is small, the variance is larger and the ML estimator less accurate.

Given the asymptotic efficiency of the ML estimator the inverse of the information matrix provides a lower bound on the

asymptotic covariance matrix, often referred to as the Cramer-Rao lower bound.

(33)

An alternative expression for the information matrix can be obtained from the result that the matrix

Ji  Esⁱsⁱ^′ (31) is identical to Ii, provided that the likelihood function is

correctly specified.

(34)

(d) the covariance matrix can be consistently estimated by replacing the expectational operator with a sample average and by replacing the unknown coefficients with the

corresponding maximum likelihood estimates. The estimator based on (28) is

V_H  − 1n

∑

_i ^∂² ^{log L}ⁱ^

∂∂^′ ^__ML

−1

(32) whereas,the estimator based on (31) is

VG  1

n

∑

_i ^sⁱ^^^^ML^s^^^ML^^′ ⁻¹ ⁽³³⁾

(35)

Example 1: discrete setting

The second order derivative is d² log Lp

dp²  − n¹

p² − n − n1

1 − p² (34)

We can therefore write that

(36)

I  −E 1n d² log Lp

dp²  −E 1n − n1

p² − n − n¹

1 − p²

 p

p²  1 − p

1 − p²

 1

p1 − p

(35)

(37)

and, finally

n p_ML − p

d

→ N0, p1 − p (36)

(38)

Example 2: continuos setting

The log-likelihood function is

log L, ²  − n2 log2² − 12

∑

_i ^yⁱ ^{− x}ⁱ^′^²

² (37)

(39)

The score contributions are therefore given by si, ² 

∂ log Li,²

∂

∂ log Li,²

∂²



y_i−x_i^′

² xi

− ₂_¹₂  ¹₂ ^yⁱ^−x_₄ⁱ^′^²



i

² xi

− ₂_¹₂  ¹₂ _^ⁱ²₄

(38)

(40)

To obtain the asymptotic covariance matrix of the ML estimator we use the expression in (31).

After computing the external product

si, ²si, ²^′ 

_i²

⁴ xix_i^′ _^ⁱ₂ − _2¹₂  ¹₂ _^ⁱ²₄ xi

i

² − _2¹₂  ¹₂ _^ⁱ²₄ x_i^′ − _2¹₂  ¹₂ _^ⁱ²₄ ²

(3

(41)

we obtain

J_i 

1

² xix_i^′ 0

0^′ _2¹₄ (40)

Note that under normality

Ei  0, E_i²  ², E_i³  0, E_i⁴  3⁴

(42)

Using the expression in (40), the asymptotic covariance matrix is given by

I, ²⁻¹  ²XX−1 0

0^′ 2⁴ (41)

where XX 

limn→

∑

_i ^xⁱ^xi′

(43)

Form all this we finally obtain n 

_ML − 

d

→ N0, ²XX−1  n ML

2 − ²

d

→ N0, 2⁴

(42)