Maximum Likelihood (ML)
Estimation and Specification Tests
Econometrics I Part II
2016
Introduction
The ML estimation methodology is grounded on the assumption that the (conditional) distribution of an observed phenomenon (the endogenous variable) is known up to a finite number of unknown parameters.
These unknown parameters are estimated by taking those values for them that give the observed values the highest probability (likelihood) to be drawn given the assumed
conditional distribution.
Example 1: discrete setting
Consider a large pool filled with black and white balls. We are interested in the fraction of white balls, p in this pool.
To obtain information on p we extract a random sample of n balls.
Let us denote yi 1 if ball i is white and yi 0 otherwise.
Then it holds by assumption that
Pryi 1 p (1)
Suppose our sample contains n1
∑
i yi white balls and n − n1black balls, the probability of obtaining such a sample (in a given order) is given by
Lp pn11 − pn−n1 (2) The expression in (2), Lp, seen as a function of the unknown
parameter p, is referred to as the likelihood function.
The maximum likelihood estimator of p implies that that we
choose a value for p that maximizes the expression in (2), that is the probability of drawing the observed sample.
For computational reasons it is often more convenient to
maximize the natural log (which is a monotonic transformation) of the expression in (2). Such a transformation is referred to as the log-likelihood function
log Lp n1 log p n − n1 log1 − p (3)
Maximizing the expression in (3) with respect to p gives the first-order condition
d log Lp
dp np −1 n − n1
1 − p 0 (4)
which, solving for p, gives the ML estimator
pML nn1 (5)
To be sure that the solution we have corresponds to a maximum we also need to check the second-order condition
d2 log Lp
dp2 − n1
p2 − n − n1
1 − p2 0 (6) which indeed shows that LpML is a maximum.
Example 2: continuos setting
Consider a bivariate classical linear regression model
augmented with the normality assumption of the error terms yi 1 2xi i
i|x NID0, 2
(7) where the NID acronym stands for normally N and
independently I distributed errors.
Given the assumptions in (7) the following holds
yi|x NID1 2xi,2 (8) Therefore the contribution of observation i to the likelihood
function is the value of the density function at the observed point yi. For the normal distribution this gives
fyi|xi;, 2 1
22 exp − 12 yi − 1 − 2xi2
2 (9)
Because of the independency assumption, the joint density of y1, y2, . . . yn (conditional on x) is given by
fy1, y2, . . . yn|x;, 2
i fyi|xi;, 2 1
22
n
i exp − 12 yi − 1 − 2xi22
The likelihood function is identical to the joint density function in (10) but it is seen as a function of the unknown parameters
and 2. We can therefore write L, 2 1
22
n
i exp − 12 yi − 1 − 2xi22 (11)
and, by applying the log transformation
log L, 2 − n2 log22 − 12
∑
i yi − 1 − 2xi22 (12)
As the first term in (12) does not depend upon it can be easily seen that maximizing the expression in (12) with respect to 1
and 2 corresponds to minimizing the residual sum of squares S.
That is, the ML estimators for 1 and 2 are identical to the OLS estimators. In general terms the following holds
bML bOLS X′X−1X′y (13)
Given the expression in (13) we can substitute yi − 1 − 2xi in expression (12) with the corresponding ML residuals (which are also the OLS residuals)
log L2 − n2 log22 − 12
∑
i ei22 (14)
After differentiating the expression in (14) with respect to 2
we obtain the first order condition d log L2
d2 − n2 2
22 12
∑
i ei24 0 (15)
Solving for 2 yields the ML estimator for 2
ML
2 e′e
n (16)
This estimator is consistent even if it is biased. In fact it does not correspond to the unbiased estimator for 2 that was derived
from the OLS estimator, given by s2 e′e
n − K (17)
General properties of the ML estimator
Suppose that we are interested in the conditional distribution of yi given xi.
The probability mass (in a discrete setting) or the density function (in a continuos setting) can be written as
fyi|xi; (18)
where is the unknown parameter vector.
Assume that observations are mutually independent. In this situation the probability mass or joint density function of the sample y1, y2, . . . yn conditional on x1, x2, . . . xn can be written as:
fy1, y2, . . . yn|X;
i fyi|xi; (19)The likelihood function is therefore given by
L
i Li
i fyi|xi; (20)where Li is the likelihood contribution of observation i, which represents how much observation i contributes to the
likelihood.
The ML estimator for is the solution to the maximization problem
Max log L
Max
∑
i log Li (21)First order conditions are given by
∂ log L
∂ ML
∑
i ∂ log L∂i
ML
0 (22)
If the log-likelihood function is globally concave there is a unique global maximum and the ML estimator is uniquely determined by these first-order conditions.
Only in special cases, however, the ML estimator can be determined analytically. More often numerical optimization methods are required.
For notational convenience, we denote the vector of the first derivatives of the likelihood function as
s ∂log L
∂
∑
i ∂ log L∂i ∑
i si (23)where the s vector is referred to as the score vector and the si vector is referred to as the score contribution for
observation i.
The first order conditions thus say that the K sample averages of the score contributions, evaluated at the ML estimate,
ML
should be zero
s|ML
∑
i si ML 0 (24)
Provided that the likelihood function is correctly specified, it can be shown under weak regularity conditions that
(a) the ML estimator is consistent for
ML
→ p (25)
(b) the ML estimator is asymptotically efficient, that is,
asymptotically, the ML estimator has the smallest variance among all consistent (linear and non-linear) asymptotic
estimators;
(c) the ML estimator is asymptotically normally distributed, according to
n
ML −
d
→ N0, V (26)
where V is the asymptotic covariance matrix which
corresponds to the inverse of the information matrix I.
The covariance matrix V is determined by the shape of the log-likelihood function.
To describe it in general terms we define the information in observation i as
Ii −E ∂2 log Li
∂∂′ (27)
Loosely speaking, this K K matrix summarizes the expected amount of information about contained in observation i.
The average information matrix for a sample of size n is given by
In 1n
∑
i Ii −E 1n∑
i ∂2 log Li∂∂′
−E 1n ∂2 log L
∂∂′
(28)
while the limiting information matrix is defined as I
n→
lim In (29)
In the special case where observations are identically and independently distributed the following holds
I In Ii (30)
Intuitive interpretation of the information matrix
The expression in (28) is (minus) the expected value of the matrix of second order derivatives, scaled by the number of observations.
If the log-likelihood function is highly curved around its
maximum, the second derivative is large, the variance is small and the ML estimator is relatively accurate.
If, on the other hand, the function is less curved, the second derivative is small, the variance is larger and the ML estimator less accurate.
Given the asymptotic efficiency of the ML estimator the inverse of the information matrix provides a lower bound on the
asymptotic covariance matrix, often referred to as the Cramer-Rao lower bound.
An alternative expression for the information matrix can be obtained from the result that the matrix
Ji Esisi′ (31) is identical to Ii, provided that the likelihood function is
correctly specified.
(d) the covariance matrix can be consistently estimated by replacing the expectational operator with a sample average and by replacing the unknown coefficients with the
corresponding maximum likelihood estimates. The estimator based on (28) is
VH − 1n
∑
i ∂2 log Li∂∂′ ML
−1
(32) whereas,the estimator based on (31) is
VG 1
n
∑
i siMLsML′ −1 (33)Example 1: discrete setting
The second order derivative is d2 log Lp
dp2 − n1
p2 − n − n1
1 − p2 (34)
We can therefore write that
I −E 1n d2 log Lp
dp2 −E 1n − n1
p2 − n − n1
1 − p2
p
p2 1 − p
1 − p2
1
p1 − p
(35)
and, finally
n pML − p
d
→ N0, p1 − p (36)
Example 2: continuos setting
The log-likelihood function is
log L, 2 − n2 log22 − 12
∑
i yi − xi′22 (37)
The score contributions are therefore given by si, 2
∂ log Li,2
∂
∂ log Li,2
∂2
yi−xi′
2 xi
− 212 12 yi−x4i′2
i
2 xi
− 212 12 i24
(38)
To obtain the asymptotic covariance matrix of the ML estimator we use the expression in (31).
After computing the external product
si, 2si, 2′
i2
4 xixi′ i2 − 212 12 i24 xi
i
2 − 212 12 i24 xi′ − 212 12 i24 2
(3
we obtain
Ji
1
2 xixi′ 0
0′ 214 (40)
Note that under normality
Ei 0, Ei2 2, Ei3 0, Ei4 34
Using the expression in (40), the asymptotic covariance matrix is given by
I, 2−1 2XX−1 0
0′ 24 (41)
where XX
limn→
∑
i xixi′Form all this we finally obtain n
ML −
d
→ N0, 2XX−1 n ML
2 − 2
d
→ N0, 24
(42)