Regression with binary dependent variables

3. Empirical Methods

3.2 Regression with binary dependent variables

So far, the independent variables of the linear regression model have been considered either binary or non-binary, with no particular issues in both cases. But what happens when the dependent variable is binary? This is the case here, since the dependent variable aims at highlighting the presence of a local ICT specialization or not. Therefore, the only admissible values are 0 and 1. In this case, things are obviously more difficult since the regression function has to be interpret as a predicted probability (Stock & Watson, 2014).

The linear multiple regression model applied to a binary dependent variable is called the linear probability model (“linear” because it is a straight line and “probability” model because it models the probability that the dependent variable equals 1).

Within this model, the population coefficient 𝛽₁ related to the regressor X represents the change in the probability that 𝑌 = 1 associated to a unit change in 𝑋₁, holding the other regressors constant (and so on for 𝑖 = 1, … , 𝑘). The same changes apply to the estimated predicted values.

The linear probability model is the linear multiple regression model expressed as follows:

𝑌_𝑖 = 𝛽₀ + 𝛽₁𝑋_1𝑖+ ⋯ + 𝛽_𝑘𝑋_𝑘𝑖+ 𝑢_𝑖 (3.14) Since Y is binary, we have that:

31 Pr(𝑌 = 1|𝑋₁, … , 𝑋_𝑘) = 𝛽₀+ 𝛽₁𝑋₁+ ⋯ + 𝛽_𝑘𝑋_𝑘 (3.15) The regression coefficients can be estimated by OLS and, apart from 𝑅², the OLS standard errors can still be used. The problem with 𝑅² is the following: while for continuous dependent variables it can equal 1 (all data lie exactly on the regression line), this is impossible for binary dependent variables (unless the regressors are binary as well).

One of the main issues related to a binary dependent variable is related to the fact that, since probabilities range between 0 and 1, the effect on the probability that 𝑌 = 1 of a given change in X must be non-linear. Very high (low) values of X might lead to values of the binary dependent variable higher (lower) than one (than zero): this is totally nonsense since we are talking about a probability. To address this problem, the probit and logit regression models have been introduced.

Probit and logit regression are nonlinear regression models specifically designed for binary dependent variables, since they force predicted values to be between 0 and 1 using cumulative probability distributions (c.d.f.’s). Probit regression uses the standard normal cumulative probability distribution, whereas logit regression (also called logistic regression) uses the “logistic” cumulative probability distribution.

3.2.1 Probit Regression

The probit regression model with a single regressor X is:

Pr(𝑌 = 1|𝑋) = Φ(𝛽₀+ 𝛽₁𝑋) (3.16) Where Φ is the cumulative standard normal distribution function (tabulated in Appendix Table 1 and Table 2).

In the probit model, the term 𝛽₀ + 𝛽₁𝑋 is the so-called “z” in the cumulative standard normal distribution (Table 1 and Table 2). The probit coefficient 𝛽₁ in the equation (3.16) is the change in the z-value associated to a unit change in X. Considering:

 𝛽₁ > 0: an increase in X increases the value of z and thus the probability that 𝑌 = 1 increases;

 𝛽₁ < 0; an increase in X decreases the probability that 𝑌 = 1.

It is notable to say that, although the effect of X on the z-value is linear, its effect on the probability is nonlinear.

Considering now not one but two regressors (probit regression with multiple regressors), the equation (3.16) becomes:

Pr(𝑌 = 1|𝑋₁, 𝑋₂) = Φ(𝛽₀+ 𝛽₁𝑋₁+ 𝛽₂𝑋₂) (3.17) In the case in which the population regression function is a non-linear function of X , the expected change in Y due to a change in X is still estimated in three steps (being the expected change in Y due to a change in X the change in the probability that 𝑌 = 1). First of all, compute the predicted value of Y with the original value of X. Then, compute the predicted value of Y with the new value of X. Finally, compute the difference between the two predicted values of Y.

The general expression of the (3.17) with multiple regressors is the following:

Pr(𝑌 = 1|𝑋₁, … , 𝑋_𝑘) = Φ(𝛽₀ + 𝛽₁𝑋₁+ ⋯ + 𝛽_𝑘𝑋_𝑘) (3.18) The probit coefficients are estimated by the method of maximum likelihood, which produces efficient estimators in a wide variety of applications. The maximum likelihood estimator in consistent and normally distributed in large samples, so t-statistics and confidence intervals for the coefficients can be constructed in the usual way.

3.2.2 Logit Regression

The logit regression model of the binary dependent variable Y with multiple regressors is:

33 Pr(𝑌 = 1|𝑋₁, … , 𝑋_𝑘) = 𝐹(𝛽₀+ 𝛽₁𝑋₁ + ⋯ + 𝛽_𝑘𝑋_𝑘)

= ¹

1+𝑒−(𝛽0+𝛽1𝑋1+⋯+𝛽𝑘𝑋𝑘) (3.19) Logit regression is similar to probit regression except for the cumulative distribution function, which is different from the cumulative standard normal distribution function. In particular, here the cumulative standard logistic distribution function is used, denoted by F. Its form is defined in terms of exponential function, which is given by the equation (3.19).

As with probit, the logit coefficients can be estimated by maximum likelihood. The maximum likelihood estimator is consistent and normally distributed in large samples, so t-statistics and confidence intervals for the coefficients can be constructed in the usual way.

The differences between probit and logit regression functions are small and they often produce similar results. Logit regression used to be preferred because of the faster computation of the logistic cumulative distribution function, but now with more efficient computers this difference is no longer important.

3.2.3 Maximum Likelihood Estimation

After having defined them, it is possible to state that the probit and logit regression functions are a non-linear function of the coefficients (i.e., the coefficients 𝛽₀, 𝛽₁,…, 𝛽_𝑘 appear inside the cumulative standard normal distribution function in case of probit regression, and inside the cumulative standard logistic distribution function in case of logit regression). For this reason, the coefficients 𝛽₀, 𝛽₁,…, 𝛽_𝑘 cannot be estimated by OLS, but they are instead estimated by maximum likelihood.

The likelihood function is the joint probability distribution of the data, treated as a function of the unknown coefficients. The maximum likelihood estimator (MLE) of the

34 unknown coefficients consists of the values of the coefficients that maximize the likelihood function, which is in turn the joint probability distribution.

The likelihood function for 𝑛 = 2 independent and identically distributed observations (𝑌₁, 𝑌₂) on a binary dependent variable with no regressors is:

𝑓(𝑝; 𝑌₁, 𝑌₂) = 𝑝^(𝑌¹^+𝑌²⁾(1 − 𝑝)^2−(𝑌¹^+𝑌²⁾ (3.20) where the only unknown parameter to estimate is the probability p that 𝑌 = 1, which is also the mean on Y.

The maximum likelihood estimator of p is the value of p that maximizes the likelihood function in the equation (3.20). For general n, the MLE 𝑝̂ of the Bernoulli probability p is the sample average (that is, 𝑝̂ = 𝑌̅).

3.2.4 Measures of Fit

As mentioned before, the 𝑅² is a poor measure of fit for the linear probability model, and this is also true for probit and logit regression. The two measures of fit used for models with binary dependent variables are the “fraction correctly predicted” and the

“pseudo-𝑅²”.

The fraction correctly predicted says that if 𝑌_𝑖 = 1 and the predicted probability is higher than 50%, or if 𝑌_𝑖 = 0 and the predicted probability is lower than 50%, then 𝑌_𝑖 is correctly predicted. Otherwise, 𝑌_𝑖 is said to be incorrectly predicted. The fraction correctly predicted is the fraction of the n observations 𝑌₁, … , 𝑌_𝑛 that are correctly predicted.

The pseudo-𝑅² measures the fit of the model using the likelihood function. Since the MLE maximizes the likelihood function, adding another regressor to the probit or logit model increases the value of the maximized likelihood. Therefore, the fit is measured by comparing values of the maximized likelihood function with all the regressors to the value of the likelihood with none.

35 Specifically, the pseudo-𝑅² for the probit model is:

𝑝𝑠𝑒𝑢𝑑𝑜 − 𝑅² = 1 − ^ln(𝑓^{𝑝𝑟𝑜𝑏𝑖𝑡}

𝑚𝑎𝑥 )

ln(𝑓_{𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖}^𝑚𝑎𝑥 ) (3.21)

Nel documento The effect of Foreign Direct Investments on the emergence of ICT specializations in Italian provinces (pagine 30-35)