• Non ci sono risultati.

Identifying influential variables for corporate credit ratings in the U.S. technology sector

N/A
N/A
Protected

Academic year: 2021

Condividi "Identifying influential variables for corporate credit ratings in the U.S. technology sector"

Copied!
84
0
0

Testo completo

(1)

Master’s Degree in

Economics and Finance

Second cycle (D.M. 270/2004)

Final Thesis

Identifying influential variables for

corporate credit ratings in the U.S.

technology sector

Supervisor

Ch. Prof. Irene Mammi

Graduand Luca Brusadin

Matriculation number 856248 Academic Year

(2)
(3)

Abstract

Credit Rating Agencies play a fundamental role in providing information to market participants. However, a high number of companies are not rated. Thus, the thesis aims to provide a sensible model in order to forecast corporate ratings for listed companies in the U.S. technology sector. The variables collected are financial and market-based information available to the public. Further, board and gender diversification data are included to investigate if they have some impact on rating assignment. Since the outcome of this research is categorical, a Logistic Regression model was chosen to identify influential variables and how much significantly they modify a rating outcome. Once the prediction accuracy of the model is assessed, the results are compared to theoretical background to draw conclusions on the possibility to provide an initial rating outlook to unrated companies.

(4)
(5)

Contents

Chapter 1. Introduction . . . .1

Chapter 2. The Rating Process . . . .. . . .3

2.1 Introduction to credit ratings methodology . . . 3

2.2 Business Risk Profile . . . 4

2.2.1 Country Risk . . . 5

2.2.2 Industry Risk . . . 6

2.2.3 Company specific Risk . . . 6

2.3 Financial Risk Profile . . . 7

2.4 The Anchor . . . 9

2.5 The Stand-Alone Credit Profile . . . 10

Chapter 3. The Dataset . . . 11

3.1 Industries and Peer Groups . . . 11

3.2 Ratings . . . 13

3.3 Financial variables . . . 15

3.4 Board and gender data . . . 16

Chapter 4. Model specification and estimation . . . 19

4.1 Binary logistic regression . . . 19

4.2 Multinomial logistic regression . . . 21

4.3 Hypothesis tests . . . 23

4.3.1 Wald test . . . 23

4.3.2 Likelihood ratio test . . . 24

Chapter 5. Data Analysis . . . 25

5.1 Multicollinearity . . . 26

5.2 Correlation matrix . . . 27

5.3 Variance Inflation Factor . . . 29

5.4 Principal Component Analysis . . . 32

(6)

5.4.2 Definition 2: Principal component analysis . . . 32

5.4.3 Principal components computation . . . 33

5.5 Outliers detection . . . 41

5.6 Linearity Assumption . . . 43

Chapter 6. Models selection . . . 45

6.1 Overview . . . 45

6.2 Binary logistic regression . . . 45

6.2.1 Step 1: Univariate analysis . . . 45

6.2.2 Step 2: Multivariate analysis . . . 46

6.2.3 Step 3: Preliminary model . . . 48

6.2.4 Step 4: Final model . . . 51

6.2.5 Goodness of fit . . . 55

6.2.6 Inference . . . 57

6.2.7 Board and gender section . . . 60

6.2.8 Results . . . 63

6.3 Multinomial logistic regression . . . 67

6.3.1 Results . . . 69 Chapter 7. Conclusions . . . 71 References . . . 73 Data Sources . . . 74 Appendix A . . . . . . 75 Appendix B . . . 77

(7)

1

Chapter 1. Introduction

Credit Ratings Agencies are the most reliable and accurate institutions for the assessment of default risk and financial strength of companies since their ratings rely on a bunch of information which are not publicly available. Indeed, ratings agencies analyse and assess firm’s ability to meet principal and interest payments on their debts which are then utilised by market participants in order to make investment decisions. On the corporate side, companies must rely on ratings agencies for bonds issuance because their debt needs to be rated. Referring to data provided by the Securities and Exchange Commission in the “Annual Report on Nationally Recognized Statistical Rating Organizations” for year 2019, there are three main rating agencies, internationally recognized, which control 95.3% of the entire rating business, Standard and Poor’s, Moody’s and Fitch. In this work, we will focus on Standard and Poor’s as the leading agency because of its dominant position in the rating business with 49.5% of market share.

The advantages of obtaining a credit rating for issuers rely mostly on having a certified independent opinion about their creditworthiness. Through credit ratings, companies can attract easily more investors from capital and money markets to raise money in a bond issuance instead of taking loans from a bank. Further, a good credit rating score reduces firm’s cost of debt as well as it has an impact on management capital structure decisions. However, to obtain and to maintain a credit rating is an extremely expensive and intrusive process (Langohr, 2010). Hence, only companies which are frequent bond issuers rely on credit rating agencies nowadays. Accordingly, there is a huge number of unrated companies. The choice of being focused on the technologic sector relies on its exponential growth in last years and the significant impact it has in generating economic activity for countries. According to a research report provided by CompTia (2020), the overall economic activity generated in the United Stated in 2019 was approximately 18.8 trillion dollars of which 1.8 trillion dollars have been generated by tech sector. Thus, it accounts for 10% of total economic value without considering indirect benefits for the overall economy, making it the third largest industry in the U.S. after manufacturing and government.

(8)

2

The purpose of this research is to identify the influential factors for corporate credit ratings within the technology sector in the United States and to build a model that helps unrated companies to reach a preliminary rating outlook in order to decide if starting a rating process could be convenient for them to get access to capital markets. Following this purpose, a dataset of cross-sectional financial and marked-based variables has been collected. Then, a preliminary analysis has been conducted on collected data leading to the selection of two econometric models, a binary logistic regression and a multinomial logistic regression. Although core models rely on financial and market-based variables, an additional model takes in account gender and board data to discover if these factors contribute to make better inferences in rating assignments. The identified influential variables are furthermore related with corporate rating methodology theory adopted by Standard and Poor’s in order to draw conclusions in light of the general rating assignment process as well as if they are in line with the financial literature.

The paper has the following structure: chapter 2 is devoted to a literature review about the rating assignment process; chapter 3 presents the dataset construction and description; chapter 4 explains the models’ theory and hypothesis tests; chapter 5 concerns data analysis; chapter 6 implement the logistic regression approach in order to identify influential variables; the last chapter is dedicated to conclusions and final remarks.

(9)

3

Chapter 2. The Rating Process

2.1 Introduction to credit ratings methodology

Focusing on the corporate methodology framework for the credit rating assignment implemented by Standard and Poor’s, it is possible to identify a standard procedure for all companies within different sectors and industries. The aim is to provide Stand-Alone Credit Profiles (SACP) and Issuer Credit Rating (ICR). These two “ratings” differ a lot, indeed following the definition provided by Standard & Poor’s: “The SACP is S&P Global Ratings' opinion of an issuer's creditworthiness in the absence of extraordinary support or burden. It incorporates direct support already committed and the influence of ongoing interactions with the issuer's group and/or government. Therefore, the SACP differs from the ICR in that it does not include potential future extraordinary support from a group or government, during a period of credit stress for the issuer, except if that support is systemwide.” To summarize, SACP is a preliminary rating that concerns only the company as a stand-alone entity without considering extraordinary support from other institutions, for instance governments. Once the credit rating agency has determined this preliminary rating, it incorporates extraordinary support in order to provide the main credit rating, also called Issuer Credit Rating that is available to market participants. Credit Rating Agencies assign outlooks as well as place CreditWatch only to ICR. Therefore, when we try to identify the determinants of credit ratings using qualitative and quantitative variables that are specific for a company, we are trying to make inferences more on SACP than on ICR. Nevertheless, the credit rating assessment procedure is more complicated than a mere analysis of firm’s financial metrics and ratios.

(10)

4

Figure 2.1 Corporate Criteria Framework, Standard & Poor’s

The common framework for the determination of corporate credit ratings starts with the evaluation of two principal firm’s risk profiles, the business risk profile and the financial risk profile. They are evaluated separately at a first stage, then they are joint together to reach a preliminary assessment of the company’s rating called anchor. The anchor is further adjusted by additional analyses, called modifiers, to provide the Stand-Alone Credit Profile, which later lead to the Issuer’s Credit Rating with adjustments for extraordinary support as explained before. The full procedure is illustrated in Figure 2.1. CRAs share the opinion that the business risk profile has more significant impact on the ultimate rating conclusion for investment-grade companies, while the financial risk profile is much more important for speculative-grade companies. In the next sections, we explain in more details the different stages of the process that leads to the Stand-Alone Credit Profile for a company, following the corporate methodology developed by Standard and Poor’s.

2.2 Business Risk Profile

Companies compete within different sectors and within different countries, hence their business risk profile can be divided in nested categories of risks: a company specific risk within the industry risk within the country risk. Before giving a brief explanation for each risk category, the importance of the business risk profile evaluation in ratings’ assignment is emphasized by the following statement: “Credit ratings often are identified with financial analysis, and especially ratios. But it is critical to realize that

(11)

5

ratings analysis starts with the assessment of the business and competitive profile of the company. Two companies with identical financial metrics are rated very differently, to the extent that their business challenges and prospects differ.” (Standard & Poor’s, 2006, p.19). Usually, the country risk and industry risk assessments are combined to form what is known as the issuer’s Corporate Industry and Country Risk Assessment (CICRA). Then, the CICRA and the Company Specific Risk assessment are combined to reach the business risk profile of a company, as shown in Table 2.1.

Table 1.1 Business Risk Profile Assessment, Standard & Poor’s

2.2.1 Country Risk

Country Risk is defined as the risk which addresses the economic risk, institutional and governance effectiveness risk, financial system risk, and payment culture or rule of law risk in the countries in which a company operates (Standard & Poor’s, 2013). Following the above quote, country risk plays a fundamental role in the range of potential ratings that companies in a specific country can achieve. For instance, the ratings of two well-performing companies with the same financial metrics and grow prospects can be different depending on the country in which they operate. Indeed, not all countries start with the maximum rating, the AAA. In this research, we are dealing with companies that

(12)

6

all operate in one country, the United States, hence the country risk is not an issue for us. Further, a study conducted by Coface in the first quarter of 2020, about country risk assessments, has provided a low country risk profile for the United States also during the pandemic outbreak, with a valuation of A2 in eight-level ranking that ranges from A1 to E. The low country risk for the United States is also proved by the rare existence of companies rated with AAA by Standard & Poor’s, indeed the only two companies left in the world with the highest credit rating, Johnson & Johnson and Microsoft Corporation, both reside in the US. The range of country risk assessment is classified with six appraisals from very low risk to very high risk.

2.2.2 Industry Risk

Industry Risk refers to the market in which the company operates. Hence, credit rating agencies study the trends and competition of a specific industry to reach a broad understanding of the quality of the industry itself because they are particularly interested in the pricing power that companies of the industry have. Then, following their conclusion, CRAs classify the industry as a growth, mature, niche, global, or cyclical sector. In the same way as for the country risk, the industry risk can put limitation on the credit quality that a company can obtain. Regarding this research, the companies selected are all within the technology sector, however belonging to different industries. The range of industry risk assessment is classified with six appraisals from very low risk to very high risk.

2.2.3 Company specific risk

Company specific Risk refers to the competitive position that a company has when compared with all the companies of an industry, thus a comparison with its competitors. The strengths of the company are evaluated looking at the trends in market share, product and sales diversity, sales growth, and pricing power as compared to the competition. Then, a final, but essential, consideration involves the company management. Thus, the company specific risk analysis made by credit analysts, is a

(13)

7

combination of quantitative and qualitative factors which assign a valuation of the company in the range from excellent to vulnerable.

2.3 Financial Risk Profile

To evaluate the companies’ financial risk profile, Standard and Poor’s utilise criteria that are based mostly on present and future cash flows prospective and on leverage. Specific ratios and financial metrics are chosen based on a first classification, indeed there are companies with intermediate or stronger cash flow/leverage assessments and companies with weaker cash flows/leverage assessments. This initial classification is provided by two core ratios, funds from operations (FFO) to debt and debt to EBITDA, which are then compared against specific benchmarks for standard, medium and low volatility industries. In order to help or to confirm preliminary cash flows/leverage assessments, the criteria utilise one or more supplemental ratios. These standard supplemental ratios are generally five, three payback ratios and two coverage ratios. The payback ratios are: Cash Flow from Operations (CFO) to debt, Free Operating Cash Flow (FOCF) to debt and Discretionary Cash Flow (DCF) to debt. While the two coverage ratios are: FFO plus cash interest paid to cash interest paid (FFO cash interest cover) and EBITDA to interest.

Therefore, following Standard and Poor’s (2013), from the initial classification based on the core ratios, supplemental coverage ratios are of greater importance for weaker companies while supplemental payback ratios are of greater importance for intermediate or stronger companies. As a matter of fact, for companies with stronger cash flows assessments, CRAs need to measure companies’ capacity and ability to repay their obligation, while for weaker companies CRAs need to measure companies’ ability to pay obligations using cash earnings and the cushion a company possess to support financial stress periods.

Time horizon is another important feature when credit analysts deal with credit ratios. “A company's credit ratios may vary, often materially, over time due to economic, competitive, technological, or investment cycles, the life stage of the company, and corporate or strategic actions.” (Standard and Poor’s, 2013, p.31). Hence, Standard and

(14)

8

Poor’s procedure is to weight differently a time-series of credit ratios, according to transformational events. Any event that could cause a significant change in the financial profile of a company is a transformational event: examples are mergers, acquisitions and management changes. Back to the choice of time horizon, a rule of thumb is to consider a time series of five years divided like this: two previous years, current year and two forecasted years with relative weighting of 10%, 15%, 25%, 25% and 25%. Hence, more importance is given to the forecasted years instead of past years. Of course, number of years and weights can vary based on specific cases and when accounting for transformational events.

The preliminary financial risk profile is then assessed by the combination of the relevant factors described above, thus the computation of core and supplemental ratios according to time horizon features and their comparison with benchmark tables. Furthermore, the final cash flows assessment is determined by integrating volatility adjustments. Accordingly, companies are finally classified as stable, volatile or high volatile.

- Companies are classified as stable when they are expected to raise at least by one category during periods of stress based on their business risk profile. The final cash flows assessment remains stable.

- Companies are classified as volatile when they are expected to move down of one or two categories during periods of stress based on their business risk profile. Equivalently it is expected a decline of 30% in the EBITDA from their current level. The final cash flows assessment is modified to one category weaker.

- Companies are classified as high volatile when they are expected to move down two or three categories during periods of stress based on their business risk profile. Equivalently it is expected a decline of 50% in the EBITDA from their current level. The final cash flows assessment is modified to two categories weaker.

At the end, to each company is assigned a value from one to six which represents the final cash flows assessment, hence the financial risk profile. Looking at the bounds of

(15)

9

this cash flows assessment range, a value of 1 means that the company leverage is minimal while a value of 6 means that the company is highly leveraged.

2.4 The Anchor

The anchor is the combination of the business risk profile with the financial risk profile to provide the initial outlook of the rating without considering the adjustment provided by the modifiers in a subsequent analysis. It is worth noting from Table 2.2 that notches provided by the anchor are never lower than b-, indeed Standard and Poor’s have different criteria when they need to assign ratings equal and below CCC+. In that case, it is appropriate to see the following paper "Criteria for Assigning 'CCC+', 'CCC', 'CCC-', And 'CC' Ratings," published Oct. 1, 2012 by Standard and Poor’s. Indeed, ratings below CCC+ are assigned to companies which have announced that they will miss payments on interests and/or principals as well as their intention to file a bankruptcy petition. Therefore, the timeframe of default, usually certain for these companies, and the degree of financial stress are the main assessments for credit ratings assignments in this specific case.

(16)

10

2.5 The Stand-Alone Credit Profile (SACP)

Once the anchor is determined, Standard and Poor’s applies modifiers which could increase, decrease or maintain notches of the anchor in order to reach the Stand-Alone Credit Profile of a company, that is the stage before providing the final corporate credit rating, also called ICR. There are six possible modifiers, we are going to briefly explain each of them. The diversification/portfolio effect modifier identifies the benefits of diversification in the business lines of a company, meaning that more than one earning stream decreases the likelihood of default for a company during financial stress periods. Firm’s Capital Structure modifier aims to discover risks which did not arise from the cash flows and leverage assessment, thus related with the mismatches between payments and source of financing which can be compounded by interest rate risk and currency risk. Financial Policy modifier concerns short-term and long-term firm’s financial policies which can increase the default risk. Liquidity modifier is focused on a qualitative analysis regrading such factors as the ability to react in different events, bank relationships and the degree of prudence of the company’s financial risk management. The management and governance modifier assesses the company organization and the management effectiveness that can shape the competitiveness of the company within its industry. The comparable rating analysis is the last modifier which is a holistic review of the company’s characteristics in an aggregate way.

(17)

11

Chapter 3. The Dataset

Following the purpose of this thesis, 32 financial variables and other qualitative variables have been collected for 131 listed US companies belonging to the technology sector. The variables have been collected from Reuters for the year 2019. On the other hand, each firm’s Long-Term Debt Credit Rating has been collected from Standard & Poor’s keeping the same time horizon.

In order to gather a useful amount of companies within technology sector, “MSCI US InfoTech Index” was selected as the best source. Indeed, this specific index is designed to capture mid-large cap companies classified in the Information Technology sector and headquartered in the United States. The index contains a total of 317 companies but approximately 131 of them are rated. The big amount of unrated companies is another evidence of how much expensive is obtaining and/or maintaining a rating as well as the lack of information for market participants that arise from this situation.

For the sake of having a clear view about the several variables collected and the composition of the dataset, this chapter is divided in four sections that explain and analyse different issues of features. The analysis of each issue will be useful when dealing with outputs and conclusions after the implementation of logistic regression models. Therefore, the four sections concerning our dataset are: industry, ratings, financial variables, board and gender data.

3.1 Industries and Peer Groups

The research is focused on the technology sector of the United States, but within this sector we can split companies in different industries of interest. Indeed, for each company we have collected the respective industry of business and the respective peer group. We are going to introduce the definitions of what is an industry and what is a peer group in order to understand the slight differences between these two concepts. Therefore, an industry is a group of companies that are related based on their primary business activities as well as their main source of revenue. In modern economies, there are dozens of industry classifications, which are typically grouped into larger categories

(18)

12

called sectors. However, the main industry classification for the United States is provided by the Standard Industrial Classification (SIC). While, a peer group is another way to classify companies and is defined as companies that are competitors in similar business area and are of similar size. Therefore, looking at the previous definitions we can conclude that basically industries and peer groups are two different classifications which share the feature of being subcategories of a sector.

Accordingly, we have created a pie chart for each subcategory as shown in Figure 3.1. Regarding the industry classification, the main two industries in our dataset are “IT Services” and “software” with equal weight, while referring to the peer group classification the main category is “business and consumer services”, followed by “software” and “semiconductors”.

(19)

13

3.2 Ratings

Looking at the ratings of the U.S. listed companies within the technology sector (Table 3.1), ratings range from AAA to CCC-, being in line with the theory discussed in the previous chapter about the low country risk for United States which permits a full range of ratings without restrictions. Following the criteria, generally adopted by investors, of splitting company’s ratings in two categories, investment grade and speculative grade, we firstly grouped the ratings in a new dichotomous output variable. Hence, we have assigned 𝑌 = 1 to ratings that range from AAA to BBB- and 𝑌 = 0 to those remaining ratings that range from BB+ to CCC-. It is worth noting that after having grouped companies in these two outcomes, the result is perfectly balanced among them. Indeed, 66 companies are classified as investment grade companies while 65 companies are classified as speculative grade companies. Therefore, the most suitable econometric model for this new dependent variable is a binary logistic regression.

Afterwards, in order to better represent ratings’ heterogeneity, we have decided to group companies’ ratings in three categories. Under 𝑌 = 0 we have grouped ratings with range from BB+ to CCC-, under 𝑌 = 1 belong ratings from BBB+ to BBB- and 𝑌 = 2 is formed by companies which have ratings from AAA to A-. This classification follows a different ratio concerning mainly the quality of ratings. Indeed 𝑌 = 0 is formed by speculative and highly speculative grade companies, 𝑌 = 1 by lower-medium grade companies, while 𝑌 = 2 is made by prime and high grades companies. Instead of having a balanced division of the three categories, we have approximately 50% of values that fall under the speculative output. In this case, the most suitable econometric model for a dependent variable with three levels is the multinomial logistic regression. The pie chart in Figure 3.2 shows the complete division of ratings for the multinomial logistic regression.

(20)

14

Figure 3.2 Ratings classification for multinomial logit

RATING COUNT AAA 1 AA+ 1 AA 2 AA- 3 A+ 5 A 3 A- 8 BBB+ 11 BBB 17 BBB- 15 BB+ 17 BB 18 BB- 10 B+ 6 B 7 B- 10 CCC- 2

(21)

15

3.3 Financial Variables

Following the classification adopted by Reuters, we have collected 32 variables among financial metrics and ratios from Reuters and Yahoo Finance. All data are available to public, following the purpose of creating a model that can be useful not only to unrated companies in order to reach a general idea of their preliminary rating outlooks but also to investors which want to know the default risk for a specific company in the US technology sector. The full list of variables and their associated categories is presented in Table 3.2.

FINANCIALS: MARKET-BASED:

Non-Current Liabilities to Liabilities Market Capitalisation

Current Liabilities to Liabilities Beta

Long-Debt to Equity EPS

Debt to Equity

Debt Ratio FINANCIAL STRENGHT AND MARGINS:

Goodwill to Assets Free Cash Flow

PP&E to Assets Gross Margin

Current Ratio Net Profit Margin

Liabilities to Assets Operating Margin

Cash and short-term investments to Assets Pre-tax Margin Equity to Assets

Goodwill and Intangibles to Assets MANAGEMENT EFFECTIVENESS:

Cash Ratio ROA

Quick Ratio ROE

Operating Income to Net Sales ROI

Retained Earnings to Assets Asset Turnover

Account Receivables to Revenue Revenue to Assets

Net Debt Total Assets

(22)

16

3.4 Board and gender data

Nowadays gender gap has become a relevant problem because basically all relevant positions in companies were occupied by males in the past. Through the years, a lot of companies have started several incentives to increase the presence of women in leading positions. In this research, board and gender data have been collected for each company to understand the gender gap situation, and if these data can contribute positively to ratings’ assignments, as will be evaluated in Chapter 6.

Firstly, we are focused on the percentage of women in company’s boards. The average percentage of women in boards is 18%. Therefore, using this value as benchmark, we classified women presence in companies’ boards as shown in Table 3.3. Since we are interested in investigating significant differences between investment-grade firms (𝑌 = 1) and speculative-grade firms (𝑌 = 0), we also grouped companies in these two categories. At a first glance, we cannot notice any evidence of relationship regarding number of women and rating class because we have approximately the same frequencies. Nonetheless, we can conclude that the number of women in boards is still particular small compared to number of males because 86% of boards have a percentage of women less than 30%, further 47% of boards present less than 18% of women, basically half of the entire dataset.

Table 3.3 Companies classification for board’s women percentage

Next step is to analyse the gender gap in relevant company’s positions. Hence, we collected data about the gender of Chairman, Chief Executive Officer (CEO) and Chief

(23)

17

Financial Officer (CFO) for each company. As shown in Figure 3.3, the number of women who occupy these roles is extremely small. This situation could be influenced by the fact that we are dealing with the technologic sector where historically the prevalent gender is male.

(24)
(25)

19

Chapter 4. Model specification and estimation

4.1 Binary Logistic Regression

As explained in the previous section, the outcome of interest can be expressed a dichotomous one that takes two possible categories, investment grade and speculative grade. Because we handle a discrete dependent variable, especially a dichotomous (binary) dependent variable, the most suitable model in order to deal with this kind of dependent variable is the Logistic Regression. Furthermore, we have more than one explanatory variable so the model we are going to implement is a Multivariate Logistic Regression. The logistic regression assumes a logistic distribution, and there are two significant reasons why to use the logistic distribution. First, from a mathematical point of view, it is an extremely flexible and easily used function. Second, its model parameters provide the basis for meaningful estimates of economic effects (Hosmer et al., 2013). Starting with the assumption that we have a collection of 𝑝 independent variables denoted by the vector 𝒙 = 𝑥 , 𝑥 , … , 𝑥 and that the conditional probability of an outcome to belong in the investment grade category is denoted by Pr(𝑌 = 1|𝑥) = 𝜋(𝑥), the specific form of the multiple regression model is given by

𝜋(𝑥) = 𝑒

( )

1 + 𝑒 ( )

where 𝑔(𝑥) is the logit transformation function of a logistic regression model and is defined by the following equation,

𝑔(𝑥) = ln 𝜋(𝑥)

1 − 𝜋(𝑥) = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ + 𝛽 𝑥 .

Because in some cases we don’t deal only with continuous or interval scaled variables but also with discrete variables transformed in design variables (dummy variables), the logit function can be rewritten as

𝑔(𝑥) = ln 𝜋(𝑥)

(26)

20

In the former equation 𝐷 represents 𝑘 − 1 design variables when we deal with categorical features. We subtract one to the 𝑘 levels of the 𝑥 independent variable because, unless stated otherwise, we need to account for a constant term.

Furthermore, in order to estimate the parameters of the logit function, the generally adopted method is the maximum likelihood. A likelihood function is constructed to express the probability of the observed data as a function of the unknown parameters. We are then able to obtain maximum likelihood estimators of the parameters which are the values that maximize the likelihood function. In other words, the estimated coefficients computed by the former method maximize the probability of obtaining the observed set of data.

Logit transformation is of central importance in the study of logistic regression. Firstly, 𝑔(𝑥) has a lot of desirable properties that can be found in linear regression. As described by Hosmer et al. (2013), the logit transformation is linear in its parameters, may be continuous, and may range from −∞ to +∞, depending on the range of 𝑥. Hence, using these properties we can use the same principles applied for linear regression analysis in the logistic regression one. Further, regarding the conditional distribution of the outcome variable we might express the outcome variable given a vector of independent variables 𝑥 = 𝑥 , 𝑥 , … , 𝑥 as 𝑦 = 𝜋(𝑥) + 𝜀 where 𝜀 may assume one of two possible values. If 𝑦 = 1 then 𝜀 = 1 − 𝜋(𝑥) with probability 𝜋(𝑥), and if 𝑦 = 0 then 𝜀 = − 𝜋(𝑥) with probability 1 − 𝜋(𝑥). Instead of having a normal distribution with zero mean and constant variance as in linear regression case, 𝜀 has a distribution with mean zero and variance equal to π(x)[1 − π(x)]. Therefore, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, 𝜋(𝑥).

(27)

21

4.2 Multinomial Logistic Regression

When we deal with a categorical dependent variable that has more than two levels, the model choice goes to the multinomial logistic regression. As it happens with the binary logistic regression, we need to assign real numbers starting from zero to each nominal outcome. In this research, during the decision process for the multinomial model, we split the companies ratings in three different groups, in order to better represent the heterogeneity of ratings .Therefore we assign 𝑌 = 2 to ratings that range from AAA to A-, 𝑌 = 1 to ratings that range from BBB+ to BBB- and 𝑌 = 0 to all ratings below BBB-. Another difference between the two logistic regression models, is that in the latter case we need to compute not just one, but two logit functions and decide which outcome category to use as the reference group. Once the outcome baseline category is chosen, in this case the one under 𝑌 = 0, we need to compare every other category to it. Assuming a vector x of length p+1, because we have a constant term and p covariates, we denote the two logit functions in the following way

𝑔 (𝒙) = 𝑙𝑛 Pr (𝑌 = 1|𝒙)

Pr (𝑌 = 0|𝒙) = 𝛽 + 𝛽 𝑥 + ⋯ + 𝛽 𝑥 = 𝒙′𝛽 and

𝑔 (𝒙) = 𝑙𝑛 Pr (𝑌 = 2|𝒙)

Pr (𝑌 = 0|𝒙) = 𝛽 + 𝛽 𝑥 + ⋯ + 𝛽 𝑥 = 𝒙′𝛽 .

Using the two logit functions we can then compute conditional probabilities for the three outcome categories in the following way

𝜋 (𝒙) = 𝑃𝑟(𝑌 = 0|𝒙) = 1 1 + 𝑒 (𝒙)+ 𝑒 (𝒙) 𝜋 (𝒙) = 𝑃𝑟(𝑌 = 1|𝒙) = 𝑒 (𝒙) 1 + 𝑒 (𝒙)+ 𝑒 (𝒙) 𝜋 (𝒙) = 𝑃𝑟(𝑌 = 2|𝒙) = 𝑒 (𝒙) 1 + 𝑒 (𝒙)+ 𝑒 (𝒙)

(28)

22

Then, also in the multinomial logistic regression we estimate the parameters of the logit functions for each covariate using the maximum likelihood method.

The interpretation of the fitted logistic model is not straightforward as in linear regression case. The sign of estimated coefficients provides the effect of an explanatory variable, holding constant the others, about the likelihood for the outcome to occur in one category instead of the reference category. However, a more exhaustive interpretation of the estimated coefficients is given by the odds ratio. Odds ratio is a measure of association that approximates how much more likely or unlikely it is for the outcome to occur in a specified rating category compared to the baseline category, 𝑌 = 0, for a unit change in the value of the independent variable at issue, assuming a ceteris paribus condition.

Regarding the computation of the odds ratio, we generalize the formula following the notation of binary logistic regression. In the formula below, the subscript j represents which outcome category is compared to the reference outcome.

𝑂𝑅 (𝑎, 𝑏) = Pr(𝑌 = 𝑗| 𝑥 = 𝑎) Pr(𝑌 = 0| 𝑥 = 𝑎)⁄ Pr(𝑌 = 𝑗| 𝑥 = 𝑏) Pr(𝑌 = 0| 𝑥 = 𝑏)⁄

We perform the binary logistic regression using the StatsModels package in Python, nevertheless there is not a straightforward way to fit a multinomial logistic regression. In alternative, we are going to fit two separate binary logistic regression because, as proposed by Begg and Gray (1984), this is the best approach for an approximation of the multinomial logistic model. Therefore, we firstly fit a binary logit for 𝑌 = 1 against 𝑌 = 0 and then we fit 𝑌 = 2 against 𝑌 = 0. “Begg and Gray show that the estimates of the logistic regression coefficients obtained in this manner are consistent, and under many circumstances the loss in efficiency is not too great.” (Hosmer et al., 2013, p.282)

(29)

23

4.3 Hypothesis tests

The maximum likelihood method provides estimated coefficient for each explanatory variable in a fitted logistic regression. However, to identify the most influential variables for credit ratings, we need to test the statistical significance of each regressor in order to obtain a restricted set of variables which best represents the purpose of this research. Therefore, during models’ selection in Chapter 6, we are going to perform two hypothesis tests. They have the common aim to investigate if under the null hypothesis, 𝐻 , one or more regressors’ coefficients are zero against the alternative hypothesis, 𝐻 , of being statistically significant, and thus labelled as influential factor. The tests are needed in order to decide which covariates are not significant in univariate or multivariate models (Wald Test) and to decide among two models which one is the most parsimonious (Likelihood Ratio Test). Therefore, we now describe them from a theoretical point of view.

4.3.1 Wald Test

Wald test is needed to test the statistical significance of the estimated coefficients. Hence, to investigate if a coefficient differs from zero in a fitted logistic regression. Regarding the significance of a single coefficient, it is assumed that under the null hypothesis the coefficient is zero against the alternative of being different from zero. The Wald test is the ratio between the maximum likelihood estimate of the variable’s coefficient (slope) in the model and its estimated standard error. The formula is the following one:

𝑊 = 𝛽

𝑆𝐸 𝛽

Under the null hypothesis, the distribution of W is assumed to be chi-square with n degrees of freedom. Therefore, for p-values lower than 0.05, at 5% significance level, the null hypothesis is rejected against the alternative.

(30)

24

4.3.2 Likelihood Ratio Test

The Likelihood Ratio test is a hypothesis test for the comparison of two models in order to test if adding or subtracting some independent variables brings to a better model from a log-likelihood perspective, meaning a more stable model from a numerical point of view. We calculate the statistic G in this way,

𝐺 = −2𝑙𝑛 (𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑀 )

(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑀 ) = 2[𝑙𝑜𝑔𝑙𝑖𝑘(𝑀 ) − 𝑙𝑜𝑔𝑙𝑖𝑘(𝑀 )].

In this equation, 𝑀 represents the model with less explanatory variables while 𝑀 represents the one with more variables. Under the null hypothesis that the coefficients of the independent variables not presented in 𝑀 are zero, the statistic G follows a chi-square distribution with 𝑛 degrees of freedom, where 𝑛 is the difference in the number of parameters between the two logistic models. Therefore, if the p-value is less than 0.05, at 5% significance level, the model with more predictors fits better than the nested one.

(31)

25

Chapter 5. Data Analysis

Before we choose to use a logistic regression model, a fundamental preliminary step involves checking the data because we need to be sure that they follow four assumptions, without any differences between a binary or a multinomial logit model. These assumptions are required in order to be confident that the logistic model we are going to implement will give us valid results. The four assumptions are now presented and briefly discussed.

- Assumption 1: The dependent variable is categorical and then we assign discrete numbers, starting from zero, to each unique category. However, the dependent variables should have mutually exclusive and exhaustive categories.

- Assumption 2: One or more independent variables are needed. These variables can be either continuous or categorical. Continuous variables are numeric variables that have an infinite number of values within an interval. While categorical variables contain a finite number of categories or distinct items, which might not have a logical order.

- Assumption 3: Independence of observations. More generally, we need to avoid multicollinearity among the data.

- Assumption 4: The linearity assumption, which implies a linear relationship between any continuous independent variables and the logit transformation function.

Luckily, we are able to discard first and second assumption from our preliminary analysis since the dataset collected is formed by a categorical dependent variable, which represents the different rating classes, and a set of independent variables that are only continuous. Therefore, next sections of this chapter will be focused on the third and the fourth assumptions, as well as outlier detection which is an important feature of every

(32)

26

econometric model. Our strategy is to delete, at each step, the set of regressors which is found out to not rely on the assumptions and useless for the purposes of the research.

5.1 Multicollinearity

Starting from the correlation matrix as reference point, it is recommended to investigate whether multicollinearity is an issue in the data before performing any multiple logistic regression. We start with a definition of multicollinearity. “In general, the term multicollinearity is used to describe the problem when an approximate linear relationship among the explanatory variables leads to unreliable regression estimates. This approximate relationship is not restricted to two variables but can involve more or even all regressors” (Verbeek, 2012, p.81). Intuitively, when we deal with a large dataset containing several covariates, it could be possible that a lot of them are highly correlated between each other. Turning back to the definition of multicollinearity, it could affect the reliability of the estimates and imply higher variances which are likely to affect confidence intervals and hypothesis tests when performing a multiple logistic regression (Hoerl and Kennard, 1970). Furthermore, with highly correlated covariates we face another problem. It becomes difficult to identify which regressor has the higher significance and the higher impact on the output of a regression. For instance, we assume that we have just two covariates, ROE and ROA. Hypothetically, they have a correlation coefficient greater than 0.9. Once we have performed a significant regression on these two variables, it becomes difficult to determine which of them has the most important role in the estimation of the output. This is a significant reason why we need to avoid the multicollinearity problem to reach likely conclusions. According to Verbeek (2012), we have different ways to detect and to avoid multicollinearity. The first one is to compute the correlation matrix and then delete one of the pairwise correlated variables, especially among covariates which present a perfect multicollinearity. Exact multicollinearity occurs when one explanatory variable is an exact linear combination of one or more other explanatory variables (including the intercept), in that case the OLS estimator is not uniquely defined from the first-order conditions of the least squares problem, therefore the matrix 𝑋′𝑋 is not invertible. The second method, which is more

(33)

27

sophisticated, is to use the Variance Inflation Factor (VIF) to detect collinearity in our set of independent variables. The Variance Inflation Factor is given by:

𝑉𝐼𝐹(𝑏 ) = 1 1 − 𝑅

and indicates the factor by which the variance of a single coefficient, denoted by 𝑏 , is inflated compared with the hypothetical situation where there is no correlation between 𝑏 and any of the other explanatory variables. While the 𝑅 represents the squared multiple correlation coefficient (Verbeek, 2012). Once we have computed the VIF for each explanatory variable, a rule of thumb that could be applied is that a VIF greater than 10 (meaning that 𝑅 > 0.9) is “too high”, thus variables above this threshold should be removed.

Nevertheless, once multicollinearity is detected we can perform a Principal Component Analysis in order to avoid it. This process will be described and performed on specific variable categories in the section 5.4 of this chapter. Now we will focus on the correlation matrix to detect firstly exact multicollinearity and further we will compute Variance Inflation Factor on the remaining set of independent variables to identify which variable could negatively impact the logistic regression.

5.2 Correlation matrix

Looking at the correlation matrix (Figure 5.1), we have found a huge number of pairwise correlated variables. In this section we are mainly interested in those covariates in which the correlation coefficient is exactly ± 1 because they are the representation of exact multicollinearity, as discussed before. Nevertheless, sometimes exact multicollinearity happens not by chance but basically because some variables are mutually exclusive. Indeed, the analysis provided in this section shows a perfect correlation for only variables with mutually exclusive features.

Among all the regressors, the ones which present a perfect correlation, because they are mutually exclusive, are: “Current Liabilities” against “Non-Current Liabilities” and “Equity to Total Assets” against “Liabilities to Total Assets”. At this point, the challenge

(34)

28

was to decide which variable in each pair should have been deleted from the dataset. In order to solve the problem, we have decided to perform a univariate logistic regression for each regressor in the list above against the binary dependent variable and to look at the p-values of Wald Test statistics for the significance of the coefficients as described in Hosmer et al. (2013). Furthermore, following the work of Mickey and Greenland (1989) on logistic regression, we should keep variables which have a p-value less than 0.25 as well as variables which are known of being “clinically important” in our research. Accordingly, we have decided to maintain “Current Liabilities” and “Liabilities to Total Assets”. Therefore, the two independent variables, “Equity to Total Assets” and “Non-Current Liabilities”, were deleted from the list of useful regressors for this research. Leaving apart the variables mentioned above, we extract a list of variables using the same univariate logistic regression approach among all highly correlated pairs of covariates. The regressors which are most likely to be deleted to avoid multicollinearity problems in future regressions are the following ones: Long-Debt to Equity, ROA, ROE, Operating Margin, PreTax Margin, Quick Ratio, Current Ratio, Goodwill to Assets, Revenue to Assets, Cash Ratio, Operating Income to Sales and Free Cash Flow. However, at this point we are not going to delete this last set of variables because we want to make a further analysis, through Variance Inflation Factor, before reaching an ultimate decision.

(35)

29

Figure 5.1 Heatmap of the Correlation Matrix of the original set of independent variables.

5.3 Variance Inflation Factor

In this section we perform the Variance Inflation Factor analysis on the same dataset of independent variables as before in order to compare the results and to finally delete the variables that could affect negatively the logistic regression due to multicollinearity. Making a comparison between the last set of variables found in previous section and the highlighted variables in Table 5.1, all the covariates selected in the previous section have a VIF > 10, therefore we could eliminate those variables from the set of relevant features in our general analysis. There is just one exception, which is “Goodwill plus Intangibles_Ratio”. It seems that this variable is more inflated than “Goodwill_Ratio”, therefore respect to what found in the previous section we will delete the latter mentioned variable.

(36)

30

Furthermore, there are two subcategories of regressors which have a lot of correlated variables inside them. We are referring to the Management Effectiveness category variables (ROA, ROE, ROI) and Margins category variables (Net Profit Margin, Operating Margin and PreTax Margin). Since these subcategories have interesting features in line with Standard and Poor’s corporate methodology for credit assignments, we are going to perform the Principal Component Analysis on them in the next section, in order to maintain them as significant variables in economic sense but at the same time avoiding multicollinearity.

(37)

31

(38)

32

5.4 Principal Component Analysis

This section will focus on the Principal Component Analysis, also known as PCA. The PCA is an unsupervised learning method that is useful to reduce a huge number of features in few principal components which gather most of the variables’ information. Before computing and explaining in detail the methodology used for our variables of interest, we briefly explain what an unsupervised learning method is as well as we provide the definition of principal component analysis from a theorical point of view.

5.4.1 Definition 1: Unsupervised Learning Methods

Unsupervised learning is a set of statistical tools intended for the setting in which we have only a set of features 𝑋 , 𝑋 , … , 𝑋 measured on 𝑛 observations. We are not interested in prediction, because we do not have an associated response variable 𝑌. Rather, the goal is to investigate correlation patterns among variables 𝑋 , 𝑋 , … , 𝑋 . Since we don’t have an associate response variable 𝑌, unsupervised methods are more challenging because there is no way to check our results due to the fact there is no universally accepted mechanism for performing cross-validation or validating results on an independent dataset. Because of the subjectivity of the results, unsupervised learning methods are often performed as part of an exploratory data analysis.

5.4.2 Definition 2. Principal Component Analysis (PCA)

Principal component analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features 𝑋 , 𝑋 , … , 𝑋 , and no associated response 𝑌. Apart from producing one or more new variables to fit in supervised learning problems, as in the logistic regression we are going to perform, PCA also serves as a tool for data visualization in a preliminary process of data analysis. To summarize, principal components analysis is a tool used for data visualization or data pre-processing before supervised techniques are applied.

(39)

33

5.4.3 Principal Components computation

The scope of our exploratory analysis is to use the PCA in order to reduce different sets of variables to a low-dimensional representation of the data that captures as much of information as possible. Mainly, we are interested in how much of the information is captured by the First Principal Component such that we could use it as a new independent variable for the supervised methods that we are going to perform in order to find the main determinants of credit ratings for tech companies. The first principal component of a set of independent variables 𝑋 , 𝑋 , … , 𝑋 is the normalized linear combination of the features that has the largest variance:

𝒁𝟏 = ∅𝟏𝟏𝑿𝟏+ ∅𝟐𝟏𝑿𝟐+ ⋯ + ∅𝒑𝟏𝑿𝒑

We refer to the elements ∅ , … , ∅ as the loadings of the first principal component, which together form the principal component loading vector, ∅ = ∅ ∅ … ∅ and furthermore we refer to normalized linear combination meaning that ∑ ∅ = 1.

It follows that because we are interested only in the variance, each of the variables in 𝑋 has been centred to have mean 0, then to compute the first principal component of a 𝑛 × 𝑝 data set 𝑋, we have to solve the following optimization problem:

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 ∅ … ∅ 1

𝑛 ∅ 𝑥 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∅ = 1

After the first principal component 𝑍 of the features has been determined, we can find the second principal component 𝑍 . The second principal component is the linear combination of 𝑋 , … , 𝑋 that has maximal variance out of all linear combinations that are uncorrelated with 𝑍 .

The second principal component scores 𝑧 , 𝑧 , … , 𝑧 take the form: 𝑧 = ∅ 𝑥 + ∅ 𝑥 + ⋯ + ∅ 𝑥

where ∅ is the second principal component loading vector, with elements ∅ , ∅ , … , ∅ . It turns out that constraining 𝑍 to be uncorrelated with 𝑍 is

(40)

34

equivalent to constraining the direction ∅ to be orthogonal (perpendicular) to the direction ∅ . The process can keep going with the same criteria unless all the regressors’ information is gathered by a number of components which is smaller than the number of starting features.

We start by computing the two principal components on the first set of variables that are highly correlated: ROA, ROE and ROI which come from the “Management Effectiveness” category.

Before performing PCA, it is recommended to standardize each variable in order to have zero mean and standard deviation equal to one. We scale each variable in this way because it could be possible that the variables could have different scales or units as well as extraordinary differences in variance. Thus, because it is undesirable for the principal components obtained to depend on an arbitrary choice of scaling, we typically scale each variable to have standard deviation 1 before we perform PCA. The standardization of all the data in this research as well as the Principal Component Analysis was performed using Python, with “SkLearn”, an open-source library for data analysis.

Once we have computed the principal components, we can plot them against each other in order to produce a two-dimensional view of the data, as showed in Figure 5.2. Indeed, the biplot is a useful representation of the data contained in the first and in the second principal components as well as it shows how much weight each variable has on them. Further, the loading vectors (plotted in red) of each feature, tell us how variables are correlated each other looking at their angles. Two loading vectors are positively correlated if they are close, if they form a 90°angle they are not likely to be correlated, while if they form a 180°angle they are likely to be negatively correlated. The table of the principal components loading vectors with relative weights is showed in Figure 5.3. Always referring to Figure 5.3, the index 0 concerns the first principal component, while index 1 concerns the second principal component in analysis.

(41)

35

Figure 5.2 Biplot of PCA on “Management Effectiveness” variables

Figure 5.3 Heatmap of Principal Components loading vectors weights

We are now interested in how much of the variance of the data is contained in the first principal component, in other words how much information of the three variables (ROA, ROE and ROI) is explained by first principal component. More generally, we want to

(42)

36

know the proportion of variance explained (PVE) by each principal component. We start by defining the total variance present in the data set as:

𝑉𝑎𝑟(𝑋 ) = 1

𝑛 𝑥

While the variance explained by the 𝑚th principal component is:

1

𝑛 𝑧 =

1

𝑛 ∅ 𝑥

Therefore, the PVE of the 𝑚th principal component is given by:

∑ ∑ ∅ 𝑥

∑ ∑ 𝑥

Using any software, we compute rapidly the proportion of variance explained by each principal component and then we plot the cumulative explained variance as in Figure 5.4. From this first analysis, the proportion of variance explained by the first principal component, indexed as the number zero in the plot, is approximately the 97.5% of the total variance. This is an extremely good statistical result, which most likely comes from the high correlation among our set of variables. Thus, we can conclude that the first principal component is able to explain approximately all the information contained in the set of variables in analysis. It is worth noting, looking again at Figure 5.3, that all

(43)

37

features in the first principal component have the same importance and weight in containing the amount of information.

Figure 5.4 The cumulative proportion of variance explained by the three principal components in the "Management Effectiveness" data

Afterwards, we perform the PCA in the same way as before on another set of variables that are highly correlated. The second set of variables comes from “Margins” category. We are focused on the following variables: Net Profit Margin, Operating Margin and PreTax Margin. Looking directly at Figure 5.7, we can see that also in this PCA, the first principal component can explain basically all the information contained in the three variables. Indeed, the proportion of variance explained is the 96.6%.

Furthermore, each variable contributes with the same weight concerning the variance explained in the first principal component. The output is showed in first row of Figure 5.6.

(44)

38

Figure 5.5 Biplot of PCA on “Margins” variables

Figure 5.6 Heatmap of PC Loading Vectors of “Margins” variables

(45)

39

Lastly, we conduct a PCA on the variables before mentioned all together, meaning a PCA on six features. Even if the independent variables come from two different categories, we may speculate about a potential likelihood of the first principal component to gather most of the information because all the variables are highly correlated, with a correlation greater than 0.8 among the different pairs. Therefore, we perform the Principal Component Analysis on ROE, ROA, ROI, Net Profit Margin, Operating Margin and PreTax Margin at the same time.

As shown in Figure 5.10, the proportion of variance explained in the first principal component is roughly the 90%, a successful statistical result that we can use to gather six different features in only one variable. Then, we assign a name to this new regressor which is PCA_6. Further, Figure 5.9 represents again the equal importance of each variable of the last set in explaining the 90% of the total information.

(46)

40

Figure 5.9 Heatmap of PC Loading Vectors of both “Management Effectiveness” and “Margins” categories

Figure 5.10 The cumulative proportion of variance explained by PCA of both “Management Effectiveness” and “Margins” categories

(47)

41

5.5 Outliers detection

We should check the number of outliers in each regressor because a huge number of non-standard values can affect negatively the estimation of coefficients once the logistic regression is performed. As defined by Moore and McCabe (1999), “an outlier is an observation that lies outside the overall pattern of a distribution”. If one covariate shows a high number of outliers is recommended to apply a logarithmic transformation to reduce the influence of these values since the mentioned transformation has the property of decreasing the skewness of data distribution in order to make data as normally distributed as possible and thus reducing the overall number of outliers. Therefore, we begin by checking the number of outliers for each independent variable utilizing interquartile range method and Z-scores method. Interquartile range method is based on the five-number summary of a dataset. To identify outliers using this method, we subtract the first quartile from the third quartile in order to find the interquartile range (IQR). Then, we add 1.5 × 𝐼𝑄𝑅 to the third quartile as well as subtract 1.5 × 𝐼𝑄𝑅 from first quartile to find a new range of values. To conclude, values that are above and below the computed range are considered outliers. On the other hand, Z-scores method relies on the assumption that the data follows a normal distribution. Therefore, we standardized each observation using the formula 𝑍 − 𝑠𝑐𝑜𝑟𝑒 = ( ) in order to get a value that represents the number of standard deviations above or below

(48)

42

the mean. To find outliers, a standard cut-off value is Z-scores of ±3 or further from zero. The results are presented in Table 5.2.

Table 5.2 Outliers detection in the set of explanatory variables

The method of selection was to put an arbitrary threshold of 10% in the ratio between number of outliers over total data regarding the quantile method, as showed in the third column of Table 5.2. Therefore, following the above criteria we have decided to log transform the following variables: PP&E_Assets, MarketCap, Assets and Net_Debt. Furthermore, facing the problem of negative values belonging to Net_Debt variable and the impossibility to add a constant to solve the problem, we cannot log transform this regressor. So, we leave it in the original numerical form so far, but we will split it following the criteria of dummy variables for continuous covariates in future analysis. As shown in Table 5.3, the log transformed variables do not show evidence of a significant amount of outliers as we were expecting. At this point, each variable shows an amount of outliers smaller than the 10% threshold.

(49)

43

Table 5.3 Log transformation of variables with high number of outliers

5.6 Linearity Assumption

As stated before, the conditional distribution of outcomes in logit models follows a binomial distribution. Therefore, independent variables do not rely on normality assumption and we do not need to check for it in order to estimate a statistically stable logistic model. Nevertheless, the maximum likelihood estimator distribution when we deal with Wald-based confidence intervals is assumed to be normal. For this reason, likelihood-ratio test is recommended over the Wald test for assessing the significance of individual coefficients, as well as for the overall model (Hosmer et al., 2013). Hence, normality assumption is not completely rejected although not essential in our analysis. However, a compulsory assumption to verify prior to perform the logistic regression model analysis is to check the linearity between the independent continuous variables and the logit of the outcome. To check the linearity assumption, we need to inspect if in the plots (Figure 5.11) the logit outcome is linear. Once the linearity in the logit is obtained, we are able to use the same approaches commonly used in linear regression. In order to achieve this result, we have performed this diagnostic with R Studio software instead of Python, because only few software have implemented a way to perform this type of analysis.

(50)

44

Figure 5.11 Plots of independent variables against their logit outcomes

Indeed, looking at the Figure 5.11, some variables behave approximately linearly while other not. For instance, Gross_Margin do not show a linear relationship between itself and logit outcome. However, there are also other variables like Cash_Assets, Goodwill_Assets and Log(PP&E_Assets) which do not show linearity. In this research, we simply delete these variables because they distort the performance of the model. Nonetheless, the theory suggests a deeper analysis on those and transform them with the most appropriate distributional function, i.e. exponential transformation or different polynomials, to reach the linearity in the outcome.

(51)

45

Chapter 6. Models Selection

6.1 Overview

Following the decision-making procedure described by Hosmer et al. (2013), we are going to identify the best subset of covariates which will be fitted firstly in a binary logistic model and then in a multinomial logistic model. Indeed, the last goal is to find the most parsimonious model across the universe of potential models. It means to find the best model which better describes the true outcome with the less possible number of variables in order to be numerically stable and more easily adopted for inference purposes. Otherwise, too many variables included in the model could increase estimated standard errors and therefore lowering the model’s estimator. This variable-selection analysis is mainly focused and deeply described for the binary logistic regression model because it is the simplest between the two models. Nevertheless, once the final binary model is identified and assessed, we are going to use the same optimal set of regressors to assess the significance for the multinomial logistic regression model.

6.2 Binary Logistic Regression model 6.2.1 Step 1: Univariate analysis

We start with a univariate analysis of each independent variable. Then we look at the univariable Wald statistics to discard all covariates which exceed 0.25 in the p-value for the significance of the coefficient. The threshold of 0.25 follows the work of Mickey and Greenland (1989) on logistic regression. In Table 6.1 are highlighted all the variables that are above the previous mentioned threshold and which we are going to delete before fitting the multiple logistic regression model in the next step. We focus on the fact that many of the highlighted variables are the same variables that do not respect the linearity assumption in the last section of the “Data Analysis” chapter.

(52)

46

Table 6.1 Univariate logistic regression output for each explanatory variable

6.2.2 Step 2: Multivariate analysis

At this point, we fit the multivariate logistic regression model with all the remaining independent variables. Once the model is fitted, we look at the significance of the coefficients in a multivariate context. Therefore, we eliminate all the covariates that have a p-value of the Wald test statistic below the significance level for their coefficients, hence less than 0.05. The output of the logistic regression model is showed in Figure 6.1. For a better understanding, Table 6.2 presents the list of independent variables which we are going to delete because not significant from a statistical point of view for their estimated coefficients.

(53)

47

Figure 6.1 Multiple logistic regression output

Table 6.2 Variables to delete after the first multiple logistic regression is performed

Then we fit the more parsimonious model, thus the model with the remaining set of regressors, and we compare the two models through the Likelihood Ratio test. The output of the smaller fitted model is presented in Figure 6.2. Hence, looking at the computed log-likelihoods for each model, we have 𝑙𝑜𝑔𝑙𝑖𝑘(𝑀 ) = −27.535 for the nested one and 𝑙𝑜𝑔𝑙𝑖𝑘(𝑀 ) = −24.852 for the original model. We then compute the G statistic in the following way,

(54)

48

As stated in the theory of the Likelihood Ratio test (Chapter 3), the G statistic follows a chi-square distribution with 𝑛 degrees of freedom. Since 𝑛 is the difference in the number of parameters in the two models, in this case we have 4 degrees of freedom because 𝑀 has 5 features while 𝑀 has 9 features. Thus, we compute the chi-squared p-value of 4.683 with 4 degrees of freedom,

𝜒 (4.683, 4) = 0.252.

Since the p-value is greater than 0.05 we reject the null hypothesis that the model with more predictors fits better and so we conclude that the smaller model is better than the larger.

Figure 6.2 Output of the nested logistic model

6.2.3 Step 3: Preliminary model

At this point we look at the estimated coefficients of the two models fitted in Step 2. We should be concerned if the estimated coefficients in the smaller model have changed in magnitude more than 20% if compared with those in the larger model. This could have been caused by the elimination of one of previous regressors and it indicates that one of the excluded variables provides a needed adjustment in the effects of the covariates in the smaller model.

Riferimenti

Documenti correlati

In questa tesi vengono proposte estensioni del linguaggio LAILA di coordinamento tra agenti logico/abduttivi, con particolare riguardo alle primitive di comunicazione e ai

Il presente volume è inoltre corredato di un’Appendice documentaria che presen- ta inediti come il fondamentale Atto di Donazione della Fabroniana, più volte citato nel testo,

This paper presents a novel robot simulation tool, fully interfaced with a common Robot Offline Programming software (i.e. Delmia Robotics), which allows to automatically

High-density molecular characterization and association mapping in Ethiopian durum wheat landraces reveals high diversity and potential for wheat breeding.. Dejene Kassahun Mengistu

Chronology underpins our understanding of the past, but beyond the limit of radiocarbon dating (~60 ka), sites become more difficult to date. Amino acid geochronology uses the

In the course of the research semi-structured interviews with executive directors and marketing/PR directors were made. The aim of the research was to get an overview of the

Norden e Weber (2004) 168 , utilizzando i credit files relativi a 160 imprese affidate (periodo 1992-1996) forniti da quattro delle principali banche tedesche,

Quando la valutazione di pericolosità geologica vada ad interessare aree costituite da pareti di roccia lapidea nuda, subverticali o comunque fortemente acclivi,