• Non ci sono risultati.

Using machine learning and linear programming approaches to optimize TV advertising campaign performance

N/A
N/A
Protected

Academic year: 2021

Condividi "Using machine learning and linear programming approaches to optimize TV advertising campaign performance"

Copied!
79
0
0

Testo completo

(1)

Department of Computer Science

Master Degree Programme in Business Informatics

MASTER THESIS

Using machine learning and linear programming approaches to

optimize TV advertising campaign performance

SUPERVISOR

Prof. Dino PEDRESCHI

CANDIDATE

Mario SHTJEFNI

(2)
(3)

TV advertisers have a limited budget they can spend on a television ad campaign. When promoting a product or service, their goal is to optimally allocate spot bookings on TV channel networks so as to reach the biggest audience possible while at the same time not surpassing the financial resources associated with the campaign.

In this paper, I have used machine learning models to analyze data from previous spot bookings and the corresponding consumer response to determine what input features are the most important in predicting a successful campaign. Afterwards, the interplay between explanatory and target variables was interpreted in a decision-making context. Lastly, based on model forecasts I have proposed an optimal basket of booking choices which takes all budgetary constraints into account and can be implemented by the advertiser in future campaigns.

(4)

Contents

1 INTRODUCTION 3

1.1 Problem formulation . . . 3

1.2 Theory and methods . . . 4

1.2.1 Data mining . . . 4

1.2.2 Linear programming . . . 5

1.3 Approach . . . 6

1.4 Internship at Videobeat GmbH . . . 7

1.5 Attribution calculation - providing context . . . 8

1.6 Aim and expectation . . . 9

2 DATA PREPROCESSING 10 2.1 Feature overview . . . 10

2.1.1 Assessing quality of the data . . . 12

2.2 Feature transformation . . . 12

2.3 Target feature overview . . . 20

2.3.1 Relationship between explanatory features and the target feature . . . 22

2.4 Determining feature importances . . . 25

2.4.1 Assessing feature importance with ANOVA . . . 25

2.4.2 Assessing feature importance using random forests . . . 27

3 RANDOM FOREST REGRESSION 29 3.1 Foreword on decision trees . . . 29

3.2 The random forest regressor . . . 31

3.3 The bias-variance trade-off . . . 32

3.4 Handling skewed response variable distribution . . . 33

3.5 Encoding categorical variables . . . 34

3.6 Random forest hyperparameter tuning . . . 35

3.7 Fitting the model to the data . . . 36

3.8 Using cross-validation to determine best hyperparamter values . . . 38

4 GRADIENT BOOSTING MACHINES REGRESSION 42 4.1 The gradient boosting approach . . . 42

4.2 Performing gradient descent through boosting . . . 44

4.3 Gradient boosting hyperparameter tuning . . . 45

4.4 Fitting the model to the data . . . 46

5 MULTI-LAYER PERCEPTRON REGRESSION 51 5.1 The architecture of a neural network . . . 51

5.2 How do neural networks learn? . . . 53

5.3 Multi-layer perceptron hyperparameter tuning . . . 57

5.4 Fitting the model to the data . . . 58

5.5 Comparing model performances . . . 61

6 IMPACT OF PREDICTIVE FEATURES ON CPV 64 6.1 Interpreting random forests with treeinterpreter . . . 64

(5)

7 OPTIMIZING TV SPOT CHOICE 69 7.1 Formulating optimization as a 0-1 knapsack problem . . . 69 7.2 Implementing the maximization problem in Python . . . 70

8 CONCLUSIONS AND DISCUSSION 72

(6)

1

INTRODUCTION

1.1 Problem formulation

TV advertisers constantly face the challenge of making optimal strategic decisions in a highly competitive environment that is characterized by uncertainty. If an advertiser decides to run an ad campaign on television with the intent of promoting a product or service, they need to carefully plan budget allocation for the entire campaigning period. In an optimization context, this usually means bringing as many people as possible in touch with the product while simultaneously not exceeding the limited financial resources available. Typically when planning a campaign, the most relevant variables to be taken into consideration by the advertiser are the broadcasting channels they are choosing to book their commercials on, the specific time of day selected to air these commercials, the duration of the commercial spots etc.

Historically, advertisers make spot bookings based on some prior domain knowledge. They generally know what TV channels have the biggest reach, or during what time of day the biggest number of people are watching. However, it is often difficult to asses the effectiveness of a campaign because it is hard to quantify consumer response that was exclusively triggered by a spot which aired on TV. Note that the term ‘consumer response’ in this paper refers specifically to the advertiser’s website visits that were caused by spot airings on TV.

The company I have worked for as an intern indeed provides a tool that accurately calculates this consumer response. Most of our clients are in fact advertisers who are very interested in knowing the magnitude of such customer feedback; and now that the they can reliably measure the success of their ad campaign, I can properly formulate the problem that lies at the core of this paper’s focus. We established that an advertiser’s goal is to get the most bang for buck, or in other words, optimize the allocation of bookings at every broadcasting channel so as to reach the most people. In this situation, the most important metric to consider is what we call cost-per-visit (CPV), which is simply the ratio between the price of buying a specific booking for a spot and the website visits attributed to it. A lower CPV generally implies that the advertiser was able to spend less money for the campaign and obtain more online

(7)

visits while in contrast, a higher CPV means that the campaign was not particularly effective and some decisions need to be reevaluated. It is in the advertiser’s interest to minimize overall CPV.

Problem: Given a data repository with the following information related to past TV spot airings: • Channel name

• Week day • Hour of the day • Spot length

analyze historical data for one client to gain useful insights on CPV variation based on the predictive power of these four dimensions. After asserting which dimensions have the most significant impact on CPV behavior, provide a method to ensure overall CPV minimization for an ad campaign by adequately allocating bookings, considering budgetary constraints and the minimal/maximal amount of money that can be spent per channel.

1.2 Theory and methods

My plan of attack in this paper was to split the problem it into two clearly distinguishable tasks:

• Step 1 - Employ regression analysis using machine learning techniques in order to accurately forecast CPV based on the above mentioned features. Interpret how each predictive feature influences CPV.

• Step 2 - Identify and select the most discriminative combinations of dimensional values that minimize predicted CPV for each spot while not violating any budgetary constraints.

The first task is essentially a data mining problem, while the second one is an optimization problem.

1.2.1 Data mining

When it comes to discovering hidden, potentially complex relationships between variables and predicting future trends in large sets of data, then a data mining approach is the way to go. In

(8)

simple words, data mining is the non-trivial process of sifting through and analyzing rich sets of domain specific data and then extracting the information and knowledge in the form of new relationships, patterns or clusters for decision-making purposes [1]. It often involves methods at the intersection of machine learning, statistics and database systems. This process is narrowly linked to and is an integral part of the Knowledge Discovery in Databases (KDD) concept. The KDD process denotes the overall course of identifying novel, potentially useful, and ultimately understandable patterns in data. Note that the term knowledge is not easy to define and can be rather viewed as being “in the eyes of the beholder”. KDD applications deliver measurable benefits, including reduced cost of doing business, improved profitability, and enhanced quality of service.

Often before the knowledge extraction process can begin, a preprocessing stage is required. Preprocessing is necessary for transforming input into a format that is more understandable and better suited for further processing [2]. Transformations may include aggregation, sampling, dimensionality reduction, feature creation etc.

In practical applications of data mining, if we talk about machine learning algorithms we are usually talking about supervised learning models. In supervised learning, the main goal is to learn a model from labeled training data that allows us to make predictions about unseen or future data [3]. The task of the algorithm is to map an input to an output based on example input-output pairs. In contrast, when we are dealing with the analysis of unlabeled data, we are in the domain of unsupervised learning [3]. Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable. In this paper, I have used unsupervised learning algorithms to perform cluster analysis in the data preprocessing stage. Afterwards I have applied supervised learning algorithms to build reliable predictive models and interpreted them accordingly.

1.2.2 Linear programming

Businesses have limited resources. A manufacturing organization can only hire a limited number of employees, restaurants have limited amount of space available for seating. Advertisers can also

(9)

only spend a certain amount of money on their campaigns. Deciding how to make the most out of the available resources is a universal problem [4]. Usually, this involves choosing how to distribute the resources in such a way that profits will be maximized and costs minimized.

Mathematical programming (MP) is a branch of management science that finds the optimal way of using limited resources for the organization to attain its goals. It is applied in many domains such as determining product mix, routing and logistics or financial planning. The optimization problem has to be articulated into a mathematical model. This model will contain the decision variables that typically represent the amounts or quantities that need to be tuned. Based on these variables, the objective function and constraints are defined. An objective function identifies some function of the decision variables that the decision maker wants to either maximize or minimize [4]. On the other hand, constraints can be viewed as functions of the decision variables that must be equal to, greater or less than some specific predefined value.

Linear programming (LP) is a sub branch of MP and deals with the problem of optimizing a linear function subject to linear constraints [5]. In other words, the objective function and constraints are a linear combination of the decision variables. LP is a very powerful tool that can be applied in many business situations. Once defined, LP problems can be solved using variants of the Simplex method, heuristic approaches etc. depending on the temporal complexity associated with them. The optimization problem presented in this paper is part of the LP family.

1.3 Approach

Throughout my work, I have relied on the Python programming language and the rich and powerful data analysis libraries it offers to conduct thorough experimentation with the data at hand. Given that we have a data repository and we want to automatically discover useful information in it, I considered three different machine learning regressors upon which predictions of the target variable (CPV) were carried out:

• Random forest regression • Gradient boosting regression

(10)

The models used were all implemented using Python’s sklearn library.

Before fitting these models to the data, a preprocessing stage took place. It consisted mostly in binning categorical values of various features together into fewer groups and extracting new features from existing ones.

Subsequently, the machine learning models underwent hyperparameter tuning i.e. the process of finding the parameters for which a model yields the best performance on the given data. The predictive capabilities of every tuned model turned out to be rather modest. However, unveiling curious relationships between features was still possible. After having selected the best performing model for CPV prediction, the impact of every predictive feature on the target variable was quantified in order to provide more tangible results for the business domain.

The second task consisted in providing a method for choosing only the subset of spots in the dataset that would minimize the sum of predicted CPV-s. Initially, the apparent minimization problem was translated into a maximization one. This way it could be rewritten as a slightly modified version of the 0-1 Knapsack problem. In the knapsack problem, we have a set of items (spots) with weights (booking prices) and values associated with them. The objective is picking items that maximize total value while not violating a given capacity constraint. Furthermore, additional constraints associated with channel-specific spending bounds were introduced. I have used the Python pulp library to define an optimization problem in terms of the objective function and constraints. pulp then exploits its dual simplex algorithm solver to choose which spots should be booked by the advertiser in the next campaign.

1.4 Internship at Videobeat GmbH

From mid-October of 2018 until early April 2019, I have been part of the Data Science team at Videobeat GmbH. Based in Hamburg and initially a start up agency, the company’s main activity consisted in booking and managing TV advertisement campaigns for various businesses trying to reach their potential customers through television. In the complicated world of TV spot bidding, Videobeat negotiates with German broadcasters on behalf of advertisers, securing discounts when possible and scheduling bookings during hours in which more people are thought to be watching. However, in recent years the company’s operations have expanded and become more complex,

(11)

shifting the core business activity into a rather interesting domain.

The newly implemented technology is centered around measuring the impact that an advertiser’s TV spot has on their website traffic. I will refer to this impact as “attribution”, a word used to denote the amount of web visits that are caused by, or more precisely, can be attributed to the airing of said spot. Conceptually similar to Google’s web Analytics, Videobeat provides its own tool that allows a business to quantify with great proximity the effect their TV spot has had on their website’s visits. This is something that in general, is highly desirable and historically lacked the TV advertisement industry. Videobeat fills this gap quite competently, which is why it has established itself as one of front runners in data driven video marketing in Germany. During my internship, I have mostly been tasked with analyzing and interpreting attribution results for various clients, identifying what went wrong and what went right during calculations and providing incremental improvements to the main algorithm.

1.5 Attribution calculation - providing context

To group out the web visits that came due to a TV spot being aired, Videobeat relies on the complex attribution algorithm implemented by our hard working and immensely talented team leader Dr. Maria Gaschler.

The entire infrastructure upon which attribution calculations are run is written in the Python programming language. Two important inputs are needed in order to perform computations: a so-called traffic file containing user session data from the website and a spot file (containing spot related information such as airing time-stamps, broadcaster’s name, price paid for booking and so forth). The data from both files is then combined and undergoes preprocessing to generate a time series representation the web traffic for a predefined period. Starting from this time series, the algorithm uses spline interpolation and wavelet transform models (bear with me here) to compute the traffic baseline i.e. a time series of the traffic that would have occurred, if no spot(s) had aired on TV during that period. The difference between the original time series and the calculated baseline is the spot attribution.

(12)

1.6 Aim and expectation

Our dedicated TV-team has been managing bookings quite efficiently, keeping clients very happy over the years with their expertise.

My aim is to support them in the decision making process which until now has been almost exclusively human-based. I believe that the combination of both human and machine learning knowledge where possible, can lead to increased benefits for all the stakeholders involved.

By the end of my internship period I expect to provide Videobeat with a new tool that can be used to make better decisions regarding the TV spot booking process.

(13)

2

DATA PREPROCESSING

Data preprocessing is a crucial component of the knowledge discovery process. The steps involved in this stage include cleaning data to make it more suitable for analysis and selecting records and features that are relevant to the data mining task at hand [2].

The Videobeat client I have selected for analysis in this paper is a Berlin based vacation rental search engine that asked to remain anonymous. I will be focusing exclusively on the German market of this client, under the assumption that consumer behavior is different for each country and analyzing multiple markets at the same time would bias the results. The data collected covers a period of seven months, from June to December of 2018 and contains spot related information, including calculated attribution per spot.

2.1 Feature overview

Table 1 lists the most important features of the dataset along with their types aiming to give a clear description for each of them:

Feature Type Description

Channel Nominal Name of TV channel on which the spot was aired

Date Interval Date of spot airing

Time Ordinal cyclic Time of spot airing (granularity is in hours)

Day Ordinal cyclic Day of the week (from Monday to Sunday) when the spot was aired

Length Ratio Duration of the spot (in seconds)

CTC Ratio

Cost-to-client - the price that the advertiser paid Videobeat for booking the spot (obtained by applying adequate fees and discounts to the original price demanded by a channel) Spot name Nominal A name that uniquely identifies every spot

created by the client

Program Nominal Program that was being broadcast on the channel before the spot was aired

Visits Ratio Attributtion i.e. number of website visits credited to the TV spot Network Nominal Broadcast network to which the channel belongs

Table 1: The client dataset

Nominal and ordinal attributes are referred to as categorical attributes. Categorical attributes are discrete i.e. they take a finite or countably infinite set of values. The two other types of attributes,

(14)

interval and ratio, are referred to as numeric attributes. Numeric attributes are usually continuous, their values are real numbers and are often represented as floating point variables:

• Nominal - values provide only enough information to distinguish one object from another. • Ordinal - values provide enough information to order objects.

• Interval - differences between values are meaningful, i.e., a unit of measurement exists. • Ratio - both differences and ratios between values are meaningful.

Notice that the target feature is the CPV for every spot, a continuous, non-negative variable. Although it is not present in the dataset, it can be easily derived through the formula:

CP V = CT C V isits

It means that CPV is the price paid for a spot divided by the number of visits is was able to attract. The pandas library was used to read the data and transform it into a pandas dataframe object, which can be then easily manipulated. The first five rows of the dataframe are presented in the figure below:

Figure 1: Client dataframe

In total, there are 50.331 rows present. Note that each row represents a single spot that was aired at a given point in time. The most important features that will be considered for this study are channel, time (in hours only), day and length. With the exception of the latter, every feature is categorical. However, because length only takes two values, I will treat it as a categorical variable as well.

(15)

2.1.1 Assessing quality of the data

The saying “garbage in, garbage out” applies to data analysis as to any other area [6]. We often need to worry about the data quality before proceeding with deeper analysis. One way to assess quality of data is evaluate the accuracy and completeness. Accuracy is defined as the closeness between the value in the data and the true value [6]. Completeness generally refers to the number of missing values that may be present in some attributes.

For the purposes of my analysis, the data I will work on is both highly accurate and complete. There are no missing values present in the dataset and attribute values such as length or time of spot airing were measured with high precision.

To conclude, it is true that the number of features taken into consideration is relatively low and the size of the dataset is relatively small. Since the explanatory features are categorical, it would not make much sense to apply normalization or to bring values to the same scale. In addition, we do not have to particularly worry about dimensionality reduction or sparsity. Nevertheless, there is “no free lunch” and in almost every case data needs to be meaningfully transformed so as to make it easier for the machine learning algorithms to recognize useful patterns.

2.2 Feature transformation

To get a better idea of how feature values are spread across the dataset, I have used the seaborn library to plot the frequency distribution for every category, as shown in the histograms below:

(16)

Figure 2: Week day frequency

(17)
(18)

Figure 5: Spot length frequency

It is important to consider that when we are dealing with categorical variables, it is a good practice to merge them into larger sets, especially if the number of categories for each variable is high. This is done mainly to avoid possible overfitting when using them as predictors in machine learning models [2] as well as to eliminate values that occur at a very low frequency relative to other categories. Let’s describe each graph step by step.

In Figure 2 we can see that there is no significant difference as far as what week days the client decides to book their spots on. For now, there is no real incentive to further modify this attribute. In contrast, Figure 3 shows that there is a big difference in the hour selection. This variable has 19 categories. Late night and early morning hours are rarely chosen for spot airing, due to the lower exposure rates that are typically associated with them. On the other hand, noon or evening times are highly preferred. This gives us a good idea on what a proper way of grouping hours into fewer categories would be. In fact, it would make a lot of sense to band hours together according to different parts of the day, so as to be more in line with the activities of the average person who

(19)

watches TV. The German Broadcaster Association suggests six different periods of the day that provide a meaningful separation of TV transmission time:

• Late night → Midnight to 6:00 AM • Morning → 6:00 AM to 9:00 AM • Daytime1 → 9:00 AM to 01:00 PM • Daytime2 → 01:00 PM to 05:00 PM • Access time → 05:00 PM to 20:00 PM • Prime time → 20:00 PM to Midnight

I am going to use this approach for clustering hours together.

Looking at the channel frequency distribution in Figure 4, we see that that it is highly imbalanced. There are 45 channels in total, but mediums such as Folx, Welt der Wunder and wetter.com stand out as the most popular choices of the client, whereas broadcasting giants like RTL, ARD and VOX have been picked far fewer times. This is likely cost-related, implying that channel frequency is influenced by the the price channels demand to secure a booking. Typically, the higher the price, the more influential the network. The cost-to-client attribute does not directly reflect prices as demanded by channels. Rather, it is the price Videobeat charges the client after applying various fees and discounts that are regulated internally. Nevertheless, CTC can still be utilized to infer a channel’s influence or reach.

To find out whether some relationship between cost-to-client and channel frequency exists, I have used the pearsonr function of the scipy library. The Pearson correlation coefficient quantifies the linear co-variance between CTC and channel frequency. It is mildly negative (≈ -0.25), suggesting that generally the more frequent the channel, the lower the CTC associated with it. Thus, a meaningful way to reduce the number of categories in this case is to cluster channels together according to frequency and median of the CTC per channel. The median is better suited if datapoints are spread unevenly because it is less susceptible to noise and extreme values than the mean.

(20)

Cluster analysis is often referred to as unsupervised classification. It groups objects based only on information found in the data that describes the objects and their relationships [2]. The goal is that the points within a group be similar to each other and different from points in other groups. There are a few techniques that can be used to perform cluster analysis. I have decided to employ KMeans using the KMeans() function from sklearn library. KMeans is a simple technique that defines a prototype in terms of a centroid, which is usually the mean of a group of points and is typically applied on an N-dimensional euclidean space. Before the clustering is done, we have to specify the number of groups k we want to slit the points in. The algorithm steps are listed below:

1. select k points as initial centroids

2. repeat:

3. form K clusters by assigning each point to its closest centroid 4. recompute the centroid of each cluster

5. until centroids do not change

A natural question that arises in this scenario is how do we choose the right number of centroid points k? If k is too high, clusters will be too small and not convey meaningful information on the structure of the data. If k is too small, splitting will be too simple. A popular approach is using the so-called elbow-rule [3]. We plot the number of clusters k against the sum of squared error (SSE) that corresponds to each k. SSE in this case can be defined as the sum of squared differences between the points of a group and its centroid:

SSE = k X j=1 n X i=1 (pij− cj)2

pij is the i-th point in cluster j, while cj is the centroid of cluster j. If we observe that moving form k to k+1 clusters does not cause a significant decrease in the total SSE, we select k as the optimal number of clusters. The idea is illustrated in the figure below.

(21)

Figure 6: SSE distribution

It seems that a good choice would be to partition the points into 3 groups because if we select k=4, we see that the decrease in total SSE is not significant (∆SSE ≈ 0.7).

Another way of evaluating the quality of the split is through the silhouette coefficient. It quantifies the similarity between points within a group (cohesion) and dissimilarity between points of different groups (separation). The value of the silhouette coefficient varies between -1 and 1. The higher the value, the better the clustering. Indeed, for k = 3 the obtained coefficient is the highest (≈ 0.675).

The graph below visualizes the clustering results. Because KMeans assigns every single datapoint to a cluster, it is up to us to identify potential patterns or anomalies in clusters. We can see that channels in group 1 and 2 are quite similar in terms of median CTC but they differ when it comes to the frequency of appearance in the dataset. The third group consists of channels that have a high median CTC and low frequency. Channels in this group are outliers. Outliers are simply values or data objects that are far away or very different from all or most of the other data [6]. Unfortunately

(22)

this kind of clustering is somewhat too simplistic and is not likely to increase predictive power of the channel name feature. This highlights a weakness of KMeans, namely the inability to identify well- differentiated clusters in areas with high and low density of datapoints.

Figure 7: Channel clusters

For this reason I have used a simpler approach by binning together channels through CTC discretization. Discretization implies transforming a continuous attribute into a categorical attribute. After the median CTC per channel is sorted, we can bin the values that are closer to each other into groups. Each CTC value in a group corresponds to a channel, thus we are essentially doing channel binning. There is no generally best choice for the number of bins, but there are certain recommendations. Sturges’ rule proposes to choose the number k of bins according to the following formula:

k = dlog2(n) + 1e

where n is the sample size. This would mean six bins for our sample, however I have decided to split the channels into five groups given that the majority of CTC medians per channel are very close

(23)

to each other. The number of channels per bin is not equal. The first four bins have 9-11 channels while the last bin contains only 4 - ARD, RTL, VOX and WELT (these are the channels whose median CTC is much higher than average median value). Figure 8 shows some of the obtained channel groups:

Figure 8: Channel groups

Lastly, Figure 5 tells us that 12-second spots are used more than twice as much as 15-second spots by the client. The length variable will not be modified.

2.3 Target feature overview

To visualize the target variable distribution, I used the seaborn kdeplot() function. “KDE” stands for kernel density estimation, a method used to generate a probability density function for a given random variable. We can see that CPV distribution is highly skewed:

(24)

Figure 9: Probability distribution for the CPV

Smaller values have a much higher probability of occurring. 75% of the values are smaller than 4.1, while the maximum value is 2851. Given that the mean is right to the median (the 50th percentile), a rule of thumb is to suggest that the data is skewed to the right. Table 2 summarizes the basic statistics for the CPV.

Mean Standard deviation Minimum 25% 50% 75% Maximum

31.7 143.1 0.0 0.9 1.8 4.1 2851

Table 2: CPV overview

A minor issue that was encountered when deriving CPV was that in about 2000 records the value of visits attribute was 0 and CPV could not be calculated. In order not to discard these records, I replaced zero-value visits with a number close to zero (0.01 ). This conveniently yields high corresponding CPV values, which are not desirable anyway since an advertiser wants to avoid receiving zero or close to zero visits on his website.

(25)

2.3.1 Relationship between explanatory features and the target feature

Let’s look at how the CPV behaves for different feature categories. We can use boxplots to accomplish this goal. Boxplots are a non-parametric and compact way of visualizing and summarizing important characteristics of a sample from a numerical attribute [6]. They can tell us if any outliers exist and what their values are, if the data is symmetrically spread or how tightly grouped it is. seaborn’s boxplot() function enables great visualization of boxplots. Figure 10 shows CPV distribution with respect to day categories:

Figure 10: Day boxplot

The width of the colored boxes represents the interquartile range (IQR) for every day of the week. IQR is simply the difference between the 75th and 25th percentile (Q3 − Q1) and thus represents a measure of variability in the data. The middle band in each box is the median i.e. 50th percentile. The lower and the upper end of the whiskers represent the Q1–1.5 ∗ IQR and Q3 + 1.5 ∗ IQR values respectively. Any datapoint outside of this range is considered to be an outlier.

(26)

for every day of the week. This is another indicator of the skewed CPV distribution.

We can see that Sunday has the lowest CPV median and IQR among all week days. Thursday and Tuesday surprisingly account for the highest overall median and IQR.

Figure 11 displays boxplots for different parts of the day:

Figure 11: Daypart boxplot

It is curious to note that the CPV simultaneously has the largest IQR and highest median during late night time. This is due to the lower viewership associated with the late night period, which is in turn responsible for a higher CPV. Another interesting detail is that median CPV steadily increases from morning time to Daytime2, where it reaches its peak. Altogether, the cost per view in different parts of the day does not seem to vary by a large margin. One possible reason is that CTC and visits change proportionally, meaning that for example if the client pays more for a spot booking during prime time, the audience that it reaches is also larger, hence CPV does not differ much with respect to previous periods.

(27)

see how the CPV varies for different months in each season. Customer demand usually tends to increase during periods before the holidays. For this purpose I extracted the month attribute from the date. Figure 12 summarizes CPV distribution by month.

Figure 12: Month boxplot

There are noticeable differences in CPV median and IQR between every month. December and September seem to be the most volatile and expensive periods for the client in terms of TV advertisement. August is the least expensive month and also has the smallest IQR.

Lastly, CPV rates are similar for the categories of channel group and length.

Figure 13 shows a heatmap of the CPV given week days and dayparts. Each cell represents the median of the CPV given a day and daypart. We observe that independently of the day, CPV is generally higher during late night, Daytime1 and Daytime2. It appears that the lowest values are reached on a Sunday as well as during morning time. The heatmap gives us a hint on what particular feature value combinations to either avoid or exploit.

(28)

2.4 Determining feature importances

2.4.1 Assessing feature importance with ANOVA

If we were dealing with continuous or discrete numerical attributes, one of the approaches to determine what features have the most predictive power would be to set up a correlation matrix in order to quantify the linear relationship between features. We are interested in those features that have a high correlation with our target variable [3]. This would make it easier to foresee its behavior based on the predictive features. With categorical attributes it does not make a lot of sense to test for correlation, especially with nominal features. In this scenario, a good approach would be to use analysis of variance (ANOVA). ANOVA is typically used to learn the relative importance of different sources of variation in a dataset [7]. We can perform a so-called one-way ANOVA test on every feature and see if the change in categorical values causes a statistically significant change in the corresponding mean of the target feature.

(29)

The idea is to quantify the total variation of the target feature in terms of within group variation and between group variation. Total variation is referred to as the Total Sum of Squares (SST) while within and between group variations are often labeled as Sum of Squares Within (SSW) and Sum of Squares Between (SSB). Say we have k groups (i.e. a feature with k unique categorical values), then the variations can be mathematically formulated as:

SST = n X i=1 (yi− y)2 SSW = k X j=1 m X i=1 (yij − yj)2 SSB = SST − SSW where

• yi is the ith observation in the entire population if size n • y is the mean of the population

• yij is the ith observation in the jth group with size m • yj is the mean of group j

The null and alternative hypotheses are:

H0 : µ1= µ2= ... = µk H1 : µi6= µj for at least one i,j

One of the most important metrics provided by ANOVA is the F-statistic, which is the ratio between SSB and SSW :

F = SSB

SSW

where SSB is the mean SSB and SSW is the mean SSW. A high the F-score means that the variation between groups is large, thus prompting us to reject the null hypothesis that the means

(30)

of all groups are equal. In general, the higher the F-score, the stronger the evidence against the null hypothesis i.e. the more likely it is that a change in categorical values causes a significant variation in the response variable.

I have used the anova lm() function from the statsmodels library to perform a one-way ANOVA test on the importance of each independent attribute on the CPV. Table 3 summarizes the results:

Feature F-statistic p-value Channel group 123.18 6.12e−56

Day 56.63 1.07e−23

Daypart 52.86 1.07e−22

Month 234.61 5.23e−73

Length 5.77 9.01e−3

Table 3: One way ANOVA scores

If the p-value is smaller than 0.05, we reject the null hypothesis, otherwise we fail to reject it. We can see that the p-value is smaller than 0.05 for all features, indicating that they are all relevant in predicting CPV. However, not all features are equally important, as some of them clearly have higher F-scores. Month, Channel group and Day are the most discriminative features. Their corresponding p-values are practically zero, hence we accept the alternative hypothesis that the mean of CPV is statistically different for at least two categorical values of each feature. On the other hand, length has a higher p-value but it is still close to zero and we still reject the null hypothesis, hence switching between different spot lengths also causes a statistically significant change in the mean of the CPV.

2.4.2 Assessing feature importance using random forests

Another useful approach to select relevant features from a dataset is to use an ensemble technique called random forest (RF) [3]. I will focus more extensively on randoms forests in the next section. Random forests measure feature importance in terms of the averaged impurity decrease computed from all decision trees in the forest [3]. The RF variable importance measure can also automatically capture non-linear and interaction effects without specifying these a priori [8]. The sklearn library

(31)

computes feature importances that can be accessed with the feature importances attribute after fitting a RandomForestClassifier.

Figure 14: Feature importances

The above image depicts the relative importances of the five independent features. The importance rank is the same as the one obtained using one-way ANOVA.

Given that the number of input variables is relatively small, I will consider all five features above (channel group, month, length, daypart and day) when fitting various machine learning models in the upcoming analysis.

(32)

3

RANDOM FOREST REGRESSION

3.1 Foreword on decision trees

Decision trees are a simple yet widely used prediction technique. Like the name implies, we can think of decision trees as models that partition data by making decisions based on asking a series of questions. They are a non-parametric approach or in other words, do not require any prior assumptions regarding the type of probability distribution or even the kind of relationship (linear or non-linear) satisfied between the response variable and other attributes [2]. Decision trees are composed by nodes. There are three types of nodes :

• Root node - has no incoming links but has zero or more outgoing link • Internal node - has one incoming link and at least two outgoing ones • Leaf node node - has one incoming link and no outgoing links

We start at the root of the tree and split the data based the feature that brings the most information gain (IG) [3]. Information gain can be explained in terms of node impurity. When we are dealing with classification i.e. predicting a binary or multi-labeled response variable, the node impurity can be defined as a measure of the homogeneity of the labels at the node. The higher the impurity, the lower the homogeneity in the node and vice versa. Various impurity indexes exist, among which gini index, classification error and entropy are most commonly used. The information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities [3]. The goal of the decision tree algorithm is to maximize information gain during the splitting procedure.

In theory, we can iteratively repeat the splitting procedure for each child node until all the leaves are pure, but as I will explain later, this often leads to undesirable results. Therefore, we employ pruning techniques to limit full tree growth. If a leaf node happens to be impure, the prediction result will be the label that has the highest count in the node. Figure 15 illustrates a simple case of a decision tree in action. The rightmost node is a leaf node. We don’t need to split it further since it is already completely pure as indicated by the label color of each datapoint. The two internal nodes have been split further based on pre-defined pruning criteria.

(33)

Figure 15: Simple decision tree

Decision trees can also be used as regressors when the target variable we are trying to predict is continuous, which is indeed the case in this paper. There are two main differences compared to the classification task:

1. The impurity index is replaced with the standard deviation of the node sample:

Sk= r Pn

i=1(xi− x)2 n

where xi is the ith observation in node k, x is the sample mean and n the sample size. The standard deviation is a measure of variability or diversity of the data, thus we want to split based on features that provide the highest standard deviation (variability) reduction. 2. Predictions are made by taking the average value of all datapoints in the leaf node sample.

Decision trees have a kind of “natural” inclination towards handling categorical data, hence I believe that selecting machine learning models that are based on decision tree estimators is a reasonable choice given the type of our input features.

(34)

3.2 The random forest regressor

Random forests is an ensemble technique designed specifically for decision trees. In an ensemble technique, we have a group of base (or weak) regressors that “train together” in the training set and then a prediction is performed by averaging out the predictions of every base regressor in the set [2]. The goal is to decrease overall generalization error. There are two conditions that all base predictors should meet:

1. They should be independent of each other.

2. The base predictor should do better that a predictor that performs random guessing.

Independence can be enforced by diversifying the data on which each predictor is trained as much as possible, although in practice it is difficult to guarantee zero correlation between base predictors. This can be achieved through the bagging (bootstrap aggregation) process, a core idea behind random forest prediction. Essentially, bagging is an application of the bootstrap which consists in repeatedly collecting (with replacement) samples form a dataset, usually in a random manner. Now, because the sampling is done with replacement, it means that some records will appear in the sample more than once. It has been mathematically shown that if the dataset in large enough, the probability of each datapoint being selected converges to ≈ 0.632 [2]. In most implementations, the size of the bootstrap sample is chosen to be the same as the original training set [3].

The random forest algorithm can be summed up in four steps:

1. Randomly sample n datapoints (randomly choose n samples from the training set with replacement).

2. Create a decision tree from the bootstrap sample. At each node:

• Split the node using the feature that provides the best split according to the objective function, for example, by maximizing the standard deviation reduction.

3. Repeat the steps 1 to 2 k times.

(35)

Although not as easily interpretable as a decision tree, a big advantage of random forests is that we usually do not have to worry too much about pruning them since random forests are quite resilient to the overfitting of individual decision trees [3].

3.3 The bias-variance trade-off

The goal of any supervised machine learning algorithm is to best pick up the “signal” that is present in the data, without it either corresponding too closely to a particular set of data, or being completely unable to catch the underlying trend in the data. Simply put, we say that a model has high variance (overfits) when it does a good job at predicting the response variable of the dataset on which it has been trained but fails generalize well on data in has never seen before. On the other hand, we say that a model has high bias (underfits) if it does a poor job at predicting outcomes for both observed and unobserved sets of data. The expected square error that a predictive model makes on a given input x can be formulated as follows:

Err(x) = bias2+ variance + irreducible error

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given datapoint and tells us the spread of our data. Irreducible error is simply noise that is ever present in the data and does not depend on the model implemented. Ideally we would want our model to have both low bias (not overly simplified) and low variance (it generalizes well on unseen records). Figure 16 visualizes how the bias-variance dynamic affects model prediction:

(36)

Figure 16: Bullseye illustration of bias-variance tradeoff

The bias-variance trade-off can be controlled on random forests. As outlined by Breiman [9], “the randomness used in tree construction has to aim for low correlation ρ while maintaining reasonable strength”. A large bootstrap sample size n means that we decrease the randomness i.e. base regressors are more correlated with each-other and thus the forest is more likely to overfit. In contrast, we can decrease the level of overfitting by choosing smaller values for n at the cost of model performance. In the RandomForestClassifier implementation of sklearn, the size of the bootstrap sample is chosen to be equal to the number of samples in the original training set. This usually provides a good bias-variance trade-off [3].

3.4 Handling skewed response variable distribution

As observed in previous sections, the distribution of CPV is highly skewed. Around 75% of the values are smaller than 4, while the most extreme values are larger than 2700. Outliers like the latter values are interesting to study because they can be a good indicator on what feature combinations to avoid. Despite their infrequent occurrences, a correct prediction of rare outcomes often has greater value than a correct prediction of the majority outcomes [2]. However, keeping them in the dataset creates an imbalance problem. Because the model learns from the most common occurrences, it

(37)

gives the impression that it is performing quite well on unseen records. While this may be true most of the time, the model is failing to accurately foresee the unlikeliest outcomes. A simple approach to handling imbalance problems is to resample the dataset in such a way that the distribution of the response variable is more balanced. This can be achieved through various methods like naive random oversampling or undesampling. Naive random oversampling means that input rows where the target variable is considered to take outlier (minority) values are randomly oversampled so as to guarantee a balanced representation. Undersampling means actively under representing input rows with the most common outcomes through randomly selecting a limited number of them, while minority input rows are all taken into consideration.

I my analysis, I decided to proceed with the latter approach, thus the final dataset used by machine learning models to train on contains all the values that are above the 90th percentile i.e. outliers (≈ 5000), while only considering a subset (≈ 15000 datapoints) of the records below the 90th percentile. This way, imbalance ratio is reduced to 1:3 compared to 1:9 that it was before.

3.5 Encoding categorical variables

As previously mentioned, all predictive features considered in this paper are categorical. It is a prerequisite in many machine learning libraries that categorical labels be encoded as integer values. Although estimators provided by sklearn convert labels to integers automatically, in order to avoid any unexpected behavior it is considered a good practice to supply labels that are first transformed into integer arrays. A popular technique used in this scenario is called onehot encoding. The idea is to translate all unique values of a categorical feature into a binary representation that is easy to interpret by the model. The number of bits used to portray every categorical value is equal to the total number of unique categories. For example, suppose we have a feature that takes three values: blue, green, and red. The onehot encoding approach would produce the following conversion:

• blue → 0 0 1 • green → 0 1 0 • red → 1 0 0

(38)

The ‘blue’ label corresponds to the first bit, the ‘green’ label to the second and the ‘red’ one to the third bit. Notice that only one bit will be set to 1 for every category, hence the name “one hot”. The position of the bit with the binary value 1 is sufficient to successfully distinguish between them. To perform this transformation on the dataset, I have used the OneHotEncoder that is implemented in the sklearn.preprocessing module. A noteworthy observation to make in this case is that the machine learning models will interpret all onehot encoded categories of a feature as standalone, independent features. For example, day will be replaced by seven new features (form Monday to Sunday) which will only take either 0 or 1 as values.

3.6 Random forest hyperparameter tuning

The random forest implementation has quite a few parameters, also known as hyperparameters (I will use both terms interchangeably), that need to be set before fitting the model on the data. The process of comparing different parameter settings for the purpose of improving prediction performance is called model selection [3]. Different performance metrics exist; the one I have chosen to employ for the purposes of this paper is the root mean squared error (RMSE):

RM SE = r Pn

i=1( ˆyi− yi)2 n

where ˆyi is the value returned by the model, yi is the actual value for the datapoint i and n is the size of the training set. RMSE represents the sample standard deviation of the differences between predicted values and observed values (called residuals).

The most important parameters I have considered for the random forest regressor are: • Number of base estimators

• Minimal internal node size

• Number of randomly drawn candidate features (d ) • Tree depth

Number of base estimators refers to the number of trees in the forest. Usually, the more estimators are present, the better the model performance is going to be, but the improvement brought on by

(39)

an individual estimator keeps diminishing after a certain number of estimators is reached. Minimal node size is the minimum number of samples required to split an internal node while tree depth is maximum depth that all trees in the random forest are allowed to reach. Lastly, d is defined as the number of randomly drawn candidate variables out of which each split is selected when growing a tree [10]. The default value in the sklearn implementation is set to d = m, where m is the total number of features. However, a reasonable rate is also d =√m [3].

3.7 Fitting the model to the data

A popular approach for measuring generalization performance of a model is the holdout method. This technique splits the dataset in two parts, the training set upon which the model will be fitted for training, and an independent test set which is later used to evaluate the performance [3]. It is less computationally expensive that other validation methods I will mention later.

In order to determine the influence that each hyperparameter has on model performance, I have used the holdout method to iteratively experimented with different values for each parameter. For every hyperparameter being studied, other hyperparameters are held constant with sklearn’s default RandomForestRegressor() values1.

Figure 16 shows how the model performance of the training and test set changes depending on the number of estimators. While the model is clearly overfitting since the error in the training set is notably smaller (≈ 198) than the error in the test set (≈ 217), we can observe that model performance does not change significantly in either of the sets after fitting 40 or more estimators. In addition, looking at the levels of tree depth it is curious to note that performance on the test set is marginally better than the training set for trees shallower than six levels. This may be due to the fact that data variability on the test set is slightly smaller compared to the training set. Figure

1

The default values are:

• Number of base estimators: 10

• Tree depth: none (nodes are expanded until all leaves are pure) • Minimum number of samples required to split an internal node: 2 • Number of features to consider for each split: d (all features)

(40)

18 reveals that error on the test set constantly diminishes until the model starts to overfit for tree depth greater than six.

Figure 17: Model performance for different numbers of estimators

(41)

Finally, we observe how model performance changes depending the minimum number of samples required at an internal node. If the sample size is smaller than 150, the model will tend to overfit while for values greater than 150 the performance on both sets is similar, as shown on Figure 17. Based on the achieved results, it can be concluded that the random forest regressor tends to overfit on the data if the base decision tree estimators are not pruned by liming the maximum depth or the minimum number of samples required to split a node.

Figure 19: Model performance for different minimal samples split values

3.8 Using cross-validation to determine best hyperparamter values

A disadvantage of the holdout method is that model performance often depends on how we partition the dataset into training and test sets [3]. Predictions may vary for different samples. The so-called k-fold cross-validation method provides a robust approach for performance estimation. We randomly divide the training set into k folds without replacement, where k − 1 folds are used for the model training and one fold is used for validation. This way we can obtain a performance evaluation that is less susceptible to the subpartitioning of the training data. K-fold cross-validation is typically adopted when trying find the optimal hyperparameter values for a certain model [3]. Once good hyperparameter values have been found, the model is refitted on the

(42)

complete training set and a final performance measure is extracted from the test set. A reasonable choice for k is 5, which is used in quite a few applications. Figure 20 summarizes this procedure.

The sklearn package offers two powerful tools that help us find the best hyperparameters: RandomizedSearchCV() and GridSearchCV(). The approach of grid search is simply a brute-force exhaustive search where we provide an array of different values for each parameter. The algorithm then tries out all of the possible combinations to extract the optimal set. However, a major disadvantage of grid search is that it is computationally expensive because it allocates too many trials to the exploration of dimensions that do not matter while suffering from poor coverage in dimensions that are important, as shown in a study by Bergstra and Bengio [11].

Figure 20: 5-fold cross-validation

On the other hand, randomized search performs a pseudo-random search on the combinations of hyperparameters. According to the sklearn documentation “in contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions.”.

(43)

demonstrated in their study that “random experiments are more efficient than grid experiments for hyper-parameter optimization in the case of several learning algorithms on several datasets” because not all hyperparameters are equally important to tune. In addition, compared with the grid search experiments of Larochelle and others [12], random search found better models in most cases and required less computational time. Therefore I decided to employ a random search and defined the random grid as follows:

– number of estimators: [50, 100, 200] – maximum features to consider: [d,

√ d] – maximum depth: [5, 6, 7, 8, 9]

– minimum samples split: [5, 10, 50, 100, 200]

The are a total of 150 candidate combinations, out of which I have to chosen to consider only 100. With 5-fold cross-validation, five folds will be fitted for each of the 100 candidates, accounting for a total of 500 fits. The best combination of parameters will be the one which is responsible for the lowest mean cross-validation score. After running RandoizedSearchCV() the best estimator attribution tell us what the best estimator chosen by the search actually looks like:

– best number of estimators: 200 – features to consider for each split: d – maximum depth: 6

– minimum samples split: 100

RMSE of the best estimator on the training set is ≈ 204.

Note that the root mean squared error is a good measure in relative terms, i.e. for comparing different regressor performances with each-other, however it does not tell us if the fit of a model on the data is particularly good or bad on its own. In order to evaluate the quality of the fit of the random forest regressor, sklearn provides the very useful learning curve()function. This function plots the k-fold cross-validation scores on the training and validation sets for different

(44)

sample sizes. Every value on the x-axis represents some sample size of the training set. The size of the respective validation sets can be inferred using the formula

validation sizei=

training sizei k − 1

where the index i indicates the ith value in the x-axis. The learning curve Figure 21 was built using 5-fold cross-validation:

Figure 21: RF learning curve

The cross-validation score lies between 0 and 1 and can be used to measure how well predictions approximate the actual CPV values. This score is the coefficient of determination R2, which is often employed in linear regression. R2 serves as an estimate for what is commonly referred to as the goodness of fit. I have addressed sklearn’s goodness of fit estimation in a later section. From the graph it can be observed that training score is always higher than the validation score. This is because the size of the training set is always bigger than that of the validation set, hence the model has more data to learn from. The scores both sets shows an increasing trend when the size gets progressively bigger. They lie relatively low, between 0.10 and 0.18.

(45)

4

GRADIENT BOOSTING MACHINES REGRESSION

“If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter” - Ben Gorman (Kaggle master).

4.1 The gradient boosting approach

Similar to bagging, boosting is an ensemble method consisting of very simple base predictors that perform only slightly better than random guessing [3]. The main difference is that boosting adds new models to the ensemble sequentially i.e. base models do not work together in parallel to yield a final prediction [13]. When employing gradient boosting, there is a loss (or cost ) function the algorithm is trying to minimize that guides the whole procedure. For regression problems, this loss function is typically the error between the predicted and actual values of the response variable. As more simple models are introduced, the overall model becomes a stronger and stronger predictor, much like putting together many pieces of string forms a resilient rope. At each particular iteration, prediction error is decreased by adding a new weak learner which is trained with respect to the error of the whole ensemble learned so far.

To better understand the method, consider a simple case where we have a dataset of size N with only one input feature X and the response variable Y. The goal is to predict the actual values as accurately as possible. This means minimizing the mean squared error of predictions, which is the loss function in this case. Given that the model being fitted is fairly simple, we can expect some amount of error associated with every outcome observation:

i = yi− F0(xi)

i denotes the error in the ith observation , yiis the ith actual output while F0(xi) refers the model prediction given the ith observation of array X. The gradient boosting steps can be represented in pseudocode as follows:

(46)

Fit the model on the dataset to obtain the vector of predicted output values F0(X). for m = 1 to M do:

1. Calculate residuals vector: rm−1 = Y − Fm−1(X). 2. Predict the residuals by training the model on rm−1.

Let ∆m be the predicted residuals vector.

3. Add the predicted residuals to the previous prediction and obtain a new prediction: Fm(X) = Fm−1(X) + ∆m(X) end

return FM

After M steps, the final vector of predicted values will be:

FM(X) = F0(X) + ∆1(X) + ∆2(X) + ... + ∆M(X)

We iteratively “correct” the first prediction by taking small steps that bring us us closer to actual output values [14]. Notice that I did not specify what model exactly should be used for learning. In theory, any machine learning model can be implemented. This highlights flexibility of gradient boosting which stems from the fact that it is really just a framework for iteratively improving any weak learner. However in practice, boosting models with regression trees as base predictors are widely used. These models are know as gradient boosting machines (GBM).

The intuition behind boosting can be easily grasped if we think of it as a golfer who initially hits a golf ball towards the hole at y but only gets as far as F0(X). The golfer then repeatedly taps the ball more softly, working the ball towards the hole, after reassessing direction and distance to the hole at each stage. The diagram below illustrates this:

(47)

Figure 22: Intuition behind gradient boosting

4.2 Performing gradient descent through boosting

In short, gradient descent is an optimization algorithm for finding the local minimum of a function. The word gradient refers to a multivariate generalization of the derivative concept. Given a starting position, the gradient of that function gives us the direction of steepest ascent. This way, we know where to step in the function space in order to approach a local maximum. Naturally, because the aim is to find a local minimum, taking the negative gradient will point us closer to the local minimum. This is easiest to see in two dimensions:

(48)

If the slope is negative, x should change to the right get closer to the minimum. In contrast, if the slope is positive x changes to the left. The opposite sign of the slope tells us which direction to head, while the amount of steepness tell us the magnitude of change. The same concept is also valid for multidimensional spaces with a large number of variables. The only distinction is that now the change to be applied is not a scalar, but rather a vector containing the amounts of change corresponding to every variable. Note that there is no guarantee that the local minimum found by the algorithm is the smallest possible value. Finding a global minimum can be computationally expensive in real life applications.

The important question now is, how does gradient boosting perform gradient descent? It makes sense that bringing the predicted ˆY closer and closer to the real Y would be gradient descent. Moreover, we know that the residual vector Y − ˆY points to the direction of a better approximation. Since ˆY = Fm(X) we can again think of ˆY as the ball which the golfer is trying to “get into” Y. At every step, the direction and distance with which the golfer hits the ball is given by the predicted error. It can be proven mathematically that this distance happens to be in the opposite direction of the slope of the mean squared error loss function:

L(y, FM(X)) = Pn

i=1(yi− FM(xi))2 N

If we take the partial derivative of the above function with respect to a specific ˆyj approximation, it can quickly be demonstrated that the gradient (left side) is the same as negative residuals vector:

OyˆL(y, ˆy) = −(y − ˆy)

Therefore, as observed by Parr and Howard [14] “chasing the residuals vector in a gradient boosted machine is chasing the negative gradient of a loss function via gradient descent”.

4.3 Gradient boosting hyperparameter tuning

The process of fine tuning hyperparameters in a gradient boosting implementation can be split into two parts: tuning tree-specific parameters that directly affect each individual tree in the model and tuning boosting parameters that affect the boosting operation in the model.

(49)

Tree-specific parameters include classics such as maximum depth, minimum sample required to split an internal node, number of features to consider for each split etc. Boosting parameters include the learning rate, number of base estimators and size of the subsample used for training.

The learning rate (η), also known as shrinkage, is a parameter used in regularization to modify the update rule of predicted outcomes:

FM(X) = Fm−1(X) + η ∗ ∆m(X)

where 0 < η 6 1. The purpose of shrinkage is to make residuals vectors slowly converge into the observed values, thereby gaining more control over the boosting procedure by avoiding overfitting. As the learning rate increases, the number of estimators used is decreased and vice-versa, maintaining a trade-off. In addition, one of the simplest regularization procedures for GBM is subsampling. In fact, soon after introducing gradient boosting in 2001, author Jerome Friedman proposed a minor modification to the algorithm which consisted in fitting the model only on a subsample of the training set drawn without replacement, at each iteration. The idea is to introduce some randomness into the fitting procedure [13]. Subsampling requires a parameter called “bag fraction”, which is no greater than 1 and specifies the ratio of the data to be used on each step. Choosing bag fraction < 1 leads to a reduction of variance and an increase in bias. The most important parameters I have considered for the gradient boosting regression are:

• Number of base estimators • Learning rate

• Minimal internal node size • Subsample (bag fraction)

4.4 Fitting the model to the data

The model evaluation procedure followed in this section is very similar to the one I followed for the random forest regression. sklearn offers a GradientBoostingRegressor() function that allows us to implement gradient boosting. Using the holdout method, I have tried out various

(50)

values for each hyperparameter to observe how they affect model performance on the training and test set, all the while maintaining constant with default values2 the remaining parameters. Figure 24 shows how the model performance of the training and test set changes depending on the number of estimators.

Figure 24: Model performance for different numbers of estimators

Performance is similar when n estimators 6 10, but the model starts to slightly overfit for more than ten base regressors. Figures 25 and 26 respectively show model behavior for different values of minimal internal node sizes and learning rates.

2Default values are:

• Number of base estimators: 100 • Learning rate: 0.1

• Minimum number of samples required to split an internal node: 2 • Subsample: 1.0 (all records)

(51)

Figure 25: Model performance for different minimal internal node sizes

As can be seen from the above graph, the model is overfitting for every selected number of samples since the error in the training set is slightly lower. Nevertheless, it seems that for a sample size of 100 the difference between the errors is smallest. Moreover, increasing the learning rate gradually leads to a smaller training set error and bigger test set error, as shown below. The value which optimizes bias-variance trade-off is approximately 0.2.

In order to determine the best hyperparameter values, I have again used the function RandomizedGridsearchCV(), with the random grid defined as follows:

– number of estimators: [20, 50, 100, 200, 300]

– subsample: [0.6, 0.7, 0.8]

– learning rate: [0.1, 0.2, 0.3, 0.4, 0.5]

– minimum samples split: [20,50,100,200,300,400]

Cross-validation is 5-fold while the total number of feature combinations considered was 100. By calling the best estimator attribution we obtain the best estimator found by the search:

(52)

– best number of estimators: 200 – best subsample fraction: 0.8 – best learning rate: 0.2

– best minimum samples split: 200

RMSE of the best estimator on the training set is ≈ 205.

Figure 26: Model performance for different learning rates

Finally, looking at Figure 27 we get an idea of how model performance changes in response to increasing training and test sizes. The learning curve was obtained using 5-fold cross-validation. The goodness of fit in the training set seems higher for smaller datasets and keeps diminishing as the sample sizes get bigger. Scores in both sets become more similar as more data is introduced.

(53)
(54)

5

MULTI-LAYER PERCEPTRON REGRESSION

Artificial Neural Networks (ANN), or just neural networks, are one of most widely used machine learning techniques to date. They have been demonstrated to achieve outstanding performances on complex problems such image recognition or natural language processing. Their ability to handle a wide range of problems makes them ideal “black box” implementations. A widely used variant of neural networks is the multi-layer perceptron (MLP) model, which I have decided to adopt in this paper.

The mathematical notation behind how a neural networks learns is initially not easy to grasp. In this chapter I have focused on the main aspects of the core algorithm that drives the entire procedure, aiming to give a more intuitive explanation rather than focusing too much on the notation since it goes beyond the scope of this study.

5.1 The architecture of a neural network

The underlying structure of a neural network is loosely inspired by the way biological neural networks in the human brain process information. The basic unit of computation is called a neuron, or perceptron. In a classic setting, a perceptron receives binary inputs and produces a single binary output, as shown below:

Figure 28: A simple perceptron

In order to compute the input, weights and thresholds are introduced. Weights are real numbers expressing the importance of the respective inputs to the output [15]. A threshold on the other hand is directly associated with a perceptron and it reflects a bias that also influences the result of the perceptron. The bias is the negative threshold. Thinking of the bias in terms of negative threshold is a convention that helps formulate the output computation in a neat way. Figure 29

Riferimenti

Documenti correlati

On this basis, the MSUS Study Group of the Italian Society for Rheumatology (SIR) prioritized its research activities on assessment of the prevalence and clinical significance

(42) Two-Dimensional Nanosheets Produced by Liquid Exfoliation of Layered Materials. Capacitors and Capacitance.. a) Schematic diagram of GrB/hBN/GrT inkjet printed

•  Learning-mathema,cal-func,ons-associated- to-node-computa,on.. Table 1 shows these functions. The neural networks to represent the first three functions are shown in figure 4.

Un sistema di Machine Learning (apprendimento automatico) durante la fase di training apprende a partire da esempi (in modo più o meno supervisionato).. Successivamente è in grado

Sapienza University of Rome, Italy - Machine Learning (2020/2021). Example:

The industrial internet of things Machine learning Manufacturers must learn to behave more like tech firms Nov 21st 2015 | 

Introduction to Data Mining and Machine

Per lo più dei tre criteri viene utilizzato Nearest Neighbour poiché tale criterio permette di valutare le distanze da ogni dato nello spazio dei parametri a tutti i nodi chiusi