A discrete bandit approach to hyperparameter optimization

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering Master of Science in Mathematical Engineering

A discrete Bandit approach to continuous

Hyperparameter Optimization

Supervisor: Dr. Daniele Loiacono Author: Davide di Nello matricola 875672 Academic Year 2018-2019

(2)

(3)

(4)

Abstract

In machine learning, the step of choosing the right algorithm parameters is known as Hyperparameter Optimization, and it is a growing field of research due to the exploding popularity of Data Science. This document has the goal of tuning the hyperparameters of a boosted trees model (a popular machine learning algorithm), and it presents some of the common approaches to this type of optimization, also presenting a new algorithm that was created as a more stable adaptation of Hyperband, one of the most famous methods in the literature.

(5)

(6)

Sommario

Nella letteratura anglofona sul Machine Learning, il processo di scelta dei giusti parametri per un algoritmo è conosciuto come Hyperparameter Opti-mization, ed è un argomento di crescente interesse per via della grande popo-larità acquisita dal Data Science. Questo documento presenta alcune delle tecniche più comuni per l’ottimizzazione dei metaparametri, introducendo un nuovo algoritmo che ha il fine di migliorare Hyperband, uno dei metodi di ottimizzazione più famosi in letteratura, nel caso specifico in cui venga usato su algoritmi che fanno uso di boosting su modelli ad alberi.

(7)

(8)

Ringraziamenti

Ringrazio i miei genitori, per avermi aiutato, motivato e supportato (e forse anche sopportato), nonostante il mio vagabondare.

Ringrazio i miei amici di sempre, nella speranza non cambino mai.

Ringrazio i miei compagni di universit`a, in Italia come in Francia, per avermi reso belli anche i momenti pi`u difficili.

Ringrazio tutte le persone con le quali ho condiviso una palla a spicchi e un parquet; senza basket mi sarei probabilmente laureato prima e meglio, ma non sarei mai stato altrettanto felice.

(9)

(10)

Introduction

With the broadest of definitions, Machine Learning is the science of teaching computers how to act without programming them explicitly, and it is one of the fields of engineering that has seen an explosion in the past decade. It is a discipline that crosses the boundaries of mathematics, statistics, and programming, and finds applications in almost any domain thanks to the availability of data of this new Information Age society.

The financial industry, as often, is pushing the limits of new technologies, and Machine Learning algorithms have been applied to the financial markets since the late 1990s. Nevertheless, thanks to the fast evolution that this field is experiencing, new technologies and algorithms are always waiting to be developed and tested.

1.1 Motivation and Objectives

Financial data is very peculiar and particularly challenging for data scien-tists. The efficiency of the market makes the signal to noise ratio almost non-existent, making it extremely hard to find recognizable patterns. For this reason, many statistical and machine learning practices that work well

(15)

in other environments are ineffective when dealing with financial data. Dur-ing one of my latest workDur-ing experiences, I was dealDur-ing for quite some time on the automation of procedures to optimize the hyperparameters of ma-chine learning algorithms for financial forecasting; these are the parameters related to the implementation of the models themselves, often defining their complexity and impacting strongly their performance. An example can be the penalization coefficient of a Lasso Regression, the depth of the trees of a Random Forest, or the number of neurons of a Deep Neural Network. The goal of this document is to presents the results of my work, and to show an application of hyperparameter optimization to financial data through the means of a new algorithm which I developed during my working time to improve the robustness and efficiency of Hyperband, a famous automated hyperparameter tuning algorithm.

1.2 Outline of contents

The present work will be structured as follows:

• Chapter 2 will define the hyperparameter optimization problem and analyze the current literature, presenting both the classic methods and some of the latest techniques.

• Chapter 3 will serve as a presentation of the original work, explain-ing its inception, detailexplain-ing the underlyexplain-ing mathematical concepts and presenting the algorithm itself.

• Chapter 4 will contain some practical results on a real set of data, showing the potential advantages of this algorithm over the one it is designed to improve.

• Chapter 5 will hold the conclusions, final remarks, and suggestions about future work to be developed following this document

(16)

Chapter 2

State of the Art

2.1 Introduction

Every Machine Learning algorithm depends on intrinsic parameters that help specify its complexity. We can find examples of these parameters in linear models, for example with regularization coefficients, in unsupervised methods, such as the number of centroids of K-Means, or in deep architec-tures, where parameters define the shape, dimension, and functionality of the network’s layers. The tuning of these values is called Hyperparameter Optimization (HPO from now on) and it is one of the main steps of ma-chine learning. Lately, the rising complexity of models as well as the push for automation in this field, have boosted the research for fast and efficient HPO algorithms. Indeed, algorithms are more than ever dependent on the setting of their parameters, as the number of available configurations grows exponentially with the complexity of the models themselves as does the com-putational power required to train them. With these premises, automated HPO can bring several advantages [11]:

• Reducing the manual effort and time required to apply machine learn-ing procedures - many HPO algorithms nowadays work by budgetlearn-ing

(17)

evaluation of expensive functions to speed up the tuning process.

• Improving the performance of machine learning algorithms - tailoring the hyperparameter configuration to the problem at hand can result in drastic improvements over the ”suggested configurations” provided by libraries documentation.

• Improving the reproducibility of studies - reducing the manual in-tervention of humans in the parameter choice makes work easier to reproduce, as it removes the variability of the personal choices.

Several factors make HPO a tough challenge and differentiate it from clas-sical mathematical optimization problems. To begin with, function evalu-ations can be extremely expensive, both in terms of time and money. For example, a single evaluation of a Deep Learning architecture entails its full training with a given parametrization, which can take hours, days, or even weeks, and may require renting an online cluster for computations. Another reason is that the configuration space is often very complex, containing a mix of continuous, categorical and conditional hyperparameters, sometimes without a clear meaning or range. Furthermore, the loss function is often retrieved via a black-box approach, without any (or limited) knowledge of mathematical properties such as convexity, smoothness or derivatives (i.e. no Gradient Descent). Finally, as it often happens in statistics, limited training data is available, but the performance of the optimization should be generalized.

In the following sections, we will give a formal definition of the problem and provide an overview of the current literature, going from classical practices to the most recent algorithms.

(18)

2.2 Definition

Let us call A a machine learning algorithm with N hyperparameters, laying in the configuration space Λ = Λ1× Λ2× ...ΛN. The single domains of the

hyperparameters can be real-valued (e.g. penalization coefficients, learning rate), integer-valued (e.g. depth of a tree, number of layers) or categorical (e.g. loss metric). The configuration space can even contain some condi-tionalities, meaning that some parameters may only be relevant for certain values of a different parameter. This happens often in case of neural net-works since the choice of the architecture becomes a parameter itself. In that case, parameters related to the optimization of a given layer are depen-dant on the existence of the layer itself, and consequently on the number of total layers.

On a general note, given a dataset D taken from a population D, our goal is to find

λ = argmin

λ∈Λ

ED[L(Aλ)]

i.e. the configuration λ that minimazes loss L over the population D, for algorithm A. However, in practice we only have a limited amount of data, and we will try to use this available data to generalize our choices as much as possible, by choosing:

λ = argmin

λ∈Λ E

D[V (L, Aλ, Dtrain, Dval)]

where V (L, Aλ, D) measures the loss generated by algorithm A with

con-figuration λ, using validation protocol V on dataset D. Popular choices for V (·, ·, ·, ·) are holdout validation, k-fold cross validation, or walk forward validation (for time series data), which are well known methods in statistics [23] that have been introduced in the late 1960s.

Another valid approach to HPO used in machine learning, but that finds its origin in classical Bayesian statistics, is ensembling. This approach sug-gests to average the outputs of the models resulting from multiple (possibly

(19)

uncorrelated) configurations rather than only using the best one, thus im-proving generalization performance [27]. From an HPO side, this means that our algorithm would be better off finding many ”good” configurations rather than a single ”great” one.

2.3 Model Free Optimization

In the following sections, we will go through the main methods used for HPO, starting with the simplest practices used by data scientist and going to the latest algorithms on multi-fidelity optimization. One method that will not be included is manual optimization. To some, it might seem even naive to discuss it, but this is still the most valid method for many practitioners, as some of the best Data Scientists on Kaggle (a popular website for Machine Learning contests, [1]) admit. The main problem with this approach is that it is extremely time-consuming, and even though with enough experience and knowledge it can lead to sound solutions, it can not be used to implement an automated production system.

2.3.1 Grid Search

The most basic method for parameter optimization is Grid Search (full fac-torial design [28]). As the name suggests, it implies the evaluation of the model on a preselected discrete grid of the parameters. The user selects a fi-nite set of values for each hyperparameter, usually at a uniform distance (on a linear or logarithmic scale), and generates a grid as the cartesian product of these sets. The simplicity of this approach is however heavily outweighed by its drawbacks; mainly, Grid Search suffers terribly from the curse of dimen-sionality since the number of evaluations grows exponentially with the size of the configuration space. Nevertheless, this is still a very common method among practitioners, often combined with sequential manual optimization.

(20)

2.3.2 Random Search

A very simple and yet effective alternative is Random Search. This method has been used for a long time but was popularized in 2012 by a paper of Bergstra and Bengio [4] that aimed at demonstrating its superiority to Grid Search. Random Search works by choosing each configuration by sampling uniformly (or log-uniformly) on the domain of each hyperparameter. This works very well for any case where the importance of the parameters differs significantly, and visual representation can be seen in Figure 2.1. The reason for this is that, given a fixed budget of evaluations B and a dimensionality of the configuration space N, Grid Search will evaluate each hyperparameter in B1/N distinct values, while Random Search will explore B different values. Another major advantage of Random Search over Grid Search is that it is easier to parallelize: workers do not need to communicate to perform parallel evaluations, and any failure on a worker does not impact the whole design of the experiment, granting a more flexible allocation of resources.

Random Search is often considered as a good baseline for HPO methods and it is often integrated within other algorithms, in particular for early exploration of the configuration space. Of course, it also has some pitfalls; for starters, it is a purely exploratory algorithm and as such it will not learn from past iterations and might end up sampling configurations very close to each other even if they have extremely bad results. Therefore, in many settings guided search algorithms tend to perform way better, as proved in many papers [25, 10, 3]

2.3.3 Quasi Random Search

A slight variation of random search, also discussed by Bergstra and Ben-gio in their paper [4] involves the use of Quasi-random number generation. Techniques of this type - such as Sobol sequences or Latin Hypercube Sam-pling - have been developed for Monte Carlo algorithms and have better

(21)

Figure 2.1: Graphical representation of the difference between Grid Search and Random Search over a 2-D space, with one important and one unimportant parameter. Random Search allows for a better exploration of the important parameter [4]

convergence properties than Random Search [5, 15], granting a more uni-form exploration of the space by avoiding sampling points that are too close together. Obviously, HPO differs from Monte Carlo simulations as the num-ber of sampled points is lower even by many orders of magnitude, but even if these techniques are asymptotically equivalent to Random Search, they are proven to have a better convergence curve, achieving good results in a shorter time.

2.3.4 Population Based Methods

Population-based methods are a large family that includes multiple algo-rithms, such as genetic algoalgo-rithms, evolutionary strategies and particle swarms. These algorithms usually define a population (which in our case is a set of configurations) and try to improve it over multiple iterations by modify-ing (mutations) or combinmodify-ing (crossovers) some members, therefore creatmodify-ing new “generations” of the population [31]. Some of these methods, such as CMA-ES (Covariance Matrix Adaptation Evolutionary Strategy) [26] have obtained some of the best performances for Black Box function

(22)

optimiza-tions. However, in this document, I will not dig deeper into these, as I preferred focusing on different methods.

2.4 Bayesian Optimization

Bayesian optimization techniques have become very popular in recent years for HPO tasks, obtaining extremely competitive results in the tuning of machine learning algorithms such as neural networks for image classification or speech recognition. The main point of Bayesian optimization is to use the information that is extracted from previously tested configurations to have a better choice strategy, therefore ”guiding” the sampling procedure towards more promising areas of the search space.

Bayesian Optimization algorithms can differ strongly, but they are all based on two key parts: a probabilistic model and an acquisition function. The first one is a surrogate model over the space of the configurations which allows giving a probabilistic estimation of the performance of the machine learning algorithm based on sampled points; the second one is a function on the surrogate model that is used to determine the suitability of all future candidate points and choose where to sample next, balancing the exploration of the space and the exploitation of current information.

Algorithm 1 Meta algorithm for Bayesian Optimization

1: Build a surrogate probability model for the objective function

2: Find the hyperparameters that perform best on the surrogate by the use of a choice function

3: Apply the chosen configuration to the true objective function

4: Update the surrogate model incorporating the new point

5: Repeat steps 2-4 until max iteration or max time reached

(23)

2.4.1 Surrogate Models

The most common approach to building a surrogate model for a target black-box function is through the use of Gaussian processes. Gaussian processes (GP) are a very convenient way to express prior distribution over functions, and are defined in a way such as any finite set of M configurations {xn}Mn=1

defines a multivariate Gaussian distribution on RN [29]. Moreover, it allows to analytically compute the predictive distributions for new configurations given sampled ones. Given a mean function µ : X → R, a covariance function κ : X2 → R, and f ∼ GP(µ, κ), we have that f(x) follows a gaussian distribution N (µ(x), κ(x, x)) for every x ∈ X . Then, if we have a set of n observations Dn= {(xi, yi)}ni=1from our function, we know the form

of the predictive distribution fn ∼ GP(µn, κn), which will have a posterior

mean and covariance function given by:

µn= kT(K + η2I)−1Y

κn(x, x0) = κ(x, x0) − kT(K + η2I)−1k0

where Y ∈ R is a vector containing the observations yi, while k, k0 ∈ Rn

are defined as ki = κ(x, xi), ki0 = κ(x0, xi), and K ∈ Rn×n is characterized

by Ki,j = κ(xi, xj). The parameters of the GP itself, such as the kernel

(covariance) function, can usually be updated through maximum likelyhood or Markov Chain Monte Carlo integration [12].

GPs have reached great results compared to vanilla Random Search, but they have a major downfall which is their limited scalability: the compu-tation of the kernel function scales like O(N3) with the number of sampled configurations and like O(N2) with the dimensionality of the configuration space. For this reason, other approaches have been proposed, such as ap-proximated Gaussian process, sparse Gaussian process and other methods that are based on different techniques such as Random Forest.

(24)

[3] is Tree Parzen Estimators (TPE). Contrarily to Gaussian processes, this algorithm does not model directly the posterior distribution of our function given sampled points p(y|λ), but rather p(λ|y). Starting from a percentile α on the target function, sampled observation are clustered into ”good” and ”bad”, and their densities (respectively l(λ) and g(λ) ) are calculated via Kernel Density Estimation. This allows for easy optimization of Expected Improvement, a popular acquisition function, and for very good scalability both in number of sampled points and in dimensions (O(N )).

2.4.2 Acquisition Functions

Acquisition functions have the goal of indicating which configurations would be more interesting to test next, given current information. Some simple examples are Probability of improvement and Expected improvement, which are defined as follows (for a configuration λ, and current sampled minimum f?):

P I(λ) = E[1f (λ)<f?|λ, D]

EI(λ) = E[max(f?− f (λ), 0)|λ, D]

These two particular functions are pretty much self-explanatory and they are easy to compute, in particular when (like in the case of GPs) our surro-gate model can be described via Gaussian distributions. Moreover, Expected Improvement is one step optimal, meaning that it always proposes the best possible configuration to achieve a new minimum on the next step [12].

Newer acquisition functions such as Entropy Search and Knowledge Gradi-ent are instead less intuitive from a probabilistic standpoint, and they look for configurations achieving a greater decrease in the differential entropy of the surrogate model. These are usually not available in closed form, and therefore need more computation time, but have been proven to have con-sistently good results on a set of different problems [14]. Nevertheless, given

(25)

Figure 2.2: Representation of three steps of a Gaussian process bayesian optimization algorithm (for maximisation) with generic acquisition function over a 1-D domain. At every step we compute the cheap acquisition function over the domain, find its maximum, use it as the next sample point for the expensive objective function, and finally update our Gaussian surrogate model with the new information.

(26)

its interpretability and its perfect matching with the TPE model, Expected Improvement remains the most popular acquisition function in the field.

2.5 Multi-fidelity Optimization

Multi fidelity Optimization is a methodology for HPO that involves the use of low budget evaluations to speed up the exploratory process. A good formal definition is given by Kandasamy [19]: Let us assume that we want to maximise a function f : X → R over its domain X . Our goal is to find x? = argmaxx∈Xf (x), and consequently f? = f (x?).

Multi-fidelity optimization assumes the existance of a fidelity space Z and a function g : (Z × X ) → R, related to f via f (·) = g(z•, ·), where z• ∈

Z. Under the assumption that evaluations g(z, ·) at a cheaper fidelity z ∈ Z are informative about our target function g(z•, ·), our goal becomes to

maximise f through evaluations of g, by exploring the space through low-fidelity evaluations and refining our search with high-low-fidelity tests only in the most promising regions of X . In most common cases, the fidelity space represents the amount of training data used for the model or the length of the training, and z• is a thorough training with all available points. A good

example for a neural network would be Z = [0, N•] × [0, T•], where the two

intervals represent the number of points and the number of epochs. In this case z•= (N•, T•) would be an evaluation with the full training dataset and

with the highest number of epochs allowed.

Recent years have seen a surge in popularity in this field, with many research groups from prestigious universities focusing on this argument [19, 25, 22, 21, 32]. We will concentrate in particular on three algorithms that have been published lately and that use different approaches: Hyperband is a simple numerical algorithm that works as a ”smartly budgeted” Random Search, Freeze Thaw Bayesian Optimization tries to model and predict the learning curve for iterative models, and FABOLAS fits a Gaussian Process regression

(27)

Figure 2.3: The plot shows an iteration of Succesive Halving. The validation error of a set of configurations is displayed as a function of assigned resources (training time). As one can see, more promising configurations are trained longer, while bad ones are dropped at every step.

model onto the extended fidelization space Z.

2.5.1 Hyperband

The first algorithm that we will introduce for multi-fidelity optimization is called Hyperband, and has been published in 2016 by Li and Jamieson [25]. The algorithm itself is an extension of Successive Halvings, which was pro-posed by Jamieson and Talwalker in 2015 [18]. The idea of Successive Halv-ings is straight forward: allocate budget uniformly to a set of configurations, evaluate their performance, throw out the worst-performing half of them and then repeat until only one configuration is left. At every iteration, the total allocated budget is constant while the number of configurations decreases, and therefore the algorithm allocates exponentially increasing resources to the configurations that are not excluded at every round.

(28)

Figure 2.4: Summary of a Hyperband run. Every block represents a bracket with new configurations being tested, and it is indexed by s. In every bracket, configurations are selected sequentially by testing ni of them with budget ri.

on the choice of the number of starting configurations n and the budget B. The solution introduced by Li addresses this issue by proposing a sort of hedging strategy that iterates over multiple values of n. In substance, Hyperband loops over Successive Halving, changing every time the values of n - the number of tested configurations - and r - the minimum resources to be allocated to a single evaluation. The terminology used in the paper refers to each separate iteration level of Successive Halving as a ”Bracket”, and we will use the same in the rest of this document. Brackets are designed to use approximately the same amount of resources but with an everchanging allocation of n and r.

A good way for understanding the algorithm is looking at Figure 2.4. The algorithm needs two parameters R and η; the first one represents the maxi-mum budget we are allowed to allocate, while the second one is the inverse of the ratio of retained configurations at every step. In the example, R = 81 and η = 3, therefore the algorithm uses five brackets. Within every bracket an initial number of configurations n0 is tested with budget r0. The best

n1 = _η1n0 configurations are kept, and reevaluated with budget r1 = ηr0,

and so on until we reach ri = R. By using different starting values for n0

(29)

a worst case scenario equivalent to a slower Random Search (given by the last bracket). [Full algorithm in Appendix B]

This algorithm has been tested on a multitude of datasets, providing signifi-cant speedups compared to Bayesian Optimization methods (×5 to ×30), in particular when applied to kernel methods and neural networks. One of its main drawbacks it’s that the search is not ”really” guided, and therefore its performance will suffer on long runs compared to than Bayesian methods. For this, a Bayesian version of this algorithm was developed in 2019, called BOHB [10], guiding the choice of newer configuration through a Tree Parzen Estimator approach.

2.5.2 Freeze-Thaw Bayesian Optimizaiton

A more complex but extremely interesting approach to Multi-task optimiza-tion is given by [32]. For the sake of simplicity, we will try to stay clear of the math presented in the paper as much as possible, but the interested reader can find more detailed explanations in the original publication. The idea behind Freeze-Thaw is to build a regression model that allows to predict the learning curve of an iterative training and to do so via Gaus-sian Processes. In particular, the authors make the (fair) assumption that learning curves follow roughly an exponentially decaying function of the form e−λt, and use this as a building block for the kernel of their Gaussian Process. Instead of using a finite set of values for λ, Swesky and Snoek in-tegrate over the whole [0, ∞) range using a non-negative mixing measure ψ; therefore, the GP kernel function between two-time steps takes the following form:

k(t, t0) = Z ∞

0

e−λte−λt0ψdλ

The authors then suggest using a Gamma distribution for the mixing mea-sure, as this leads to simplifications and to an analytic solution of the inte-gral.

(30)

Figure 2.5: (a) Directed acyclic graph representing the conditional independence model used for training. Every row represents a learning curve that is drawn from an inde-pendent GP prior, conditioned on its mean. The mean of each learning curve is jointly drawn with the mean of the other learning curves using another GP prior. (b) Partially completed training curves, and GP prediciton of their asymptote. (c) Posterior GP prediction at the asymptote. Each colored point represents the GP prediction at the configuration corresponding to the training curve of the same color. [Snoek]

Finally, the authors make an assumption on independence between training curves, conditioned on a prior mean. This is, of course, a simplification, but it is an essential one since a Gaussian Process fitting all configurations and all training points would scale like O(N3T3) in the number of configurations N and the number of curve training points T . Conditional independence implies that every learning curve is drawn from a separate GP whose mean itself is drawn from a Global GP; this allows the whole algorithm to scale like O(N3+ T3+ N T2).

As the training of a curve is occurring, the algorithm computes the Ex-pected Improvement for the current training configuration as well as for new untested ones - if new configurations look more promising than the current one given the state of the learning curve, the training is frozen and the algo-rithm switches to a new configuration. Moreover, a set of (10) past learning curves is kept at all times updating their expected improvement, and old configurations can be re-taken and trained from where it had been previ-ously frozen.

(31)

the exploitation of partial information as training is happening, and it is a good solution for problems where the fidelity is related to the number of iterations/epochs. Its main problem is that, like every GP based algorithm, it has cubic complexity, and this can be a problem in situations with very high-dimensional configuration spaces where many different tries are needed.

2.5.3 Fabolas

Fabolas - an acronym for FAst Bayesian Optimization on LArge data Sets - is another approach to hyperparameter optimization based on Gaussian Processes, published by Klein and Falkner [21].

This approach models the loss of a machine learning algorithm across dataset sizes; to use the notation of the section on Multi Fidelity Optimization, Fabolas fits a Gaussian Process over the space (Z × X ), where Z = [0, 1] is a unitary interval representing data subset size that is being used, with z• = 1 being the full dataset. Then, Entropy Search is used to find out

how much can be learned about performance on the full dataset from an evaluation at any z. Having a Gaussian Process on the whole space of (Z × X ) allows to predict the performance of the algorithm on the full dataset without necessarily having to evaluate there, but this is only true if the regression model correctly captures the behaviour of the loss with respect to the dataset size used. It is safe to assume that the loss of a machine learning algorithm decreases with size of the used dataset, and for this reason the authors suggest the use of a factorized kernel, made of a standard kernel over hyperparameter configurations (on X only) and a covariance function in z (on Z only):

k((x, z), (x0, z0)) = k5/2(x, x0) · (φT(z) · Σφ· φ(z0))

The authors choose φf(z) = (1, (1 − z)2)T to enforce the monotonicity of the

(32)

function that models not only the potential gain in information but also the computational cost of sampling at a given z, trying to balance both in the smartest way possible.

When tested against other HPO algorithms on similar problems, Fabolas has shown to have great performance, in particular with large datasets and medium to small configuration spaces. Once again, its weakness stands in the use of Gaussian Process and their limited scalability in terms of points, which does not allow it to work well in problems where the dimensionality space is too high or where the maximum is particularly tricky to find.

2.6 Summary

In this chapter, we have defined the problem of Hyperparameter Optimiza-tion and shown both some of the most common and some of the latest algorithms on the market. Recent algorithms have shown to have better results, but are often more complicated to set up and to use, which is why many practitioners still prefer using manual techniques on easier problems. Nevertheless, the inclusion of environment variables such as dataset size, number of iterations, etc. and the definition of multi-fidelity optimization is probably one of the greatest recent advances in this field and is still a very active argument for research.

(33)

(34)

Chapter 3

Best Arm Identification

Hyperband

3.1 Introduction

The goal of this chapter is to introduce a new algorithm that will try to address the shortcomings of the previously mentioned Hyperband algorithm. There are many reasons behind the choice of this algorithm rather than the other ones presented in the last chapter. First of all, it is intuitive and does not make assumptions on the optimization space [Z × X ]. Avoiding the use of Gaussian Processes, it also scales better in terms of speed to higher-dimensional spaces, and its implementation from scratch into production code is much quicker.

The first step of the chapter will be to present some of the pitfalls that I encountered when using the Hyperband algorithm and the main ideas behind the modifications that I wished to include. Then, I will discuss some concepts about Bayesian Statistics and Multi-Armed Bandit models that will be useful when presenting the new algorithm. Finally, I will present the full algorithm and discuss its use and implications.

(35)

3.2 The effect of Learning Rate on the Hyperband

algorithm

When introducing multi fidelity algorithms in the previous chapter, we briefly explained that common choices for the fidelity (or budget) space are dataset size and training time. For iterative algorithms such as tree-based models, additive models, or neural networks, the number of iterations is also a valid choice, as it allows for easier implementation and it has a more direct and understandable impact on the training of a model. Moreover, many of these iterative algorithms like the Gradient Boosted Machines used during this study allow to save partially trained models, to load them, and to re-sume training. In the optic of Hyperband, this is an advantage that can save a huge amount of training time, and that is not available when budgeting through the amount of data.

3.2.1 Learning Rate

Before diving in the details of the algorithm and its problems, we need to take a step backwards and define what the learning rate is. Some algorithms, such as Neural Networks, train their parameters using Gradient descent (or some derivation), which is an iterative method that updates parameters by minimizing the gradient of the chosen loss function. Others, such as Gradient Boosted Machines and Generalized Additive Models, are built through an iterative additive procedure that greatly resembles Gradient Descent. The learning rate is the hyperparameter η common to these procedures, that influences the size of the updates of every step. This parameter has proven to be crucial to the training: too high values will make the iterative process diverge, while too low ones will take too long to converge and will have more probability of getting stuck in local minima. For this reason, many studies have been published trying to assess the optimality for this parameter and

(36)

proposing variants that adapt its magnitude to the state of the training with the goal of improving convergence. All and all, this hyperparameter differentiates itself from the others in the fact that it is the one that most influenced the speed and the convergence of the training [8].

3.2.2 Bias in the optimization

The problem we face when using Hyperband for iterative models is that it makes the implicit assumptions that different configuration of a machine learning algorithm will rank similarly at different budgets. If this was com-pletely true, it would mean that learning curves for different configurations would never cross. We know of course that this is false, and this is why the evaluations are compared at different stages of the learning (i.e. at different budgets); but if we want to extract as much information as possible out of the learning curve we have to remove possible biases, and the learning rate is one. As explained in the previous subsection, the learning rate explic-itly drives the convergence speed, and therefore configurations with highly different learning rates should only be compared once they have been fully trained, as the shape of their curve will be extremely different, and some of them will converge much quicker than others.

For example, let us assume we are trying to optimize the hyperparameters for a Machine Learning algorithm that makes use of the Learning Rate, and let us call Aη an instance of this algorithm that uses rate η. We will be

using the Hyperband framework and budgeting our evaluations through the number of iterations (or epochs). Assuming that our learning rate is a con-tinuous parameter that lays in the range [α, β], then the first step of the Hyperband algorithm will be to sample multiple configurations, each with a randomly picked learning rate ηi ∈ [α, β], and compare the validation score

V (L, Aηi, Dtrain, Dval, n1) of our machine learning algorithm Aηi after n1

(37)

representing the number of iterations needed for this algorithm’s instance to converge). Given the shape of the learning curves, we will have that

V (L, Aηj, Dtrain, Dval, n1) V (L, Aηk, Dtrain, Dval, n1)

∀ηj, ηk: ηj ηk

This means that Hyperband, over low budget evaluations, will consistently choose configurations with high learning rate. This problem has somewhat been addressed by the original authors, as every new bracket evaluations start with a higher budget, therefore reducing the bias. However, this is an ineffective solution as it relies on the last brackets to test configurations with low learning rate, while a high learning rate configuration will be output by the lower brackets. This not only alters the original sampling distribution for the learning rate, but is also biased since the last brackets have a very limited exploration power (they sample a reduced number of configurations), and this can reduce importantly the efficiency of this algorithm when the best solutions have low learning rates.

3.2.3 Intuition behind the algorithm

Given the issues that were listed in the previous paragraph, the intuition behind the algorithm is quite straight forward: only compare budgeted eval-uations that have a similar learning rate. The first step to do so is splitting the search interval of the learning rate uniformly on a logarithmic scale into non-overlapping sub-intervals. Once the interval has been split into smaller ones, the Hyperband algorithm can be run separately on these intervals, so that the full space is explored thoroughly and fairly, comparing config-urations that use the same order of magnitude of learning rate. Finally, based on the performance obtained, one can choose to run Hyperband on the subintervals that reward the best results.

(38)

most performing split and balancing the exploration of the space and the exploitation of the best performing interval, by using a Best Arm Identifi-cation framework.

3.3 Multi Armed Bandits

Multi Armed Bandit is a vast framework of algorithms that has been studied in probability theory since 1933 [33]. In its simplest setting, the algorithm has K possible choices - called arms - and T rounds. At each round, the algorithm chooses an arm and receives a reward, which is drawn from an unknown fixed distribution πi, which depends on the arm. The goal of the

algorithm is then to maximise its total reward over the T rounds, or better to minimize its cumulative regret, which is defined as the difference between the reward that the player received and the one he would have observed by always pulling the optimal arm.

This type of framework poses the classical problem of exploration versus exploitation. Since there is an amount of uncertainty given by the fact that rewards are stochastic, an algorithm must have the right amount of ”greed-iness”: one that is too greedy might get stuck pulling forever a suboptimal arm just because it gave a lucky reward at the beginning, one that is not greedy enough will keep pulling suboptimal arms even when they are clearly underperforming with respect to others.

A very easy option to balance exploration and exploitation are epsilon-greedy algorithms; these algorithms start by pulling every arm at least once and then keep pulling the best performing arm with probability 1 − , while they choose another random arm with probability . To improve perfor-mance, it is possible to decrease with the number of rounds, in order to explore more at the beginning when there is more uncertainty about the arms, and then take advantage of the best performing arm once many re-wards are available. Improving over this benchmark, many probabilistic

(39)

algorithms have been proposed with better asymptotic regret-minimizing properties, such as Thompson Sampling [33] and Upper Confidence Bound [24]. Most of these algorithms build some probabilistic model for the rewards and then choose the arm that maximises some acquisition function (Prob-ability of Improvement, Expected Improvement, Upper Confidence Bound, etc.). Thompson Sampling, in particular, introduces extra randomization by sampling the parameters from a Bayesian setting, and we will discuss it further in an upcoming section.

Going back to our problem, it is easy to see how it may fall into this frame-work. Once we split up the learning rate interval, the sub-intervals become our arms and the stochastic reward is the loss obtained by a run of Hyper-band on that particular split. However, our goal differs slightly from the one of the basic setting; we are not interested in an arm that gives the best results on average, but rather on the arm that can give us the maximum result overall.

3.4 Bayesian updates

In this section we will give a very essential introduction to Bayesian statis-tics, just enough to give the reader the means to understand the following parts of this document.

When dealing with data, the goal is often to learn about some unknown parameter such as the mean of a distribution, a correlation or a proportion. In both frequentist and Bayesian statistics, data is used to make inference on this parameter, but the two methods differ strongly when it comes to dealing with uncertainty. Where frequentist methods give point estimation and confidence intervals, Bayesian statistics expresses beliefs through proba-bility distributions. The Bayes theorem is the fundamental tool of Bayesian

(40)

statistics, and expresses the relationship between prior and posterior belief:

p(θ|y) ∼ p(y|θ)p(θ)

The posterior distribution p(θ|y) of our unknown parameter θ is proportional to its prior distribution p(θ) times the likelihood of the data p(y|θ).

The constant of proportionality is given by an integral that usually does not have close form, and for this reason most of the times we need to estimate posterior densities through Monte Carlo methods. In some special cases, however, this is not necessary and we can easily compute the distribution. Using the definition from Jackman [17]: Suppose a prior density p(θ) belongs to a class of parametric of densities, F . Then the prior density is said to be conjugate with respect to a likelihood p(y|θ) if the posterior density p(θ|y) is also in F . Adding to this definition, we can say that in most cases the parameters of the posterior distribution can be computed instantly through some simple update rule.

Let us now give an example, that will be useful later, of such conjugate distributions. Let us assume we have a situation where our data is coming from some Gaussian distribution, y = (y1, ..., yn)0 with yi ∼iid N (µ, σ2). In

this case, our interest focuses on a vector of parameters θ = (µ, σ2_{). The}

likelyhood of the data with respect to our parameters is given by

L(µ, σ2; y) = n Y i=1 1 √ 2πσ2exp( −(yi− µ)2 2σ2 )

and it is a function of µ and σ2 defined on a two dimentional space. We find that for this likelyhood, the conjugate prior density p(µ, σ2) is defined (by marginalizing) as the product of a normal, conditional density for µ, and a density for σ2. Explicitely: p(µ, σ2) = p(µ|σ2)p(σ2) where

µ|σ2∼ N (µ₀, σ2/n0)

(41)

where the distributions are defined by the hyperparameteres µ0, n0, ν0 and

σ₀2. What is interesting about this combination is that the posterior density is also a Normal-InverseGamma, with updated hyperparameters

µ1 = n0µ0+ n¯y n0+ n n1= n0+ n ν1= ν0+ n ν1σ12= ν0σ02+ n X i=1 (yi− ¯y)2+ n0n n0 + n(µ0− ¯y) 2

where n is the number of sampled points and ¯y their average. This means we have a very easy update rule and we can easily make inference on parameters µ and σ2. Even better, in this case we also know the predictive density for a new point y∗ given the current data, i.e. p(y∗|y), which is a student-t density with location parameter µ1, scale parameter σ12p(n1+ 1)/n1, and

n + ν0 degrees of freedom, which is a known density that we can sample

form.

3.5 Thompson Sampling

As previously introduced in the section about Multi Armed Bandits, Thomp-son Sampling is a common algorithm for online decision in the Bandit frame-work and has been introduced in the 1930s [33]. Even though its inception is one of the oldest, the algorithm has been ignored for many years, coming to fame in the 2010s thanks to two important publications that showed its empirical performance [30, 6]. Its popularity has been rising since then, with applications ranging from A/B testing [34] to recommender systems [20] and Hyperparameter tuning [19].

The original Thompson Sampling algorithm was proposed in a scenario of a Bernoulli Bandit. In this simple setting, every arm is characterized by a Bernoulli distribution and produces a reward of 1 with probability pi.

(42)

The algorithm itself is then based on the Bayesian paradigm and therefore treats the parameter pi as an unknown variable, quantifying its uncertainty

by setting a prior distribution given by a Beta(αi, βi) density. This prior

distribution just happens to be conjugated to the Bernoulli likelihood, and therefore allows for an easy update of the parameters of the prior in order to obtain the posterior.

Algorithm 2 Beta-Bernoulli Thompson Sampling

1: Initialization - Choose prior distribution parameters (α, β)

2: for t ∈ {1, 2, . . . } do

3: for k ∈ {1, . . . , K} do

4: Sample ˆpk∼ beta(αk, βk)

5: end for

6: xt← argmaxk(ˆpk)

7: Pull arm xt and observe reward rt∈ {0, 1}

8: (αxt, βxt) ← (αxt+ rt, βxt + 1 − ri)

9: end for

In this simple case, the only step that differentiates this algorithm from a greedy MAB one is number 4. A greedy approach would estimate the parameter pk in a frequentist way (pk ← _α_kα_+βk _k), ignoring the uncertainty

around it. The fact that instead the algorithm samples this parameter, allows to take the uncertainty into account on the long run, balancing better its exploration choices while still giving an edge to the best-performing arms as time goes on.

This algorithm can be generalized to a broad array of problems beyond the Bernoulli bandit, with more complex parameter spaces and non-binary rewards. In a more general setting, for every action xtselected by the agent

at time t, an outcome ytis observed coming from some distribution qθ(·|xt).

The agent then receives a reward rt = r(yt), where r is a known function.

(43)

on the parameter θ, and at every round the choice of the arm happens by maximising the expected reward over the likelihood q_θˆ(·|xt), where ˆθ is

sampled from the posterior distribution.

Algorithm 3 Generic K-arms Thompson Sampling

1: Initialization - Choose prior distribution π(θ)

2: for t ∈ {1, 2, . . . } do

3: for k ∈ {1, . . . , K} do

4: Sample ˆθk∼ π

5: end for

6: xt← argmaxkEq_θkˆ [r(yt)|x = xt]

7: Pull arm xt, observe outcome y and compute reward rt

8: π ← P (θ|x1, y1, ..., xt, yt)

9: end for

As previously mentioned, this algorithm has very good regret minimizing properties as it balances very well its choices. Of course given its Bayesian nature, it is dependent on the choice of the prior distribution, in particular in the early runs when there is not much evidence from rewards.

3.6 Best Arm Identification Hyperband (BAIHB)

In this section we will finally put together all the pieces presented in this chapter and propose a new algorithm for automatic Hyperparameter Op-timization for iterative machine learning algorithms, that builds on the strengths of Hyperband and tries to obviate to some of its shortcomings. As the name suggests, BAIHB is based on two parts - a best arm identifi-cation (multi armed bandit) setting and the Hyperband algorithm itself. It is a framework more than an algorithm, as it can be set up differently de-pending on the optimization procedure that is used within the Multi Armed Bandit part. As explained in section 3.2.3 the intuition is easy -

(44)

discretiz-ing the learndiscretiz-ing rate parameter into separate intervals and run Hyperband separately on each one, while the Multi Armed Bandit part has the goal of choosing the arms in the way the maximises the probability of getting a good configuration. A small change was however performed to the Hyperband al-gorithm itself: like suggested by Falkner and Klein in [10], I removed the last bracket of Hyperband within the algorithm, which is all in all equivalent to a step of Random Search. The reason behind this last Random Search step was to avoid penalizing good runs that had a slow learning curve, but given that our adaptation has been conceived and structured to solve this problem, it only ended up slowing down everything.

Algorithm 4 Meta algorithm for BAIHB

1: Split learning rate interval into non overlapping subintervals

2: Evaluate which interval (arm) to use next based on past runs (BAI)

3: Run Hyperband on selected interval and store results (HB)

4: Repeat steps 2-3 until budget is exausted

In general, Hyperband needs to generate hyperparameter configurations on the whole configuration space, which is the cartesian product of the single intervals where each hyperparameter is defined. Running Hyperband on a sub-interval simply implies generating the configurations on the same space, but using the sub-interval of the learning rate for the cartesian product. Also, Hyperband evaluates several configurations at different budgets, but only the full-budget evaluation are stored and fed as rewards to the Multi Armed Bandit part of the optimization.

The Multi Armed Bandit part takes up step 3 of the Meta Algorithm pre-sented above and can be designed in different ways, as mentioned in sections 3.3 and 3.5. In the following paragraphs, we will present two different pos-sible designs, detailing the acquisition procedure.

(45)

3.6.1 Epsilon greedy BAIHB

The epsilon greedy version of the algorithm is of course the easier of the two, and it is quite straightforward. Please note that in this subsection we will use the terms ”outcome” and ”reward” interchangeably, and they will both represent the validation score obtained by fully trained configurations coming from a run of the Hyperband algorithm with a given learning rate interval.

After a full round of full exploration to obtain at least one reward per in-terval, the best arm is chosen based on past outcomes. In classic Bandit literature this is done by choosing the arm that has the highest average out-come, as the goal is to maximise the total cumulative reward; however, our goal being to retain a small set of ”best” configurations, we will choose the arm that has the best α-percentile mean (i.e. mean of the top α outcomes). The next run of Hyperband is then executed in the best performing interval with probability 1 − , or in one of the others at random with probability . This works quite well as very little computational overhead is added, and the greediness of the algorithm can be tweaked using the parameter . Also, this algorithm makes no assumption on the distribution of the outcomes or the parameters it estimates, and therefore it does not introduce any bias. Of course, as previously discussed, using a fixed either tends to explore for too long if the parameter is big, or risks getting stuck into a local optimum, and it is challenging to find the appropriate value. However, this should still be sufficient to improve performance over the original Hyperband algorithm.

3.6.2 TS BAIHB

The application to Hyperband of the Thompson Sampling algorithm bears only a few differences from the generic TS algorithm presented in the pre-vious section. In particular, we will define this application specifying the distributions used both for the outcomes and the parameters. These can of

(46)

Algorithm 5 Epsilon Greedy BAIHB

Split the learning rate interval into K sub-intervals. for t ∈ {1, 2, . . . } do ut∼ U nif orm(0, 1) if ut< 1 − then xt← argmaxk(αk) else xt∼ RandInt(1, K) end if

Run Hyperband on interval xt, observe outcomes Yt= (y1,t, y2,t, ...)

αxt ← Tail Average of {Yi1{xi=xt}}i=1..t

end for

course be modified, and we will discuss so in a following section. Also please bear in mind that, opposed to the previous section, the terms ”reward” and ”outcome” are here not the same, as rewards will be a transformation of outcomes.

We will use this algorithm with a Normal likelihood and a Normal-IG prior, because of the ease of computation that they allow. Therefore we assume that, for each arm, the outcomes yt of our Hyperband runs will

approximately be normally distributed, with parameters θ = (µ, σ2), with µ|σ2 ∼ N (µ₀, σ2/n0) and σ2 ∼ InverseGamma(ν0/2, ν0σ02/2). The fixed

hyperparametes (µ0, n0, ν0, σ20) must be chosen to define the amount of

in-formation that the prior distribution bears, and as such are dependent of the specific application and in particular of the score that will be returned by Hyperband: for example, using the log-likelihood as score will require differ-ent values than using the mean squared error. In general, a good approach is to train the algorithm with a couple random configurations and, based on the scores obtained, set up values for the prior that roughly mirror the at-tended behaviour without inserting too much information, or simply set up

(47)

fully uninformative values. Once the setup is done, the Thompson Sampler starts. For every interval, we sample (µ, σ2) from the respective distributions and compute the expected reward given these values. In order to express our interest for the extreme values of the distribution rather than its mean, we use the Improvement as our reward function, i.e. r(yt, y?) = (yt−y?, 0)+.

This means that what we end up maximizing is the Expected Improvement, a known quantity in the literature that we have already introduced in the chapter about Bayesian Optimization, and that has an easy closed form in the case of Normal distributions [Derivation in Appendix A]:

E[(yt− y?, 0)+|(µ, σ2)] = σφ(

y?− µ

σ ) + (µ − y

?_)Φ(−y?− µ

σ )

Once the arm with the highest Expected Improvement is selected, Hyper-band is ran on the related interval and outcomes are stored. All the outcomes relative to a given interval are then used to compute the posterior distribu-tion for its parameters (µ, σ2) through the update rules that we saw in 3.4. Compared to the greedy version, this algorithm introduces a bit more com-putational overhead to Hyperband. Between every two Hyperband runs, the model has to sample 2K values (K being the number of arms), compute the Expected Improvement for each interval, selecting the maximum and updating the posterior distribution. Nevertheless, there is no reason to use a particularly high value of K and therefore the computational cost added is still minuscule compared to the one usually needed for the training of the machine learning models that we are optimizing. Moreover, contrarily to its greedy version, this setup should automatically balance exploration and exploitation, resulting in better performances on the long run.

3.7 Better choices of the distributions

After the last section on the Thompson Sampling BAIHB algorithm, any reader with some statistical background and mathematical rigor at heart

(48)

Algorithm 6 TS BAIHB

Split the learning rate interval into K sub-intervals.

Initialize the parameters of the prior distributions µ0, σ02, ν0, n0.

for t ∈ {1, 2, . . . } do for k ∈ {1, . . . , K} do Sample ˆσ2 k ∼ IG(ν1,k/2, ν1,kσ21,k/2) Sample ˆµk|ˆσ2 ∼ N (µ1,k, ˆσ2k/n1,k) EIk← σφ(y ?_−ˆ_µ k ˆ σk ) + (ˆµk− y ?_)Φ(−y?_−ˆ_µ k ˆ σk ) end for xt← argmaxkEIk

Run Hyperband on interval xt, observe outcomes Yt= (y1,t, y2,t, ...)

if maxi(yi,t) > y? then

y?_{← max} i(yi,t) end if Update µ1,xt, σ 2 1,xt, ν1,xt, n1,xt based on Yt end for

(49)

will be left cringing in disgust. Indeed, I opted for the use of the normal distribution to represent the likelihood of the outcomes obtained by Hyper-band. However, these results come in the form of regression or classification scores - R2 , Root Mean Squared Error, Logarithmic Loss, Accuracy, Area Under the Curve - which are all bounded quantities, unlike Gaussian ran-dom variables.

The main reasons behind this mathematical murder are computational speed and simplicity. Our algorithm needs to sequentially compute posterior dis-tribution multiple times, and it is based on an optimization procedure (Hy-perband) that makes speed one of the main reasons of its competitiveness. Using a custom chosen pair of distributions for likelihood and prior would most likley mean having to compute posterior distributions using a Markov Chain Monte Carlo sampler, making the computational overhead of our al-gorithm explode. Moreover, the bayesian step of the alal-gorithm has the goal of giving an idea about the mean and the variance of the outputs and help discern good intervals from bad ones, and the Gaussian distribution be-ing R-valued does not prevent it to give us useful information about these quantities. Finally, some distributions that be could used as more ”proper” likelihoods, such as Beta or Gamma, have the Gaussian as their limiting distribution [16], which suggests that on the long run our approximation does not fall too far from the real values. Another point is that it is im-portant to consider the environment in which this algorithm was generated. As much as Data Science and Machine Learning are based on mathematics and statistics, the everyday life of practitioners is often very far from the formulas and the rigor that the academic world requires. The conception and the implementation of this algorithm came, as briefly explained in the introduction, in a work environment with the goal of speeding up the exis-tent procedures. For this reason, it was better having algorithm that could easily adapt to any score metric, rather than a mathematically sound one

(50)

that would need major adjustments at every new problem.

Nevertheless, I will briefly discuss a possible alternative choice of distribution that could be part of future developments. We are aware from the literature that it is possible to find a conjugate for any likelihood belonging to the ex-ponential family [9]. This means that for any of these distributions, we know explicitly the shape of the prior distribution and the update rule. However, for known distributions such as Gamma and Beta, the resulting prior is not a known distribution, and the challenge is finding an efficient sampling strategy, possibly through Importance Sampling or Accept-Reject Sampling, that allows obtaining values from these distributions without slowing down the optimization process excessively.

(51)

(52)

Chapter 4

Implementation and Results

4.1 Introduction

This chapter includes some of the main parts of this thesis, it shows the results of a practical implementation of the new algorithm, comparing its performance to the original version of Hyperband and to Random Search. We will first go over the data and the machine learning algorithm that were used and then we will discuss the design of the experiment and its setup process. Finally, we will analyze the results, and discuss possible future improvements to the work.

4.2 Dataset

Given the confidentiality of the content, it was unfortunately not possible to retain the original dataset that I used to train and develop the algorithm, nor I can extensively discuss its content.

As an alternative, I decided to test the application on a toy dataset in-stead. As it was not easy to find a dataset with similar properties to the original, I turned to Kaggle [1]. There I could find a good dataset, both in term of size and complexity, allowing for a useful application of

(53)

hyperpa-rameter optimization. The dataset contains anonimized information about around 75 thousand clients of Santander, a Spanish credit institution, over more than 360 preditors, with the goal of predicting their satisfaction with the bank through a binary variable. Before the application the data was cleaned, normalized, and transformed by taking the Principal Components. No other operation on the data was needed nor possible since all variables were anonymized and had no explicit meaning.

4.3 Machine Learning Algorithm

In order to perform classification on the data, we will use one of the most common and effective algorithms in the Machine Learning literature: XG-boost [7]. XGXG-boost is an ensemble algorithm that makes use of XG-boosting, creating an additive model by fitting sequentially Classification and Regres-sion Trees (CART) to the previous residuals [13]. This algorithm is known to be one of the best performing ones for classification and regression tasks on structured data, and has been used continuously in machine learning and data science competitions since its publication, and with great results. Its performance is however strongly dependent on multiple hyperparameters [2]:

• Number of trees - the maximum number of trees that are grown addi-tively

• Learning rate - drives the dimension of the update steps of the iterative procedure

• Depth - indicates the number of binary splits that each tree makes. Higher numbers imply deeper, more complex and accurate trees, with a higher risk of overfitting.

• Subsample ratio - indicates the percentage of the training set that is used when building each tree. Having a subsample lower than 1

(54)

can reduce the correlation between trees and therefore improve the ensembling score, in a procedure similar to bagging

• Column sample ratio - ratio of predictors that is sampled when build-ing each tree. Can have a similar effect to data subsample, reducbuild-ing the correlation between trees.

• Gamma - the minimum loss reduction needed to make a new split in the trees. It is a regularization parameter that can be used to avoid growing unnecessarily deep trees.

• Alpha and Lambda - R1 and R2 regularization coefficients on the weights, similar to those of Lasso and Ridge regressions

While the number of trees is used as a budget by our HPO algorithm, the other parameters need to be sampled. Learning rate and depth will be sampled on a logarithmic scale, as changes to these parameters have little impact when they are using high values; in terms of predicting power, a 15 and a 17 splits deep trees will be almost equivalent, while the same cannot be said about a 2 and 4 splits deep trees. The sampling rates and the regularization coefficients, on the other hand, will be sampled on a uniform scale, but with non-zero probability on their default value. What this mean in practice is that, for example, data subsample rate will be drawn uniformly between 0.5 and 1 half of the times and will be set to 1 the other half of the times. The reason for this is that these parameters are not always necessary and it is therefore a good idea to test plain models with no subsample, no regularization or with only partial regularization. Finally, in order to avoid overfitting, we will train the model through cross validation, stopping the training early when no improvement is made with respect to the Area Under the ROC Curve, and then score the configuration’s performance on a validation set.

(55)

4.4 Experiment design

The design of the experiment is quite simple. The three algorithms ran for a comparable amount of time with the goal of optimizing the validation results of Xgboost over the described dataset. For every one of them, we will compare the score of fully trained configurations and their distributions, showing their similarities and differences. For BAIHB I have decided to test the Thompson Sampling version, as it is the most interesting from both an academic and an implementation perspective.

In the following subsections we will detail the setup for each individual algorithm.

4.4.1 Hyperband

The setup of Hyperband, which is also necessary for BAIHB, requires the choice of the budgets and of the exploration coefficient η. For the last one I kept the standard value of η = 3 that is suggested by Li in the original paper, as it has shown to be the best performing one in most standard situations. This value implies that at every step of Successive Halvings we will keep the one third of the configurations and we will triplicate the budget. In order to set up minimum and maximum budgets, the easiest thing to do is to reason in term of equivalences. Hyperband is most intuitive when using a budget that ranges from 1 to a power of η, so in this case I set up an equivalence of 1 budget unit to 30 Boosting iterations. This means that the models will be first compared after growing 30 trees, so to avoid earlier training phases where the score can be particularly volatile. At the same time, I set a maximum budget R = η4 = 81, which equates to 2430 Boosting iterations. This value has been explicitly chosen to be very high, in a way that will allow slow growing configurations to reach full training while only partially affecting fast ones thanks to the use of early stopping.

(56)

4.4.2 Random Search

The setup of Random Search is rather trivial, as we have already defined the sampling strategy for all the parameters in the section about Xgboost. The only variable that is not defined is the number of trees, and in order to make the results comparable to Hyperband’s, Random Search will test configurations on the equivalent of the maximum budget, which is 2430 trees, while using early stopping.

4.4.3 Thompson Sampling BAIHB

The preparation phase for the Thomspson Sampling BAIHB consists in three parts:

• The setup of Hyperband, which is identical to the one explained earlier

• The setup of the Multi armed bandit via the discretization of the space

• The choice of the parameters for the prior distributions of the Thomp-son Sampler

For the multi armed bandit part I chose to use only three arms, and there-fore three sub-intervals in the learning rate domain [0.005, 0.3]. Indeed, the domain in this application spaces over three orders of magnitude, so it feels like three subintervals might be enough to make the learning rates compa-rable, while avoiding overparametrizing the problem. As previously defined, the split points for the interval are taken uniformly on the logarithmic scale, which gives us the following:

• Arm 1 - η in [0.005, 0.020]

• Arm 2 - η in [0.020, 0.077]

(57)

As for the parameters of the prior distributions, I followed a rather heuristic approach. First, I set µ0 = 0.8 by running a couple of low budget evaluations

of the model and seeing where they landed. This allows the prior mean not to be too far from the actual evaluations, which could influence positively or negatively the convergence time of the posterior.

Second, I set n0 = 1. The reason for this is that n0 can be seen as an

”equivalent prior sample size”, and therefore by selecting a low value we make sure that our posterior will be taking most of its information from the data, even with only a handful of samples.

Finally, I set ν0 = 1 and σ02= 0.005 by looking at the predictive distribution

(see chapter 3.4). In particular by setting this value for σ0, I make sure that

most of the density of the prediction distribution lays in [0.5, 1], which is the domain of the AUC score that we are maximizing.

Figure 4.1: Initial predictive distribution given prior parameters

4.4.4 Number of equivalent evaluations

In order to compare the results of the three algorithms, we need to have them running for a comparable amount of time. Since the code was executed in different conditions and on different machines, the easiest way to compare

A discrete bandit approach to hyperparameter optimization

A discrete Bandit approach to continuous

Hyperparameter Optimization

Abstract

Sommario

Ringraziamenti

Contents

Chapter 1

Introduction

1.1

Motivation and Objectives

1.2

Outline of contents

Chapter 2

State of the Art

2.1

Introduction

2.2

Definition

2.3

Model Free Optimization

2.4

Bayesian Optimization

2.5

Multi-fidelity Optimization

2.6

Summary

Chapter 3

Best Arm Identification

Hyperband

3.1

Introduction

3.2

The effect of Learning Rate on the Hyperband

algorithm

3.3

Multi Armed Bandits

3.4

Bayesian updates

3.5

Thompson Sampling

3.6

Best Arm Identification Hyperband (BAIHB)

3.7

Better choices of the distributions

Chapter 4

Implementation and Results

4.1

Introduction

4.2

Dataset

4.3

Machine Learning Algorithm

4.4

Experiment design