Bayesian optimization of expensive black-box functions in big data analytics via feature selection

(1)

P

OLITECNICO DI

M

ILANO

Scuola di Ingegneria Industriale e dell’Informazione

Corso di Laurea Magistrale in Ingegneria Matematica Dipartimento di Matematica

Bayesian Optimization of

Expensive Black-Box Functions in Big Data Analytics

via Feature Selection

Relatore: Prof.ssa Alessandra GUGLIELMI Correlatore: Prof. Danilo ARDAGNA

Tesi di laurea di:

José Yuri VILLAFAN Matr. 883536

(2)

(3)

(4)

(5)

Acknowledgments

Dedico questa tesi ai miei genitori.

A mia madre, medico, amica sincera e simpatica, che si é improvvisata matematica per farmi arrivare in tempo un’idea brillante; coraggiosa nell’orientarmi a fare esperienza su nuove strade, a superare le insidie, a trovare la luce, a capire che la conoscenza da sola non basta: viene messa al servizio, ma prima tradotta in linguaggio matematico.

A mio padre, che in ogni difficoltà é sempre rimasto sereno e fedele al suo obiettivo di vedermi realizzato come uomo e matematico.

(6)

(7)

Abstract

The amount of data we produce every day is growing faster than ever before, doubling every two years. Recent studies estimate that more than 150 zettabytes (i.e., 150 trillion gigabytes of data) will be analysed by 2025. Big data processing requires new types of in-frastructures, and virtual machines (VM) clusters are nowadays the most suitable execution environment. At the application layer, Apache Spark is one of the most widely used envi-ronments to perform various big data analyses. Today, big data analytics running on clouds have become critical for almost every industry and choosing the right cloud configuration is essential to service quality and business competitiveness.

The goal of the thesis is to indentify the best cloud configuration in order to minimize execution times and costs of analytic jobs. Due to the latter being black-box functions, we use a Bayesian approach to global optimization. Specifically, we assume that the application execution cost is a random function sampled from a proper distribution and we perform constrained optimization to find the optimal cloud configuration. Furthermore, we combine Bayesian optimization with feature selection techniques to avert problems associated with the high-dimensionality of the data. To that end, we develop BOSS-V, an algorithm that integrates both techniques, and indentifies the best cloud configuration effectively.

(8)

(9)

Sommario

La quantità di dati generati attualmente raddoppia ogni due anni. Studi recenti stimano che più di 150 zettabyte di dati, ovvero 150 trilioni di gigabyte di dati, saranno analizzati entro il 2025. Questa grande quantità di dati, big data, richiede nuove infrastrutture per la loro analisi a causa dell’inadeguatezza di quelle attualmente disponibili. Al giorno d’oggi, le analisi dei big data, big data analytics, vengono eseguite in cluster di macchine virtuali (VM), e Apache Spark è diventata una delle piattaforme più usate per la loro esecuzione. Big data analytics ricoprono un ruolo di primaria importanza per quasi tutte le industrie. Per ottenere servizi di qualità e competitività commerciale è importante scegliere la migliore configurazione del cloud, minimizzando i tempi di esecuzione delle applicazioni eseguite sulle VM e riducendo i loro costi. In particolare, per identificare la configurazione ottimale del cloud, noi adottiamo un approccio bayesiano all’ottimizzazione. Ovvero, assumiamo che il tempo di esecuzione, che è una funzione della configurazione di tipo black-box, sia una funzione aleatoria e abbia una sua distribuzione. Inoltre, combiniamo l’approccio di ot-timizzazione bayesiana con tecniche di selezione delle variabili per far fronte alle difficoltà che essa presenta in problemi ad alta dimensionalità, come lo sono i big data, e proponi-amo un algoritmo di ottimizzazione vincolata che calcoli l’ottimo sopra descritto. Inoltre, esploreremo la possibilità di integrare informazione riguardante l’ottimo, ottenuta in modo indipendente dall’ottimizzazione bayesiana, nella procedura di ottimizzazione.

(10)

(11)

List of Figures

2.1 A 1D Gaussian process with three observations. . . 25

2.2 Samples from GPs with different kernels. . . 29

2.3 Acquisition functions comparison. . . 32

3.1 DAG from TPC-DS benchmark. . . 47

3.2 Core interpolation scenarios. . . 49

4.1 Geometric comparison between Lasso and Ridge regressions. . . 57

4.2 2D function with low effective dimensionality. . . 65

4.3 Nascent minima distribution function for varying k. . . 68

4.4 Optimization methods comparison on 1D Forrester function. . . 70

6.1 Benchmark test: 1D. . . 82

6.2 Benchmark test: 1D. Stills of iterations. . . 83

6.3 Benchmark test: 1D. Kernel comparison. . . 84

6.4 Benchmark test: 2D. . . 85

6.5 Benchmark test: 2D. Stills of iterations. . . 86

6.6 Benchmark test: 2D. Adaptive methods comparison. . . 88

6.7 Benchmark test: 2D. Non adaptive methods comparison. . . 88

6.8 Benchmark test: 6D. Adaptive methods comparison. . . 91

6.9 Benchmark test: 6D. Non adaptive methods comparison. . . 91

6.10 Benchmark test: 6D. Feature selection comparison. . . 92

6.11 Benchmark test: 6D. Model sizes comparison. . . 92

6.12 BOSS-V variant A: Query 26, 250 GB, black box models. . . 98

6.13 BOSS-V variant A: Query 26, 250 GB, gray box models. . . 99

6.14 BOSS-V variant B: Query 26, 250 GB, black box models. . . 100

(16)

6.16 BOSS-V variant C: Query 26, 250 GB, black box models. . . 102

6.17 BOSS-V variant C: Query 26, 250 GB, gray box models. . . 103

6.18 BOSS-V variant D: Query 26, 250 GB, black box models. . . 104

6.19 BOSS-V variant D: Query 26, 250 GB, gray box models. . . 105

6.20 BOSS-V: Query 26, 250 GB. Cumulative costs plots comparison. . . 106

6.23 BOSS-V variant A: Query 26, 750 GB, gray box models. Best evaluations. 108 6.24 BOSS-V variant B: Query 26, 750 GB, black box models. . . 109

6.25 BOSS-V variant B: Query 26, 750 GB, gray box models. . . 110

6.26 BOSS-V variant B: Query 26, 750 GB, gray box models. Best evaluations. . 110

6.34 BOSS-V variant B: Query 26, 1000 GB, black box models. . . 118

6.35 BOSS-V variant B: Query 26, 1000 GB, gray box models. . . 119

6.40 BOSS-V variant A: Query 26, 1000 GB, black box with LR model training. 124 6.41 BOSS-V variant A: Query 26, 1000 GB, gray box with LR model training. . 125

6.42 BOSS-V variant B: Query 26, 1000 GB, black box with LR model training. 126 6.43 BOSS-V variant B: Query 26, 1000 GB, gray box with LR model training. . 127

6.44 BOSS-V variant C: Query 26, 1000 GB, black box with LR model training. 128 6.45 BOSS-V variant C: Query 26, 1000 GB, gray box with LR model training. . 129

6.46 BOSS-V variant D: Query 26, 1000 GB, black box with LR model training. 130 6.47 BOSS-V variant D: Query 26, 1000 GB, gray box with LR model training. . 131

A.1 BOSS-V variant A: K-means, 5 million rows, black box models. . . 154

A.2 BOSS-V variant A: K-means, 5 million rows, gray box models. . . 155

A.3 BOSS-V variant B: K-means, 5 million rows, black box models. . . 156

A.4 BOSS-V variant B: K-means, 5 million rows, gray box models. . . 157

A.5 BOSS-V variant C: K-means, 5 million rows, black box models. . . 158

A.6 BOSS-V variant C: K-means, 5 million rows, gray box models. . . 159

(17)

List of Figures

A.8 BOSS-V variant D: K-means, 5 million rows, gray box models. . . 161

A.9 BOSS-V: K-means, 5 million rows. Cumulative costs plots comparison. . . 162

A.16 BOSS-V variant D: K-means, 10 million rows, black box models. . . 169

A.18 BOSS-V: K-means, 10 million rows. Cumulative costs plots comparison. . 171

A.27 BOSS-V: K-means, 15 million rows. Cumulative costs plots comparison. . 180

(18)

(19)

List of Tables

3.1 Workload features. . . 46

3.2 Black and Gray box model features. . . 47

3.3 Workload data sizes in different scenarios. . . 49

4.1 Hyperparameters provided for regression methods. . . 62

6.1 Benchmark test: 2D. Iterations needed to reach a fixed tolerance. . . 87

6.2 Benchmark test: 2D. Best result across 10 runs. . . 87

6.3 Benchmark test: 6D. Adaptive approach. Iterations needed to reach a fixed tolerance. . . 89

6.4 Benchmark test: 6D. Best result across 10 runs. . . 90

6.5 Benchmark test: 6D. Non adaptive approach. Iterations needed to reach a fixed tolerance. . . 90

6.6 BOSS-V: Query 26, 250 GB. Cumulative costs values comparison. . . 106

6.7 BOSS-V extrapolation summary: Query 26, black box, without data. . . 132

6.8 BOSS-V extrapolation summary: Query 26, black box, with data. . . 133

6.9 BOSS-V extrapolation summary: Query 26, gray box, without data. . . 133

6.10 BOSS-V extrapolation summary: Query 26, gray box, with data. . . 134

6.11 BOSS-V extrapolation summary: K-means, black box, without data. . . 135

6.12 BOSS-V extrapolation summary: K-means, black box, with data. . . 136

6.13 BOSS-V extrapolation summary: K-means, gray box, without data. . . 136

6.14 BOSS-V extrapolation summary: K-means, gray box, with data. . . 137

A.1 BOSS-V: K-means, 5 million rows. Cumulative costs values comparison. . 162

(20)

(21)

List of Algorithms

2.1 Bayesian optimization (BO) algorithm . . . 17 4.1 Sequential Forward Floating Selection (SFFS) algorithm . . . 61 5.1 Feature optimization (BOSS-V) algorithm . . . 76

(22)

(23)

Introduction

Big data analytics are employed in several industries to allow organizations and compa-nies to make better decisions. Its main applications are improving customers satisfaction, predicting market behavior or improving processes in public health [61, Maros et al., 2019]. The most suitable execution environment of big data analytic applications is a cluster of virtual machines (VMs). This type of infrastructure often relies on distributed program-ming platforms, where problems are divided into tasks, each of which is solved by one or more computers, which operate concurrently and can fail independently without affecting the whole system’s uptime. These platforms offer a suitable execution environment for big data applications by allowing the adjustment of the allocated resources to match the appli-cation’s current needs. However, a poor choice of the latter may cause higher execution times and incur economic losses. Due to the diverse behaviors and resource requirements (CPU, memory, disk, network) of analytic jobs, choosing the best configuration to minimize completion times and reduce costs for a broad spectrum of applications is a challenging pro-cess. The goal of this thesis is identifying the minimum cost cluster configuration of big data applications, at deployment time, and guarantee applications run within an a priori deadline.

Clearly, the execution cost of big data applications can only be evaluated at application completion. This makes it impossible to obtain any prior knowledge about the structure of the execution cost function, like convexity or linearity, and its dependence on the cloud configuration parameters. Functions that are expensive to evaluate and lack known struc-ture, that would make it easy to optimize using techniques that leverage such structure to improve efficiency, are known as black-box functions. Due to the aforementioned charac-teristics of the objective function – the execution cost –, we employ a Bayesian approach to global optimization. Indeed, Bayesian optimization is an iterative approach to optimiz-ing black-box objective functions by settoptimiz-ing a prior distribution on the objective function and combining it with evidence (data) to build a surrogate for the objective. Furthermore,

(24)

Bayesian optimization has the nice property of reducing the number of evaluations of the objective by modelling the uncertainty to guide the exploration of the search space. This property is a fundamental necessity in the framework of big data applications. Several articles in the literature show the advantages of the Bayesian approach to global optimiza-tion and its applicaoptimiza-tion in the most diverse fields due to its flexibility. Regarding the op-timization problem at hand, a solution was proposed by Alipourfard et al. [9, 2017] who developed a framework called CherryPick that leverages Bayesian optimization for predict-ing near-optimal cloud configurations. Other applications see Bayesian optimization bepredict-ing used, as a multi-objective optimization technique, to identify the Pareto front of a Deep Neural Network (DNN) system for performance optimization [44, Iqbal et al., 2020], or to optimize DNN hardware accelerators for both accuracy and energy efficiency [79, Reagen et al., 2017]. Bayesian optimization has also been used to identify the cluster topology that minimizes the average hop-length in [106, Xu et al., 2020], to tune hyperparameters of complex systems via MLE maximization [50, Klein et al., 2017] and for auto-tuning of parameters [24, Dalibard et al., 2017].

However, in this work we focus on applications with high-dimensional cloud configura-tions. In particular, we analyse big data analytic benchmarks from the TPC-DS industry [4] and Sparkbench [2]. Whereas, Bayesian optimization is mainly limited to low-dimensional input spaces – the objective function’s domain – and moderately sized data sets. The reason being that it suffers from the well-known curse of dimensionality. Indeed, Bayesian opti-mization guides the exploration of the search space by quantification of the spatial cross-correlations of the observations in covariance structures that need to be inverted repeatidly. The issue is that the number of points needed to explore the input space in its entirety in-creases exponentially with the dimension. This issue, due to the computation complexity of Bayesian optimization beingO(N3) where N is the number of observations, leads to a severe bottleneck in the presence of moderately big datasets. Addressing this challenge has received great attention over the last decades and several methods have been proposed to alleviate the computational cost and bring the scaling down toO(d2N), where d N is the dimensionality of the input space. However, this becomes impractical for high di-mensional data sets. Furthermore, the experiments conducted by Wang et al. [102, 2013] confirm that Bayesian optimization degrades above a tipping point of about 15-20 dimen-sions, performing rather poorly just above this critical dimensionality, and merely tying with random search. Thus, Snelson and Ghahramani [87, 2012] propose a dimensionality scaling approach and employ a technique that can be understood as performing supervised dimensionality reduction. Whereas, Wang et al. [102, 2013] propose a random embed-ding technique to solve high-dimensional problems with “low effective dimensionality”. On the other hand, Perdikaris et al. [74, 2016] employ a frequency-domain learning tech-nique that entirely avoids costly matrix inversions and can be applied with high-dimensional data. Lastly, Brochu et al. [14, 2010] point out that Bayesian optimization methods do not use feature selection techniques for dimensionality reduction. Feature selection techniques

(25)

Introduction

are applied in a variety of high-dimensional problems to alleviate the effect of the curse of dimensionality [64, Miao et al., 2016], and improve algorithms’ generalization perfor-mance [107, Yang et al., 2012].

At the time of writing this thesis, and to the best of our knowledge, feature selection techniques have not been used to perform dimensionality reduction for Bayesian optimiza-tion. Furthermore, Maros et al. [61, 2019] show that the applications we focus on have “low effective dimensionality” and their completion time can be expressed as a linear combina-tion of the cloud configuracombina-tion features. Therefore, to overcome the issue with dimensional-ity, we propose a novel approach based on performing supervised dimensionality reduction via feature selection techniques which employ Linear Regression (LR) models. Feature se-lection techniques may help alleviating the issues arising from the application of Bayesian optimization to the high-dimensional optimization problem we aim to solve, while the LR models guide the feature selection technique into choosing those dimensions which are the most relevant to the optimization problem and which capture the application performance behaviour. In summary, we develop a novel solution for high-dimensional Bayesian opti-mization by taking into account previous approaches developed by Maros et al. [61, 2019], and findings by Snelson and Ghahramani [86, 2006] and Wang et al. [102, 2013]. Specif-ically, we perform supervised dimensionality reduction via feature selection techniques, and we learn the low dimensional structure by regressing on a low-dimensional projection of the inputs, in order to apply Bayesian optimization to problems with “low effective di-mensionality”. To that end, we develop BOSS-V, an algorithm that consists of two main components, a Bayesian optimization procedure, for the selection of near-optimal cloud configurations, and a feature selection technique, that enables us to perform Bayesian opti-mization, in order to identify the minimum cost cluster configuration of high-dimensional big data applications.

We tested the performance of our algorithm by considering applications which are rep-resentatives of different types of workloads under several scenarios. In particular, Query 26 from the TPC-DS industry benchmark [4] and K-means from Sparkbench [2]. Query 26 is an interactive query, representative of SQL like workloads, whereas K-means, which is an unsupervised learning statistical technique, is representative of an iterative workload, usually characterized by larger execution time variability. The scenarios we consider derive from Machine learning (ML) gray box and black box performance prediction models, also adopted by Maros et al. [61, 2019]. Our experiments show that our algorithm performs satisfactorily, yielding near-optimal configurations in most scenarios considered. However, most configurations found by the algorithm in other scenarios yield an application comple-tion time which do not meet the a priori deadline. In order to guarantee applicacomple-tions run within the a priori deadline, we develop three variants of BOSS-V algorithm. We call the baseline algorithm, which performs Bayesian optimization and feature selection, variant A. The three variants are called variant B, C and D. They employ the LR model to predict an application’s completion time and use this information to varying degrees. For instance,

(26)

variant B uses it to derive a coefficient and modulate the surrogate of the objective, whereas variant C uses it to prevent the exploration of those configurations which would not meet the deadline. Variant D applies both methods used in variants B and C, and, therefore, yields better results in those scenarios in which the LR model is more accurate. However, numerical results show that variant C frequently was the best solution on black box models. Lastly, it is worth mentioning that the Bayesian method has asymptotic convergence properties that we will discuss in this thesis by drawing on the work carried out by Mockus [68, 1989], Vazquez and Bect [97, 2010] and Bull [15, 2011]. However, the introduction of dimensionality reduction in the optimization procedure may affect convergence and we will describe the conditions under which our optimization algorithm recovers its convergence properties. Whereas, the question that is of significance to us is whether the optimization procedure will find a near-optimal configuration among the very first iterations, reducing the number of required evaluations: our analyses show this to be the case.

The thesis is organized as follows: in the first three chapters we describe what has been done to solve such complex optimization problem, in the latter three we describe our contribution and the methods we employ to reach our goal. Specifically, in Chapter 1 we outline the tools we employ in this thesis; in Chapter 2 we describe in detail the Bayesian approach to global optimization, including its shortcomings and the most commonly used techniques for hyperparameter tuning; in Chapter 3 we introduce the optimization prob-lem and in Chapter 4 we specify the feature selection technique we employ to reduce the dimensionality of the former; in Chapter 5 we detail the workflow of the process and the algorithm we built to solve the optimization problem; lastly, we illustrate the performance of our algorithm in Chapter 6. Conclusions are finally drawn in their respective Chapter.

(27)

CHAPTER

1

Optimization techniques

“ In God we trust, all others bring data. ”

W. Edwards Deming

In this chapter the reader will be presented with the most elementary description of the optimization problem in Section 1.1 and with the most suitable optimization tools and techniques that we will employ during the thesis in Section 1.2.

1.1 An introduction to Optimization

Mathematical optimization, or mathematical programming, attempts to develop solu-tions for identifying efficiently the best element with regard to some criterion from some set of available alternatives. An enormous body of scientific literature has been devoted to the problem of optimizing a real-valued function f (x) over a compact setA. In the realm of optimization, this problem is formulated concisely as follows:

min

x∈A⊂Rd f(x). (1.1)

Without loss of generality, an optimization problem is equivalent to a minimization problem. Minimization can be converted to maximization by a change of sign. For prob-lems in finitely many continuous variables, which we concentrate on here,A ⊂ Rdmay be specified by a number of side conditions, called constraints, on x = (x1, . . . , xd).

The goal of optimization is finding an optimal solution x. An optimal solution is a feasible solution, i.e. satisfies all constraints, and is where the minimum of f relative to some set is attained. A solution is optimal in a global sense when optimization is performed

(28)

with respect to the setA, and we define

x= arg min

x∈_A⊂Rd f(x).

On the other hand, a solution is optimal in a local sense when f(x) ≤ f (x) ∀x ∈Br(x) ,

whereBr(x) is the subset of points within a specified distance r from the optimum. Global

minima are also local minima, whereas the converse is not true, unless f (x) is convex. In solving an optimization problem one typically assumes that the objective function f(x) has a known mathematical representation, is convex, or can be computed efficiently. Optimization problems of sorts arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries [26, Du et al., 2009]. Major subfields include the following:

• Convex programming: studies the case when the objective function is convex (mini-mization) or concave (maxi(mini-mization) and the constraint set is convex.

• Linear programming: a type of convex programming, studies the case in which the objective function f is linear and the constraints are specified using only linear equal-ities and inequalequal-ities. Such a constraint set is called a polytope if it is bounded. • Nonlinear programming: studies the general case in which the objective function or

the constraints or both contain nonlinear parts. This may or may not be a convex program. In general, whether the program is convex affects the difficulty of solving it.

• Integer programming: studies the case in which some or all the variables x are re-stricted to be integer. It is called mixed integer programming when some, but not all, variables are restricted to be integer, and is called pure integer programming when all decision variables must be integers. Integer programming has real life applications in both linear and nonlinear problems.

• Stochastic programming: studies the case in which some of the constraints or pa-rameters depend on random variables, function measurements are noisy, inputs in the search process are random. It takes advantage of the fact that probability distributions governing the data are known or can be estimated.

• Multi-objective optimization: studies the case in which the objective is a vector-valued function. Each component of f corresponds to an objective that must be optimized simultaneously with the others. In the search for optimality a trade-off

(29)

1.1. An introduction to Optimization

must be created among the many objectives. The set of trade-off designs that can-not be improved upon according to one criterion (objective) without hurting acan-nother criterion is known as the Pareto set.

• Constraint programming: studies the case in which relations between variables are stated in the form of constraints. These constraints can be either “soft” or “hard”. Hard constraints denote conditions that must not be violated by the algorithm, whereas soft constraints prevent the algorithm to go in some areas of the search space. A fairly common approach is to use a penalty method. It consists in assigning a penalty cost to constraint violations and to add it to the criterion. The measure of violation is nonzero when the constraints are violated and is zero in the region where constraints are not violated.

The focus of this thesis is solving a single-objective, nonlinear, constrained global op-timization problem with noisy function evaluations in high dimension (d ∼ 100). Namely minimizing execution cost of big data analytic jobs via identification of the best cloud con-figuration. However, the evaluation of big data applications execution times is an expensive process, and, clearly, it can be evaluated only at application completion. Thus, there is no way of knowing whether the objective function is convex beforehand, or any other informa-tion regarding its structure. All that is known about the objective are its values y = f (x) + ε corrupted by noise at some (few) cloud configurations x. In the framework of optimization, objective functions with no known structure and that are expensive to evaluate are called black-box functions. We will describe the optimization problem in its entirety in the fol-lowing Section 3.1. However, assumptions can be made about the regularity of the objective function. For instance, the following assumptions are going to be made in order to develop an algorithm capable of solving the problem at hand:

1. Compactness of the domainA ⊂ Rd_;

2. Continuity of the objective function f ∈ C0(A);

3. Existence of only one global minimum (not a multi-modal problem); 4. The noise is normally distributed: ε ∼ N 0, σ2, σ2_{> 0.}

Under the above hypothesis, the existence of a solution in the optimization problem at hand is given by the extreme value theorem of Karl Weierstrass (1815-1897). It states that a continuous real-valued function on a compact set attains its maximum and minimum value. More generally, a lower semi-continuous function on a compact set attains its minimum point; an upper semi-continuous function on a compact set attains its maximum point.

It is worth noting that uniqueness is not guaranteed in real applications. Indeed op-timization problems are often multi-modal; that is, they possess multiple good solutions.

(30)

They could all be globally good (same cost function value) or there could be a mix of glob-ally good and locglob-ally good solutions. Obtaining all (or at least some of) the multiple solu-tions is the goal of a multi-modal optimizer. Classical optimization techniques due to their iterative approach do not perform satisfactorily when they are used to obtain multiple solu-tions, since it is not guaranteed that different solutions will be obtained even with different starting points in multiple runs of the algorithm. Metaheuristics [100, Voß, 2001], however, are a very popular approach to obtain multiple solutions in a multi-modal optimization task. Other complete techniques are covered in [71, Neumaier, 2004].

A classical approach to numerical methods is to design a sequence of points xn∈A ⊂ Rd, n = 1, 2, . . .

which converge to the exact solution x when n is large enough. However, even with these restrictions, it is not possible in general to find the optimal answer using a finite number of iterations. The problem will be considered succesfully solved when a point ˜xwithin some tolerance tol > 0 of the global optimum x will be found with respect to some closeness measure, e.g., Euclidean distance (k ˜x− xk < tol), or function value (| f ( ˜x) − f (x)| < tol).

In order to develop an algorithm that tackles the optimization problem at hand the nec-essary tools will be introduced in the following section.

1.2 Computational optimization techniques

Once an optimization problem has been expressed as a function maximization prob-lem, researchers may use algorithms that terminate in a finite number of steps, or iterative methods that converge to a solution (on some specified class of problems). An optimization algorithm is executed iteratively by comparing various solutions until an optimum or a sat-isfactory solution is found. In particular, a finite step method would yield an exact solution in a known finite number of steps if it could be implemented in exact arithmetic, whereas an iterative method constructs a sequence that would converge to the solution after some finite but generally unknown a priori number of steps, resulting in an approximate solution to the problem [94, Lloyd, 1997].

Besides finitely terminating algorithms and convergent iterative methods, there are heuris-tics. A heuristic is any algorithm which is not guaranteed (mathematically) to find the solution. Its iterates need not converge, but may provide approximate solutions to some problems at a reasonable computational cost. It is designed for solving a problem more quickly when classic methods are too slow, or for finding an approximate solution when classic methods fail to find any exact solution. This is achieved by trading optimality, completeness, accuracy, or precision for speed. Well-known heuristic are Nelder-Mead method [70, Nelder, Mead, 1965] [49, Kelley, 1999] and Tabu search [36, Glover, Laguna, Marti, 2008].

(31)

1.2. Computational optimization techniques

A very active area of research is the design of nature-inspired algorithms [18, Can and Alatas, 2017]. They belong to the metaheuristics family of algorithms which are top-level strategies that guide an underlying heuristic solving a given problem. A more complete treatment and a formal definition of metaheuristics may be found in [100, Voß, 2001].

As more and more optimization algorithms are developed it becomes necessary to clas-sify them in families. This thesis will use some of the algorithms from this section, but it is not its primary subject.

Stochastic vs. deterministic algorithms

Deterministic algorithms are those methods whose output depends on the input param-eters, and possibly an initialization one. If the input parameters are the same two successive runs will generate identical results. On the contrary, stochastic algorithms introduce some indetermination through the generation of random numbers. This indetermination affects the solution such that two successive runs may yield different results.

Global vs. local search algorithms

Local search algorithms are those methods that start from a candidate solution and then iteratively move to a neighbor solution. The choice of which neighbor to move to is taken using only information about the solutions in the neighborhood of the current one, hence the name local search. These methods are generally deterministic and their main drawback lies in the high correlation between the quality of the result and the quality of the initialization. Local search algorithms are indeed incapable of exploring the search space and may remain confined in a local optimum when no improving configurations are present in the neigh-borhood. On the contrary, global search algorithms find areas that might contain the global minimum via exploration of the search space, either by stochastic procedures, e.g., random restarts, or by utilization of prior information coming from previous function evaluations, e.g., Bayesian optimization. A recent approach to the global optimization problem is via minima distribution [58, Luo, 2019].

Zero-order optimization algorithms

Another criterion is based on the information exploited by the algorithm to direct the search procedure. The derivatives provide detailed information for such optimizers. While evaluating Hessians and gradients improves the rate of convergence, for functions for which these quantities exist and vary sufficiently smoothly, such evaluations increase the compu-tational complexity (or compucompu-tational cost) of each iteration. In some cases, the computa-tional complexity may be excessively high. Zero-order (or derivative-free) algorithms use only the function value at some positions. They are popular when information about first and/or second derivatives of the objective are difficult to obtain, e.g., no explicit function forms are given, functions are not differentiable. The main advantage is few assumptions

(32)

are done on the objective since high order regularity is not a requirement. Examples of zero-order algorithms are Monte Carlo and its adaptive variant [10, Bagaria et al., 2018] and DIRECT [45, Jones, 2001]. Several more are described in Ruffio et al. [80, 2011].

First-order optimization algorithms

First-order algorithms optimize an objective function f (x) using its gradient values with respect to its parameters x. The first order derivative provides us with information regarding whether the function is decreasing or increasing at a particular point. Furthermore, Fermat’s theorem exploits the value of the derivative ∇ f at a point x to yield a necessary condition for optimality when f is differentiable.

Gradient Descent (GD), equation (1.2), is the most widely used minimization algorithm of this class, which has the form

xn+1= xn− α∇ f (xn) (1.2)

where α it the step size, not necessarily constant. By taking a step proportional to the positive of the gradient one reaches a local maxima for f and the procedure is known as Gradient Ascent. Variants of GD include stochastic procedures like stochastic average gra-dient(SAG) [84, Schmidt et al., 2016]. Other methods have been proposed to obtain faster convergence rates than GD in the minimization of strongly convex functions, for instance the heavy ball method, Nesterov’s accelerated gradient descent and the triple momentum method[96, Van Scoy et al., 2018].

First-order stochastic methods are the state-of-the-art in large-scale Machine learning optimization owing to efficient per-iteration complexity [7, Agarwal et al., 2016].

Second-order optimization algorithms

Second-order algorithms use the second order derivative (Hessian) to optimize the ob-jective function. The Hessian is a matrix of second order partial derivatives and provides us with information about the local curvature of the objective that may greatly speed up the convergence of the algorithm. The knowledge of the Hessian allows for a better local approximation of the objective in comparison to a first order method allowing for the use of methods such as the Trust region method [49, Kelley, 1999], but the objective is required to be twice differentiable. A necessary condition for optimality is that the Hessian be either positive definite (local minimum) or negative definite (local maximum) when evaluated in the candidate optimum. In such a case, Newton’s method can be applied to find the roots of the derivative ∇ f , which are possible optima of the objective f by Fermat’s theorem:

xn+1= xn−

∇2f(xn)

−1

(33)

Often Newton’s method (1.3) is modified to include an adaptive stepsize α = α (n) while using the Armijo-Goldstein-Wolfe conditions to avoid poor choices of step lengths [51, Kolda, Lewis, Torczon, 2003]. The Levenberg-Marquardt algorithm uses instead an adap-tive damping variable λ = λ (n). A variety of secant update quasi-Newton methods have been developed which approximate the Hessian using first derivatives while trying to in-corporate the symmetry and definiteness properties in the approximation algorithm, e.g., the Broyden-Fletcher-Goldfarb-Shanno (BFGS) or the Davidon-Fletcher-Powell (DFP) al-gorithms. A more detailed description of these methods can be found in [49, Kelley, 1999] Chapters 3 and 4 respectively.

Second-order methods, while able to provide faster convergence, have a high complex-ity per step which makes them prohibitive in Machine learning applications. Agarwal et al. [7, 2016] developed a second-order stochastic method which is linear in the sparsity of the input data.

In summary, one major criterion for optimizers is the sheer number of required function evaluations. Usually higher order algorithms require less iterations, but a higher number of function evaluations. Which one is best with respect to the number of function calls depends on the problem itself.

1.2.1 Optimization and Machine Learning

Paraphrasing Arthur L. Samuel’s paper [81, 1959], “Machine learning is the art of pro-gramming a computer so that it will learn to perform a task better than the programmer him-self”. Machine learning algorithms use mathematical models in order to perform a specific task and achieve a predefined objective with no explicit instructions. Learning problems are usually formulated as minimization of some loss function on a training set of examples, where loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances. For example, in classification [52, Kotsiantis, 2007], one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples.

Both Machine learning and optimization attempt to minimize the loss on known samples (training set). While that is optimization’s primary objective, Machine learning’s ultimate goal is generalization. Indeed the cause of poor performance in machine learning is either underfitting or overfitting the data, which happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Thus Machine learning algorithms also attempt to minimize the loss on unseen samples (testing set).

1.2.2 Bayesian optimization methods

Bayesian optimization is an approach to optimizing objective functions with no known structure, like convexity or linearity, and that are expensive to evaluate, e.g., drug trials,

(34)

de-structive tests or financial investments. Furthermore, it tolerates stochastic noise in function evaluations and can be employed either as a global method or as a local one.

Bayesian optimization employs the Bayesian approach of setting a prior on the objective function and combining it with evidence (data) to build a surrogate for the objective. By doing so, a new function, called acquisition, is defined from this surrogate and used to guide the evaluation procedure of the objective and, accordingly, the optimization process. Thus, rather than optimizing the objective function directly (typically unknown), a new function, the acquisition function, is built on the known information and optimized in order to obtain a candidate point x which yields the maximum improvement with respect to some criterion. This step is key in guiding the search in those areas of the domain which most likely contain the optimum with as few function evaluations as possible.

By using the Bayesian method the original multimodal problem, in which the objective is unknown, is replaced by an auxiliary multimodal problem. The advantage of doing so is that the auxiliary problem is a rough prediction of the expected gain at the next observation and fast approximate optimization of the acquisition function is usually sufficient. To that end, any of the optimization methods previously introduced can be used. Lastly, while knowledge of the derivatives of the objective is not required, it is possible to incorporate such information in order to improve the efficiency of the algorithm [56, Lizotte, 2008].

For the above reasons Bayesian optimization is a suitable approach for the class of problems this thesis attempts to solve, and a more detailed description of the Bayesian optimization procedure will be given in the following Chapter 2. In this thesis we will focus on the Bayesian approach to global optimization. The theory behind the Bayesian approach to local optimization is described by Mockus [68, 1989, Chapter 7], while [14, Brochu et al., 2010] and [32, Frazier, 2018] are extensive tutorials on global Bayesian optimization.

1.2.3 Non-Bayesian methods

Several approaches have been proposed during the years that attempt to find global optima and have little to no connection to Bayesian optimization. Most methods balance attraction toward a local optimum with prevention from getting it stuck there. Some use local optimization techniques, whose objective is to find an optimum (either maximal or minimal) within a neighboring set of candidate solutions, in constrast to a global optimum, which is the optimal solution in the whole domain.

Well-known methods are:

• Random restarts: points are uniformly drawn from the domainA ⊂ Rd and a local optimizer is executed starting from each one.

• Density clustering: a random restart method that employs the use of regions around each local optimum found so far to avoid starting the local optimizer within these regions.

(35)

• Simulated annealing: a random walk is defined over the domain A ⊂ Rd in hopes that, in the limit, the walk will converge toward the global optimum.

• Lipschitz methods: used when the objective function is a Lipschitz-continuous func-tion. They yield optimal methods in the minimax sense (see Section 2.2), but require a number of function evaluations which is exponential in the number of dimensions. • Evolutionary algorithms: in this paradigm candidate solutions to the optimization

problem play the role of individuals in a population, which are evolved over time and evaluated using the objective function f . Those individuals that perform poorly are removed/replaced from the population and the procedure is repeated until no further progress is made.

A more detailed description of the first four methods and the latter can be found in [42, Horst and Pardalos, 2013] and in [99, Vikhar, 2016] respectively.

(36)

(37)

CHAPTER

2

Bayesian optimization

“ You should not ask questions without knowledge. ”

W. Edwards Deming

In this chapter the reader will be presented with a comprehensive description of the Bayesian optimization method in Section 2.1, Section 2.2 and Section 2.3, as well as with several approaches regarding the introduction of information about the location of the minu-mum in the optimization procedure in Section 2.5.

2.1 An introduction to Bayesian optimization

Bayesian optimization is an iterative optimization method focused on solving the prob-lem

min

x∈_A⊂Rd f(x).

It is a suitable approach when the objective function f lacks known structure, like convex-ity or linearconvex-ity, that would make it easy to optimize using techniques that leverage such structure to improve efficiency. Especially when evaluating the objective function is expen-sive or even impossible, or when lack of knowledge about first- or second-order derivatives prevents the application of higher order optimization algorithms [32, Frazier, 2018].

For the purposes of this thesis, the feasible set and objective function have the following properties:

• A is a hyper-rectangle delimited by hard constraints: x ∈ Rd_{: a}

i≤ xi≤ bi . Later

(Section 2.7) this assumption will be relaxed. • The objective function f is continuous, f ∈ C0(A).

(38)

• If the objective function f is obscured by noise, the latter will be assumed independent across evaluations and normally distributed with constant variance.

The likely presence of noise and the absence of known structure of the objective f make a Bayesian approach suitable for modeling the objective. The latter, assumed random, is modeled by a prior probability distribution over functions which may describe attributes of the objective function, such as smoothness or the most likely locations of the minimum. Information about the objective is then collected in the form of evaluations of the latter, and used to update the prior belief about the objective’s dritribution. Specifically, the statistical model provides a Bayesian posterior probability distribution that describes potential values for f (x) at a candidate point x which is updated each time f is observed at a new point. In this way, previously unknown information, such as the location of the global optimum, is gradually uncovered with every iteration step.

There are other techniques that can be used to optimize expensive derivative free tions with no known structure and that maintain a surrogate that models the objective func-tion like Bayesian optimizafunc-tion. They are called “surrogate methods” [13, Booker et al., 1999]. Bayesian optimization distinguishes itself from other surrogate methods by using surrogates developed using Bayesian statistics, and in deciding where to evaluate the objec-tive using a Bayesian interpretation of these surrogates.

Thus the focus of the optimization algorithm is shifted toward the construction of a structure capable of exploiting the information that comes from the evaluation of the objec-tive and permits a utility-based selection of the next observation to make on such objecobjec-tive. This structure is known in the Bayesian optimization framework with the name acquisition. The acquisition is a function that measures the utility of sampling the objective function at a new point x based on the current posterior distribution over f , by taking into account, directly or indirectly, the mean and variance of the predictions over the space.

The process of deciding where to sample next requires the choice of a suitable acquisi-tion funcacquisi-tion and a way of optimizing this acquisiacquisi-tion with respect to the posterior distribu-tion of the objective funcdistribu-tion. The objective is then sampled at the argmax of the acquisidistribu-tion function, its posterior probability distribution is updated and the process is repeated.

In summary, Bayesian optimization consists of two main components: • a Bayesian statistical model for modeling the objective function; • an acquistion function for deciding where to sample next.

Both components will be introduced in a more rigorous way in the following section and analysed in detail in Section 2.3 and Section 2.5 respectively, where several possible acqui-sition functions will be described.

The main steps of the algorithm are illustrated in the following Algorithm 2.1 using a generic acquisition function.

(39)

2.2. Formalization of Bayesian approach in global optimization

Algorithm 2.1 Basic pseudo-code for Bayesian optimization Input: Choose N

1: Place a prior distribution on f

2: Observe f at n0points . According to an initial space-filling experimental design

3: Set n = n0

4: while n ≤ N do

5: Update the posterior probability distribution on f using all available data

6: Find xnthat maximizes the acquisition function over x, where the acquisition

func-tion is computed using the current posterior distribufunc-tion parameters

7: Observe yn= f (xn)

8: Increment n

9: end while

Output: Return the point evaluated with the smallest f (x)

Regarding step 2, Jones et al. [46, 1998] recommend to use a space filling Latin hyper-cube design of n0= 10 · d starting points, with d being the dimensionality of the input space,

to prevent the algorithm from getting stuck in a local optimum, as shown in [14, Brochu et al., 2010]. Then iterate until the maximum number of function evaluations N is reached.

Furthermore, in case the observations were corrupted by noise, the latter would be as-sumed independent across observations and normally distributed:

yn= f (xn) + εn, where εn iid

∼ N 0, σ2 , n = 1, . . . N.

Noise will be discussed in more detail in Section 2.6. In the following section the Bayesian statistical model will be formally introduced.

2.2 Formalization of Bayesian approach in global optimization

In this section the mathematical justification of the Bayesian method is given in the terms of information theory [22, Cover and Thomas, 2006]. Specifically, we will describe the conditions under which the search strategy defined in Algorithm 2.1 is optimal. The following description draws on the extensive work carried out by Mockus in [68, 1989] and [66, 1994] and a more complete treatment may be found in the aforementioned book and paper.

In information theory an optimal strategy is defined by associating with each possible strategy a figure of merit and declaring a procedure to be optimal if it achieves the minimum possible value (over all possible strategies) for this criterion [90, Thisted, 2017, Chapter 4.5.2.1]. Two such criteria are the “minimax” criterion, also known as “worst case” analysis, and the “average case” analysis.

Let us start by introducing the setting over which the aforementioned criteria will be defined. Let C_A be a family of continuous functions defined over A ⊂ Rd to which the objective f belongs. Let N be the total number of observations (evaluations of the objective)

(40)

made so far at points xn, n = 1, . . . , N, collected in the vector f (x1:N) = [ f (x1) , . . . , f (xN)].

Let s denote an iterative search strategy for the optimum, let f be the function to be searched, let xN+1 = xN+1(s) denote the final decision of the method s, and let δ denote the loss

(utility) function.

Mockus defined the loss as the deviation of the method s from the global optimum x with at most N function evaluations of f :

δ = δ ( f , s) = f (xN+1(s)) − f (x) , (2.1)

which is a linear function of the difference f (xN+1(s)) − f (x). Whereas Kushner [53, 1964]

used a non-linear loss function:

δ = δ ( f , s) = H ( f (xN+1(s)) − y0) , (2.2)

where H (·) is the Heaviside step function and y0represents a desirable level of the objective

function f (x).

“Worst case” analysis

In a “worst case” analysis, the figure of merit associated with strategy s is max

f∈C_Aδ ( f , s)

and a minimax search strategy s∗corresponds to the following condition

s∗= arg min

s maxf∈C_Aδ ( f , s) . (2.3)

Thus a “worst case” analysis, or “minimax” approach, means that some property of method should always be present, even in the worst case, and an optimal method is a method which provides minimal or at least reasonably low deviation from a global optimum for all func-tions from a given family [66, Mockus, 1994]. Its main disadvantage is its expensiveness: if the family of problems is broad the worst case could be very bad, and an enormous num-ber of iterations could be required in order to guarantee any convergence of the method. For instance, when the objective function f belongs to the family of real-valued Lipschitz-continuous functions with Lipschitz constant C

| f (x1) − f (x2)| ≤ C kx1− x2k ∀x1, x2∈A,

any algorithm that provides a solution xN+1such that | f (xN+1) − f (x)| < tol or kxN+1− xk <

tol, for some tol > 0, must make (C_/_2tol₎d _{function evaluations [11, Betrò, 1991] [42, Horst}

and Pardalos, 2013], which is exponential in the number of dimensions. An incredibly expensive premium to pay for insurance against unlikely scenarios.

(41)

“Average case” analysis

In global optimization of continuous functions and functions with noise it makes sense to apply an “average case” approach. Indeed, the family of continuous functions C_Adefined overA, is a noncompact set. Thus, only the “average case” approach can be defined, since it is not possible to define a “worst case” scenario in this case [66, Mockus, 1994]. On the contrary, a “worst case” analysis can be defined on the set of Lipschitz-continuous functions with constant C and compactness of such domain can be proven by means of the Ascoli-Arzelá theorem.

In an “average case” approach the notion of average is defined by means of an integral, hence through the introduction of a probability measure P which is part of the problem definition and fixed before any investigation of the latter starts. Thus, P is called a priori distribution and an “average case” analysis is called Bayesian approach. The figure of merit associated with strategy s is the a priori expected loss

ˆ

C_A

δ ( f , s) dP ( f )

and an optimal search strategy s∗corresponds to the following condition

s∗= arg min

s

ˆ

C_A

δ ( f , s) dP ( f ) . (2.4)

In doing so, the function to be minimized is considered a sample of some stochastic process defined by the a priori distribution P, the existence of which has been proved in [68, 1989, Chapter 3]. Both P and the search strategy depend on the available observations f(x1:N) of the objective. Mockus proved that the search strategy s∗ is optimal both under

uncertainty — noisy evaluations — and without [68, 1989, Theorem 3.4.1] and is defined by the solution of a system of recurrent equations:

         vN( f (x1:N)) = minx∈_AEN[ f (x) | f (x1:N)] vn−1( f (x1:n−1)) = minx∈AEN[vn( f (x1:n−1) , f (x) | f (x1:n−1))] n= N, . . . , 2 v₀= minx∈_AEN[v1( f (x))] (2.5)

where EN[ f (x) | f (x1:N)] denotes the conditional expectation of the random variable f (x)

with respect to the random vector f (x1:N) and the existence of conditional distribution

PN( f (x) | f (x1:N)) is assumed. The solution is a sequence of decision functions s∗= [s0, . . . , sN]

which defines the relation between xN+1and f (x1:N):

         xN+1= sN( f (x1:N)) xn= sn−1( f (x1:n−1)) n= N, . . . , 2 x₁= s0. (2.6)

(42)

In summary, the search strategy is defined by the conditional distribution PN and it uses the

available observations f (x1:N) to guide the exploration of the space.

One-step approximation

Unrolling the recurrence relation in (2.5) all the way down to the first acquisition yields an equation which is known as the Bellman equation in the problem of acting optimally in finite-horizon Markov Decision Processes. Its complexity prevents analytical or numerical investigation, and therefore some approximation is indispensable.

Mockus proposed a one-step approximation where the next observation xN+1is

consid-ered as the last one:

xN+1 = arg min

x∈_AEN[min (µmin, f (x)) | f (x1:N)] (2.7)

µmin = min x∈AµN(x) .

Method (2.7) is known as expected improvement in the Bayesian optimization framework and its convergence is connected with the a priori distribution P. Indeed, P can be consid-ered as correct if it provides convergence to the global minimum of any continuous function when N is large (asimptotical optimality) [66, Mockus, 1994]. Such condition is a delicate issue.

A global optimization algorithm converges for all continuous functions if and only if the sequence of evaluation points produced by the algorithm is dense for all continuous function, as explained by Vazquez and Bect [97, 2010] and references therein. They show that when P is a fixed Gaussian process prior of finite smoothness, expected improvement converges on the minimum of any f in the reproducing-kernel Hilbert space associated with P, and P-almost surely for continuous f .

The one-step Bayesian method has a good asymptotic behavior. However, its main advantage is that it minimizes an expected deviation from the global minimum for any fixed number of observations.

Axiomatic approach

Rather than a stochastic process, it is possible to use a more general statistical model such as a family of random variables. Calvin and Žilinskas [17, 2002] argue that the defi-nition of a rational optimization method is not as obvious as (2.4) in such cases, and an ax-iomatic approach is used. Having defined the loss function as in (2.2), Kushner [53, 1964] proposed the following method

xN+1 = arg max

x∈_APN( f (x) < ymin| f (x1:N)) (2.8)

ymin = min n≤N f(xn) .

(43)

This algorithm, known as probability of improvement, may be applied also when a stochastic process is used to model minimization problems and its convergence is proved in [16, Calvin and Žilinskas, 1999].

Acquisition function

Methods (2.7) and (2.8) define two different nearly-optimal search strategies that pro-vide the next point xN+1 to be acquired by exploiting information regarding the objective.

The methods differ in the way they collect and use information about the objective to choose the next point, or in the acquisition criterion.

From an information-theoretic perspective, the goal of an acquisition function is to learn about the objective function f (x) as rapidly as possible to select where to evaluate next. In Bayesian Experimental Design (ED), see [59, MacKay, 1992] and [19, Chaloner and Verdinelli, 1995], the informativeness of a set of sampling points about the objective, f(x1:N), is measured by the information gain (see [22, Cover and Thomas, 2006, pag.

20], [35, Ghojogh et al., 2019]),

I( f (x1:N) ; f (x)) = H ( f (x)) − H ( f (x) | f (x1:N)) (2.9)

which is the mutual information between f and its observations f (x1:N) and quantifies the

reduction in uncertainty about f from revealing f (x1:N). H is the (Shannon) entropy of a

continuous probability distribution p (x) of a continuous random variable X and is defined as

H(X ) = − ˆ

p(x) log (p (x)) dx. (2.10) This analysis gives rise to several information-based acquisition functions described in Sec-tion 2.5, where a more detailed explanaSec-tion of the methods (2.7) and (2.8) will be shown.

Consequences of the one-step optimality

It was shown by Mockus that expected improvement (2.7) is asimptotically optimal. However, both Calvin and Žilinskas [17, 2002] and Mockus [68, 1989] argue that the one-step algortihm performs a very local search, seeking the minimum in the neighborhood of the best found points. They explain this behavior is a consequence of the one-step optimal-ity: “if only one observation is planned then it seems not rational to risk but better to make an observation near the best point, where an improvement (maybe only small) is likely”, and therefore the influence of the remaining observations N + 2, N + 3, . . . is neglected in the equations (2.7). While Mockus suggests the introduction of a parameter ξ that would act as trade-off between exploration of the space and exploitation of the best found observa-tions in order to improve the efficiency of search and make the method more global, Calvin and Žilinskas argue that it is difficult to justify such a modification theoretically. The same tuning parameter can be applied to probability of improvement to achieve similar results.

(44)

Adaptive Bayesian approach

In the definition of Bayesian optimality (2.4) it was assumed that the a priori distribution Pis completely known before observations are started. Mockus [68, 1989] argues that it would be more realistic to assume that at least some parameters η of P are unknown and can be estimated only on the base of observed values. By doing so, the probability P is not fixed and can change under the influnce of new obervations, and therefore the method

min

s

ˆ

C_A

δ ( f , s) dPn( f ) , n = 0, 1, . . . , N (2.11)

is not well-defined since Pndepends on n. The one-step approximation can be used in this

case to give meaning to (2.11), yielding the method min

sn

ˆ

C_A

δ ( f , sn) dPn( f ) , n = 0, 1, . . . , N (2.12)

where each component sn of the search strategy s is chosen under the assumption that the

distribution Pnwill not change later.

In adapting the parameters of the distribution, two issues arise. Convergence of the Bayesian method may fail and the Kolmogorov consistency conditions are violated. The latter ensure that a sample path of the same stochastic process is considered during the op-timization process. Mockus [68, 1989, Chapter 5] argues that it is reasonable to omit those conditions if by doing so one gets some computational advantage because the stochastic model adapts to the observed data, and therefore the performance of the optimization is improved. Thus, Kolmogorov consistency conditions are no longer necessary if after each iteration n different stochastic models of the function to be minimized are considered. Other conditions must be defined to provide consistency. They are:

1. Continuity of the acquisition function;

2. Convegence of the adaptive method (2.12) to a global minimum of any continuous function;

3. Simplicity of expressions of µN(x) and σN2(x).

Regarding convergence of the Bayesian method, Bull [15, 2011] proved that for standard priors with estimated parameters, there exist smooth functions f on which the expected im-provementacquisition function derived from (2.12) does not converge. He goes on to pro-vide optimal parameters for the posterior variance that guarantee convergence in the noise-less case [15, Bull, 2011, Chapter 3.3]. In the noisy case the standard one-step Bayesian method (2.7) should be used instead of the adaptive one if the convergence of the method is necessary. Otherwise, the adaptive method can be used, which does not necessarily converge, but provides some improvement in the average sense with regard to the initial point [68, Mockus, 1989, pag. 85].

(45)

2.3. Gaussian processes in Bayesian optimization

2.3 Gaussian processes in Bayesian optimization

A Bayesian optimization method depends on a prior distribution by definition. A large family of distributions satisfy the aforementioned consistency conditions, hence some ad-ditional conditions were introduced to narrow this family by Mockus [66, 1994] setting the framework for the Gaussian process prior which is widely adopted in Machine learning today. Those conditions are the following:

1. Continuity of sample functions f (x); 2. Homogeneity of a priori distribution P; 3. Independence of the dthdifferences.

Condition 2 means that distribution P of the sample f (x) defined on the compact domain A is invariant with respect to homeomorphisms of A [104, Woll, 1959], e.g., translations or rotations of the coordinates about the origin. Condition 3 means that the partial differences of dth order (regarded as discrete approximations of derivatives of the objective function) are stochastically independent:

4_ζ₁(x1, . . . , xd) = f(x1+ ζ1, . . . , xd) − f (x1, . . . , xd)

4_ζ₁_ζ₂(x1, . . . , xd) = 4ζ1(x1, x2+ ζ2, . . . , xd) − 4ζ1(x1, . . . , xd)

. . . 4_ζ

1...ζd(x1, . . . , xd) = 4ζ1...ζd−1(x1, . . . , xd+ ζd) − 4ζ1...ζd−1(x1, . . . , xd)

where ζi> 0, i = 1, . . . , d. Condition 3 is the weakest condition compatible with continuity

of samples, condition 1. From conditions 1-3 it follows that the stochastic process f (x) is Gaussian [48, Katkauskaite, 1972].

A Gaussian process (GP) is a collection of random variables for which any finite subset of the variables has a joint multivariate Gaussian distribution. It represents a distribution over functions and it is completely specified by its mean function m, known as prior mean, and its convariance function k, known as kernel. A more complete treatment of Gaussian processes may be found in [103, Rasmussen and Williams, 2006].

The Gaussian process is used to define the a priori distribution over the objective. f(x) ∼ GP m(x),k x,x0

m(x) = E[ f (x)]

k x, x0 = E( f (x) − m (x)) f x0 − m x0 .

In order to sample from the prior, a sequence of points x1:N is chosen and the objective