Combining Machine Learning and Mathematical Optimization techniques to tackle IBM ILOG CPLEX automatic configuration on Hydro Unit Commitment problems

(1)

University of Pisa

Department of Computer Science

Master of Science in Business Informatics

Combining ML and Mathematical Optimization

techniques to tackle automatic IBM ILOG CPLEX

configuration on Hydro Unit Commitment

problems

Supervisor: Antonio FRANGIONI External Tutors: Claudia D’AMBROSIO Leo LIBERTI Candidate: Gabriele IOMMAZZO

Academic Year 2016/2017

(2)

"A journey of a thousand miles begins with a single step." [Laozi]

(3)

Abstract

This document is intended to describe work carried out to accomplish my MSc’s dissertation project, the goal of which was the use of Machine Learn-ing and Mathematical Optimization techniques to conveniently select the most appropriate algorithmic configuration of a general purpose mathemat-ical optimization solver.

Assumptions and planning choices – especially those pertaining to how the problem had to be modeled – adopted to conduct such a task were elaborated with a specific application in mind: the Hydro Unit Commitment scheduling problem at a french company producing and supplying electricity.

(4)

Introduction

Numerically solving a mathematical optimization problem entails addressing a number of well known different, customary procedures, that encompass initially formulating a model, retrieving instances of the problem, submitting them to a mathematical programming solver and finally proceeding to the analysis of the results produced by the solver.

However, there is one feature of current MP solvers which is not usually exploited to its fullest, namely the configuration of the solver. This is usually left to the experience of the user, not to mention those cases where it is not even taken into account.

An explanation to this is to be found in the fact that, ordinarily, solvers’ default parameter settings are tested by their vendors to have average good performance on a huge testbed of cases. Yet, when an instance pertains to a specific class of problems, default algorithmic configurations are very likely to disappoint. When this happens, solver users are compelled to undertake a manual quest for the best parameter values.

Unfortunately, this process is coarse, approximate and inaccurate, other than time-consuming.

For this reason, the benefits of finding a good algorithmic configuration of the solver would be obvious, and even more obvious they would be if there was a way to "learn" how to accomplish that.

Therefore, with the work discussed, in this document I tried to put into practice a two-fold idea: that supervised Machine Learning (ML) techniques can be used to automatically learn a performance model of a complex system with many parameters, in function of the numerical and structural properties of the instance being solved; and that it is possible to harmoniously merge Machine Learning and Mathematical Optimization together, in such a way that the optimization model (characterizing from the latter) relies on said performance function to guide the search for a good solver configuration. This work roughly consists of two main tasks:

(7)

CHAPTER 1. INTRODUCTION 4 • the Learning Phase, where Machine Learning – in particular, Sup-port Vector Regression – was employed to learn a performance model of the solver

• Optimization Phase, where the performance function resulting from the previous stage was used as an objective function for a mathematical optimization problem. I called such a problem the "inverse problem", as its main input is nothing but the very output of the Learning Phase. To test the approach just introduced, experiments were limited to a small but appealing optimization problem named Hydro Unit Commitment problem (HUC). There are multiple reasons why HUC is interesting, but the main one is that it is such a hard problem to solve that even from tiny improvements could follow rather significant benefits.

Ultimately, there is one aspect of this work that represents an additional strong point and could perhaps make it even more compelling, and that is the fact that several other applications besides optimization clearly exist. Just to name one example, a field that might benefit from our work is medicine, where a physician may want to optimize the treatment admin-istered to each patient considering all the measurements of his condition, i.e. select a range of optimal therapies, knowing the most noteworthy medical traits of the patient.

The main programming language used during the development of this dis-sertation was Python 2.7. I also employed AMPL as a modeling language for describing optimization problems.

Chapter 2 provides an overview of the most significant (in the writer’s opinion) research lines present in literature about Automatic Algorithm Con-figuration. Chapter 3 focuses giving an detailed overview of the adopted approach and also defines a formal structure for it. In Chapters 4–5, the main characteristics of the data are presented and it is explained how it was handled in order to create a training set for the selected Machine Learning algorithm. Also, parameters of CPLEX (the MILP optimization solver that I decided to work with) are introduced and described here. Chapter 6

deals with the actual assembling of the training set and the pre-processing its records. Chapter7is centered around Support Vector Regression. Here the mathematical foundations of SVR are recalled, followed by a thorough explanation of the implementation choices applied. Chapter 8, instead, provides a more in–depth look at the inverse problem and the algorithmic tools used to treat it, in particular the MINLP solver Bonmin. Finally, it also shows and analyzes some results, obtained by comparing my approach with CPLEX alone, on instances that had not been part of the learning process. Chapter 9 gives a glimpse of possible future developments.

(8)

Chapter 2

A review of Automatic

Algorithm Configuration

2.1 Definition of the problem

State-of-the-art algorithms for optimization stand out for the very many parameters that need to be painstakingly tuned in order to obtain good executions. It is not difficult to imagine how, as these parameters grow in number and problems get harder to solve, the tuning process becomes both more challenging and inaccurate.

In these cases, since the user can only carry on a trial–and–error empirical search by evaluating supposedly promising settings, the notion of "best pa-rameter configuration" seems to have lost any deterministic meaning. For this reason, the result of such a process is often a suboptimal solution. Especially for mathematical optimization solvers, finding the optimal param-eter values is essential to achieve good solutions quickly. On the other hand, that also turns out to be particularly arduous because of the sheer number of parameters to tweak. Just to give an idea of what I am discussing, for example, IBM ILOG CPLEX makes around 170 parameters available in its Version 12 Release 6.

The term Parameter Tuning identifies the problem of picking the best parameter configurations for a specific target algorithm, in order to attain high and robust performance; the best configuration is typically the one that minimizes the algorithm’s CPU time or, more generally, optimizes a quality measure.

Formally, the Automatic Algorithm Configuration problem can be stated as follows([22], [16]): given

(a) A: target algorithm, that needs a performance improvement (b) P = {p1, p2, . . . pk}: parameters of A;

(9)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION6 (c) Θ ⊆ Θ1× Θ2× · · · × Θk: space of all possible configurations of

algo-rithmic parameters;

(d) A(θ), θ ∈ Θ: an instantiation of algorithm A with parameters1 θ ∈ Θ; (e) Π: set of all possible instances of the problem

(f) Π0 ⊂ Π: set of the instances used to learn a good parameter setting for the target algorithm A; Π0 is the set containing the only known instances at the moment of learning

(g) q : π0 _{× Θ −→ R s.t. q(π}0, θ) is the observed quality measure, for a given instance π0 ∈ Π0 and a given parameter setting;

(h) _eq(·): estimated (or rather "learnt") quality measure, for a given in-stance π0 ∈ Π0 _{and a given parameter setting θ ∈ Θ,}

and given an instance π00 that has never been seen before, one wants to find the parameter setting θ∗_A,π00 s.t.

θ∗A,π00 = argmin{

e

qA(π00, θ) : θ ∈ Θ π00 ∈ Π − Π0}. (2.1)

It is essential that _eq is not only able to estimate the performance of A on the instances in Π0; most importantly, it must have good generalization capabilities, in order to handle even instances which are fairly dissimilar to those used for learning. Otherwiseq would just yield a model that overfits_e to data in Π0.

Sometimes, a variant of the problem shown above is used, and this variant is expressed as the "search for the parameter configuration that has the best performance, on average, over all instances". In this case, the problem can be more properly formulated as a minimization problem w.r.t. an averaged quality measure:

θ_A,Π∗ = argmin{ ¯qA,Π0(θ | Θ) : θ ∈ Θ }, (2.2)

which means that the best configuration is found relying on a subset Π0 ⊂ Π and then it is directly used on instances outside Π0. Such a set Π must be of course very ample and heterogeneous, in order to try to guarantee the robustness of the best setting.

As mentioned earlier, in spite of the (usually) fair amount of parameters to handle, parameter tuning is routinely approached manually. The resulting

1_{the time limit t – fixed on the target algorithm execution – also belongs to the set of}

parameters; nonetheless, since in my case t is fixed once and for all at the very beginning of the work and does not vary throughout the approach, it is treated as a constant and thus it does not belong to the list of parameters to explore

(10)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION7 empirical process is likely to become tedious, terribly time-consuming and

possibly inaccurate, even for an experienced user (i.e. one that is deeply knowledgeable about the class of problems to which the target algorithm is applied, or about the algorithm itself).

This chapter provides an overview of the most relevant methodologies in Automated Parameter Tuning , performed to automatically receive rec-ommendations about the algorithmic settings of parameters that are best suited to bring about optimal executions. In literature, the problem is also referred to as(Automatic) Algorithm Configuration or Auto-Tuning , among others.

In this document, any mention to "parameters to be tuned" implicitly ref-erences to two distinct categories: numerical and categorical.

In particular, numerical parameters can be further divided into integer and continuous and they preserve the order relations characterizing the sets on which they are defined; categorical (or symbolic) parameters, instead, "do not define any metric, thus cannot be ordered " [11].

In literature, noteworthy efforts have also been invested in investigating ways to pick, out of a portfolio of algorithms, the best one. This is normally called theOnline Algorithm Selection problem. This problem comes somewhat close to Automatic Algorithm Configuration: running and comparing differ-ent algorithms can indeed be equated to searching for the one best parameter setting – among a variety of them – for a single target algorithm; this is es-pecially true if one thinks of an algorithm portfolio as a set of parameter configurations.

Nonetheless, it is also clear that the vast space of all the possible parame-ter configurations and the (considerably smaller) space composed by a finite number of selected algorithms in a portfolio can be hardly compared in terms of size. However, the main difference between the two classes of approaches is in when the learning happens: whereas in the first case it is performed "of-fline", in the other case it occurs during the execution of the target algorithm itself (and this clearly also implies way simpler learning rules).

After reviewing research results on Automatic Algorithm Configuration, I will introduce the approach adopted for my dissertation, built on the as-sumption that there exists a relation between single instances of the problem being solved and the recommended best parameter configuration. In partic-ular, the goal of this work will be to try to find the best parameter setting of IBM ILOG CPLEX for a specific Hydro Unit Commitment (HUC) problem.

(11)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION8

2.2 Approaches

In my opinion, among the numerous surveys existing on the context, "Recent Developments in Automatic Parameter Tuning for Metaheuristics" by Dob-slaw [11], "Automated Algorithm Configuration and Parameter Tuning" by Hoos [14] and, notably, "Parameter Adjustment Based on Performance Pre-diction: Towards an Instance-Aware Problem Solver " by Hutter and Hamadi [15] take on great interest.

The latter, in particular, offers an especially clear and honest overview of recent and not-so-recent research results in Automated Parameter Tuning. For this reason, I will be mainly following the categorization proposed in [15], that groups auto-tuning methods developed in the last two decades into three classes:

• approaches that seek for one best default configuration of a single al-gorithm, evaluated across an entire set of available instances (see (2.2) for a more formal definition of such approaches);

• approaches that aim to choose the parameter setting(s) of given algo-rithm that is(are) best suited for each specific instance (see (2.1));

• mixed approaches that learn on-the-fly, as the algorithm itself is being executed; these ones can be distinguished from the previous ones which pick an "optimal" configuration before the target algorithm is executed. 2.2.1 Best default configuration across all instances

These approaches work towards learning a single "best default parameter configuration", given a specific target algorithm A and a specific set of in-stances Π. Each parameter setting θ is evaluated according to how good its average algorithm performance across all of the instances, namely ¯qΠ,A(θ, is.

The result is an "overall good final configuration", i.e. a point of the space of feasible settings that optimizes a performance function of the target algo-rithm for any instance.

On the one hand, Hutter and Hamadi in [15] note that "approaches that find the best default parameter configuration have the potential to improve parametric algorithms for any kind of problem. All that is required for this approach to be applicable is a distribution over (or a set of ) problem in-stances and a (possibly black-box) parametric algorithm". A key assumption of their argument is that the structure and characteristics of the instances are unknown and need not be made explicit: there will be no attempt to try and learn a relation between instance features and appropriate parameter settings. In that lies the biggest difference between these approaches and mine.

(12)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION9 On the other hand, they observe that such methods could prove inferior to

other approaches that are capable of exploiting "domain knowledge". This can be defined as "the knowledge provided by someone who has great ex-pertise in the problem that the target algorithm is applied to"; it also plays an important role in identifying a priori which features can be profitably employed to produce good predictions. So, experts’ knowledge makes it pos-sible to interpret data inside the instances, choose the most representative information and, more importantly, it allows to translate an instance into a set of meaningful features.

The approaches that I will be discussing and that fall under this category are: CALIBRA, F-Race, GGA, MADS, ParamILS, REVAC, SMAC.

CALIBRA [1] employs Taguchi’s Fractional Factorial Experimental De-signs and a Local Search procedure in order to optimize fifteen (at most) integer parameters of a specific algorithm. In case parameters are continu-ous, discretization will be required.

Adenso-Díaz and Laguna [1] start by planning experiments and performing statistical analysis in order to: locate promising points (i.e. values of the parameters) in the search space; assess "the effect of changing the parameter values" within a specified range of values; focus the search around promising regions and simultaneously pursue diversification for the sake of exploration. Then, a local search is carried out to find a local optimum and reduce the range of each parameter value.

These two tasks are iteratively executed within an allotted "maximum num-ber of executable experiments", M AXEX. Such a budget of course depends on the time that is available for conducting experiments; which, in turn, is determined by the number of parameters p and that of critical values k to be considered for each parameter (for a total of kp values). Using Taguchi’s method, though, Adenso-Díaz and Laguna [1] only needed a fraction of all the possible kp experiments for obtaining recommendations.

CALIBRA performed well on different problem settings.

F-Race [6], instead, is a bit more elaborate and it iteratively performs three phases:

• Initially, a statistical distribution is decided (e.g. a uniform distri-bution when no or scarce information about the parameter space is provided) and the space of the parameter configurations is sampled according to that distribution, in order to extract a list of starting configurations;

• Next, the best configurations are selected using a racing approach, which: runs the algorithm on a single instance with each parameter setting; discards the ones that show worse statistical performance than

(13)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION10 at least one of the others’, according to a Friedman Test (hence the ’F’

in F-Race); keeps running the survivors;

• Finally, at the end of each step, the current statistical distribution is updated and refined, so as to increase accuracy in sampling promising configurations.

These three tasks are repeated with different instances until a "satisfying" number of survivors is left, or a maximum amount of available instances has been employed, or a user-defined computational time has elapsed, or a number of algorithm runs have been executed.

Unlike Calibra, F-Race can handle various types of parameters: categori-cal, ordinal, continuous or integer, conditional ("only in effect when another parameter, usually a categorical one, takes certain values"). However, con-tinuous parameters had to be discretized here too.

Experiments presented in [6] regard almost exclusively Ant Colony Optimiza-tion (black-box) algorithms for the Travelling Salesman Problem. Training and testing were performed on large sets composed by 100 and 300 euclidean TSP instances, to guarantee the robustness of the approach.

Another "one-setting-fits-all" example is the Gender-based Genetic Al-gorithm (GGA): Ansótegui et al. [2] used a Genetic Algorithm for the automatic parameter tuning problem. Plus, they boosted it with a gender separation mechanism for applying "different selection pressure on the two gender populations". This mechanism is the strong point of their work: only individuals in one of the two groups have to undergo competition to win the right to mate in each season. One of the consequences is that the least promising individuals are very unlikely to survive. This of course requires that initially the "fittest" individuals are identified, thereby preventing them from having to compete in order to survive. Not only does such a mechanism allow to perform less fitness evaluations with respect to approaches where every individual has to be ranked based on fitness; but random offspring assignments to one of the two groups also ensure diversification.

After first selecting the top percentage of competitive individuals among the candidate parameter settings, the authors run them in parallel, making use of a racing framework (already seen in [6]) in order to pick the most competitive ones.

At each racing step, a random subset of all the available instances is em-ployed; the size of such a subset gets slowly increased as the most com-petitive parameter configurations survive the genetic selection. In this way, the comparison of parameter settings progressively gets more accurate, since only the fittest configurations survive and they are evaluated on a larger and larger number of instances.

Compared to a "standard" genetic algorithm, this approach was demon-strated to be superior. The sizes of the training and test sets used to perform

(14)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION11 F-Race vary from about 100 to 1000 SAT instances.

A different method named Mesh Adaptive Direct Search (MADS) was studied by Audet and Orban [3], and this one is a class of nonsmooth opti-mization algorithms that extend the Generalized Pattern Search.

MADS makes use of a surrogate model to guide its search strategy towards promising regions of the parameter space. Given an initial performance func-tion that can map dependencies between the target algorithm and its param-eters, the advantage of relying on a surrogate model is in that it normally provides cheaper or faster evaluations.

In MADS, no assumptions nor closed formula are given on the performance function: it could be simply represented by the CPU time or the number of iterations or the number of problems unsuccessfully solved. As a matter of fact, this implies a black-box function evaluation, "in the sense that a computer program must in general be run in order to evaluate" that function. On the other hand, the surrogate function is defined as "the same measure (as with the performance function) of performance" of the target algorithm in solving a set of "easy" problem instances. The surrogate function is therefore obtained by just running the target algorithm on problem instances that take shorter time to be solved to optimality.

Starting from a set of starting parameter values and an incumbent config-uration - both initially decided by the user - at each step MADS tries to find points with a lower value of the performance function than that of the incumbent (SEARCH phase). At every iteration, this is accomplished by evaluating a set of promising candidate points, which must lie on a certain space called mesh. The mesh is initially defined starting from the set of points on which the objective function has already been evaluated, and it is then iteratively updated.

If no candidate proves better than the incumbent, then a limited area sur-rounding the incumbent is explored (POLL phase); otherwise, another iter-ation is performed, with a better incumbent.

What makes this approach interesting is that promising candidates of the SEARCH and the POLL phases are first selected using the surrogate func-tion, then evaluated with the true objective funcfunc-tion, to obtain their exact performance value.

In [3] MADS is used to tune four box-constrained, continuous parameters of a trust-region algorithm. The target algorithm was run on 163 unconstrained optimization instances, 55 of which were picked as "easy" and uses as testbed for the surrogate function.

ParamILS is maybe one of the most cited examples in the category, and this is perhaps due to the fact that Hutter et al. [16] worked to create a structured, automatic framework that can solve the algorithm configuration problem potentially for any given class of problems. Their framework has in

(15)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION12 fact been tested on many known problems, such as: complete and incomplete

heuristic search algorithms for SAT, MPE (Most Probable Explanation), protein folding, university time-tabling and CPLEX.

The Iterated Local Search (ILS) algorithm that is at the core of ParamILS starts its search from some default parameter configurations, and then per-forms pairwise comparisons with configurations in the neighborhood (i.e. configurations obtained by modifying "one single parameter value at a time"). Each unseen configuration is evaluated on a range of instances in order to find the one "that yields the lowest mean CPU time on the benchmark set ". Each iteration of the method begins with a perturbation move, executed in order to escape from local optima; then a search procedure is set to run until it is decided that no parameter setting outperforms the current best incumbent, or rather a superior configuration is found; if this is the case, then the incumbent is updated and memorized. Also, in order to ensure diversification, the search can be re-initialized at random with a certain probability.

What makes ParamILS compelling is the little expedients: for example, the default execution of a special Adaptive Capping technique allows the user to avoid running the algorithm for too long with parameter configurations that are not promising, meanwhile performing more executions with good configurations. Adaptive Capping computes a lower bound on the evalua-tion of each parameter configuraevalua-tion after a certain number of runs and, if this bound is higher than the lower bound on the incumbent, the current configuration is discarded.

Each pair of configurations is evaluated on the same instances.

Experiments were conducted on three target algorithms: SAPS, that has four continuous parameters for each of which the authors chose promising starting intervals and selected seven values inside that interval; SPEAR, that has 10 categorical parameters, and 16 integer and continuous parameters that were discretized and for which between 3 and 8 values were picked; CPLEX, for which 63 categorical, conditional, integer and continuous parameters were chosen.

Eight sets of benchmark instances were employed to test the target algo-rithms on: QCP (2000 SAT-encoded quasi-group completion instances), SW-GCP (2000 SAT-encoded graph colouring instances), Regions100 (2000 MIP-encoded winner determination instances for combinatorial auctions), Re-gions200 (2000 almost identical instances as those contained in Regions100, but larger); MJB (343 machine-job assignment instances), CLS (100 MILP-encoded, capacitated lot-sizing instances, MIK (120 MILP-MILP-encoded, mixed-integer knapsack instances), QP (2000 quadratic programs originated from RNA energy parameter optimization). For each set, half of the instances was dedicated to the training phase and the other half to the test phase.

(16)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION13 sense, not so different from other approaches that have been described above.

In fact, it ultimately relies on heuristics to find a good parameter setting. Furthermore, there is no clear attempt to explain the relations between data and algorithm performances, and in this I believe lies its greatest limits. Relevance Estimation and Value Calibration of Evolutionary Al-gorithms (REVAC) belongs to this research branch too, and it proposes a way to estimate parameter relevance for evolutionary algorithm. So the target is here particularly circumscribed.

Nannen and Eiben [23] state that "instead of estimating the performance of an Evolutionary Algorithm for different parameter values [. . . ], this method estimates the expected performance when parameter values are chosen from a probability distribution".

REVAC therefore refines and iteratively improves a uniform distribution over the parameter space (i.e. over all the possible parameter vectors, simultane-ously), and rewards with higher probability those points that are expected to yield higher performance of the EA. This procedure somehow reminds of [6].

In particular, the performance achieved by an EA using a specific parameter vector is called response.

So, as a first step, each parameter vector is evaluated and its utility is recorded. Then, starting from this initial set of vectors with known utility, a new population of vectors is generated – using an evolutionary approach – that have greater expected utility. These new vectors allow to build updated density functions, that assign higher probability to regions that are expected to perform well. Also, REVAC combines this process with a mechanism that can reduce the possibility of convergence to local maxima (a well known drawback of evolutionary algorithms). In the end, the highest-probability values are selected.

Interesting tests were reported in [23] on a population of 100 parameter vectors (with values values scaled in the interval [0, 1]), 50 of which were extracted for being a parent. REVAC was requested to tune a target EA executed 1000 times on the response function.

In this class of approaches, another one that – comparably to ParamILS ([16]) – has many potential application contexts and proves very power-ful is Sequential Model-based Algorithm Configuration (SMAC). This approach is based on the so called Sequential Model-Based Optimiza-tion (SMBO), which "iterates between fitting models and using them to make choices about which configurations to investigate".

SMAC is initialized by using training data to learn a supervised machine learning model (a Random Forest classifier, based on regression trees) that can predict, for each instance and for each parameter setting, the perfor-mance of the algorithm being executed. Such training data is in the form

(17)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION14

(instance features, parameter configuration, algorithm performa-nce).

Although at a first glance this phase resembles a lot the work described in this dissertation, SMAC then involves aggregating performance predictions across instances. With this passage, potential information regarding the relations between a specific instance and its most suited configuration are totally lost. Consequently, instance features are only exploited during the model-training phase, but at the end of the process only one best parameter setting is selected and returned.

After training and aggregating, three steps are iteratively repeated until a total time budget is exhausted:

• SelectConfigurations, that outputs a list of promising parameter configurations: "to quantify how promising a configuration θ is, SMAC uses the model’s predictive distribution for θ" (built in the previous stage with regression techniques), so as "to compute its expected pos-itive improvement EI(θ) over the best configuration seen so far ( the incumbent)". In more detail, SelectConfigurations computes EI(θ) ∀ θ ∈ Θ, then picks ten configurations with maximal EI and begins a multi-start local search at each of them. The (list of) configurations selected at the end of this search have locally maximal EI;

• Intensify, that runs the target algorithm on such a list of promising configurations, executes pairwise comparisons with the current best incumbent to update both the incumbent and the current training set with new data. When comparing a couple of parameter configurations, the same instances are used to evaluate them;

• FitModel, that adjusts and improves the current model using updated training data.

SMAC was tested on SAT and MIP instances. For the first ones, 126 fea-tures were considered, such as "feafea-tures based on graph representations of the instance, an LP relaxation, DPLL probing, local search probing, clause learn-ing and survey propagation". For the second ones, instead, 39 features were computed, such as "features based on graph representations of the instance, an LP relaxation, the objective function and the linear constraint matrix ". PCA was applied in order to perform feature selection. Hutter et al. [18] suggest that SMAC can still be used when no features are available about the instances: in these particular cases, it is still possible to work with an empty set of features or with domain-independent features (e.g. the size of an instance).

Unlike other "one-setting-fits-all" approaches, for which the structure and features of instances are normally unknown, SMAC involves an initial manual feature selection (or feature computation), later refined with PCA. This of

(18)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION15 course implies previous domain knowledge. Moreover, at its initial stage,

SMAC uses ML to predict the performance of a target algorithm w.r.t a couple (instance features, parameter setting). While it is true that the information thus acquired is then aggregated over the whole dataset and it somehow gets lost, it is also clear that SMAC come a bit closer to "per-instance" approaches than other examples in this section.

In total, 17 algorithm configuration problem scenarios were taken into con-sideration for testing in [18], and this confirms its adaptability.

They regard the same three target algorithms tuned in ParamILS ([16]): SAPS; SPEAR and CPLEX, with numerical and categorical parameters. During the training period, parameters that "vary on a non-uniform scale" were applied analytical transformations (e.g. logarithmic) to constrain them in a domain where they show more uniform variations. Of course when they had to be used in a target algorithm execution, their values were brought back to the original.

2.2.2 Best configuration on a per-instance base

These approaches are built on the assumption that there exists a relation be-tween the characteristics of an instance and the most appropriate parameter values of the target algorithm that is being employed to solve that instance. Automatic parameter tuning here usually shows two recurring steps:

1. In the first offline phase a model is learnt that can predict the perfor-mance of a target algorithm for each couple (instance, parameter configuration).

An instance is identified by its features; this of course implies being able to represent it as a set of numerical (continuous, discrete or categor-ical) values, expressive of the difficulties encountered in solving that specific instance. A rather evocative term for conveying this is "in-stance hardness". So, prior to learning a model, the advice of experts is necessary in order to create a training set;

2. Later, when a new instance must be solved, the space of all parameter settings must be evaluated. The model learned in the previous stage "lends the necessary predictive power and makes it somehow possible to find the optimal configurations" (i.e. the configuration that optimizes such a model).

The work carried out for this Master’s dissertation certainly belongs to this class of approaches. Nonetheless, its novelty lies in the fact that, to my knowledge, none of the other approaches tried to build an "inverse problem" as outlined in the Introduction of this document (3.1). What can maybe be pointed to as the most significant difference with these other approaches is that none of them actively gives, to the information coming from the learning

(19)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION16 phase (i.e. the learnt model, seen as a "white box"), a structure that allows

for the re-using of that information, in order to drive the subsequent search for an optimal configuration.

Among the most significant literature examples of per-instance approaches, I chose to discuss of Bayesian Case-based Reasoning, CluPaTra, LaO and Parameter Adjustment based on EHM.

Bayesian Case-Based Reasoning: Pavón et al. [25] uses a 3-step ap-proach:

• First, the algorithm that the user want to tune is run on available instances with various parameter configurations. Therefore, for each instance and each configuration, the algorithm performance is collected in a database of runs. The database is composed of triples (instance features, parameter setting, algorithm performance);

• Then, after aggregating database records w.r.t. each distinct instance, a Bayesian Network (BN) model is induced, capable of recommending the best algorithm configuration for that specific instance. Bayesian Networks are data mining models that represent probability distribu-tions with a graph, and can therefore quantify uncertainty in param-eter recommendations. The result of this passage is a case base: a set of pairs (instance features, BN model) that contains a differ-ent graphical model for each differdiffer-ent instance;

• Finally, when a new problem instance appears, a case-based reasoning approach is put into practice: "the features of a new instance prob-lem are used to index into the case base and to retrieve the Bayesian Network solution potentially relevant ". If the new instance is proved similar (according to some similarity metric defined beforehand) to some case in the case-base, the best configuration for that instance can be easily retrieved and recommended; next, this data becomes fuel for the update of the related BN in the case base. Otherwise, a new BN model is induced by repeating the first phase only on the new instance, and the case-base is expanded.

The described approach was applied (according to [25]) to the field of graphic design, so as to tune a genetic algorithm employed to solve the Root Iden-tification Problem. The data used to test it consists of 60 problems (that is, 60 databases of runs) and three parameter of a genetic algorithm were examined: mutation rate, crossover rate, population size, the last one an integer. 7 × 8 × 9 = 504 values were fixed for the parameters, based on "rec-ommendations from the specialized literature", i.e. that "domain knowledge" that can only come from experts.

(20)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION17 Lindawati et al. [21] attempted to tackle the problem from a different angle

than others in this category, as they performed cluster-specific parameter tuning with CluPaTra (Clustering based on Pattern Trajectory). Firstly, they apply clustering tecnhiques to a set of available instances, based not on the features of such instances but rather on the search trajectory patterns that "the ( target) algorithm follows as it searches from an initial solution to its neighbors iteratively ". Each cluster is therefore characterized by instances with a similar search trajectory.

Because Search Trajectories are extracted from sequences of actions, iterative solutions and their evaluations, this approach can only be applied to local-search based target algorithms.

During clustering, Sequence Alignment is an algorithm that allows to com-pute similarity scores between instances – based on their search trajectory – so to be able to run Hierarchical Clustering to define the clusters.

After that, a one-size-fits-all parameter-tuning methods (in this case, ParamILS [16]) is applied on each cluster, with the goal of finding a list of configurations that optimize a certain performance measure.

At operating speed, given a new unknown instance: the target algorithm is run with a default initial sequence configuration; the search trajectory of the instance is mapped to the most similar cluster and the parameter settings recommended for that cluster are returned.

It appears that no previous knowledge about the target algorithm or the instances is needed in CluPaTra and this makes it closer to "one-setting-fits-all" methods, seen before. Still, its final result is a recommendation that is specifically dependent on the cluster to which the instance belongs. Even though in the end I considered CluPaTra a "per-instance" (or, rather, a "per-cluster") approach, its characteristics make it very hard to assign it a distinct and conclusive label.

CluPaTra was employed to test two target algorithms: a variant of an It-erated Local Search (with 5 discrete parameters) for solving TSP instances and a hybrid Simulated Annealing and Tabu Search algorithm (with 4 con-tinuous and discrete parameters) for solving Quadratic Assignment Problem instances. All of the parameter values were discretized in order to comply to ParamILS working principles. The method was tested on 70 instances from the TSPLib and on 50 instances from the QAPLib, keeping 20% of them for testing.

An appealing work is Learn-and-Optimize (LaO) by Brendel and Schoe-nauer [8]: this aims to combine "optimizing (i.e. parameter tuning) and learning (i.e. finding the mapping between features and best parameters)". This approach is, in fact, interesting for its similarity to the one carried out for my dissertation. In this case a clear-cut distinction is made between learn-ing – through ML – and optimizlearn-ing: in fact, LaO uses Machine Learnlearn-ing as a black-box, which somehow hinders the passing of (some kind of)

(21)

informa-CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION18 tion between the two phases. LaO first uses a Feed-Forward Artificial Neural

Network (ANN) to learn a relation between the features characterizing each instance and the optimal parameter values for that instance.

The model thus produced is used, as a surrogate model, to guide a Covariance Adaptation Evolution Strategy (CMA-ES, an evolutionary algorithm) in the search for optimal parameter settings, on a per-instance base: regrettably (in my opinion), it is employed by the CMA-ES only to get hints on how to improve a "fitness function".

So, the optimization task boils down to selecting the parameter setting that optimizes such a fitness map, normally related to the CPU time that the target algorithm needs to find a solution, or expressing a quality criterion on the algorithm execution.

Moreover, a random gene transfer mechanism dubbed Genetransferer was implemented in LaO, to exploit the idea that "a good chromosome of one instance may at least help another instance [. . . ] thus it may be used as a hint in the optimization of that other instance". In fact, when a request arrives for a hint during the optimization of a certain instance, Genetransferer always returns the current best parameter of a different instance. This confirms that learning and optimization are not really part of a continuous flow, but rather two disjointed acts of the process.

Since CMA-ES can only handle box-constrained parameters, every point in the parameter space had to be normalized into a [0,1] interval.

Experiments were conducted on between 55 to 108 instances from three different Artificial Intelligence Planning problem variants. The target of the tuning was Divide-and-Evolve, an evolutionary algorithm. Running LaO took several weeks.

Finally, Parameter Adjustment based on Empirical Hardness Mod-els (EHM) deserves a special mention. Hutter and Hamadi [15] built on related work by Leyton-Brown et al. [20], who used supervised ML tech-niques to introduce an empirical hardness model that predicts the CPU time taken by an algorithm running to solve a specific problem instance.

Hutter and Hamadi [15] make use of Ridge Regression to learn a map that can predict the CPU time of a target algorithm, for each pair (instance features, parameter configuration).

After an initial training phase, upon receiving a new instance, the search for a parameter setting that can yield the shortest CPU time begins. An alter-native way to describe it is that the goal is to optimize the aforementioned regression function with respect to all the possible parameter configurations. This is achieved by either evaluating all possible settings (if there are few) or employing more complicated methods [. . . ], such as gradient descent ". Dur-ing optimization, the algorithm needs not be executed when evaluatDur-ing the various configurations, since only the regression model returned by the first stage is needed to carry out evaluations.

(22)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION19 EHM is without a doubt the closest approach to the one discussed in this

document. The main difference lies in how the optimization phase (what before I referred to as the "inverse problem") is tackled: Hutter and Hamadi [15] use either a brute-force approach or a first-order iterative approximation method; instead, in the case of this dissertation, a mathematical optimization problem is formally defined upon a the performance map delivered by ML, and suited methods are performed to solve it.

The authors of [15] also developed an extension of their approach, that employs Bayesian Linear Regression instead of Ridge Regression so as to take into account estimates of the uncertainty underlying predictions. This variant was devised for those situations where uncertainty can ensue from diversity in the instances’ origins, e.g. when instances belong to different do-mains, different applications inside the same domain or even different mod-eling choices inside the same domain and application. In such situations, the (CPU time) predictions could suffer from low accuracy and therefore it could be useful to know to what extent it is possible to trust such predictions. This approach was tested on SAPS, a stochastic local search algorithm for SAT. Besides obtaining accurate predictions, the experiments tried to tune two continuous and strongly interdependent parameters of SAPS, which were discretized to four values each (16 combinations in total).

The instances used to conduct tests were all satisfiable SAT instances with varying solving difficulty.

Hutter and Hamadi [15] finally talk about how features for the instances were obtained: they created three sets of features, with the first set being a subset of the second and, in turn, this being a subset of the third. In particular, the first set comprises 46 Basic Features handpicked from the 91 SAT features that Leyton Brown et al. presented in [20]. The second set SLSF adds twelve features to Basic Features, computed by running SAPS 10 times with 1000 search steps, performing statistics on the runs and then calculating mean, median and variation coefficient for each of those statistics. The third set, called LLSF, has the same dynamic features as the second set but the additional twelve features are produced by running SAPS 100 times with 10000 steps for each run. Even though calculating features in LLSF was quite expensive in terms of computational time, this set seems to outperform the others when it comes to accuracy of predictions.

2.2.3 Mixed approaches: online learning

Hutter and Hamadi, in their survey ([15]) define these approaches as "those that tune the algorithm parameters during the search" and they distinguish online algorithm configuration and selection problems" from "apriori vari-ants, in which one commits to using a particular parameter configuration (or algorithm) before running the configuration (or the algorithm) on an in-stance". Particularly interesting, in this context, is the so called

(23)

Reinforce-CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION20 ment Learning in which decisions must be taken in order to maximize a

reward (or minimize a loss), given a set of stochastic rules that define the decision process: every time a new instance is solved, the models that drive decisions adapt to the ever evolving environment. Therefore, if on the one hand an initial performance model is a prerequisite for Reinforcement Learn-ing, on the other hand solving instances allows to incrementally improve such a model.

In general, since adapting a model on-the-fly requires considerable compu-tational resources, and this process must be performed every time a new in-stance appears, simple learning rules tend to be normally applied (at least, simpler than the ones appearing in the two other main parameter tuning classes).

Based on the above analysis and on the observation that new instances of the Hydro Unit Commitment problem have to be solved very often and very quickly, Online Learning does not seem the most appropriate solution to our case.

Anyway, briefly examining some Online Learning examples can be worthy, in the light of a possible hybridization between this approach and the others. ACE-FORR [12] belongs to this acategory and it is based on the so-called "FORR architecture" that uses collaborative heuristics. FORR assumes that useful knowledge about the domain is available or can be provided by experts in advance. Such knowledge is exploited to define a set of initially problem-class-indipendent heuristics that can recommend feasible actions, algorithms or useful data to solve the current problem, called advisors.

After each problem instance is solved, FORR is capable of incrementally learning an array of weights for the advisors, to reward optimal choices. One drawback of this approach is that FORR is clearly only applicable to a set of problems belonging to the same domain, thus penalizing prospective generalization attempts.

Epstein and Freuder [12] also extended FORR with an Adaptive Constraint Engine (ACE) that can improve the learning of domain-specific expertise regarding the Constraint Satisfaction Problem (CSP).

Another very problem-specific online learning example is Auto-Walksat [24]. Walksat is the name of a family of stochastic algorithms used to solve Boolean Satisfiability Problems in CNF. Given an objective function that a Walksat algorithm needs to minimize as it searches for a truth assignment of a CNF formula, the invariant ratio can be defined as the ratio between the mean objective function value and the standard deviation of the objective function (computed over a range of "unsuccessful attempts to solve a for-mula"). The goal of Auto-Walksat is to find an optimal value of a Walksat parameter called noise, making use of the invariant ratio.

(24)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION21 Upon receiving a new instance of the problem, Auto-Walksat first attempts

to solve it with little computational power (i.e. a few iterations), in order to good find empirical values of the invariant ratio; then such values are “used to guide a search for the minimum invariant ratio which in turn leads to an estimated optimal noise level". The invariant ratio space is searched by means of a Golden Section Search.

It si observed in [24] that the optimal noise can be obtained by adding 10% noise to the minimal invariant ratio value and consequently this rule was adopted during tests. Unfortunately, it was also demonstrated that this empirical relation between invariant ratio and noise can be nullified on some problems. This of course sets some serious limits on the use of Auto-Walksat, since it is specifically tailored to SAT problems.

Gagliolo and Schmidhuber [13] worked on a bandit problem solver for loss games called GambleTA, and defined their online algorithm selection prob-lem as "the process of allocating computational resources to a set of alterna-tive algorithms. [ . . . ] The algorithm set can contain multiple copies of the same algorithm, differing in their parameter settings".

In this approach, the starting input data is: a sequence of instances for which the user wants to minimize solution time, a portfolio of algorithms to be tested, a set of time allocators (TA) and a Bandit Problem Solver (BPS). Time Allocators are arbitrary functions that map the constantly updated performance data – collected during previous runs of algorithms in the port-folio – to a share. The share helps to guide the distribution of computational resources among the algorithms.

The task that the BPS is assigned is to choose, for each problem instance, an appropriate TA: picking an arm means choosing a certain TA to solve the current instance. CPU times of the algorithms can be considered as losses for the BPS. "In the long term, the BPS would allow to select, on a per set base, the TA that is best at allocating computational resources to algorithms in the portfolio on a per instance base".

Every time that a new instance in the provided sequence needs solving, a TA is chosen using the BPS and all algorithms are then run in parallel, in accordance with the allocation of computational resources associated with the chosen TA. At each step, the best algorithm is the one that minimizes the ratio between CPU time (the time spent before finding a solution) and share of resources assigned (depending on the chosen TA). After each instance has been solved, the information gathered during the run is used to update the BPS and the TAs accordingly. Thus, Time Allocators incrementally learn from past history.

The set of TAs is composed of one uniform allocator and nine dynamic allocators, built on "a model of the CPU time distribution that is updated after each problem is solved ". In this case too experiments were carried out on SAT instances. The only feature considered in the tests was the number

(25)

CHAPTER 2. A REVIEW OF AUTOMATIC ALGORITHM CONFIGURATION22 of variables, since the clauses-to-variable ratio does not change.

Similarly to GambleTA, Reinforcement Learning for automatic online algorithm selection [10] aims to pick one algorithm out of a portfolio. Instead of relying only on a Bandit framework, Degroote et al. [10] start by training a Random Forest Regression model for each algorithm in the portfolio to predict "its performance in function of the problem features". After this first "offline" task, when a new instance is revealed a contextual multi-armed bandit approach is employed to pick an algorithm that can solve the instance. The bandit method involves evaluation of the regression models previously built and progressively updated: "each algorithm is an arm and pulling an arm is the equivalent of selecting an algorithm. When selecting an algorithm for an instance, its features are known, which is the equivalent of having made a context vector available". This process therefore allows to incrementally learn the relation between context vectors and rewards. After the new instance has been solved, a new datapoint will be available. When a minimal amount of new datapoints for an algorithm is collected, they are employed to re-train the regression model associated with that algorithm, so as to improve its prediction power. Other models are not updated (no datapoints are produced) because they are not run and, consequently, their performance on the current instance is not known.

The described approach was tested on instances with between 22 and 155 features. Moreover, the set of benchmark instances was split into three subsets: training, online instances and verification set. In the verification stage, "the final model quality at the end of the online phase is evaluated [. . . ] by using the models to make predictions".

The last case of on-the-fly learning that I describe is Reactive Search (RS) by Battiti and Brunato [5]. With Reactive Search, some parameters of the algorithm being run get tuned according to the algorithm behaviour, during its very execution. In particular, RS focuses on search heuristics and, based on a history of past search moves, automates the process of deciding if the algorithm is being trapped inside a certain region of the search space; if that is the case, the algorithm is compelled to perform an escape move, which of course incorporates a random component. Considering an algorithm past steps somehow echoes what has been seen earlier, when discussing CluPaTra: search trajectories.

Normally, since history regarding past moves of the search algorithm must be available, the choice of appropriate data structure to store it acquire a great importance: "the storage and access of the past events is executed through the well-known hashing or radix-tree techniques". In particular, the structures must be chosen in order to minimize computational complexity (since they have to be accessed often and read quickly) and must have a strong connection with the structure of the problem itself.

(26)

Chapter 3

Framing the problem

3.1 A different proposal

A good starting point to introduce the work pertaining to this dissertation is to compare it with other examples present in the literature.

Among the approaches described in the previous Chapter, the ones that seem more akin to the one presented here are EHM by Hutter and Hamadi [15] and LaO by Brendel and Schoenauer [8]:

• Hutter and Hamadi [15] especially focus on training a good ML model, that is, one that can accurately predict the CPU time taken by run-ning the target algorithm. They concentrate their attention primarily on the learning stage.

However, not much attention seems to be given to the optimization phase, as they perform it by either evaluating all possible parameter configurations or using "more complicated methods [. . . ], such as gra-dient descent ": the first approach can prove intractable if the space C of all the parameter settings is very large; the second could easily fall short of the constrained optimization problem defined in (3.3);

• Brendel and Schoenauer [8], instead, use an Artificial Neural Network for training their predictor, and do not obtain a model available for re-using during the optimization stage. So, whereas in the case of my approach building ¯p produces a model that can be later re-employed to build a constrained "inverse optimization problem", their "learning" and "optimization" phases are totally unrelated. In particular, any method could be chosen for (3.3), regardless of the performance map learnt. As a further proof of that, their ANN model is indeed just used to give the evolutionary algorithm of the optimization phase hints that can straighten out its search towards the best control.

It is true that a possible alternative way for me to achieve the same result would have been using a ML approach that directly suggests configurations;

(27)

CHAPTER 3. FRAMING THE PROBLEM 24 this would imply that the right configuration should be directly produced. There were not many approaches that I am currently aware of that could potentially do that, perhaps only Neural Networks. Yet, Brendel and Schoe-nauer [8] showed some limits of such a choice.

To sum up, the approach that I briefly sketched above makes it possible to first learn a performance map through Machine Learning and then to solve an "inverse problem" using that map. Here "inverse" is a reference to the fact that the ouptut produced by SVR ultimately serves as an input to the mathematical optimization problem of the last phase. Relying on NNs certainly would have yielded solutions for our problem, but then again very little information on the underlying performance model (that is, on why such solutions are obtained) would have been made available and re-usable. In conclusion, the novelty of the approach presented in this report lies in the assumption that it is possible to model the structure of an estimator (created during the ML training phase) as an optimization problem and that such a structure can be profitably re-employed in a second moment to face the parameter tuning issue.

To my knowledge, nobody has ever thought about an "inverse problem" formulation, that tries to exploit the structure underlying the optimization problem solved during the training phase, in order to formulate a mathe-matical program that can actively re-use such a structure to guess the best parameter configuration.

This is the reason why my supervisors and I believed that, given the huge range of tools, algorithms and solvers available, dealing with a MP such as the one in (3.3) is an extremely promising way to tackle the problem at hand.

3.2 Unfolding the approach

The work associated with this dissertation falls under the second category of methods illustrated in Chapter 2: in fact its underlying assumption is that a relation exists between instance features and optimal parameter configuration of the target algorithm.

For this reason, similarly to "per-instance" examples, a set of features must be available prior to performing learning; defining features is indeed essential to build a training set for the ML.

It is obvious that such features have to be engineered starting from the instances available. It is also obvious that someone who is knowledgeable about the problem in question – in particular, about the structure of an instance – will play a big role during the process. In other words, it is essential to know in advance which features make an instance hard-to-solve: there is a high probability that such features will exercise a very marked influence in the prediction of "instance-based" best parameter settings.

(28)

CHAPTER 3. FRAMING THE PROBLEM 25

(29)

CHAPTER 3. FRAMING THE PROBLEM 26 That said, the purpose of this work was not to build a method that can automate the process of deducing a set of meaningful features from an in-stance. That task goes way beyond the objectives of this dissertation (at the moment, Ed.), but hopefully will be carried on in a near future. As for the contents of this document, instance features were considered as given, and therefore it will only be explained why some of them were chosen and why others were discarded.

At last, I can introduce how parameter tuning was dealt with for my disser-tation. Let F ⊂ Rm be the set of possible instance features, C ⊂ Rk the set of possible CPLEX parameter configurations (or "controls") and Y ⊂ R the set of CPLEX solution measures. Then the approach was roughly di-vided into three phases: Training set assembling and process-ing, Offline supervised learning (represented in Figure 3.1) and "Inverse" optimization problem and Testing (sketched in Fig-ure3.2).

3.2.1 _{Training set assembling and processing}

First of all, after receiving 21 instances of the Hydro Unit Commitment problem, I parsed them with the Python script Features.py. Thanks to this, each instance could be translated into a vector of continuous values, gathered in features.csv. However, not every piece of data pertaining to the available HUC instances was deemed equally important: there exists a clear distinction between features that really make an instance harder or easier to solve and the remaining values, that do not affect how CPLEX behaves. Only the former came to be called "features" and appear into features.csv.

Then, the code in Combinations.py allowed me to generate parameter con-figurations. Because of the great amount of parameters that CPLEX makes available (around 170), running all possible configurations on all given in-stances would have proven computationally unfeasible; consequently only a small subset of 11 parameters was initially selected, thereby dramatically re-stricting the set of configurations that could be generated starting from that subset. Executing Combinations.py I obtained combinations.csv, where parameter values were translated into simpler data vectors.

After that, each instance was solved executing CPLEX without fixing a time limit, via the script LoopNoTL.py: with that I obtained, for each in-stance, its optimal solution value and a file named "model.lp". Instaead, hydro_scaling.mod and hydro_scaling.run provided, respectively, the prob-lem formulation (namely, a model for the Hydro Unit Commitment probprob-lem) and other execution instructions for CPLEX. These were written in AMPL,

(30)

CHAPTER 3. FRAMING THE PROBLEM 27 a mathematical programming language used to describe optimization prob-lems and call solvers – such as CPLEX – to solve them.

Finally, combinations.csv and .lp models were used as inputs for Trainin-gSetOneModel.py, that runs the solver for the second time and makes it pos-sible to observe the effect of several algorithmic configurations on CPLEX performance. This time, though, a time limit of 15 seconds was fixed. In other words, for each considered pair (instance features, parameter values), I obtained a quality measure of the solution achieved by the solver in a short (15 seconds) span of time.

Running FinalProcessedCsv.py allowed me to perform transformations on the data collected from previous phases, and in this way to satisfy the re-quirements that SVR entails; these transformations are better known as "pre-processing " and their purpose is to reshape data and provide it with a format that the chosen ML algorithm (in this case, SVR) will be able to handle smoothly.

In conclusion, the output obtained executing FinalProcessedCsv.py was a final training set trSet_final.csv, which is composed by triples {(instance features,

para-meter values, CPLEX performance)} or, shortly, {(f, c, y)}, c ∈ C, f ∈ F , y ∈ R

3.2.2 _{Offline supervised learning}

Assuming that interactions between the data and the solver can be described by a performance map p that allows to evaluate the performance of a pair (f, c):

p : F × C −→ R, p(f, c) = y, (3.1)

Support Vector Regression was exploited to approximate p. The code in MySVR.py mainly relies on the scikit-learn Python library. The resulting model was a performance function ¯p. Information needed to accurately write the regression function was written into svr_results.txt. Such information includes, but is not limited to: support vectors, optimal dual coefficients α, C and the optimally-tuned Gaussian kernel coefficient γ, namely all of ¯p constants.

The most challenging task here consisted in selecting and using SVR to define an approach that, upon receiving an instance, recommends the solver configurations that are best suited for that instance. This required training SVR to make it capable of telling if a configuration is good (i.e. of evaluating it); an alternative way to say it is that, by means of ML, it was possible to learn a performance model ¯p, a function that can evaluate and predict the performance of CPLEX for a given pair (instance features, parameter

(31)

setting). As remarked above, CPLEX performance was measured in terms of how good a solution found by the solver (within a fixed time limit) is. A more detailed explanation of SVR model will be addressed in Chapter 6.

3.2.3 _{"Inverse" optimization problem and Testing} The performance model built was then used in order to drive the search for the best configuration, exploring the (combinatorially large) space of all configurations in search of a good one.

Mathematically, this corresponds to solving an optimization problem that minimizes the discussed performance function over a linearly constrained space. Numerically, this was achieved by defining linear constraints for C and writing the constraints and ¯p into an AMPL file (model_inv.mod). To call the Bonmin algorithms to solve the problem, another AMPL script (model_inv.run) was run.

At full speed, upon the arrival of a new instance (identified by its feature vector) ¯f , the optimal configuration will be identified as that control c∗, that minimizes ¯p given ¯f . Namely, the "inverse problem" can be modelled as follows:

c∗ ∈ argmin{ ¯p( ¯f , c) : c ∈ C }. (3.2) Finally, considering that at this stage ¯f is given as a constant of the problem ( ¯f stands for the instance features, hence corresponds to unvarying data), (3.2) shall be reformulated as:

c∗ ∈ argmin{ ¯p_f¯(c) : c ∈ C } (3.3)

The concepts described in this section are pictured in the left part of Figure

3.2, where it can be seen how SVR results were retrieved and, together with 20 credible feature vectors { ¯f } (corresponding to newly generated HUC test instances), they were employed to build instances for the inverse problem. The 20 HUC test instances are not to be confused with the 21 instances that are used as input for the learning and for the inverse problem. Optimization for the inverse problem was performed using Bonmin, a solver for MINLP problems. These activities were thus devised for obtaining solutions to (3.3): { c∗_t, t a test instance}.

So, because of course I needed to show proof that my approach could out-perform CPLEX – or, on the contrary, verify that it is still inferior to the solver – what I did was running CPLEX on the 20 HUC instances with and without setting a specific parameter configuration c∗.

At last, results originated by Tester_pipeline.py and Tester_no_pipeli-ne.py were compared to draw my final remarks, that are addressed in Chap-ter 8

(32)

Figure 3.2: Optimization Phase and Testing

Points 1. and 2. will be referred to as "Learning Phase", whereas point 3. will be called "Optimization Phase".

(33)

Chapter 4

Hydro Unit Commitment

instances

4.1 The Hydro Unit Commitment problem at hand

Should the approach outlined in the previous chapters be applied to any problem, the training set for the ML would probably need to be impractically large in order to obtain acceptable results.

In the light of this, I decided to restrict its application scope to a specific class of problems, for which it is very well known which features make an instance hard-to-solve and which CPLEX algorithms can strongly affect its behaviour: the Hydro Unit Commitment (HUC) problem at EDF.

HUCs are usually modelled using Mathematical Programming (MP) and solved using a range of algorithms going from heuristic to exact. Accord-ing to MP taxonomy, HUCs are natively Mixed-Integer Nonlinear Programs (MINLP), meaning they involve both continuous and integer decision vari-ables, and both linear and nonlinear terms in their objective functions and constraints.

The HUCs that I handled yield correspondingly difficult MINLPs to solve. Even when the nonlinearities are linearized, and the MINLP becomes a Mixed-Integer Linear Program (MILP), they pose formidable challenges. Currently, MILP solution technology is not even advanced enough for finding a feasible solution to these MILPs, let alone optimal ones.

In the following sections I will illustrate the fundamental characteristics of the problem and then move on to describing its instances.

4.1.1 Some introductory concepts on the HUC

It is a well known fact that electricity cannot be stored easily, and that one of the most efficient and long-term methods for storing it is to turn it into potential energy by pumping water up mountain valleys into natural

Combining Machine Learning and Mathematical Optimization techniques to tackle IBM ILOG CPLEX automatic configuration on Hydro Unit Commitment problems

University of Pisa

Combining ML and Mathematical Optimization

techniques to tackle automatic IBM ILOG CPLEX

configuration on Hydro Unit Commitment

problems

Academic Year 2016/2017

Contents

Chapter 1

Introduction

Chapter 2

A review of Automatic

Algorithm Configuration

2.1

Definition of the problem

2.2

Approaches

Chapter 3

Framing the problem

3.1

A different proposal

3.2

Unfolding the approach

Chapter 4

Hydro Unit Commitment

instances

4.1

The Hydro Unit Commitment problem at hand