A novel approach to model selection in distribution estimation using Markov networks

(1)

A Novel Approach to Model Selection in

Distribution Estimation Using

Markov Networks

AI & R Lab

Laboratorio di Intelligenza Artificiale e Robotica del Politecnico di Milano

Relatore: Prof. Matteo MATTEUCCI Correlatore: Ing. Luigi MALAG `O

Tesi di Laurea di: Gabriele VALENTINI, matricola 734888

(2)

(3)

Evolutionary Algorithms consist of a broad class of optimization algorithms based on the Darwinian paradigm of the survival of the fittest. In particular, Estimation of Distribution Algorithms (EDAs) are a recent paradigm, often presented in the literature as an evolution of Genetic Algorithms, that inte-grates the evolutionary approach with techniques from Statistical Machine Learning. An EDA evolves a population of promising solutions to a given optimization problem by building and sampling probabilistic models. This thesis proposes to integrate EDAs based on Markov Networks with novel model selection techniques from Statistical Machine Learning community. Particular focus is given to algorithms in the DEUM framework, a recent promising search-strategy based on undirected graphical models. We intro-duce the use of ℓ1-regularized logistic regression techniques with the aim to recover the interactions among the variables of a given optimization problem and to encode them in a statistical model without any prior knowledge. Be-sides, we provide a novel approach to model fitting in the DEUM framework based on randomized techniques recently proposed in linear algebra in order to reduce the computational effort required. The results of this work show that, when the size of the population is big enough, it is possible to almost perfectly recover the interactions among the variable of a given objective function, and consequently, to increase the probability of convergence to optimal solution.

(4)

(5)

La ricerca presentata in questa tesi verte sulla macro-area dell’ottimiz-zazione. Per ottimizzazione si intende quell’insieme di discipline scientifiche atte a determinare i valori estremi di una data funzione obiettivo, anche nota come funzione di costo, ai quali possono corrispondere soluzioni ot-timali di un particolare problema di ingegnerizzazione. Questa ricerca si focalizza in particolare nel campo dell’ottimizzazione combinatoria, ovvero, l’ottimizzazione di quei problemi in cui l’insieme delle soluzioni ammissibili risulta essere discreto. Gli algoritmi di ottimizzazione presentati in seguito sono basato su di un approccio alla risoluzione dei problemi di tipo a scatola chiusa: nessuna informazione a priori riguardante la funzione da ottimizzare viene fornita al risolutore.

Dagli Algoritmi Genetici alla Stima delle

Distribuzioni

Gli Algoritmi Evolutivi (EAs)[2] costituiscono un insieme di algoritmi di ot-timizzazione basati sul paradigma Darwiniano dell’evoluzione. Un EA evolve una popolazione di individui, i quali codificano le soluzioni (ammissibili) di un problema di ottimizzazione, lungo diverse generazioni, applicando ripetu-tamente meccanismi ad ispirazione biologica di selezione e riproduzione. Il valore in un individuo della popolazione di un EA è valutato attraverso la funzione di fitness, ovvero, la funzione da ottimizzare o un suo opportuno surrogato. Il valore di fitness determina le possibilità di un individuo di sopravvivere alla selezione ed entrare a fare parte della popolazione della prossima generazione. La famiglia degli EAs è composta da diversi paradig-mi, alcuni dei più conosciuti sono rappresentati dagli Algoritmi Genetici (GAs) [30, 38], dalle Strategie di Evoluzione [3, 76] e dalla Programmazione Evolutiva [26, 27, 48].

Gli Algoritmi Genetici rappresentano il paradigma evolutivo pi`u popo-lare al di fuori della comunit`a degli EAs. Un GA cerca di trovare la soluzione

(6)

gono selezionati gli individui migliori della popolazione. Le soluzioni scelte vengono successivamente sottoposte a ricombinazione e mutazione. La ri-combinazione procede a combinare pi`u soluzioni (solitamente due), selezion-ate assieme, scambiando alcune delle loro parti. La mutazione introduce in modo sporadico delle perturbazioni negli individui generati al passo prece-dente. Le soluzioni cos`ı create sostituiscono parte della popolazione ed il processo viene ripetuto fino al soddisfacimento di un dato criterio di terminazione.

Tuttavia, operatori di variazione indipendenti dal problema spesso pos-sono risultare distruttivi per blocchi contigui di informazione contenuti nelle soluzioni promettenti. Questi blocchi vengono comunemente chiamati “mat-toni” (building-block ). Infatti, gli Algoritmi Genetici si comportano egregia-mente quando applicati all’ottimizzazione di funzioni i cui building-blocks risultano piccoli e ravvicinati [86]; al contrario, questo paradigma evolutivo risulta essere poco efficace per funzioni obiettivo che non soddisfano questi requisiti. Un esempio `e rappresentato dalle funzioni deceptive. Le funzioni deceptive costituiscono dei problema artificiali appositamente progettati per indurre in errore i GAs [29, 57]. In tali funzioni, le regioni nello spazio di ricerca ad alto valore di fitness sono circondate da regioni a basso valore che hanno lo scopo di allontanare la ricerca genetica verso aree dello spazio di ricerca corrispondenti a soluzioni localmente ottime. In aggiunta, gli operatori di variazione degli Algoritmi Genetici non sono supportati da fon-damenti teorici, e questo aspetto spesso scoraggia l’uso dei GAs al di fuori della comunit`a degli EAs.

Un paradigma evolutivo più recente è rappresentato dagli Estimation of Distribution Algorithms (EDAs) [50]. In letteratura, gli EDAs sono conosciuti anche come Probabilistic Model Building Genetic Algorithms (PM-BGAs) [65] e come Iterated Density Estimation Algorithms (IDEAs) [11]. Gli EDAs mantengono la ricerca basata su popolazione e l’esplorazione per ricombinazione e perturbazione di soluzione promettenti caratteristi-ci degli Algoritmi Geneticaratteristi-ci. Tuttavia, gli operatori di variazione, che nei GAs guidano l’esplorazione dello spazio di ricerca, sono sostituiti da opera-tori di tipo statistico di stima e campionamento di distribuzioni probabilis-tiche, ovvero, integrando il paradigma evolutivo con concetti provenienti dal campo del Machine Learning. L’utilizzo di modelli probabilistici con-sente l’impiego di tecniche di modellazione statistica e di campionamento per rilevare in modo automatico le regolarità del problema e sfruttare ques-ta informazione per la sua risoluzione. Infatti, da un lato, la popolazione

(7)

dalla quale ricostruire, usando tecniche di stima e selezione del modello, la distribuzione di probabilit`a originaria.

Un EDA evolve una popolazione di soluzioni ad un dato problema di ot-timizzazione. Il flusso di lavoro seguito da un EDA inizia con la generazione di una popolazione di soluzioni in modo casuale. Successivamente, ad ogni generazione, la popolazione `e valutata attraverso la funzione di fitness e gli individui promettenti sono selezionati come nei GAs. A questo punto, si costruisce un modello probabilistico utilizzando la popolazione selezionata come campione di osservazioni. Il modello costruito `e campionato per gener-are nuove soluzioni, e i nuovi individui, sono infine inseriti nella popolazione seguendo una certa politica di rimpiazzo.

`

E possibile migliorare DEUM?

Molto del lavoro svolto nella letteratura degli EDAs si basa sull’uso di mod-elli grafici probabilistici di tipo direzionato (reti Bayesiane) per la model-lazione statistica della popomodel-lazione di soluzioni. Modelli grafici non di-rezionati, come le reti di Markov, non sono molto popolari in questo con-testo. Infatti, tali modelli non sono sfruttati al massimo delle loro capacit`a espressive, in quanto alcune caratteristiche di questi modelli rende difficile o computazionalmente oneroso farlo. Spesso infatti le reti di Markov vengono utilizzate per una prima fase di modellazione della popolazione e, successi-vamente, convertiti a modelli direzionati per il campionamento delle nuove soluzioni. Tuttavia, esistono diversi problemi reali di ottimizzazione che non possono essere rappresentati in modo efficiente tramite grafi direzionati. Risulta quindi importante trovare modo di sfruttare tutte le potenzialit`a di questi modelli probabilistici.

Recentemente, `e stata proposta una famiglia di algoritmi basati sul for-malismo delle reti di Markov sotto il nome di Distribution Estimation Using Markov Networks (DEUM) [78, 79, 80, 81]. L’idea alla base di DEUM che lo distingue dagli altri approcci nella letteratura degli EDAs si basa sull’uso dell’informazione fornita dalla funzione di fitness. Solitamente, la funzione di fitness viene ridotta negli EDAs ad una metrica sulla quale costruire un ordinamento per la fase di selezione delle soluzioni. Invece, in DEUM ogni soluzione fornisce un contributo alla stima del modello proporzionale al suo valore di fitness. Infatti, il modello probabilistico viene costruito in modo da assegnare ad ogni soluzione una probabilit`a proporzionale al suo valore di

(8)

molto importante: se il modello probabilistico e stimato in modo oppor-tuno, la soluzione con probabilità massima corrisponde anche alla soluzione ottimale del dato problema. Successivamente è possibile adottare tecniche di campionamento allo scopo di campionare direttamente la soluzione con probabilità massima. Risultati molto significativi sono stati ottenuti medi-ante DEUM. Infatti è stato notevolmente ridotto il numero di valutazioni della funzione di fitness necessarie per risolvere in modo ottimale diversi problemi di ottimizzazione, (es., la funzione Ising Spin Glass [78, 82], il problema MAXSAT [16], etc.).

Tuttavia, a fianco del’uso innovativo dell’informazione data dalla fitness, DEUM soffre di alcune problematiche. Il compito di stimare i parametri del modello probabilistico, che in DEUM prende la forma di una distribuzione di Boltzmann, necessita della risoluzione di un sistema lineare di equazioni. Stimare i parametri del modello probabilistico in DEUM ha quindi un cos-to computazionale molcos-to elevacos-to. Inoltre, allo stacos-to dell’arte, gli algoritmi DEUM necessitano di informazione a priori riguardo la funzione obietti-vo da ottimizzare. Infatti, DEUM richiede che la struttura della rete di Markov, che codifica le indipendenze condizionali tra le variabili, sia no-ta contro l’ipotesi di ottimizzazione a scatola chiusa. In DEUM, è sno-tato svolto un lavoro molto limitato per estendere le funzionalità di questo sis-tema con metodi ti apprendimento automatico delle interazioni tra le vari-abili del problema. La mancanza di metodi automatici di apprendimento limita fortemente l’impiego di DEUM quando si vuole risolvere problemi di ottimizzazione nei quali la struttura delle interazioni tra le variabili non è nota.

Il lavoro presentato in questa tesi ha lo scopo di migliorare diversi as-petti del sistema DEUM. Particolare attenzione viene posta al problema della selezione automatica del modello probabilistico. Alcuni risultati sono stati ottenuti anche per quanto riguarda la riduzione della complessit`a com-putazionale necessaria alla stima dei parametri incogniti del modello prob-abilistico.

Contributi Originali

In questa sezione riassumiamo i contributi originali pi`u rilevanti, alla co-munit`a degli Estimation of Distribution Algorithms, ottenuti della ricerca descritta in questa tesi. Buona parte del lavoro presentato in questo docu-mento si focalizza sugli EDA basati sulle reti di Markov. Tuttavia, contributi

(9)

i Riduzione della selezione del modello ad un problema di ottimizzazione. Abbiamo introdotto, nel contesto degli Estimation of Distribution Algo-rithms basati su modelli grafici non direzionati, un’approccio di selezione automatica del modello, ovvero della struttura della rete di Markov, che riduce questo problema alla risoluzione di un problema di ottimizzazione. In particolare, abbiamo introdotto l’uso della regressione logistica rego-larizzata con vincoli di norma ℓ1 per determinare l’insieme delle vari-abili del problema di ottimizzazione interagenti con una data variabile (neighbourhood). L’uso di tecniche di regressione logistica `e motivato dalla considerazione che, molto spesso, problemi di ottimizzazione reali sono caratterizzati da una struttura sparsa delle interazioni tra vari-abili. Questo approccio innovativo alla selezione del modello probabilis-tico viene introdotto formalmente nel sistema DEUM tramite l’algoritmo DEUMℓ1.

ii Fitting del modello randomizzato. Abbiamo introdotto nel contesto del sistema DEUM l’utilizzo di metodi randomizzati per la decomposizione di matrici. La SVD randomizzata riduce lo sforzo computazionale richiesto per determinare la SVD di una matrice. Questo compito risulta fonda-mentale durante la stima dei parametri del modello probabilistico uti-lizzato da DEUM, in quanto, a tale scopo, è necessario risolvere un sistema lineare di equazioni. Infatti, la decomposizione SVD di una ma-trice viene utilizzata per evitare il calcolo della mama-trice inversa che risul-ta essere compurisul-tazionalmente ancora più oneroso. La nostra proposta di utilizzare metodi randomizzati di decomposizione SVD può risultare interessante anche in contesti diversi da quello del sistema DEUM.

iii Evoptool: l’Evolutionary Optimization Toolkit. Evoptool è un pacchet-to software utile allo sviluppo e alla fase di analisi necessari durante il design di nuovi algoritmi evolutivi. Evoptool ha lo scopo di facilitare l’attività di confronto delle performance tra diversi algoritmi, provenienti anche da paradigmi di ottimizzazione diversi, necessaria in questo con-testo. Questo toolkit offre un esteso insieme di funzioni obiettivo stan-dardizzate e condivise dalla comunità EA sul quale è possibile testare le performance di un algoritmo. Inoltre, Evoptool offre una collezione di implementazioni di Algoritmi Genetici e di Estimation of Distribution Algorithms pronte all’uso, con i quali i ricercatori di questa comunità possono confrontarsi in modo facile e diretto. Evoptool è un progetto

(10)

delle performance degli algoritmi su di un insieme condiviso di problemi di ottimizzazione.

Risulta importante evidenziare i risultati ottenuti dalla ricerca riassunta in questa tesi per quanto riguarda le tecniche di apprendimento automatico di una rete di Markov. Abbiamo mostrato che, data un insieme di osser-vazioni sufficientemente ampio, `e possibile ricostruire in modo esatto, medi-ante tecniche di regressione logistica con regolarizzazione basata su norma ℓ1, la struttura delle interazioni tra le variabili di un problema di ottimiz-zazione e di codificarla in una rete di Markov. Inoltre, per la prima volta nella letteratura degli EDAs, proponiamo un approccio in grado di risolvere il problema di selezione del modello senza requisiti di conoscenza a priori del problema di ottimizzazione, e quindi, in un contesto completamente black-box. `E d’obbligo sottolineare che i precenti metodi di apprendimento del modello, diversi dalla nostra proposta, sfruttano delle informazioni sulla fun-zione obiettivo che, in generale, vengono fornite all’algoritmo come numero massimo di interazioni per ogni variabile.

Risultati significativi sono stati ottenuti anche sul fronte della riduzione delle richieste computazionali della fase di stima dei parametri del modello utilizzato da DEUM. Infatti, l’impiego di metodi randomizzati di decompo-sizione SVD ha permesso di raggiungere un guadagno computazionale, nel caso ottimo, dell’85%. Quando gli algoritmi DEUM vengono eseguiti in un contesto multigenerazionale, il nostro approccio randomizzato al fitting del modello sfrutta la riduzione di diversit`a nella popolazione, riducendo la di-mensione effettiva della matrice da decomporre, e conseguentemente, anche le richieste computazionali per portare a termine questo compito. Inoltre, l’uso di metodi randomizzati riduce i problemi di convergenza, dovuti alla presenza di matrici non ben condizionate, che spesso influenzano gli approcci classici alla SVD.

(11)

(12)

Abstract I

Estratto della Tesi III

1 Introduction 1

1.1 From Genetic Algorithms to Estimation of Distributions . . . 1

1.2 Can we improve DEUM? Why should we care? . . . 4

1.3 Thesis Contributions and Road Map . . . 5

2 Estimation of Distribution Algorithms 9 2.1 Probability Distributions on Graphs in EDAs . . . 9

2.2 Bayesian Networks . . . 10

2.2.1 MIMIC . . . 12

2.2.2 COMIT . . . 13

2.2.3 Bayesian Optimization Algorithm . . . 14

2.3 Markov Random Fields . . . 15

2.3.1 Factorized Distribution Algorithm . . . 17

2.3.2 MN-FDA . . . 18

2.3.3 MN-EDA . . . 19

2.3.4 Distribution Estimation Using Markov Networks . . . 21

2.4 Benchmark Fitness Functions . . . 22

2.4.1 One Max . . . 22

2.4.2 Alternated Bits . . . 23

2.4.3 Ising Spin Glass . . . 23

3 The DEUM framework 25 3.1 The Probabilistic Model . . . 25

3.1.1 The Fitness Modelling Approach . . . 27

3.2 Estimating the Parameters of the Model . . . 28

3.3 Finding Marginals from the Boltzmann Distribution . . . 30

3.3.1 Monte Carlo Approach to Sampling . . . 31

(13)

4 Anℓ1-regularized Approach to Model Selection 39

4.1 Model Selection: Problem Definition . . . 39

4.1.1 Performance Evaluation Measures . . . 41

4.2 Structure Learning in Markov Networks based EDAs . . . 43

4.2.1 Shannon’s Information Entropy . . . 44

4.2.2 Joint Equals Marginal Product . . . 44

4.2.3 Pearson’s χ2 Independence Test . . . 45

4.2.4 Linkage Detection Algorithm . . . 46

4.3 Learning with ℓ1-regularized Logistic Regression . . . 47

4.3.1 The ℓ1-Markov Networks Learning Algorithm . . . 49

4.3.2 The Regularization Parameter λ . . . 50

4.4 Empirical Analysis . . . 52

4.4.1 The Strength of the λ Parameter . . . 52

4.4.2 Comparison of the Structure Learning Criteria . . . . 59

5 A New Proposal in the DEUM Framework 67 5.1 The DEUMℓ1 Algorithm . . . 67

5.2 Empirical Analysis . . . 71

5.2.1 Optimization of the One Max Function . . . 72

5.2.1.1 Success Rate Against Population Size . . . . 72

5.2.1.2 Comparison With Other EDAs . . . 74

5.2.2 Optimization of the Alternated Bits Function . . . 77

5.2.3 Optimization of the Ising Spin Glass Function . . . 81

5.2.4 General Considerations . . . 86

6 Conclusions and Future Works 89 6.1 General Conclusions and Important Contributions . . . 89

6.2 Future Lines of Work . . . 91

Bibliography 92 A Evoptool: an Extensible Toolkit for Evolutionary Opti-mization Algorithm Comparison 103 A.1 Introduction . . . 103

(14)

A.2.2 Algorithms . . . 106

A.2.3 Benchmarks . . . 109

A.2.4 Statistics . . . 113

A.3 Technical description . . . 115

A.3.1 Modules and features . . . 115

A.3.2 Evoptool engine and graphical user interface . . . 116

A.3.3 Algorithms hierarchy . . . 118

A.3.4 Objective functions hierarchy . . . 119

A.4 Conclusions . . . 120

B Graphics of the Model Selection Performance 121 B.1 ℓ1-Markov Networks Learning algorithm vs Alternated Bits . 121 B.2 Comparison with CE, JEMP and χ2 _{. . . 125}

(15)

2.1 (a) A simple Bayesian Network with four variables (X1, X2, X3, X4). Vertices {1, 2, 3} are the parents of vertex 4, written π(4) = {1, 2, 3}. (b) A more complex Bayesian Network that defines a partial order on its vertices. . . 11 2.2 (a) A Markov Network on n = 5 vertices, with maximal

cliques {1, 2, 4}, {2, 3, 4} and {2, 4, 5}. (b) A Markov Net-work which defines a 4-nearest neighbour lattice model in 2D. 16 2.3 (a) A junction graph of the undirected graph shown in

Fig-ure. 2.2(a) and (b) its corresponding junction tree where the square node represents a separator set. . . 17 2.4 Independence model underlying the One Max fitness function. 23 2.5 Chain model underlying the Alternated Bits fitness function. 23 2.6 Toroidal square lattice model underlying the Ising Spin Glass

function. . . 24

3.1 Sample of solutions to an optimization problem and their rel-ative fitness value. . . 29

4.1 The confusion matrix associated to a prediction problem. Rows refer to the actual value, i.e., the real value before clas-sification, while columns refer to the predicted outcome. TP stays for true positive, FN for false negative, FP for false positive, and TN for true negative. . . 42 4.2 ℓ1-Markov Networks Learning applied to an Ising Spin Glass

function of size n = 25: (a) F1-measure with a tournament selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red dashed line refers to a 5% selection, the green dashed to 10% and the blue dashed to 20%. . . 53

(16)

selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red dashed line refers to a 5% selection, the green dashed to 10% and the blue dashed to 20%. . . 54 4.4 ℓ1-Markov Networks Learning applied to an Ising Spin Glass

function of size n = 100: (a) F1-measure with a tournament selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red dashed line refers to a 5% selection, the green dashed to 10% and the blue dashed to 20%. . . 55 4.5 Markov Networks resulting from an execution of the ℓ1-Markov

Networks Learning algorithm applied to an Ising Spin Glass function of size n = 25, with a population size of p = 20.6n and a percentage selection of ps = 10%. Selection operator: (a) binary tournament, (b) truncation selection. . . 58 4.6 F1-measure for a structure learning test onto an Alternated

Bits (a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 25: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 _{independence test. The red dashed line refers to a 5%} selection, the green dashed to 10% and the blue dashed to 20%. 60 4.7 F1-measure for a structure learning test onto an Alternated

Bits (a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 64: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 independence test. The red dashed line refers to a 5% selection, the green dashed to 10% and the blue dashed to 20%. 61 4.8 F1-measure for a structure learning test onto an Alternated

Bits (a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 100: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 independence test. The red dashed line refers to a 5% selection, the green dashed to 10% and the blue dashed to 20%. 62

(17)

ps = 10%. (a) ℓ1-Markov Networks Learning algorithm, (b) cross-entropy, (c) JEMP criterion, and (d) χ2 independence test. . . 64

5.1 Success rate over population size of the DEUMℓ1 algorithm

applied to the resolution of the One Max function. . . 73 5.2 Average number of fitness calls and average execution time

of the whole set of algorithms when applied to the resolution of the One Max function of size: (a) n = 25, (b) n = 65 and (c) n = 100. . . 76 5.3 Success rate over population size of the DEUMℓ1 algorithm

applied to the resolution of the Alternated Bits function. . . . 78 5.4 Average number of fitness calls and average execution time of

the whole set of algorithms when applied to the resolution of the Alternated Bits function of size: (a) n = 25, (b) n = 65 and (c) n = 100. . . 80 5.5 Success rate over population size of the DEUMℓ1 algorithm

applied to the resolution of the Ising Spin Glass function. . . 82 5.6 Average number of fitness calls and average execution time of

the whole set of algorithms when applied to the resolution of the Ising Spin Glass function of size: (a) n = 25, (b) n = 65 and (c) n = 100. . . 85 5.7 Average number of fitness calls and average execution time

required by DEUM-CE for different value of the maximum neighbourhood size m, when applied to an Ising Spin Glass function of size n = 100. The subscript of the name of the algorithms refers to the parameter m. . . 87

A.1 This figure shows the landscape of the Trap-5 benchmark problem. . . 111 A.2 This figure shows the images associated to the best individual,

after 2400 generations, for a Texture Restoration benchmark problem over an image of 80x80 pixels. Algorithms: SGA with population size 500 and mutation rate 0.004, UMDA with population size 300. (a) original image, (b) corrupted image, (c) SGA best individual, and (d) UMDA best individual.112

(18)

lation size 500 and mutation rate 0.001, PBIL with population size 1000 and learning rate 0.3, and sBOA with population size 5000 and max incoming edges 2. . . 114 A.4 Average population fitness trend over number of fitness calls

for an Ising Spin Glass benchmark problem of size 225 (15x15). Algorithms: PBIL with population size 400 and learning rate 0.3, MIMIC with population size 500 and offspring size 250, and sBOA with population size 5000 and max incoming edges 4. . . 114 A.5 Algorithms class diagram. Classes represented by a double

squared box are abstract class. . . 119

B.1 ℓ1-Markov Networks Learning applied to an Alterated Bits function of size n = 25: (a) F1-measure with a tournament selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 122 B.2 ℓ1-Markov Networks Learning applied to an Alterated Bits

function of size n = 64: (a) F1-measure with a tournament selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 123 B.3 ℓ1-Markov Networks Learning applied to an Alterated Bits

function of size n = 100: (a) F1-measure with a tournament selection and (b) with a truncation selection, (c) precision statistics with a tournament selection and (d) with a trun-cation selection, (e) recall statistics with a tournament selec-tion and (f) with a truncaselec-tion selecselec-tion. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 124

(19)

1

(f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 126 B.5 Recall for a structure learning test onto an Alternated Bits

(a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 25: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 127 B.6 Precision for a structure learning test onto an Alternated Bits

(a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 64: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 128 B.7 Recall for a structure learning test onto an Alternated Bits

(a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 64: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 129 B.8 Precision for a structure learning test onto an Alternated Bits

(a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 100: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 130 B.9 Recall for a structure learning test onto an Alternated Bits

(a,b,c,d) and an Ising Spin Glass function (e,f,g,h) of size n = 100: (a) and (e) ℓ1-Markov Networks Learning, (b) and (f) cross-entropy, (c) and (g) JEMP, and (d) and (h) χ2 inde-pendence test. The red-dashed line refers to a 5% selection, the green-dashed to 10% and the blue-dashed to 20%. . . 131

(20)

3.1 Computational gain given by the randomized model fitting in the resolution of an Ising Spin Glass function through the IsDEUM algorithm for ǫ ∈ {10−1_{, 10}−2_{, 10}−3_{}. Here, ǫ is the} requested maximal error of the randomized SVD, n refers to the size of the problem, ET is the execution time in seconds, and G is the computational gain. . . 37 3.2 Computational gain given by the randomized model fitting

in the resolution of an Ising Spin Glass function through the IsDEUM algorithm for ǫ ∈ {10−5, 10−10, 10−25}. Here, ǫ is the requested maximal error of the randomized SVD, n refers to the size of the problem, ET is the execution time in seconds, and G is the computational gain. . . 37

5.1 Parameter settings for the whole set of algorithms when ap-plied to the resolution of the One Max function of size (a) n = 25, (b) n = 64 and (c) n = 100. . . 75 5.2 Parameter settings for the whole set of algorithms when

ap-plied to the resolution of the Alternated Bits function of size (a) n = 25, (b) n = 64 and (c) n = 100. . . 79 5.3 Parameter settings for the whole set of algorithms when

ap-plied to the resolution of the Ising Spin Glass function of size (a) n = 25, (b) n = 64 and (c) n = 100. . . 84

(21)

1 PBIL _{. . . .} ₃ 2 MIMIC _{. . . .} ₁₂ 3 COMIT _{. . . .} ₁₃ 4 BOA _{. . . .} ₁₄ 5 MN-FDA _{. . . .} ₁₉ 6 MN-EDA _{. . . .} ₂₀ 7 DEUM _{. . . .} ₂₁ 8 Gibbs Sampler _{. . . .} ₃₂

9 Randomized Singular Value Decomposition _{. . . .} ₃₅

10 ℓ1-Markov Networks Learning . . . 50

11 DEUM_ℓ 1 . . . 68

12 Gibbs sampler for Single Generation DEUM _{. . . .} ₇₀

(22)

Introduction

The area of interest of this thesis is the optimization, i.e., the task of finding the extreme values of some objective function, also known as cost function, which may corresponds to the optimal solution of a particular engineering problem. Planning, scheduling, tuning of parameters, represent some of the more common applications for optimization algorithms. In this work we restrict our attention to the field of combinatorial optimization, i.e., the task of finding the optimal solution from a finite (discrete) set of feasible solutions, even if most of the work presented here could be extended to the continuous case as well. Moreover, we focus on black-box approaches for the optimization of objective functions, that is, those algorithms which do not exploit any prior knowledge concerning the problem to be solved. The aim of this section is introducing the reader to the particular class of optimizers taken into account in this work, accompanying this choice with motivations and discussions. Once the area of the work is well figured, the aim and motivations of the work will be presented. Then, a brief description of the contributions of the thesis is given before concluding with an overview of manuscript.

1.1 From Genetic Algorithms to Estimation of

Dis-tributions

Evolutionary Algorithms (EAs)[2] consists of a broad class of optimization algorithms based on the Darwinian paradigm of the survival of the fittest. An Evolutionary Algorithm evolves a population of individuals, which en-codes the (possibly feasible) solutions to an optimization problem, along the generations (iterations) by repeatedly applying biological inspired mecha-nisms of selection and reproduction. The goodness of an individual

(23)

belong-ing to the population of an EA is evaluated through some fitness function, i.e., the function to be optimized or an appropriate surrogate of it, and the fitness value determines the chance of the individual to survive for the next generation. Different paradigms belong to the family of the Evolu-tionary Algorithms, some of the most well know are represented by Genetic Algorithms (GAs) [30, 38], Evolution Strategies [3, 76] and Evolutionary Programming [26, 27, 48].

Genetic Algorithms represents the more popular evolutionary paradigm outside the Evolutionary Computation community. A GA tries to find the optimal solution to a given optimization problem by repeatedly applying selection and reproduction operators. In a nutshell, from the current popu-lation of individuals the better solutions are selected by the selection oper-ator; then selected solutions are processed by applying recombination and mutation operators. Recombination combines multiple (usually two) solu-tions that have been selected together by exchanging some of their parts. There are various strategies to do this, e.g. one-point and uniform crossover. Mutation performs a perturbation to the resulting offspring. Created solu-tions replace (some of) the old ones and the process is repeated until some termination criterion is met.

As a matter of facts, fixed and problem-independent recombination op-erators often either break the pieces of information contained in the promis-ing solutions (buildpromis-ing blocks) or do not mix them effectively. It turns out that, GAs work particularly well for problems where the building blocks are located tightly in the individuals [86]. An example of this statement is provided by deceptive functions. Deceptive functions are synthetic problem specifically designed to mislead classical GAs [29, 57]. In such functions, re-gions of optimal fitness are surrounded by rere-gions of lower fitness that have the role of leading the genetic search process towards different areas of the search space corresponding to local minima. Moreover, classical variation operators employed by GAs, i.e., crossover and mutation, lack of theoreti-cal foundations, and this lack often discourage the use of GAs outside the Evolutionary Computation community.

A more recent paradigm within the community of EAs is represented by Estimation of Distribution Algorithms (EDAs) [50]. In the literature, EDAs are also known as Probabilistic Model Building Genetic Algorithms (PMBGAs) [65] and Iterated Density Estimation Algorithms (IDEAs) [11]. EDAs maintain the population-based search approach and the exploration by recombination and perturbation of promising solution typical of Genetic Algorithms. However, the variation operators, which in GAs guide the ex-ploration of the search space, are replaced by the use of probabilistic models,

(24)

Algorithm 1: PBIL

Set all values in P V to 0.5

1

whileStopping criteria are not met do

2

Sample P V to generate the new population P

3

Evaluate P

4

Select a subset D of P

5

forEach variable Xido

6

Compute empirical marginal probability p(Xi) over D

7

Update P V as in P Vi(t+1)← γ · ˆp(t)(Xi) + (1 − γ) · P V (t) i

8

Increment the generations counter, t ← t + 1

9

that is, by integrating the evolutionary paradigm with concepts coming from the field of Machine Learning.

The employment of probabilistic models in EDAs allows the use of sta-tistical modelling and sampling techniques to automatically discover and exploit problem regularities. On one hand, the population of individuals can be seen as a sample of observations, that are particular realizations of a certain probability distribution. On the other hand, this probability distri-bution can be recovered through statistical learning techniques applied to such sample of observations.

The generic EDA evolves a population of promising solutions to a given optimization problem. If no prior information concerning the problem in the form of a population is supplied to the algorithm, the EDA begins by initializing a random population. Next, at each generation, the population is evaluated through the fitness function and the promising individuals are selected. At this stage, a probabilistic model is build over the selection of the population; the built model is than sampled with the aim to generate the new offspring; the new individuals are then incorporated in the population following some replacement strategy, (e.g., elitism, full replacement, etc.).

Let us clarify the reader about Estimation of Distribution Algorithms taking as an example one of the simplest algorithm of this family, the Pop-ulation Based Incremental Learning (PBIL) [4]. PBIL represents for the community of EDAs what the Simple Genetic Algorithm (SGA) [22] is for the community of GAs: a forerunner. This algorithm use to model the population through an independence probabilistic model, i.e., each variable Xi in the problem has a certain marginal probability p(Xi) that does not depend from the other variables X_{V \i}. The set of marginal probabilities are represented in PBIL as a probability vector P V . The PBIL algorithm, whose workflow is presented in Algorithm 1, evolves a population of solutions

(25)

P along the generations, by sampling and updating the probability vector P V . Thanks to the learning rate, γ, the probability vector is updated by the algorithm considering both the selection D of promising solutions and the current status of the P V according to

P V_i(t+1) = γ ˆp(t)(Xi) + (1 − γ)P Vi(t),

where t is a counter that takes trace of the current generation number, and ˆp(t)(Xi) is the empirical marginal probability of the variable Xi at the generation t.

Convergence, to a “stationary point” of the algorithm execution, occurs when the probabilities in the P V reach the values of 0 or 1, after which there will be no variation in the generated individuals, so that further gen-erations of the algorithm will result useless. The learning rate, γ, allows the control of the speed of convergence of the algorithm. Indeed, it weighs in different ways, through the learning rule, the contributions of both the past information, P V(t), and the current status of the population, ˆp(t)(X).

PBIL is one of the simplest Estimation of Distribution Algorithms, and it is based on a na¨ıve assumption of independence among the variables of the given optimization problem; in doing this, PBIL belongs to the class of the univariate EDAs [34, 61, 79], i.e., those algorithms which do not consider the interactions among the variables of the problem. Under this perspective, the PBIL algorithm, together with the class of univariate EDAs, does not differ from the SGA when looking to the interactions among variables. Besides univariate EDAs, there are also EDAs that employ probabilistic models with higher order of interactions as bivariate [5, 21, 78, 82] or multivariate EDAs [64, 77].

1.2 Can we improve DEUM? Why should we care?

In the literature of EDAs much of the efforts are focused on the employment of directed graphical models for the statistical modelling of the population. The use of undirected graphical models, such as Markov Networks, is not so popular in this context. As a matter o facts, most of the EDAs based on Markov Networks do not exploit all the expressive power of such formalism. The Markov Networks are used to model the population in a first stage, while are often converted to directed graphical models during the sampling of the new offspring. This is mainly due to some features of the Markov Net-works that make their use difficult or computationally expensive. However, several real-world optimization problems cannot be efficiently represented by directed graphs, while can be naturally encoded through undirected models.

(26)

Recently, an interesting framework based on the formalism of Markov Networks has been proposed under the name of Distribution Estimation Us-ing Markov Networks (DEUM) [78, 79, 80, 81]. The idea that distUs-inguishes this framework from other approaches is based on the direct use of the in-formation provided by the fitness function. In the general case, the fitness function is usually reduced to a ranking score for the selection operator, while in the DEUM framework each solution gives a contribution to the estima-tion of the model proporestima-tional to its fitness. Indeed, the probabilistic model is constructed so as to assign to each solution a probability proportional to its fitness value. That is, if the probabilistic model is well posed, i.e., the Markov Network is chosen according to the interactions among variables and there are sufficient observations to correctly estimate the parameters of the model, the solution having maximal probability corresponds to the optimal solution of the problem. Next, it is possible to employ sampling techniques aimed to find such solution regardless the others. As a matter of facts, the DEUM framework uses to model the fitness function through an undirected graphical model. This approach was shown to significantly re-duce the amount of fitness evaluations required to reach the optimal solution in several optimization problems, (e.g., Ising Spin Glass function [78, 82], and MAXSAT [16])

Beside this innovative idea, the DEUM framework suffers from some issues. The task of estimating the parameters of the joint probability distri-bution, that in this framework takes the form of a Boltzmann distridistri-bution, requires the solution of a linear system, and this has a computational cost that might results very expensive. Furthermore, at the state of the art, the DEUM framework requires prior knowledge concerning the problem to be supplied in terms of structure of the Markov Network.

A limited amount of work was done to enable the automatic discovery of problem regularities in DEUM. In particular, the lack of automatic learning techniques prevents the employment of the DEUM framework to the reso-lution of those problems whose structure of interactions is unknown. In this thesis we aim at improving the DEUM framework from different points of view. Particular focus is given to the problem of automatic selection of the probabilistic model, although, some effort is spent with the aim of reducing the computational complexity of parameters estimation.

1.3 Thesis Contributions and Road Map

In this section, we summarize the original contributions given by the work presented in this thesis. Our primary objective was to extend the DEUM

(27)

framework with structure learning techniques, but, most of the effort spent in this direction can result useful to the rest of the community of the Esti-mation of Distribution Algorithms. The original contributions provided by this research are:

i Model Selection as an Optimization Problem. The task of selecting the proper structure of the Markov Network is addressed by introducing the use of ℓ1-regularized logistic regression techniques.

ii Randomized Model Fitting. The computational effort required to esti-mate the the parameters of the probabilistic model used by the DEUM algorithms is reduced by introducing the use of randomized methods for matrix decomposition, in contrast to classical SVD.

iii Evoptool: the Evolutionary Optimization Toolkit. Evoptool provides a common platform for the development and test of new evolutionary algo-rithms, with the aim of facilitating the performance comparison activity. The toolkit offers a wide set of benchmark problems and a collection of implementations of GAs and EDAs.

The remainder of this document is organized as follow.

In Chapter 2, the reader is introduced to the Estimation of Distribution Algorithms literature. The most noticeable works are reviewed and catego-rized by their probabilistic model. A brief theoretical introduction about Probabilistic Graphical Models allows the reader to better understand them. In Chapter 3, the theoretical foundations of the DEUM framework are deeply reviewed. The probability distribution, the relationship between the probabilistic model and the fitness function, the estimation of the param-eters, and the sampling of the model are described in detail. Finally, we introduce the first original contribution: the randomized model fitting.

In Chapter 4, we face the model selection problem in the literature of EDAs, with particular focus to the DEUM framework. Next, ℓ1-regularized logistic regression is introduced as a method for the selection of the model, with a primary analysis of the performance and a comparison with other approaches.

In Chapter 5, we propose a novel approach to optimization in the DEUM framework, and thus, in the literature of EDAs, with the DEUMℓ1

algorithm. The chapter begins with a description of the workflow of our proposal. Next, the performance of the DEUMℓ1 algorithm is analysed and

compared with those of other popular EDAs.

In Chapter 6, the goals of this thesis are summarized with a discussion on the results obtained. Next, we give several indications on future lines of

(28)

works that can be followed with the aim of further extend this work. In Appendix A, we present in a nutshell the Evolutionary Optimization Toolkit (Evoptool). The set of standardized benchmarks and the evolution-ary search algorithms implemented within this toolkit are summarized. The appendix also contains an introduction to the software architecture of Evop-tool.

In Appendix B, we provide additional graphics concerning the per-formance of model selection based on ℓ1-regularized logistic regression over the Alternated Bits function. Besides, we provide also graphics that report details of the performance comparison with other popular criteria of model selection.

(29)

Estimation of Distribution

Algorithms

Estimation of Distribution Algorithms use probabilistic modelling of promis-ing solution to guide their search towards an optimal solution. However, when the assumption of stochastic independence falls, the complexity of the probabilistic models raises, so that it is general practice to employ more powerful modelling techniques. Such techniques are represented by the Prob-abilistic Graphical Models (PGMs) [43, 62, 88]. PGM bring together graph theory and probability theory in a powerful formalism for multivariate statis-tical modelling. They provides a unifying framework for capturing complex dependencies among random variables and building large-scale multivari-ate statistical models. The use of PGMs is becoming very popular in sev-eral research communities, including signal and image processing, statistical physics, combinatorial optimization, Information Retrieval, Machine Learn-ing, and, recently, in the Evolutionary Computation community. The focus of this chapter is to provide the reader of the necessary bases on Probabilis-tic Graphical Models to better understand this work, and, at the same time, to introduce she to the Estimation of Distribution Algorithms by reviewing the more important works present in the literature categorized by PGM.

2.1 Probability Distributions on Graphs in EDAs

In classical Genetic Algorithms an individual is usually modelled through a chromosome represented by a string 1 _{of discrete or continuous values.} This representation is formalized in EDAs in the sense that each allele of the individual is associated to a random variable Xi taking values in some

(30)

space Xi. Again, the state space Xi may be either continuous (e.g., Xi = R) or discrete (e.g., Xi = {0, 1, . . . , r}). In the reminder of this work lower case letters will be used to denote particular elements xi ∈ Xi, so that the notation Xi = xi corresponds to the event that the random variable Xi takes the value xi ∈ Xi. In order to simplify the notation an individual on length n will be denoted simply by X leaving the length of the vector implied, while a sample of random vectors, i.e., a population of p individuals, will be represented by Xp_{, where x}(i) _{∈ X stands for the i}

th individual of the population. The notation XS, where the subscript S refers to a set of vertices, denotes the set of random variables {Xs|∀s ∈ S ⊆ V }.

In most of EDAs the probability models are represented by Probabilistic Graphical Models. The key idea of a PGM is that of factorization, or rather, a PGM consists of a collection of probability distributions that factorize according to the structure of a certain graph. A graph G = (V, E) is defined by a set of vertex V = {1, 2, . . . , n} and a set of edges E ⊆ V × V . Each edge consists of a pair of vertices s, t ∈ V , and may be either undirected, in which case there is no distinction between (s, t) and (t, s), or directed, in which case the edge will be denoted by (s → t) to distinguish the direction. At this stage, given a graph G = (V, E) and an individual X, every random variable Xi is associated to a vertex i ∈ V , while an edge between a pair of vertices encodes a condition in a set of independence assumptions. Indeed, the topology of the graph defines a set of independences among random variables. This concept will be better defined in the next sections.

Again, an Estimation of Distribution Algorithm makes use of a PGM defined on a graph G = (V, E) with the aim to factorize a joint probability distribution, p(x), that describes the population X. The structure of the graph can be either fixed a priori or learned from a sample of individuals depending from the nature of the algorithm considered. The structure of the graph is than used to sample a new population of individuals.

The reminder of this chapter defines the factorizations encoded by the most popular Probabilistic Graphical Models and gives examples of EDAs that make use of those models.

2.2 Bayesian Networks

Bayesian Networks (BNs) consist of probabilistic graphical model which defines a family of probability distributions that factorize according to the structure of a Directed Acyclic Graph (DAG). A DAG is a graph where all the edges are directed, and in which it is not possible to find a directed path that begins in a vertex and ends in the same vertex. Given a directed edge

(31)

1 4 3 2 (a) 1 4 3 2 5 (b)

Figure 2.1: (a) A simple Bayesian Network with four variables (X1, X2, X3, X4).

Ver-tices {1, 2, 3} are the parents of vertex 4, written π(4) = {1, 2, 3}. (b) A more complex Bayesian Network that defines a partial order on its vertices.

(s → t) of a DAG, we say that t is a child of s, or conversely, that s is a parent of a vertex t. For any vertex i ∈ V , let π(i) denote the set of the parents of the given node i as in Figure 2.1(a). If a vertex i has no parents, than the parenthood of i consists in an empty set π(i) = ∅.

Given a directed acyclic graph G = (V, E), it is possible to define a partial order on the set of vertices V by the notation of ancestry. A node s is an ancestor of the node t if there is a directed path (s, u1, u2, . . . , uk, t) between the vertices, (e.g., in Figure 2.1(b) the vertex 3 is an ancestor of vertex 5 through the directed path (2, 4, 5)). Now, for each vertex i ∈ V and its parent set π(i), let pi(xi|xπ(i)) denote a non-negative function over the variables (xi, xπ(i)) normalized such that R pi(xi|xπ(i))dxi = 1. In terms of these local functions, a Bayesian Network consists in a collection of probability distributions which factorize according to

p(x) =Y i∈V

pi(xi|xπ(i)). (2.1)

Moreover, the non-negative function pi(xi|xπ(i)) represents the conditional probability of the variable Xi = xi given its parenthood Xπ(i) = xπ(i).

Bayesian Networks are the most popular probabilistic graphical models employed in the EDA literature. Different algorithms make use of BNs to factorize the probability distribution of the population of individuals. In most simple EDAs based on BNs, the structure of the DAG that charac-terizes the network maybe restricted either to a single directed chain or to a directed tree. However, there are also algorithms that fully exploit the

(32)

power of this formalism. Next sections are dedicated to present some of these EDAs in increasing order of model complexity.

2.2.1 MIMIC

In [21] De Bonet et al. introduces for the first time in the literature of the Estimation of Distribution Algorithms an algorithm which considers in-teractions among the variables of an optimization problem. The Mutual Information Maximizing Input Clustering (MIMIC) is an EDA that uses a directed chain model for the structure of the distribution which describes the population. Moreover, MIMIC uses to learn this model from scratch. The algorithm makes use of a greedy heuristic based on the Shannon’s In-formation Entropy [83] with the aim of minimizing the Kullback-Leibler divergence [49] between the learned model and the true distribution. The workflow of the MIMIC algorithm is given in Algorithm 2.

The family of probability distributions employed by the MIMIC al-gorithm consists in a subset of the distributions encoded by a Bayesian Network. Indeed, given a permutation of the numbers between 1 and n, π = {π1, π2, . . . , πn}, the algorithm make use of a class of probability distri-butions in the form

p(x) = p(xπ1|xπ2)p(xπ2|xπ3) . . . p(xπn−1|xπn)p(xπn)

where π defines the ancestral order encoded by the chain that, this time, consists in a full order of the variables.

At each generation, the algorithm determines the optimal permutation of the variables by means of a greedy heuristic that, given the last node added to the chain, computes the Information Entropy of all the remaining

Algorithm 2: MIMIC

Generate random initial population P

1

2

Select a subset D from P of individuals with fitness higher than the

3

median value

Determine entropy of all the variables using D

4

Select Xi with lowest marginal entropy as head of the chain

5

whileChain does not contain all the variables do

6

Compute pairwise conditional entropy given the last variable added

7

to the chain, for remaining variables

Add the variable with lowest conditional entropy to the chain

8

Sample new population P according to the chain

(33)

variables and chooses the one with the minimum value. Once the distri-bution is chosen, generating samples is straightforward: for each individual to be generated, first choose a value for Xπn based on the empirical

proba-bility p(xπn), than proceeds for the rest of the chain using the conditional

empirical probabilities p(xπk|xπk+1).

The performance of MIMIC on synthetic functions has been shown to promise well [21]. Nevertheless, real world optimization problems rarely have an underlying chain structure, so that the relevance of this algorithm is more theoretical than practical. Indeed, the approach introduced by MIMIC has been extended and adapted by other popular EDAs one of which is represented by COMIT where we now move on.

2.2.2 COMIT

A straightforward variation of the MIMIC algorithms is represented by COMIT, Combining Optimizers with Mutual Information Trees [5, 6]. COMIT make use of a tree structured Bayesian Network rather than a chain in such a way that two or more variables can share the same parent node.

In COMIT, the basic criterion used to learn the tree is the Mutual In-formation I among pair of variables

I(Xi, Xj) = X xi∈Xi,xj∈Xj ˆ p(xi, xj) log ˆ p(xi, xj) ˆ p(xi)ˆp(xj)

which is strictly related to the value of the entropy since it can be computed as the sum of the entropy of each variable minus the joint entropy. This consideration enforces the relation among the COMIT and the MIMIC al-gorithms. However, COMIT make use of a maximum weight spanning tree algorithm [19] to build the tree structured Bayesian Network rather than a greedy heuristic as in MIMIC. Moreover, the COMIT algorithm employs a

Algorithm 3: COMIT

Generate random initial population P

1

2

Create a tree-shaped probabilistic network T that models P

3

Use T to stochastically generate D solutions.

4

Select a set Q of the fittest solutions from D

5

Execute fast-search procedure, either PBIL or hill-climber, starting from

6

the points in Q

Replace up to M individuals in P using better individuals found during

7

(34)

fast search procedure with the aim to further optimize the individuals gener-ated by the stochastic sampling of the network. In particular, the algorithm as been proposed with a fast-search procedure based either on hill-climber or on the PBIL [4] algorithm.

COMIT extends the class of probability distribution on DAG employed by MIMIC, since the directed chain can be seen as a particular instance of a tree-shaped BN. But, no matter how it improves the performance of the MIMIC algorithm there is still more power in the formalism of Bayesian Network which can be exploited by an EDA.

2.2.3 Bayesian Optimization Algorithm

An EDA able to fully exploit the potentialities of the BNs formalism is represented by the well known Bayesian Optimization Algorithm (BOA) [63, 64, 66]. BOA is able to learn from scratch a BN in which the structure of the interactions is not restricted a priori to fulfil any particular shape. In order to learn the most adequate structure for the Bayesian Network a greedy algorithm is usually used to achieve a good compromise between search efficiency and model quality. In BOA, the quality of a given network is quantified by using popular scoring metrics for Bayesian Networks such as the Bayesian Information Criteria (BIC) [75] or the Bayesian-Dirichlet metric with likelihood equivalence (BDe) [20]. The workflow of the basic implementation of BOA is given in Algorithm 4. Nonetheless, the algorithm has been extended in numerous ways some of which are discussed in the next paragraphs.

The BOA algorithm uses conditional probability tables (CPTs) to store the conditional probability p(Xi|Xπ(i)) for each variable Xi specified by the network. Nevertheless, the number of conditional probabilities for a variable that is conditioned to k parents grows exponentially with k. Moreover, the dependencies sometimes also contains regularities, and these regularities can be exploited to represents the parameters of the network with local

struc-Algorithm 4: BOA

Randomly generate initial population P

1

2

Select a subset D of promising individuals from P

3

Construct the network B using a chosen metric and constraints

4

Generate a set O of new individuals according to B

5

Replace some individuals in P with the better of O according to a

6

(35)

tures that allow more efficient representation of local conditional probability distribution than full CPTs. Decision trees or decision graphs can be em-ployed as such local structures. One of the first extensions of this algorithm is represented by dBOA [66], a version of BOA which exploits decision graph with the aim to represents a set of conditional probabilities.

A further extension of BOA is represented by the Hierarchical Bayesian Optimization Algorithm (hBOA) [63]. The hBOA algorithm combines BOA with local structures and a niching technique with the aim to solve difficult hierarchical and deceptive optimization problems. A restricted tournament replacement (RTR) is used by hBOA to ensure effective niching. RTR selects a subset of the original population for each newly generated offspring and lets the offspring compete with the most similar members of this subset, where similarities are measured through the Hamming distance.

The Bayesian Optimization Algorithm has been extended in several other ways that we are not going to present here. A work on this algorithm that is worth mentioning is represented by L1BOA [93]. In this variant of BOA the ℓ1-regularized logistic regression is employed with the aim to restrict the set of candidates parents of each variables during the learning of the networks. For each node, the regularization path2 _{is constructed, and in the} meanwhile each solution given by the path is used to assess the minimum description length (MDL) of the network with that set of parents. The point in the regression path that minimize the MDL is used to set-up an initial set of parents. This set of vertices is then used as starting point for the usual learning algorithms employed by BOA.

2.3 Markov Random Fields

When the graph G = (V, E) has an undirected nature it is possible to factorize a probability distribution according to functions defined over the set of cliques C which characterize the graph. A clique C is a set of fully connected vertices, meaning that for each pair of vertices s, t ∈ C there exists an edge (s, t) belonging to E, (e.g, in Figure 2.2(a) the set of vertices {1, 2} defines a clique). Moreover, a clique C is said to be a maximal clique if it is not part of any other clique C′in the given graph, (e.g., in Figure 2.2(a) the set of vertices {1, 2, 4} defines a maximal clique).

Let ψC be an application defined as ψC : ⊗s∈CXs → R+, where ⊗s∈CXs denotes the Cartesian product of the state spaces of the variables within the

2_{The regularization path consists in the family of solutions of the logistic regression}

problems obtained when the strength of the regularization parameter λ is free to varies over the range (0, ∞).

(36)

1 4 3 2 5 (a) 1 4 3 2 5 & ' ( ) (b)

Figure 2.2: (a) A Markov Network on n = 5 vertices, with maximal cliques {1, 2, 4}, {2, 3, 4} and {2, 4, 5}. (b) A Markov Network which defines a 4-nearest neighbour lattice model in 2D.

clique. A direct consequence of this definition is that the application ψC is a local quantity in the elements xC ∈ C. Such application ψC is usually called compatibility function or potential function.

Following this notation, an undirected graphical model defines a family of distributions which factorize according to

p(x) = 1 Z

Y C∈C

ψC(xC), (2.2)

where Z is a normalization constant called partition function3_{which ensures} that the distribution sums up to 1. Such undirected models are often referred to as Markov Random Field (MRF), but, they are also known as Markov Network (MN) or as Gibbs distribution. It is worth noticing that, differently from Bayesian Networks, the compatibility functions ψC does not have any obvious interpretation in terms of marginal or conditional probability.

Given an undirected graph G = (V, E) and considering only its maximal cliques it is possible to define a junction graph. A junction graph is defined as a graph in which the nodes correspond to the maximal cliques of G and between a pair of nodes there exists an edge if and only if their corresponding cliques overlaps, (e.g, the Figure 2.3(a) shows the junction graph associated to the undirected graph of Figure 2.2(a)). Moreover, an undirected graph G is said to be a chordal graph, if all the cycles in G made of more than

3_{The partition function is defined as Z =}P

y∈Ω

Q

C∈CψC(yC), where Ω is the set of all

possible configurations of X. In general conditions, the partition function is exponential in the length of X and therefore is not feasible to compute.

(37)

1 2 $ 2 3 4 2 4 5 (a) 1 3 5 2 4 (b)

Figure 2.3: (a) A junction graph of the undirected graph shown in Figure. 2.2(a) and (b) its corresponding junction tree where the square node represents a separator set.

three edges has a chord. This kind of undirected graphs are also known as triangulated graphs. Figure 2.2(a) shows an undirected graph which is also a chordal graph. An interesting property of a chordal graph is that on such kind of graph it is possible to construct a junction tree. A junction tree is a particular kind of tree where the nodes correspond either to the maximal cliques of an undirected graph or to the intersection of a pair of adjacent cliques in which case the node is called separator. A junction tree satisfies what is usually referred to as the running intersection property which states that: given a clique tree, for any two clique nodes Ci and Cj, all nodes on the unique path joining them contain the intersection Ci∩ Cj. The running intersection property is of considerable importance in this context, not only because any undirected graph which satisfies this property can be formulated as a junction tree, but in addition, for such graphs there is an exact inference algorithm called Junction Tree Algorithm [51].

Several instances of EDAs based on Markov Random Fields will be pre-sented in the remainder of this section, some of which exploit the above-mentioned notations of junction tree and junction graph.

2.3.1 Factorized Distribution Algorithm

M¨uhlenbein et al. were the first to propose an EDA based on the undirected graphical models. The Factorized Distribution Algorithm(FDA) [60] requires prior knowledge concerning the optimization problem to be supplied in the form of an Additive Decomposable Function (ADF). This means that FDA is not capable to learn the structure of the problem itself. The ADF is given to the algorithm as a set of potential functions which define the cliques in the probabilistic model, a Markov Random Field. The distribution specified by the network is factorized into a junction tree and the overall probabilistic

(38)

model takes the form p(x) = Y C∈C p(xC) = Y C∈C p(xCr|xCs),

where each clique C is divided into two sets: the residuals Cr and the separators Cs.

In the FDA algorithm, the parameter estimation is based on the com-putation of the empirical conditional probabilities, which are defined by the model, from the selection of the individuals. The sampling of the new solu-tions is similar to that of Bayesian Networks: starting from the root of the junction tree, the configuration of the variables belonging to each clique is sampled from the associated conditional probability following the ordering defined by the junction tree. Obviously, the value of the variables belonging to the separators are sampled before the residuals, since these belong also to a clique that has just been sampled.

It has been theoretically proven that when the model is correct, the FDA solves decomposable problems quickly and accurately [60]. However, this algorithm requires the prior information about the problem in the form of its decomposition and its factorization. Unfortunately, this is usually not available when dealing with real-world problems, and therefore the use of FDA is limited to those problems where it is possible to approximate accu-rately a decomposition. Moreover, the supplied factorization must satisfy the running intersection property to be transformed into a junction tree. Latter requirement further reduces the field of application of the algorithm, nonetheless FDA has been shown to be a good starting point for the devel-opment of other EDAs some of which are discussed later.

2.3.2 MN-FDA

The Markov Network Factorized Distribution Algorithm (MN-FDA) [72] rep-resents one of the variants of the FDA which tries to overcome the limita-tions implied by the use of junction tree. The MN-FDA does this by learning from scratch a junction graph instead of the junction tree employed by its predecessor FDA. The junction graph encodes an approximate factorization of the probability distribution which is then sampled to generate the new offspring.

The MN-FDA employs an articulated procedure to learn the probabilistic graphical model that describes the population. The first step consists in the construction of a Markov Network from a sample of individuals. It starts with a complete bivariate network, and then it tries to remove edges by

(39)

Algorithm 5: MN-FDA

Generate initial population P randomly

1

2

Select a subset of individuals D of P

3

Learn a Markov Network M from D

4

Find the set L of all the maximal cliques of M

5

Construct a labelled JG from L

6

Calculate the marginal probabilities for all the cliques in JG

7

Generate a new population sampling from JG

8

testing for conditional independence between the linked nodes through a χ2 independence test. Moreover, the values of the χ2 statistics are used as weights for the remaining edges employed in the following step: finding the maximal cliques of the undirected graph. This is done by means of the well known Bron and Kerbosh algorithm [12], a branch and bound techniques to find the maximal cliques of a given undirected graph. The set of maximal cliques is now ordered in a list following descending ordering of the weight. Finally, the ordered list of maximal cliques is passed through an heuristic which constructs a labelled junction graph, where each label refers to the position of the clique in the list.

The sampling of the junction graph is similar to that of the junction tree in the FDA. The cliques are sampled according to the order defined by the set of labels. The variables corresponding to the first clique are sampled from the corresponding marginal probabilities. For the rest of the cliques, each subset of the variables that has not been yet instantiated is sampled conditionally on the other variables that belong to the clique. A relevant difference between the sampling of a junction graph and a junction tree is that the second does not contain cycles. Nonetheless, this fact does not influence the sampling algorithm. That because the conditioning and conditioned sets of variables belong to the same clique.

2.3.3 MN-EDA

The Markov Network Estimation of Distribution Algorithm (MN-EDA) [73] consists in an extension of the MN-FDA, that in turn is a variation of the FDA. The MN-EDA replaces the junction graph used by its predecessor with a Kikuchi approximation [45] of the distribution in such a way to further extend the representation capability of the factorization employed. The workflow of this EDA is given in Algorithm 6.

(40)

appro-priate set of marginal probabilities from which to obtain an approximation of the distribution. In particular, a Kikuchi approximation is a way to approximate the factorization of the energy in a Boltzmann distribution. Kikuchi developed the Cluster Variational Method [45], an algorithm that constructs an approximation of the energy as a function of a set of different marginals. The approximation is defined on a graph region based decompo-sition of a Markov Network, calculating the marginal probabilities for each region. Here a region has to be intended as a set of cliques that belongs to the network. The key idea in this approach is to find a valid regional decomposition of the graph.

The MN-EDA shares with its predecessor MN-FDA part of procedure used to learn the structure of the probabilistic model employed. As MN-FDA, the algorithm learns from scratch a Markov Network removing un-necessary edges from a fully connected graph using the Pearson’s χ2 inde-pendence test to decide on each edge. The graph is then refined to ensure a maximum size of the neighbourhood for each vertex. The Bron and Ker-bosh algorithm [12] is employed with the aim to recover the set of maximal cliques. At this stage, differently from MN-FDA, the MN-EDA finds through the cluster variational method a valid set of regions starting from the initial set represented by the maximal cliques of the network. Once the regional decomposition of the Markov Network is defined, next step consists in com-puting all the necessary marginal probabilities from a sample of individuals and finally to sample the distribution by means of a Gibbs sampler [28] with the aim to generate the new offspring.

The approach proposed by Santana [73] with the MN-EDA allows for a full representation of all the interactions learned with the χ2 independence test, in contrast with the earlier FDA and MN-FDA algorithms which were unable to do this.

Algorithm 6: MN-EDA

Generate initial population P randomly

1

2

Select a subset of individuals D of P

3

Learn a Markov Network M from D

4

If necessary, refine the graph M

5

Find the set C of all the maximal cliques of M

6

Construct a clique based decomposition of the graph

7

Find the marginal probabilities for the regions of the decomposition

8

Generate a new population sampling from Kikuchi approximation