Cutting plane methods for inverse reinforcement learning

(1)

POLITECNICO DI MILANO

Corso di Laurea Magistrale in Ingegneria Informatica Dipartimento di Elettronica e Informazione

Cutting-Plane methods for Inverse

Reinforcement Learning

Relatore: Ing. Marcello Restelli Correlatore: Matteo Pirotta

Tesi di Laurea di: Stefano Balloccu, matricola 787147

(2)

(3)

List of Figures

2.1 agent-environment interaction . . . 6 4.1 bi-dimensional convex Pareto frontier in the features space.

All the optimal expert solutions lie on the Pareto frontier. The light-shaded area represents the set of all the possible sub-optimal expert solutions. . . 33 5.1 The inequality aTz ≤ b defines a cutting-plane at the query

point ω, for the target set X , shown shaded, that lies in a polyhedron Pk. . . 35

5.2 Left: a deep cut for the point ω and target set X that lies in a polyhedron Pk Right: a neutral cut for the point ω and

target set X that lies in a polyhedron Pk. . . 36

5.3 A polyhedron Pk, query point ωk+1 and the new polyhedron

Pk+1shown shaded. Left: The cutting-plane cuts a very small

part of Pk and the new polyhedron Pk+1 is not much smaller

than Pk. Right: The cutting-plane cuts a large amount of

space from Pk and the new polyhedron Pk+1 is smaller. . . . 37

5.4 Same scenario of Figure 5.3 but using a more central query point ωk+1 with different cutting-plane directions. In both

cases we obtain a good reduction in the size of the localization polyhedron. . . 37 6.1 The Pareto front, the sub-optimal expert solution in red and

one of the possible dominating solution in green. All the solutions on the Pareto front that lie in the dark-gray shaded area dominate the expert. . . 56 7.1 The mean number of steps and the standard deviation shaded

needed by CPIRL, using CG, to converge based on the num-ber of samples used to computing the center of gravity point at each step . . . 61

(6)

7.2 A person represented by the Human Walking model. . . 66 7.3 Examples of trajectories . . . 66

(7)

List of Algorithms

1 Analytic Center Cutting-Plane Method (ACCPM) . . . 39 2 Cutting-Plane Inverse Reinforcement Learning (CPIRL) . . . 44 3 Cutting-Plane Inverse Reinforcement Learning (CPIRL),

Sub-Optimal case . . . 57

(8)

(9)

List of Theorems

2.1 Definition (Markov property) . . . 7

3.1 Definition (Game matrix) . . . 18

4.1 Definition (Pareto dominance) . . . 32

4.2 Definition (Pareto frontier) . . . 32

4.3 Definition (Optimal expert) . . . 34

4.4 Definition (Sub-optimal expert) . . . 34

6.2.1 Lemma (CPIRL: expected performance loss) . . . 47

6.1 Definition (Proper Polyhedron) . . . 47

6.2.2 Theorem (CPIRL: bounding radius) . . . 47

6.2.3 Theorem (CPIRL: bounding minimax problem solution) . . . 50

6.2.4 Theorem (MVIE: bounding minimax problem) . . . 51

6.2.5 Theorem (CPIRL: bounding iterations number) . . . 52

6.2.6 Theorem (MVIE: bounding iterations number) . . . 53

6.2.7 Theorem (CPIRL: bounding weight’s distance) . . . 54 6.2.8 Theorem (MVIE: bounding ωE−ω distances in the polyhedron) 54

(10)

(11)

List of Tables

3.1 Some properties of the IRL algorithms proposed in literature and described in this section . . . 16 7.1 Gridworld model: Comparing computational time and

num-ber of steps required by different cutting-plane methods to find an apprentice policy πAs.t. J (πA) ≥ 0.99J (πE) . . . 60

7.2 Gridworld model: Comparing times and number of steps re-quired by the algorithms to find an apprentice policy πAs.t.

J (πA) ≥ 0.99J (πE) . . . 62

7.3 LQR model Optimal Expert: Comparing times and number of steps required by the different cutting-plane methods to find a policy πA such that J (πA) ≥ 0.98J (πE). . . 63

7.4 LQR model Optimal Expert: Comparing time (s) and num-ber of steps required by the algorithms to find a policy πA

such that J (πA) ≥ 0.98J (πE). . . 63

7.5 LQR Suboptimal expert: Computational times and number of steps required by different cutting-plane methods to find a policy πAthat dominates the expert (µ(πA) µ(πE)). . . 65

7.6 LQR Suboptimal expert: Computational times and number of steps required by algorithms to find a policy πAthat

dom-inates the expert (µ(πA) µ(πE)). . . 65

7.7 HW Optimal expert: Time and number of steps required by different cutting-plane methods to find a policy πA such that

J (πA) ≥ 0.99J (πE). . . 67

7.8 HW Optimal expert: Time and number of steps required by algorithms to find a policy πA such that J (πA) ≥ 0.99J (πE). 67

7.9 HW Suboptimal expert: Time and number of steps required different cutting-plane methods to find a policy πAthat

dom-inates the expert (µ(πA) µ(πE)). . . 68

(12)

7.10 HW Suboptimal expert: Time and number of steps required by algorithms to find a policy πA that dominates the expert

(13)

Sommario

Reinforcement Learning (RL) (vedi Capitolo 2 per lo stato dell’arte) è una branca del machine learning che si occupa di risolvere il problema di far apprendere ad un agente, situato in un dato ambiente, un comportamento atto a raggiungere un obbiettivo attraverso meccanismi di ricompensa (feed-back). Questa interazione tra agente e ambiente avviene mediante lo scambio di informazioni: l’agente in un dato tempo si trova in uno stato s rappresen-tante la sua situazione all’interno dell’ambiente e, basandosi sulla conoscenza dello stato occupato, può selezionare l’azione che ritiene più opportuna. Ad ogni combinazione di stato e azione l’ambiente restituisce all’agente una ri-compensa r(s, a) in base ad una funzione di rinforzo e un nuovo stato s0. L’agente di per sè crea una funzione chiamata politica, π, che definisce il comportamento dell’agente, in cui relaziona ad ogni stato la probabilità di selezionare una data azione. Lo scopo dei metodi di Reinforcement Learning `

e specificare il modo in cui l’agente aggiorna questa sua politica, in base alle ricompense ricevute dall’ambiente, per far s`ı che il totale delle ricompense accumulate venga massimizzato.

Nel corso degli anni, le tecniche di apprendimento per rinforzo sono state applicate a problemi sempre più complessi e, conseguentemente, si è iniziato ad avere difficoltà nel definire la funzione di rinforzo. Infatti, se la funzione di rinforzo fosse mal definita potrebbe far apprendere all’agente compor-tamenti indesiderati non facilmente prevedibili a priori. Per cui, come de-scritto in [17], si è iniziato a notare che per certe applicazioni è più facile offrire degli esempi pratici, come delle dimostrazioni, anzichè specificare di-rettamente la funzione di rinforzo. Da qui iniziarono a farsi sempre più importanti i metodi per risolvere i probemi di Inverse Reinforcement Learn-ing (IRL) (vedi Capitolo 3 per il relativo stato dell’arte), in cui l’obbiettivo `

e risolvere il problema inverso, per il quale date delle dimonstrazioni di un comportamento, bisogna ricostruire la funzione di rinforzo la cui politica associata abbia delle prestazioni molto vicine, se non migliori, a quelle delle dimostrazioni offerte.

(14)

In questa tesi noi vogliamo descrivere un nuovo approccio per risolvere questo tipo di problemi, ponendoci in un contesto simile a quello considerato in letteratura [1]. La funzione di rinforzo viene espressa come una combi-nazione lineare di più funzioni analogamente a quanto avviene nei problemi multi-obbiettivo (vedi Capitolo 4), in cui ogni obbiettivo è rappresentato da una funzione µ e da un peso ω il quale rappresenta la preferenza dell’agente rispetto a ogni obbiettivo. Se un obbiettivo è considerato più importante di un altro esso avrà un peso maggiore.

Il nostro scopo, e quindi quello del nostro algoritmo, `e, date delle di-mostrazioni di un esperto ottimo rispetto alla funzione di rinforzo R = ω · φ, con ω sconosciuti, riuscire a trovare quei valori di ω per cui la politica asso-ciata abbia performace come quelle delle dimonstrazioni.

Sempre sotto queste assunzioni gli approcci desciritti in letteratura che pi`u si collegano al nostro lavoro sono quelli proposti in [28, 27], in cui si formula il problema di apprendimento come un ’adversarial game’, in cui l’obbiettivo `

e di dominare l’esperto, cioè andare meglio dell’esperto su ogni obbiettivo. Gli algoritmo descritti sono MWAL e LPAL: il primo è un metodo iterativo in cui vi è la garanzia di convergenza in un tempo finito per trovare un soluzione -ottima; il secondo approccio invece, risolve un problema LP, in cui viene derivata la politica che domina l’esperto, però con tutte le limi-tazioni inerenti agli approcci LP per risolvere i problemi MDP [12].

L’approccio proposto in questa tesi, CPIRL (Cutting-Plane IRL), si basa sui metodi di plane per risolvere i problemi di IRL. I metodi di cutting-plane [6] (vedi Capitolo 5 per il relativo stato dell’arte e possibili metodi) sono degli approcci aventi l’obbiettivo di trovare un punto, o un insieme di punti, in uno spazio convesso. Iterativamente questi metodi ricercano un punto all’interno dello spazio dicendoci se è quello che stavamo cercando oppure no, in questo caso computano un iperpiano associato a quel punto, detto cutting-plane, che permetterà di eliminare, tagliare via, una parte dello spazio dei punti in cui sappiamo non si trovi il punto che stiamo cercando. Ovviamente in IRL, quindi nel nostro caso, il punto da dover trovare sarà il vettore dei pesi ω e lo spazio convesso iniziale saranno tutte le possibili com-binazioni di pesi. L’algoritmo CPIRL, descritto nel dettaglio nel Capitolo 6, all’inizio parte da uno spazio iniziale di possibili pesi e ad ogni iterazione trova un punto all’interno di esso, calcola le performance della politica asso-ciata e la confronta con quella dell’esperto ottenuta dalle dimonstrazioni. Se le performance trovate sono uguali o comunque simili, a meno di un piccolo errore , a quelle dell’esperto, l’algoritmo termina, altrimenti CPIRL calcola l’iperpiano che andrà a ridurre lo spazio, eliminando tutti quei pesi per cui l’esperto non risulta migliore (vedi Capitolo 4) e cercherà un altro punto

(15)

all’interno dello spazio di ricerca e cos`ı via.

CPIRL oltre ad essere un algoritmo innovativo, riesce ad offrire molte garanzie sui risultati ottenuti grazie alla teoria dei cutting-plane. Infatti come descritto nella Sezione 6.2 del Capitolo 6, CPIRL offre due tipologie di garanzie: la prima basata sul funzionamento di CPIRL e indipendente dal metodo di cutting-plane utilizzato, che permette, mentre l’algoritmo `e in funzione, di dare un bound di quanto sarebbe la perdita di performance, in confronto a quelle dell’esperto, nel caso ci si volesse fermare in quella it-erazione dell’algoritmo; la seconda, dipendente dal metodo di cutting-plane (nello specifico vedremo le garanzie per due metodi, Analytic Center e Maxi-mum Volume Inscribed Ellipsoid), di dare un limite superiore numero di iter-azioni necessarie per ottenere un risultato che abbia l’accuratezza desiderata. La versione base di CPIRL funziona nel caso in cui l’esperto risulti ot-timo, ovvero che massimizzi una funzione di rinforzo che possa essere rappre-sentata come combinazione lineare delle funzioni base considerate. Nel caso in cui questo non sia verificato CPIRL non permette di trovare una soluzione accettabile, dato che, l’esperto non essendo ottimo, nello spazio considerato non esisterebbe nessun peso per il quale l’esperto risulti migliore, con la conseguenza che questo spazio si ridurrebbe ad uno spazio vuoto.

Nella tesi viene presentata anche una versione funzionante nel caso di esperto sub-ottimo che non va a ricercare i pesi dell’esperto ma si pone l’obbiettivo di andare a trovare un peso che sia associato a una politica con prestazioni che superano quelle dell’esperto (vedi Sezione 6.3).

Per concludere abbiamo eseguito dei test su diversi domini quali gridworld, LQR e Human Walking, per valutare le performance riguardanti ai vari metodi di cutting-plane, al fine di poter fornire ai lettori un’indicazione riguardo, in base alla situazione, quale metodo di cutting-plane sarebbe preferibile far utilizzare da CPIRL. Inoltre gli stessi test sono stati eseguiti per confrontare CPIRL con altri algoritmi di IRL, notando delle ottime per-formance che lo rendono una valida alternativa agli altri metodi di IRL.

(16)

(17)

Abstract

The purpose of this thesis is to propose a novel approach consisting in using cutting-planes methods in order to resolve Inverse Reinforcement Learning (IRL) problems. The IRL goal is to identify a reward function, by observing the behavior of an expert that is supposed to maximize the accumulation of the reward expressed by such function. Our algorithm, we have called Cutting-Plane Inverse Reinforcement Learning algorithm (CPIRL), works under the common IRL assumption that such reward function can be repre-sented as a linear combination of a given set of basis functions. It exploits cutting-plane methods to perform an iterative search for the optimal weights, at each iteration selects a weight vector that lies inside a search space, com-putes the correspondent optimal policy and then introduces a cutting plane in order to reduce the search space by excluding all the weight vectors that cannot be the one optimized by the expert.

We will present two different versions of CPIRL, the first working when the expert is optimal, and CPIRL aims at finding the reward function used by the expert, and the second one working on the condition of a suboptimal expert, where CPIRL converges to a weight vector whose related policy is better than the expert one. We will illustrate also various cutting-plane methods that can be used by our algorithm to select the interior point, i.e., the cutting-plane. However we will provide some CPIRL theoretical results, providing a bound on the convergence to an -optimal solution and an upper bound on the number of steps required by CPIRL to stop, providing also some interesting properties related to the various cutting-plane methods. At the end we will compare different cutting-plane methods and also we will give an empirical comparisons of CPIRL with related IRL algorithms in various different domains.

(18)

(19)

Acknowledgements

Un sincero e immenso ringraziamento a tutte quelle persone che mi sono state vicine sopportadomi e supportandomi in questi anni.

(20)

(21)

Chapter 1

Introduction

1.1 General introduction

Reinforcement Learning (RL) faces the problem of finding the optimal pol-icy, i.e., an optimal agent behavior to reach a goal, in a Markov Decision Process (MDP), when the underlying dynamics of the environment are un-known. The optimality is measured against an informative reward function that is assumed to be given by an expert. This kind of problems leverages on the principle that in many problems it is easier to define the reward func-tion than manually code the optimal policy, but that is not true in many other real situations. In fact there are several examples of applications, for instance, applications that involve human iteration or robotic applica-tions, where the design of the reward function is as difficult as to define the optimal policy. As [17] have noticed, in some applications can be easier to demonstrate examples of a desired behavior than to define the reward function directly. This is true in many cases where the control problem have been solved by a human expert for example in car driving, where the goal is to learn how to behave by observing expert demonstrations. Since 2000s, several contributions was provided to solve the problem of learn-ing from demonstrations, given rise to the Apprenticeship Learnlearn-ing (AL) framework [1]. In this framework, the goal is to learn a policy that per-forms similarly to the expert one, assuming that the expert is behaving in an MDP whose reward function is unknown to the apprentice. There are many solutions due to solve AL problems, for instance they can be solved using policy imitation or finding an agent whose performance is as good as the expert demonstrations or within Inverse Reinforcement Learning (IRL) algorithms, whose goal is to recover a reward function for which the expert

(22)

behavior is optimal.

A common assumption made by the most of IRL algorithms, for example in [17], is that the expert optimizes a multi-objective reward function that can be represented as a linear combination of a set of known features. As a consequence, the IRL goal is to identify the weights of the expert reward function. In these settings, the most related approaches to our work are the ones proposed in [28, 27]. They formulate the AL problem as an adversarial game where the goal is to dominate the expert on every objective. Their iterative algorithm (MWAL) is guaranteed to converge in a finite time to an -dominant solution. Successively to MWAL, they proposed an LP algo-rithm (LPAL) that, exploiting the dual LP formulation, it is able to derive an apprentice policy that dominates the experts, but it presents all the lim-itations of LP approaches when used to solve MDP problems.

1.2 Goals of the Thesis

The purpose of this thesis is to identify a solution method for IRL problems in order to provide good guarantees on the convergence, accuracy and com-putational complexity. So we will show Cutting-Plane Inverse Reinforcement Learning (CPIRL), a novel IRL algorithm based on a cutting-plane method. Cutting-plane methods, as described in [6], are methods aim at finding a point, or a set of points, that lies in a convex space, by iteratively finding a set of linear inequalities, representing each one a hyper-plane, in order to reduce the search space. These methods, in order to find the searched point, first pick a point in the current space, if it is not the point that we are looking for, they compute the related hyper-plane that removes part of the space, thus reducing the next search space. In IRL and in our specific case the point that the cutting-plane method is looking for is the set of weights that generate policies whose performance is as good as the expert policy performance, and the convex space is the set of all the possible weights. Following what done in [28], we assume that, given q reward features, the expert weights belong to the unit (q − 1)-simplex. Given a weight in the unit (q − 1)- simplex, the performance of the policy that optimizes the cor-responding reward function can be used to define a linear inequality, a cut, that splits the weight space into two half spaces: one contains the expert weights and the other one does not. So the CPIRL idea is to iterate the introduction of new cuts to monotonically reduce the size of the subspace of weights that contains the expert weight vector.

(23)

guaran-tees on the results. The first guarantee, independently to which cutting-plane method is used, allows CPIRL to give an upper bound on the per-formance loss related to the difference of the perper-formance between the best policy found so far and the expert policy performance, in the case we want to stop the algorithm in that iteration. The second guarantee is dependent by which cutting-plane method we are using and gives an upper bound on the number of steps needed by CPIRL to return an -optimal policy with respect to the expert one.

Since CPIRL works only when we have demonstrations given by an op-timal expert, we provide also a version of CPIRL working on the case of sub-optimal expert that permits to search a weight that corresponds to a policy which dominates the expert one.

In conclusion we have tested different cutting-plane methods and compared their performance in different domains due to give to the readers informa-tion about which method can be preferred in a given situainforma-tion. Furthermore we have compared CPIRL with other IRL algorithms, showing that CPIRL can be a valuable alternative to solve IRL problems.

1.3 Thesis structure

The thesis is structured as follows: Chapter 2, 3 and 5 provide the state of arts respectively of RL, IRL and Cutting-plane methods; Chapter 4 and 6 explain the settings of our problem and describe the algorithm we have implemented to solve the problem, providing also theoretical results; and finally Chapter 7 and 8 provide experiments results, conclusions and future works.

A more detailed thesis structure is:

In Chapter 2, we will describe the concept of Reinforcement Learning giving an introduction on what the RL problems objectives are and provid-ing a section, with the RL state of the art, when are described the various approach to solve RL problems.

In Chapter 3, we will describe Inverse Reinforcement Learning problems, providing also a section with the state of the art of different IRL methods described in literature.

In Chapter 4, we will illustrate our research problem settings describing the multi-objective concept, the definitions of Pareto optimal solutions and defining what expert and sub-optimal expert mean.

In Chapter 5, we will explain what cutting-plane methods are, providing also a cutting-plane state of the art describing each method that can be used in our algorithm.

(24)

In Chapter 6, we will describe our algorithm CPIRL, giving theoretical re-sults and providing a second CPIRL version working in the case of sub-optimal expert.

In Chapter 7, we will show experiments on different models as gridworld, LQR and Human Walking, comparing different cutting-plane methods and then giving a comparison between CPIRL and others IRL methods;

In Chapter 8, we summarize our achievements, evaluate our work and propos-ing ideas about interestpropos-ing possible future works.

(25)

Chapter 2

State of Art: Reinforcement

Learning

This chapter aims explaining to the readers fundamentals concepts to better understand the problems we are focusing on and how we will try to solve them. We will explain what Reinforcement Learning (RL) problems are, why RL problems are useful and how a real situation can be modeled using Markov Decision Process (MDP) framework. Moreover we will see different solutions to solve MDPs as RL methods, Dynamic Programming (DP) and Linear Programming (LP).

2.1 RL problem introduction

The reinforcement learning problem is meant to be a framing of machine learning focusing on learning from interaction to achieve a goal. The learner and decision-maker is called the agent. The thing it interacts with, compris-ing everythcompris-ing outside the agent, is called the environment. Agent interacts with the environment by selecting actions, environment interacts responding to those actions and presenting new situations to the agent. The environ-ment also gives rise to rewards, special numerical values that the agent tries to maximize over time.

More specifically, the agent and environment interact at each of a se-quence of discrete time steps t = 1, 2, 3, .... At each time step, the agent receives some representation of the environment’s state, st ∈ S , where S

is the set of possible states, and on that basis selects an action, at ∈ A,

where A(st) is the set of actions available in state st. One time step later, in

(26)

Figure 2.1: agent-environment interaction

rt+1 ∈ R, and finds itself in a new state, st+1.

At each time step, the agent implements a mapping from states to prob-abilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π, where πt(s, a) is the probability that at = a if

st = s. Reinforcement learning methods specify how the agent changes its

policy as a result of its experience. In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special reward signal pass-ing from the environment to the agent. At each time step, the reward is a simple number, rt ∈ R. Informally, the agent’s goal is to maximize the

total amount of reward it receives. This means maximizing not immediate reward, but cumulative rewards in the long run.

The agent tries to select actions so that the sum of the discounted re-wards it receives over the future is maximized. In particular, it chooses to maximize the expected discounted return:

Rt= rt+1+ γrt+2+ γ2rt+3+ ... = ∞

X

k=0

γkrt+k+1 (2.1)

where γ is a parameter, γ ∈ [0, 1) , called the discount rate.

The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γk−1 times what it would be worth if it were received immediately. If γ → 1, the infinite sum has a finite value as long as the reward sequence {rk} is convergent.

If γ = 0, the agent follows a greedy policy trying to maximize immediate rewards.

The learner needs to perceive the environment in order to determine the actual state and the effect of an action, and for ’state’ we mean whatever information is available to the agent that would be useful to it in making decisions.

Our state is a ’signal’ that summarizes past sensations compactly, yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state signal that succeeds in retaining all

(27)

relevant information is said to be Markov, or to have the following Markov property:

Definition 2.1 (Markov property). Given the history of an agent Ht= {s0, a0, r1, s1, a1, r2, ..., st−1, at−1, rt, st}

a situation is a state iff

P r(st+1= s, rt+1= r|Ht, at) = P r(st+1= s, rt+1= r|st, at)

for all t ≥ 0, s ∈ S, r ∈ R, a ∈ A(st), and all possible Ht. The meaning of

such statement is the s0 reached by performing an action a at time t from a given state s, and the received reward r, only depends on the action and the state at time t, and not on any other element which could be possibly undefined.

It follows that Markov states provide the best possible basis for choosing actions. The Markov property is important in reinforcement learning be-cause decisions and values are assumed to be a function only of the current state.

A sequential decision process that satisfies the Markov property is called a Markov decision process, or MDP.

2.1.1 Markov Decision Process

A Markov decision process (MDP) provides a mathematical framework for modeling a decision making problem under uncertainty about the effect of an agent’s action in an environment.

An MDP is defined by a tuple < S, A, T, R, γ >where: • S is the set of states.

• A is the set of actions.

• T : S × A × S → [0, 1] is the state transition function, where T (s, a, s0) denotes the probability of reaching state s0 from state s by taking action a.

• R : S × A × S → R is the expected value of the next reward, where R(s, a, s0) denotes given any current state and action, s and a, together with any next state, s0.

• γ ∈ [0, 1) is the discount factor. 7

(28)

A policy π in an MDP is a mapping from each state, s ∈ S, and action, a ∈ A, to the probability π(s, a) of taking action a when in state s. The value function of policy π at state s is the expected discounted return of starting in state s and executing the policy.

The value function can be computed as:

Vπ(s) = Eπ "_∞ X k=0 γkrt+k+1| st= s # =X a π(s, a)X s0 T (s, a, s0)R(s, a, s0_{) + γV}π_(s0₎ (2.2)

So, given an MDP, the agent’s objective is to find an optimal policy π∗ that maximizes the value for all the states, that has to satisfy the Bellman optimality equation: V∗(s) = max π V π_{(s) = max} a X s0 T (s, a, s0)R(s, a, s0) + γV∗(s0) (2.3) for all s ∈ S.

It is often useful to express the above equation in terms of Q-function or action-value function: π is an optimal policy if and only if

π ∈ arg max a∈A(s)Q π_{(s, a),} where Qπ(s, a) = Eπ "_∞ X k=0 γkrt+k+1| st= s, at= a # =X s0 T (s, a, s0)R(s, a, s0) + γVπ(s0) (2.4)

which is the expected discounted return of executing action a in state s and then following the policy π and the optimal action-value function:

arg max a∈A(s)Q π_{(s, a) = Q}∗_{(s, a) =}X s0 T (s, a, s0) R(s, a, s0) + γ max a0 Q ∗_(s0_{, a}0₎ (2.5) for all s ∈ S, a ∈ A(s).

Previously we have talked about RL, but how we can solve RL prob-lems? To solve RL problems we have three categories of methods: Dynamic Programming, mathematical optimization as Linear Programming and RL methods.

(29)

2.1.2 Dynamic Programming

In the previous argument we have introduced the concept of optimal policy π∗as the policy that maximizes Vπ(s). A possible solution to computing the value and the value function is to use Dynamic programming (DP) methods. DP refers to a collection of algorithms that can be used to compute optimal policies by giving the model of the environment as an MDP.

The key idea of DP is the use of value functions to search for good policies. An example of DP algorithms that aim at computing the optimal value and action-value funtion, V∗ and Q∗, defined in 2.3 and 2.5, and successively to compute the relative optimal policy are policy iteration (PI) and value iteration (VI).

Policy iteration (PI) Policy iteration [25] is an algorithm which alter-nates the evaluation of the state, i.e., the computation of the value function Vπ(s), and the policy improvement, computing the new policy that improves the old value function, as follow π0 → VE π0 → πI 1 → VE π1 → ...I → πI ∗ E→ V∗,

where→ denotes a policy evaluation andE → denotes a policy improvement,I to computing the optimal policy. Policy iteration guarantees that each pol-icy is an improvement of the previous one (unless it is already optimal). However this process must converge to an optimal policy and optimal value function in a finite number of iterations, because a finite MDP has only a finite number of policies.

Policy evaluation Policy evaluation [25] step allows to compute the value function related to a policy π. If the environment’s dynamics are completely known and the MDP is discrete, we have 2.2 as a system of |S| linear equations in |S| unknowns. Consider the sequence of value functions V0, V1, V2, ..., each mapping S to R. The initial value function, V0, is chosen

arbitrarily, and each successive Viis obtained by using the Bellman equation

for 2.1 as an update rule: V_k+1π (s) =X a π(s, a)X s0 T (s, a, s0)R(s, a, s0) + γVk(s0) (2.6) for all s ∈ S. It is proved that the sequence {Vk} converges to Vπ as

k → ∞. This algorithm is called iterative policy evaluation.

To produce each successive value function (Vk+1 from Vk), iterative

pol-icy evaluation applies the same operation to each state s by replacing the old value of s with a newer one, obtained from the old values of the succes-sor states of s and by the expected immediate rewards. Each iteration of

(30)

iterative policy evaluation produces a new approximate value function Vk+1

until there are no changes in the V (s) with respect to the previous one by , that means maxs| Vk+1(s) − Vk(s) |< for all s ∈ S.

Once we found the value function we can obtain the new π(s) with policy improvement.

Policy improvement Policy improvement [25] is an operation that aims at finding the policy using the information given by the value function, returned by policy evaluation computing the

arg max

a

X

s0

T (s, a, s)R(s, a, s0) + γV (s0)

for each state s ∈ S. In this way the new policy is a greedy policy taking the best action according to Vπ.

Another Dynamic Programming algorithm used to find the optimal pol-icy of an MDP is value iteration.

Value iteration This method as described in [25] is obtained simply by turning the Bellman optimality equation into an update rule. The value iteration is identical to the policy evaluation 2.6 except that it requires the maximum to be taken over all actions.

Like policy evaluation, value iteration requires an infinite number of iter-ations to converge exactly to V∗, so we stop once the value function changes by only a small quantity.

2.1.3 Mathematical optimizations

In next sections of this thesis we will not see mathematical optimization to solve RL problems, so we do not enter in details on this category of methods. We limit us to say that as Dynamic Programming algorithms, mathematical optimization methods need to have the model of the environment to compute the optimal policy. This category aims at finding the optimal policy by solving a Linear Problem (LP) of the RL problem that can be formulated as following

minV Ps∈SV (s)

(31)

2.1.4 Reinforcement Learning methods

In contrast with DP, these methods aim at computing the optimal policy π∗ without the model of the environment. We will see three classes of algo-rithms of RL methods: Approximate Value Iteration (AVI), Approximate Policy Iteration (API) and then Policy Search methods.

Approximate Value Iteration (AVI) Since we do not have a model of the environment or simply we have an infinite state-space S, we need to use a function approximation, F , as linear approximation or k-nearest neighbors, to approximate the value function V (s). The approach of these methods, we consider API too, is to computing an approximation of the value function Vπ_{(s), V}

apx(s) defined in F , and then computing the relative policy π as

π(s) ∈ arg max a∈A X s0_∈S T (s, a, s0)R(s, a, s0_{) + γV} apx(s0)

So the extension of VI, AVI, works in unknown model situations and to find the optimal policy first computes the approximate value function Vapx

sampling n states si ∈ S, i = 2, ..., n uniformly over S, then for each action

a ∈ A computes, by taking action a in the state si, m state samples s0i,a, i =

1, 2, ..., m, then generates the reward samples R(s, a, s0), and finally it applies VI on Vapx.

Approximate Policy Iteration (API) As for Apx VI, Approximate Policy Iteration is an extension of PI used when a complete model is un-available. This algorithm chooses an initial policy and iteratively alternates approximate policy evaluation, to compute an approximation Vapx of the

value function, with policy improvement, to find a newer policy π until an optimal policy is found.

Policy Search An alternative method to find a good policy is to search directly in the policy space. A possible distinction of these methods can be to distinguish between gradient-based and gradient-free methods.

Gradient-based methods We start with a mapping from a vector space to the space of policies, so given a parameter vector we compute the policy associated , and then defining a performance function, differentiable on the parameter vector, we compute the gradient. If this gradient is known, we could use gradient ascent else we have to rely on a noisy estimate [26].

(32)

Gradient-free methods This category includes simulated annealing, cross-entropy search or methods of evolutionary computation [13]. Many of these methods can achieve a global optimum.

The issue with policy search methods is that they may converge slowly if the information based on which they act is noisy, but value-function based methods might help in this case.

(33)

Chapter 3

Inverse Reinforcement

Learning

In this chapter we will introduce Inverse Reinforcement Learning (IRL) prob-lems, investigating the motivations that have made this problem a relevant task in machine learning and how MDPs are useful to model IRL problems too. Finally in the last section we will give a general overview of the IRL literature.

3.1 Inverse Reinforcement Learning introduction

As seen in Subsection 2.1.1, the MDP formalism leverages on the principle that although in many problems it is easier to define the reward function than manually code the optimal policy, in many real-world cases this is not possible.

In [17] the authors have noticed that in some applications it is easier to demonstrate examples of a desired behavior rather than to specify an infor-mative reward function. There are several examples of applications where the design of the reward function may be as difficult as designing the opti-mal policy. For instance, in robotic applications, where an expert-designed reward function may give rise to undesired behavior that cannot be easily predicted a priori [8], or the case of others real-world applications where the control problem have been solved by a human expert or by a control agent designed by a human expert, e.g., car driving, humanoid robotics and many others.

So resolving inverse RL problems means to find the reward function by observing expert demonstrations.

Since early 2000s, RL community has provided several contributions to 13

(34)

solve the problem of learning from demonstrations, given rise to the Ap-prenticeship Learning (AL) framework [1]. In AL, the goal is to learn a policy that performs similarly to the expert one, assuming that the expert is behaving in an MDP whose reward function is unknown to the apprentice. The AL problem can be solved using policy imitation or finding an agent whose performance is as good as the expert demonstration or within Inverse Reinforcement Learning (IRL) algorithms, whose aim is to recover a reward function for which the expert behavior is optimal.

Inverse reinforcement learning (IRL) was first proposed by Russell (1998) as follows:

”Given measurements of an agent’s behavior over time, in a variety of circumstances, measurements of the sensory inputs to the agent, a model of the physical environment (including the agent’s body), determine the reward function that the agent is optimizing.”

IRL is a natural way to examine animal and human behaviors. If the decision maker is assumed to follow the principle of rationality (Newell, 1982), his/her behavior could be understood by knowing the reward function that the decision maker internally optimizes. Learning by demonstrations can be useful in such situations, as car driving problems, where is to difficult define an appropriate reward function and can be more appropriate to learn it by observing examples of driving behavior.

In addition, if the agent is rational, we can exploit the computed reward function to generate an agent that imitates the decision maker’s behavior. This will be a useful approach to build an intelligent agent.

Another advantage of IRL is that the solution of IRL problems, that is, the reward function, is one of the most transferable representations of the agent’s behavior.

Later we will see a number of studies on IRL that have been reported in the last decade as in [1, 17, 16] and many others in literature.

Since IRL problems are related to RL problems, we can model a real situation using the MDP framework, that provides a good starting point for developing IRL algorithms too.

3.1.1 Markov Decision Process in IRL problems

The IRL problem, in completely observable Markovian environments, is de-noted by an MDP without the reward function, MDP\R, which is formally stated as a tuple < S, A, T, γ > where S, A, T and γ are the same of the

(35)

MDP defined in 2.1.1. Given an expert’s policy πE, the goal is to find the

reward function R that makes πE an optimal policy for the given MDP.

We can have two types of expert’s demonstrations: first type when an expert’s policy is explicitly given and the second type is when an expert’s policy is implicitly given by a set of trajectories D = {τ1, τ2..., τn} generated

by πE, where a generic trajectory τ is a sequence of state-action pairs of

finite length L : τ = {(si, ai)}, i = 1, 2, .., L.

Like in Abbeel and Ng studies [17] and in other works in literature, the unknown reward function R can be represented as a linear combination of features φ(·) each ones having a related weight.

R(s) = ω1φ1(s) + ω2φ2(s) + ... + ωkφk(s), (3.1)

where φ1, .., φkare fixed, known, bounded basis functions mapping from

S into representing the goals, and the unknowns ωis specify weighting

be-tween these features.

This new reward formulation defines the value of a policy π as follow: Vπ_(s 0) = EP∞t=0γtR(st)|π = EP∞_t=0γtωTφ(st)|π = ωTEP∞ t=0γtφ(st)|π

where the expected discounted accumulated feature value vector

EP∞_t=0γtφ(st)|π, that we define as µ(π) ∈ Rk, is called feature

expecta-tions.

Using this notation the value of a policy can be written as

V (s0) = ωTµ(π) and in case of a set of m trajectories, we can denote the

empirical estimate for V (s0) by

ˆ V (s0) = ωTµ(π)ˆ where ˆµ(π) = _m1 Pm i=1 P∞ t=0γtφ(sit).

3.2 IRL works

As said before, in a few years, many studies have been reported about IRL problems and how these problems can be solved with a lot of different the-ories and algorithms.

We classify these solutions in three categories. We have algorithms that find the expert policy focusing on find the policy having:

• same feature expectations as expert policy; 15

(36)

(a) Weights and feature expectation focusing algorithms

Algorithms weights focusing method feature exp. focusing method Projection method X

Max Margin method X

MWAL X

LPAL X

NG algorithms X MMP,MMP-BOOST X Natural Gradient IRL X Maximum Entropy X Relative Entropy X Policy walk algorithm X GPIRL X

(b) Algorithms properties

Categories finite S infinite S linear R non-linear R LP/QP solver

feature exp. focusing methods

Projection Projection Projection

Max margin Max margin Max margin Max margin MWAL MWAL MWAL

LPAL LPAL LPAL

mathematical methods NG methods (1, 2) NG methods(2) NG methods(2) NG methods(1, 2)

numerical optimization

MMP MMP

MMPBOOST MMPBOOST MMPBOOST Natural Gradient IRL Natural Gradient IRL Natural Gradient IRL

Maximum Entropy Maximum Entropy Relative Entropy Relative Entropy probabilistic methods Policy walk Policy walk

GPIRL GPIRL GPIRL GPIRL

Table 3.1: Some properties of the IRL algorithms proposed in literature and described in this section

• same feature weights as expert’s weights;

• same performance J = ωT_{µ as the expert’s policy.}

In Table 2.1(a) we give a schema about how we have classified the methods we are going to report in the next section and in Table 2.1(b) we have show some properties of the algorithms that we have reported in this chapter.

3.2.1 Feature expectation focusing methods

We can classify in this category Abbeel and NG algorithms [1] and Syed algorithms [28, 27], since the goal is to find a mixed or stationary policy with performance near to the expert policy performance.

Both their approaches assume the true reward function to be a linear combination of a set of known and observable features, as described in Sec-tion 2.1.1 (even if are subject to different constraints), but the ideas that they exploit to solve the IRL problem are pretty different.

(37)

In [1], the authors aim at providing a set of policies whose feature ex-pectations convex closure is nearly as good as to the expert’s feature expec-tations by given enough examples:

k µ(πE) − µ(πi) k2≤ , i ∈ 1, 2, .., maxsteps.

However, unlike Abbeel and NG algorithms, Syed et al. [28, 27] show that theirs may produce (when the expert is not optimizing a reward function contained in the search space) a policy that is substantially better than the expert one.

In [1] are proposed two algorithms: Max-margin and Projection method. The first one method is a QP based algorithm trying to find at each iter-ation i a reward function R(s) = ω_iTφ(s) with φ(s) ∈ [0, 1]k such that Es0vD[V

πE_(s

0)] ≥ Es0vD[V

πi_(s

0)] + t This means to find a reward on which

the expert does better, by a margin of a positive value t, than any of the policies we had found previously, computing in each step the following max-imization:

maxt,w t

s.t ωTµ(πE) ≥ ωTµ(πj) + t,

k ω k₂≤ 1

j = 0, .., i − 1

This is equivalent to finding the maximum margin hyperplane separating the sets of points µ(πj) from µ(πE).

Instead in the projection algorithm no QP solver is needed and at each step, by using a geometrical approach, it updates the weights, replacing the maximization in the Max-margin method, computing at each iteration i the orthogonal projection ¯µ of µ(πE) onto the line through the two last µ(πi)s

found previously and taking the µ(πE) − ¯µ(i−1) distance as new weights ω.

Even Syed and colleagues proposed two algorithms, Multiplicative Weights for Apprenticeship Learning (MWAL) [28] and Linear Programming Ap-prenticeship Learning (LPAL) [27]. MWAL is faster and easier to implement than max-margin and projection methods and can be applied even in the absence of an expert by setting µ(πE) = 0.

MWAL has different constraints on feature and weights vector, with respect to the approaches of Abbeel & Ng. MWAL assumes that φ(s) ∈ [−1, 1]k and R∗(s) = (ω∗)T_{φ(s) for some ω ∈ R}k _{:k ω k}

1= 1 and ω 0. This method

is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire, in [10], for playing repeated matrix games.

The goal is to find the mixed policy ψ∗ that achieves V∗. Since Vψ = (ω∗)Tµ(ψ) for all ψ, we have that ψ∗ is the policy, in the policy space Ψ,

(38)

that maximizes Vψ− VπE _{with respect to the worst-case possibility for ω}∗_.

Since ω∗ is unknown, maximizing for the worst case is appropriate.

The idea is based on a two person zero-sum game, where the “min player” specifies a reward function by choosing ω, and the “max player” chooses a mixed policy ψ. The goal of the min player is to cause the max player’s policy to perform as poorly as possible relative to the expert, and the max player’s goal is just the opposite.

Definition 3.1 (Game matrix). Generally a game is defined by its associ-ated game matrix, a k × |Π| matrix G, where Π is the policy state, and each component of G is defined as

G(i, j) = µi(πj) − µi(πE), (3.2)

where µ(πj) is the vector of feature expectations for the j-th

determin-istic policy πj ∈ Π, and µ(·)i is the ith component of µ(·).

So given the matrix, G(ψ), the problem can be written in the form of a max-min or min-max problem as follow:

v∗ = max

ψ∈Ψminω ω

T_{G(ψ) = min} ω maxψ∈Ψω

T_G(ψ) _(3.3)

where Ψ is the set of all mixed policies and V∗ is called game value. In each iteration t, of the MWAL algorithm, an optimal policy πtand its

feature expectation µ(πt) are computed with respect to the current weight

vector ωt. Then the weight vector is updated since its i-th component ωi

is increased or decreased if πt is a bad or good policy (relative to πE) with

respect to the i-th basis reward function.

In [27], the authors propose several procedures to find the optimal policy as value iteration or policy iteration methods as well for MWAL (called respectively MWAL-VI and MWAL-PI) and for all the methods that have to solve the forward problem. In practice they differ for the number of iterations required to reach the optimal policy.

The previous algorithms have in common that their output is a mixed policy. The following method, called LPAL, uses a dual approach to solve an IRL problem finding a stationary policy by computing the occupancy measure x(s, a) of each state-action pairs. It solves the MDPs by using linear programming: maxxPs,a,s0R(s, a, s0)x(s, a) s.t. P ax(s, a) = αs+ γPs0_ax(s0, a)T (s, a, s0) x(s, a) ≥ 0, (3.4)

(39)

where αsis the initial state distribution, γ is the discount factor and T (s, a, s0)

is the transition probability.

It is well-known (Puterman, 1994) that if x∗ is a solution to (13), then π∗= x(s, a)

∗

P

ax(s, a)∗

(3.5) is an optimal policy, and x∗ is the occupancy measure of π∗. Often constraints in Equation 3.7 are called the Bellman flow constraints. The linear program in Equation 3.7 is actually the dual of the linear program that is typically used to find an optimal policy in an MDP.

Theoretical results All these algorithms offer similar guarantees on the feature expectations. Max-margin and projection methods return as re-sult the set of intermediate policies πi, i = 0..., n. Both terminate in

n = O_(1−γ)k22log_(1−γ)k

iterations and to ensure that, with probability at least 1 − δ, we need of a number of samples m ≥ _((1−γ))2k 2log2k_δ.

So µ(πE) is guaranteed to be close to the convex closure of feature

ex-pectations set of intermediate policies, µ(π1), µ(π2), ..., µ(πn) by and we

can compute the mixed policy µ(πA) =

P iλiµ(πi), i = 1, 2, ..., n by solving the following QP: mini k µ(πE) − µ k2, s.t. µ = n X i=1 λiµ(πi), λi ≥ 0, X i λi = 1 (3.6)

and finding the mixture weights λi.

Theoretical results of these algorithms guarantee that let πA be the

mixed ψ or the stationary policy returned by MWAL or LPAL, MWAL takes time T ≥ O

log k (1−γ)2

, instead LPAL in the overall worst case takes time T = (|S||A| + k), where T (n) is the complexity of an LP solver. 3.2.2 Weight-based methods

Now we consider methods whose goal is to find the expert’s weights ωE that

have determined the optimal expert’s policy or trajectories. We can split these algorithms in three main sets:

• mathematical methods: methods that find the expert weight solving linear or quadratic problem;

• numerical optimization based methods: are methods using entropy or gradient based algorithms;

(40)

• probabilistic methods: are methods that model their IRL problems from a probabilistic point of view.

Mathematical Methods In the first category we talk about algorithms processing linear or quadratic problems that are solved by LP and QP tech-niques. Two of these methods are the algorithms in [17], which try to find the expert’s weights by using LP problems respect to the assumption that there is not another policy π better than expert’s policy πE.

The first method works in an MDP with a finite state space S, a set of actions A = a1, .., ak, Parepresenting all the probability to take the action a

in every state by following the policy π and a given discount factor γ ∈ (0, 1). Their idea is that, π a policy that in each state takes action a1, π(s) = a1,

is said to be optimal iff, for all a = a2, .., ak, the reward R satisfies:

(Pa1 − Pa)(I − γPa1)

−1_{R 0} _(3.7)

So w.r.t Equation 3.7, desired reward function, R, can be found solving the following optimization problem:

maxPN

i=1mina∈{a2,..,ak}{Pa1(i) − Pa(i))(I − γPa1)

−1_{R} − λ k R k} 1 subject to (Pa1Pa)(I − γPa1) −1_{R 0, ∀a ∈ A \ a} 1 |R(i)| ≤ R_max, i = 1, ..., N

where Pa(i) denotes the i-th row of Pa, −λ k R k1 is a weight decay-like

penalty term, where λ is an adjustable penalty coefficient and Rmax is an

upper-bound for not allowing R to take too large values.

Second method instead works in a large state space situation where we can not define the matrix P, so in [17] is proposed an alternative working in this case. They assume R as a linear function R(s) = ωTφ(s) as in the assumption done in [1], and they want to find a setting of ωis so that the

resulting reward function satisfies:

Vπ∗(s) ≥ Vπi_(s) _(3.8)

where Vπi_(s

0) denote the value function of the policy π in the MDP

when the reward function is Ri(s) = φi(s).

In this case the optimization becomes: max Pk i=1p(Vπ ∗ (s) − Vπi_(s)) s.t ωi ≤ 1, i = 1, .., d (3.9)

(41)

where

p(x) = (

x x ≥ 0

2x x < 0 is a constraint violation penalty.

This algorithm starts with a π0 random policy and at each iteration i

the optimization gives a new setting of weights ωis and hence a new

re-ward function R = ωT_{φ. Then the algorithm computes a policy π}

k+1 that

maximizes Vπk+1_{(s) and add π}

k+1to the set of policies {π1, ..., πk} and go

to the next iteration until the algorithm finds a reward function R that is satisfactory.

Numerical Optimization Based Methods The second category in-cludes sub-gradient methods as Max margin planning and its Boosted ver-sion [23, 24], Natural Gradient IRL [16], entropy optimization methods as Maximum entropy IRL [29] and Relative entropy IRL [4].

Gradient based approaches MMP and MMPBOOST methods’ goal is to find the expert’s weights computing a sub-gradient to update weights at each iteration. MMP consists in minimizing a cost function by sub-gradient descent, where an MDP problem is solved at each step, similarly to what done in Natural Gradient IRL, it tries to minimize a loss function l that penalizes deviations from the expert’s policy.

MMP main idea is to allow only weight vectors for which the example policies πi have higher expected reward than all other policies by a margin

that scales with the loss. The loss function penalizes choosing different ac-tions from the teacher at any state the teacher reaches or penalizes reaching states the teacher chooses not to enter.

MMP solves the following quadratic program:

minw,viξ 1 2 k ω k2+ γ n P iβiξqi s.t. ∀i ωTµ(πi) + ξi ≥ sT_i Vi(s) ∀i, s, a Vi(s) ≥ (ωTφ(s)i+ li)s,a+Ps0T_i(s, a, s0)V_i(s0) (3.10)

where: Vi(s) are the value-functions that satisfy the Bellman primal

constraints; ξ are a slack variables that permit violations of these constraints for a penalty that scales with the hyper-parameter γ ≥ 0; li is the loss

vector; Ti(s, a, s0) is the transition probability; βi > 0 are data dependent

scalars that can be used for normalization when the examples are of different lengths; q ∈ {1, 2} is a scalar value; φi the d × |S||A| feature matrix where

(42)

d in the number of features and S, A are the state and action space and µ(πi) ∈ Rd is the feature expectation vector for the policy πi.

The issues of MMP are: 1) the number of constraints that scales linearly with state-action pairs and number of training examples; 2) solving the quadratic program is at least as hard as solving the linear programming formulation of a single MDP.

To overcome these issues they presented a simpler approach based on the sub-gradient method, writing the problem into a single cost function:

cq(ω) = 1 n n X i=1 βi(max x∈Gi (ωTφi(s) + lTi )x − ωTµ(πi))q+ λ 2 k ω k 2_, _(3.11)

where x ∈ G is the dual state-action frequency following πi.

This objective function is convex, but non-differentiable, and can be optimized by a generalization of gradient descent called the sub-gradient method (Shor, 1985).

The sub-gradient gqω is the partial derivative of the cost function c in ω,

gqω∈ ∂c(ω) defined as gq_ω= 1 n n X i=1 qβi((ωTφi(s) + lTi )x ∗_{− ω}T_µ(π i))q−1φi(s)∆ωxi+ λω, (3.12)

where x∗ = arg maxx∈Gi(ω

T_φ

i(s) + lTi )x and ∆ωxi = x∗− xi.

MMP at each step computes the optimal policy and the state action visitation frequencies x∗_i for each i reward function computed as (ωT_φ

i(s) +

lT_i ) and then computes the sub-gradient gωq ∈ ∂c(ω) as in Equation 3.12 and

updates ω = ω − αtg, where α is a scalar factor.

One of the problems of MMP is that we have to choose a prefixed number of features, so a good solution can be the second version of MMP, MMP-BOOST proposed by Ratliff in [24], that aims at finding the set of features by solving a simple classification problem. At a high level, this algorithm learns a new feature by learning a classifier that is best correlated with the changes we would like to have made in order to decrease the loss by using all possible features that we need.

Each iteration of MMPBOOST fits the model and computes the loss-augmented cost map, updates ωt+1 = ωt− γtgt, where γt, t = 1, 2, ...∞ is a

prespecified step-size sequence and gt is a sub-gradient at the current

time-step t, invokes the planner and computes the feature expectation. Then it builds positive examples by gathering feature vectors encountered along

(43)

this loss-augmented path giving label 1 and builds negative examples by gathering feature vectors encountered along the example path giving label −1.

At the end MMPBOOST learns a classifier using this data set of paths, with each one an associated label, to generalize these suggestions to other points on the map, applies this classifier to every cell of all example maps and add the result as a new feature to the feature matrix.

If the original set of features cannot correctly represent the expert reward function as a linear combination, this algorithm tries to find a new feature as a nonlinear function of the original base set of features, that can best simultaneously raise the cost of the current erroneous paths, and lower the cost of the example paths. Adding this feature to the current feature set provides an incremental step toward explaining the decisions made in the example paths.

The Natural Gradient IRL method proposed in [16] has the purpose to find the expert reward given a parametric family of rewards ω ∈ Rd, where d is the number of basis functions in the MDP, such that the corresponding (near) optimal policy, πω(s), matches the expert’s policy πE (more precisely,

its empirical estimate).

The proposed method can be written as the optimization problem min

ω J (πω) s.t. πω= Gr(Q ∗ ω(s, a)),

where Q∗_ω is the optimal action-value function corresponding to the reward function ω, Gr is a suitable smooth mapping that returns (near) greedy policies with respect to its argument and J is the loss function, defined as:

J (π) = X

s∈S,a∈A

µ(πE)(P r(a|π(s)) − P r(a|πE(s)))2 (3.13)

with S, A are the state and action spaces respectively, P r(a|π(s)) is the probability to performing the action a in the state s under the policy π. This loss function gives a metric to the distance of πE and its argument.

The update step is to apply a gradient algorithm for minimizing the distance in Equation 3.13. To do this they chose πω to be the so-called

Boltzmann-policy with respect to

Gr(Q)(a | s) = P exp[βQ(s, a)]

a0_∈Aexp[βQ(s, a0)]

,

where β > 0 is a parameter that controls how close Gr(Q) is to a greedy action selection. The reason for not relying on the optimal policy is to make the policy a differentiable function of Q∗_ω(s, ·).

(44)

The update is additive, using ∆k=

X

(s,a)∈S×A

ˆ

µ(πE)(ˆπE(a|s) − πωk(a|s))∂ωπωk(a|s) (3.14)

where ∂ωπω(a|s) is the gradient of πω(a|s).

MMP and his boosted version are guaranteed to converge to the mini-mum, but only at a sub-linear rate under the strong convexity assumption (see ([14]), Proposition 2.8). There are not mathematical guarantees for Natural Gradient IRL method, MMP and MMPBOOST but only empirical evidence.

Entropy based approaches Now we will consider Maximum Entropy and Relative Entropy algorithms in [29, 4]. The principle of Maximum En-tropy states that the policy that best represents the demonstrated behavior is the one with the highest entropy, subject to the constraint of matching the reward value of the demonstrated actions.

The approach, similar to Relative Entropy, consists of finding the pa-rameters ω of a policy π that maximizes the entropy of the trajectories distribution, subject to constraints:

∀i ∈ {1, .., k}, X

τ ∈T

P r(τ | π, T )fi(τ ) = ˆfi (3.15)

where τ is a trajectory, T is the trajectories space, P r(τ |π, T ) is the probability of τ is defined in T and that τ was generated by the policy π, fi(τ ) and ˆfi denote the feature expectation count and the empirical feature

expectation count respectively calculated from the demonstration i.

Solving this problem leads to maximizing the likelihood L of the demon-strated trajectories: θ∗ = arg max ω L(ω) = arg maxω X τ log P r(τ |ω, T ) (3.16) under the following distribution which is an entropy estimation:

P r(τ | ω, T ) ∝ x(s1) exp( k X i=1 θifi(τ )) L Y t=1 T (st, at, st+1) (3.17)

where τ is the trajectory with finite length L, x(s) is the expected state visitation frequency of the state s and T the transition function.

The goal represented in the Equation 3.16 is a convex function and can be optimized using gradient-based optimization methods.

(45)

This leads to ∇L(ω) = ˆf −X τ P r(τ | ω, T )f (τ ) = ˆf −X st x(st)φ(st)

The gradient is the difference between expected empirical feature counts ˆ

f and the agent’s expected feature counts fst at a state st.

Relative Entropy IRL is based on Generalized Maximum Entropy meth-ods minimizing the KL divergence between the empirical distribution of the state-action trajectories under a baseline policy and the distribution of the trajectories under a policy that matches the reward feature counts of the demonstration subject to following constraints:

∀i ∈ {1, .., k} : |P

τ ∈T P r(τ )fτi − ˆfi |≤ i

P

τ ∈T P r(τ ) = 1

∀τ ∈ T : P r(τ ) ≥ 0;

where given n sampled trajectories and a confidence probability δ, i is

the threshold and defined as

i =

s

−ln(1 − δ) 2n

γL+1

γ − 1 maxs,a fi(s, a) − mins,a fi(s, a).

As in Maximum Entropy they compute the gradient first and his sub-gradient later by setting the Lagrangian of this problem as:

P r(j) = 1 Z(ω)Q(τ ) exp k X i=1 ωifiτ

where Z(ω) is a normalization constant.

So the dual function resulting from this step is:

g(ω) = k X i=1 ωifˆi− log Z(ω) − k X i=1 | ωi| i

and the dual problem consists in maximizing g(ω), where ω ∈ Rk. The function g is concave and differentiable everywhere except for ωi = 0, hence,

it can be maximized by using a subgradient ascent given by: ∂ ∂ωI g(ω) = ˆfi− X τ ∈T P r(τ | ω)f_iτ− αii

with αi = 1 if ωi ≥ 0 and αi = −1 otherwise and the function Q is the

distribution on the trajectories under a policy and the transition matrix T. 25

(46)

The sub-gradient ∂θig(θ) cannot be obtained analytically unless the

tran-sition function T , used for calculating Q and P , is known.

Maximum Entropy method, as MMP and Natural Gradient IRL method, do not offer any guarantee on its resulting weights, instead Relative Entropy IRL output is guaranteed to have ∀i ∈ {1, .., k} : |P

τ ∈T P (τ )fτi − ˆfi |≤ i

as the optimization problem’s constraints.

These ’entropy’ based approaches provide a powerful framework for solv-ing the IRL problem.

Probabilistic methods Finally, in the last category we are going to see probabilistic methods as the methods proposed in [22] that model IRL prob-lems from a Bayesian perspective and then Levin’s GPIRL algorithm based on gaussian processes.

The first algorithm considers the actions of the expert as evidence that they use to update a prior on reward functions. They solve reward learning and apprenticeship learning using this posterior and perform inference for these tasks using a modified Markov Chain Monte Carlo (MCMC) algorithm called Policy Walk algorithm.

Policy Walk approach is to derive a posterior distribution for the rewards R, used by the agent X , from a prior distribution P(R) and a probabilistic model of the expert’s actions given the reward function.

The main idea is find the policy π that minimizes the expected policy loss, over the posterior distribution for R, computed as:

Lp_policy(R, π) =k V∗(R) − Vπ(R) kp

where V∗(R) is the vector of optimal values for each state achieved by the optimal policy for R and p is some norm. Clearly given a Markov Decision Problem M = (S, A, T, γ, Ep(R)):

E[Lp_policy(π)] = E(k (V∗(R) − Vπ(R) kp)

is minimized for all p by the optimal policy π∗ of the MDP M .

Policy Walk starts taking an initial random reward vector R given by P(R) and computes the optimal policy π for the current reward vector.

At each step it picks a new reward vector ˜R uniformly at random from the neighbors of R and computes the Q-function: Qπ(s, a, ˜R), ∀s ∈ S, a ∈ A of policy π using ˜R, then detects a change in the optimal policy when moving to the next reward vector ˜R in the chain if

(47)

If this happens the new optimal policy, usually only slightly different from the old one, is computed by just a few steps of policy iteration (see [25]) starting from the old policy π then go to the next step of Policy Walk and so on until an equilibrium is reached.

This method has the advantages that does not need a completely speci-fied optimal policy as input to the IRL agent and can incorporate external information about specific IRL problems into the prior of the model, or use evidence from multiple experts.

GPIRL algorithm extends the Gaussian process model to learn the re-ward as a non-linear function without assuming the expert to be optimal. The learned GP kernel hyper-parameters allow to capture the structure of the reward, including the relevance of each feature to the expert’s policy, to recover the reward for the current state space and predict the reward for any unseen state space within the domain of the features.

As said before, the non-linear reward function is modeled as a Gaussian process, and its structure is determined by its kernel function. The Bayesian GP framework provides a method for learning the hyper-parameters of this kernel, thereby learning the structure of the unknown reward using the com-plete log likelihood of the data D under the reward r:

log P r(D|r) =X

i

X

t

log P r(ai,t|si,t) =

X

i

X

t

(Qr(si,tai,t) − V (si,t)r)

(3.18) to specify a distribution over GP outputs and learning both output values and kernel functions.

This algorithm directly learns the most likely values of ω, which repre-sent the rewards associated with feature coordinates µ(πω), that can be the

feature values of all states or a subset of all states. The states’ reward that are not included in this subset are inferred by the GP.

As well as learning ω, GPIRL learns the kernel hyper-parameters θ in order to recover the structure of the reward by maximizing their probability under the expert demonstrations D and the feature coordinates µ(πω):

is the probability of a reward function under the current values of ω and θ, 27

(48)

and P r(µ, θ|µ(πω)) is the prior probability of a particular assignment to ω

and θ.

Guarantees for Policy Walk are if Rmax = O(1/N ) then prior P can be efficiently sampled (within error ) in O(N2log(1/)) steps by algorithm Policy Walk, with N = |S| and S the states space of the MDP, instead GPIRL does not offer guarantees on recovered reward, but in many ex-periments with different domains, it can provide a learned reward function that reflects the true expert’s policy more accurately than methods such as Maximum Entropy, MMP, MWAL and Projection Method.

(49)

Chapter 4

Revisiting the IRL

framework: a multi-objective

perspective

In this chapter we will propose our algorithm talking about the optimization problem we want to solve and how we have modeled the problem. Later, we will explain the concept of multi-objective optimization, the concept of Pareto optimal solutions and then we will illustrate how our IRL problem is related to a multi-objective optimization problem, introducing also the definitions of optimal and sub-optimal expert.

In this thesis we propose a new IRL algorithm based on a cutting plane method: Cutting-Plane Inverse Reinforcement Learning (CPIRL). CPIRL can be suited in the weight-based category, described in Subsection 3.2.2, where methods try to recover the reward function optimized by the expert, finding the weights ωE that corresponds to the reward function optimized

by the expert’s policy.

Our idea to study IRL problems was inspired because people have al-ways the desire, the curiosity, to discover or just give a reason to certain behaviors we all can see in everyday life. IRL helps us, given a model, to understand and study these behaviors, finding an important and one of the most representing behavior information as the reward function. The IRL problem can be modeled as an MDP without the reward function, MDP/R defines as a 5-tuple M =< S, A, T, γ, D > where:

• S is the state space, • A is the action space,

Cutting plane methods for inverse reinforcement learning