Configurable Markov decision processes

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering

Department of Electronics, Information, Bioengineering Master of Science Degree in Computer Science and Engineering

CONFIGURABLE MARKOV

DECISION PROCESSES

AI & R Lab

The Artificial Intelligence and Robotics Lab of the Politecnico di Milano

Supervisor: Prof. Marcello Restelli

Co-supervisor: Dott. Alberto Maria Metelli

Author: Mirco Mutti, 858222

(2)

“There are two motives for reading a book; one, that you enjoy it; the other, that you can boast about it.” Bertrand Russell

(3)

Abstract

Markov Decision Processes (MDP) is a fundamental framework to model

Sequential Decision Making problems. In this context, a learning agent

interacts with an environment in order to reach a, potentially long-term, goal. Solving an MDP means to find a behaviour for the agent that leads to optimal performance, while a learning algorithm is a method to solve an MDP starting from an initial behaviour. The performance of the agent is determined by the feedback provided by the environment, in the form of a reward signal. In the traditional formulation, the environment is fixed and stationary and out of the control of the agent. However, in many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of the learning agent.

This Thesis proposes a novel framework, Configurable Markov Decision Processes (CMDP), extending the traditional MDP framework, in order to meet the need for modelling those real-world scenarios.

Moreover, we provide a new learning algorithm, Safe Policy Model Itera-tion (SPMI), to deal with a learning problem in the CMDP context. To this purpose, SPMI jointly and adaptively optimizes the agent behaviour and the environment configuration, while guaranteeing a monotonic improvement of the performance.

After having introduced our approach and derived some related theo-retical results, we develop two prototypical domains representing CMDP scenarios. Then, we present the experimental evaluation on these domains to show the benefits of the environment configurability on the performance of the learned policy, and to highlight the effectiveness of our approach in solving a CMDP.

(4)

(5)

Estratto in Lingua Italiana

I Processi Decisionali Sequenziali sono comuni in una grande variet`a di am-biti, dal controllo automatico all’economia, ma anche robotica, ricerca op-erativa e molti altri. In questa classe di problemi, un decisore deve operare una serie di decisioni in diversi momenti, per raggiungere un obiettivo di lungo termine. Tuttavia, gli effetti delle decisioni non sono, solitamente, immediati, ma anzi si ripercuotono nel futuro in modo non necessariamente deterministico. Nell’ambito dell’Intelligenza Artificiale, numerosi approcci sono stati teorizzati per modellare e, possibilmente, risolvere problemi di questa natura.

I Processi Decisionali di Markov (MDP) sono una struttura classica

per modellare Processi Decisionali Sequenziali. In questa struttura, un

agente, che rappresenta il decisore del problema, interagisce con un am-biente, che incamera tutti gli aspetti rilevanti del contesto in cui l’agente opera. L’orizzonte temporale del processo decisionale viene suddiviso in una sequenza discreta di istanti temporali. Lo stato corrente sintetizza la cor-nice in cui l’agente deve prendere la decisione successiva. Ad ogni passo, l’agente seleziona un’azione fra le opzioni disponibili nello stato corrente, quindi, in altre parole, prende una decisione. Come effetto dell’azione, riceve dall’ambiente due importanti informazioni: l’osservazione dello stato succes-sivo e una quantità che misura, in qualche modo, il profitto derivante dal passaggio da stato corrente a stato successivo. Nel vocabolario degli MDP, questa quantità prende il nome di rinforzo immediato. Il comportamento, istante per istante, di un agente all’interno del processo viene descritto da una politica. La politica rappresenta la strategia, eventualmente proba-bilistica, con cui l’agente prende una decisione in ogni stato dell’ambiente. Naturalmente, il fine dell’agente è di apprendere una politica, sulla base dell’informazione collezionata nelle precedenti interazioni con l’ambiente, in modo da massimizzare il suo profitto. Questo non è però valutato in termini di rinforzo immediato, bens`ı considerando il rinforzo cumulativo, ovverosia il profitto collezionato lungo l’intero orizzonte temporale.

Il modello degli MDP si basa essenzialmente su due assunzioni fondamen-tali. La prima, chiamata Propriet`a di Markov, richiede che una transizione ad uno stato successivo dipende unicamente dallo stato attuale e dall’azione selezionata dall’agente e non dalla storia del processo, intesa come la

(6)

se-quenza di stati e azioni precedenti all’istante temporale considerato. La sec-onda riguarda invece la stazionariet`a dell’ambiente, per cui la probabilit`a di passare da uno stato corrente ad uno stato successivo, a seconda dell’azione selezionata, rimane costante durante tutto l’arco temporale del processo.

Risolvere un MDP, significa trovare una politica ottima per l’agente, cio`e che massimizzi il rinforzo cumulativo. Nell’ambito della Programmazione Dinamica, sono stati presentati numerosi approcci che risolvono problemi di

apprendimento in MDP in cui la dinamica dell’ambiente `e completamente

nota al decisore. Tuttavia, in moltissimi scenari reali, l’agente si trova in con-testi di razionalità limitata, in cui un modello integrale dell’ambiente non è disponibile. Questo genere di problemi sono stati affrontati nell’ambito dell’Apprendimento per Rinforzo, in cui la ricerca della politica ottima si basa solo sull’esperienza, cioè sulle precedenti interazioni con l’ambiente, e nessuna conoscenza del modello. Per riuscire in questo intento, gli algo-ritmi di Apprendimento per Rinforzo favoriscono, solitamente, l’esecuzione di azioni esplorative, ovverosia azioni volte non ad accumulare maggior prof-itto ma ad acquisire informazione utile per la ricerca di una politica ottimale nel lungo periodo. Tuttavia, in molti casi, azioni esplorative non possono essere eseguite arbitrariamente, ad esempio in uno scenario in cui il profitto corrente è comunque importante, tipicamente in un problema di apprendi-mento on-line, oppure quando vogliamo impedire l’esecuzione di azioni che possono, in qualche modo, danneggiare il sistema.

Negli ultimi decenni, ha ricevuto sempre maggior rilievo la classe dei

metodi di Apprendimento Sicuro (Safe Learning). Questi metodi hanno

come obiettivo quello di risolvere un problema di apprendimento garantendo, al contempo, una crescita monotona della performance. L’obiettivo viene raggiunto grazie ad un approccio conservativo, che limita l’esecuzione di azioni esplorative anche al costo di rallentare significativamente la velocit`a di convergenza alla politica ottima. In particolare, i metodi sviluppati da Kakade e Langford [21] e Pirotta et al. [33], si basano sulla scelta della politica successiva che garantisce la massima performance nel caso peggiore.

Sebbene gli MDP abbiano avuto notevole successo nel modellare Processi Decisionali Sequenziali, esistono una serie di problemi reali, di grande rile-vanza, che non sono rappresentabili mediante la sua formulazione tradizionale. Nemmeno le successive evoluzioni, quali gli MDP Non-stazionari e gli MDP con probabilità incerte, vengono in aiuto. Infatti, in tutte le formalizzazioni il modello dell’ambiente, anche se non più statico e immutabile, rimane in ogni caso al di fuori del controllo dell’agente o di qualsiasi altra entità all’interno del processo decisionale. Al contrario, a nostro avviso diversi casi reali ammettono l’alterazione di qualche parametro ambientale, con il fine di aumentare la performance del decisore. Pensiamo, ad esempio, ad un pilota che guida una monoposto in gara: il veicolo è parte integrante

(7)

dell’ambiente, eppure l’assetto delle sospensioni, il profilo aerodinamico, i pneumatici e molti altri parametri, possono essere modificati per ottenere il miglior tempo sul giro.

In questa Tesi, per andare incontro a questa necessità modellistica, pro-poniamo una nuova struttura, chiamata Processi Decisionali di Markov Configurabili (CMDP), che estende la formulazione tradizionale degli MDP per considerare scenari in cui è possibile esercitare un controllo, quanto meno parziale, sull’ambiente. In generale, qualsiasi parametro dell’ambiente potrebbe essere configurato. In questo documento, ci focalizziamo sulle mod-ifiche del modello di transizione di stato allo scopo di ottenere performance migliori per l’agente. Chiaramente, la nuova struttura impone alcune riv-isitazioni delle formalizzazioni tradizionali, a partire dal problema di ap-prendimento. Se nel contesto classico il fine è trovare una politica ottima, nel nuovo contesto la politica ottima non è indipendente dalla configurazione del modello. Allo stesso modo, possiamo definire il modello ottimo come una configurazione dell’ambiente che garantisce massima performance alla polit-ica corrente. Diversi modelli potrebbero indurre diverse politiche ottime e, simmetricamente, una determinata politica potrebbe indurre un modello ottimo piuttosto che un altro. Il fine dell’apprendimento in un CMDP è, quindi, quello di trovare la migliore combinazione possibile di modello e politica, in modo da ottenere la massima performance per l’agente.

Successivamente all’introduzione del nuovo formalismo degli CMDP, pre-sentiamo un algoritmo, chiamato Safe Policy Model Iteration (SPMI), per poter risolvere un problema di apprendimento in questo nuovo contesto. L’algoritmo adotta un approccio di Apprendimento Sicuro, largamente ispi-rato al lavoro di Pirotta et al. [33], che si basa sulla scelta della successiva coppia politica-modello che massimizza il profitto nel caso pessimo, otte-nendo in questo modo una crescita monotona della performance. Per le sue caratteristiche, questo metodo risulta particolarmente adatto per problemi di apprendimento in scenari critici, in cui l’esplorazione deve essere limitata, rappresentabili mediante il formalismo degli CMDP.

Come ultimo contributo, proponiamo un’analisi sperimentale su due do-mini illustrativi, che riesce nell’intento di mostrare il valore aggiunto, for-nito dalla possibilit`a di configurare il modello, sulla performance della po-litica. Inoltre, gli esperimenti mettono in luce le potenzialit`a dell’approccio di apprendimento introdotto nella gestione congiunta e adattiva di politica e modello.

Questa Tesi non si propone in alcun modo di fornire un apporto con-clusivo all’ambito degli ambienti configurabili. Al contrario, speriamo con il nostro lavoro di ispirare numerosi sviluppi futuri volti a migliorare l’approccio di apprendimento da noi suggerito e a presentarne di nuovi, estendendo le potenzialit`a modellistiche del formalismo degli CMDP.

(8)

(9)

Ringraziamenti

Un primo ringraziamento, doveroso quanto sentito, non pu`o che andare al

Prof. Marcello Restelli e al Dott. Alberto Maria Metelli. Con la loro

paziente e sapiente guida mi hanno introdotto al mondo della ricerca; con la loro presenza costante, e una serie di preziosi consigli e soluzioni acute, hanno di fatto reso possibile la stesura di questa Tesi.

In seconda battuta, vorrei esprimere il pi`u grande ringraziamento, e tutta la mia gratitudine, alla mia famiglia; con il loro supporto, silenzioso e con-tinuo, mi hanno sostenuto incessantemente, contribuendo in modo non quan-tificabile, non solo a questa Tesi, ma a qualsiasi traguardo raggiunto lungo il percorso.

Vorrei ringraziare tutti i miei amici che, molto meno silenziosamente, mi hanno, da sempre, aiutato in modo determinante, creando intorno a me un contesto disteso e sereno che `e alla base di ogni buon lavoro.

Per ultimi, non certo per importanza, vorrei ringraziare i compagni di corso che, dal primo giorno, sono riusciti a rendere misteriosamente diver-tente tutta la mia esperienza universitaria.

Mirco Milano, 19 Aprile 2018

(10)

(11)

List of Figures

2.1 The framework for Sequential Decision Making (from [41]). . 5

2.2 Policy Iteration (from [41]). . . 11 2.3 A Policy Search taxonomy (from [11]). . . 13 3.1 J_µπ as a function of iterations for USPI, MSPI, CPI, Policy

Iteration (PI), Natural Policy Gradient (NPG). The exper-iment domain is a Chain-Walk with 50 states. The plot is taken from [33]. . . 23

3.2 Jπ

µ as a function of iterations for the approximate versions of USPI, MSPI, CPI, PI. The experiment domain is a Blackjack game. The plot is taken from [33]. . . 23

3.3 Comparison of Diverse Exploration (DE) and Safe Policy

Im-provement (SPI) on Grid World domain, as presented in [10]. 26

6.1 α, β coefficients for the Student-Teacher domain 2-1-1-2 for different update strategies. . . 54

6.2 Policy, model advantages for the Student-Teacher domain

2-1-1-2 for different update strategies. . . 55

6.3 Expected return and bound values for the Student-Teacher

domain 2-1-1-2 for different update strategies and target choices. 55 6.4 Policy dissimilarity and target policy changes for the

Student-Teacher domain 2-1-1-2 for different target choices. . . 56 6.5 Expected return for the Student-Teacher domains 2-1-1-2 (left)

and 2-3-1-2 (right) for different update strategies. . . 56 6.6 Several statistics of the persistent (greedy) target choice for

different update strategies in the case of Student-Teacher do-main 2-1-1-2. . . 58 6.7 Several statistics of the greedy target choice for different

up-date strategies in the case of Student-Teacher domain 2-1-1-2. 59 6.8 Several statistics of the persistent (greedy) target choice for

different update strategies and sequential approaches in the case of Student-Teacher domain 2-3-1-2. . . 60

(15)

6.10 Basic representation of the tracks used in the Racetrack Sim-ulator. From left to right: “twostart”, “twogoal”, “obstacle” and “straight” just below. Each position have a type label: 1 for initial states, 2 for goal states, 4 for walls, 5 for roadtracks. 63 6.11 Expected return and coefficient of the high speed stabiliy

vertex model for different update strategies, different target choices and sequential approaches in the Racetrack simulator (“twostart” track with two vertex models). . . 64 6.12 Expected return for different update strategies in the

Race-track simulator (two vertex models), “twogoal” Race-track on the left, “obstacle” track on the right. . . 65 6.13 Expected return for different update strategies and target

choices in the Racetrack simulator (“straight” track with four vertex models). . . 65 6.14 Coefficients of the different vertex models for different update

strategies in the Racetrack simulator (“straight” track with four vertex models). . . 66 6.15 Several statistics of different update strategies and target

choices in the Racetrack simulator considering vehicle sta-bility only (“twostart” track with two vertex models). . . 67 6.16 Several statistics of different update strategies in the

Race-track simulator considering vehicle stability and engine set-ting (“straight” track with four vertex models). . . 68 C.1 An example ofsome additional plots. CMDP with local maxima. 91

C.2 An example of CMDP with mixed optimal model. . . 92

C.3 The state value function of the CMDP in Figure C.2 as a function of the parameter. . . 92

C.4 The state value function of states A and B (the only ones

varying with the parameter) of the CMDP in Figure C.2. In green plain line the Pareto frontier. . . 92

(16)

List of Tables

5.1 The four possible optimal (α, β) pairs, the optimal pair is the one yielding the maximum bound value (all values are clipped in [0, 1]). . . 45

5.2 The values of the Decoupled Bound corresponding to each

possible optimal (α, β) pair. . . 46

6.1 Additional information and hyper-parameters used for the

different problem settings of the Student-Teacher domain. . . 54

6.2 Number of steps to convergence for the update strategies in

different problem settings of the Student-Teacher domain. In bold the best algorithm and underlinedthe second best. The run has been stopped after 50000 iterations. . . 57

6.3 Additional information and hyper-parameters used for the

different problem settings of the racetrack simulator. In the cases with two vertices only stability vehicle configurations are considered. . . 63

(17)

List of Algorithms

1 Policy Iteration. . . 11

2 Value Iteration. . . 12

3 General Policy Gradient. . . 14

4 Conservative Policy Iteration. . . 19

5 Unique-parameter Safe Policy Improvement. . . 22

6 Safe Policy Iteration. . . 42

7 Safe Model Iteration. . . 43

(18)

Chapter 1

Introduction

In many real-world problems, a decision-maker has to make a sequence of decisions to achieve a long-term goal. The effects of these decisions can

be probabilistic and delayed in time. This type of problems, known as

Sequential Decision Making problems, are common in a variety of fields, such as automation control, robotics, operational research, and economics. Artificial Intelligence is a classical approach to address this kind of problems. Markov Decision Processes (MDP) is a fundamental framework to model Sequential Decision Making problems. In this context, a learning agent, i.e. the decision maker, interacts with an environment receiving a feedback, in the form of a reward signal. The current state of the environment describes the situation in which the agent has to make its next decision. At each time step, the agent performs a suitable action from the current state. Then, it receives a payoff, or reward, and it observes the next state from the environ-ment. The main assumption of the MDP framework is that the next state depends only on the current state and the taken action, and not on the his-tory of the process (Markov Property). The agent behaviour, named policy, is the strategy that the agent employs in order to select an action in any state of the process. The agent aims to learn, by processing the information of a sequence of interactions with the environment, an optimal behaviour. Nonetheless, the utility of an action is not measured in terms of immediate payoff, but in terms of the cumulative reward collected during the entire time horizon.

Solving a problem in the MDP formulation means to find the above men-tioned optimal behaviour for the agent. The Dynamic Programming theory addresses the learning problem in an MDP exploiting the full knowledge of the dynamics of the environment. However, in many real-world cases, a complete model of the environment is not available. Reinforcement Learning faces a learning problem in these situations, in which an optimal policy has to be recovered from the samples only, without any initial information on the dynamics of the environment. To this purpose, reinforcement learning

(19)

algorithms usually perform exploratory actions, i.e. actions not intended to improve the actual performance but to gather information in order to target an optimal policy in the long-term. Nevertheless, in some real-world cases we cannot perform arbitrarily free exploration. As for instance, in on-line scenarios, we are not allowed to give up much actual performance, and, in critical applications, we must avoid dangerous behaviours.

In the last decades, among the other methods, safe learning approaches have received more and more attention. Their purpose is to develop learning methods that guarantee, with high probability, monotonic improvement of performance, even at cost of decreasing the convergence rate to the optimal

policy. Notably, Kakade and Langford [21] and Pirotta et al. [33] have

presented a conservative approach, in which, at each step of the learning process, the next policy is selected as the one optimizing a lower bound on future performance.

MDPs have been really successful in modelling Sequential Decision Mak-ing problems. Nevertheless, there exists a relevant set of real-world scenarios that are not covered by the MDP framework or any of its subsequent evolu-tions, such as Non-stationary MDPs and MDPs with imprecise probabilities. Indeed, in the traditional formulation, the dynamics of the environment are fixed and immutable during the process. Several real-world scenarios, in-stead, admit the possibility to dynamically alter some environmental pa-rameters in order to improve the agent performance. For instance, a human car driver has at her/his disposal a number of possible vehicle configurations she/he can act on (e.g. seasonal tires, stability and vehicle attitude, engine model, automatic speed control, parking aid system) to improve the driving style or quicken the process of learning a good driving policy.

In this thesis we propose a novel framework, Configurable Markov De-cision Processes (CMDP), extending the traditional MDP framework, in order to model those real-world scenarios in which the environment is, at least partially, controllable. In principle, any of the CMDP’s parameters can be tuned, but we restrict our attention to the transition model and we focus to the problem of identifying the environment that allows achieving the highest performance possible. To this purpose, we formalize a tailored learning problem for the new context. At an intuitive level, there exists a tight connection between environment and policy: variations of the environ-ment induce modifications of the optimal policy. Furthermore, even for the same task, in presence of agents having access to different policy spaces, the optimal environment might be different. Therefore, solving a CMDP means to find an optimal combination of policy and environment configuration, i.e. a combination leading to a maximal performance of the agent.

After the introduction of the CMDP framework, we present a learning algorithm, Safe Policy Model Iteration (SPMI), that jointly and adaptively

(20)

optimizes the policy and the transition model in order to solve a learning problem in the new context. The algorithm adopts a safe approach, in-spired by the work of Pirotta et al. [33], based on the maximization of a lower bound on performance improvement. In this way, SPMI generates a sequence of policy-model pairs with monotonically increasing performance during the learning process, making our approach suitable for on-line and critical applications.

Having formalized the framework and a suitable learning approach, we provide an empirical evaluation in order to show the benefit of configuring the environment to improve the policy performance and, at the same time, to highlight the effectiveness of our approach in solving learning problems in the CMDP formulation. To this purpose, we develop two prototypical domains, representing basic abstractions of illustrative real-world scenarios that can be modelled by means of the CMDP framework, to which we base our experiments.

The contents of this thesis are organized as follows. In Chapter 2, we provide an overview on the Markov Decision Processes. We present the framework definition and the common approaches to solve an MDP, along with two subsequent extensions of the traditional framework in which the environment is no longer fixed. Chapter 3 is dedicated to the state-of-the-art of the Safe Learning approaches. Especially, we present in detail the theoretical foundations and the corresponding methods that are at the basis of our learning approach. In Chapter 4, after a focus on the motivations un-derlying the framework proposal, we define Configurable Markov Decision Processes and we provide the corresponding formalizations. In Chapter 5, we first extend the theoretical results presented in Chapter 3 to take into account the model configurability. We present a learning algorithm, Safe Policy Model Iteration, exploiting these results to solve learning problems in the CMDP context. Then, we analyse some relevant theoretical aspects related to SPMI. Chapter 6 provides an empirical evaluation of our learn-ing approach, along with a detailed description of the two environments on which the experiments are performed. In Appendix A, we report the proofs and derivations we have omitted in the text. Appendix B provides a the-oretical insight on the dissimilarity measures. In Appendix C, we present two examples of CMDPs with a significant behaviour.

(21)

Chapter 2

Markov Decision Processes

and Sequential Decision

Making

In this chapter, we will first introduce Markov Decision Processes (Sec-tion 2.1), a fundamental framework which will be essential for the disserta-tions reported in this document. Then, we will present a partial collection of approaches to optimize Sequential Decision Making problems based on the aforementioned framework (Section 2.2), namely Policy Iteration, Value It-eration, and Policy Search. Finally, we will briefly consider some extensions of the traditional framework as presented in the literature (Section 2.3).

2.1 Markov Decision Processes

A Markov Decision Process (MDP) is a framework to effectively model Se-quential Decision Making problems in which an agent interacts with an envi-ronment in order to reach a, potentially long-term, goal. In this document, we limit our perspective to discrete-time MDPs. At each time step, the agent receives information on the current state of the environment. Then, considering this information, it has to select an action from a set of applica-ble ones. Finally, the agent receives a payoff from the environment according to the current state and the selected action. This payoff, called reward, rep-resents in a way the progress of the agent toward the goal accomplishment in the current time step. It is noteworthy that in the presented context the agent is not interested in maximizing the immediate reward, but to collect as much reward as possible during the entire time horizon. Therefore, it may decide to give up an immediate reward in order to gather more long-term reward.

(22)

Figure 2.1: The framework for Sequential Decision Making (from [41]).

2.1.1 Definition of the model

There exist several definitions of MDP in the literature, such as [41], [35],

[43]. For our purpose, we define a Markov Decision Process as a tuple

hS, A, P, R, µ, γ, Hi, where:

• S is a non-empty set of states, named state space; • A is a non-empty set of actions, named action space;

• P is a function P : S × A → ∆(S), named transition model, where P (·|s, a) represents the probability distribution over the next state s0 when the chosen action isa and current state is s;

• R is a function R : S × A → ∆(R), named reward model, where R(s, a), named reward function, is the expected value of the reward received by taking actiona in state s;

• µ ∈ ∆(S) is the initial state distribution;

• γ ∈ [0, 1] is the discount factor for future rewards;

• H is the time horizon, i.e. the maximal length, potentially infinite, of a trajectory.

A trajectory is a sequence hst, at, rtit=0,...,H in which st, at, rt represent the current state, the action taken, the reward collected at time step t. For notation completeness, it is worthwhile to formalize also the state-action space SA, which is the cardinal product between the state space and the action space.

In the traditional formulation, the dynamics of the environment, de-scribed by the transition model, has to satisfy two main properties:

• Markov property: the next state s0 depends only on the current state s and the chosen action a;

(23)

• Stationarity: the dynamics of the environment does not change over time.

While the first property, which states that the transition probabilities do not depend on the history of the process but only on the current time step, represents a fundamental feature of the MDPs, we will see in the Section 2.3 that non-stationary transition models have been considered in the literature.

2.1.2 Policies and Markov Reward Processes

The behaviour of the agent in an MDP is modeled by a stationary, stochastic policy:

π : S → ∆(A),

whereπ(·|s) is a probability distribution over A when the current state is s andπ(a|s) is the probability density of action a in state s. An agent, which follows the policyπ, selects the action a in state s by drawing a sample from the aforementioned conditional distribution. A policy is said deterministic

if for each s ∈ S there exists an a ∈ A such that π(a|s) = 1. In many

applications, the agent is inherently limited on how it can interact with the environment. As instance, it may not be able to take a particular action even if it is included in the action space. Moreover, an action available in a given state could result to be not applicable in a different state. These constraints are formalized by explicitly defining a set of suitable policies Π, named policy space.

Any Markov Decision Process paired with a policy π induces a Markov

Reward Process (MRP) defined as a tuple hS, Pπ, Rπ, µ, γi, where:

• S, µ, γ are state space, initial state distribution and discount factor of a Markov Decision Process;

• Pπ is a probability measure over S, Pπ(s0|s) =R

Aπ(a|s)P (s

0_{|a, s)da,} named kernel function, is obtained by marginalizing the transition

modelP of the MDP over the actions;

• Rπ _{is a probability measure over real numbers, conditioned to S, and} Rπ(s) =R

Aπ(a|s)R(s, a)da.

The MRP formulation conceals the notion of action by coupling the ef-fects of the transition model and the policy. Therefore, MRPs are more suit-able to describe uncontrollsuit-able processes, such as a stochastic environment producing reward at each time step, while an MDP formulation is preferable when the explicit intervention of the agent is inherent in the process.

(24)

2.1.3 State occupancy and distribution

Following a policy on a MDP clearly induces a probability distribution over the state space, i.e. an agent will encounter a state s at the i-th time step of the process with a certain probability. For our purposes, recalling the formulations presented in [42], we formalize this concept defining the γ-discounted future state occupancy as:

˚ dπ_µ(s) = +∞ X t=0 γtP(si=s|µ, π, P), ∀s ∈ S,

where P(si = s|µ, π, P) is the probability that the trajectory turns up in states at the i-th time step, given the initial state distribution µ, the policy π and the transition model P. We can recursively redefine the state occupancy as follows: ˚ dπ_µ(s) = µ(s) + γ Z S ˚ dπ_µ(s0) Z A

π(a|s0)P (s|a, s0)dads0.

According to the aforementioned formulations, the state occupancy is not, formally, a probability distribution since the sum over the state space is not 1. Therefore, it is worthwhile to define a γ-discounted future state distribution, which can be easily obtained normalizing the state occupancy:

dπ_µ(s) = (1 − γ)˚dπ_µ(s), ∀s ∈ S.

Straightforwardly, we state the recursive formulation for the state distribu-tion as: dπ_µ(s) = (1 − γ)µ(s) + γ Z S dπ_µ(s0) Z A

π(a|s0)P (s|a, s0)dads0. (2.1)

2.1.4 Performance of a policy

The goal of an agent in an MDP can be restated into finding an optimal pol-icy, w.r.t. some optimality criterion. Therefore, we aim to define a function to properly ordinate the policies according to this criterion. The general idea is that a policy π1 is better than a policy π2 if a trajectory induced by π1 collects more reward, on average, than a trajectory generated by π2. However, a reward gathered in the current time step may worth more, in the agent perspective, than an equal reward gathered far in the future. Thus, the future rewards are usually discounted, in order to consider the temporal distance w.r.t. the current time step along with the absolute reward value.

There exist several approaches in the literature [35], a common one is to measure the value of a trajectory in terms of γ-discounted cumulative reward, also called return, defined as:

v = H X

t=0 γtrt,

(25)

where H is the time horizon of the process, i.e. the maximal length of a trajectory, andrtis the reward collected at time step t. Then, we define the value of a policyπ by taking the expected return of the trajectories induced byπ under the initial state distribution:

J_µπ _{= E} µ,π[v].

In the following, we often refer to this measure as performance, since it basically represents the overall quality of the agent decisions. We can rewrite the expected return exploiting the state distribution as follows:

J_µπ = 1 1 −γ Z S dπ_µ(s) Z A π(a|s)R(s, a)dads.

Having defined a performance measure for a policy, we can qualify the op-timal policy π∗ _as:

π∗∈ arg max π∈Π

J_µπ,

that is the one maximizing the performance in the policy space.

2.1.5 Value functions

It is valuable for the agent to measure the utility to be in a state, or to select an action in a given state, in order to guide its search toward the optimal policy: value functions provide such utility measures. It is rather clear that the value of a state, or state-action pair, strictly depends on the current policy, i.e. on the decisions the agent will make from the current

state onwards. Then, we define a state value function Vπ : S → R under

a policyπ by the following recursive equation (named Bellman expectation equation [4]): Vπ(s) = Z A π(a|s) R(s, a) + γ Z S P (s0|s, a)Vπ(s0)ds0 da, (2.2)

where the function Vπ _{assigns to each state the expected return that is} obtained from the states by following the policy π. The state value function allows us to rewrite the performance of a policyπ as:

J_µπ = Z

S

µ(s)Vπ(s)ds.

It is worth noting that an alternative optimality criterion for a policy can be derived from the concept of value function. Indeed, an optimal policy can be defined as a policy that maximizes the value function in each state simultaneously, i.e.π∗_{∈ arg max}

πVπ(s) for all s ∈ S. While this optimality condition is more strict than the one presented in the previous section, it

(26)

does not fit well cases with limited policy space, in which an optimal policy thus defined may not exist.

Since the value function does not provide any information about the utility of selecting an action in a given state, it does not support the agent decisions. Then, for control purposes, it is more useful to define a state-action value function Qπ : S × A → R, which assigns to each state-action pair the expected return that is obtained by taking action a from sate s and, then, following the policyπ. As for the state value function, also the state-action value function can be defined by a recursive equation:

Qπ(s, a) = R(s, a) + γ Z S P (s0|s, a) Z A π(a0|s0)Qπ(s0, a0)da0ds0. (2.3) Beyond the introduction of state and action-state value functions, we now consider a third interesting measure, known as advantage function, which will result to be rather significant in this document:

Aπ(s, a) = Qπ(s, a) − Vπ(s). (2.4)

The advantage function effectively represents the benefit of selecting an ac-tiona in the state s instead of drawing an action from the policy π. Further-more, it allows us to define a relative advantage function, which measures the utility to draw an action from a policyπ0, w.r.t. the current policyπ, in a states:

Aπ_π0(s) = Z

A

π0(a|s)Aπ(s, a)da.

Averaging the relative advantage function according to the state distribu-tion, we obtain a measure of the ability of the policyπ0_{to select advantageous} actions in comparison to the policyπ:

Aππ,µ0 (s) = Z

S

dπ_µ(s)Aπ_π0(s)ds, which is called expected relative advantage function.

2.2 Methods to solve MDPs

Solving an MDP means to find an optimal policy, i.e. to find an optimal behaviour to interact with the environment. How to approach this funda-mental task highly depends on the particular setting of the learning problem. A first crucial discrimination is based on the degree of knowledge about the environment:

• Model-based learning: the agent has at its disposal a full model of the environment, which can be either inherently available or estimated;

(27)

• Model-free learning: the agent learns without an explicit model of the environment.

When the agent has a full representation of the environment, Dynamic Pro-gramming (DP) algorithms can be exploited to find the optimal policy, while Reinforcement Learning (RL) theory deals with Model-free learning prob-lems. RL methods, essentially, provides sample-based versions of the DP algorithms to effectively estimate the environment behaviour from experi-ence. How the experience is presented to the agent also induces a dichotomy to the learning problems:

• Infinite-horizon learning task: the experience is composed of a single, infinite, trajectory;

• Episodic learning task: the experience is divided into finite segments, called episodes.

Moreover, we can easily distinguish between:

• On-line learning: the agent learn while the experience is collected; • Off-line learning: the agent starts learning when all the experience

have been collected.

Another significant dichotomy is based on the policy used to collect the experience:

• On-policy learning: the agent directly alters the policy employed to interact with the environment while the policy learning goes on; • Off-policy learning: the agent tries to learn an optimal policy while

the experience is collected under another policy.

After this partial presentation of the dichotomies that qualifies a learning problem, we briefly discuss the main learning approaches to solve an MDP in the following sections.

2.2.1 Policy Iteration

Policy Iteration, as presented in [20], is an approach based on alternated policy evaluation and policy improvement phases. A policy iteration algo-rithm usually starts from an arbitrary policy π(0)_{. Then, it generates a} sequence of policies hπ(t)it=0,...,N that has the guarantee to converge to an optimal policy in a finite number of steps, say N , if the state-action space is finite [35]. In the evaluation phase, the algorithm evaluates the current policy π(t) by means of the state-action value function Qπ(t). This can be done by performing iterative applications of the equation (2.3). Therefore, in the improvement phase, the algorithm aims to improve the current policy

(28)

Figure 2.2: Policy Iteration (from [41]).

Algorithm 1 Policy Iteration. initialize policy π(0) _arbitrarily fort = 0, 1, 2, ... until convergence do

initializeQπ(t)

0 (s, a) arbitrarily

fori = 0, 1, 2, ... until convergence do ∀s ∈ S, ∀a ∈ A, compute: Qπ(t) i+1(s, a) = R(s, a) + γ R SP (s0|s, a) R Aπ(a0|s0)Qπ (t) i (s0, a0)da0ds0 end for π(t+1)_{(s) = arg max} a∈AQπ (t) i (s, a), ∀s ∈ S end for

return optimal policyπ(t)

considering the local information provided by the state-action value func-tion. A common choice to make such improvement is to select asπ(t+1) the greedy policy π+, that is a deterministic policy obtained by taking in each state the action that maximizes the value function:

π+(s) ∈ arg max a∈A

Qπ(s, a) , ∀s ∈ S.

Although we have decided to present a Policy Iteration based on state-action value function, a quite similar algorithm based on state value function can be derived. Several approaches have been studied to extend the Policy Iteration paradigm in Model-free learning case, such as Monte Carlo Control and SARSA [41].

(29)

Algorithm 2 Value Iteration. initialize policy π(0) _arbitrarily initialize value V(0) arbitrarily

fort = 0, 1, 2, ... until convergence do V(t+1)(s) = max a∈A R(s, a) + γ RSP (s 0_{|s, a)V}(t)_(s0_)ds0_{, ∀s ∈ S} end for Qπ∗(s, a) = R(s, a) + γR SP (s 0_{|s, a)V}(t)_(s0_)ds0_{, ∀s ∈ S, ∀a ∈ A} π∗_{(s) = arg max} a∈AQπ ∗ (s, a), ∀s ∈ S return optimal policyπ(t)

2.2.2 Value Iteration

Value Iteration, as presented in [4], solves the MDP by directly computing the optimal value function, i.e. the value functionVπ∗ induced by an optimal policy. The general idea underlying the Value Iteration approach is to merge evaluation and improvement steps of Policy Iteration in a single operation. In particular, a Value Iteration algorithm consists of iterative updates of the initial value functionV(0) exploiting the Bellman optimality equation [4]:

V(t+1)(s) = max a∈A R(s, a) + γ Z S P (s0|s, a)V(t)(s0)ds0 ,

which can be derived from the Bellman expectation equation (2.2) by tak-ing the maximum over actions instead of weighttak-ing with the current policy. When the aforementioned procedure ends, i.e.Vπ∗ is obtained, the optimal policy is computed as the greedy policy induced by the optimal value func-tion. While the asymptotic convergence to the optimal value function is guaranteed [4], the algorithm presents a significant difference w.r.t. Policy Iteration. Focusing on value functions only, it completely avoids interme-diate policy representations. Moreover, the intermeinterme-diate value functions V(t) may not correspond to any policy. As for Policy Iteration, also the Value Iteration approach has been extended to deal with Model-free cases. A rather famous instance is Q-learning, which is a sample-based version of Value Iteration capable of learning off-policy [46].

2.2.3 Policy Search

While Policy Iteration and Value Iteration focuses on the value functions to guide the recovery of an optimal policy, Policy Search methods consist of a direct exploration of the policy space in order to find an optimal policy. This may be rather beneficial in scenarios where either the state space or the action space is continuous, making impractical a direct computation of the

(30)

Figure 2.3: A Policy Search taxonomy (from [11]).

value functions. Taking inspiration from [11], we can classify Policy Search methods in three broad categories: Expectation Maximization approaches, as presented in [6], and Information-Theoretic approaches, such as REPS [30]. The third category, Policy Gradient, is the one we are going to focus a little more, since an introductory comprehension of this approach will become useful in the following chapters.

Policy Gradient methods, introduced by [48], exploit a standard gradient optimization in the context of Policy Search. In this scenario, the search for an optimal policy is restricted to a class of parametrized stochastic policies defined as:

ΠΘ=πθ : θ ∈ Θ ⊂ Rk ,

where θ is a vector of parameters, and each parametric policyπθ has to be stochastic, and also differentiable in θ. Like any gradient descent approach, the idea is to update the parameters toward the maximum growth direction, i.e. the gradient, of a properly defined performance measure in order to find an optimal parameters vector. In case of Policy Search, it is straightforward to select the expected return Jπθ

µ , or J(θ) for the sake of brevity, as per-formance measure. Therefore, a Policy Gradient algorithm aims to find a policy parametrization θ∗ such that:

θ∗ = arg max θ∈Θ

J(θ).

To accomplish this goal, the algorithm starts from an arbitrarily initialized parametrization θ0, thus it performs a sequence of updates in the direction of the gradient:

θt+1= θt+αt∇θJ(θt),

whereαt≥ 0 is the learning rate. Although the sequence of updates may end in a local optimum, due to the inherent locality of the gradient, the algorithm convergence is guaranteed under common regularity assumptions [37].

(31)

Algorithm 3 General Policy Gradient.

initialize the policy parameters θ0 arbitrarily fort = 0, 1, 2, ... until convergence do

Compute (or estimate) the policy gradient ∇θJ(θt) Update the policy parameters θt+1= θt+αt∇θJ(θt) Update the learning rateαt toαt+1

end for

return optimal policy parameters θ∗

It is worth noting that the Policy Gradient Theorem [41] provides an exact expression of ∇θJ(θ): ∇_θJ(θ) = Z S dπθ µ (s) Z A

πθ(a|s)∇θlogπθ(a|s)Qπθ(s, a)dads.

Nevertheless, in many model-free scenarios it is quite impractical to compute the exact gradient. For this reason, usually the critical part is to obtain an effective sample-based estimate of ∇θJ(θ). Several approaches have been proposed to overcome this issue, which we can classify in Finite-Difference methods [23] and Likelihood Ratio methods, such as REINFORCE [48], G(PO)MDP [3], and PGT [42].

2.3 Framework extensions

The framework presented in Section 2.1 represents, to some extent, a tra-ditional formulation of a Markov Decision Process. However, the literature propose numerous extensions to this formulation in order to flexibly adapt the model to a broad range of real-world scenarios that were not completely covered by the original framework. For our purposes, we are going to present, in a nutshell, a couple of instances that made the transition model no longer immutable.

2.3.1 Non-stationary MDPs

A Non-stationary MDP is, basically, a Markov Decision Process in which the dynamics of the environment and the reward model are allowed to change over time [7]. Then, we can reconsider the formulations of the transition model and the reward function as follows:

P (s0|s, a, t) = Pt(st+1=s0|st=s, at=a), R(s, a, t) = Rt(s, a) = E[rt|st=s, at=a],

(32)

where transition probabilities and collected rewards are conditioned by the current time step. What makes possible to learn a policy interacting with an evolving environment is that contiguous time frames are, usually, highly correlated. Indeed, the experience gathered at time stept could guide the agent in the following time steps. At the same time, the experience collected far in the past has to be forgotten since could be no longer informative, or even misleading, about the current dynamics. Several works have been done in the field of Non-stationary MDPs: first, to define an optimality criterion [19], then to effectively find optimal policies [15], [9], [16].

2.3.2 MDP with imprecise probabilities

A MDP with imprecise probabilities (MDPIP), as presented in [12], and previously introduced by [38] and [47], is simply a Markov Decision Process in which transition probabilities may be imprecisely specified. To model this scenario, let us consider a set of probabilitiesK, named transition credal set, for each transition between states. At each time step the actual transition probability is picked from the one specified in the corresponding credal set:

P (st+1|st, at) ∈K(st+1|st, at).

While the transition model is uncertain, it is worth noting that MDPIPs are element-wise stationary. A common solution to this class of problems is to adopt a minimax approach in which a policy is selected in order to maximize the performance in the worst possible transition model [12].

(33)

Chapter 3

Safe Learning

In the last decade Safe Learning methods, i.e. learning methods with guar-anteed monotonic performance improvement, have become more and more important. In this chapter, we aim to present some motivations underly-ing Safe Learnunderly-ing (Section 3.1). Then, we will consider an approach to extend Policy Iteration with guaranteed monotonic improvement, by pre-senting Conservative Policy Iteration (Section 3.2) and Safe Policy Iteration (Section 3.3). For the sake of completeness, in Section 3.4 we will mention some safe approaches to Policy Gradient, and in Section 3.5 we will report studies based on a definition of safety different from the one considered in this document.

3.1 The need for Safe Approaches

The learning methods presented in the previous chapter have been really successful in solving MDPs in exact settings. However, their generalization to approximate scenarios resulted to be exposed to an important issue known as policy oscillation problem [5], [45]. We have policy oscillation, in general,

when the performance measure Jπ

µ incurs in a pathological oscillating be-haviour over the learning iterations. It is worth noting that non-monotonic performance is quite natural in model-free learning, due to estimation er-rors or exploratory learning steps, which trade short-term performance to gather more information on the model of the environment. This behaviour is natural and admissible in many cases, but it becomes pathological when it prevents from convergence to an optimal policy. Moreover, even if the trend of Jπ

µ remains overall increasing during the learning process, in sev-eral critical scenarios the magnitude of oscillations may be unacceptable.

In the following, we consider some problems that policy oscillations may cause in real-world scenarios:

• Dangerous behaviour: in on-line settings, a policy corresponding to an oscillation low peak may induce critical interaction with the

(34)

environ-ment. For instance, let us consider an on-line autonomous car driving scenario. The goal is to improve a driving policy as fast as possible, but we also desire to avoid risky behaviours, which may lead to break the car or to cause a crash.

• Poor on-line performance: in several cases, the learning goal is to reach an optimal behaviour in the future preserving a good level of perfor-mance during the process. Let us think about an automatic production system, which adapts itself to small changes in the environment. In this case, low performance peaks, which negatively affect the current production, may not be afforded in general.

• Sub-optimal policy selection: in learning processes with fixed durations we may stop at a time step corresponding to a low performance policy. A common scenario is a learning task with a limited time-budget, in which we may halt the process while an exploratory behaviour is running.

To overcome these issues a need for Safe Learning emerged. Although it is always related to the idea of guaranteed performance improvement, at least to some extent, the definition of safety is not unique in the literature. Then, it is worthwhile to explicitly formalize the concept of safety as inter-preted in this document, while in Section 3.5 an alternative definition will be briefly discussed.

Definition 3.1.1. A learning algorithm is said to be safe if, starting from an initial policyπ0 with performanceJµπ0, it produces a sequence of policies hπiii=0,...,T such that Jµπi+1 ≥ Jµπi ∀i, and limi→∞Jµπi =J∗, where J∗ is the performance of an optimal policy.

According to the definition, Safe Learning approaches attain monotonic performance improvement, effectively overcoming the policy oscillation prob-lem. However, they often lose speed of convergence to reach the achievement. It is rather clear that a trade-off between oscillating behaviour and speed of convergence exists. Usually, to have fewer oscillations, it is necessary to give up some exploration or to increase the number of samples used for es-timations, which both lead to a slower convergence rate. This trade-off is an instance of the well-known exploration-exploitation dilemma [41], which is rather common in learning tasks on real systems.

3.2 Conservative Policy Iteration

Policy Iteration, as presented in Section 2.2.1, learns an optimal policy by iteratively performing policy evaluation and policy improvement. When the value functions or the corresponding greedy policies cannot be exactly com-puted, we have to consider Approximate Policy Iteration (API) approach.

(35)

This family of methods, basically, reproduces the scheme of Policy Iteration by substituting the evaluation phase with an approximate value function computation, then performing an update to an estimated greedy policy in the improvement phase. Although a performance improvement is guaran-teed considering an exact greedy policy, the approximation error on the greedy estimation may cause policy oscillations. To overcome this issue, several studies have focused on producing better approximations in the pol-icy evaluation step, such as [24], [27], [25], [14].

In [21], Kakade and Langford proposed a novel solution to approach policy oscillations problem in API algorithms. Notably, in order to guar-antee monotonic performance, they consider limiting the magnitude of the updates. Indeed, they focused their attention on the policy improvement phase instead of the policy evaluation. The general idea is that we cannot completely rely on the policy evaluation when we move the policy too far from the one under which this evaluation is computed. Therefore, it may be beneficial to keep the updated policy closer to the current one, operating in a conservative way in the policy improvement phase.

To clearly formalize this intuition, let us consider the current policy π and a policyπ, called target policy, which establishes the direction toward the policy update is performed. Usually, in Approximate Policy Iteration algorithms, the target policy is the greedy w.r.t. the current estimated state-action value function:

π(s) ∈ arg max a∈A

ˆ

Qπ(s, a),

and this same target is selected as the next policyπ0_{. Kakade and Langford} proposed to broaden the viable policy updates to the convex hull of the current and the target policies:

π0=απ + (1 − α)π,

where α ∈ [0, 1] is the convex combination coefficient. Analogously to the Policy Gradient step-size (which will be discussed in the Section 3.4), the combination coefficient α can be interpreted as a measure of reliability on the target policyπ. Then, it also measures how confident we are on the ac-curacy of the policy evaluation, which is, instead, affected by the number of samples used in the estimation, named batch-size in Policy Gradient theory (Section 3.4).

The rationale is to optimize the value ofα, w.r.t. some criterion, in order to achieve monotonic performance improvement. To be as conservative as possible, the authors proposed the maximization of a lower bound on the performance improvement: J_µπ0 − J_µπ ≥ α 1 −γ Aππ,µ− 2αγ 1 −γ(1 − α) ,

(36)

Algorithm 4 Conservative Policy Iteration. initialize policy π(0) _arbitrarily

fort = 0, 1, 2, ... until convergence do evaluate the current policyπ(t)

compute the targetπ(s) ∈ arg max_a∈AQˆπ(s, a) compute the advantage Aππ,µ

computeα∗= min (1−γ)Aπ π,µ 4_1−γγ , 1

update the policy asπ(t+1)₌_α∗_{π + (1 − α}∗_)π(t) end for

where = maxs∈S,a∈AEπ[Aπ(s, a)]. This lower bound is characterized by a trade-off between two important components. The first one, the positive term, is related to the expected relative advantage of the target policy, w.r.t. the current one, and it grows according to the α coefficient. The second term is a penalization term, which, basically, takes into consideration the magnitude of the update: the more we move far from the current policy, the more the penalization is high. Intuitively, we aim to accept larger updates only if the target is really promising in terms of future performance, i.e. the advantage has an high value.

Under the assumption of positive relative advantage, which is quite rea-sonable since it is always met taking the exact greedy policy as target, by maximizing the bound we obtain an update coefficient that guarantees per-formance monotonic improvement. From this maximization problem we can straightforwardly write the optimal value of the coefficient (α∗) and the corresponding bound value:

α∗= (1 −γ)A π π,µ 4_1−γγ , J_µπ0− J_µπ ≥ (1 −γ)A π π,µ 2 8_1−γγ . (3.1)

Kakade and Langford presented an algorithm, named Conservative Pol-icy Iteration (CPI) based on these theoretical results (Algorithm 4). Ba-sically, it is a generalization of a traditional Policy Iteration in which the policy improvement phase does not lead to a full step toward the greedy policy, but it updates the current policy by maximizing the lower bound on future performance. The authors also provided a sample-based version of the algorithm to deal with approximate settings. Although the expected relative advantage cannot be exactly computed and has to be estimated, monotonic

(37)

improvement of performance is guaranteed with high probability.

3.3 Safe Policy Iteration

The basis of Kakade and Langford’s seminal work, i.e. the maximization of a lower bound on future performance to overcome policy oscillations, is quite smart and it has been rather successful, inspiring several further studies in the last decade. However, their main contribution is more theoretical than practical, since the proposed algorithm (CPI) has resulted to be over-conservative, leading to poor convergence rate due to the excessive looseness of the bound.

In [33], starting from the valuable theoretical foundations presented by Kakade and Langford, Pirotta et al. derived a tighter lower bound on per-formance improvement w.r.t. the one in Conservative Policy Iteration. They also proposed two safe algorithms to exploit this novel lower bound, along with an experimental study to show the quality of their results on illustrative examples.

3.3.1 Theoretical foundations

To the purpose of lower bounding the performance improvement of two policy π and π0 given the corresponding advantage function, Pirotta et al. noted that the Aππ,µ0 can be a good prediction of the future performanceJπ

0

µ only if the state distributions induced by the two policies are similar, i.e. the two policies visit all the states with similar probabilities. This intuition is upheld by the following result, presented by Kakade and Langford [21]. Lemma 3.3.1. For any stationary policiesπ and π0 and any starting state distributionµ: J_µπ0− J_µπ = 1 1 −γ Z S dπ_µ0(s)Aπ_π0(s)ds.

Unfortunately, computing the exact performance improvement exploit-ing the previous lemma is quite impractical, since it requires to estimate the state distributiondπ_µ0 for each candidateπ0. Instead, we aim to estimate the future performance relying only on local information, i.e. the current state distribution dπ_µ and the relative advantage. To this purpose, the authors derived an upper bound on the difference between the state distributions. Lemma 3.3.2. Let π and π0 be two stationary policies for an infinite hori-zon MDP. The L1-norm of the difference between their γ-discounted state distributions under starting state distribution µ can be upper bounded as follows: kdπ_µ0− dπ_µk1≤ γ (1 −γ)2D π0_,π ∞ , where D∞π0,π= sup_s∈Skπ0(·|s) − π(·|s)k1.

(38)

On the basis of this upper bound on state distributions, which depends only on the discount factor and the difference between the two policies, Pirotta et al. derived a novel lower bound on performance improvement. The idea is that the penalization term has to be related to the difference between the state distributions, instead of the magnitude of the update as proposed by Kakade and Langford.

Theorem 3.3.1. For any stationary policiesπ and π0_{and any starting state} distribution µ, the difference between the performance of π0 and the one of π can be lower bounded as follows:

J_µπ0− J_µπ ≥ A π0 π,µ 1 −γ − γD∞π0,π∆Aπ 0 π 2(1 −γ)2 , where ∆Aπ0 π = sups,s0_∈S Aπ0 π (s0) −Aπ 0 π(s) .

The lower bound is composed of two terms, like in [21]. While the

positive term remains unchanged, the penalization term is related to the difference between the two policies and the ∆Aπ0

π. Pirotta et al. provided also a looser but simplified version of the lower bound on performance im-provement.

Corollary 3.3.1. For any stationary policies π and π0 _{and any starting}

state distribution µ, the difference between the performance of π0 and the one of π can be lower bounded as follows:

J_µπ0− J_µπ ≥ A π0 π,µ 1 −γ − γD∞π0,π 2 ∆Qπ 2(1 −γ)2 , where ∆Qπ0 π = sups,s0_∈S a,a0_∈A Qπ_(s0_{, a}0_{) −}_Qπ_{(s, a)} . 3.3.2 The algorithms

Pirotta et al. presented two different algorithms to exploit the novel lower bound on performance improvement. The first one, Unique-parameter Safe Policy Improvement (USPI) is developed on the same problem settings of Conservative Policy Iteration. We consider a current policyπ and a target policyπ. At each step size, a policy improvement is performed in the policy space identified by the convex hull between current and target policies:

π0 =απ + (1 − α)π, α ∈ [0, 1].

In order to find an optimal value for theα coefficient, we aim to maximize a lower bound on performance improvement. Thus, the optimization problem leads to the following values of the coefficient and the bound:

α∗= (1 −γ)A π π,µ γD∞π0,π∆Aπ_π0 ,

(39)

Algorithm 5 Unique-parameter Safe Policy Improvement. initialize policy π(0) _arbitrarily

fort = 0, 1, 2, ... until convergence do evaluate the current policyπ(t)

compute the targetπ(s) ∈ arg max_a∈AQˆπ(s, a) compute the terms Aππ,µ,Dπ

0_,π ∞ , ∆Aπ 0 π computeα∗= min (1−γ)Aπ π,µ γDπ0,π∞ ∆Aπ0π , 1

update the policy asπ(t+1)₌_α∗_{π + (1 − α}∗_)π(t) end for

J_µπ0− J_µπ ≥ (1 −γ)A π π,µ 2 γDπ∞0,π∆Aπ_π0 .

Comparing the value of the bound w.r.t. the one of CPI (3.1), we can easily notice that the two has identical numerator. Moreover, since D∞π0,π∆Aπ

0

π ≤

4

1−γ, the bound of USPI is tighter in comparison to the one derived by

Kakade and Langford. For the sake of clarity, we report the pseudocode of USPI (Algorithm 5).

Pirotta et al. presented a second algorithm, called Multiple-parameter Safe Policy Improvement (MSPI), which deviates a bit from the settings of Conservative Policy Iteration taking into consideration state-dependent policy updates:

π0(s) = α(s)π(s) + (1 − α(s))π(s), α(s) ∈ [0, 1], ∀s ∈ S.

This approach significantly enriches the update possibilities far beyond the convex hull of the two policies. The idea is to selectively update the policy toward the target only in the more promising states, i.e. the states with higherAπ_π0, setting to zero all the other α(s). Although this generalization seems to be quite valuable, the computation of the coefficients can be done only iteratively. Thus, the algorithm results rather inefficient in comparison with the simpler state-independent approach. Moreover, it does not guaran-tee better performance improvements, from a theoretical standpoint, w.r.t. USPI.

Pirotta et al. also provided a sample-based version of the presented algorithms, following the same scheme proposed by Kakade and Langford for Conservative Policy Iteration. The authors derived a further simplification of the lower bound on performance improvement:

J_µπ0− J_µπ ≥ A π0 π,µ 1 −γ − γDπ∞0,π 2 2(1 −γ)3.

(40)

Figure 3.1: Jπ

µ as a function of iterations for USPI, MSPI, CPI, Policy Iteration

(PI), Natural Policy Gradient (NPG). The experiment domain is a Chain-Walk with 50 states. The plot is taken from [33].

Figure 3.2: Jπ

µ as a function of iterations for the approximate versions of USPI,

MSPI, CPI, PI. The experiment domain is a Blackjack game. The plot is taken from [33].

In this way, the only term that has to be estimated is the advantage Aππ,µ0 . An -accurate estimation of the advantage can be easily obtained following the sampling procedure described in [21].

In [33], the authors also provided an empirical evaluation of the pre-sented algorithms. In Figure 3.3.2, we report the results of an experiment on a Chain-Walk domain with 50 states. In this case, a full model of the environment is available. Not surprisingly, Policy Iteration outperforms all the other methods in this scenario. Though, it is noteworthy that USPI and MSPI perform far better than CPI. In Figure 3.3.2, we show the per-formance plot of a Blackjack domain experiment in approximate settings. Approximate Policy Iteration (aPI) results in a pathological oscillating be-haviour in this case. Instead, the safe methods all leads to better perfor-mance, with Approximate MSPI (aMSPI) significantly outscoring Approxi-mate USPI (aUSPI) and ApproxiApproxi-mate CPI (aCPI).

(41)

3.3.3 Other comparable approaches

The idea of limiting the magnitude of the updates according to a penalization term, somehow related to the distance between the current and the next policies, presents other instances in the literature.

Schulman et al., in [39], derived theoretical results quite similar to the one presented in [33], using the Kullback-Leibler divergence betweenπ and π0, instead of the `∞-norm, for the penalization term. Considering several approximations of these results, they presented a running algorithm, called Trust Region Policy Optimization (TRPO). Although the monotonic perfor-mance improvement is no longer guaranteed from a theoretical standpoint, their algorithm has proven to be quite successful in practice, dealing with a set of diverse learning problems. The approach of TRPO has been fur-ther developed in the subsequent [40], where the problem of safe learning has been set aside a bit, in favour of the search for a general approach for efficient Policy Gradient.

In [30], Peters et al. addressed the problem of penalizing the loss of information underlying a policy update. To this purpose, they proposed an algorithm, Relative Entropy Policy Search, in which the policy update is the result of a linear programming problem where the loss of information is bounded by a hard constraint. Therefore, in their solution, this is the constraint that limits the distance between the current and the next policy, instead of a penalization term as in [33].

Also the idea of producing updates with state-dependent coefficients, similarly to MSPI, has been considered in the literature. Bartlett et al. [1], proposed an approach employing state-dependent mixture coefficients. In their solution, the mixture coefficients can be computed efficiently to at-tain monotonic improvement of performance. The resulting algorithm, Lin-earized Policy Improvement (LPI), guarantee better improvement w.r.t. CPI, allowing at the same time larger updates and faster convergence.

3.4 Safe Gradient approaches

Policy Gradient methods, introduced in Section 2.2.3, have been proven to be really effective dealing with several of complex, continuous, and high-dimensional, real-world control problems, such as robotic control systems [29]. Nevertheless, also for this class of algorithms, policy oscillations problems have been pointed out [45], leading to a need for Safe Policy Gradient meth-ods.

Policy Gradient is based on iterative updates to the policy parametriza-tion in the gradient direcparametriza-tion, of the form: θt+1 = θt+αt∇ˆθJ(θt). The αt is called step-size, while the ˆ∇θJ(θt) is a sample-based estimate of the gradient, and the number of samples is called batch-size. In general, we have policy oscillations every timeJ(θt+αt∇ˆθJ(θt))< J(θt), i.e. the