Batch reinforcement learning for highway driving

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering Department of Electronics, Information, Bioengineering

Master Degree in Computer Science and Engineering

Batch Reinforcement Learning for

Highway Driving

AI & R Lab

The Artificial Intelligence and Robotics Lab of the Politecnico di Milano

Supervisor: Prof. Marcello Restelli

Co-supervisors: Dott. Alberto Maria Metelli Dott. Amarildo Likmeta

Author: Andrea Mecchia, 10452804

(2)

(3)

Abstract

An Autonomous Vehicle (AV), also known as self-driving car or driverless car, is a vehicle that is capable of moving safely with little or no human input. As evidence of the advances in the field of Autonomous Driving (AD), in 2014 the Society of Automotive Engineers (SAE) International defined five stages of development, from simple driving assists (level 1) to full automation (level 5), which means that the autonomous system will completely replace the human driver.

In order to achieve full automation, the vehicle should be able to react promptly to possibly unseen situations in an intelligent and safe manner, just like a human would. Whereas partial automation can be achieved with classic task planning frameworks, the dynamic nature of the driving task requires a high level of flexibility that can not be guaranteed by classic approaches. For this reason, in the last years, the AD research has focused its efforts on exploring new approaches, mainly methods that belong to a subfield of Artificial Intelligence (AI) known as Reinforcement Learning (RL).

Reinforcement learning is a branch of Machine Learning (ML) that aims at solving decision-making problems. In a typical RL setting, an autonomous agent observes and interacts with the surrounding environment, with the purpose of fulfilling a specific goal. By interacting with the environment, the agent observes a reward, which depends on the current state of the envi-ronment and the actions performed by the agent, and is usually a measure of how well the agent is behaving. Therefore, most goals can be described as maximizing the cumulative reward over time. For example, a dog can be taught to give its paw by giving the dog some kind of reward every time it performs the task correctly: the dog will receive the reward only if it gives its paw when requested; over time, the dog will eventually learn to associate the extended hand of the owner with the gesture of giving its paw.

Many autonomous driving tasks can be well described as RL problems. In our specific case, we tackle the problem of driving through dense traffic

(4)

iv

in a multi-lane road; the goal of the task is to drive as fast as possible, learning how to efficiently overtake other vehicles without getting stuck in the traffic. The method we propose to solve this problem is called Fitted Q Iteration (FQI), which belongs to a class of RL algorithms known as Batch Reinforcement Learning algorithms. Finally, we provide an empirical evaluation of the proposed algorithm and we compare its performance on a simulated environment with a well known on-line method, the Deep Q Network (DQN) algorithm.

(5)

Estratto in lingua italiana

Un veicolo a guida autonoma `e un veicolo in grado di muoversi in maniera sicura col minimo o senza alcun intervento umano. A testimonianza dei progressi compiuti nell’ambito della guida autonoma, nel 2014 la SAE In-ternational ha definito cinque stadi di sviluppo, da semplici aiuti alla guida (livello 1) fino a una completa autonomia del veicolo (livello 5), nella quale il sistema autonomo rimpiazzer`a completamente il guidatore umano.

Al fine di sostenere una autonomia completa, il sistema deve essere in grado di reagire in maniera intelligente e sicura a situazioni eventualmente mai affrontate, esattamente come farebbe un essere umano. Mentre una autonomia parziale può essere ottenuta tramite classici strumenti di pianifi-cazione delle attività, la dinamicità dell’attività di guida autonoma richiede un livello elevatissimo di flessibilità che questo tipo di approcio semplice-mente non può garantire. Per questo motivo, negli ultimi anni la ricerca ha concentrato i propri sforzi nell’esplorazione di nuove tecniche, principal-mente metodi che appartengono a una branca dell’Intelligenza Artificiale nota come Apprendimento per Rinforzo.

L’Apprendimento per rinforzo è un ambito dell’apprendimento auto-matico il cui scopo è studiare e risolvere processi decisionali. In un tipico problema, un agente autonomo osserva e interagisce con l’ambiente cir-costante, allo scopo di raggiungere un obbiettivo prestabilito. Interagendo con l’ambiente, l’agente ottiene una ricompensa che dipende dalla configu-razione dell’ambiente e dalle azioni compiute, ed è solitamente indice della qualità di tali azioni. Di conseguenza, l’obbiettivo dell’agente consiste nel massimizzare la somma delle ricompense ottenute nel corso del tempo. Per esempio, è possibile istruire un cane a dare la zampa, dandogli una ricom-pensa ogni volta che la procedura è eseguita correttamente: il cane riceverà la ricompensa solo se darà la zampa quando richiesto; col passare del tempo, il cane imparerà ad associare la mano tesa del padrone con il gesto di dare la zampa.

La maggior parte delle attivit`a legate alla guida autonoma possono essere

(6)

vi

descritte come problemi di apprendimento per rinforzo. Nel nostro caso specifico, il problema che ci apprestiamo a risolvere è quello della guida autonoma nel traffico su strada con corsie multiple; l’obbiettivo è guidare il più velocemente possibile, imparando a sorpassare gli altri veicoli senza rimanere imbottigliati nel traffico. Il metodo che proponiamo di utilizzare per risolvere questo problema si chiama Fitted Q Iteration e fa parte di una classe di algoritmi nota come Batch Reinforcement Learning. Mentre negli algoritmi on-line l’agente impara sul posto, attraverso tentativi ed errori, nei metodi batch la fase di apprendimento avviene off-line, usando esperienza collezionata in precedenza. Infine, presentiamo una valutazione dell’algoritmo proposto e lo confrontiamo con Deep Q Network (DQN), un algoritmo on-line che è stato applicato con successo a vari problemi di guida autonoma.

(7)

Ringraziamenti

Desidero ringraziare il Prof. Marcello Restelli, il Dott. Alberto Maria Met-teli e il Dott. Amarildo Likmeta; in un momento di difficolt`a come quello che stiamo vivendo, il loro impegno e la loro disponibilit`a sono stati fonda-mentali per la stesura di questa tesi.

Ringrazio la mia famiglia, che mi ha sostenuto emotivamente ed eco-nomicamente durante il mio percorso universitario, e spero vivamente che questo traguardo possa in parte ripagare i loro sforzi.

Ringrazio infine tutte le persone che mi sono state vicine, i miei amici di tutti i giorni e i miei colleghi di universit`a, per tutti i momenti felici passati insieme.

Andrea Milano, 6 Giugno 2020

(8)

(9)

List of Figures

2.1 General interaction between agent and environment in Rein-forcement Learning . . . 5 2.2 Interaction of evaluation and improvement phases in GPI

al-gorithms . . . 17

3.1 The algorithm partition the input space into distinct regions, as shown in fig. 3.1a; in fig. 3.1b is represented the equivalent tree structure. . . 31

5.1 The DeepTraffic simulation interface. . . 41 5.2 Two elements of the simulation: 5.2a the safety system

pre-vents collisions between vehicles by limiting the speed of a vehicle and preventing it from turning left/right 5.2b the dis-crete grid through which the environment is actually perceived. 43 5.3 Simplied architecture of the deep neural network used in DQN. 44 5.4 The relative grid used to perceive the vehicle’s surrondings. . 47 5.5 Performance evaluation of FQI with extra trees using a batch

collected while following a pure-random policy 5.5a and using a batch collected following an -greedy policy with one of the recovered policies 5.5a. The light area is the standard deviation of the average speed over 500 runs. . . 48 5.6 The reward distribution for the batch collected on the random

policy 5.6a and on the -greedy policy 5.6b. . . 48

(12)

(13)

List of Algorithms

1 Value iteration. . . 16

2 Policy iteration. . . 16

3 SARSA learning. . . 19

4 Q-learning. . . 21

5 The vannilla DQN algorithm. . . 23

6 DQN with experience replay. . . 24

7 Fitted value iteration. . . 29

8 Fitted Q iteration . . . 30

(14)

Chapter 1

Introduction

Over the last decades, vehicles have become an integral part of our lives, as a convenient form of transport. The manufacture of large volumes of vehicles has led to a reduction in their cost, and thus to a wider audience being able to afford one. However, the growing number of vehicles on the streets has also increased traffic accidents. A survey conducted by the National Highway Traffic Safety Administration (NHTSA) from 2005 to 2007 reported that 94% of the accidents are caused by human error. This is one of the many reasons why Autonomous Driving (AD) has become one of the central topics in the Information and Communication Technology (ICT) field. Driving a car is a task that requires a high level of flexibility: the driver must be able to react promptly to unforeseen events and situations; this type of flexibility is extremely difficult to achieve using traditional task planning methods. For this reason, the AD research has focused on exploring new techniques that belong to a subfield of Artificial Intelligence (AI) known as Reinforcement Learning (RL) [1].

Reinforcement learning is a branch of Machine Learning (ML) that aims at solving decision-making problems, where an autonomous agent observes and interacts with the environment during the training phase, collecting experience that will enable it to make informed decisions in the future. The framework used in the research literature to describe RL problems is the Markov Decision Process (MDP) [2]. An MDP describes the RL problem as a set of 1. environment configurations (states); 2. available actions; 3. a transition function that captures the dynamics of the environment; 4. a reward function. The agent interacts with the environment by iteratively performing actions and observing the new environment configuration; the agent also observes a reward signal, which depends on the current state and the actions taken and is usually a measure of how well the agent is behaving.

(15)

2 CHAPTER 1. INTRODUCTION

The agent of an MDP is driven by a goal, which is usually expressed in terms of cumulative reward, i.e. maximizing the sum of the rewards over time; thus, the shape of the reward signal encodes a specific task. In many real-world settings, the actual reward could be delayed, thus the agent may have to learn to sacrifice immediate reward at the benefit of a larger long-term payoff.

Solving an RL problem means finding an optimal policy, i.e. knowing how to choose the best action in any state. There are many methods to solve RL problems; in our work, we focus on a class of algorithms known as Batch Reinforcement Learning [2], in which the agent learns the optimal policy not by interaction with the environment, but using experience collected a priori. In particular, we focus on applying batch RL methods to the problem of driving efficiently through dense traffic, on a multi-lane road.

Goals and motivations

Batch methods are known to be more data-efficient compared to pure on-line RL methods. In pure on-on-line RL methods the information acquired at each training step is processed individually, whereas batch methods can process the whole experience in a collective way. As a consequence, batch methods usually require a smaller amount of experience compared to on-line methods.

Batch methods also allow for a more controlled exploration, since the al-gorithm is completely unaware of the exploration phase. In general, the eas-iest way to collect experience is by following a pure random policy, i.e. ran-domly choosing a new action at each iteration. In a real-world setting, such as autonomous driving, this could result in very dangerous behaviour. The initial performance of on-line algorithms usually displays a nearly-random behaviour, caused by the high uncertainty of the outcome of the chosen actions and the need to explore.

Sometimes, we simply can not interact further with the environment and we can only use the experience collected during previous interactions. This naturally leads to the user of batch methods.

Our goal is to demonstrate that a vehicle driving through traffic is able to learn a good policy using only a fixed batch of samples. For this purpose, we evaluate the performance of FQI in a simulated highway environment and compare it with the performance of the Deep Q Network (DQN) [3], a method which has proven to be a robust and reliable solution in many contexts, including autonomous driving.

(16)

3

Outline of the thesis

In Chapter 2, a background on reinforcement learning and Markov decision processes is given; we first define the reinforcement learning problem and the mathematical framework used to formulate such problems; then we define the concept of value-function and provide a brief analysis of the common approaches used to solve MDPs and RL problems. In Chapter 3, we in-troduce the class of RL algorithms known as batch reinforcement learning methods, focusing on the fitted Q iteration algorithm with tree-based meth-ods. In Chapter 4, we review the state-of-the-art of autonomous driving, with particular emphasis on RL approaches. In Chapter 5, we show em-pirical results of the application of FQI to the problem of highway driving. Finally, in Chapter 6, we further discuss the results obtained and possible future work.

(17)

(18)

Chapter 2

Background

Reinforcement Learning (RL) is an area of machine learning concerned with how agents interact with the environment in order to maximize some notion of utility function. The agent explores the environment, driven by the reward signal it receives from its interaction and collects experience that will be used to perform informed decisions in the future. Fig. 2.1 illustrates the process.

Agent R_t+1 R_t S_t+1 S_t Environment A_t t t+1

Figure 2.1: General interaction between agent and environment in Reinforcement Learning

The most interesting cases present themselves when the actions of the agent affect not only the immediate reward but also the environment state, and through that, all the subsequent rewards. Therefore, the agent has to learn how to map states to actions, in a trial and error fashion. Prob-lems with these characteristics are best described using the Markov Decision Process (MDP) framework.

(19)

6 CHAPTER 2. BACKGROUND

2.1 Markov Decision Processes

A Markov Decision Process (MDP) [1] is a framework for sequential decision-making problems in which a goal-directed agent interacts with the environ-ment by performing actions and observing how the environenviron-ment evolves. We consider discrete-time MDPs, in which at each time step the agent has to decide which action to take in order to maximize the utility function. When-ever an action is performed, the environment evolves into a new state and the agent receives a payoff, called immediate reward. The utility function of the agent is usually defined in terms of cumulative reward, thus the decision cannot depend on the immediate reward alone: for instance, the agent may need to sacrifice a short-term reward in order to gain a better long-term reward.

Definition 1 A discrete-time MDP is a tuple M = {S, A, P, R, µ, γ}, where:

• S is a non-empty set of possible states;

• A is a non-empty set of possible actions;

• P : S × A → ∆(S) is the transition model;

• R : S × A × S → ∆(R) is the reward function; • µ is the distribution of the initial state;

• γ ∈ [0, 1] is the discount factor.

The state-space S and the action-space A can be either finite or infinite, discrete or continuous. Usually, not all actions can be performed in all states, in which case it is convenient to define the subset A(s) ⊂ A, ∀s ∈ S which is the set of all actions available in state s.

The dynamics of the environment are captured by the transition model P (· | s, a), which defines a probability distribution over the state-space S. Given current state s and current action a it assigns to every state s0 the quantity P (s0 | s, a), which is the probability to transition into state s0

starting from state s and performing the action a.

The immediate reward is represented as a scalar quantity1. The reward function R(· | s, a, s0) is a probability measure defined over R. Given current state s, current action a and next state s0, the quantity R(r | s, a, s0) is the

1

Sutton and Barto [1] suggested that any goal can be well-thought as the maximization of a cumulative scalar reward. This assumption may not always be satisfied but it is so simple and flexible that we will not disprove it.

(20)

2.1. MARKOV DECISION PROCESSES 7

probability of receiving reward r when transitioning from the current state s to the next state s0 performing action a.

A sequence of state-action-reward tuples τ = {(st, at, rt)}Tt is referred to

as trajectory. Sometimes we will omit the reward component of a trajectory. A Markov Decision Process satisfies the Markov property.

Property 1 A stochastic process Xtis Markovian if the following condition

holds:

P (Xt+1 = z | Xt= yt, Xt−1= yt−1, . . . , X0= y0) = P (Xt+1= z | Xt= yt)

(2.1)

In other words, the future state only depends on the current state-action pair. This is a fundamental property of MDPs.

2.1.1 Decision making and policies

At each time step t, the agent in an MDP observes the current state st and

chooses which action at to perform. A decision rule dt tells the agent how

to choose the action to perform based on the history Htup to time step t. If

the decision rule is deterministic it prescribes a single action, dt: Ht → A;

if the decision rule is stochastic it provides a probability distribution over the set of possible actions, dt : Ht → ∆(A). A decision rule that satisfies

the Markov property prescribes the action to perform based on the current state rather than the history, dt: S → ∆(A). This class of decision rules is

of particular importance in MDP theory.

Definition 2 A policy π is a sequence of decision rules, one for each time step:

π = {dt: S → ∆(A)}T_t=0

If the policy is stationary, the decision rule dt does not depend on the

time step t, i.e. dt = d, ∀t ∈ T , in which case we indicate d with π. If

the policy is Markovian, the quantity π(a | s) is the probability of selecting action a given the current state s.

Definition 3 A Markovian policy is a function π : S → ∆(A) that for each state s defines a probability distribution π(· | s) over the set of all possible actions A; if the policy is deterministic then the policy prescibes a deterministic action, π : S → A.

(21)

If not otherwise specified, we will use the term policy to refer to Marko-vian stationary policies.

Given an MDP, a policy π induces a distribution over the trajectory space, whose density function is given by:

p(τ | P, µ, π) = µ(sτ,0) T −1

Y

t=0

π(aτ,t | sτ,t)P (sτ,t+1| sτ,t, aτ,t) (2.2)

where P is the transition model, µ is the distribution of the initial state, {(s_τ,t, aτ,t) | t = 0, . . . , T } is the sequence of state-action pairs of the

trajec-tory τ .

2.1.2 Markov Chains

Given an MDP defined by the tuple (S, A, P, R, µ, γ) and a policy π, the triple (S, Pπ, µ) is called Markov Chain (MC). Pπ(· | s) is a probability measure over the set of states S whose density function is obtained by marginalizing over the set of possible actions:

Pπ(s0 | s) = Z

A

π(a | s)P (s0| s, a)da, ∀s, s0 ∈ S (2.3)

The value Pπ_(s0 _{| s) is the probability of transitioning into state s}0_,

starting from state s according to policy π. We can also recursively define the the probability of reaching state s0 starting from s after n steps:

P_nπ(s0 | s) = Z

S

P_n−1π (s0 | s00)Pπ(s00| s)ds, ∀s, s0 ∈ S, n = 2, 3, . . . (2.4)

The tuple (S, Pπ, Rπ, µ, γ) is called Markov Reward Process, where Rπ(· | s, s0) is a probability measure over R whose density function is obtained as:

Rπ(r | s, s0) = Z

A

π(a | s)R(r | s, a, s0)da, r ∈ R, ∀s, s0∈ S (2.5)

In MRPs and MCs the notion of action disappears. Thus, MRPs and MCs are best suited at describing uncontrollable problems, i.e. problems where the agent cannot perform actions.

(22)

2.1.3 value-functions

The agent in an MDP performs actions driven by the goal of maximizing its utility function. The utility function is almost always defined in terms of cumulative reward, i.e. the sum of the immediate rewards along a given tra-jectory. Given an MDP and a policy π we can define the expected cumulative reward induced by the policy π as:

Jπ = Ep(τ ) "_{T −1} X t=0 R(sτ,t, aτ,t) # (2.6)

where p(τ ) ∼ p(· | P, π, µ) is the distribution over the trajectory space induced by policy π.

If the MDP has infinite horizon the series might not converge, in which case it is convenient to resort to the mean expected reward :

Jπ = lim T →∞Ep(τ ) " 1 T T −1 X t=0 R(sτ,t, aτ,t) # (2.7)

A cumulative reward function widely used in literature is the discounted expected cumulative reward, also known as expected return:

Jπ = Ep(τ ) "_{T −1} X t=0 γtR(sτ,t, aτ,t) # = Ep(τ )[R(τ )] (2.8)

where R(τ ) is the return value of the trajectory τ . γ ∈ [0, 1] is called discount factor. A value of γ ∈ [0, 1) has the advantage of preventing the cumulative reward from diverging, under the condition that the immediate reward is upper-bounded in absolute value. The discount factor has various important interpretations, though it is usually regarded as a measure of the importance given to a reward at a certain time step by the agent, which might be more interested in a payoff obtained in the near future rather than a payoff obtained further in the future. Values close to 0 lead to a myopic view whereas values close to 1 lead to a far-sighted view. A far-sighted agent might learn to sacrifice the immediate rewards in the near future in order to gain a better long-term payoff.

Given a policy π for an MDP, the state value-function Vπ : S → R assigns to each state s the expected return starting from state s and following policy π.

Definition 4 Given an MDP and a policy π, the state value-function in a state s ∈ S is the expected return starting from state s according to policy π:

(23)

Vπ(s) = Eτ ∼p(·|P,π,s0=s)[R(τ )] , ∀s ∈ S (2.9)

The state value-function provides the agent with a way to compute the optimal behaviour: we first compute the optimal state value-function and then determine an optimal behaviour with relative easiness. The optimal value-function specifies the best possible performance for the given MDP.

Definition 5 Given an MDP and the set of policies Π defined over the MDP, the optimal state value-function in a state s ∈ S is given by:

V∗(s) = max

π∈ΠV

π_(s), _{∀s ∈ S} _(2.10)

Let us now introduce another value-function defined over the set of state-action pairs, called state-action value-function.

Definition 6 Given an MDP and a set of policies Π defined over the MDP, the action value-function of a state-action pair (s ∈ S, a ∈ A) is the expected return starting from state s, performing action a and then following policy π:

Qπ(s, a) = Eτ ∼p(·|P,π,s0=s,a0=a)[R(τ )] , s ∈ S, a ∈ A (2.11)

Unlike the state value-function, the action value-function can be used to measure the performance of the possible actions A(s) in a given state s without knowledge of the transition model P , therefore the agent may use this information to choose the best action.

Definition 7 Given an MDP and the set of policies Π defined over the MDP, the optimal action value-function of a state-action pair (s ∈ S, a ∈ A) is given by:

Q∗(s, a) = max

π∈ΠQ

π_{(s, a),} _{∀s ∈ S, a ∈ A} _(2.12)

Once the optimal action value-function is known, the agent can choose the greedy action at each time step, i.e. the action that yields the maximum utility, at= arg maxa∈A(s)Q∗(s, a).

(24)

2.1.4 Bellman expectation

The relationship between the state function and the action value-function is given by:

Vπ(s) = Z A π(a | s)Qπ(s, a)da, ∀s ∈ S (2.13) and conversely: Qπ(s, a) = E [R(s, a)] + γ Z S Pπ(s0 | s)Vπ(s0)ds0, ∀s ∈ S, a ∈ A (2.14)

This relationship induces a recursive definition of the state value-function:

Vπ(s) = Z A π(a | s)Qπ(s, a)da = Z A π(a | s) E [R(s, a)] + γ Z S Pπ(s0| s)Vπ(s0)ds0 da

Definition 8 The Bellman expectation operator Tπ : RS → RS _{maps state}

value-functions to state-value-functions:

TπV (s) = Ea∼π(·|s),s0_{∼P (·|s,a)}R(s, a) + γV (s0) (2.15)

It is possible to prove that Vπ is the fixed point of the Bellman expec-tation operator, i.e. TπVπ = Vπ.

Equivalently, the action value-function can be written recursively as:

Qπ(s, a) = E [R(s, a)] + γ Z S Pπ(s0 | s)Vπ(s0)ds0 = E [R(s, a)] + γ Z S Pπ(s0 | s) Z A π(a0 | s0)Qπ(s0, a0)da0 ds0

Therefore we can define the Bellman expectation operator for the action value-function as well.

Definition 9 The Bellman expectation operator Kπ : RS×A→ RS×A _maps

action value-functions to action value-functions:

(25)

Qπ is the fixed point, i.e. KπQπ = Qπ. We can find the fixed points Vπ and Qπ by repeatedly applying the Bellman operators2.

If the MDP is fully observable, i.e. we know the MRP (S, Pπ, Rπ, µ, γ) induced by the policy π, we can find Vπ _{and Q}π _{analytically by solving a}

linear equation. In matricial form:

Vπ =Rπ+ γPπVπ (I − γPπ)Vπ =Rπ Vπ =(I − γPπ)−1Rπ (2.17) and equivalently: Qπ =R + γP πQπ (I − γP π)Qπ =R Qπ =(I − γP π)−1R (2.18)

The equations 2.17 and 2.18 are also known as Bellman expectation equa-tions.

2.1.5 Bellman optimality

The Bellman expectation operators and equations can be used to evaluate the performance of a given policy π in an MDP. This leaves us with the problem of identifying the optimal policy π∗ among all the possible policies, which is the one that maximizes the utility function of the agent.

Definition 10 A policy π∗ is optimal if it satisfies the condition:

Vπ∗ = V∗

In other words, the value-function induced by the optimal policy π∗ is the optimal value-function. Thus, a naive approach to learning the optimal policy is to enumerate all the policies and evaluate them by computing the value-functions Vπ, ∀π ∈ Π. To do this we need to define an ordering over the policy space.

2_{The convergence to the fixed point is guaranteed by the max-norm contraction}

(26)

Definition 11 Given and MDP and two policies π, π0 ∈ Π, policy π is better or equal to policy π0 iff the value-function of π is greater or equal to the value-function of π0 and is defined as:

π π0 ⇐⇒ Vπ(s) ≥ Vπ0(s), ∀s ∈ S

It is clear that an optimal policy π∗ is such that:

π∗ π, ∀π ∈ Π

A better method consists in directly finding the optimal value-function V∗ and extracing the optimal policy. If we write the equation that relates state value-functions to action value-functions for the optimal value-function we obtain: V∗(s) = max π∈ΠV π = max π∈Π Z A π(a | s)Qπ(s, a)da = max a∈AQ ∗_{(s, a)} = max a∈A R(s, a) + γ Z S P (s0 | s, a)V∗(s0)ds0

We can now define the Bellman optimality operator for state value-functions.

Definition 12 The Bellman optimality operator for state value-functions T∗ : RS → RS _{maps state value-functions to state value-functions and is}

defined as:

T∗V (s) = max

a∈A R(s, a) + γ Es

0_{∼P (·|s,a)}V (s0) (2.19)

V∗ is the unique fixed point. Analogously Q∗ is the unique fixed of the Bellman optimality operator for action value-functions.

Definition 13 The Bellman optimality operator for action value-functions K∗ : RS×A→ RS×A _{maps action value-functions to action value-functions:}

K∗Q(s, a) = R(s, a) + γ Es0_{∼P (·|s,a)} max a0_∈AQ(s 0_{, a}0₎ (2.20)

(27)

Due to the non-linearity of the max operator no closed form solution exists, but we can find V∗ and Q∗ by applying iteratively the optimality operator, i.e. given any V and Q, V∗ = T∗∞V and Q∗= K∗∞Q.

Given the optimal action value-function, we can easily extract an optimal deterministic policy π∗ by greedily choosing the best action for each state s ∈ S:

π(s) = arg max

a∈A(s)

Q∗(s, a) (2.21)

Theorem 1 For any MDP the following statements hold:

• there exists an optimal policy π∗ _{that is better or equal to all other}

policies, i.e. π∗ π, ∀π ∈ Π;

• all optimal policies π∗ achieve the optimal value-function, i.e. Vπ∗ = V∗;

• there always exists a deterministic optimal policy.

The theorem ensures the existence of an optimal (deterministic) policy but does not guarantee its uniqueness: for example, if there exist two actions a, a0 ∈ A(s) such that Q∗_{(s, a) = Q}∗_{(s, a) then there exist at least two}

distinct deterministic optimal policies.

2.1.6 Solving MDPs

Solving an MDP consists in finding an optimal policy. There are various methods to approach this kind of task, and which to choose depends on the setting of the learning problem at hand.

In general we can identify two classes of learning algorithms:

• value-based methods estimate the state value-function Vπ _{or action}

value-function Qπ, from which the policy π is exracted. Two important algorithms in this class are value iteration and policy iteration;

• policy-based methods explore the policy space directly to find an op-timal policy.

Another important distinction is between model-based learning algo-rithms, in which the agent has full knowledge of the model of the envi-ronment, and model-free learning algorithms, in which the agent learns the model of the environment by interacting with it. In the latter case, we further classify the algorithms as:

(28)

2.2. DYNAMIC PROGRAMMING 15

• online, if the interaction phase and the learning phase are interleaved, vs. offline, if the two phases are separated;

• on-policy, if the policy configurable by the agent is also used for ex-ploration, vs. off-policy, if the agent learns an optimal policy while following another policy when interacting with the environment.

In the next sections, we discuss techniques to solve the control problem in MDPs.

2.2 Dynamic Programming

If we know the transition model P of the MDP, we can solve the MDP using dynamic programming.

Dynamic programming [4] is a general method to solve problems that have two properties: overlapping subproblem and optimal substructure.

Property 2 (Overlapping subproblem) A problem is said to have over-lapping subproblems if it can be subdivided into subproblems which recur sev-eral times.

Property 3 (Optimal substructure) A problem is said to have optimal substructure if an optimal solution can be constructed from optimal solutions of its subproblems.

Markov Decision Processes satisfy both properties: the Bellman equa-tions provide recursive decomposition and value-funcequa-tions provides a way to cache and reuse solutions of subproblems.

2.2.1 Value iteration

Value iteration is an off-policy method that finds the optimal value-function V∗ without any intermediate policy. This method is based on the iterative application of the Bellman optimality operator. Starting from an initial value-function V0, we repeatedly apply the optimality operator until a

stop-ping condition is met.

The optimal policy is extracted by taking the greedy policy induced by the state-value-function. This step requires knowledge of the transition model.

(29)

Input: M = (S, A, P, R, T, γ) Output: π∗

1 Initialize value-function V₀; 2 for t = 0, 1, . . . , T − 1 do

/* Apply Bellman optimality operator */

3 V_t+1← T∗V_t; 4 end

/* Compute optimal policy */

5 π∗(s) = arg max_a∈AR(s, a) + γ E_s0_{∼P (·|s,a)}[V_T(s0)] , ∀s ∈ S;

6 return optimal policy π∗

Algorithm 1: Value iteration.

2.2.2 Policy iteration

Policy iteration is an on-policy algorithm that consists of two interacting processes: evaluating the policy π by computing the value-function Vπ; improving the policy π, i.e. making the policy greedy with respect to the computed value-function.

1 Initialize policy π0; 2 for t = 0, 1, . . . , T − 1 do

3 Evaluate policy π_t, get back Vπt_;

/* Make policy πt greedy w.r.t. Vπt */

4 π_t+1(s) ← arg max_a∈AR(s, a) + γ E_s0_{∼P (·|s,a)}[Vπt(s0)] , ∀s ∈

S;

5 end

6 return optimal policy πT

Algorithm 2: Policy iteration.

The policy evaluation problem can be solved either by closed form solu-tion 2.17 or by iteratively applying the Bellman expectasolu-tion operator 8.

Sutton and Barto [1] use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-evaluation and policy-improvement processes interact. Many reinforcement learning methods are well described as GPI. The GPI process is illustrated in fig. 2.2.

(30)

2.3. MODEL-FREE PREDICTION 17

Evaluation

Improvement

Figure 2.2: Interaction of evaluation and improvement phases in GPI algorithms

2.3 Model-free prediction

Usually, the transition model of the MDP is not known, or the size of the model makes solving it via dynamic programming infeasible, in which cases we have to resort to model-free algorithms. In model-free prediction, the agent learns the model in a trial-and-error fashion, by the experience col-lected during the interaction with the environment.

2.3.1 Monte Carlo methods

Monte Carlo (MC) methods are a broad class of randomized algorithms that rely on repeated random sampling to obtain a numerical result. MC learns directly from episodes of experience, using the simple idea that the mean return is a consistent estimator for the value-function.

Recall that the value-function of a state s 4 is the expected return start-ing from state s and followstart-ing policy π. MC uses empirical mean return instead of the expected return.

Definition 14 Let X be a random variable with mean µ = E [X ] and vari-ance σ2 = V ar [X ]. Let xi ∼ X , i = 1, 2, . . . , n be i.i.d. realizations of X .

The empirical mean of X is:

ˆ µ = n X i=1 xi

MC first-visit averages returns only for the first visit of state s. MC every-visit averages returns for every visit of state s. MC every-visit is a consistent but biased estimator, and usually converges faster than MC first-visit.

The return value vt of a state stis the discounted sum of the immediate

(31)

18 CHAPTER 2. BACKGROUND τ = {(s0, a0, r0), . . . , (sn, an, tn)} , vt= n X i=t rt (2.22)

And the update rule is given by:

V (st) ← V (st) + α(vt− V (st))

where usually α = _{N (s}1

t), N (st) being the number of time state st has

been visited.

2.3.2 Temporal difference learning

The main disadvantage of Monte Carlo methods is that they can only be used for episodic MDPs because they require to sample the entire episodes. Temporal difference (TD) learning is another model-free algorithm used to estimate the value-function of an MDP.

TD replaces the MC return value with:

vt= rt+1+ γV (st+1) (2.23)

Thus the update rule becomes:

V (st) ← V (st) + α(rt+1+ γV (st+1) − V (st)), α ∈ [0, 1] (2.24)

TD can be extended using n-step return:

v_t(n)=

n

X

i=1

γi−1rt+i+ γnV (st+n) (2.25)

As n → ∞ the TD target v(n)_t converges to the MC return value 2.22.

2.4 Model-free control

MC and TD methods can also be used to estimate the optimal value-function V∗and to find an optimal policy π∗. The procedure is analogous to the value iteration algorithm 1 except that we need to estimate the transition model P . Otherwise, we could directly learn the optimal action value-function and take the induced greedy-policy.

(32)

2.4. MODEL-FREE CONTROL 19

2.4.1 SARSA learning

SARSA is an on-line model-free RL algorithm, that learns about policy π from experience sampled according to policy π itself. At each time step t, we draw action at∼ π(· | st) from the policy πt derived from Qt, the action

value-function, collect reward rt and update the action value-function Qt+1

using the following update rule:

Qt+1(st, at) ← Qt(st, at) + α [rt+ γQt(st+1, πt(st+1)) − Qt(st, at)] (2.26)

The algorithm pseudo-code is reported below:

1 Initialize Q(s, a); 2 repeat

3 Pick random initial state s;

4 Draw a from policy derived from Q;

5 repeat

6 Take action a, observe new state s0 and reward r; 7 Draw a0 from policy derived from Q;

/* Update Q value-function */

8 Q(s, a) ← Q(s, a) + α [r + γQ(s0, a0) − Q(s, a)];

/* Update current state and action */

9 s ← s0;

10 a ← a0;

11 until s is terminal ; 12 until for each episode;

13 return greedy policy derived from Q

Algorithm 3: SARSA learning.

SARSA method is based on GPI [1]: at each iteration, we first evaluate the policy (policy-evaluation), then improve the policy according to the computed action value-function (policy-improvement).

As with most on-line approaches, we have to face the exploration-exploitation dilemma, i.e. the trade-off between the need to obtain new knowledge and the need to use that knowledge to improve performance. In SARSA, it comes down to how we derive the policy from Q. For example, if we were to use the greedy policy π(s) = arg max_a∈AQ(s, a) we would select at each time step the action we consider to be the best, but if we select the same

(33)

action over and over we do not gain new knowledge. Therefore we need a mechanism to balance exploration and exploitation.

The simplest solution is to use the -greedy policy, which selects with probability a random action a among all possible actions (exploration), and with probability 1 − the greedy action, i.e. the action with the highest Q-value (exploitation):

π(a | s) =

(

m + 1 − , if a = arg maxa0_∈A(s)Q(s, a0)

m, otherwise

, m = |A(s)|

Theorem 2 For any -greedy policy π, the -greedy policy π0 with respect to Qπ is an improvement [1].

is either fixed or controlled over time. If the value of decreases over time, i.e. limt→∞t= 0, the -greedy policy converges to a greedy policy.

Definition 15 A learning policy is Greedy in the Limit of Infinite Explo-ration (GLIE) if it satisfies the following properties:

• all state-action pairs are visited infinitely many times:

lim

t→∞Nt(s, a) = ∞

• the policy converges to a greedy policy:

lim

t→∞πt(a | s) =

(

1, if a = arg max_a∈A(s)Qt(s, a)

0, otherwise

An -greedy policy with = 1/t satisfies both conditions. The GLIE property is of particular importance because it ensures that both exploration and exploitation are achieved.

2.4.2 Q-learning

Unlike SARSA, Q-learning [5] is off-policy, which means that it learns about the optimal policy while following a different policy. The action to perform is drawn from a policy π derived from the Q value-function, but the update rule for Q is given by:

Q(s, a) ← Q(s, a) + α " r + γ arg max a0_∈A(s0₎ Q(s0, a0) − Q(s, a) # (2.27)

(34)

2.5. FUNCTION APPROXIMATION 21

1 Initialize Q(s, a); 2 repeat

3 Pick random initial state s;

4 repeat

5 Draw action a from policy π ∼ p(· | s);

6 Observe new state s0 and immediate reward r;

/* Update Q value-function */

7 Q(s, a) ←

Q(s, a) + α h

r + γ arg max_a0_∈A(s0₎Q(s0, a0) − Q(s, a)

i ;

/* Update current state */

8 s ← s0;

9 until s is terminal ; 10 until stopping condition;

11 return greedy policy derived from Q

Algorithm 4: Q-learning.

The pseudo-code of the algorithm is reported below: Again, the -greedy policy is a valid choice for exploration.

SARSA and Q-learning have the same update rule under the condition that we use the greedy policy:

π(· | s) = (

1, if a = arg max_a0_∈A(s)Q(s, a0)

0, otherwise

In fact, the update rule for SARSA becomes:

Q(s, a) ← Q(s, a) + α r + γQ(s0, a0) − Q(s, a) , a0 ∼ π(· | s) Q(s, a) ← Q(s, a) + α " r + γ arg max a0_∈A(s0₎ Q(s0, a0) − Q(s, a) #

which is equivalent to the Q-learning update rule 2.27.

2.5 Function Approximation

The RL algorithms presented so far make use of a tabular representation for the value-function. While tabular algorithms are easy and efficient to

(35)

implement, they cannot be used when the state-action space is continuous, for obvious reasons, or when it is too large to manage, due to memory and time limitations.

The proposed solution in literature is to use function approximation to approximate the value-function or the policy itself. In the context of function approximation, learning consists in choosing a function ˆV from a hypotesis space H that minimizes some notion of loss function J :

ˆ

V = arg min

f ∈H

J (f )

The loss function is usually expressed in terms of distance from the real value-function Vπ:

J (f ) = kVπ− f k_q, q = 0, 1, 2, . . .

If the hypotesis space is a parametric model, our goal reduces to the more familiar problem of parameter estimation. Let H be a space of functions defined by the parameter vector θ ∈ Θ. The function approximation problem can be solved by finding the parameter ˆθ that minimizes the loss function:

ˆ

θ = arg min

θ∈Θ

J (fθ)

2.5.1 Deep Q Learning

The Deep Q Learning (DQL) [6] algorithm, also known as Deep Q Network (DQN), is an on-line, model-free algorithm that uses a deep Neural Network (NN) to approximate the action value-function. The algorithm was first introduced in [6], where it was used to beat various Atari 2600 games by learning to map raw pixels to in-game actions.

DQN is based on GPI. In its simplest form, it uses Stochastic Gradient Descent (SGD) to udpate the weights of the parameterized Q function, using the estimated value r + γ max0_aQ (s, aˆ 0) as the target of the update. At each iteration t, we draw a new action at from an -greedy policy, perform at

and observe the new state st+1and the reward signal rt. Then we build the

target yt= rt+ γ max0aQ (sˆ t+1, a0) and perform a gradient descent step on

yt− ˆQ (st, at) 2 2.

The pseudo code of vanilla DQN is reported in 5.

Usually, in order to improve the performance of the learning phase, we introduce an Experience Replay buffer, whose purpose is to store and reuse past experience. There are mainly two advantages to using an experience

(36)

Input: M = {S, A, P, R, µ, γ}, M , T Output: ˆQ∗

1 Initialize network ˆQ with random weights; 2 for episode = 1, . . . , M do

3 Observe initial state s₀; 4 for t = 0, . . . , T − 1 do

5 With probability draw a random action at, otherwise select

action at= maxaQ (sˆ t, a);

6 Execute action at and observe new state st+1 and reward rt;

7 Set yt=

(

rt if stis terminal

rt+ γ maxa0Q (sˆ _t+1, a0) otherwise.

8 Perform a gradient descent step on

yt− ˆQ (st, at) 2 2; 9 end 10 end 11 return ˆQ

Algorithm 5: The vannilla DQN algorithm.

replay buffer. First of all, we can perform the gradient descent step on a mini-batch of samples rather than on a single datum; this greatly improves the stability of the learning process. Secondly, we mini-batch is obtained by random sampling the experience buffer, which means that the data used to perform the gradient descent step is more likely to be i.i.d.

The pseudo code of DQN with experience replay is reported in 6. Another popular variation of DQN uses two networks, identified by their weights θ and θ0, to further improve the stability of the learning process. The original network ˆQθ still receives the gradient updates, whereas the other

network ˆQθ0 is used to retrieve Q values:

yj =

(

rj if sj is terminal

rj + γ maxaQˆθ0(s_j, a) otherwise.

After a certain number of updates, we synchronize the networks:

θ0 ← θ

The problem with using a single network is that the updates for a set of state-action pairs (si, ai) will impact other estimations; if the same network

(37)

Input: M = {S, A, P, R, µ, γ}, M , T Output: ˆQ∗

1 Initialize network ˆQ with random weights; 2 Initialize experience replay buffer D of size N ; 3 for episode = 1, . . . , M do

4 Observe initial state s0; 5 for t = 0, . . . , T − 1 do

6 With probability draw a random action at, otherwise select

action at= maxaQ (sˆ t, a);

7 Execute action a_t and observe new state s_t+1 and reward r_t; 8 Store transition (st, at, rt, st+1) in D;

9 Sample a mini-batch of n transitions from D; 10 For j = 1, . . . , n set:

yj =

(

rj if sj is terminal

rj + γ maxaQ (sˆ j+1, a) otherwise.

11 Perform a gradient descent step on

yj− ˆQ (sj, aj) 2 2; 12 end 13 end 14 return ˆQ

(38)

learning process. An alternative solution widely adopted is to perform a soft update at each iteration:

(39)

(40)

Chapter 3

Batch Reinforcement

Learning

The term Batch Reinforcement Learning is used to describe an RL setting, where the learning experience is given a priori and fixed, in contrast to pure on-line learning methods, where the experience is sampled indefinetely during the learning process.

Sutton and Barto [1] defines the batch learning problem as the task of finding a policy that maximizes the expected return, without interacting with the environment during the learning process. The learner only receives a set X = {(st, at, rt, st+1) | t = 0, 1, . . . , n} of n transitions sampled from the

environment, whose nature is not known: they may be sampled by a pure-random policy; they may not even be sampled along connected trajectories. Using this information, the learner has to derive the best possible policy, which will be used to interact with the environment in the future. Since the transition set is finite and the learner is not allowed to further interact with the environment, it might not be possible to derive an optimal policy, thus the objective is simply to derive the best possible policy.

Differently from on-line learning methods, batch methods do not suffer from the usual exploration-exploitation dilemma for the simple reason that the exploration phase is fully decoupled from the learning task.

In Section 3.1 we describe the fitted value iteration (FVI) algorithm [7], which serves as a basis to introduce the fitted Q iteration (FQI) algorithm in Section 3.2. Finally, in Section 3.3 we analyze various tree-based regression algorithms and their application in the FQI algorithm.

(41)

28 CHAPTER 3. BATCH REINFORCEMENT LEARNING

3.1 Fitted value iteration

The fitted value iteration (FVI) [7] algorithm is a batch method that com-putes an approximate representation of the state value-function using a fixed batch of samples.

FVI is an extension of the value iteration algorithm 1. The value iteration algorithm works by repeatedly applying the Bellman optimality operator T∗, until a stopping condition is met:

Vt+1← T∗Vt

The value iteration algorithm uses a vector representation for the value-function V . This allows us to represent any possible value-value-function exactly, but it does not scale well and does not work out of the box for MDPs with continous state-spaces1.

FVI replaces the tabular representation with a more compact represen-tation. Rather than setting Vt+1 to T∗Vt, we first compute (T∗Vt)(x) only

for the states x in the batch X ⊂ S, and then fit an approximator to these training values, and call the resulting function Vt+1. The batch X should be

resonably small, but it should be representative enough that we can learn something about the MDP by examining only states x ∈ X.

It is convenient to define a function approximation operator ΠX : (S →

R) → (S → R), ∀X ⊂ S that maps state functions to state value-functions. The input of the operator ΠX is a function f that represents

the training values, i.e. the collected states, while the output is a function ˆ

f = ΠXf that represents the fitted values. In this definition the states

in the batch X are fixed, and changing X results in a different function approximator.

The FVI update rule is given by:

Vt+1← ΠX(T∗Vt) (3.1)

The pseudo code for the FVI algorithm 7 is reported below.

The characteristics of the approximator ΠX determine how it behaves

when combined with value iteration. In particular, Gordon [7] emphasizes how certain approximators, such as linear regressors and neural nets, do not ensure convergence, whereas others, like k-nearest-neighbour (KNN), do.

More precisely, many function approximators exaggerate the difference between two target functions, i.e. given two target functions f and g, the

1_{A very simple solution for handling continous space MDPs is}

(42)

3.2. FITTED Q ITERATION 29

Input: π, X ⊂ S, T Output: VT

1 Initialize V₀;

2 for t = 0, 1, . . . , T − 1 do

/* Approximate V using samples from X */

3 V_t+1← Π_X(T∗V_t); 4 end

5 return VT

Algorithm 7: Fitted value iteration.

fitted functions ˆf = ΠXf and ˆg = ΠXg are farther apart in max norm than

f and g are: ˆ f − ˆg ∞> kf − gk∞

Theorem 3 Let M be an MDP with discount γ < 1 and let ΠX be a

func-tion approximator. If ΠX is a nonexpansion in max norm, then the operator

ΠX◦T∗ has contraction factor γ and the fitted value iteration algorithm based

on ΠX converges in max norm at the rate γ.

3.2 Fitted Q Iteration

The fitted Q iteration algorithm [8] is a batch mode RL algorithm that applies the idea of FVI to the problem of Q-learning. The algorithm yields an approximate representation of the action value-function QT corresponding

to a T -step optimization horizon.

Much like FVI, FQI uses a function approximation operator ΠX to fit

the action value-function using training samples from a fixed batch X of the form X = {(sn, an, rn, sn+1) | n = 0, 1, . . . , N − 1}:

• The approximate representation of Q1 is obtained by fitting the

re-ward function, i.e. applying a regression algorithm to a training set whose inputs are the pairs (sn, an) and whose targets are the

immedi-ate rewards rn;

• Subsequent iterations derive an approximation of Qt by refreshing the

targets of the training set using the Q-learning target return and ap-plying the regression algorithm on the resulting training set:

(43)

30 CHAPTER 3. BATCH REINFORCEMENT LEARNING qn,t= rn+ γ max a∈A ˆ Qt−1(sn+1, a) , n = 0, 1, . . . , N − 1

The pseudo code of the FQI algorithm 8 is reported below.

Input: X = {(sn, an, rn, sn+1) | n = 0, 1, . . . , N − 1}, T

Output: QT

/* Fit the reward function */

1 Q₁ ← Π_XR;

2 for t = 1, . . . , T − 1 do

/* Refresh targets using Q-learning target return

value */ 3 q_t+1,n← r_n+ γ h maxaQˆt(sn+1, a) i , n = 0, 1, . . . , N − 1

/* Fit target return value */

4 Q_t+1← Π_Xq_t+1 5 end

6 return Q_T

Algorithm 8: Fitted Q iteration

Contrary to stochastic approximation algorithms [2, 9], which can only use parametric function approximators (for example, linear regressors and neural nets), FQI allows us to take full advantage of the generalization ca-pabilities of any regressor.

As we did for FVI, we can study the convergence of the fitted Q iteration algorithm:

Definition 16 The Fitted Q iteration algorithm is said to converge if there exists a function ˆQ such that for all > 0 there exists a T ∈ N such that:

ˆ Qt− ˆQ _∞< , ∀t > T

Ernst [8] proved that all kernel-based methods [7] described by eq. 3.2 satisfy the above condition:

f (z) =

|X|−1

X

i=0

kX(xi, z) ∗ ti (3.2)

with the kernel kX being fixed from one iteration to the other and

(44)

3.3. TREE-BASED METHODS 31

|X|−1

X

i=0

kk_X(xi, z)k = 1, ∀z (3.3)

Kernel-based methods include k-nearest-neighbours methods, partition and multi-partition methods and linear and multi-linear interpolation.

It is important to understand that the definition 16 does not guarantee convergence of ˆQ to the real, optimal value-function Q∗; the propagation of the approximation error in FQI is thoroughly studied in [10].

3.3 Tree-based methods

Ernst [8] poses particular emphasis on tree-based regression algorithms, non-parametric models that offer a great modeling flexibility, which is a property of particular importance for the FQI framework, since the re-gression algorithm must be able to model any function Q in the sequence {Q_t| t = 1, 2, . . . , T }.

A regression tree [7] is a regression model that partitions the input-space into distinct regions and assigns to each region a constant value, determined by averaging the target values of the elements of the training set which belongs to that region, as shown in fig. 3.1.

(a) Input-space partition (b) Tree representation

Figure 3.1: The algorithm partition the input space into distinct regions, as shown in fig. 3.1a; in fig. 3.1b is represented the equivalent tree structure.

The model generated by a regression tree can be described using eq. 3.2, replacing the kernel with the expression:

(45)

32 CHAPTER 3. BATCH REINFORCEMENT LEARNING kX z, z0 = IS(z0₎(z) P (x,t)∈XIS(z0₎(x) (3.4)

where IR(·) is the characteristics function of the region R:

IR(z) =

(

1 if z ∈ R 0 otherwise

and S (·) is the function that assigns to each sample the region it belongs to. It is possible to prove [8] that the kernel defined by eq. 3.4 satisfies the normalizing condition 3.3.

The model produced by an ensemble of regression trees averages the predictions of the different trees to make a final prediction. Such model can also be described by eq. 3.4 and its kernel satisfy the normalizing condition as well.

Algorithms implementation

Most tree-based algorithms are top-down, in that they create their parti-tion starting from a single subset and progressively splitting its subsets into different regions. At each node, the algorithm selects a cut-direction (or fea-ture) xi and a cut-point v according to some criterion, and splits the subset

into two disjoint partitions p0 = [xi ≤ v] and p1= [xi > v].

The classification and regression tree (CART) [11] algorithm, for in-stance, selects the configuration (i.e., cut-direction and cut-point) that max-imizes the average variance reduction of the output variable.

The k-dimensional trees (kd-tree) algorithm instead, picks the cut-direction incrementally (e.g. if the cut-direction of the parent node is xi it picks xi+1

for the current node) and the cut-point at the local median, so that each subset is split into two subsets of the same cardinality.

CART trees are also used to build an ensemble of regression trees, along with the bagging algorithm [12]. Each tree of the ensemble is built using a bootstrap replica of the training set, obtained by randomly sampling with replacement the same number of elements. Bagging usually reduces dramat-ically the model variance, but significantly increases computing times.

Like tree-bagging, the extremely randomized trees algorithm [13], also called extra-trees, works by building several regression trees from the training set, but unlike tree-bagging, where each tree is built from a bootstrap replica of the training set using the CART algorithm, all trees are built using the entire training set. The configuration of each node is chosen by evaluating the score of K random features, at random cut-points, and picking the one

(46)

3.3. TREE-BASED METHODS 33

that yields the best score. The algorithm stops splitting a subset when the number of elements in the subset is less than nmin.

Note that with a value of K = 1 the tree is built in a totally random fashion (i.e. each node configuration is selected randomly), in which case the configuration does not depend on the output values t of the training set. This specific case of extremely randomized trees is called totally randomized trees [8].

Convergence of tree-based methods

Since models produced by the tree-based method may be described by eq. 3.2, with kernel methods kX satisfying the normalizing condition 3.3,

con-vergence is guaranteed as long as the kernel remains the same from one iteration to the other. This latter condition is satisified if the tree structure does not change throughout all iterations. The only tree-based algorithm that has this property is the kd-trees algorithm, since it is deterministic and its structure does not depend on the output values of the training set.

The tree structure generated by the totally randomized tree algorithm does not depend on the output values of the training set either, but due to its random nature, the algorithm will not produce the same tree structure at each call. However, we are not required to refresh the tree structure at each iteration of FQI, unless it depends on the output values t, hence it is also possible to ensure the convergence of FQI using totally randomized trees.

Though they do not satisfy the convergence conditions, tree-bagging and extra-trees algorithms still have some valuable properties that set them apart from other regression algorithms, at least in the context of fitted Q iteration. It is easy to prove that the sequence of Q-functions {Qn| n = 1, 2, . . . , N }

generated by the proposed tree-based methods does not diverge to infinity, whereas that is not true for some supervised learning techniques, as already pointed out in other contexts [14]. Indeed, the prediction value of a leaf is merely the average value of the outputs of the elements of a subset of the training set. Therefore the value of Qn is bounded by:

Qˆn ∞≤Br+ γ Qˆn−1 ∞ Qˆn ∞≤Br+ γ h Br+ γ Qˆn−2 ∞ i ˆ Qn _∞≤Br+ γ h Br+ γ Br+ . . . h Br+ γ ˆ Q0 _∞ i . . .i But since Q0(s, a) = 0, ∀s, a:

(47)

34 CHAPTER 3. BATCH REINFORCEMENT LEARNING ˆ Qn _∞≤ Br 1 − γ, ∀n ∈ N (3.5)

(48)

Chapter 4

Reinforcement Learning for

Autonomous Driving

The idea of a self-driving vehicle dates back as far as the concept of a car itself, although truly autonomous cars did not appear earlier than the 1980s, when a vision-guided robotic van cruised out of the Mercedes-Benz labs at a staggering speed of 63 kph, on a traffic-less street road. In march 2004, the US funded DARPA project held a contest in the Mojave Desert, offering a $1 million prize to any team which could design an autonomous vehicle capable of driving a 150 miles long course. At that time, no vehicle achieved anything close to the competition goal. In 2014, following the progress of autonomous driving technology and the growing interest of the general public, SAE International (Society of Automotive Engineers) published a standard to define different levels of vehicle automation, ranging from level 1 and 2 simple driving assists, to level 3 and 4 autonomous driving capabilities, up to complete human replacement at level 5.

In Section 4.1 we define goals and tasks of Autonomous Driving (AD). In Section 4.2 we explore the state-of-the-art solutions in the existing research literature, with a focus on reinforcement learning approaches.

4.1 Driving Scenarios and Tasks

The learning goal in autonomous driving is to learn some driving capabilities, which can be general or specific to a certain scenario.

Urban driving, for instance, is one of the most complex scenarios, mainly due to the heterogeneity of the agent population (e.g. vehicles with different levels of automation, cyclists, pedestrians) and the complexity of the driving task itself, which may require the vehicle to follow complex trajectories,

(49)

36CHAPTER 4. REINFORCEMENT LEARNING FOR AUTONOMOUS DRIVING

frequently change speed or even come to a complete stop. The presence of human agents, such as non-automated vehicles, bicycles and pedestrians, introduces unpredictable behaviours that make learning a safe policy very difficult.

The highway driving scenario, on the other hand, allows for certain as-sumptions to be made in order to reduce the complexity of the driving task. In the simplest setting, the speed of the vehicle is fixed and the vehicle is only required to switch lane; the assumption of constant speed is not as lim-iting as it sounds, considering the fact that in most countries highways are subject to speed limits, and that many vehicles are already equipped with more or less advanced cruise control systems. In addition, we can assume a straightforward behaviour of the other vechicles.

Another classic scenario is the circuit scenario, where the vehicle races on a single or multiple-lane track in absence of traffic. In this scenario, the goal is usually to find the best trajectory and minimize the elapsed time.

These scenarios may be split into more specific subtasks, such as merging in a highway or crossing an intersection. In this case, the goal is the re-use or composition of pre-defined driving policies, where each policy aims at solving a single task.

4.2 Methodologies

The research literature has produced a vast range of different techniques to approach and solve the AD problem, ranging from traditional motion planning methodologies to modern approaches, such as RL and Imitation Learning (IL). In this section, we review the state-of-the-art of RL algo-rithms for autonomous driving.

Deep Reinforcement Learning (DRL) algorithms aim at learning an op-timal control policy using a deep neural network to approximate the policy or the value function. The reward is usually hand-crafted to encode the particular driving task. For instance, a negative reward may be awarded for the time elapsed before reaching a target destination; the vehicle speed may be used as a reward signal when driving through dense traffic; large nega-tive rewards may be generated by undesired behaviours, such as collisions with other agents (e.g. vehicles or pedestrians) and driving off the road; the reward signal may even model the level of driving comfort, penalizing harsh manuevers; most times, the reward is a combination of multiple components. DRL techniques are found both in end-to-end (E2E) approaches, where the perception inputs, usually raw sensor readings and camera feeds, are mapped directly to low-level control variables (e.g. continous values for

(50)

4.2. METHODOLOGIES 37

throttle, brake, steering angle), and in high-level control problems, where the agent learns to make high-level decisions and the perception and actuation tasks are delegated to submodules of the AV.

In [15], the AV learns steering in a 3D real world physics simulation using the Deep Q Network (DQN) algorithm. DQN is an online RL algorithm that uses a deep neural network to approximate the Q function. The input consists of a low-resolution image acquired by a camera mounted in the front of the car. To control the vehicle, a set of 5 actions is defined, where each action set the vehicle’s steering angle to specified values while driving at constant speed. The network architecture is composed by two convolutional layers. Convolutional Neural Networks (CNN) [16] are widely used in the AD literature due to their dimensionality reduction and feature extraction capabilities.

In [17], DQN is used to learn a high-level control policy for tactical decision making in highway driving scenarios. The state consists of internal features (vehicle speed and lane index, distance to goal) and an occupancy grid centered on the vehicle. Available actions represent driving decisions, such as accelerate, move left, move right, etc. The network, which uses a single convolutional layer to process the occupancy grid, outputs a Q-value for each action. In each state the greedy policy chooses the action with the highest Q-value.

Hierarchical RL is combined with Monte-Carlo Tree Search (MCTS) in [4] to simultaneously learn a high-level policy for tactical decision making and low-level control policies for actuation. All control policies, whose out-put are continous values of steering angle and acceleration, are represented as neural networks with a single hidden layer of 32 fully connected neu-rons, trained using Deep Direct Policy Gradients (DDPG) algorithm. The high-level options policy is trained using DQN.

A similar hierarchical approach is adopted in [18], where the tactical decision making module is implemented as a deep RL agent, but routing, planning and control modules are realized with non-learning methods for safety and comfort.

In Actor-Critic (AC) approaches, both the policy and the action value function are approximated by a neural network. These methods rely on the interaction of actor and critic components, with the actor choosing the policy to adopt and updating it based on the response obtained by the critic. Deep Deterministic Policy Gradient (DDPG) [19] is an AC algorithm based on Deterministic Policy Gradient (DPG) [20]. DDPG mantains a pa-rameterized actor function which represent the current policy and a critic function that is learnt using the Bellman equation as in Q-learning. This

(51)

38CHAPTER 4. REINFORCEMENT LEARNING FOR AUTONOMOUS DRIVING

makes it possible to apply DDPG to problems with continous action spaces using arbitrary function approximators, such as neural networks. On the other hand, DQN can only handle discrete and low-dimensional action spaces. In [19], DDPG is successfully used to teach a vehicle to drive around a cir-cuit in TORCS, a physics based racing simulator. The vehicle is controlled by acceleration, braking and steering, and the input is either a set of low-dimensional features or low-resolution frames captured by a camera.

Initial performance of online RL algorithms may be nearly random and often dangerous in real-world settings such as autonomus driving. Batch RL methods and imitation learning can be used to overcome this problem.

In [21], FQI is used to learn overtaking maneuvers in a highway driv-ing scenario usdriv-ing a fixed batch of samples. The state consists of low-dimensional features related to the controlled vehicle (velocity and orien-tation) and to the vehicle’s surroundings (relative distance, velocity and orientation of other vechicles). The agent is controlled by 5 discrete actions that represent driving decisions.

A similar approach is used in [22], where FQI is used to learn a policy to drive a real robot car. A neural network is used to map the input, which consists of 5 continous variables (speed, angle and relative position of the car), to the steering angle. To avoid apbrut changes of value, the discrete output is further processed by an integrator that smooths the steering angle over time. Instead of maximizing the reward, the RL task is formulated as the minimization of the expected cumulative cost over time, where the cost function is proportional to the deviation of the car from the current track.

The policies learnt using batch methods may be used as a starting point to improve initial performance of DRL algorithms. In [23], Normalized Actor-Critic (NAC) is used to combine learning from demonstration data and online learning. At each training iteration, a mini-batch is sampled either from the set of demonstrations or from the replay buffer, rather than restricting the samples to be on policy as in standard actor-critic methods. NAC uses a unified loss function to process both demonstration data and online experience that penalizes action-state pairs not seen in the demon-stration. The performance of the algorithm was evaluated on two realistic 3D simulated environments, TORCS and the action-adventure game Grand Theft Auto V. In TORCS, the goal of the agent is to drive as fast as pos-sible around the track without colliding with obstacles; the input is a low-resolution grayscale image, and the agent is controlled by 9 discrete actions that combine longitudinal and lateral movement (left and up, no-op and down, etc.). The same setup is used in the other game.

Batch reinforcement learning for highway driving

Batch Reinforcement Learning for

Highway Driving

Abstract

Estratto in lingua italiana

Ringraziamenti

Contents

List of Figures

List of Algorithms

Chapter 1

Introduction

Chapter 2

Background

2.1

Markov Decision Processes

2.2

Dynamic Programming

Evaluation

Improvement

2.3

Model-free prediction

2.4

Model-free control

2.5

Function Approximation

Chapter 3

Batch Reinforcement

Learning

3.1

Fitted value iteration

3.2

Fitted Q Iteration

3.3

Tree-based methods

Chapter 4

Reinforcement Learning for

Autonomous Driving

4.1

Driving Scenarios and Tasks

4.2

Methodologies