Adversarial imitation learning under covariate shift

(1)

p o l i t e c n i c o d i m i l a n o

Facoltà di Ingegneria

Scuola di Ingegneria Industriale e dell’Informazione

Dipartimento di Elettronica, Informazione e Bioingegneria

Master of Science in

Computer Science and Engineering

Adversarial Imitation Learning under

Covariate Shift

Supervisor:

m a r c e l l o r e s t e l l i

Co-Supervisor (University of Illinois at Chicago): b r i a n d. ziebart

Master Graduation Thesis by: a n d r e a t i r i n z o n i

Student Id n. 849827 Academic Year 2016-2017

(2)

(3)

A C K N O W L E D G M E N T S

I would like to thank my advisor, Professor Marcello Restelli, for his help in the realization of this thesis and his precious suggestions. I would also like to thank my advisor at the University of Illinois at Chicago, professor Brian Ziebart, for giving me the opportunity to work on this project and supporting me during its development.

I am grateful to all my friends for their continuous support during this long journey. Also, many thanks to all who shared the last months in Chicago with me.

Last but not least, my sincere gratitude goes to my parents. Without their eco-nomic and moral support, I would not be writing this document.

(4)

C O N T E N T S Abstract viii 1 i n t r o d u c t i o n 1 1.1 Problem Description 1 1.2 Contribution 2 1.3 Document Outline 3 2 b a c k g r o u n d 4

2.1 Markov Decision Processes 4

2.2 Partially Observable Markov Decision Processes 6 2.2.1 Point-based Value Iteration 8

2.3 Directed Information Theory 9 2.4 Imitation Learning 10

2.4.1 Behavioral Cloning 10

2.4.2 Inverse Reinforcement Learning 11 3 r e l at e d w o r k 13

3.1 Feature Matching 13

3.2 Maximum Causal Entropy IRL 14 3.3 Maximum Margin Planning 15

3.4 Adversarial Inverse Optimal Control 15 4 a d v e r s a r i a l f o r m u l at i o n 18 4.1 Problem Definition 18 4.2 Formulation 19 4.3 Learning Algorithm 20 4.3.1 Gradient Estimation 21 4.3.2 Gradient Descent 23 5 m u lt i p l e-mdp optimization 25 5.1 Problem Definition 25

5.2 Approximate Dynamic Programming 25 5.2.1 Dynamic Program Properties 29 5.3 Modified Point-Based Value Iteration 31

5.3.1 Modified Value Backup 32 5.3.2 Modified Belief Expansion 33

5.4 Application to Adversarial Imitation Learning 35 6 e x p e r i m e n t s 36

6.1 Gridworld 36 6.2 Taxi Problem 40

(5)

7 c o n c l u s i o n s 43

b i b l i o g r a p h y 44

(6)

L I S T O F F I G U R E S

Figure 6.1 A 10x10 gridworld with 2x2 blocks. White states have high reward, while black states have low reward. 37

Figure 6.2 The state distribution induced by the expert’s policy un-der (a) test dynamics, and (b) train dynamics. White de-notes high-probability, black dede-notes low-probability, and red circles denote absorbing cells. 38

Figure 6.3 Mean test loss for different numbers of trajectories in the case without covariate shift. 39

Figure 6.4 Mean test loss for different numbers of trajectories in the covariate shift case. 39

Figure 6.5 The original taxi environment (a) and the modified one we use to generate demonstrations (b). 40

Figure 6.6 Mean test loss (a) and mean test reward (b) for different numbers of trajectories in the case without covariate shift. 41 Figure 6.7 Mean test loss (a) and mean test reward (b) for different

numbers of trajectories in the covariate shift case. 42

(7)

A C R O N Y M S

IRL Inverse Reinforcement Learning IOC Inverse Optimal Control

RL Reinforcement Learning

BC Behavioral Cloning MDP Markov Decision Process

POMDP Partially Observable Markov Decision Process

PWLC Piece-Wise Linear and Convex

PBVI Point-Based Value Iteration

AIL Adversarial Imitation Learning

AIOC Adversarial Inverse Optimal Control

ME Maximum Entropy Inverse Reinforcement Learning

FM Feature Matching

(8)

A B S T R A C T

Imitation learning, the problem of estimating a policy that reproduces demon-strated behavior, has become an essential methodology for training intelligent agents to solve complex tasks. A powerful approach to solving imitation problems is inverse reinforcement learning, which attempts to rationalize given trajectories by recovering the unknown reward function being optimized by the expert. Most of the research in this field has focused on estimating a policy capable of imitating the demonstrator under the unknown state distribution of the latter, given sam-ples distributed in the same way. In this work, we analyze the case where there is a shift between these two distributions. We propose an adversarial formulation, based on inverse reinforcement learning, that is able to produce a single deter-ministic policy minimizing a general loss function with respect to the unknown expert’s policy. We prove that covariate shift leads to an NP-hard optimization sub-problem, the computation of a deterministic policy maximizing the total expected reward from two different Markov decision processes. We propose a tractable ap-proximation by reducing the latter to the optimal control of partially observable Markov decision processes. We evaluate the performance of our approach on two common reinforcement learning benchmarks and show its advantages over other state-of-the-art algorithms.

(9)

S O M M A R I O

L’apprendimento per imitazione, il problema di stimare una politica che ripro-duce un comportamento dimostrato, è diventato un metodo essenziale per adde-strare agenti intelligenti a risolvere compiti complessi. Un potente approccio a problemi di imitazione è l’apprendimento per rinforzo inverso, il quale cerca di razionalizzare delle date traiettorie recuperando la funzione di rinforzo sconosci-uta ottimizzata dall’esperto. La maggior parte della ricerca in questo campo si è concentrata sulla stima di una politica capace di imitare il dimostratore sulla distribuzione degli stati sconosciuta di quest’ultimo, dati campioni distribuiti nello stesso modo. In questo lavoro, analizziamo il caso in cui è presente una differenza tra queste due distribuzioni. Proponiamo una formulazione, basata sull’apprendimento per rinforzo inverso, in grado di produrre una singola po-litica deterministica che minimizza una funzione di perdita generale rispetto alla politica sconosciuta dell’esperto. Dimostriamo che la differenza nelle distribuzioni genera un problema di ottimizzazione NP-difficile, ovvero il calcolo di una polit-ica deterministpolit-ica che massimizza la ricompensa totale da due processi di deci-sione diversi. Proponiamo un’approssimazione trattabile riducendo quest’ultimo al controllo ottimale di processi di decisione parzialmente osservabili. Valutiamo il rendimento del nostro approccio in due comuni problemi nell’apprendimento per rinforzo e mostriamo i suoi vantaggi su altri algoritmi allo stato dell’arte.

(10)

1

I N T R O D U C T I O N

Imitation learning, also known as learning from demonstrations [1], is the prob-lem of estimating a control policy capable of performing a task demonstrated by an expert. When building intelligent agents using Reinforcement Learning (RL) [2], it is often difficult to design a reward function that allows to learn complex tasks. On the other hand, it is much easier to directly show how to perform the task and let the agent infer the best policy from such demonstrations. This is the main motivation behind imitation learning. Consider, for example, the problem of learning how to autonomously drive a car [3,4]. A driver typically takes into con-sideration several variables in order to decide what is the best thing to do, such as the speed limit on the current road, the presence of other cars or pedestrians, and so on. Intuitively, a reward function that allows the agent to learn how to drive should somehow weigh all these variables, but finding the right weights so as to obtain the desired behavior is very difficult. In practice, this leads to a trial-and-error approach to tuning the reward function that might take a long time. However, it seems much wiser to have a person show the driving task and make the agent learn from such demonstrations.

p r o b l e m d e s c r i p t i o n

The two main approaches to imitation learning are Behavioral Cloning (BC) [5] and Inverse Reinforcement Learning (IRL) [3, 6, 7]. The former uses supervised techniques to directly infer a policy from the data. Since sequential prediction problems violate the i.i.d. assumption made by supervised algorithms, BC typi-cally does not generalize well and requires a large number of samples. IRL, on the other hand, poses the problem in the Markov Decision Process (MDP)

formal-ism and tries to recover the unknown reward function being optimized by the expert. Then, the best imitation policy is computed using RL and the recovered reward function. As a consequence of these more general assumptions,IRL gener-ally performs much better thanBC and has been successfully applied to a variety of real-world problems [3, 8].

However,IRLapproaches still suffer some limitations. To the best of our knowl-edge, no state-of-the-art algorithm handles the case where there is a shift between the demonstrator’s state distribution and that of the provided data. Such situation arises rather frequently in practice. Consider, for example, that the environmental

(11)

1.2 contribution 2

dynamics under which demonstrations are generated might be noisy and, thus, slightly different than the expected ones. Going back to the autonomous driving example, the human might provide demonstrations on a car he has never used before, or in a place he has never seen. The expert still knows how to drive prop-erly, but it might take sub-optimal decisions due to the fact that the environment is slightly different than the one he is used to. If we simply ignore this shift,IRL al-gorithms are likely to consider such decisions as optimal, and the inferred policy will be a poor imitator of the expert. Furthermore, the typical approach used in machine learning to handle covariate shift [9,10], i.e., importance sampling, is not easily applicable in our case due to the fact that neither of the two distributions is known (the expert’s policy is not given). These examples motivate us to explicitly deal with this problem.

We suppose the expert’s policy is optimal under some given dynamics (which we call test dynamics), but we only have access to a finite number of demonstra-tions under the same policy but different known dynamics (which we call train dynamics). Our goal is to imitate the expert policy under the test dynamics. Build-ing on top of the framework described in [11], we propose a new formulation to solve the imitation learning under covariate shift problem. Our formulation is a zero-sum game where the learner tries to minimize an imitation loss under the test dynamics, while the adversary tries to maximize such loss while being constrained to match the feature counts derived from the data under the train dynamics. As in [11], our algorithm does not make any particular assumption on the loss function (as far as it decomposes over states). This constitutes a sig-nificant improvement over other imitation learning algorithms which typically require convexity. We show that the introduction of covariate shift into [11] leads to an NP-hard optimization problem, that is, the computation of a deterministic policy maximizing the total expected reward from two different MDPs. We pro-pose a tractable approximation by reducing the latter to the optimal control of partially observable Markov decision processes [12,13].

c o n t r i b u t i o n

Our contribution is two-fold. First, we propose an approximate solution to the NP-hard problem of computing the optimal deterministic policy maximizing the total reward from different processes. This a general problem, not necessarily only related to IRL, whose solution can be re-used in many contexts. To this end, we analyze and solve it from the most general perspective possible, so as to ease its re-usability. Then, we adopt such result to solve our imitation learning under covariate shift problem.

(12)

1.3 document outline 3

d o c u m e n t o u t l i n e

The rest of this document is organized as follows. In Chapter 2, we provide the mathematical background needed to understand this work. Focus is given to MDPs, the mathematical model that specifies our decision-making settings,

POMDPs, which we adopt to approximate the NP-hard sub-problem of our for-mulation, and causally conditioned probability distributions, which compactly represent our stochastic processes. Furthermore, we provide a deeper description of imitation learning, behavioral cloning, and inverse reinforcement learning. In Chapter 3, we introduce the most important state-of-the-art _IRL algorithms (fea-ture matching, maximum entropy and maximum margin planning) which we use for comparison. We also detail the adversarial approach we extend to derive our formulation. Chapter 4 formally defines the problem we are trying to solve, presents our adversarial formulation and derives our learning algorithm. In chap-ter 5, we analyze the NP-hard problem of computing the optimal policy for dif-ferentMDPs. We propose a modified POMDP algorithm, analyze its performance, and specify its application to our formulation. In Chapter6, we demonstrate the benefits of our approach on two common reinforcement learning benchmarks, gridworld and the taxi problem [14], by comparing to other state-of-the-art algo-rithms. Finally, in Chapter7we draw some conclusions on this work.

(13)

2

B A C K G R O U N D

This chapter introduces the concepts that are used extensively throughout this document. Section2.1starts by describing _MDPs, the mathematical tool to model environments where reinforcement learning and inverse reinforcement learning are applied. Then, Section 2.2 provides an overview of _POMDPs, which we em-ploy in chapter 5 to solve an important sub-problem of our formulation. Since these two topics are well known and well described in the literature, we simply provide a quick overview, while referring the reader to other sources for more detailed explanations. Section2.3 quickly introduces directed information theory and causally conditioned probability distributions, which allow for a more con-cise representation of our stochastic processes. Finally, Section 2.4 describes the problem of imitation learning, focusing on the two main methodologies to solve it: behavioral cloning and inverse reinforcement learning.

m a r k ov d e c i s i o n p r o c e s s e s

This chapter provides a quick introduction to MDPs. Since in this document we

restrict ourselves to finite-state, finite-horizon MDPs, we focus on such case. A description of the more general settings can be found in [15] and [2].

MDPs are discrete time stochastic control processes used to model complex decision-making problems under uncertainty. At each time of the process, the agent is in some state, takes some action and observes a reward for taking that action in that particular state. The agent’s task is to find the sequence of actions so as to maximize the long-term total reward. Definition2.1formalizes this concept. Definition 2.1. AMDPis a tuple <S, A, τ, R, p0 >, where:

• S is a finite set of states; • A is a finite set of actions;

• τ are the state transition dynamics, where τ(st+1 | st, at)is the probability

of the next state being st+1 given that the agent executes action at in state

st;

(14)

2.1 markov decision processes 5

• R is a reward function, where R(st) specifies the reward that the agent

re-ceives for entering state st1;

• p0 is a probability distribution over initial states, where p0(s) is the

proba-bility that s is the state where the process starts.

A control policy π(at | st)is a probability distribution over actions given states.

The process starts at time t = 1, where the first state is drawn from the initial distribution p0. Then, the agent selects the first action a1 according to its policy

π(a1 | s1) and makes a transition to another state as specified by the dynamics

τ(s2 | s1, a1). The agent successively selects the second action, and so on until the

process ends in state sT. The goal is to find the policy π?that maximizes the sum

of expected rewards, that is: π? = argmax π E "X_T t=1 R(St)| τ, π # (2.1) It is possible to prove that the knowledge of the current state is sufficient for acting optimally in the future; that is, knowing the past history of state-action sequences does not add any further information. A process where this is the case is said to be Markovian. Furthermore, the optimal policy in anMDPis always deterministic,

i.e., it can be specified as a mapping π?(st) from states to actions returning the

best action at for each state st.

For a particular policy π, the state value function represents the total expected reward that is achieved from each state st by following π:

Vπ(st) = R(st) +E " _T X i=t+1 R(Si)|τ, π # (2.2) while the state-action value function represents the total expected reward that is achieved by executing action at from state st and then following π:

Qπ(st, at) = R(st) +Est+1∼τ(·|st,at) " _T X i=t+1 R(Si)|τ, π # (2.3) These two quantities are related by the so-called Bellman expectation equations [16], as specified in Theorem2.1.

1 Notice that the reward is usually specified as a function R(s, a) of state and action, but in this document we consider it as a function of the state alone. The extension is trivial.

(15)

2.2 partially observable markov decision processes 6

Theorem 2.1. Let M = < S, A, τ, R, p0 > be an MDP and π(at | st) be a policy.

Then we can compute the state value function Vπ(st) and the state-action value

function Qπ(st, at)for π by solving the following dynamic program:

       Vπ(sT) = R(sT) Vπ(st) =Patπ(at | st)Q π_(s t, at) ∀t = T − 1, . . . , 1 Qπ(st, at) = R(st) +Pst+1τ(st+1 | st, at)V π_(s t+1) ∀t = T − 1, . . . , 1 (2.4)

The following theorem states Bellman’s optimality equations [16], which repre-sent one of the most important results in this field and allow the computation of the optimal policy π? by means of dynamic programming.

Theorem 2.2. Let M = < S, A, τ, R, p0 > be an MDP. Then we can compute the

optimal policy π?(st)for M by solving the following dynamic program:

         Vπ?(sT) = R(sT) Vπ?(st) = max at Qπ?(st, at) ∀t = T − 1, . . . , 1 π?(st) = argmax at Qπ?(st, at) ∀t = T − 1, . . . , 1 (2.5)

where Qπ? is computed as specified in 2.4.

pa r t i a l ly o b s e r va b l e m a r k ov d e c i s i o n p r o c e s s e s

This section quickly introducesPOMDPs, focusing on the properties we use in this

work. For a more detailed description, we refer the reader to [17].

POMDPs [12, 13] provide an extension of _MDPs to partially observable environ-ments, i.e., those where the agent cannot directly observe the state but only re-ceives partial information about it. Definition2.2formalizes this new model. Definition 2.2. A Partially Observable Markov Decision Process (POMDP) is a tu-ple <S, A, Ω, τ, O, R, b0, γ >,where:

• S is a finite set of states; • A is a finite set of actions;

• Ω is a finite set of observations, where each observation gives partial infor-mation about the state the agent is into;

(16)

• τ are the state transition dynamics, where τ(st+1 | st, at)is the probability

of the next state being st+1 given that we execute action at from state st;

• O are the conditional observation probabilities, where O(ot+1 | st+1, at) is

the probability of observing ot+1 after taking action at and transitioning to

state st+1;

• R is the reward function, where R(st, at) specifies the utility the agent

ob-tains after taking action at from state st;

• b0 is the initial state probability distribution (also called initial belief state);

• γ ∈ [0, 1] is the discount factor and it is used to discount future rewards in infinite-horizon processes.

The process starts at time t = 1, where the first state s1 is drawn from b0.

How-ever, the agent cannot directly observe such state but receives an observation o1.

Then, the agent selects an action a1, transitions to the unobserved state s2

accord-ing to τ(s2 | s1, a1), receives an observation o2 according to O(o2 | s2, a1), and

finally obtains a reward R(s1, a1). Then, the process is repeated (forever in case of

infinite-horizon). The agent’s goal is to select the sequence of actions maximizing the total (discounted) reward over time.

In order to solve the problem, the agent keeps, at each time, a distribution over statesS called belief state. A belief state is an |S|-dimensional vector where each component represents the probability that the agent is in that particular state. Then, the partially observable MDP is reduced to a continuous fully observable

MDPwhere the state space is B, the space of all belief states (if we have |S| states, B is the (|S| − 1)-dimensional simplex). In suchMDP, the state-transition dynamics are:

τ(b_t+1 | bt, at) =

X

o∈Ω

Pr(b_t+1 | bt, at, o)Pr(o| at,bt) (2.6)

and the reward function is: R(bt, at) =

X

s∈S

R(s, at)b(s) (2.7)

According to Bellman optimality equations, the value function for the optimal policy of the fully observable MDP can be written as:

Vt(bt) = max at∈A  R(bt, at) + γ X bt+1∈B τ(b_t+1 | bt, at)Vt+1(bt+1)   (2.8)

(17)

This is a function of a continuous variable and cannot be computed by standard value iteration [2]. However, it turns out that such function is piece-wise linear and convex in b [12] and can be written as:

Vt(bt) = max α∈Vt

b|_tα (2.9)

whereVt is a set of vectors representing the normal directions to the hyperplanes

describing the function (we call them alpha-vectors).

Such formulation allows for efficient algorithms to compute the value function (and the associated optimal policy). The most common approach, and the one we adopt in this document, is Point-Based Value Iteration (PBVI), which we describe in the next section.

Point-based Value Iteration

PBVI[18] approximates the optimal value function of a_POMDPby considering only a restricted set of representative belief states B and computing one alpha-vector for each of them. Then, the value function is represented as the setV containing all alpha-vectors learned by the algorithm. The motivation behind PBVI is that most alpha-vectors in the full set exactly describing V are dominated by some others. This means that, when taking the maximum over actions, such vectors are never used and can be safely pruned. Since the total number of alpha-vectors grows exponentially with the number of actions and observations, pruning is necessary to make the problem tractable. In PBVI, this is done implicitly by computing the dominating alpha-vector at each belief inB. Thus, no dominated alpha-vector is ever added toV.

The algorithm proceeds iteratively by alternating two main procedures:

• value backup: given the current belief setB and the current value function V, computes an update V0 _of_{V by applying some backup equations similar}

to the Bellman optimality equations (a modified version of2.8);

• belief expansion: adds new belief states to the current belief setB. For each belief inB, this is done by taking each action and simulating a step forward, thus leading to a new belief state. Then, only the farthest belief from the current setB is retained.

These two procedures are then repeated for a specified number of iterations. The more iterations are executed, the better is the value function approximation (i.e., the number of alpha-vectors that are computed grows with the number of itera-tions).

(18)

2.3 directed information theory 9

d i r e c t e d i n f o r m at i o n t h e o r y

In this document, we make use of causally conditioned probability distributions to represent our stochastic processes. Such distributions arise naturally from di-rected information theory [19], which associates a direction to the flow of infor-mation in a system.

Given two random vectors Y1:T and X1:T, the causally conditioned probability

distribution of Y given X is: p(Y_1:T || X_1:T) =

T

Y

t=1

p(Yt | X1:t,Y1:t−1) (2.10)

Notice the difference with respect to the conditional probability of Y given X in the limited knowledge about the conditioning variable X available:

p(Y_1:T | X_1:T) =

T

Y

t=1

p(Yt| X1:T,Y1:t−1) (2.11)

Using this notation, we can compactly represent our processes as: τ(S_1:T || A_{1:T −1}) = T Y t=1 p(St | S1:t−1,A1:t−1) (M) = T Y t=1 p(St | St−1, At−1) (2.12) π(A_{1:T −1} || S_{1:T −1}) = T −1_Y t=1 p(At| S1:t,A1:t−1) (M) = T Y t=1 p(At | St) (2.13)

where the last equalities (M) hold only under the Markovian assumption. The product of these two quantities represent the joint distribution of state-action se-quences:

p(S1:T,A1:T −1) = τ(S1:T || A1:T −1)π(A1:T −1 || S1:T −1) (2.14)

and can be used to concisely write the total expected reward as: E "X_T t=1 R(st)| τ, π # = X S1:T,A1:T −1 p(S1:T,A1:T −1)R(S1:T) (2.15)

(19)

2.4 imitation learning 10

i m i tat i o n l e a r n i n g

In imitation learning, an agent tries to mimic another agent’s behavior. The im-itating agent is usually called "learner", while the imitated one is referred to as "expert" or "demonstrator". The environment for this context is generally modeled as a set of states in which agents can be located, and a set of actions that they can take to perform state transitions. Then, the problem of imitation learning reduces to finding a policy, that is, a mapping from states to actions (or a probability dis-tribution over actions given states) that best approximates the expert’s policy. The latter is known only through demonstrations, i.e., the expert shows how to behave (by taking actions in different states) and the learner has to find the policy that best reproduces the demonstrated state-action sequences.

The two most common approaches to imitation learning are behavioral cloning and inverse reinforcement learning, which are described in the next two sections. Behavioral Cloning

BC [5] is the most common approach to imitation learning. The main problem is reduced to a supervised learning one, where a mapping from states, or features of such states, to actions is learned by adopting classification techniques. Such method has been successfully adopted in several different fields. Robotics is the most common, where excellent results have been achieved in a wide variety of tasks. Among the notable examples, Pomerleau [20] implemented an autonomous land vehicle that learns to follow a road given demonstrated trajectories (from camera images) and using a neural network. LeCun et al. [4] proposed a system for learning obstacle avoidance from human demonstrations (again in the form of images) and using a convolutional neural network. Sammut et al. [21] designed an algorithm to learn how to fly an aircraft given human demonstrations from a simulation software and using decision trees.

Although behavioral cloning has proven to perform very well on some spe-cific tasks, it provides poor results when the goal is to maximize a long-term utility. The main issue with behavioral cloning (and with the supervised tech-niques adopted) is that samples (typically state-action couples) are supposed to be independent. This is clearly not the case when sequential decision making is demonstrated. For example, the state the agent is into depends on the previous state and action. Thus, when the demonstrator’s behavior aims at maximizing a long-term reward, behavioral cloning algorithms tend to generalize poorly and fail at reproducing the optimal performance. Furthermore, they usually require a large number of samples to learn good imitation policies.

(20)

Inverse Reinforcement Learning

The idea to tackle problems of imitation learning when the demonstrator is show-ing sequential decision-makshow-ing behavior is to formalize them as MDPs. In this context, the expert is supposed to be optimizing an unknown long-term reward function, and imitation learning reduces to estimating such function. This prob-lem is known asIRLor Inverse Optimal Control (IOC) [3, 6, 7].

In optimal control and RL [2], the agent is given a model of the environment and a reward function, and must generate optimal behavior by finding a policy that maximizes the long-term reward2

. In IRL, on the other hand, the agent is

given trajectories showing the expert’s (optimal) policy together with a model of the environment, and must recover the reward function being optimized.

Differently from behavioral cloning, IRL attempts to rationalize demonstrated sequential decisions by estimating the utility function that is being maximized by the expert. Since the whole field of reinforcement learning [2] is based on the idea that "the reward function is the most succinct, robust, and transferable definition of the task" [7], its recovery seems wiser than directly learning a mapping from states to actions. The estimated reward function can be successively used to learn the best control policy via classicRL.

We can formalize theIRLproblem as follows [6]. Given:

• a model of the environment the expert is acting in (i.e., state-transition dy-namics τ), and

• a set of trajectories ζi, where each trajectory is a sequence (s1, a1, s2,...) of

states and actions generated by executing the expert’s (optimal) policy π? under τ,

estimate the reward function R? being optimized by the expert. More specifically, the problem is reduced to estimating a reward function that makes the demon-strated behavior optimal (i.e., rationalizes such behavior). Formally, the estimated reward function R must satisfy:

E "X_T t=1 R(st)| τ, π? # > E "X_T t=1 R(st)| τ, π # ∀π (2.16)

However, this formulation has several challenges. First, it constitutes an ill-posed problem since it is easy to prove that there exist many solutions (actually, infinitely 2 To be precise, in_RLthe agent is not explicitly given these two quantities but has the ability to take

(21)

many) [7]. As an example, a constant reward (e.g., a reward that is always zero) makes every policy optimal. Nevertheless, it is very unlikely that such function matches the one that is sought. Second, it is not possible to explicitly compute the left-hand side since the expert’s policy π? is not given but is demonstrated from sample trajectories. Furthermore, this formulation makes the assumption that the expert is optimal (i.e., π? is optimal). When this is not the case, the problem becomes infeasible. Last, we can solve the inequalities only by enumerating all possible policies, which is computationally not practical.

Many algorithms have been proposed to tackle such difficulties. We defer the description of the most important ones to the next chapter.

(22)

3

R E L AT E D W O R K

In Chapter2, we introduced and formally defined_IRL, one of the main approaches to imitation learning, while describing its main challenges. Since our work is based onIRL, in this chapter we present the state-of-the-art algorithms which we use for comparison, focusing on how they tackle such complications. Section3.1 describes feature matching, one of the firstIRLalgorithms and whose underlying

assumptions represent the foundations of many successive works. Then, Section 3.2 describes maximum entropy _IRL, which provides a principled way of estimat-ing the demonstrated policy, while Section 3.3 presents maximum margin plan-ning, which casts the main problem into a maximum margin one. Finally, Section 3.4 introduces the adversarial approach that we extend to build our framework. f e at u r e m at c h i n g

Abbeel and Ng [3] represent rewards as linear combinations of given features of the states:

R(s) =w|φ(s) (3.1)

Given a policy π, they define the feature expectations of π as: µ(π) = E " ∞ X t=0 γtφ(st)| π # (3.2) Their main intuition is that, if the feature expectations of a policy π match those of the expert’s policy π?:

kµ(π) − µ(π?)k 6 (3.3)

then π is guaranteed to perform as well as π? for all rewards with kwk1 6 1.

They propose an iterative procedure that looks for a policy satisfying the condi-tion of Equacondi-tion3.3. The algorithm keeps a set of policies π(i) together with the corresponding feature expectations µ(i). At each iteration, the following quadratic program is solved to find the weights w that maximally separates the expert’s fea-ture expectations µE from the ones in the above-mentioned set:

max w:kwk261 min i w |_(µ E−µ(i)) (3.4) 13

(23)

3.2 maximum causal entropy irl 14

Notice that this is equivalent to finding a maximum margin hyperplane separat-ing µE from all µ(i). Then, the optimal policy for the new weights is computed

together with its feature expectations, and the algorithm is iterated until the two sets are separated by a margin less than . Finally, the output is one of the learned policies, if the demonstrator is optimal, or a mixture of some of them, if the demon-strator is sub-optimal.

Although this algorithm always achieves the feature matching property (which provides a way to solve the degeneracy problem of reward functions), it is not guaranteed to recover a reward that is similar to the true one.

m a x i m u m c au s a l e n t r o p y i r l

While Abbeel and Ng’s algorithm [3] solves the degeneracy problem by matching the empirical sum of features, which leads to a (mixture) policy that achieves a performance very close to the one of the demonstrator even without recovering the true reward function, it introduces another ambiguity: many policies, or mix-tures of policies, that match the feature expectations exist. No principled way to choose among them is proposed by the authors.

Ziebart et al. [8, 22] solve the above-mentioned ambiguity by employing the principle of maximum causal entropy. Their formulation seeks a probability distri-bution over actions given states that maximizes the causal entropy while matching the feature expectations:

argmax

π

H(A1:T || S1:T)

such that µ(π) = µE

(3.5) where H(A1:T || S1:T) = E[− log π(A1:T || S1:T)] denotes the causal entropy of

the distribution π [23]. The authors prove that solving this optimization problem reduces to minimizing the worst-case prediction log-loss and yields a stochastic policy of the form:

π(at | st) = eQ(st,at)−V (st) (3.6) where: Q(st, at) =w|φ(st) +E_τ(.|s_t,at)[V(St+1)] (3.7) V(st) = softmax at Q(st, at) (3.8)

(24)

3.3 maximum margin planning 15

and the softmax function is defined as: softmax

x f(x) =log

X

x

ef(x) (3.9)

Then, the reward weights achieving such probability distribution can be com-puted by adopting a convex optimization procedure. The authors show how the resulting algorithm allows for efficient inference and successfully apply it to the problem of modeling route preferences and inferring the destination given partial trajectories.

m a x i m u m m a r g i n p l a n n i n g

Ratliff et al. [24] propose a different approach to selecting a policy (and corre-sponding reward weights) that makes the expert better than all alternatives. They cast the problem into a maximum margin one, where the goal is to find a hyper-plane separating the demonstrator’s feature expectations from those of any other policy by a structured margin. The resulting formulation is the quadratic program:

min w 1 2kwk 2_{+ Cξ} s.t. w|µ(π?)> w|µ(π) + L(π?, π) − ξ ∀π (3.10) where L denotes a loss function comparing two policies and ξ are slack variables used to soften the margin (whose effect is controlled by constant C). The rationale behind the inclusion of a loss function into the maximum margin is that the latter should be larger for policies that are very different than the demonstrated one. The authors allow the usage of any loss function, as far as it can be factored into state-action couples.

One drawback of the quadratic program of 3.10is that it has a very large num-ber of constraints. In order to make learning practical, the authors solve it using an approximate algorithm based on subgradient methods.

Differently from feature matching and maximum causal entropy IRL, this algo-rithm does not try to find a policy that achieves the same performance of the expert, but it directly estimates reward weights w. The policy used for imitation can successively be computed by optimizing over the learned reward function. a d v e r s a r i a l i n v e r s e o p t i m a l c o n t r o l

Chen et al. [11] propose an adversarial framework for a more general imitation learning and IRL setting, that is, the case where learner and demonstrator are

(25)

3.4 adversarial inverse optimal control 16

acting in different environments. They consider a demonstrator acting according to the (unknown) policy π under (known) dynamics τ, and a learner acting under different (known) dynamics ˆτ. Then, the main idea is to find a policy ˆπminimizing a loss function comparing learner and demonstrator’s state sequences. Formally:

argmin ˆ π E "X_T t=1 loss(St, ˆSt)| τ, π, ˆτ, ˆπ # (3.11) However, the demonstrator policy π is not known and the expectation cannot ex-plicitly be computed. In order to provide an estimate of such policy, an adversary is introduced to find the policy ˇπ that maximizes such imitative loss. To prevent the adversary from choosing very bad policies, they consider the set of all stochas-tic policies that match the empirical sum of features of the demonstrated trajecto-ries [3] and allow ˇπto be picked only from that set. Thus, the idea to deal with the feature-matching ambiguity is, in this case, to choose the worst case (i.e., loss max-imizing) policy from the restricted set. The final formulation is a zero-sum game between the learner, looking for the best (stochastic) policy to minimize the loss, and the adversary, trying to maximize such loss by selecting another (stochastic) policy in a constrained way:

min ˆ π maxπ∈Θˇ E "X_T t=1 loss( ˆSt, ˇSt)| τ, ˇπ, ˆτ, ˆπ # (3.12) where Θ is the set of all policies matching the expert’s feature expectations:

ˇ π∈ Θ ⇔E "X_T t=1 φ( ˇSt)| τ, ˇπ # =µ_E (3.13)

In order to solve such problem, they reduce the constrained zero-sum game to one that is parameterized by a vector of Lagrange multipliers and solve it using the double oracle method [25], while optimizing over such parameters using a simple convex optimization procedure [26].

This algorithm provides several advantages over existing methods. First, any loss function can be used (as far as it can be decomposed over states). Second, it provides another principled way to solve the feature matching ambiguity: simply pick the worst case policy. This allows the algorithm to generalize well to new situations. Finally, the algorithm is proven to be Fisher consistent, i.e., the learned policy minimizes the loss under the assumption that the feature set is sufficiently rich.

(26)

3.4 adversarial inverse optimal control 17

Notice, however, that this approach assumes the demonstrator’s distribution matches the data distribution, which simplifies the estimation of the expert’s pol-icy. Thus, it is not suitable to deal with our imitation learning under covariate shift problem and needs to be adapted accordingly. We describe how to achieve this in the next chapter.

(27)

4

A D V E R S A R I A L F O R M U L AT I O N

This chapter describes the adversarial formulation we adopt to solve the imita-tion learning under covariate shift problem. Secimita-tion4.1 formalizes such problem, while Section4.2 details our approach. Finally, Section4.3 describes our learning algorithm and points out its main challenges.

p r o b l e m d e f i n i t i o n

We start by formalizing the imitation learning under covariate shift problem. Con-sider a finite set of statesS, a finite set of actions A, and a probability distribution p0(s)over initial states. Suppose the expert acts according to the unknown policy

π(A1:T −1 || S1:T −1), which is optimal for the unknown reward function R?(s)

under known test dynamics τ(S1:T || A1:T −1). Furthermore, suppose R? can be

written as a linear combination of given state features φ [3]:

R?(s) =w?· φ(s) (4.1)

where φ and w? are K-dimensional vectors, and · denotes the inner product op-erator. We are given a finite set of N trajectories under the known train dynamics

˜τ(S1:T || A1:T −1)and the expert’s policy π, where each trajectory is a sequence:

ζi = (s1, a1, s2,..., aT −1, sT | π, ˜τ)i (4.2)

Then, our goal is to estimate a policy ˆπminimizing the expected loss with respect to the expert’s policy π under the test dynamics τ. Formally:

argmin ˆ π E " _T X t=1 loss( ˆSt, St)| τ, π, ˆπ # (4.3) Minimizing such loss requires knowledge of the unknown demonstrator’s state distribution: p(st) = X st−1 X at−1 τ(st | st−1, at−1)π(at−1 | st−1)p(st−1) (4.4)

while we only have data under: ˜p(st) = X st−1 X at−1 ˜τ(st | st−1, at−1)π(at−1 | st−1)˜p(st−1) (4.5) 18

(28)

4.2 formulation 19

Notice that, although we supposed the demonstrator’s policy to be optimal and the reward to be linear in given features, we do not require this conditions to hold, as we show in our experiments.

Of fundamental importance for our algorithm is the empirical sum of features observed on a certain trajectory, which allows to estimate the expert’s feature expectations under train dynamics ˜τ:

˜ µ_E = 1 N N X i=1 X s∈ζi φ(s) (4.6) f o r m u l at i o n

We now describe how to adapt the adversarial framework described in [11] to han-dle covariate shift. Our new formulation is a constrained zero-sum game where the learner chooses a stochastic policy ˆπthat minimizes a given loss under the test dynamics τ, whereas the adversary chooses a stochastic policy ˇπthat maximizes such loss while matching the feature expectations under the train dynamics ˜τ:

min ˆ π maxπ∈Θˇ E "X_T t=1 loss( ˆSt, ˇSt)| τ, ˆπ, ˇπ # (4.7) where Θ is the set of all stochastic policies matching the expert’s feature expecta-tions (defined in Equation4.6):

ˇ π∈ Θ ⇔E "X_T t=1 φ( ˇSt)| ˜τ, ˇπ # =µ˜_E (4.8)

The main idea of this new formulation is to have the adversary estimate the expert’s policy by choosing the loss maximizer among the set of all policies match-ing the feature expectations under the train dynamics. Then, the learner chooses a policy minimizing the loss with respect to such estimate under the test dynamics. As proved in [11], when features are sufficiently rich, the constraint set contains only the expert’s policy, and the learner achieves the minimum theoretical loss. Notice that the difference in embodiment considered in [11] is removed, and the new formulation deals with covariate shift instead.

As can be seen from this formulation, we do not require the true reward to be linear in given features, nor do we require to know such features. Such assump-tions are only needed to provide guarantees on the final expected reward. As long as the chosen features well constrain the adversary’s choice, we are sure the learned policy is a good imitator of expert’s.

(29)

4.3 learning algorithm 20

As described in [11], we reduce the constrained zero-sum game of Equation 4.7 to an unconstrained zero-sum game parameterized by a vector of Lagrange multipliers. Theorem4.1formulates our new game.

Theorem 4.1. The constrained zero-sum game of Equation 4.7 can be reduced to the following optimization problem:

min w minπ_ˆ maxπˇ E hP T t=1loss( ˆSt, ˇSt)| τ, ˆπ, ˇπ i +w|E hPT_t=1φ( ˇSt)| ˜τ, ˇπ i −µ˜E (4.9) Proof. This result follows immediately from [11] and [27]. l e a r n i n g a l g o r i t h m

We now present the learning algorithm we employ to solve the optimization prob-lem of Equation4.9. We consider the outermost minimization over Lagrange mul-tipliers:

min

w f(w) (4.10)

where the function f is defined as: f(w) = min ˆ π maxπˇ E hP_T t=1loss( ˆSt, ˇSt)| τ, ˆπ, ˇπ i +w|E hPT t=1φ( ˇSt)| ˜τ, ˇπ i −µ˜E +λ₂kwk2

Notice that we add a regularization term to the objective function. This is moti-vated in [8] by the fact that a small amount of regularization (controlled by the parameter λ) helps in dealing with the uncertainty in the data.

Theorem 4.2 states a fundamental result that allows us to find a simple algo-rithm to solve the problem of Equation4.10.

Theorem 4.2. The function f(w) of Equation4.10is convex. Proof. We start by rewriting f in the more concise form:

f(w) = min ˆ π maxπˇ L(π,ˆ π) +ˇ w |_(F(_{π) −}_ˇ _µ_˜ E) + λ 2kwk 2 (4.11) where L and F are, respectively:

L(π,ˆ π) =ˇ E "X_T t=1 loss( ˆSt, ˇSt)| τ, ˆπ, ˇπ # (4.12) F(π) =ˇ E " _T X t=1 φ( ˇSt)| ˜τ, ˇπ # (4.13)

(30)

In order to prove that f is convex, we need to show that [26]:

f(θw1+ (1 − θ)w)6 θf(w1) + (1 − θ)f(w2) ∀θ ∈ [0, 1], ∀w1,∀w2 (4.14)

Since the regularization term is a well-known convex function, we prove the above-mentioned property only for the remaining part of f. Then, convexity follows from the fact that the sum of convex functions is also convex [26]. By expanding Equation4.14 we obtain: min ˆ π maxπˇ L(π,ˆ π) + (θwˇ 1+ (1 − θ)w) |_(F(_{π) −}_ˇ _µ_˜ E) = min ˆ π maxπˇ {L( ˆπ, ˇπ) + (θw1+ (1 − θ)w) |_F(_π)_ˇ _{} − (θw} 1+ (1 − θ)w)|µ˜E6 θ min ˆ π maxπˇ L(π,ˆ π) +ˇ w | 1(F(π) −ˇ µ˜E) + (1 − θ) min_π_ˆ max_π_ˇ L(π,ˆ π) +ˇ w | 2(F(π) −ˇ µ˜E) = min ˆ π maxπˇ θL(π,ˆ π) + θwˇ ₁|F(π)ˇ − θw₁|µ˜E+ min ˆ π maxπˇ (1 − θ)L(π,ˆ π) + (1 − θ)wˇ |₂F(π)ˇ − (1 − θ)w₂|µ˜_E= min ˆ π maxπˇ θL(π,ˆ π) + θwˇ ₁|F(π)ˇ + min ˆ π maxπˇ (1 − θ)L(π,ˆ π) + (1 − θ)wˇ |₂F(π)ˇ − (θw1+ (1 − θ)w)|µ˜E

Notice that the terms (θw1+ (1 − θ)w)|µ˜E cancel out from both equations. By

applying the triangle inequality of the max function, we can now write: min ˆ π maxπˇ L(π,ˆ π) + (θwˇ 1+ (1 − θ)w) |_F(_{π) =}_ˇ min ˆ π maxπˇ θL(π,ˆ π) + (1 − θ)L(ˇ π,ˆ π) + θwˇ | 1F(π) + (1 − θ)wˇ |2F(π)ˇ 6 min ˆ π maxπˇ θL(π,ˆ π) + θwˇ | 1F(π) + minˇ _π_ˆ max_π_ˇ (1 − θ)L(π,ˆ π) + (1 − θ)wˇ | 2F(π) =ˇ θ min ˆ π maxπˇ L(π,ˆ π) +ˇ w | 1F(π) + (1 − θ) minˇ _π_ˆ max_π_ˇ L(π,ˆ π) +ˇ w | 2F(π)ˇ

which concludes the proof.

Given the result of Theorem4.2, we can minimize the function f(w) using stan-dard convex optimization techniques [26]. Before doing that, we need to show how we can compute the gradient of f.

Gradient Estimation

It is easy to show that the gradient of f with respect to Lagrange multipliers w is:

Owf(w) =E " _T X t=1 φ(St)| ˇπ?, ˜τ # −µ˜E+ λw (4.15)

(31)

where ˇπ? is the adversary’s equilibrium strategy. Thus, in order to estimate the gradient of f, we need to compute a Nash equilibrium of the zero-sum game of Equation4.7. That is, we want to solve the optimization problem:

min ˆ π maxπˇ E "X_T t=1 loss( ˆSt, ˇSt)| τ, ˆπ, ˇπ # +E "X_T t=1 w|φ( ˇSt)| ˜τ, ˇπ # (4.16) Notice that we can neglect feature expectations ˜µE since they are constant terms

and do not affect the equilibrium strategy.

As described in [11], we consider deterministic policies as the basic strategies of each player and we denote them by letter δ. Then, we represent stochastic policies as mixtures of deterministic ones. Since there are two different definitions of mixture policy, we specify the one we consider in Definition4.1.

Definition 4.1. Given a setP of N deterministic policies {δ1, δ2,..., δN} and mixing

coefficients α1, α2,..., αN, such that αi > 0 and

PN

i=1αi = 1, the mixture policy π

ofP is the stochastic policy where, at the first time step, one of the N deterministic policies in P is chosen with probability given by αi and then deterministically

used throughout the whole process.

Notice the difference between Definition 4.1 and the more general case where mixtures are fully stochastic policies obtained by mixing the deterministic ones at each time step (and not only at the beginning of the process). Although more restrictive, our definition is necessary to allow the computation of expectations under the mixture policy as a linear combination of the expectations under deter-ministic policies (which is necessary when we want to use the mixture policy to match feature counts). Notice also that there always exists a stochastic policy that achieves the same expectation as our mixture policy, thus we are able to match the features even when the demonstrator’s policy is stochastic.

Since the payoff matrix that would result from these assumptions is exponential in the number of states [11], we use the double oracle method to iteratively build the matrix and find a Nash equilibrium. The algorithm we employ is exactly the one described in [11], thus we do not specify it here. The only big difference is in the two subroutines to compute the best responses of each player. We now show how this can be achieved.

l e a r n e r’s best response Given the adversary’s mixed strategy ˇπ, the learner’s best response is the deterministic policy ˆδ given by:

BRmin(π) = argminˇ ˆδ E " _T X t=1 loss( ˆSt, ˇSt)| τ, ˆδ, ˇπ # (4.17)

(32)

This can be solved as a simple optimal control problem by considering an MDP

with dynamics τ and reward:

R(st) =E −loss(st, ˇSt)| τ, ˇπ (4.18)

Notice that the minus sign in front of the loss function is necessary since we are minimizing. Notice also that the feature-matching terms are not controlled by the learner’s strategy, thus they can be safely omitted. Value iteration can be used to efficiently solve this problem, as it is proposed in [11].

a d v e r s a r y’s best response The computation of the adversary’s best re-sponse, on the other hand, represents the main difficulty of this work. Given the learner’s mixed strategy ˆπ, the adversary’s best response is the deterministic pol-icy ˇδ given by:

BRmax(π) = argmaxˆ ˇδ E "X_T t=1 loss( ˆSt, ˇSt)| τ, ˆπ, ˇδ # +E "X_T t=1 w|φ( ˇSt)| ˜τ, ˇδ # (4.19) This is the problem of finding the optimal deterministic policy that maximizes the total expected reward over two differentMDPs and has been shown to be NP-hard [28]. Since the solution of such problem is a big part of this work and should be analyzed in a more general context, its description is deferred to the next chapter. At this point, the reader should simply suppose the existence of a procedure to tractably compute BRmax.

Gradient Descent

We are now ready to describe our learning algorithm. This is shown in Algorithm 1. We adopt a simple gradient descent procedure which uses the double oracle method to estimate the gradient. We stop when the L1-norm of the gradient

di-vided by the number of features is less than a given threshold . Since the gradient is the difference between the feature expectations under the estimated policy ˇπ? and the expert’s feature expectations, we are sure that, upon termination, all fea-tures are, on average, matched with an error at most .

(33)

Algorithm 1:Gradient Descent for Imitation Learning under Covariate Shift Input: learning rate η, convergence threshold , regularization weight λ Output: weights w, imitation policy ˆδ

Initialize weights: w ←N(0, 1) while _K1kOwf(w)k1 > do (πˆ?,πˇ?)← doubleOracle(w) Owf(w) ←E hP T t=1φ(St)| ˇπ?,˜τ i −µ˜_E+ λw w ← w − ηOwf(w) end

Compute imitation policy: ˆδ ← BRmin(πˇ?)

After gradient descent terminates, the final estimate ˇπ? of the expert’s policy is used to compute the best (deterministic) imitation policy. Furthermore, the final weights w returned by our algorithm can be used to estimate the reward function being optimized by the expert as:

R(s) =w|φ(s) (4.20)

(34)

5

M U LT I P L E - M D P O P T I M I Z AT I O N

In Chapter 4, we proved that a difficult sub-problem arising from our adversar-ial formulation is the computation of a deterministic policy maximizing the total expected reward over two different MDPs. In this chapter, we analyze the more general problem where we are given multipleMDPs and we must find a determin-istic policy that maximizes the total reward from all of them. We start by formally defining such problem in Section5.1. Then, we present how we are able to reduce its solution to the optimal control ofPOMDPs [12] in Section 5.2, and we propose a modified version of PBVI [18] that solves such reduction in Section 5.3. Finally, Section5.4shows how we can use this result to tractably find the adversary’s best response in our adversarial framework.

p r o b l e m d e f i n i t i o n

Suppose we are given a set of D MDPs M = {M1,M2,...,MD}, all defined over

state spaceS and action space A. EachMDP Mj ∈M is a tuple < S, A, τj, Rj, p0 >,

where τj(S1:T || A1:T −1)are the state-transition dynamics, Rj(St, At, St+1)is the

reward function, and p0(S) is the initial state distribution. Our goal is to find a

deterministic policy π that maximizes the total expected reward from allMDPs in M. Formally, we want to solve the optimization problem:

argmax π D X j=1 E "_{T −1}X t=1 Rj(St, At, St+1)| τj, π # (5.1) Unfortunately, such problem is NP-hard [28]. It is easily possible to show that the policy achieving the maximum is non-Markovian, i.e., it depends on the whole state-action sequence and not only on the current state. The next section proposes a tractable approximation that makes the policy Markovian by introducing a new continuous variable.

a p p r o x i m at e d y na m i c p r o g r a m m i n g

The optimization problem of Equation 5.1 is not practical to solve using classic dynamic programming since the optimal policy is non-Markovian. Therefore, we

(35)

5.2 approximate dynamic programming 26

propose an approximate dynamic program that makes the optimal policy Marko-vian by incorporating knowledge of the state-action sequences into a new contin-uous variable. We call the latter "belief state", in analogy to POMDPs. Our main result is given in Theorem5.1.

Theorem 5.1. The optimization problem of 5.1 can be solved by considering the following dynamic program. We define a "belief state" vector b incorporating knowledge of the transition probabilities from allMDPs inM as:

bt def= 1 PD j=1τj(s1:t || a1:t−1)       τ₁(s_1:t || a_1:t−1) τ2(s1:t || a1:t−1) .. . τ_D(s_1:t || a_1:t−1)       (5.2)

Then, the state-action value and state value functions are, respectively: Q(st, at,bt) = PD j=1bt,j P st+1τj(st+1 | st, at)Rj(st, at, st+1) + V(st+1,bt+1) (5.3) V(st,bt) = max at Q(st, at,bt) (5.4)

The next belief state bt+1 can be computed from bt, given st, at and st+1, as:

bt+1 = bt PD j=1bt,jτj(st+1 | st, at)       τ1(st+1 | st, at) τ2(st+1 | st, at) .. . τD(st+1 | st, at)       (5.5)

where the symbol denotes the point-wise (or element-wise) product of the two vectors. Finally, the optimal deterministic policy is:

π?(st,bt) = argmax at

Q(st, at,bt) (5.6)

Proof. Suppose we reach time step T − 1 after observing state sequence s1:T −1

(36)

which leads to the final state sT. The contribution of this last decision to the total

expected reward is:

D X j=1 E Rj(ST −1, AT −1, ST | τj, π = D X j=1 τj(s1:T −1|| a1:T −2)π(a1:T −2|| s1:T −2) X aT −1 π(aT −1| a1:T −2,s1:T −1) X sT τj(sT | sT −1, aT −1)Rj(sT −1, aT −1, sT) = X aT −1 π(a1:T −1|| s1:T −1) D X j=1 τj(s1:T −1|| a1:T −2) X sT τj(sT | sT −1, aT −1)Rj(sT −1, aT −1, sT)

We want to find the best action aT −1 which maximizes this expectation, namely:

argmax aT −1 D X j=1 τ_j(s_{1:T −1} || a_{1:T −2})X sT τ_j(s_T | s_{T −1}, a_{T −1})R_j(s_{T −1}, a_{T −1}, s_T) = argmax aT −1 PD j=1τj(s1:T −1 || a1:T −2) P sTτj(sT | sT −1, aT −1)Rj(sT −1, aT −1, sT) PD j=1τj(s1:T −1 || a1:T −2) = argmax aT −1 D X j=1 b_{T −1,j}X sT τj(sT | sT −1, aT −1)Rj(sT −1, aT −1, sT) (5.7) where we define the belief state bt as:

bt def = _P 1 D j=1τj(s1:t || a1:t−1)       τ1(s1:t || a1:t−1) τ2(s1:t || a1:t−1) .. . τD(s1:t || a1:t−1)       = 1 PD j=1 Qt

i=1τj(si | si−1, ai−1)

      Qt

i=1τ1(si | si−1, ai−1)

Qt

i=1τ2(si | si−1, ai−1)

.. . Qt

i=1τD(si | si−1, ai−1)

      (5.8)

Notice that the last equality holds since we are considering only Markovian dy-namics. Furthermore, the dependence on the whole state-action sequence has now

(37)

been removed: the maximization over actions depends only on the current state and belief state. We now define the state-action value function at time step T − 1 as the part of5.7 that is being maximized:

Q(s_{T −1}, a_{T −1},b_{T −1}) = D X j=1 b_{T −1,j}X sT τ_j(s_T | s_{T −1}, a_{T −1})R_j(s_{T −1}, a_{T −1}, s_T) (5.9)

and the state value function at time step T − 1 as: V(sT −1,bT −1) = max

aT −1

Q(sT −1, aT −1,bT −1) (5.10)

Finally, the computation of the optimal action at time step T − 1 can be reformu-lated in terms of the state-action value function as:

π?(s_{T −1},b_{T −1}) = argmax

aT −1

Q(s_{T −1}, a_{T −1},b_{T −1}) (5.11) We now prove the update rule for the belief state. If the current belief state is bt,

we take action at from state st and we end up in state st+1, the i-th component

of the next belief state is: bt+1,i=

b_t,iτ_i(s_t+1 | st, at)

PD

j=1bt,jτj(st+1 | st, at)

(5.12) To see this, we substitute the definition of bt in Equation5.12:

b_t+1,i= τi(s1:t||a1:t−1) PD j=1τj(s1:t||a1:t−1) τi(st+1 | st, at) PD j=1 τj(s1:t||a1:t−1) PD k=1τk(s1:t||a1:t−1) τj(st+1 | st, at) = τi(s1:t || a1:t−1)τi(st+1 | st, at) PD j=1τj(s1:t || a1:t−1)τj(st+1 | st, at) = τi(s1:t+1 || a1:t) PD j=1τj(s1:t+1|| a1:t) (5.13)

and we get that the last equation is exactly the i-th component of bt+1. We now

move to time step T − 2. The state-action value function can be easily modified to account for the immediate reward and the value of the next state:

Q(sT −2, aT −2,bT −2) = PD j=1bT −2,j P sT −1τj(sT −1| sT −2, aT −2)Rj(sT −2, aT −2, sT −1) + V(sT −1,bT −1)

where the state value function is again: V(sT −2,bT −2) = max

aT −2

(38)

and the optimal action is:

π?(sT −2,bT −2) = argmax aT −2

Q(sT −2, aT −2,bT −2) (5.15)

We can now continue this procedure backward up to the first time step, thus

concluding the proof.

Theorem5.1defines a dynamic program that can be used to solve the optimiza-tion problem of Equaoptimiza-tion 5.1. However, its solution is still not trivial. In order to make everything Markovian, we introduce a continuous variable that prevents us from using any tabular representation of policies and value functions. The sim-plest solution is belief discretization: for a D-dimensional belief state vector, we partition the spaceRD into a finite set of hypercubes and discretize the belief over such partition. Then, we are able to solve the dynamic program of Theorem 5.1 using a tabular representation. However, the number of discretized belief states necessary to get a satisfactory approximation grows exponentially with the belief dimension D. Thus, this algorithm would be practical only for low-dimensional beliefs. The next section analyzes some properties of this formulation that allow a much more efficient solution.

Dynamic Program Properties

We analyze some properties of the dynamic program introduced in Theorem5.1, focusing on its relation toPOMDPs.

Let us start by commenting on the meaning of the belief state. Although the name is derived in analogy to POMDPs, there is no relation between the two def-initions. In aPOMDP, the belief state is an |S|-dimensional vector whose i-th com-ponent represents the probability of the process being in the i-th state. Thus, the belief state vector belongs to the (|S| − 1)-dimensional simplex. In our case, the belief state is a D-dimensional vector whose i-th component represents the (nor-malized) probability of the state-action sequence observed up to the current time under the i-th dynamics. Since our belief is normalized, it belongs to the (D − 1)-dimensional simplex.

The other main difference with respect toPOMDPs is in the value function V. In our case, V is a function of both the state st and belief state bt, and represents

the expected future reward that we get starting from st, executing the optimal

policy π? and given that the state-transition probabilities are encoded into bt. In

aPOMDP, V is a function of the belief state b alone, and represents the expected future reward the agents obtains considering that the unknown current state is distributed according to b.

(39)

Although many differences exist, our value function has a property that allows us to solve the dynamic program of Theorem 5.1 using (modified) _POMDP algo-rithms. Such property is given in Theorem5.2.

Theorem 5.2. The value function defined in Equation5.4is Piece-Wise Linear and Convex (PWLC) in the belief state bt.

Proof. We prove the theorem by induction. By replacing Q with its definition, we can write the value function at time step T − 1 as:

V(sT −1,bT −1) = max aT −1 D X j=1 bT −1,j X sT τj(sT | sT −1, aT −1)Rj(sT −1, aT −1, sT) = max a_{T −1} b | T −1       P s_T τ1(sT | sT −1, aT −1)R1(sT −1, aT −1, sT) P sT τ2(sT | sT −1, aT −1)R2(sT −1, aT −1, sT) .. . P s_T τD(sT | sT −1, aT −1)RD(sT −1, aT −1, sT)       (5.16)

The function that is maximized in 5.16 is clearly linear in b_{T −1} for any choice of state sT −1 and action aT −1. Thus, V(sT −1,bT −1) is the maximum over a set of

linear functions, i.e., it isPWLC.

Suppose now the value function isPWLCat time step t+1. This means that it can

be written as a maximization over a set of hyperplanes: V(st+1,bt+1) = max

k b

|

t+1αk(st+1) (5.17)

where αk(st+1) are the normal directions to such hyperplanes (we call them

alpha-vectors in the remaining, again in analogy to POMDPs). We need to prove that this implies that the state value function is PWLC at time step t. This can be written as: V(st,bt) = max at b|_t         P st+1τ1(st+1| st, at) h R1(st, at, st+1) + V(st+1,bt+1) i P st+1τ2(st+1| st, at) h R2(st, at, st+1) + V(st+1,bt+1) i .. . P st+1τD(st+1| st, at) h RD(st, at, st+1) + V(st+1,bt+1) i         = max at X st+1            b|_t       τ1(st+1| st, at)R1(st, at, st+1) τ2(st+1| st, at)R2(st, at, st+1) .. . τD(st+1| st, at)RD(st, at, st+1)       + V(st+1,bt+1)b|t       τ1(st+1| st, at) τ2(st+1| st, at) .. . τD(st+1| st, at)                  (5.18)

By expanding the definition of bt+1 in the PWLC representation of V(st+1,bt+1)

given in5.17, we get: V(st+1,bt+1) = PD 1 j=1bt,jτj(st+1|st,at) max k PD j=1bt,jτj(st+1 | st, at)αk,j(st+1)

(40)

5.3 modified point-based value iteration 31

(5.19) If we now substitute this into5.18, we notice that the normalization term of5.19 and the last term of5.18cancel out. Thus, we have:

We see that the function being maximized over actions is the sum of a linear term and a set of |S| PWLC functions, i.e., it is again a PWLC function of bt. Since the

maximum over a set ofPWLCfunctions isPWLC, this concludes the proof. Theorem5.2proves a fundamental property: as for_POMDPs, our value function is PWLC. This means that the alpha-vectors, the normal directions to the hyper-planes describing the function, are sufficient to fully represent it. Algorithms for solvingPOMDPs rely on this fact to efficiently approximate the value function and, thus, the optimal policy. Since thePWLCproperty of V is the only requirement to apply such algorithms, we have reduced the solution of our problem to that of a partially observable domain.

Many algorithms have been proposed in thePOMDPliterature. In this work, we adopt point-based procedures, which approximate the value function by storing a finite set of representative belief states and computing one alpha-vector for each of them. Since, as we proved at the beginning of this section, there are a few differences between our dynamic program and that of aPOMDP, we need to extend

such algorithms to solve our case. The next section shows how we can do this for

PBVI[18].

m o d i f i e d p o i n t-based value iteration

In the previous section, we proved that the value function defined in our dynamic program isPWLC. This allows the reduction of our problem to the optimal control of aPOMDP. In this section, we show how we can adapt PBVI[18], one of the most common and efficient algorithms for solvingPOMDPs, to our specific case.

We show the skeleton of our modified algorithm, which is very similar to the original one, in Algorithm2, while we defer the description of the modified sub-routines to the next two sections.

(41)

Algorithm 2:Modified Point-Based Value Iteration Input: number of iterations N

Output: value functionV = {V1,V2,...,VT}

Initialize belief set:B = {b1}

for i = 1 to N do VT(sT) =∅ ∀sT ∈S for t = T − 1 to 1 do Vt = modifiedValueBackup(Vt+1,B) end B = modifiedBeliefExpansion(B) end

Notice immediately two differences with respect to the original algorithms pre-sented in [18]. First, the initial belief state we use to initializeB is not a distribution over states (as inPOMDPs), but is defined as:

b₁ = _P 1 D j=1p0(s1)       p0(s1) p0(s1) .. . p0(s1)       =       1/D 1/D .. . 1/D       (5.21)

where the last equality holds since we consider allMDPs to have the same initial state distribution. The second main difference is that, since we are considering a finite-horizon problem, our value function is represented by a different set of alpha-vectors at each time step. Thus, we write V = {V1,V2,...,VT} and we run

T − 1backups before we apply the modified belief expansion. Furthermore, since our value function depends on the state, we have a different set of alpha-vectors for each state and time step. Thus, we write eachVt as the setVt= {Vt(st)| ∀st ∈

S}. Notice that VT(sT) = ∅ ∀sT ∈S since there is no action to take at the last time

step.

Modified Value Backup

Value backup updates the current value function given the current belief setB by computing, for each b ∈B, the alpha-vector achieving the maximum over actions. Since our value function is very different than that of aPOMDP, we need to adapt the original procedure. We use a notation similar to the one used in [18], so that the reader can easily understand the new formulation.

(42)

Our goal is to compute the alpha-vectors describing V(st,b) for each b ∈ B.

We denote the set containing such vectors as Vt(st). Notice that, from now on,

we drop the time subscript from b since we are keeping a time-independent set of belief states. We consider the representation of V(st,b) given in Equation5.20, which most highlights the different terms forming the alpha-vectors. Then, we define the following quantities, generalizing from the original algorithm [18], for each state st ∈S:

• Γat,?

t : the vector multiplying b in the first linear term of V(st,b) for action

at;

• Γat,st+1

t : a set of all vectors multiplying b in the inner maximization of

V(st,b)for state st+1 and action at;

• Γat,b

t : the alpha-vector that is multiplied with b to compute the outer

maxi-mum of V(st,b) for action at and belief b ∈B.

In order to simplify the notation, we define τ (st+1 | st, at) as the vector

contain-ing τj(st+1 | st, at)for each j = 1, ...D. We do the same for the reward functions by

defining R(st, at, st+1). Given such quantities, the modified algorithm is shown

in Algorithm3.

Algorithm 3:Modified Value Backup Input: value functionVt+1, belief setB

Output: updated value functionVt ={Vt(st)| ∀st ∈S}

Modified Belief Expansion

Belief expansion updates the current belief setB by simulating every action and adding only the resulting belief state that is the farthest from B (if not already