• Non ci sono risultati.

A study of importance sampling techniques for policy optimization

N/A
N/A
Protected

Academic year: 2021

Condividi "A study of importance sampling techniques for policy optimization"

Copied!
68
0
0

Testo completo

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering Department of Mathematics

Master of Science Degree in Mathematical Engineering

A study of Importance Sampling

Techniques for Policy Optimization

AI & R Lab

Laboratorio di Intelligenza Artificiale

e Robotica del Politecnico di Milano

Supervisor: Prof. Marcello Restelli

Co-supervisors: Dott. Alberto Maria Metelli

Dott. Matteo Papini

Prof. Jürgen Schmidhuber

Author:

Francesco Faccio, 863358

(2)
(3)

Abstract

Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this work, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.

(4)
(5)

Estratto in Lingua Italiana

I metodi di ottimizzazione della politica si sono rivelati essere degli ap-procci molto effettivi per la soluzioni di problemi di controllo continuo nell’apprendimento per rinforzo. Alcuni recenti risultati hanno mostrato che alternare l’ottimizzazione online e offline Ãĺ una scelta di successo per riutilizzare l’informazione proveniente da una traiettoria. Tuttavia, decidere quando fermare il processo di ottimizzazione ed iniziare a raccogliere nuove traiettorie non Ãĺ banale, in quanto deve tenere conto della varianza della funzione obiettivo stimata. In questo lavoro, proponiamo un nuovo algoritmo di ricerca della politica, POIS, che puÚ essere applicato sia nel setting delle azioni che in quello di parametri. Per prima cosa, deriviamo un limite in-feriore per la stima da campionamento di importanza; poi definiamo una funzione obiettivo surrogata, che viene ottimizzata offline quando un nuovo gruppo di traiettorie viene raccolto. Infine, l’algoritmo Ãĺ testato su una selezione di problemi di controllo continuo, sia usano politiche lineari, che usando Reti Neurali profonde, e confrontato con lo stato dell’arte dei metodi di ottimizzazione della politica.

(6)
(7)

Ringraziamenti

Vorrei ringraziare innanzitutto il Prof. Marcello Restelli, il Dott. Alberto Maria Metelli e il Dott. Matteo Papini, per avermi accompagnato e sostenuto in questo progetto di ricerca.

Ringrazio la mia famiglia, Bruno, Teresa e Federico, per tutto il supporto ricevuto in questi anni. Grazie ad Erin, per avermi dato la forza nei momenti in cui ne avevo piú bisogno.

Grazie infine ai soci dell’AIM e SIAM, ai coinquilini di Via Lambrate, Jacopo, Giordano e Tommaso, a tutti coloro che hanno partecipato ai σ-party.

(8)
(9)

Contents

Abstract I

Estratto in Lingua Italiana III

Ringraziamenti V

1 Introduction 1

2 Reinforcement Learning and Markov Decision Process 3

2.1 Markov Decision Process . . . 3

2.1.1 Definition . . . 3 2.1.2 Value Functions . . . 4 2.2 Dynamic Programming . . . 5 2.2.1 Policy Iteration . . . 5 3 Policy Search 7 3.1 Action-based Methods . . . 7 3.2 Parameter-based Methods . . . 8

3.3 State-of-the-art algorithms in Policy search . . . 8

3.4 Policy Gradient . . . 9

3.4.1 Policy Gradient Theorem . . . 9

3.4.2 REINFORCE/GPOMDP . . . 9 3.4.3 Natural Gradient . . . 10 3.5 Constraint Methods . . . 10 3.5.1 TRPO . . . 10 3.6 Surrogate methods . . . 11 3.6.1 PPO . . . 11 4 Importance Sampling 13 4.1 The Importance Sampling estimator . . . 13

4.1.1 Rényi divergence . . . 14

(10)

4.2.1 Unequal variances . . . 15

4.2.2 Equal variances . . . 17

4.3 The Self-Normalized Importance Sampling Estimator . . . 19

4.4 Analysis of the SN Estimator . . . 20

4.5 Importance Sampling and Natural Gradient . . . 23

5 Optimization via Importance Sampling 25 5.1 Concentration Inequality . . . 25

5.2 Policy Optimization via Importance Sampling . . . 31

5.2.1 Action-based POIS . . . 31

5.2.2 Estimation of the Rényi divergence . . . 31

5.2.3 Parameter-based POIS . . . 34

5.3 Implementation details . . . 35

5.3.1 Line Search . . . 35

5.3.2 Computation of the Fisher Matrix . . . 37

5.3.3 Practical surrogate objective functions . . . 37

5.3.4 Practical P-POIS for Deep Neural Policies (N-POIS) . 38 6 Experimental Evaluation 39 6.1 Description of the Simulated Environments . . . 39

6.1.1 Cart-Pole Balancing . . . 39

6.1.2 Mountain Car . . . 40

6.1.3 Double Inverted Pendulum . . . 40

6.1.4 Acrobot . . . 41

6.1.5 Swimmer . . . 42

6.2 Results for Linear Policy . . . 42

6.2.1 Experiments Details . . . 46

6.3 Results for Deep Policy . . . 48

6.3.1 Experiments Details . . . 49

7 Conclusion 51

(11)

Chapter 1

Introduction

Reinforcement Learning [48] (RL) is a Machine Learning framework for mod-eling the interactions between an artificial agent and the environment. In recent years, a class of RL methods, called policy search methods [8] have achieved state-of-the-art results in continuous control tasks. [21, 39, 41, 40], robotic locomotion [51, 18] and partially observable environments [29]. Policy search algorithms can be classified in two different sets, depending whether the search is performed in the space of parametrized policies (like policy gradient), or directly in the space of parameters, by exploiting global opti-mizers [38, 15, 46, 50] or following a proper gradient direction like in Policy Gradients with Parameter-based Exploration (PGPE) [43, 57, 44]. In this work, we focus on a very general problems of Reinforcement Learning: given a dataset of trajectories collected from an agent which interacts with the environment using a fixed policy, how can we use the information gathered in the most efficient way?

On-policy methods use a batch of trajectories for doing a single update step in the optimization process. These methods are the most widely used, and they can use stochastic policy gradient [49], like REINFORCE [58] and G(PO)MDP [4] or more advanced contrained methods such as Trust Region Policy Optimization (TRPO) [39]. However, these methods do not exploit efficiently the information contained in the trajectories, since they require a new batch for each update step. On the other hand, off-policy methods collect trajectories using a behavioral policy and then perform multiple optimization steps evaluating and improving a target policy. Off-policy learning has been adapted first in value-based RL [56, 31, 27] and has been then extended to Deterministic Policy Gradient (DPG) [45] and Deep Deterministic Policy Gradient (DDPG) [21]. Parameter-based methods have also been adapted in order to perform off-policy optimization.

(12)

The off-policy optimization can be performed by alternating online steps, in which a behavioral policy collects new data, and offline optimization steps, in which a target policy is optimized with these new data. This idea has been used in Proximal Policy Optimization (PPO) [41], which has achieved new state-of-the-art performance in continuous control tasks. Optimizing a target policy with samples collected from a behavioral policy can be performed using the Importance Sampling (IS) [30, 16] technique, which weight each sample by the likelihood of having been generated from the target distribution. Unfortunately, the IS estimator, although unbiased, can have a very high variance when the two distribution are dissimilar.

In this work, we propose a novel, model-free, actor-only, policy optimiza-tion algorithm, named Policy Optimizaoptimiza-tion via Importance Sampling (POIS) that alternates online and offline optimization to efficiently exploit the infor-mation contained in a batch of trajectories.

The main contributions of this paper are theoretical, algorithmic and experimental.

The structure of the remaining part of this document is as follows: Chapter 2 provides an introduction on the Reinforcement Learning framework; Chapter 3 briefly reviews the state-of-the-art methods in Policy search; Chapter 4.1 contains an analysis of the Importance sampling estimator and its relation with the dissimilarity of the policies; Chapter 5 contains the derivation of the novel algorithm, POIS, and the details of its implementation; Chapter 6 discusses the results of a comparison between POIS and the current state-of-the-art Policy search methods in a selected group of benchmark problems; Chapter 7 summarizes the work and suggests future direction of research. The implementation of POIS can be found at https://github.com/T3p/pois. POIS was accepted at the 32nd Conference on Neural Information Processing Systems (NIPS 2018) and selected for an oral presentation [24].

(13)

Chapter 2

Reinforcement Learning and

Markov Decision Process

In many Machine Learning problems, an agent has to learn to solve a task by interacting with the environment. Markov Decision Process (MDP) [34] can be used to model this interaction.

2.1

Markov Decision Process

2.1.1

Definition

A discrete-time MDP can be formally defined as a tuple M = (S, A, P, R, γ, µ0)

in which:

• S ⊆ Rn is the set of possible states of the agent s ∈ S

• A ⊆ Rn is the set of possible actions the agent can perform: a ∈ A

• P (·|s, a) is a stochastic markovian transition matrix, which assigns for each pair (st, at) the probability of transitioning to state st+1

• R(s, a) ∈ [−Rmax, Rmax] is the reward function, which assigns a scalar

reward to the agent for performing action a in state s

• γ ∈ [0, 1] is a discount factor, measuring the actual value of future reward

• µ0 is a probability distribution over the initial state of the agent

At each discrete time-step t, the agent observes its state st and interacts

(14)

action of the agent and emits a scalar reward rt = R(st, at). The state of the

agent evolves following the dynamics of the transition matrix P. The goal of the agent is to choose actions such that the cumulative discounted reward is maximized, i.e., v =PH−1

i=0 γ tr

t is maximized, where H is the horizon length.

We define the return vt as the total discounted reward from time-step t:

vt = rt+1+ γrt+2+ · · · = H−t−1

X

k=0

γkrt+k+1.

In order to define the behavior of the agent, we need to introduce the policy, which is a map between the set of the states S and a probability distribution over the action set A. More formally, we define π : S → ∆(A) which specifies the probability of performing action a in state s. A policy is deterministic when for each state s there exists an action a such that π(a|s) = 1. When the agent interacts with the environment for H time steps, the sequence of states and actions collected defines a trajectory, i.e., τ = (sτ,0, aτ,0, . . . , sτ,H−1, aτ,H−1, sτ,H). Given a policy π, we can compute the

probability of a trajectory τ as:

p(τ ) = µ0(s0) H−1

Y

t=0

π(at|st)P (st+1|st, at).

We can express the problem of maximizing the cumulative discounted reward in terms of the policy. The objective function is an expectation over π:

= Eπ[v].

2.1.2

Value Functions

Given a policy π, we want to be able to evaluate its utility in each state. The state-value function, defined as the expected return for being in a state s and following policy π, is used to measure the value of the policy in each state and it is employed by many Reinforcement Learning algorithms. More formally, it is defined as:

(s) = Eπ[vt|st= s].

The value function can also be defined in a recursive way, using Bellman’s Expectation Equation: Vπ(s) = Z A π(a|s)  R(s, a) + γ Z S P (s0|s, a)Vπ(s0) ds0  da. 4

(15)

By integrating over the state space S, we can express the maximization of the expected cumulative reward in terms of the value function:

Jπ = Z

S

µ0(s)Vπ(s) ds.

Another way to measure how good is a policy is by defining the action-value function Qπ, which is defined as the expected return for performing

action a in state s, under the policy π:

(s, a) = Eπ[vt|st= s, at = a],

which can also be expressed recursively, using Bellman’s Expectation Equation: Qπ(s, a) = R(s, a) + γ Z S P (s0|s, a) Z A π(a0|s0)Qπ(s0, a0) da0ds0.

2.2

Dynamic Programming

For solving an MDP, we need to find the optimal policy. One naive approach would be to enumerate all possible policies, to evaluate their value function and then to return the best policy found. However, the number of determin-istic markovian policies is exponential (|A||S|) when action and states are discrete and impractical when they assume continuous values.

Dynamic Programming techniques offer a general solution for complex prob-lems which can be divided in smaller, easier subprobprob-lems. In particular, for solving an MDP, it iterates among two main steps: Prediction and Control.

2.2.1

Policy Iteration

In Policy Iteration algorithm, these two steps correspond to Policy Evalua-tion and Policy Improvement.

• Policy Evaluation: In Policy Evaluation, the goal is, given a policy π, to compute its action-value function Qπ using Bellman Equation. • Policy Improvement: Given the action-value function Qπ for policy

π, compute a better policy π0(s) ∈ arg maxa∈A(s, a).

Iterating the procedure we obtain a deterministic policy. This method is guaranteed to find the optimal policy. However, in most Reinforcement Learning problem, the action-value function cannot be computed exactly and

(16)

needs to be estimated from sample. This can lead to an high variance estimate and to slow down or prevent the algorithm to converge. One possible solution is to use a method which does not require to know the dynamics of the MDS and which directly models the policy, constraining it in a parametric subspace. We obtain a class of methods called model-free, policy search methods.

(17)

Chapter 3

Policy Search

In Policy Search methods [8] we consider a parametrized policy πθ belonging

to a parametric policy space ΠΘ = {πθ : θ ∈ Θ ⊆ Rp}. The goal is

to find the parameters of the policy such that the expected cumulative reward is maximized. In a high level, we can divide these methods in two categories: action-based and parameted-based. In the following sections the two approaches are described.

3.1

Action-based Methods

Action-based methods, known also al Policy Gradient methods, optimize the objective function by following the direction of the gradient w.r.t. the parameters θ, which is estimated using a batch of trajectories generated by the policy. The expected return can be espressed as an expectation over the trajectory space: J (θ) = Z T p(τ |θ)R(τ ) dτ, where R(τ ) =PH−1

t=0 γtR(sτ,t, aτ,t) is the trajectory return. The goal is to find

the parameter θ∗ such that J (θ) is maximized. Assuming that the policy πθ

is stochastic and differentiable, we can update the parameters using gradient ascent: θ0 = θ + α∇θJ (θ), where: ∇θJ (θ) = Z T p(τ |θ)∇θlog p(τ |θ)R(τ ) dτ.

(18)

3.2

Parameter-based Methods

In Parameter-based methods, like in Policy Gradients with Parameter-based Exploration (PGPE) [43, 57, 44], the agent is endowed with an hyperpolicy ν belonging to a parametric hyperpolicy space NP = {νρ : ρ ∈ P ⊆ Rr}. At

the beginning of each episode, a vector of policy parameters θ is sampled from µ. The policy πθ, in the parameter-based setting, is a deterministic

controller and does not need to be differentiable. The expected return can be espressed as an expectation over the trajectory space and over the policy parameter space: J (ρ) = Z Θ Z T νρ(θ)p(τ |θ)R(τ ) dτ dθ, where p(τ |θ) = µ0(s0) QH−1

t=0 πθ(at|st)P (st+1|st, at) is the trajectory density

function. The goal is to find the hyperparameter ρ∗ such that J (ρ) is maximized. This can be done, without the assumption of the differentiability of the policy πθ using gradient ascent:

ρ0 = ρ + α∇ρJ (ρ), where α > 0 and ∇ρJ (ρ) = Z Θ Z T νρ(θ)p(τ |θ)∇ρlog νρ(θ)R(τ ) dτ dθ

Sampling the parameters once per episode can reduce a lot the variance of the gradient estimate.

3.3

State-of-the-art algorithms in Policy search

Policy search methods [8] have become widely used because of their ability to solve Reinforcement Learning problems with continuous control tasks [21, 39, 41, 40], robotic locomotion [51, 18] and partially observable environments [29]. We can divide the state-of-the-art Policy search methods in three categories, depending on how the optimization process is carried on in order to find the optimal policy. The first, and most common used, category, is the one of Policy Gradient methods, in which the cumulative expected return is optimized following the gradient direction. The second category also follows the direction of the gradient, but subject to a constraints that limits updates leading to policies too far from the current policy. This constraint is typically expressed in terms of the dissimilarity between the two policies. The third category, which includes our algorithm, POIS, optimizes a surrogate objective function: a function similar w.r.t. the objective function of the problem, but much easier to optimize.

(19)

3.4

Policy Gradient

In this section, we see some Policy search algorithms using policy gradient methods.

3.4.1

Policy Gradient Theorem

We can first note that when we perform a parameter update: θ0 = θ + α∇θJ (θ),

we have to compute the quantity ∇θJ (θ). This can be done analytically,

thanks to the Policy Gradient Theorem [49]:

Theorem 3.4.1. For any Markov Decision Process, the following holds: ∇θJ (θ) = Z S µ0(s) Z A ∇θπθ(a|s)Qπθ(s, a) da ds.

3.4.2

REINFORCE/GPOMDP

Unfortunately, in many applications we need to estimate the term Qπθ(s, a)

with sample. One straightforward way of doing it is to use the return: ˆ Qπθ(s, a) ' H−1 X k=0 γkR(sk, ak)

Moreover, we can rewrite the Policy Gradient Theorem using the likelihood ratio technique [58]:

∇θπθ(a|s) = πθ(a|s)∇θlog πθ(a|s),

and obtain: ∇θJ (θ) = Z S µ0(s) Z A

πθ(a|s)∇θlog πθ(a|s)Qπθ(s, a) da ds.

Combining this expression with the approximation of the action-value function, we obtain the REINFORCE estimator [58]:

ˆ ∇θJ (θ) = * H−1 X k=0 ∇θlog πθ(ak|sk) ! H−1 X k=0 γkR(sk, ak) !+ N

where h.iN denotes the sample mean over N trajectories. One problem of REINFORCE is that, in the estimator, the value of (sk, ak) depends on the

(20)

past reward. Making this term independent on past reward, we obtain the G(PO)MDP gradient estimator [4]:

ˆ ∇θJ (θ) = *H−1 X t=0 t X k=0 ∇θlog πθ(ak|sk) ! γtR(st, at) !+ N

3.4.3

Natural Gradient

When searching in the space of parameters in order to find the optimal policy, the optimization process strongly depends on the choice of the parametrization used for the policy. Natural gradient methods [1] allow to search directly in the space of the policy and are invariant w.r.t. the parametrization used. The update formula for the natural gradient becomes:

e

∇θL(θ) = F−1(θ)∇θJ (θ),

where F is the Fisher Information Matrix[36, 2]:

F (θ) = Z S µ0(s) Z A

πθ(a|s)∇θlog πθ(a|s)∇θlog πθ(a|s)T da ds

3.5

Constraint Methods

The methods belonging to this category, still follow the direction of the gradient, but with a constraint on the step size depending on how different the current policy and the policy at the update are.

3.5.1

TRPO

The most used algorithm of this category, which achieved state-of-the-art results in many continuous control tasks, is Trust Region Policy Optimization (TRPO) [39], whose update formula is:

max bEt∼θ

h

wθ0/θ(at|st) bA(st, at)

i s.t. bEt∼θ[DKL(πθ0(·|st)kπθ(·|st))] ≤ δ

where bA(st, at) is an estimate of the advantage function Qπθ(s, a) − Vπθ(s),

which measures the advantage of taking action a in state s, and DKL(πθ0(·|st)kπθ(·|st))

is the Kullback-Liebler divergence between the two distributions. 10

(21)

3.6

Surrogate methods

This class of methods tries to maximize a surrogate objective function, which is a function similar w.r.t. the objective function of the problem, but much easier to optimize.

3.6.1

PPO

The most widely used surrogate policy gradient method is Proximal Policy Optimization (PPO) [41], which optimizes the following surrogate objective function:

max bEt∼θ

h

minnwθ0(at|st) bA(st, at), clip wθ0(at|st), 1 − , 1 + A(sb t, at) oi

(22)
(23)

Chapter 4

Importance Sampling

4.1

The Importance Sampling estimator

Importance sampling is a technique that can be used for studying a distribu-tion, which we call target, using samples collected from another distribudistribu-tion, called behavioral. In particular, we want to study the problem of estimating the expected value of a deterministic bounded function f (kf k∞ < +∞) of a

random variable x, under a target distribution P , having samples collected samples from another distribution Q. If we assume that P and Q admit p and q as Lebesgue probability density functions, we can use the importance sampling estimator (IS) [6, 30], which weights the collected samples by the likelihood of being generated by the target distribution. This correction relies on the importance weights (or Radon-Nikodym derivative or likelihood ratio) wP /Q(x) = p(x)/q(x): b µP /Q = 1 N N X i=1 p(xi) q(xi) f (xi) = 1 N N X i=1 wP /Q(xi)f (xi), (4.1)

where x = (x1, x2, . . . , xN)T is sampled from Q and we assume q(x) >

0 whenever f (x)p(x) 6= 0. The IS estimator is unbiased (Ex∼Q[µbP /Q] = Ex∼P[f (x)]) but it can have a large variance if the two distributions differ a

lot. Even in the particular case in which the two distributions are Gaussian, the IS estimator can have infinite variance.

We can measure how two distributions are dissimilar using the Rényi divergence, an information-theoretic index. [37, 54]

(24)

4.1.1

Rényi divergence

Consider two probability measures P and Q on a measurable space (X , F ) such that P  Q (P is absolutely continuous w.r.t. Q) and Q is σ-finite. Let P and Q admit p and q as Lebesgue probability density functions (p.d.f.), respectively. The α-Rényi divergence is defined as:

Dα(P kQ) = 1 α − 1log Z X  dP dQ α dQ = 1 α − 1log Z X q(x) p(x) q(x) α dx, (4.2)

where dP/ dQ is the Radon-Nikodym derivative of P w.r.t. Q and α ∈ [0, ∞].

Some particular cases are the following:

• for α = 1 we have that D1(P kQ) = DKL(P kQ)

• for α = ∞ we have D∞(P kQ) = log ess supX dP/ dQ.

Using the notation of [7], we define the α-Rényi divergence as dα(P kQ) =

exp (Dα(P kQ)). Whenever it will be clear within the context, Dα(P kQ) will

be replaced by Dα(pkq).

There is a tight relation between the dissimilarity of the two distributions and the moments of the importance weights. Indeed:

E x∼QwP /Q(x) α = d α(P kQ). Moreover, Var x∼QwP /Q(x) = d2(P kQ) − 1, and for [7] ess sup x∼Q wP /Q(x) = d∞(P kQ).

4.2

Analysis of the IS estimator

In this section, the importance weights are analyzed under the assumption that both the behavioral and target distributions are Gaussian. We can note that for multivariate Gaussian distribution, there exists an analytic expression for the Rényi divergence [5]. Let P ∼ N (µP, ΣP) and Q ∼ N (µQ, ΣQ) and

α ∈ [0, ∞]: Dα(P kQ) = α 2(µP− µQ) TΣ−1 α (µP− µQ) − 1 2(α − 1)log det(Σα) det(ΣP)1−αdet(ΣQ)α , (4.3)

where Σα= αΣQ+ (1 − α)ΣP under the assumption that Σα is

positive-definite.

(25)

In the particular case of univariate Gaussian distributions, the expression of the importance weights and their probability density function fw can be

computed in closed-form. Let us consider Q ∼ N (µQ, σQ2) as the behavioral

distribution and P ∼ N (µP, σP2) as the target distribution. Under the

assumption that σ2

Q, σ2P > 0 we can analyze the importance weights in the

following two cases: unequal variances and equal variances. The importance weight wP /Q(x) will be indicated with w(x) for brevity.

4.2.1

Unequal variances

We start considering the case in which the variance of the behavioral and target policies differ: σQ2 6= σ2

P. In this case, the importance weights assume

the following expression:

w(x) = σQ σP exp 1 2 (µP− µQ)2 σ2 Q− σ2P ! exp  −1 2 σ2Q− σ2 P σ2 Qσ2P x −σ 2 QµP− σ2PµQ σ2 Q− σ2P !2 , (4.4)

for x ∼ Q. We can study when the above equation is bounded based on the sign of the exponential: if σQ2 − σ2

P > 0, then w(x) is upper bounded by

A = σQ σP exp  1 2 (µP−µQ)2 σ2 Q−σ 2 P  , while if σQ2 − σ2

P < 0, then w(x) is unbounded and

it admits A as minimum value. We then compute the probability density function of w(x):

Proposition 4.2.1. Let Q ∼ N (µQ, σ2Q) be the behavioral distribution and

P ∼ N (µP, σP2) be the target distribution, with σQ2 6= σP2. The probability

density function of w(x) = p(x)/q(x) is given by:

fw(y) =        σ yqπ logA y exp −12µ2 y A σ2 coshµσq2 logAy, if σ2 Q> σ2P, y ∈ [0, A], σ y√π logy A exp −12µ2Ay σ2 cosh µσp2 log y A , if σQ2 < σ2P, y ∈ [A, ∞), where µ = σQ σ2 Q−σ2P (µP − µQ) and σ2 = σ2 P |σ2 Q−σ2P| .

Proof. Let us consider w(x) as a function of the random variable x ∼ Q. We can define the following constants:

m = σ 2 QµP − σ2PµQ σ2Q− σ2 P , τ = σ 2 Q− σP2 σQP2 . Let us start computing the c.d.f.:

Fw(y) = Pr (w(x) ≤ y) = Pr  A exp  −1 2τ (x − m) 2  ≤ y 

(26)

= Prτ (x − m)2 ≥ −2 log y A

 .

According to the sign of τ , we can study two different cases, and we observe that x = µQ+ σQz where z ∼ N (0, 1): τ > 0: Fw(y) = Pr  (x − m)2 ≥ 2 τ log A y  = Pr x ≤ m − s 2 τ log A y ! + Pr x ≥ m + s 2 τ log A y ! = Pr z ≤ m − µQ σQ − s 2 τ σQ2 log A y ! + Pr z ≥ m − µQ σQ + s 2 τ σQ2 log A y ! . We call µ = m−µQ σQ = σQ σ2 Q−σ2P (µP − µQ) and σ2 = τ σ12 Q = σ2P σ2 Q−σ2P , thus we have: Fw(y) = Pr z ≤ µ − s 2σ2logA y ! + Pr z ≥ µ + s 2σ2logA y ! = Φ µ − s 2σ2logA y ! + 1 − Φ µ + s 2σ2logA y ! ,

where Φ is the c.d.f. of a normal standard distribution. We can obtain the p.d.f by taking the derivative w.r.t. y:

fw(y) = ∂Fw(y) ∂y = − p 2σ2 1 2qlogA y y A −A y2 φ µ − s 2σ2logA y ! + φ µ + s 2σ2logA y !! = √ 2σ 2y q logAy φ µ − s 2σ2logA y ! + φ µ + s 2σ2logA y !! = √ 2σ 2yqlogAy 1 √ 2π  exp  − 1 2 µ − s 2σ2logA y !2  + exp  − 1 2 µ + s 2σ2logA y !2    = σ y q π logAy exp  −1 2µ 2  exp  −σ2logA y exp 

µσq2 logAy+ exp−µσq2 logAy 2 = σ yqπ logAy exp  −1 2µ 2  y A σ2 cosh µσ s 2 logA y ! ,

where φ is the p.d.f. of a normal standard distribution.

τ < 0: We can derive the p.d.f. using similar steps. First, let us define σ2 = − 1 τ σ2 Q = σP2 σ2 P−σQ2

, then the c.d.f. becomes:

Fw(y) = Φ  µ + r 2σ2log y A  − Φ  µ − r 2σ2log y A  , 16

(27)

and the p.d.f. is: fw(x) = σ ypπ log y A exp  −1 2µ 2  A y σ2 cosh  µσ r 2 log y A  .

We can consider both cases by defining σ2 = σ2P

|σ2 Q−σ

2 P|

.

Studying the behavior of the tail of the distribution when w is unbounded, we discover that it displays a fat-tail behavior.

Proposition 4.2.2. If σP2 > σ2

Q then there exists c > 0 and y0 > 0 such that

for any y ≥ y0, the p.d.f. fwcan be lower bounded as fw(y) ≥ cy−1−σ

2

(log y)−12.

Proof. Set z = y/A and let a > 0 be a constant, then for sufficiently large y it holds that:

fw(y) ≥ az−1−σ

2

(log z)−1/2expplog z

√ 2µσ

. (4.5) We can first observe that, for z > 1, we have exp √log z ≥ 1. Then, by replacing z with y/A, we just need to change the constant a into c > 0.

Thanks to this result, we can see that the α-th moment of w(x) does not exist when α − 1 − σ2 ≥ −1 =⇒ α ≥ σ2 = σ2P σ2 P−σ 2 Q . As a consequence, we can not bound in probability the importance weights using Bernstein-like inequalities. Note that also the fact that the α-Rényi divergence is defined only when σ2 α = ασ2Q+ (1 − α)σP2 > 0, i.e., α < σ2 P σ2 P−σQ2

confirms the non-existence of the finite moments of importance weights.

4.2.2

Equal variances

If σ2

Q = σ2P = σ2, the importance weights assume the following expression:

w(x) = exp µP − µQ σ2  x − µP + µQ 2  , (4.6)

for x ∼ Q. Under this condition, the importance weights w(x) are unbounded and assume infimum value at 0. Also in this case, we can derive the p.d.f. of the importance weights.

Proposition 4.2.3. Let Q ∼ N (µQ, σ2) be the behavioral distribution and

P ∼ N (µP, σ2) be the target distribution. The probability density function of

w(x) = q(x)/p(x) is given by: fw(y) = |σ|e √ 2πy32 exp  −1 2 µe 2+ e σ2(log y)2  , (4.7)

(28)

where µ =e µP−µQ

2σ and eσ =

σ µP−µQ.

Proof. We start computing the c.d.f.: Fw(y) = Pr  exp µP − µQ σ2  x −µP + µQ 2  ≤ y  = Pr µP − µQ σ2  x −µP + µQ 2  ≤ log y  .

Let us consider first the case µP − µQ > 0 and observe that x = µQ+ σz, where

z ∼ N (0, 1): Fw(y) = Pr  x ≤ µP + µQ 2 + σ2 µP − µQ log y  = Pr  z ≤ µP − µQ 2σ + σ µP − µQ log y  . Setµ =e µP−µQ 2σ and eσ = σ µP−µQ. Then we have:

Fw(y) = Pr (z ≤µ +e eσ log y) = Φ (µ +e eσ log y) . We can obtain the p.d.f by taking the derivative w.r.t. y:

fw(y) = ∂Fw(y) ∂y = eσ y 1 √ 2πexp  −1 2(µ +e eσ log y) 2 = eσ 2πyµeeσ+1 exp  −1 2  e µ2+eσ2(log y)2  .

When considering the case µP− µQ< 0, the p.d.f. differs only by a minus sign. We

can consider both cases by defining |eσ| in the final formula.

We can note that, when the variance of the behavioral and target policies is equal, then the tail behavior is different.

Proposition 4.2.4. If σ2

P = σQ2 then for any α > 0 there exist c > 0

and y0 > 0 such that for any y ≥ y0, the p.d.f. can be upper bounded as

fw(y) ≤ cy−α.

Proof. If we consider all constant terms as c, we can write the p.d.f. as: fw(y) = cy−3/2exp  (log y)2 −σ2e 2 . (4.8)

For any α > 0, let us solve the following inequality: y3/2exp  (log y)2 eσ2 2 ≥ yα =⇒ y ≥ exp 2 e σ2  α − 3 2  . (4.9) Thus, for y ≥ exp 2

e

σ2 α −32 we have that fw(y) ≤ cy−α.

(29)

0 1 2 3 0 2 4 y fw (y ) 0.1 0.2 0.5 1 2

(a) equal variance

0 2 4 6 8 0 0.5 1 1.5 2 y fw (y ) 0.2 0.5 1 2 5 (b) equal mean

Figure 4.1: Probability density function of the importance weights when the behavioral distribution is N (0, 1) and the mean is changed keeping the variance equal to 1 (a) or the variance is changed keeping the target mean equal to 1 (b).

The corresponding Rényi divergence is: α(µP−µQ)2

2σ2 , therefore all moments

of the importance weights exist. However, the distribution of w(x) remains subexponential, as exp (log y)2−

e σ2

2 ≥ e−ηy for sufficiently large y.

In Figure 4.1 the p.d.f. of the importance weights, for different values of mean and variance of the target distribution, is reported.

4.3

The Self-Normalized Importance Sampling

Estimator

In the previous section, we saw that the IS estimator can exhibit a very high variance, when a target distribution is too far from the behavioral in terms of the Rényi divergence. In order to reduce the variance problem of the IS estimator, we can use the self-normalized importance sampling estimator (SN) [6]: e µP /Q = PN i=1wP /Q(xi)f (xi) PN i=1wP /Q(xi) = N X i=1 e wP /Q(xi)f (xi), (4.10)

where weP /Q(x) = wP /Q(x)/PNi=1wP /Q(xi) is the self-normalized importance

weight. µeP /Q is a biased estimator but it is consistent [30] and, since

e µP /Q ≤ kf k∞, it has always finite variance. This typically leads to a more desirable

(30)

as the expected value of f under an approximation of the distribution P made by N deltas, i.e., p(x) =e PN

i=1weP /Q(x)δ(x − xi). The behavior of the SN estimator can be studied using diagnostic indices [30]. The effective sample size (ESS) was introduced in [19] and it represents the number of samples drawn from P such that the variance of the Monte Carlo estimatorµeP /P is

approximately equal to the variance of the SN estimatorµeP /Q computed with

N samples. The definition and its estimate are the following:

ESS(P kQ) = N Varx∼QwP /Q(x) + 1 = N d2(P kQ) , (4.11) d ESS(P kQ) = 1 PN i=1weP /Q(xi) 2. (4.12)

We can note that if d2(P kQ) = 1, i.e., P = Q almost everywhere, then

ESS = N since we are performing Monte Carlo estimation. If P 6= Q, then the more the dissimilarity of the distribution increases, the more the ESS decreases. In order to consider the properties of f , also other diagnostics similar to ESS have been proposed [22].

4.4

Analysis of the SN Estimator

In this section, some results regarding the bias and the variance of the self-normalized importance sampling estimator are provided. Thanks from the result in [7], we can bound the expected squared difference between non-self-normalized weight w(x) and self-normalized weight w(x).e

Lemma 4.4.1. Let P and Q be two probability measures on the measurable space (X , F ) such that P  Q and d2(P kQ) < +∞. Let x1, x2, . . . , xN

i.i.d. random variables sampled from Q. Then, for N > 0 and for any i = 1, 2, . . . , N it holds that: E x∼Q "  e wP /Q(xi) − wP/Q(xi) N 2# ≤ d2(P kQ) − 1 N . (4.13)

Proof. Since Varx∼QwP /Q(x) = d2(P kQ) − 1, with basic algebraic manipulations,

we have the following:

E x∼Q "  e wP /Q(xi) − wP /Q(xi) N 2# = E x∼Q   wP /Q(xi) PN j=1wP /Q(xj) !2 1 − PN j=1wP /Q(xj) N !2  20

(31)

≤ E x∼Q   1 − PN j=1wP /Q(xj) N !2 = Var x∼Q " PN j=1wP /Q(xj) N # = 1 N xVar1∼Q wP /Q(x1) = d2(P kQ) − 1 N .

Similarly, a bound on the bias of the SN estimator can be derived. Proposition 4.4.1. Let P and Q be two probability measures on the measur-able space (X , F ) such that P  Q and d2(P kQ) < +∞. Let x1, x2, . . . , xN

i.i.d. random variables sampled from Q and f : X → R be a bounded function (kf k∞ < ∞). Then, the bias of the SN estimator can be bounded as:

x∼QE h e µP /Q− E x∼P[f (x)] i ≤ kf k∞min ( 2, r d2(P kQ) − 1 N ) . (4.14)

Proof. Since |µeP /Q| ≤ kf k∞, then the bias cannot be larger than 2kf k∞. Exploiting

the fact that the IS estimator is unbiased, i.e., Ex∼Q

 b

µP /Q = Ex∼P[f (x)], a bound

for the bias, which vanishes as N → ∞, can be derived.

x∼QE h e µP/Q− E x∼P[f (x)] i = x∼QE  e µP/Q− E x∼Q  b µP/Q  = x∼QE  e µP/Q−µbP/Q  ≤ E x∼Q  eµP/Q−bµP/Q  = = E x∼Q " PN i=1wP/Q(xi)f (xi) PN i=1wP/Q(xi) − PN i=1wP/Q(xi)f (xi) N # = E x∼Q " PN i=1wP/Q(xi)f (xi) PN i=1wP/Q(xi) 1 − PN i=1wP/Q(xi) N # (4.15) ≤ E x∼Q   PN i=1wP/Q(xi)f (xi) PN i=1wP/Q(xi) !2  1 2 E x∼Q   1 − PN i=1wP/Q(xi) N !2  1 2 (4.16) ≤ kf k∞ r d2(P kQ) − 1 N , (4.17)

where (4.16) follows from (4.15) by applying Cauchy-Schwartz inequality and in (4.17) the fact that 

PN i=1wP /Q(xi)f (xi) PN i=1wP /Q(xi) 2 ≤ kf k2 ∞ is used.

Since the the normalization term makes all the samples interdependent, it is not trivial to bound the variance of the SN estimator. We can exploint the boundedness ofµeP /Q and derive some trivial bounds like: Varx∼Q

 e

µP /Q ≤

kf k2

(32)

The following approximation, derived using the delta method [55, 30], has been proposed in the literature:

Var x∼Q  e µP /Q = 1 N x1E∼Q  wP /Q2 (x1)  f (x1) − E x∼P[f (x)] 2 + o(N−2). (4.18) The Mean Squared Error (MSE) of the SN estimator, which is the sum of the variance and the bias squared, can be dicectly bounded.

Proposition 4.4.2. Let P and Q be two probability measures on the measur-able space (X , F ) such that P  Q and d2(P kQ) < +∞. Let x1, x2, . . . , xN

i.i.d. random variables sampled from Q and f : X → R be a bounded function (kf k∞ < +∞). Then, the MSE of the SN estimator can be bounded as:

MSEx∼Q  e µP /Q ≤ 2kf k2∞min  2,2d2(P kQ) − 1 N  . (4.19)

Proof. SinceeµP /Q is bounded by kf k∞ then its MSE cannot be larger than 4kf k2∞.

We can sum and subtract the IS estimatorµbP /Qand manipulate the MSE expression:

MSEx∼QµeP/Q = Ex∼Q  e µP/Q− E x∼P[f (x)] 2 = E x∼Q   e µP/Q− E x∼P[f (x)] ±µbP/Q 2 (4.20) ≤ 2 E x∼Q h e µP/Q−µbP/Q 2i + 2 E x∼Q   b µP/Q− E x∼P[f (x)] 2 (4.21) ≤ 2 E x∼Q   PN i=1wP/Q(xi)f (xi) PN i=1wP/Q(xi) !2 1 − PN i=1wP/Q(xi) N !2  (4.22) + 2 Var x∼Q  b µP/Q (4.23) ≤ 2kf k2 E x∼Q   1 − PN i=1wP/Q(xi) N !2 + 2 Var x∼Q  b µP/Q (4.24) ≤ 2kf k2 ∞x∼QVar " PN i=1wP/Q(xi) N # + 2 Var x∼Q  b µP/Q (4.25) ≤ 2kf k2 ∞ d2(P kQ) − 1 N + 2kf k 2 ∞ d2(P kQ) N (4.26) = 2kf k22d2(P kQ) − 1 N , (4.27)

where line (4.21) follows from line (4.20) by applying the inequality (a + b)2

2(a2+ b2), (4.24) follows from (4.23) by observing that

PN i=1wP /Q(xi)f (xi) PN i=1wP /Q(xi) 2 ≤ kf k2 ∞. 22

(33)

4.5

Importance Sampling and Natural

Gradi-ent

We can look at a parametric distribution Pω, having pω as a density function,

as a point on a probability manifold with coordinates ω ∈ Ω. If pω is

differentiable, we can define the Fisher Information Matrix (FIM) [36, 2] as: F (ω) =RX pω(x)∇ωlog pω(x)∇ωlog pω(x)T dx. The FIM is an invariant

metric (up to a scale), [1] on parameter space Ω, i.e., κ(ω0− ω)TF (ω)(ω0 − ω)

is independent on the specific parameterization and provides a second order approximation of the distance between pω and pω0 on the probability manifold

up to a scale factor κ ∈ R. We can define the natural gradient [1, 17] for a loss function L(ω), as e∇ωL(ω) = F−1(ω)∇ωL(ω), which represents the

steepest ascent direction in the probability manifold. Thanks to the invariance property, there is a connection between the geometry induced by the Rényi divergence and the Fisher information metric.

Theorem 4.5.1. Let pω be a p.d.f. differentiable w.r.t. ω ∈ Ω. Then, it

holds that, for the Rényi divergence: Dα(pω0kpω) = α 2 (ω 0− ω)T F (ω) (ω0− ω) + o(kω0− ωk2 2),

and for the exponentiated Rényi divergence: dα(pω0kpω) = 1 + α 2 (ω 0− ω)T F (ω) (ω0− ω) + o(kω0 − ωk2 2).

Proof. Let us compute the second-order Taylor expansion of the α-Rényi divergence. First, consider the term:

I(ω0) = Z X  pω0(x) pω(x) α pω(x) dx = Z X pω0(x)αpω(x)1−αdx. (4.28)

The gradient w.r.t. ω0 is: ∇ω0I(ω0) = Z X ∇ω0pω0(x)αpω(x)1−αdx = α Z X pω0(x)α−1pω(x)1−α∇ω0pω0(x) dx.

Thus, ∇ω0I(ω0)|ω0 = 0. We can compute the Hessian matrix:

Hω0I(ω0) = ∇ω0∇T ω0I(ω0) = α∇ω0 Z X pω0(x)α−1pω(x)1−α∇T ω0pω0(x) dx = α Z X (α − 1)pω0(x)α−2pω(x)1−α∇ω0pω0(x)∇Tω0pω0(x) dx+ + α Z X pω0(x)α−1pω(x)1−αHω0pω0(x) dx.

(34)

Evaluating the Hessian in ω we have: Hω0I(ω0)|ω0 = α(α − 1) Z X pω(x)−1∇ωpω(x)∇Tωpω(x) dx = α(α − 1) Z X pω(x)∇ωlog pω(x)∇Tωlog pω(x) dx = α(α − 1)F (ω). Now, Dα(pω0kpω) = 1 α−1log I(ω 0). Thus: ∇ω0Dα(pω0kpω)|ω0 = 1 α − 1 ∇ω0I(ω0) I(ω0) ω0 = 0, Hω0Dα(pω0kpω)|ω0 = 1 α − 1

I(ω0)Hω0I(ω0) + ∇ω0I(ω0)∇T

ω0I(ω0) (I(ω0))2 ω0 = 1 α − 1Hω0I(ω 0)| ω0 = αF (ω),

since I(ω) = 1. Computing the gradient of dα(pω0kpω) wrt w.r.t. ω0, we have:

ω0dα(pω0kpω)|ω0 = ∇ω0exp (Dα(pω0kpω))|ω0

= exp (Dα(pω0kpω)) ∇ω0Dα(pω0kpω)|ω0 = 0,

Hω0dα(pω0kpω)|ω0= Hω0exp (Dα(pω0kpω))|ω0

= exp (Dα(pω0kpω)) Hω0Dα(pω0kpω) + ∇ω0Dα(pω0kpω)∇Tω0Dα(pω0kpω)|ω0

= αF (ω).

Thanks to this result, we have an approximate expression for the vari-ance of the importvari-ance weights, as Varx∼pωwω0/ω(x) = d2(pω0kpω) − 1 '

α 2 (ω

0− ω)T

F (ω) (ω0 − ω). Since performing a step in the natural gradient

has a measurable effect on the variance of the importance weights, this result justifies the use of natural gradient in off-policy optimization.

(35)

Chapter 5

Optimization via Importance

Sampling

5.1

Concentration Inequality

In the off-policy optimization problem [52], we want to find the best target policy πT (or hyperpolicy νT), given samples collected by a behavioral policy

πB(or hyperpolicy νB). We can consider the following, more general,

optimiza-tion problem: finding the target distribuoptimiza-tion P maximizing the expected value of a bounded function Ex∼P[f (x)], using samples collected from a behavioral

distribution Q. If we decide to optimize directly the importance sampling estimator µbP /Q or µeP /Q the optimization process is going to assign as much probability mass as possible to the maximum value among f (xi). This would

lead to an unreliable target policy, as the variance of the estimator could be high. In order to avoid this scenario, we decide to use a risk-averse approach, optimizing a statistical lower bound of the expected value Ex∼P[f (x)] which

holds with high confidence. The use of a lower bound can be seen as a penalty optimization method, as it maximizes the IS estimator but it penalizes for choices of the target distribution leading to a high variance. The connection between the variance of the estimator and the dissimilarity of the distributions can be expressed in terms of the Rényi divergence. Indeed, studying the behavior of the IS estimator, we derive the following bound on the variance of µbP /Q in terms of the Renyi divergence.

Lemma 5.1.1. Let P and Q be two probability measures on the measurable space (X , F ) such that P  Q. Let x = (x1, x2, . . . , xN)T i.i.d. random

variables sampled from Q and f : X → R be a bounded function (kf k∞ <

(36)

upper bounded as: Var x∼Q  b µP /Q ≤ 1 Nkf k 2 ∞d2(P kQ) . (5.1)

Proof. Since xi are i.i.d. we can write:

Var x∼Q  b µP /Q ≤ 1 N xVar1∼Q  p(x1) q(x1) f (x1)  ≤ 1 N x1E∼Q "  p(x1) q(x1) f (x1) 2# ≤ 1 Nkf k 2 ∞ E x1∼Q "  p(x1) q(x1) 2# = 1 Nkf k 2 ∞d2(P kQ) .

When P = Q almost everywhere, we get Varx∼Q

 b

µQ/Q ≤ N1kf k2∞, the

bound on the variance of a Monte Carlo estimator. Thanks to the relation between the Rényi divergence and the Effective Sample Size (ESS) (4.12), we can write the bound in terms of the ESS: Varx∼Q

 b

µP /Q ≤ kf k2

ESS(P kQ), i.e., the

variance scales with ESS instead of N .

Several statistical lower bound have been proposed in the literature for offline policy evaluation [53] and for optimization [52]. However, studying the properties of the IS estimator, we discover that many of the assumptions required by these lower bounds are failing to hold, or can introduce unaccept-able limitations.

Student-T inequality

A first possible choice could be to rely on the Central Limit Theorem (CLT). If we assume to have x = (x1, x2, . . . , xN)T i.i.d. random variables with finite

variance Var[bµP /Q] = σ2, then, applying the CLT, for N sufficiently large, we

have: b µP /Q ∼ N  µP /Q, σ2 N 

This would lead to the following statistical lower bound:

Theorem 5.1.1. Let P and Q be two probability measures on the measurable space (X , F ) such that P  Q and Var[bµP /Q] = σ2 < +∞. Let x1, x2, . . . , xN

be i.i.d. random variables sampled from Q, and f : X → R be a bounded function (kf k∞ < +∞). Then, for N sufficiently large and for any 0 < δ ≤ 1,

with probability at least 1 − δ it holds that:

E x∼P[f (x)] ≥ 1 N N X i=1 wP /Q(xi)f (xi) − σ √ Nt1−δ,N −1 (5.2) 26

(37)

However, the value of σ is unknown and must be estimate using Importance Sampling. Choosing this bound would introduce further uncertainty in the op-timization process and would lead to a not reliable estimate with high variance. Hoeffding Inequality

Hoeffding inequality assumes that we know the maximum of the estimator. Under this assumption, we have the following lower bound:

Theorem 5.1.2. Let P and Q be two probability measures on the measurable space (X , F ) such that P  Q and d∞(P kQ) = M < +∞. Let x1, x2, . . . , xN

be i.i.d. random variables sampled from Q, and f : X → R be a bounded function (kf k∞< +∞). Then, for any 0 < δ ≤ 1 and N > 0 with probability

at least 1 − δ it holds that:

E x∼P[f (x)] ≥ 1 N N X i=1 wP /Q(xi)f (xi) − M r log(1/δ) 2N (5.3)

Constraining this value to be finite, i.e., ||µbP /Q||∞= M < +∞, is

equiva-lent to constraining d∞(P kQ) < +∞. Unfortunately, even in the particular

case of univariate Gaussian distributions, this lead to an unacceptable lim-itation: as we reported in Section 4.1, this constraint does not allow the optimization process to reach a target policy whose variance is higher than the behavioral one. Hence, the variance of the distribution is going to decrease at every optimization step, which might be a problem if we want our policy to be able to explore the environment.

Bernstein Inequality

Bernstein-like inequalities, which can also be used under Hoeffding’s assump-tions, assume that the estimator has a sub-exponential distribution, which means, in particular, that its tail has to converge to zero at least as fast as an exponential distribution. Thanks to the analysis on the IS distribution in Section 4.1, we discover that this is not the case. Indeed, the tail of the distribution can be lower bounded by a logarithmic function, hence displaying a fat-tail behavior. This prevents from using bounds which rely on Bernstein inequality.

Cantelli’s inequality

We decide to rely on Chebyshev-like inequalities, in particular on Cantelli’s inequality, which assumes that the variance of the importance weights has to be finite, i.e., d2(P kQ) < +∞. Under this condition, when using univariate

(38)

which is at most two times the behavioral one, thus the policy can still explore sufficiently the environment.

Theorem 5.1.3. Let P and Q be two probability measures on the measurable space (X , F ) such that P  Q and d2(P kQ) < +∞. Let x1, x2, . . . , xN be

i.i.d. random variables sampled from Q, and f : X → R be a bounded function (kf k∞ < +∞). Then, for any 0 < δ ≤ 1 and N > 0 with probability at least

1 − δ it holds that: E x∼P[f (x)] ≥ 1 N N X i=1 wP /Q(xi)f (xi) − kf k∞ r (1 − δ)d2(P kQ) δN . (5.4)

Proof. We take the random variable µbP /Q = N1

PN

i=1wP /Q(xi)f (xi) and apply

Cantelli’s inequality: Pr  b µP /Q− E x∼P[f (x)] ≥ λ  ≤ 1 1 + λ2 Varx∼Q[µbP /Q] . (5.5) Set δ = 1 1+ λ2 Varx∼Q[µP/Qb ]

and consider the complementary event. Then, with proba-bility at least 1 − δ, we have:

E x∼P[f (x)] ≥bµP /Q− s 1 − δ δ x∼QVar  b µP /Q. (5.6)

By replacing the variance with the bound in Theorem 5.1.3 we get the result.

In this bound, we can see a trade-off between the estimated performance and the variability introduced by changing the distribution. The main in-tuition is that we should optimize the unbiased IS estimator µbP /Q, but we

should penalize for optimization steps leading to a target distribution which is too dissimilar w.r.t. the behavioral one. In such cases, the estimator is not reliable and the learning process could be highly unstable. The penalization term is expressed in terms of the Rényi divergence of order two between the target distribution P and the behavioral distribution Q. When using the SN estimator, accounting for the bias, we can derive a similar lower bound, hence, the optimization process is the same. We can condense all the constant of the lower bound of Theorem 5.1.3 in λ = kf k∞p(1 − δ)/δ and obtain a

surrogate objective function.

Thanks to this result, a high confidence bound for the SN estimator, using Cantelli’s inequality, can be derived.

(39)

Proposition 5.1.1. Let P and Q be two probability measures on the measur-able space (X , F ) such that P  Q and d2(P kQ) < +∞. Let x1, x2, . . . , xN

i.i.d. random variables sampled from Q and f : X → R be a bounded function (kf k∞ < +∞). Then, for any 0 < δ ≤ 1 and N > 0 with probability at least

1 − δ: E x∼P[f (x)] ≥ 1 N N X i=1 e wP /Q(xi)f (xi) − 2kf k∞min ( 1, r d2(P kQ)(4 − 3δ) δN ) .

Proof. In order to obtain this result we have to apply Cantelli’s inequality and to account for the bias. Consider the random variableµeP /Q = N1

PN i=1weP /Q(xi)f (xi) and let eλ = λ − Ex∼P [f (x)] − Ex∼P  e µP /Q  : PreµP/Q− E x∼P[f (x)] ≥ λ  = PrµeP/Q− E x∼P  e µP/Q ≥ λ + E x∼P[f (x)] − Ex∼P  e µP/Q  ≤ PrµeP/Q− E x∼P  e µP/Q ≥ λ − x∼PE [f (x)] − Ex∼P  e µP/Q  = PrµeP/Q− E x∼P  e µP/Q ≥eλ  .

We can apply Cantelli’s inequality: PrµeP /Q− E x∼P[f (x)] ≥ λ  ≤ PrµeP /Q − E x∼P  e µP /Q ≥eλ  (5.7) ≤ 1 1 + eλ2 Varx∼Q[eµP /Q] = 1 1 + (λ−|Ex∼P[f (x)]−Ex∼P[µeP /Q]|) 2 Varx∼Q[eµP /Q] . (5.8) Set δ = 1 1+( λ−|Ex∼P [f(x)]−Ex∼P[µP/Qe ]|) 2 Varx∼Q[µP/Qe ]

and consider the complementary event:

then, with probability at least 1 − δ, we have:

E x∼P[f (x)] ≥eµP /Q− Ex∼P[f (x)] − Ex∼P  e µP /Q  − s 1 − δ δ Varx∼Q  e µP /Q  (5.9)

The bias term Ex∼P[f (x)] − Ex∼P

 e µP /Q



can be bounded using equation (4.14) and the variance term can be bounded using the MSE in equation (4.19). Then we have: E x∼P[f (x)] ≥µeP /Q− kf k∞ r d2(P kQ) − 1 N − kf k∞ r 1 − δ δ 2(2d2(P kQ) − 1) N

(40)

≥µeP /Q− kf k∞ r d2(P kQ) N − kf k∞ r 1 − δ δ 4d2(P kQ) N =µeP /Q− kf k∞ r d2(P kQ) N 1 + 2 r 1 − δ δ ! ≥µeP /Q− 2kf k∞ r d2(P kQ) N r 1 + 4(1 − δ) δ ≥µeP /Q− 2kf k∞ r d2(P kQ)(4 − 3δ) δN ,

when in the last line the fact that √a +√b ≤ 2√a + b for any a, b ≥ 0 is used. Finally, since the range of the SN estimator is 2kf k∞, we obtain the

result.

Note that the bound has the same dependence on d2 as in Theorem 5.1.3.

This allows to optimize the same surrogate objective function for both IS and SN estimators.

(41)

Algorithm 1 Action-based POIS

Initialize θ00 arbitrarily

for j = 0, 1, 2, ..., until convergence do Collect N trajectories with πθj

0

for k = 0, 1, 2, ..., until convergence do Compute G(θjk), ∇θj k L(θjk/θj0) and αk θjk+1= θjk+ αkG(θjk)−1∇θj k L(θjk/θj0) end for θj+10 = θjk end for

5.2

Policy Optimization via Importance

Sam-pling

The bound we found in Theorem 5.1.3 can be adapted to the Reinforcement Learning setting. In particular, in this section, we see how we can customize it for policy optimization. This leads to a novel, model-free, actor-only, policy search algorithm, which we call Policy Optimization via Importance Sampling (POIS). The proposed bound is adapted both in the Action-based and Parameter-based settings. We call these two variants A-POIS and P-POIS, respectively.

5.2.1

Action-based POIS

In Action-based methods, we want to find the policy parameters θ∗maximizing JD(θ) within a parametric space ΠΘ = {πθ : θ ∈ Θ ⊆ Rp} of stochastic

differentiable policies. In order to use the results obtained in the previous section, and to adapt the surrogate objective function, we substitute the behavioral (resp. target) distribution Q (resp. P) with the behavioral (resp. target) distribution over the trajectories p(·|θ) (resp. p(·|θ0)) generated by the behavioral policy πθ (resp. target policy πθ0). In this setting, the uniformly

bounded function f becomes the trajectory return R(τ ) (it is uniformly bounded since |R(τ )| ≤ Rmax1−γ

H

1−γ ).

5.2.2

Estimation of the Rényi divergence

The surrogate objective function in A-POIS contains the Rényi divergence between the target distribution p(·|θ0) and the behavioral distribution p(·|θ)

(42)

in terms of the trajectories. Computing this term is intractable as it requires to integrate over all the space of trajectories. When the agent is acting in stochastic environments, computing this term requires also to know the transition model P, which is unknown in the model-free setting. We can provide an exact bound on the Rényi divergence over trajectories when the horizon of the tasks is finite.

Proposition 5.2.1. Let p(·|θ) and p(·|θ0) be the behavioral and target tra-jectory probability density functions. Let H < ∞ be the task-horizon. Then, it holds that: dα(p(·|θ0)kp(·|θ)) ≤  sup s∈S dα(πθ0(·|s)kπθ(·|s)) H .

Proof. The proposition is proven by induction on the horizon H. Let us define dα,H

as the α-Rényi divergence at horizon H. For H = 1 we have:

dα,1 p(·|θ0)kp(·|θ) = Z S D(s0) Z A πθ(a0|s0)  πθ0(a0|s0) πθ(a0|s0) αZ S P (s1|s0, a0) ds1da0ds0 = Z S D(s0) Z A πθ(a0|s0)  πθ0(a0|s0) πθ(a0|s0) α da0ds0 ≤ Z S D(s0) ds0sup s∈S Z A πθ(a0|s)  πθ0(a0|s) πθ(a0|s) α da0 ≤ sup s∈S dα(πθ0(·|s)kπθ(·|s)) ,

where in the last but one passage Holder’s inequality is used. Assume that the proposition holds for any H0 < H. We need to prove that it holds for H.

dα,H  p(·|θ0)kp(·|θ)= = Z S D(s0) · · · Z A πθ(aH−2|sH−2)  πθ0(aH−2|sH−2) πθ(aH−2|sH−2) αZ S P (sH−1|sH−2, aH−2) × Z A πθ(aH−1|sH−1)  πθ0(aH−1|sH−1) πθ(aH−1|sH−1) αZ S P (sH|sH−1, aH−1) ds0. . . dsH−1 × daH−2dsH−1daH−1dsH = Z S D(s0) · · · Z A πθ(aH−2|sH−2)  πθ0(aH−2|sH−2) πθ(aH−2|sH−2) αZ S P (sH−1|sH−2, aH−2) × Z A πθ(aH−1|sH−1)  πθ0(aH−1|sH−1) πθ(aH−1|sH−1) α ds0. . . dsH−1daH−2dsH−1daH−1 ≤ Z S D(s0) · · · Z A πθ(aH−2|sH−2)  πθ0(aH−2|sH−2) πθ(aH−2|sH−2) αZ S P (sH−1|sH−2, aH−2) × ds0. . . dsH−1daH−2dsH−1× sup s∈S Z A πθ(aH−1|s)  πθ0(aH−1|s) πθ(aH−1|s) α daH−1 ≤ dα,H−1 p(·|θ0)kp(·|θ) sup s∈S dα(πθ0(·|s)kπθ(·|s)) 32

(43)

≤  sup s∈S dα(πθ0(·|s)kπθ(·|s)) H ,

where, as before, Holder’s inequality is applied and, in the last step, we use the inductive hypothesis.

Unfortunately, this bound is very conservative and it requires to compute the supremum among all the states. For A-POIS, we need to find an estimate of the Rényi divergence. We can obtain an estimator using a sample-based version of the definition (4.2) of the Rényi divergence:

b dα(P kQ) = 1 N N X i=1  p(xi) q(xi) α = 1 N N X i=1 wαP /Q(xi), (5.10)

where xi ∼ Q. This estimator is unbiased, but it requires to use distribution

over trajectories. We can express in exact form these probabilities in terms of the policies, providing the following formula for the α-Rényi divergence:

dα(p(·|θ0)kp(·|θ)) = Z T p(·|θ)(τ ) p(τ |θ 0 ) p(τ |θ) α dτ = = Z T D(sτ,0) H−1 Y t=0 P (sτ,t+1|sτ,t, aτ,t) H−1 Y t=0 πθ(aτ,t|sτ,t)  πθ0(aτ,t|sτ,t) πθ(aτ,t|sτ,t) α dτ.

The term dα(πθ0(·|s)kπθ(·|s)) can be computed exactly since both πθ and πθ0

are known. Thus, we propose the following estimate for the Rényi divergence between two distributions on trajectories:

b dα(p(·|θ0)kp(·|θ)) = 1 N N X i=1 H−1 Y t=0 dα(πθ0(·|sτi,t)kπθ(·|sτi,t)) . (5.11)

Thus, we obtain the following surrogate objective:

LA−POIS λ (θ 0 /θ) = 1 N N X i=1 wθ0i)R(τi) − λ s b d2(p(·|θ0)kp(·|θ)) N , (5.12) where wθ0/θ(τi) = p(τi |θ0) p(τi|θ) = QH−1 t=0 πθ0(aτi,t|sτi,t)

πθ(aτi,t|sτi,t). We focus in the particular case

in which πθ(·|s) is a Gaussian distribution whose mean is a function of the

state and the diagonal covariance is state-independent: N (uµ(s), diag(σ2)),

where θ = (µ, σ). Online and offline optimization steps alternate in learning. At each online iteration j, we use the current behavioral policy πθj

0 to interact

(44)

Algorithm 2 Parameter-based POIS

Initialize ρ00 arbitrarily

for j = 0, 1, 2, ..., until convergence do Sample N policy parameters θji from νρj

0

Collect a trajectory with each πθj i

for k = 0, 1, 2, ..., until convergence do Compute G(ρjk), ∇ρj k L(ρjk/ρj0) and αk ρjk+1= ρjk+ αkG(ρjk)−1∇ρj k L(ρjk/ρj0) end for ρj+10 = ρjk end for

are used offline to optimize the surrogate objective function LA−POISλ via gradient ascent: θjk+1 = θjk+ αkG(θjk) −1 θjkL(θ j k/θ j 0), where αk > 0 is the

step size, selected using a line search method (see Section 5.3.1) and G(θjk) is a general positive semi-definite matrix (e.g., F (θjk), the FIM, for natural gradient). The pseudo-code of A-POIS is reported in Algorithm 1.

5.2.3

Parameter-based POIS

In Parameter-based methods, we want to find the hyperpolicy parameters ρ∗ maximizing JD(ρ) within a parametric hyperpolicy space, which we call

NP = {νρ : ρ ∈ P ⊆ Rr} of stochastic differentiable hyperpolicies. In this

setting, the policy πθ0, which is assumed to be deterministic (πθ(a|s) =

δ(a − uθ(s)), where uθ is a deterministic function of the state s[42, 14]), is

not required to be differentiable. In order to use the results obtained in the previous section, and to adapt the surrogate objective function, we substitute the behavioral (resp. target) distribution Q (resp. P) with the behavioral (resp. target) hyperpolicy νρ(resp. νρ0. As in A-POIS, the uniformly bounded

function f becomes the trajectory return R(τ ). At the beginning of each episode, the the policy parameters θ are sampled from the hyperpolicy. The importance weight for P-POIS becomes τ : wρ0(θ) =

νρ0(θ)p(τ |θ)

νρ(θ)p(τ |θ) =

νρ0(θ)

νρ(θ).

Thus, we obtain the following surrogate objective:

LP−POIS λ (ρ 0 /ρ) = 1 N N X i=1 wρ0i)R(τi) − λ r d2(νρ0kνρ) N . (5.13)

We focus in the particular case in which νρ is a Gaussian with diagonal

(45)

covariance matrix, i.e., N (µ, diag(σ2)) with ρ = (µ, σ).

As in A-POIS, learning uses online and offline optimization steps. At each online iteration j, we use the current behavioral hyperpolicy νρj

0 to interact

with the environment and collect a batch of trajectories. These trajectories are used offline to optimize the surrogate objective function LP−POISλ via gradient ascent: ρjk+1 = ρjk+ αkG(ρjk) −1 ρjkL(ρ j k/ρ j 0), where αk > 0 is the

step size, selected using a line search method (see Section 5.3.1) and G(ρjk) is a general positive semi-definite matrix (e.g., F (ρjk), the FIM, for natural gradient). The pseudo-code of P-POIS is reported in Algorithm 2.

Differently from A-POIS, here the term d2(νρ0kνρ) can be computed

ex-actly. Moreover, when the hyperpolicy is a Gaussian with diagonal covariance matrix, the FIM F (ρ) is also diagonal and can be computed exactly [25]. This makes the use of the natural gradient much easier in P-POIS.

5.3

Implementation details

In this Section, we provide some aspects about our implementation of POIS.

5.3.1

Line Search

At every iteration k performed offline, we change the parameters in the direction of the natural gradient G(θjk)−1∇θj

kL(θ

j k/θ

j

0) with a step size αk such

that the improvement is maximized. Most gradient-based methods use the step size of the optimizer as an hyperparameter to tune. We adopt a different approach and derive a line search method for finding an adaptive step-size. The main idea of this method is to locally approximate the objective function L(θ), in the direction of the gradient, G−1(θ)∇

θL(θ), as a concave parabola

in the Riemann manifold with G(θ) as Riemann metric tensor. Suppose we know an initial point θ0, the gradient in that point G(θ0)−1∇θL(θ0) and

another point: θl = θ0 + αlG(θ0)−1∇θL(θ0). Then we can compute for

both points the loss function: L0 = L(θ0) and Ll = L(θl) and define as

∆Ll = Ll− L0 the objective function improvement. In this approximation

setting, thanks to this assumption, we can compute the global maximum, which corresponds to the vertex of the parabola. We can define the parabola l(α) = L (θ0+ αG−1(θ0)∇θL(θ0)) − L(θ0), which can be written as l(α) =

aα2+ bα + c. Since, by definition of l(α), we have that c = 0, we can compute the value of a and b as follows:

b = ∂l ∂α α=0 = ∂ ∂αL θ0+ αG −1 (θ0)∇θL(θ0) − L(θ0)|α=0 =

(46)

= ∇θL(θ0)TG−1(θ0)∇θL(θ0) = = k∇θL(θ0)k2G−1 0), l(αl) = aα2l + bαl = aαl2+ k∇θL(θ0)k2G−1 0)αl = ∆Ll =⇒ =⇒ a = ∆Ll− k∇θL(θ0)k 2 G−1 0)αl α2 l .

Therefore, the parabola has the form:

l(α) = ∆Ll− k∇θL(θ0)k 2 G−1 0)αl α2 l α2 + k∇θL(θ0)k2G−1 0)α. (5.14)

The parabola is concave only if ∆Ll < k∇θL(θ0)k2G−1

0)αl. We can compute

the position of the vertex as:

αl+1 = k∇θL(θ0)k2G−1 0)α 2 l 2k∇θL(θ0)k2G−1 0)αl− ∆Ll  . (5.15) As in [23], we can define αl = l/k∇θL(θ0)k2G−1

0) and have a simplified

expression. Then we have:

l+1 =

2l 2(l− ∆Ll)

. (5.16)

In the case where the parabola is convex, i.e., ∆Ll ≥ k∇θL(θ0)k2G−1

0)αl, we

can distinguish between two cases: i) ∆Ll > k∇θL(θ0)k2G−1

0)αl, the function

is sublinear and in this case we use (5.16) to determine the new step size αl+1 = l+1/k∇θL(θ0)k2G−1

0); ii) ∆Ll≥ k∇θL(θ0)k

2 G−1

0)αl, the function is

superlinear, in this case we increase the step size multiplying by η > 1, i.e., αl+1 = ηαl. We can condense the update procedure as:

l+1 = ( ηl if ∆Ll> l(2η−1) 2 l 2(l−∆Ll) otherwise . (5.17)

We iterate this process for a fixed amount of iterations, or until the objective function improvement is too small. Pseudocode is reported in Algorithm 3.

(47)

Algorithm 3 Parabolic Line Search Input: tol∆L = 1e − 4, Mls = 30, L0 Output : α∗ α0 = 0 1 = 1 ∆Lk−1= −∞ for l = 1, 2, . . . , Mls do αl = l/k∇θL(θ0)k2G−1 0) θl = αlG−1(θ0)∇θL(θ0) ∆Ll= Ll− L0 if ∆Ll < ∆Ll−1+ tol∆L then return αl−1 end if l+1 = ( ηl if ∆Ll > l(1−2η) 2η 2 l 2(l−∆Ll) otherwise end for

5.3.2

Computation of the Fisher Matrix

In A-POIS we cannot obtain the exact expression of the Fisher Information Matrix and we need to estimate it off-policy from samples. Using the IS estimator, we obtain the following estimate for the FIM:

b F (θ0/θ) = 1 N N X i=1 wθ0/θ(τi) H−1 X t=0 ∇θ0log πθ0(aτi,t|sτi,t) !T H−1 X t=0 ∇θ0log πθ0(aτi,t|sτi,t) ! .

We can replace wθ0i) with

e

wθ0i) in order to obtain an estimate of the

FIM using the SN estimator. Unfortunately, when θ0 is far from θ, these estimators are not reliable due to the high variance induced by the IS process. On the contrary, in P-POIS, when Gaussian hyperpolicies are used, we can compute the FIM exactly [47]. If the hyperpolicy has diagonal covariance matrix, i.e., νµ,σ = N (µ, diag(σ2)), the FIM is also diagonal:

F (µ, σ) =  diag(1/σ

2) 0

0 2I

 , where I is a properly-sized identity matrix.

5.3.3

Practical surrogate objective functions

The exact Rényi divergence for P-POIS and the approximated version for A-POIS tend to be very conservative. In order to mitigate this problem, we

(48)

can rely on the fact that d2(P kQ)/N = 1/ESS(P kQ), from equation (4.12).

We can use an estimate dESS(P kQ), as presented in equation (4.12), in order to estimate the Rényi divergence. This leads to the following approximated surrogate objective functions:

e LA−POIS λ (θ 0 /θ) = 1 N N X i=1 wθ0i)R(τi) − λ q d ESS (p(·|θ0)kp(·|θ)) , e LP−POIS λ (ρ 0 /ρ) = 1 N N X i=1 wρ0i)R(τi) − λ q d ESS (νρ0kνρ) .

Moreover, in all the experiments, we use the empirical maximum reward in place of the true Rmax.

5.3.4

Practical P-POIS for Deep Neural Policies

(N-POIS)

When using Deep policies, P-POIS suffers from a curse of dimensionality because of the high number of parameters (in the Deep Neural Network we use as policy, the number of parameters are 103). The dimensionality of the corresponding Gaussian hyperpolicy (with diagonal covariance matrix) is very high, producing very unstable results when sampling. This can cause a very conservative behavior in the optimization process, preventing any learning. We decide to group the policy parameters in smaller blocks, and to learn each group independently. We can see each neuron of the network as a function:

Ui(x|θm) = g(xTθm),

where x is the vector of weights an biases connected to the input of the neuron and g(·) is an activation function. We associate a set of parameters Θmfor each

unit Um such that θm ∈ Θm. Then, for each corresponding hyperparameter

subspace Pm, we can compute, as before, a surrogate objective function which

we optimize using natural gradient ascent, with the step size found by the line search: e LN−POIS λ (ρ 0 m/ρm) = 1 N N X i=1 e wρ0 m/ρm(θ i m)R(τi) − λ q d ESS νρ0 mkνρm  ,

where ρm, ρ0m ∈ Pm, θm ∈ Θm. This trick reduces the dimension of the

multivariate Gaussian hyperpolicies from ∼ 103 to ∼ 102. This variant of the algorithm is called Neuron-Based POIS (N-POIS).

Figura

Figure 4.1: Probability density function of the importance weights when the behavioral distribution is N (0, 1) and the mean is changed keeping the variance equal to 1 (a) or the variance is changed keeping the target mean equal to 1 (b).
Figure 6.1: Cart-Pole Balancing environment.
Figure 6.2: Mountain Car environment.
Figure 6.3: Double Inverted Pendulum environment.
+7

Riferimenti

Documenti correlati

Dal momento che il presente lavoro è stato condotto con lo scopo di rilevare l’effettiva influenza che informazioni aggiuntive provenienti dalla teoria di

The link between CSR and any improvement in terms of financial performance seems possible if it can be demonstrated that the adoption of socially responsible behaviors

A similar conclusion regards the vortex core radius, which values at different streamwise locations, for the data affected from wandering, are determined by the 3HFP

Carrier #1 Total Number of Background measures in the selected period (One week) 623 Minimum value of minutes between two consecutive measures 6 [Min].. Maximum value of minutes

However, at present only low temperature superconductors (LTS), operating at boiling helium temperature, are used to design and produce magnets, because the

The sample is then ready to be mounted on a probe to perform the critical current measurements.. 5.4: Coil sample ready

Hydrophobic compounds are favorably partitioned in the non-polar microenvironment, while metal ions can bind electrostatically to the polar head of the surfactant, or can be

“MT1-MMP-deficient mice develop dwarfism, osteopenia, arthritis, and connective tissue disease due to inadequate collagen turnover.,” Cell, vol.. Noda, “Osteopontin Expression