Continuous Control of UGV for Mapless Navigation: a Virtual-to-Real Deep Reinforcement Learning Approach

(1)

SCUOLA DI INGEGNERIA

LAUREA MAGISTRALE IN INGEGNERIA ROBOTOCA E DELL’AUTOMAZIONE

TESI DI LAUREA

CONTINUOUS CONTROL OF UGV FOR MAPLESS NAVIGATION:

A VIRTUAL-TO-REAL DEEP REINFORCEMENT LEARNING APPROACH

RELATORI: CANDIDATO:

PROF. PALLOTTINO LUCIA CORDELLA GIOVANNI MARCO

ING. FRANCHI MICHELE

ANNO ACCADEMICO 2017/2018

(2)

(3)

Chapter 1 Introduction

The autonomous mobile robot must be able to adapt its skills in order to react ade- quately in complex and dynamic environments. Different approaches can be used to handle these kinds of applications: map-based navigation or map-less navigation. With first approach robot uses a priori known map or learns it from scratch while localizes itself within the environment. The second approach has the advantage of generating navigation trajectories without the model of the environment and obstacles map.

Map-based approach may lead to a better results in terms of navigation path, but requires a huge amount of data, memory and machine resources. Moreover, because this method uses integration process to compute position of robot, it is subject to cumulative error that leads to a continuous decreasing in quality. Map-less approach allows the use of smaller memory dimension since there is no need to reconstruct and store the world model. A traditional map-less approach could not be obviously satisfying in case of highly complex environment where the knowledge of the map can help. Both these methods could not be able to quickly adapt to dynamic environment due to the cost of map-updating and cost-map prediction.

(7)

Regarding map-less navigation problem, it may be useful to reduced dataset from sensors and make robot able to rapidly react to environment changes. A good method for achieving this goal is Reinforcement Learning (RL), because a complete knowledge of the environment is not required and also permits the robot to learn on-line [1]. Algo- rithms developed in this area can be viewed as computational processes that transform observations of states and actions into policy parameters. Robots learn how to move on the environment using trial and error approach, sampling environment to store experiences improving their prediction about action to choose, following the best policy. One of the fundamental components of such interactions is understanding the correlation and causality between actions of an agent and the changes of the environment as a result of these actions [2].

Reinforcement learning has been used in a variety of applications, often providing better performances then other techniques. Many works show the powerful of proposed method, that does not require the knowledge of the environment avoiding the problem of reconstruct the map and use it for navigation. The method proposed in [4] for model-free navigation operates in continuous action space, using reinforcement learning to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain. Other works propose a design control technique using reinforcement learning, where non-linearity and complexity of dynamics and disturbances cause failure of classical linear approach, as shown in [5].

(8)

1.1 Related work

This work aims to prove that robot can goes towards target position, detecting and avoiding obstacles based on low dimensional data directly from sensors without the knowledge of the environment map. This goal is achieved using Deep Reinforcement Learning approach; thus, networks are trained in simulations before being inferred in real world.

The following chapter introduces reasons why map-less approach has been chosen, explaining which advantages it has and why a deep learning-based approach can lead to a better results in terms of robot behaviour in unknown environment. Then, a brief survey on reinforcement learning methods is conducted, explaining the strength of this family of algorithms. Starting from classic approach as Dynamic Programming and Monte Carlo methods, a new kind of RL implementation is introduced, called Temporal-Difference Learning. The fourth chapter is on Artificial Neural Networks and their use in support of RL. Among the several Deep RL methods, a specific approach that suits with the purpose of this work is described.

The last chapter gives a detailed description of Deep RL implementation on real case. The task this work tries to solve is the map-less navigation of unmanned ground vehicle within an unknown environment. Agent is trained in simulation using 2D environment; then it is evaluated in 3D dynamic simulator to understand if trained agent can be implemented in real scenario avoiding hardware damaging. A continuous control has been chosen to navigate robot within the environment, instead of use discrete actions and reducing robot’s allowed movements. A particular Deep Reinforcement Learning approach called Actor-Critic has been used. Actually, it requires two differ-

(9)

ent networks that work together to maximize a sort of well-designed score collected during training. Is this way convergence is achieved and task accomplished. Networks layout has been designed and trained using Keras framework and TensorFlow libraries on high-performance computer. Network inference on robot’s platform in real scenario was deployed using TensorsRT libraries.

(10)

(11)

Chapter 2 Learning-based Mapless Navigation

An autonomous robot navigates within an environment basing its motions on informations about what surrounds it and where it is with respect to the space it seeks to explore. This kind of informations are provided by fusion of the environment map and sensors data robot is equipped with. Methods that require a priori known map are so called map-based methods. Maps can be less or more complex and have a set of landmarks chosen in designing process. As robot moves within the environment, collects data from sensors and extracts features; then tries to identify the observed landmarks by searching in the database for possible matches. Once a match is obtained, the system needs to calculate its position as a function of the observed landmarks and their positions in the database. Methods that use vision are heavily based on this approach.

In same cases, model descriptions of environment are not always easy to generate.

Therefore, a better choice is to use robots that could explore their environment and build an internal representation of it ??. These are map-learning methods. Many works have proved the efficiency of a grid map built by merging different kind of sensors data.

The data is projected onto the plane of the floor and each square of the grid represents

(12)

the probability of occupancy of the corresponding square in the world by an object; the so called occupancy-grid. Other works are based on a topological map instead. Thus, these are also a map-based methods because before any navigation can be carried out, the system must create a map.

All these approaches are heavily based on data provided by sensors. They needs a huge amount of data to function properly. One of these methods is simultaneous localization and mapping (SLAM). It consist in constructing or updating a map of an unknown environment while simultaneously keeping track of an agent’s location within it. There are several algorithms known for solving this problem in certain environment. Most of them include extended Kalman filter or particle filter. Many robotics applications as unmanned ground vehicle (UGV), autonomous underwater vehicle (AUV) or more diffused applications as domestic robots use this approach to handle map-based navigation.

One of the main problems of these kinds of approaches is the interdependence between map-learning process and localization process; a map is needed for localization and a pose estimate is needed for mapping. Hence, errors that arise during one of these two phases influence the other and subsequently they need to be detected and corrected. A classic chicken-or-egg problem. Other issues can be addressed for this approach related to the time-consuming building and updating of obstacle map and the high dependence on the precise dense laser sensor for the mapping work. Learning a map of an environment from scratch is intrinsically quite difficult for an autonomous robot. When the map has to be learned, an important additional problem is to evaluate whether the robot is in a part of the environment already stored in the map, or if it is in a part never visited before [?].

(13)

In some kinds of applications, as outdoor applications, it is quite difficult or may be impossible to known or construct the environment map. Thus, a map-less approach could be used. The needed robot motions are determined by observing and extracting relevant information about the elements in the environment and it is not necessary to know its absolute position within it. But, only with local observation it is really difficult for motion planner to generate a global navigation behaviour without the knowledge of the entire obstacle map. It is still a challenge to rapidly generate appropriate navigation behaviours based on sparse informations.

In order to improve navigation behaviour using less informations than those used in map-based approach, a powerful tool is Deep Reinforcement Learning. Using this approach it is possible to build a map-less method, since is not required to learn directly the map, but how react to environment changes. Robot senses how far is obstacles using low dimensional sensor data, knows where target position is and learns how to move to accomplish task. Moreover, robot learns the local cost-map that allows it to choose which step will be the next. This method does not require localization while learning.

This learning-based map-less navigation method owes its strength to Deep Neural Network. They allow to directly use raw sensor data without elaboration. This brings the advantage of reducing the dimension of memory used by navigation framework with respect to traditional approaches. Thus, with an initial effort in time and resources during training, it is possible to make use of trained network working on a small memory platform. They allow to solve navigation problem in unknown complex environment and make robot features adaptable to unseen situation.

(14)

(15)

Chapter 3 Reinforcement Learning

3.1 Background

Machine learning is a subset of Artificial Intelligence (AI) that gives computers the ability of learn and improve specific tasks from experience, without being explicitly programmed. There are different ways an algorithm can model a problem based on its interaction with the experience or environment. This way of organizing machine learning algorithms concern the different rule of input data used for training. There are three main styles in machine learning algorithms:

1. Supervised Learning (SL); input data is called training data and has a known label or result

2. Semi-Supervised Learning (SSL); input data is a mixture of labelled and unla- belled examples

3. Unsupervised Learning (UL); input data is not labelled and does not have a known result

(16)

Reinforcement learning fits between SSL and UL, since input data is unknown, collected from environment by experience, but there is a desired prediction problem. The model must learn the structures to organize the data as well as make predictions.

3.2 Markov Decision Process

As said before, robot learns through RL choosing action based on actual state and using feedback signals to improve prediction on the next action, in order to execute tasks following the best policy and maximizing total rewards. Markov Decision Processes, or MDPs, are a classical formalizations of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. MPDs is used in this work as mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made [6].

3.2.1 Agent-Environment Interface

The most important parts of MDPs are:

1. agent, the learner and decision maker that selects action;

2. environment that responds to the action presenting new situations, in terms of states and rewards, to the agent.

The agent uses feedback signals from environment and seeks to maximize rewards over time through its choice of actions. This two parts interact continually, as shown in Figure 3.1.

(17)

Figure 3.1: Agent-Environment interaction

At each time step t of the discrete time steps sequence t = 0, 1, 2, .. agent chooses action a_t ∈ A(s) based on actual state s_t ∈ S and receives new state s_t+1 and reward r_t+1 ∈ R ⊂ R from environment as result of its choice. The interaction give rise to a sequence of data like this:

s₀, a₀, r₁, s₁, a₁, r₂, s₂, a₂, r₃, s₃, ... (3.1)

In finite MDP this leads to the definition of probability distribution of random variables r ∈ R and s⁰ ∈ S, dependent only on the preceding state and action:

p(s⁰, r|s, a) .

= P r(s_t= s⁰, r_t= r|s_t−1 = s, a_t−1= a) (3.2)

for all s⁰, s ∈ S, r ∈ R and a ∈ A(s). The function p defines the dynamic of MDP.

That is, for particular values of s⁰ and r, there is a probability occurring at time t, given the previous values of s and a.

(18)

3.2.2 Reward Shaping

The reward signal is crucial for the succes of learning process. Sutton and Barto (1998) define the reward as:

”the way of communicating to the robot what you want it to achieve, not how you want it achieved”.

The reward signal represents the formalization of idea of goal, that is a distinctive feature of RL.

The agent will try to maximize the total reward received in the long run, not only immediate reward. If the reward is provided in such a way that in maximizing it the agent will achieve the goal, this will probably ensure the success of task. Thus, it is crucial to choose rewards that correspond to what the agent should accomplish. There are different ways to formulate the goal in term of rewards; positive rewards encourages the agent to keep doing what it’s doing, while negative rewards means the agent has to complete task as quickly as possible.

Mixed rewards can be used; for an agent that learns to reach target position and avoid obstacles the rewards could be positive for win, negative in case of obstacle collision and constant reward for each nonterminal position. In more complex scenario, to make learning process easier, for nonterminal states could be used rewards that change at each time step, e.g. rewards that depend on distance between agent and target, instead of step-like rewards that jump from one constant value to another one.

In general, what agent tries to maximize is the expected return, where the return is defined as a function of the rewards sequence r_t, r_t+1, r_t+2, r_t+3, .... Denoting the return

(19)

as G_t, a simple way to define it is the sum of each reward:

G_t .

= r_t+1+ r_t+2+ r_t+3+ ... + r_T (3.3)

where T is the final time step. This definition makes sense if the task breaks into subsequence called episodes and each episode ends in a terminal state corresponding to time step T . This kind of tasks are called episodic tasks.

On the other hand there are many tasks in which the interaction between agent and environment doesn’t break in episodes and the expression (3.3) could be infinite since would be T = ∞. This type of tasks are called continuing tasks. To address the problem of infinite return, it is necessary to introduce the concept of discounting. The formulation of return could be written as

G_t .

= r_t+1+ γ r_t+2+ γ² r_t+3+ ... =

∞

X

k=0

γ^k r_t+k+1 (3.4)

with 0 ≤ γ ≤ 1 called discount factor. If γ < 1 and rewards sequence is bounded, the (3.4) is finite. The agent will seek to maximize the expected discounted return. At each time step, return is related to each other by relation shown below:

G_t .

= r_t+1+ γr_t+2+ γ²r_t+3+ ...

= rt+1+ γ(rt+2+ γrt+3+ ...)

= r_t+1+ γG_t+1 (3.5)

In order to use unique notation for return in episodic and continuing tasks, G_t can be

(20)

defined as:

G_t .

=

T

X

k=t+1

γ^k−t−1 r_k (3.6)

with the possibility that T = ∞ or γ = 1, but not both at the same time.

3.2.3 Value and Policy Function

Reinforcement learning method is based on computing the value of state in which the agent is or the value of state-action pair. This value estimates how good is for the agent to be in a certain state or how good is to perform an action given a state. The function that computes this value is called value function and what it returns is the expected return and depends on what actions the agent will take. Thus, value function is defined with respect to the way of acting of the agent, called policy.

Policy π(a|s) is a mapping from state to probability of selecting an action, that is the probability of a_t = a if s_t= s. Value function can be defined with respect to state and policy π as follow:

vπ(s) .

= E^π[Gt|st= s] = E^π

" _∞ X

k=0

γ^k rt+k+1

st = s

#

∀ s ∈ S. (3.7)

If state-action pair and π are used, the value function can be computed as:

q_π(s, a) .

= Eπ[G_t|s_t= s, a_t = a] = Eπ

" _∞ X

k=0

γ^k r_t+k+1

s_t= s, a_t = a

#

. (3.8)

The former is called state-value function for policy π, the latter is called action-value function for policy π. These functions can be estimated from experience; as the agent explores the environment following policy π and maintains an average of the actual

(21)

return of state (or state-action pair), this average will converge to the state’s value v_π(s) (or state-action value q_π(s, a)). For any policy π, a recursive condition holds between the value of s and its possible successor states:

v_π(s) .

= Eπ[G_t|s_t= s]

= Eπ[r_t+1+ γG_t+1|s_t = s]

=X

a

π(a, s)X

s⁰

X

r

p(s⁰, r|s, a)

r + γEπ[G_t+1|s_t+1= s⁰]

=X

a

π(a, s)X

s⁰,r

p(s⁰, r|s, a)[r + γvπ(s⁰)] ∀s ∈ S (3.9)

This equation is called Bellman equation for v_π. It can be written also for q_π. For RL in finite MDP problem an optimal policy can be defined as the best policy with the greatest return. In general, a policy π ≥ π⁰ if and only if vπ(s) ≥ vπ⁰(s) for all s ∈ S.

Hence, the optimal policy is defined through the definition of optimal value function:

v^∗(s) .

= max

π v_π(s) ∀s ∈ S (3.10)

or

q^∗(s, a) .

= max

π qπ(s, a) ∀s ∈ S a ∈ A. (3.11)

For the state-action pair, function q_π(s, a) gives the expected return of taking action a in state s and following optimal policy. Hence, q^∗ can be written in terms of v^∗ as follow:

q^∗(s, a) = E[rt+1+ γv^∗(s_t+1)|s_t= s, a_t = a] (3.12)

The value functions v^∗(s) and q^∗(s, a) for a policy π must satisfy the condition given

(22)

by the Bellman equation. These value functions represent the optimal value functions, so they can be written in a special form without reference to any specific policy, called Bellman optimality equation. For action-value function q^∗(s, a), the Bellman optimality equation is:

q^∗(s, a) = E

r_t+1+ γ max

a⁰ q^∗(s_t+1, a⁰)

s_t= s, a_t = a

=X

s⁰,r

p(s⁰, r|s, a)

r + γ max

a⁰ q^∗(s⁰, a⁰)

(3.13)

In the next paragraph some methods to improve estimation of value functions and compute optimal policies will be introduced.

(23)

3.3 Algorithms for Reinforcement Learning

Reinforcement learning can be implemented using different kinds of algorithms. In order to follow the best policy, some algorithms try to solve the prediction problem using perfect knowledge of the environment. Other algorithms require only experience—sample sequence from environment, that can be real or simulated or combine these features. Three main algorithms are set out below.

3.3.1 Dynamic Programming

Dynamic Programming (DP) is a set of algorithms used to compute optimal policies given a perfect knowledge of the environment, as a MDP. The key idea of DP is to find good policies using value functions. The optimal policy can be easily obtain once the optimal value functions are computed, which satisfy the Bellman optimality equation.

One of the main steps of DP algorithm is policy evaluation; that is, given a policy π the value function can be iteratively computed from (3.9) as follow:

v_k+1(s) .

= Eπ[r_t+1+ γv_k(s_t+1)|s_t = s]

=X

a

π(s, a)X

s⁰,r

p(s⁰, r|s, a)[r + γv_k(s⁰)] (3.14)

It is possible to show that the sequence v_k converges to v_π as k → ∞ for the same reason that guarantees the existence of v_π. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

After evaluating the arbitrary deterministic policy with value function v_π, for same state s the possibility to choose a different action can be assessed. This is called policy

(24)

improvement. Thus, choosing an action a 6= π(s) can lead to a better or worse policy respect to the previous one. One way to evaluate this new policy is to chose a = π⁰(s) in s and then following the existing policy π. If the state-action value function q_π(s, a) is greater than the state value v_π(s), it is better to select a in s and then follow π. The new greedy policy π⁰ is defined as follow:

π⁰(s) .

= arg max

a

q_π(s, a)

= arg max

a E[rt+1+ γv_π(s_t+1|s_t= s, a_t= a)]

= arg max

a

X

s⁰,r

p(s⁰, r|s, a)[r + γv_π(s⁰)] (3.15)

and its value function is:

v_π⁰(s) = max

a E[r^t+1+ γv_π⁰(s_t+1|s_t= s, a_t= a)]

= max

a

X

s⁰,r

p(s⁰, r|s, a)[r + γv_π⁰(s⁰)] (3.16)

Policy improvement thus must return a strictly better policy except when the original policy is already optimal.

These two steps defined above, policy evaluation (E) and policy improvement (I), are used then in a sequence of steps that yield an iteration process called policy iteration [6]:

π₀ −→ vÊ _π₀ −→ πÎ ₁ −→ vÊ _π₁ −→ πÎ ₂ −→ ...Ê −→ πÎ ^∗ (3.17)

where each policy is guaranteed to be strictly improved over the previous one, unless it is already optimal.

In DP methods the estimates of values of states are based on the estimates of values

(25)

of successor states. This property is called bootstrapping. DP may not be practical for very large problems, because it needs to know the whole probability distribution of states set. However, compared with other methods for solving MDPs, it is actually efficient, even better with today’s computers.

In the next section a different method that do not require model and do not bootstrap will be introduced.

3.3.2 Monte Carlo Methods

Monte Carlo (MC) methods are ways of solving the reinforcement learning problem based on averaging sample returns. These methods do not require complete knowledge of the environment, but need only sample transitions, unlike DP.

In order to guarantee a well-defined returns, episodic tasks are considered. Evalu- ation and improvement of value functions and policies are made only at the end of the episode, not at each step. Thus, Monte Carlo methods are incremental in episode-by- episode sense [6]. One of the main differences with respect to DP methods is that MC learn value functions from sample returns rather than compute them from knowledge of MDP.

Monte Carlo prediction is used to learn state-value function v_π(s) of state s given a policy π. Recall that value of state s is the expected cumulative future discounted return starting from that state. The simplest way to compute this value is to average the returns after visits to s. There are two possibilities to estimate value function, first-visit MC method or every-visit MC method. The first one averages all following returns after the first visit to s; the second one averages all returns following every visit of s. In both cases, the estimate of value function converges to v_π(s) as a number of

(26)

visits, of first visits, goes to infinity.

If the environment is completely unknown, it is better to estimate action values q_π(s, a), because the only state values could not be sufficient. A state-action pair is said to be visit if state s is visited and action a is taken in that state. Hence, MC methods estimates the value of state-action pair as the average of the returns that have followed all the visits to it. The problem is that some state-action pairs could never be visited in following deterministic policy π and estimates may not be improved with experience. In order to address the maintaining exploration problem, a continual exploration must be assure, e.g. starting in different state-action pair in every episode with non-zero probability of all pairs (exploration starts).

As seen in DP chapter, policy evaluation and policy improvement work together to reach optimal policy π^∗ with optimal action value q^∗(s, a). Policy evaluation computes each q_π_k(s, a) for a given policy π_k after an infinite number of episodes and policy improvement is done by constructing each π_k+1 as the greedy policy with respect to qπ_k, as follow:

π_k+1(s) = arg max

a

q_π_k(s, a) (3.18)

Policy πk+1 is better than πk, or just as good as πk, in which case they are both optimal policies. This assures that the process converges to the optimal policy and optimal value function. In this way Monte Carlo methods can be used to find optimal policies given only sample episodes and no other knowledge of the environment’s dynamics.

The implementation of exploration starts cannot be relied when interacting and learning from actual environment, so many different solutions can be taken to avoid this problem. In case of on-policy methods, as exploration starts itself, it is possible to

(27)

define a different policy improvement step using -greedy policy, as follow:

π(a|s) =











|A(s)|, if a is nongreedy 1 − + _|A(s)| , if a is greedy

(3.19)

with > 0. This means that π(a|s) > 0 for all s ∈ S and a ∈ A(s) and gradually shift towards deterministic optimal policy. The on-policy methods face the problem that learn value function in order to reach optimal behaviour, but need to follow non optimal behaviour to explore all possible actions ad improve estimates. Actually these methods are a compromise; indeed they learn action value for near-optimal policy.

A better way to proceed is to use off-policy methods. Two kinds of policy are implemented in this case: the policy that is learned, called the target policy and the policy that generates behaviour, called behaviour policy. An advantage of this sepa- ration is that target policy can be deterministic and the behaviour policy can explore the environment. Thus, prediction of off-policy MC methods learns the value function of target policy from data generated by behaviour policy. This approach will be use in experimental case of this work. In the last section of this chapter bootstrapping and model-free approach will be the key features of another method.

3.3.3 Temporal-Difference Learning

Temporal-Difference (TD) learning is a combination of DP and MC methods. As Monte Carlo methods it learns directly from raw data without the knowledge of the environment, but it doesn’t wait for the end of episode to update its estimates of value function. Indeed it can improve estimates at each step based on other learned

(28)

estimates; thus, it bootstrap as DP. As did before, the first step is to examine the prediction problem (or evaluation problem) for TD learning. Starting from update rule of estimated value function V (s_t) for MC methods

V (s_t) ← V (s_t) + α[G_t− V (s_t)] (3.20)

where G_tis the target function (actual return following time t) and α is a constant step- size parameter, TD learning needs to wait only the next step, not the whole episode as MC. Thus, the update rule can be written as follow:

V (s_t) ← V (s_t) + α[r_t+1+ γV (s_t+1) − V (s_t)] (3.21)

where the target function is r_t+1+ γV (s_t+1) as seen in DP. This particular TD method is called one step TD or TD(0) because its target function takes into account only two consecutive steps. The quantity rt+1+ γV (st+1) − V (st) si called TD error because it is the error between the estimated value of s_t and the target function that is a better estimate of value of that state.

The advantage of TD methods over MC methods is that they are more suitable for tasks with long episodes or continuing task, because the update of estimates comes at each step. A great advantage of TD over DP is that they do not require model of the environment and its probability distributions. Taking into account this advantages that make many problem easier to face, TD(0) has been also proved to converge to v_π for constant or decreasing step-size parameter.

Another approach of TD(0) that guarantees convergence is the incremental learning methods. Suppose there is a finite number of experiences, called batch of training data,

(29)

this methods compute increments in (3.21) at each step, but value function is changed only once, with the sum of all increments of batch. This technique will be adopted later.

3.3.3.1 On-policy TD methods

In order to use TD prediction methods for value and policy estimation trading off exploration and exploitation, it is possible to chose between on-policy and off-policy approach. For the first case, a widely used algorithm is called Sarsa. This name rise from the tuple used to collect data from environment, as shown in (3.1). Instead of state value, now action value Q(s, a) is calculated as estimation of q(s, a) and its update rule is:

Q(s_t, a_t) ← Q(s_t, a_t) + α[r_t+1+ γQ(s_t+1, a_+t1) − Q(s_t, a_t)] (3.22)

This update is done until terminal state occurs and for this state the action value Q(s_t+1, a_t+1) is defined as zero.

As in all on-policy methods, policy π is continually estimated with action value q_π and at the same time policy is updated to be more greedy with respect to qπ. Sarsa converges with probability 1 to an optimal policy π^∗ and action-value function q^∗_π as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy [6].

3.3.3.2 Off-policy TD methods

If an off-policy approach is chosen, the Q-learning algorithm is one of the best choice.

Its strength is that action value function Q approximates q^∗, no matter what policy

(30)

being followed. This property is clear in update rule below:

Q(s_t, a_t) ← Q(s_t, a_t) + α[r_t+1+ γ max

a Q(s_t+1, a) − Q(s_t, a_t)] (3.23)

In this case, maximum over next state–action pairs is used instead of simply action- value. The minimal requirement to reach the correct convergence is that each state- action pair has to be continually updated. With this assumption, action value Q converges with probability 1 to the optimal value q^∗.

(31)

(32)

Chapter 4 Reinforcement Learning Neural Network

In many applications in which reinforcement learning is applied, a large state spaces are involved and the methods seen before could not be sufficient to manage them.

Thus, instead of compute value and policy functions accurately, in this chapter will be introduce methods that approximate these functions using limited computational resources.

The problems that occurs with large state spaces are related to the memory needed to store large values of functions computed and the time and data needed to fill them carefully. It is necessary to improve tabular methods, learning more from a limited set of data, generalizing from previous encounters with different states that are in some sense similar to the current one. Hence, generalization will be the main issue and the strength of methods shown later, that combine RL methods with existing generalization methods.

One of the best way of generalize is function approximation that starting from

(33)

examples of desired function and attempts to generalize with an approximation of the whole function. Artificial Neural Networks (ANN) are widely used for nonlinear function approximation.

4.1 Artificial Neural Network

Artificial neural networks (ANNs) is a network of interconnected units that have some of the properties of neurons. These units can be arranged in different layers, such as input layer, output layer and hidden layers. The bigger is the number oh hidden layers, the deeper is the net. The deeply-layered ANNs being responsible for some of the most impressive abilities of machine learning systems, including reinforcement learning systems. A simple feedforward ANN is shown below. The word feedforward means

Figure 4.1: ANN with three input units, 2 hidden layers and 1 output unit

there are no loops in the net that influence unit’s input with its output. Each link between two units is weighted by a real number that corresponds to the efficacy of the connection. The output of a unit is computed by applying a nonlinear function, called activation function, to the weighted sum of its input signals. This is a semi-linear behaviour. Activation functions are typically S-shaped, e.g. sigmoid or hyperbolic tangent; sometimes rectifier nonlinearity is used, especially when learning function occurs. The reason why activation functions are nonlinear is that if all the units in a

(34)

multi-layer feedforward ANNs have linear activation functions, the entire network is equivalent to a network with no hidden layers [6].

In order to approximate functions, deep neural networks learn function itself trough the training process. Training an ANN means it changes its weights of each connection to improve its capacity of approximate the function that it is trying to learn. Another advantage of training method is the possibility of automatically create features appropriate for a given problem without relying exclusively on hand-crafted features, and let the net improve on its own. During training each weight is updated in a direction aimed at improving performance measured by an objective function. The purpose of training is to change weights based on how their variation influence network’s performance;

thus, it is necessary to compute the partial derivative of an objective function with respect to each weight. In RL problems, a TD error can be used as performance index, that is, as objective function. The gradient is the vector of these partial derivatives.

The best way to compute and apply this updates is backpropagation algorithm which consists of alternating forward and backward passes through the network. During forward step each unit computes its output and propagates to the next layer’s units as show in Figure 4.2.

Figure 4.2: Forward pass of training process. Each unit computes the weighted sum of the previous outputs and then applies activation function.

(35)

In backward step the error between the actual output of the net and the target function is computed and backpropagated to the previous layers, updating each weight as follow:

w_ij = w_ij − α ∂L

∂w_ij (4.1)

where α is the learning rate and L is the Loss function computed at the end of previous forward step. The learning rate is introduced as a constant (usually very small), in order to force the weight to get updated very smoothly and slowly avoiding big steps and chaotic behaviour. The partial derivative in 4.1 can be computed as:

∂L

∂w_ij = ∂L

∂out_j

∂outj

∂in_j

∂inj

∂w_ij. (4.2)

The drawback of backpropagation is that it may not work well for deeper networks.

This problem is caused by two important reason. First reason is the large number of weights that makes difficult to avoid overfitting, that is, the problem to generalize on cases the ANN has not visited. The second reason occurs in backward step, because the partial derivatives decay or grow rapidly toward the input of the network, making learning process slow or unstable.

There are many methods aimed to avoid overfitting and make learning process of deep ANNs more efficient and stable. One of this methods is batch normalization (BN) method that normalizes the output of deep layers before it is transferred to the following layers. More precisely, this method adjusts input variable to have zero mean and unit variance.

A type of interesting network used in RL problem is deep convolutional network, made by convolutional layers specialized in elaboration of high-dimensional array as

(36)

image. Each layer produces a feature maps and each unit of following layer performs the same operation (convolution) on part of data it sees from previous layer, its receptive field. This kind of networks are able to extract patterns from data and recognize features thanks to its deep hierarchical structure.

The impressive results on RL problems using nonlinear approximation is due to ANNs performance and their capacity of adaptation, a behaviour very similar to that of the neurons of human brain.

4.2 Deep Reinforcement Learning

To handle the large number of state-action pairs, using neural network to approximate Q function results in more compact representation of than a lookup table; this also gives the possibility of interpolate for state-action pairs that have not been visited. These are great advantages in order to generalize during training. Significant progress has been made by combining advances in deep learning for sensory processing with reinforcement learning, resulting in the Deep Reinforcement Learning (DRL) that solves problems with high-dimensional observation spaces. There are many ways to implement DRL, in discrete and continuous action spaces. Deep Q-Network (DQN) is one of the first algorithms developed for this purpose; it is capable of human level performance using unprocessed data from sensors thanks to deep neural network function approximators.

DQN solves high-dimensional state spaces problem, but it can only handle discrete and low-dimensional action spaces. Since DQN algorithms find the action that maximizes the action-value function Q, the continuous case requires an iterative optimization process at every step and this could be heavy from computational point of view. Other

(37)

methods have been developed to address continuous action spaces problems, such as that introduced in the following section.

4.2.1 Actor-Critic Method

One of the most successful methods to implement DRL in continuous action space is Actor-Critic method. This approach allows to split the policy evaluation and optimization from the evaluation of value function, using one different agent for each task.

Actor–Critic constitutes an important class of reinforcement learning algorithms that can deal with continuous actions and states in an easy and natural way [7]. Actor-critic method is a combination of value-based methods and policy-based methods as shown in Figure 4.3. The actor learns by using feedback from the critic. Although DQN is

Figure 4.3: Actor-critic set-up

useless in continuous action spaces, two innovative features that have brought it to the success can be used in Actor-Critic method to guarantee stable and robust learning process: off-policy training using samples from a replay buffer and target networks.

(38)

4.2.1.1 Replay buffer

The replay buffer is a memory in which the agent stores samples relative to current state, action taken, new state, reward and a boolean variable for terminal or non terminal state, as follow:

sample : st, at, rt+1, st+1, isterminal (4.3)

During each step of training, while on-policy side of the algorithm chooses actions and store samples of the environment dynamics, the off-policy side takes batch of samples randomly from buffer to build target functions and improve prediction. When the replay buffer is full the oldest samples is discarded. Take samples randomly is a key step to achieve more stable learning because it minimizes the correlation between samples. Because DDPG is an off-policy algorithm, the replay buffer can be large, allowing the algorithm to benefit from learning across a set of uncorrelated transitions.

4.2.1.2 Target networks

Another addition to the DQN that makes it unique is the utilization of a second network during the training procedure. This second network is used to generate the target values that will be used to compute the loss for every action during training. The issue is that at every step of training, the network’s predictions shift, and if these constantly shifting set of values are used to adjust network values, then the value estimations can easily spiral out of control. The network can become destabilized by falling into feedback loops between the target and estimated Q values. In order to mitigate that risk, the target network’s weights are fixed, and only periodically or slowly updated to

(39)

the primary networks values. These kinds of update are respectively called hard-update and soft-update In this way training can proceed in a more stable manner. With hard- update, weights of target network take the values of the primary network’s weights once every n steps, with n ' 10³. Soft-update makes this transition more smooth, updating target network at each step, as a combination of its current weights w⁰_t and the values of the primary network w_t:

w_t+1⁰ = τ w_t+ (1 − τ )w_t⁰ (4.4)

with τ << 1.

4.2.2 Deep Deterministic Policy Gradient algorithm

Policy gradient is a particular way to update policy, by following the gradient itself.

This method can provide a strong learning signal as to how to improve a parameterised policy. One recent development in the context of actor-critic algorithm is Deterministic Policy Gradient (DPG). DPG is the extension of standard policy gradient theorems to deterministic policies. Thus, Actor-Critic method uses a learned value function as a baseline for policy gradients. In this work a DPG algorithm is implemented with the advantages of deep neural networks (deep DPG or DDPG). In Fgure 4.4 is shown how this algorithm works with replay buffer and target networks mentioned before.

Starting from Bellman equation:

Q^π(st, at) = E^rt,st+1r(st, at) + γE^at+1[Q^π(st+1, at+1)]

(4.5)

(40)

Figure 4.4: DDPG algorithm

since target policy µ : S → A is deterministic, the inner expectation can be avoided and the above relationship can be written as follow:

Q^µ(s_t, a_t) = Ert,st+1r(st, a_t) + γQ^µ(s_t+1, µ(s_t+1))

(4.6)

In this way, it is possible to learn Q^µ off-policy, using transitions generated from a different stochastic behaviour policy. Given the function approximator parameters θ^Q, they are optimized by minimizing the loss function:

L(θ^Q) = Ert,st,atyt− Q(s_t, a_t|θ^Q)

(4.7)

(41)

where y_t is the target function:

y_t= r(s_t, a_t) + γQ(s_t+1, µ(s_t+1)|θ^Q). (4.8)

The critic is trained minimizing loss function 4.7 as in standard Q-learning; actor is updated using the variation in critic output w.r.t. the weights of actor itself, that is the actions. This update is implemented using the following chain rule:

Est∇_θ^µQ(s, a|θ^Q)|_s=s_t_,a=µ(s_t|θ^µ)]

= Est∇_aQ(s, a|θ^Q)|_s=s_t_,a=µ(s_t₎∇_θ^µµ(s|θ^µ)|_s=s_t] (4.9)

As said before, in order to stabilize training, target networks are used for computa- tion of target function 4.8, because Q update is prone to divergence. Given Q⁰(s, a|θ^Q⁰) and µ⁰(s|θ^µ⁰) the Critic and Actor target networks, equation 4.8 can be written as follow:

y_t = r(s_t, a_t) + γQ⁰(s_t+1, µ⁰(s_t+1|θ^µ⁰)|θ^Q⁰). (4.10)

This change moves the relatively unstable problem of learning the action-value function closer to the case of supervised learning, a problem for which robust solutions exist. This may slow learning, since the target network delays the propagation of value estimations, but the gained improvement in stability is a great result.

In order to handle the exploration problem in DDPG, policy ˆµ is created by adding noise sampled from a noise process N to Actor policy µ. This new policy will be:

ˆ

µ = µ(s_t|θ^µ) + N. (4.11)

(42)

In this work the Ornstein-Uhlenbeck process is used; this process introduce a variation relative to the variable to randomize. This variation is defined as follow:

dx_t= θ(µ − x_t)dt + σdW_t (4.12)

where θ > 0, µ and σ > 0 are parameters and Wt denotes the Wiener process. The advantage of off-policies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm. In terms of pseudo-code, this algorithm can be implemented as shown in Algorithm 1.

In the next chapter experimental case will be introduced in order to test the method shown so far. Other important aspects as hardware and relative problems, tuning of training parameters and results will be treated.

(43)

Algorithm 1 DDPG

Initialize critic Q(s, a|θ^Q) and actor µ(s|θ^µ) with parameters θ^Q and θ^µ

Initialize target networks Q⁰ anche µ⁰ with parameters θ^Q⁰ ← θ^Q and θ^µ⁰ ← θ^µ Initialize replay buffer R

for episode = 1, MAX EPISODE do

Initialize random process N for exploration Take first observation state s1

for step = 1, MAX STEP do Select action a_t = µ(s_t|θ^µ) + N_t

Execute action a_t and take the new observation s_t+1 and reward r_t Store transition (s_t, a_t, r_t, s_t+1) in R

Sample random minibatch of transitions ((s_i, a_i, r_i, s_i+1)) from R Compute target function y_i = r_i+ γQ⁰(s_i+1, µ⁰(s_i+1|θ^µ⁰)|θ^Q⁰) Update critic by minimizing loss: L = _N¹ P

i(y_i− Q(s_i, a_i|θ^Q))² Update actor policy using chain rule:

E^st∇θ^µQ(s, a|θ^Q)|_s=s_t_,a=µ(s_t|θ^µ)]

= 1 N

X

i

∇_aQ(s, a|θ^Q)|_s=s_i_,a=µ(s_i₎∇_θ^µµ(s|θ^µ)|_s=s_i

Update target networks with soft update:

θ^Q⁰ ← τ θ^Q+ (1 − τ )θ^Q⁰ θ^µ⁰ ← τ θ^µ+ (1 − τ )θ^µ⁰ end for

end for

(44)

(45)

Chapter 5 Experimental study

5.1 Scenario

In this chapter a real case is introduced, in order to evaluate DDPG algorithm. What agent has to execute is to reach target position according to the mission and to react to the obstacles, detecting and avoiding them, based on sensors. Agent has no information about the environment, its size, shape and obstacles distribution. Thus, it controls with continuous actions the vehicle trying to solve a map-less navigation problem.

The vehicle used for this purpose is a full electric nonholonomic tracked vehicle with differential kinematics. Sensors used to let the vehicle adapt its behaviour to the environment changes are laser scanner as first and depth camera later. The target position is supplied by combination of GPS data and IMU data with respect to the vehicle.

Agent is trained in 2D virtual environment, then tested in 3D virtual environment that emulates more precise dynamics of the vehicle and data from sensors. After these two phases, if agent shows good behaviour with respect to the mission to accomplish, the whole system is evaluated in real scenario.

(46)

Since DDPG requires sampling the environment to improve and generalize learning, one of the best way to maximize the number of state-action pairs visited by the agent is to start each episode in random position, set random target position and build different environment. It is clear that in real environment this is very difficult to implement;

thus, networks are trained in virtual environment and then deployed in real case. This idea is called virtual-to-real DDPG [8].

5.2 Training Phase

This work aims to provide a continuous control for map-less navigation for unmanned nonholonomic ground vehicle. The purpose is to find such a translation function:

v_t = f (x_t, d_t, h_t) (5.1)

where x_t is the observation from raw sensor information, d_t and h_t are the distance and heading error of the target with respect to the robot frame. They can be regarded as the instant state s_t of the mobile robot. The model learns to correctly map the state to the new action, which is the velocity v_t, through the training process. At each step the agent collects sample about the completely unknown environment dynamics;

it chooses an action based on current state and receives the reward related to the new state. Thus, it learns which action leads to the best results in terms of reward.

5.2.1 Input Dataset and Output

As shown in (5.1), network takes different informations as input and returns new value interpreted as actions. The choice of dataset to provide to the network is closely related

(47)

to the sensors available on vehicle. The input state used in this work is made up of two main parts. The first part is a 16-dimensional vector of distance measures. The second part has two different 1-dimensional states representing target position in polar coordinates (distance and angle) with respect to the mobile robot coordinate frame.

In order to generalize training and guarantee that the weights obtained from 2D environment training can be used in real environment, the component relative to the distance of target from the vehicle is supplied to the network as the reciprocal value

1

dt. In this way, during training, it is possible to use smallest environment, so networks learn from a range of distance values bigger than those in real environment with larger distances.

The simulated laser has ranges between −90^◦ and 90^◦ in a fixed angle distribution of 11.25^◦ with maximum value of 5m. Thus, the agent is able to know if an obstacle or wall is less than 5m near to him. The choice to limit laser range forces agent to base its action on less information. Regarding camera sensor, data from simulated laser have been modified in order to emulate data from depth camera, reducing the field of view from 180^◦ to 90^◦ with the same number of beams, as show in Figure 5.1.

(a) Simulated Laser scanner (b) Simulated Depth

Figure 5.1: Difference between simulated laser scanner and simulated depth camera in 2D environment

(48)

As shown later, these three informations about the environment are fed into actor network which elaborates data and returns a new value as action. The action is a 2-dimensional vector that includes the angular and the linear velocity of the mobile robot. Moreover, the range of the linear velocity is constrained in (0,1), while angular velocity can assume values within (-1,1). Backward moving is not expected because laser and camera cannot cover the back area of the mobile robot. The exploration noise added to action is generated using the following parameters:

OU Parameter Value

θ 1

σ 0.2

µ linear action 0.5 µ angular action 0

Table 5.1: Parameters of Ornstein-Uhlenbeck process.

As shown in previous table, µ value for linear and angular action is different due to the different output ranges of the actions. This particular value represent the value that action could have if the real value is near to zero, as stated in (4.12). Thus, the Ornstein-Uhlenbeck process guarantees that the actions assume a desired values if the predicted action is too small and agent get stack in the same position and orientation, encouraging exploration. During training phase, as the agent became more expert and improves its features, the noise component tends to zero in order to decrease exploration in favour of exploitation. The implementation of (4.11) is:

ˆ

µ = µ(st|θ^µ) + N ; (5.2)

where → 0 with decay rate of 0.995.

(49)

5.2.2 Reward Function

The reward function used by critic to learn value function Q(s, a) is:

r(s_t, a_t) =











100, if target was reached

−100, if collision

−d_t−_n², otherwise

(5.3)

where d_tis normalized target distance from vehicle and n is the smallest value of sensor data from laser (or camera). The idea behind this reward shape is to encourage network to follow shortest path. Since at each step the reward function returns a negative value, the agent has to accomplish the task as fast as it can in order to maximize Q value.

The first component of reward function −d_t ensures that the more the vehicle is near the target, the less is the weight of this component. The second component relates the reward function to the distance from obstacles or walls. The agent has to maximize the distance to the nearest obstacle in order to maximize the whole step reward.

5.2.3 2D Virtual Environment

To train the model, a 2D virtual environment has been built. It consists in 20x20m² with multiple obstacles. The vehicle is represented as a 2x1m² box with sensor beams starting from its centre in case of laser or from forward position in case of depth.

Obstacle is a 2x1m² red box with two possible orientations. Target position is identified with a green box of 1x1m². Vehicle start position, target position and obstacle position and orientation are generated randomly whenever the vehicle collides with obstacle or wall. If agent reaches the target, only the position of the target is generated randomly

(50)

again in the whole area and guaranteed to be collision-free with other obstacles. Agent wins the episode only when its distance from the target is less than 1 meter. Moreover, collision with obstacles or walls occurs when the smallest value of laser (or depth) measures is less then 1 meter. Thus, check on collision is based on sensors, not on pixel collision of agent and obstacle images. This aims to make training results more safe for real deployment, avoiding to damage vehicle.

(a) (b)

Figure 5.2: 2D Virtual Environment. (a) shows environment for training with laser, while (b) shows environment for training with camera

The model of the environment is learned from scratch, so the knowledge of the map of the environment including obstacle is not required. The highly random generation of all components guarantees that most of the environment is explored and a better learning convergence is achieved.

5.2.4 Agent Structure

Before describing networks layout in detail, it would be useful to take a look at frame- works used to design networks. A powerful tool is Keras, an open source library in

(51)

Python built specifically for neural network design. Keras provides several functions that make building and training neural networks very easy. It is based on Tensor- flow, another open source library provided by Google to handle more general machine learning problems.

These tools allow not only to choose each layer structure of network, units number and activation function, but give the possibility to go deeper and manage aspects related to training process, including loss function computing and gradient computing.

As previously said in (5.1), all the state components are fed into agent’s critic and actor networks. Actor takes laser (or camera) data and polar coordinates of target with respect to the vehicle as three different input layers. Target distance and heading error enter in a first fully-connected layer with 512 nodes; then input vectors are transferred to Batch-normalization layer that adjusts input to have zero mean and unit variance before the activation layer that takes state vector from previous layer and applies relu function. After this elaboration, both target distance state and heading error state go through another fully-connected layer with 100 nodes. These two output vectors are merged with the output from laser data elaboration pipeline that has the same structure defined above, but repeated more times.

(52)

Figure 5.3: Network layout of DDPG Actor

(53)

Thus, the input vector from laser (or depth) data goes through 5 fully-connected layers of 512 nodes, each of which has its own BN layer and relu activation layer. As in the other state components, this vector is transferred to the last fully-connected layer of 100 nodes before the merging layer. This layer takes the three output vectors of first step and returns one vector with length equal to the sum of other vectors. As in the first step, the merged vector get in another fully-connected layer with 512 nodes, then in BN layer up to the relu activation layer. At this point, the same vector goes towards two different layer in order to compute action used for linear velocity and action for angular velocity. The choice of split the pipeline instead of use a single 2-dimensional vector for action aims to provide two different output ranges, one for linear velocity and one for angular velocity; that is, as said before, they can assume respectively values in (0,1) and (-1,1). Thus, vector coming from last activation layer goes toward one fully- connected layer with 1 node per action, followed by an activation layer. Activation function for linear velocity is a sigmoid function, while for angular velocity is tanh function. These two functions guarantee the desired range for each action. Only at the end of the network architecture the two actions are combined in one 2-dimensional vector that will be the output of the actor prediction during training and inference.

The architecture of critic network is very similar to that of actor. All the sensors data are treated with the same algorithm that ends with merge layer. The main difference in this part of network is the presence of action layer input. As said in the previous chapter, critic learns the Q(s, a) function that depends on actual state and actual action. Thus, critic takes as input all the current state and the current output of actor, that is the current action.

(54)

Figure 5.4: Network layout of DDPG Critic

The action vector goes through the same three layers of target distance and heading error; thus, it is fed into fully-connected layer of 512 nodes, then into BN layer and finally in relu activation layer before to get in another fully-connected layer of 100 nodes. As in actor network, all these vectors are merged and then transferred in

Continuous Control of UGV for Mapless Navigation: a Virtual-to-Real Deep Reinforcement Learning Approach

CONTINUOUS CONTROL OF UGV FOR MAPLESS NAVIGATION:

A VIRTUAL-TO-REAL DEEP REINFORCEMENT LEARNING APPROACH

Contents

Chapter 1

Introduction

1.1 Related work

Chapter 2

Learning-based Mapless Navigation

Chapter 3

Reinforcement Learning

3.1 Background

3.2 Markov Decision Process

3.2.1 Agent-Environment Interface

3.2.2 Reward Shaping

3.2.3 Value and Policy Function

3.3 Algorithms for Reinforcement Learning

3.3.1 Dynamic Programming

3.3.2 Monte Carlo Methods

3.3.3 Temporal-Difference Learning

Chapter 4

Reinforcement Learning Neural Network

4.1 Artificial Neural Network

4.2 Deep Reinforcement Learning

4.2.1 Actor-Critic Method

4.2.2 Deep Deterministic Policy Gradient algorithm

Chapter 5

Experimental study

5.1 Scenario

5.2 Training Phase

5.2.1 Input Dataset and Output

5.2.2 Reward Function

5.2.3 2D Virtual Environment

5.2.4 Agent Structure