Memory Augmented Neural Networks in Reinforcement Learning

(1)

Memory Augmented Neural Networks in

Reinforcement Learning

Federico Rossetto - [email protected]

Supervisors:

Dr. Davide Bacciu

Dr. Andrew Melnik

September 2019

Abstract

In this work we will explore the applicability of Memory Augmented Networks in Reinforcement Learning scenarios. We will first introduce two tasks inspired by the work from Oh et al. [2016], that aim at testing the capabilities of an agent to learn to store information in memory. The I-Maze and the Pattern Matching Maze are two tasks that are partially observable, where the agent needs to memorize an information at the beginning to be able to solve it at the end. We will also introduce a new architecture called Memory Addressing Networks, that tries to simply implement the use of an external memory in a Neural Network. We will explore some possible configurations of this model and compare it to the Recurrent Neural Networks and Neural Turing Machines on their ability to solve the proposed tasks.

(2)

Acknowledgements

I would like to thank Dr. Andrew Melnik and Dr. Davide Bacciu, my Thesis supervisors, for their constant support, their patience and their help throughout this work.

I would also like to thank my Erasmus colleagues for the enlightening and enriching last year. I would also like to extend my thanks to my colleagues from Pisa, for the many challenging and interesting discussions that we had.

Finally, I would like to thank my family, my friends and my girlfriend for their patience, their support and the jokes over the years.

(3)

1 Introduction

Reinforcement Learning has been a steadily growing field of ML in the last couple of years. It has shown some great promises in his ability to learn without a strong supervision. As an example of the work that has been done we could consider the work from Vinyals et al. [2019], where they were able to make an agent learn to play the strategic game of StarCraft II. They were able to defeat one of the world champions. To do so the game requires to plan actions and manage an economic system. Both tasks are very challenging to achieve for a Reinforcement Learning Agent.

Another growing trend, although it is still missing an actual application, is the field of Memory Augmented Neural Networks. The work from Graves et al. [2014] and Graves et al. [2016] laid down the foundations for these architectures, showing how they were able to learn to perform some algorithmic tasks.

The work from Oh et al. [2016] showed an interesting overlapping of this two fields. They tried to develop a model, derived from the Neural Turing Machines from Graves et al. [2014], and apply it to Reinforcement Learning tasks. The tasks that they proposed were maze navigation tasks, where the agent had to memorize some information at the beginning of the episode to make a decision in the final part. These environments were developed using the Minecraft framework.

One big drawback of this work is the fact that they do not fully explore the capabilities of a Memory Augmented Model, but rather just try to learn to extract information from a sequence of memories. A clear weakness is that this model requires a memory as big as his experience buffer to be able to operate effectively

1.1 Motivation

Our goal in this work is to explore more thoroughly how Memory Augmented Neural Networks can perform in Reinforcement Learning tasks. To do so, we propose a new flexible architecture, that allows us to experiment with different choices in terms of the components for the model.

We will experiment with these different configurations to test what is the best performing one, and after that compare it with the more popular approaches in literature like Long Short Term Memory from Hochreiter and Schmidhuber [1997] and Neural Turing Machines from Graves et al. [2014].

1.2 Research Questions

In this work we aim at answering the question: Are Memory Augmented Neu-ral Networks applicable in a Reinforcement Learning environment using the currently available algorithms?

To be able to answer this question, we create a flexible architecture that aims at representing in the more generic way possible the class of models that

(5)

we want to test. In this way we can experiment with many variations, and show how this may be more or less suited for the task.

We also define some specific tasks to test our models with, that forces the agent to actively use its memory to be able to complete the task.

1.3 Thesis Contribution

This work proposes a new model called Memory Addressing Network. We will show how this model is able to outperform the current state-of-the-art ap-proaches in the tested tasks. This is probably due to the nature of the task, and the fact that the model has an easier time at identifying temporal dependencies than more common architectures.

1.4 Thesis Structure

This work is organized in five sections. In the beginning we will discuss some background knowledge on Reinforcement Learning, summarizing some common approaches used in the literature. After that we will discuss about the topic of Memory Augmented Neural Networks. We will first summarize the works from Graves et al. [2014] and Oh et al. [2016], and after that introduce our proposed model.

In the following section we will discuss about the experimental setup that we used to test the models capabilities. Then we will report our results, exploring our model architecture and comparing it with other models from the literature. In the final section we will summarize the work contribution and draw some conclusions.

(6)

2 Background

In this section we will cover some of the background knowledge regarding the Reinforcement Learning and the Partially Observable Markov Decision process definition.

2.1 Reinforcement Learning

We are now going to summarize the most common approaches that are used in the field of Reinforcement Learning (from now on called RL). We will start by giving a formal definition of the problem. Then we will summarize the most common algorithms that are used.

2.1.1 General Setup

The formal definition of the problem that we aim to solve with RL is called ”Markov Decision Process” (from now on called MDP). This is defined by the tuple S, A, P (·), R(·), where:

• S is the finite set of States • A is the finite set of Actions

• P (s, a, s0_{) is a transition function, defining the probability}

P(St+1= s0|At= a, St= s)

• R(s, a) is a reward function, that gives the immediate reward that we get from performing the action a, while being in state s

Based on this formal definition of the problem, we define what is the Gain, a Policy, a State-value function and an Action-value function

Gain

The Gain is a function that can be defined as ”The expected, discounted sum of future rewards that we are going to obtain”. More formally, if we consider Rtthe reward that we are going to obtain at a certain time t, we can define the

Gain as: Gt= ∞ X k=1 γk−1· Rk (1)

Where γ is called discount rate, and controls the temporal dynamics of the reward system.

(7)

Policy

A Policy is a distribution probability over the actions, given the state that we are in. We can define it formally as:

π(a|s) = P(At= a|St= s) (2)

The Policy is used to define the behavior of an entity, usually called Agent, within a MDP.

State-Value Function

A State-Value function can be defined as the expected Gain of following a policy π, starting in the state s. Formally:

Vπ(s) = Eπ[Gt|St= s] (3)

This function describes how rewarding a Policy is, starting from a state s. Action-Value Function

Similar as before, the Action-value function defines the expected Gain of per-forming a specific action, and then following a policy π. Formally:

Qπ(s, a) = Eπ[Gt|At= a, St= s] (4)

In general this function is called Q-Value function, and gives also the name to the Q-Learning Algorithm.

Optimal Policy

In general we also want to give a definition of what we consider as an Optimal Policy. In general we can define it as a Policy π∗ that gives the maximum Gain

over all the possible states of the environment.

∀π & ∀s ∈ S, Vπ∗(s)≥ Vπ(s) (5)

We consider a RL problem solved when we have found the Optimal Policy. 2.1.2 Q-Learning

The Q-Learning algorithm aims at defining one of the first RL Algorithms to approximate an optimal Policy.

To better explain it, we are going first to introduce the Bellman Optimality Equations, that are the foundations of this algorithm.

Bellman Optimality Equations

The goal of this equation is to formalize the recurrent property that the Q-value and State-Value functions have, and their relations when we are dealing with the Optimal Policy. We define them as:

(8)

Vπ∗(s) = maxa h R(s, a) + γ ·P s0P (s, a, s0) · Vπ∗(s 0₎i Qπ∗(s, a) = R(s, a) + γ · P s0P (s, a, s0) maxa0Q_π_∗(s0, a0) (6) The main idea here is that we define the current value of the function based on the current reward and the value of the function for the next state. We are going to see its application in the Q-Learning Algorithm.

Q-Learning Algorithm

The Q-Learning Algorithm is a practical implementation of the Bellman Equa-tions (6). The idea is to learn the Optimal Q-Value function, and from that derive the Optimal Policy. As a matter of fact, we know that this relation holds:

π∗(s) = arg max

a Qπ∗(s, a) (7)

Hence, if we know the Optimal Q-Value function, we also know the Policy itself.

This algorithm is also called off-policy. This means that we model his learn-ing by looklearn-ing to actions performed by a different policy, usually called an exploration policy. It is also said to be model-free, because it does not require a model of the environment itself.

The idea is that we want to move our estimation towards the expected value, according to this relationship:

Qnew(st, at) = (1 − α) · Qold(st, at) + α · rt+ γ · max

a Qold(st+1, a)

(8) Where (st, at, rt, st+1) is a tuple sampled from the experience, and α is a

parameter called learning rate. It is also worth mentioning that if st+1is a final

state, we will take Q(st+1, a) = 0, ∀a.

To use this algorithm, we need to use a Function Approximator. Early approaches used a Tabular Function, but nowadays neural networks are a more common approach.

Deep Q Networks

The work from Mnih et al. [2015] showed a first approach of the Q-Learning using the now popular Neural Networks. It showed that a simple algorithm as Q-Learning could learn many of the original Atari games, achieving super-human performances in some of them.

One of the most impressive aspects of their approach was that it was based on the raw image of the game. That meant that the agent was able to learn to identify objects, and it was able to learn to act on them.

However, even if this algorithm has shown to be very powerful and effective at solving many problems, it has the downside of been very unstable. To mitigate this problem, there has been developed some improvements to this algorithm.

(9)

We are now going to discuss about the Double Q Networks from van Hasselt et al. [2015] and the Dueling Q Networks from Wang et al. [2015].

Double Q Networks [van Hasselt et al., 2015]

One of the reasons why Q-Learning is so unstable is the fact that we are con-stantly moving our estimation of the Q-function. A simple solution that is usually implemented is to use an older copy of the model to compute the new value of the function. This older model is updated more slowly, making the process slower but stabler. This means that the equation (8) becomes:

qtarget=rt+ γ · max

a Qf rozen(st+1, a)

Qnew(st, at) =(1 − α) · Qold(st, at) + α · (qtarget)

In this equation we have also split the target value qtargetthat we want move

towards. This is not yet Double Q-Learning, but it is a first improvement of the algorithm baseline.

The Double Q Learning algorithm, extends the idea of the frozen model. In noisy environments, usually the maxaQf rozen(st+1, a) is incline to overestimate

the action-values. To have a better estimate, we use the Qf rozen only for the

value estimation, and not for the action choice itself. Formally, we redefine the qtarget as:

qtarget= rt+ γ · Qf rozen st+1, arg max

a Q(st+1, a)

(9) By this equation we can see that we are taking the maximizing action over the current instance of the model, but we use the value estimator of the ”frozen” instance.

This has been shown to outperform the Q Networks by van Hasselt et al. [2015]. In their paper they show how regular Q-Learning has the tendency to over-estimate the Q-Values, and that instead the use of Double Q-Learning is able to mitigate this problem.

Dueling Q Networks [Wang et al., 2015]

The idea of the Dueling architecture is to split the end architecture between the advantage and the value-state, and then combining the two values into q-values. The goal is to have a better sample efficiency. Currently, during Q-Learning, when we use a sample, we move the estimation only of the (action, state) pair that we are considering. Instead, we want to define an architecture that benefits from all of our predictions more efficiently. The architecture is shown in Figure 1.

The main idea is to split the definition of the Q-Values. To better explain this, we are going first to introduce the concept of Advantage.

(10)

Figure 1: Dueling DQN Architecture Advantage

We define the Advantage as a function Aπ(s, a) that is the difference between

the Q-Value function and the State-value function. Formally:

Aπ(s, a) = Qπ(s, a) − Vπ(s) (10)

This function is useful because it decouples the State-Value from the Q-Value. It is interesting when applied in reverse to estimate the Q-Value function:

Qπ(s, a) = Vπ(s) + Aπ(s, a)

This is the idea that is applied in the Dueling DQN by Wang et al. [2015]. We have two streams in the architecture, like we can see from the bottom Figure in 1. We then combine these values following the Advantage Function definition in equation (10) to obtain the Q-Values.

But this sum has a downside. By summing the two estimates of the State-value and the Advantages, we are losing the semantics of these two functions, as stated by Wang et al. [2015]. This is due to the fact that we could have a canceling interference of constant values from the V (s) and A(s, a) functions. To remove this problem, we should formally consider:

Qπ(s, a) = Vπ(s) + Aπ(s, a) − max

a Aπ(s, a)

(11) This equation comes from the relation Qπ(s, a) = Vπ(s) if a = π(s), that

uses the formal definition of the Q(·) function. This also means that, if π is the optimal policy, then ∀(a, s), maxaAπ∗(s, a) = 0.

Equation 11 is good formally, but lacks in stability from an optimization point of view. Instead the authors proposed an approximation of equation 11, that is instead much more stable:

(11)

This equation losses the straight formal definition, but increases the stability of the model afterwards.

So following equation 12 we can combine the two streams representing the State-value and the Advantages, without losing their semantic meaning.

The reason behind this architecture is that it increases drastically the sample efficiency. Here each sample has a gradient on both the Advantage and the State-value. This implies that each time we are moving the Advantage associated to the (s, a) pair that we are considering, but we are also moving our current estimate of the State-value, that affect the computation of all future Q-values associated with the state s that we are considering.

As shown by Wang et al. [2015], this optimization in the architecture greatly improves the performances of the model in many tasks of RL.

In the Results section we will report the results of our experiments, that were performed using a Double Dueling DQN architecture. The choice of using this algorithm has been driven by the fact that it is a well established baseline, and we would not want to rely on specifically crafted algorithms to test the capabilities of our models at performing the tasks.

In the remaining of this sub-section we are going to give a brief recap of other popular Reinforcement Learning algorithms that are used in these days. 2.1.3 Policy Gradient

The Policy Gradient algorithm introduced by Gullapalli [1990] is based on the Policy Gradient Theorem. The idea is constructed on how we evaluate a policy, specifically using the State-Value function to evaluate it. This means that an optimal policy is defined as:

π∗= arg max

π Es0 Vπ(s0)

(13) This means that we can use the State-Value function as the loss of our model, and train to maximize it. This method will then maximize the expected gain for every possible starting state that we consider.

To maximize this, we need to use an optimization method like Gradient Descent, and to apply it we need to be able to compute the Gradient of this function. Here is where we can use the Policy Gradient Theorem.

Policy Gradient Theorem

Let’s consider L(θ) to be the loss of our model, where θ is the set of model parameters. Let the loss be defined as the explicit Value-State function of our RL environment.

L(θ) = Es Vπ(s0)

Then it can be inferred that the gradient of this loss function is the following: ∇θL(θ) = Eπ Qπ(s, a)∇θlog πθ(s, a)

(12)

Where a, s ∼ πθ, meaning that a and s are taken from the path obtained

following the policy πθ.

Using this theorem we are able to build a Loss for our policy network, that allows to maximize the expectation over the Value-State function. Our loss will be:

L(θ) = Eπ Gtlog(π(st, at))

(15) As we can notice, we have substituted the Action-Value function with the gain. This has been done because we have the relationship Q(st, at) = Eπ(Gt).

We also remind that st, at ∼ π is a path obtained following the policy π, and

Gt = rt+P∞_i=0γirt+i is the expected gain that will be obtained from time t

onward.

The algorithm that uses this loss is called Monte-Carlo Policy Gradient, since it requires to sample from the current policy some paths, and then it optimizes the policy.

There are many other variants that come from the policy gradient theorem. Their goal is to make the optimization stabler without introducing any bias in the algorithm. Two of the most relevant ones are the Actor-Critic from Witten [1977] and the A3C from Mnih et al. [2016]. We are now going to briefly summarize their main components.

2.1.4 Actor Critic

The Actor-Critic algorithm originally introduced by Witten [1977], is based on the use of two different models, called as the name suggests Actor and Critic. The first one, as the name states, approximates a policy function, where the second instead approximates an Action-Value function.

The idea is to train the two models at a faster rate. During vanilla policy gradient, we can train just at the end of each episode, when we can compute the gain of the followed path.

Instead, in the case of the Actor Critic, we can use the actual Q-Value to train the Actor. For the Critic instead, we can use standard Q-Learning as we have seen before.

2.1.5 Asynchronous Advantage Actor Critic (A3C)

In this algorithm the focus is on two main points. First, it is optimized for parallel training, and it is synchronized from time to time. Second, it uses Advantages instead of Q-Values, that have a much lower variance and help stabilize the training.

The work from Mnih et al. [2016] shows that these asynchronous training is able to get a quicker and stabler training process, due to multiple parallel actor learners.

(13)

2.2 Partially Observable Markov Decision Processes

A POMDP is a generalization of a Markov Decision Process, where the agent cannot access directly the state of the environment. Instead it has some observations that yields a probability distribution over the state space.

Formally it is defined as the tuple (S, A, T, R, Ω, O, γ), where: • S is the State Set

• A is the Action Set

• T is the Transition function that describes the probabilities P (st+1|st, at)

• R is the Reward function that returns a value for each (s, a) pair

• Ω is the Observation set, that contains all the possible observations that the agent can make

• O is a set of conditional Probabilities P (ot|st, at)

• γ is the discount rate

A POMDP describes a process where the agent does not have perfect infor-mation. This means that he has to act under uncertainty. To model this uncer-tainty, we use an Agent belief over the states, meaning that at each timestep we have bt, that is a probability distribution over S.

In general POMDPs are computationally intractable to solve exactly. To tackle this type of problems we will try to apply Recurrent Neural Networks and Memory Augmented Networks, to test their capabilities at solving this kind of problems, and implicitly model the belief state of an Agent.

(14)

3 Memory Augmented Networks for

Reinforce-ment Learning

Recurrent Neural Networks [Pearlmutter, 1989] have proven very effective at handling stream-like data in recent years. They have been used to model Markov chain to solve problems like Economic Forecasting [Salzano, 1999].

One big limitation of RNNs is their capability at modeling long-term depen-dencies. This has been deeply studied from many authors like Hochreiter et al. [2001], Bengio et al. [1994] and Hochreiter [1998]. Due to gradient vanishing over the time-steps, the models struggle at learning reliably dependencies over more than 10 steps of distance.

To mitigate this problem, some extensions of the ”Vanilla RNNs” have been developed . Some examples could be the ”Long Short Term Memory” (LSTM) from Hochreiter and Schmidhuber [1997], or the ”Gated Recurrent Units” (GRU) from Cho et al. [2014].

We are going first to introduce the Recurrent Neural Networks (from now on referred as RNNs) originally introduced by Pearlmutter [1989], to give an initial background on how these category of models works. We are going also to briefly talk about LSTMs [Hochreiter and Schmidhuber, 1997], since it is one of the most used recurrent architectures nowadays.

Then we are going to go towards the branch of ”Memory Augmented Neural Networks” (from now on called MANNs). We are going first to discuss the Neural Turing Machines by Graves et al. [2014] and the Memory Q Networks by Oh et al. [2016]. In the end of the section we are going to introduce our approach to the problem, that aims at testing how different Addressing Techniques affects the ability of the model at learning to solve RL problems.

3.1 Recurrent Neural Networks (RNN)

RNNs have been widely used to model problems with sequences or time-series because of their ability at modeling sequential dependencies.

We are going first to introduce a ”Vanilla” RNN, originally introduced by Pearlmutter [1989]. In general this model can be defined as:

RN N (xt) : ht= f W · [xt, ht−1] + b (16)

Where xt, ht represent the input and output pair of the model, and f (·) is

the activation function. The [·] represent the concatenation of vectors.

In general we can see from Equation (16) that at each iteration we are processing not only the current input, but also the previous output of the model. This is helpful because it allows to create a dynamic context that can contain information across all the sequence that we are processing.

Sadly, this model has some limitation. It has been show by Hochreiter et al. [2001] and others ([Bengio et al., 1994] and [Hochreiter, 1998]) that RNNs struggle at learning long term dependencies due to Gradient Vanishing through the unrolling.

(15)

Long Short Term Memory (LSTM)

To try and mitigate this problem, it has been introduced the Long Short Term Memory (from now on called LSTMs).

This model is a more complex architecture that aims at improving the ”mnemonic” capabilities of the RNNs. It uses ”Gates” to define operation to process the input, control the forgetting and compute the outputs. The formal definition of LSTM is as follows: it= σ(xt· Ui+ ht−1· Wi) ft= σ(xt· Uf+ ht−1· Wf) ot= σ(xt· Uo+ ht−1· Wo) ˜ ct= tanh(xt· Ug+ ht−1· Wg) ct= σ(ft· ct−1+ it· ˜ct) ht= tanh(ct) · ot

Where the i, f, o indicates the Input, Forget and Output Gates operations. With U, W we indicates the weights matrices associated with those operations. Also c and h indicate the long and short term memory vectors.

This model has shown very strong performance across many tasks, and it is nowadays been used to tackle many state-of-the-art problems.

3.2 Related Works on MANNs

The idea behind MANNs is to help the capabilities of Neural Networks with an ”external” memory. With RNNs and their variation, we are always keeping one or two vectors as context. Instead MANNs try to use a set of context vectors, that are not all updated at the same time.

We are now going to see some specific applications of this idea. 3.2.1 Neural Turing Machines

Graves et al. [2014] proposed a model called Neural Turing Machines. Their idea was to create a trainable architecture that could learn to perform some program-like tasks like copying, sorting or associative recall.

The architecture contained a Memory matrix, that was accessed each time step with a reading and a writing operation. We can see in figure 2 a schema of the architecture.

The main goal of this model was to have a completely differentiable archi-tecture, so that it could be trained end-to-end. To achieve this goal, they had to redefine the standard Reading and Writing operations. To better under-stand how they implemented these operations, we are going first to define their Addressing operation.

Addressing via weighting

To define differentiable operations for reading and writing, they used a weight vector wt. This vector basically contains the weights for each column of the

(16)

Figure 2: Architecture schema for the Neural Turing Machines

Memory matrix M . This vector controls the interaction between the Memory and the current input.

To generate this vector, they implemented four different operations: • Addressing with a key vector kt, and a parameter for the key strength βt

wc_t(i) = exp(β_Pt· ktMt(i))

jw c t(j)

• Interpolation with the past wt−1, controlled with a parameter gt

w_tg= gtwct+ (1 − gt)wt−1

• Rotation with a convolutional filter st

ws_t(i) =

N −1

X

j=0

w_tg(j) · st(i − j)

• Sharpening the obtained distribution with a coefficient γt

wt(i) =

ws t(i)γt

P

jwts(j)γt

The operations that we have described are the one that the model needs to perform to obtain a distribution, and each one is still differentiable. All this process is controlled by the set of parameters (kt, βt, gt, st, γt). To generate these

parameters we will create a Controller Network, that for each input vector it

will generate several of this parameter sets.

We want now to tackle the Writing and Reading operations, that use the generated wtdistribution to update the memory and generate a reading vector.

(17)

Writing

The writing operation is split in two components by the authors, an erase and an add operation. Both use the same distribution wt, but each have two vectors

et, at that represent the information that we want to erase and add to the

Memory Matrix. Formally, the operation is defined as: ˆ

Mt= Mt−1◦ (1 − w>t · et)

Mt= ˆMt+ w>t · at

(17) It is worth to highlight that the dot products (w_t>· x), where x is a generic vector, generate matrices of the same dimension as Mt, so to be able to perform

the ◦ element-wise multiplication and the addition. Reading

The reading operation is defined in a more straight-forward way, where the reading vector is generated by the dot-product between the wtdistribution and

the memory matrix. Formally:

rt= wt· Mt (18)

All the operation that we have described so far are differentiable, making the whole architecture differentiable and trainable end-to-end.

Neural Turing Machines have shown some potential in the context of mod-eling very long-term dependencies, but they have the drawback of been very unstable as a model. They are very difficult to train and to make them con-verge even in the toy algorithmic problems that were built for.

Zaremba and Sutskever [2015] has also proposed the use of RL algorithms to train them to solve the algorithmic tasks. What they concluded with their experiments was that training NTMs on algorithmic problems with an RL ap-proach was feasible just for very simple problems. They failed at solving sorting and reversing tasks, concluding that learning to access memory in complex pat-terns was too difficult for the NTMs.

Even though their approach failed, they where still trying to learn algorithms in such a fashion. In general, RL scenarios access the memory more sparsely and this could help the model learn it in an easier way.

We are now going to move our focus to the Memory Q Networks, an ap-proach based on the NTM work, that tries to extend their capabilities on a reinforcement learning setup.

3.2.2 Memory Q Networks

Oh et al. [2016] present a new set of challenging tasks for RL models. They specifically want to emphasize issues like the Partial Observability of the envi-ronment, Sparse Reward Function, high dimensional observations and the use

(18)

of Active Perception. After this they propose a new model called Memory Q Network, derived from the Neural Turing Machines, that aims at solving these tasks with very good generalization capabilities.

Cognition Inspired RL Tasks

The tasks that Oh et al. [2016] introduce are based on the Minecraft platform. Their goal is to test how RL models approximate some basic cognition capa-bilities. This, as they state, is a limited exploration of cognitive faculties that humans have.

They proposed 4 different tasks: • I-Maze

• Pattern Matching Maze

• Random Maze / Random Maze with Indicator

The idea on all these tasks is similar. Have a partially observable environment (due to first person view in Minecraft), and the need to memorize information that will not be available later. We are now going to see in detail how they applied this concept in the tasks that they designed.

Figure 3: Top-Down view of a I-Maze

I-Maze

This task is a I-shaped maze, that has an indicator tile at the top, and two coloured tiles at the bottom. We can see a top-down view of the maze in Figure 3.

The goal of this task is to reach the Blue or the Red tile at the bottom of the maze, according to the color of the indicator at the beginning. If the indicator is yellow, the goal will be to go towards the Red tile, meaning that it will give +1 reward, where the Blue will give -1. Instead if the indicator is Green, the goal will be to go towards the Blue one, giving +1 reward, where the Red one will give -1.

(19)

Pattern Matching Maze

This task is a maze composed of 2 rooms, and at the bottom the same 2 colored tiles as in the I-Maze. As we can see from Figure (4) each room has a pattern of yellow and green tiles.

Figure 4: Top-Down view of a Pattern Matching Maze

The goal of the agent is to com-pare this two patterns, and decide if they are identical or not. If the two rooms are identical, the agent has to go to the Blue tile, otherwise it has to go to the Red tile.

The Reward function is structured as for the I-Maze, giving +1 for ing the correct tile, and -1 for reach-ing the other one. The generation of

the rooms is balanced to have half of the time the rooms that are equals, and half of the time that are different, to have a balanced experience.

Figure 5: Top-Down view of a Random Maze

Random Maze/Random Maze with Indicator

With this type of environment they defined different tasks. Two of them use indicators, and two do not. These tasks are very challenging for a Rein-forcement Learning agent, since they also require navigational capabilities. The agents are also tested in unseen layouts of the maze, that makes the task even more difficult.

We will now describe them: • Single Goal: The goal of this task is to reach the Blue tile (+1), while

avoiding the Red one (-1).

• Sequential Goals: The goal in this task is to visit sequentially first the Red (+0.5) and then the Blue tile (+1). If the agent visits them in the reverse order, it will get -0.5 and -1 respectively .

• Single Goal with Indicator: This task is similar to the first one, where the agent has an indicator at the beginning to tell it if it has to reach the Blue or the Red tile.

• Sequential Goals with Indicator: This task is the union of all the tasks seen so far, where there is an Indicator that indicates in which order we should visit the Red/Blue tiles.

We are now going to summarize the architecture that they presented to tackle these problems. Been a Memory Augmented NN, it has some similarities

(20)

with the Neural Turing Machines. But, has a point of difference, it simplifies the reading and writing operations quite substantially.

Model Architecture

Oh et al. [2016] propose a new model to tackle these tasks, that requires a stronger memory and the ability to access information from past experience. To achieve this, they introduce the ”Memory Q Network”, an architecture that uses an external Memory Matrix to compute the Q-Values for its actions.

Figure 6: Memory Q Network Architecture

In Figure 6, we can see the general architecture of the model. As we can see, we have a context vector that controls the memory interaction. This context vector can be generated in different ways. To test how the generation of this con-text influences the capabilities of their model the authors created three different architectures. The first one, uses as controller for the memory a Feed-forward NN (Memory Q Network - MQN). The second instead uses a Recurrent NN (Recurrent Memory Q Network - RMQN). In the last the authors add a feed-back connection from the memory, to test how this influences the capabilities of the model (Feedback Recurrent Memory Q Network - FRMQN).

The context in this three architectures is generated following these equations: MQN: ht= Wcxt+ bc

RMQN: ht= Wc[xt, ht−1] + bc

FRMQN: ht= Wc[xt, ht−1, ot−1] + bc

(19)

Where [·] indicates the concatenation operation.

The Memory operations are instead organized in a simple Key-Value schema. The writing operation is implemented as a FIFO queue, where we take the last n observations and linearly project them into two matrices Mtkey, Mtvalue.

(21)

Et= [en, ..., en−t]

Mtkey = WkeyEt

M_tvalue= WvalueEt

(20)

Where Wkey and Wvalue are two learned matrices. The et vectors instead

represent the pre-processed observation of the environment at time t.

After generating the Key-Value matrices for the memory, we want to get a reading vector ot. To perform this operation, we use the context vector ht

to get an attention distribution from the Key Matrix. After that, we use the generated distribution to gather the information that we need from the Value Matrix. Formally the Read operation is defined as follow:

pt= softmax(htM key t )

ot= Mtvaluept

(21) This gives us the reading vector ot. that will be used in the end to generate

our Q-Values. The authors used a 2-layer Neural Network on the concatenation of the read vector and the context vector.

qt= W2· ReLU(W1· [ot, ht]) (22)

Where again [·] defines the concatenation operation.

The authors show that these models out-perform the more standard Q-Learning approaches, been able to learn to generalize and solve unseen problems with much better results. The Feedback Recurrent Memory Q Network is the one that performs the best across all the tasks that they introduced. Oh et al. [2016] hypothesize that this increase in performances is due to the cycling of memory and context vectors, that allows the model to ”ponder” a decision across multiple time-steps.

In the next sections we are going to introduce our implementation of a Mem-ory Augmented Network. Our goal is to create a very simple model that can test how RL algorithms can solve this problem, and which addressing tech-nique is the best performing one at solving tasks that are derived from the ones presented by Oh et al. [2016].

3.3 Memory Addressing Network

One of the biggest simplifications that Oh et al. [2016] take is how they perform the Writing operation. They are using a fixed size memory with a FIFO queue. This is clearly working in the task scenario, since they are fixing the memory size as exactly the maximum amount of steps in the environment.

But in a more general prospective, we would like a writing operation that keeps the relevant information and discards all the non relevant one. To achieve

(22)

this, we need an operation similar to the one that is used in the Neural Turing Machines [Graves et al., 2014].

3.3.1 Writing/Reading Memory

We decided to implement two different approaches to write the memory. The first one, called replace, mimics the operation of replacing information inside the memory. The second, called ”overwrite”, does a similar job. The biggest difference is that the first removes some information from the memory, where the second just adds new information to the memory each time. Formally the two operations are defined as follows:

Overwrite(Mt, at, xt) = Mt+ a>t · xt

Replace(Mt, at, xt) = Mt◦ (1 − at) + a>t · xt

(23) Where at is an attention distribution and xt is a new information vector

that we want to add to our memory. We will explain later how we generate the attention distribution.

But there is a limitation that the Overwrite operation could have. If we try to write information in the same cell multiple times, we will have catas-trophic interference. To address this issue, we will also try to use a per-vector normalization.

The reading operation is instead the same as in the NTMs, where we multiply the attention distribution over the current memory.

READ(Mt, at) = at· Mt (24)

We are now going to show how we have tackled the Attention Generation problem, that can be seen as a soft-addressing towards the memory.

3.3.2 Addressing techniques

To tackle this problem we have implemented four different methods to compute an attention distribution. We wanted to test how a single Dense layer could perform in addressing the memory. We have used a straight attention generation and a key-based one. We have also tested with implicit attention, based on the cosine similarity between memory and the new data.

At last we have tried to generate an attention based on a Gaussian distribu-tion. This function generates a single value µ that will be used as the mean of the Gaussian attention over the memory cells.

(23)

STRAIGHT(Mt, xt) = softmax(W xt+ b) BY KEY(Mt, xt) = softmax Mt· f (W xt+ b) IMPLICIT(Mt, xt) = softmax(xtMt) INDEXING(Mt, xt) = softmax Gaussian f (W xt+ b), σ (25)

Where Gaussian indicates a discrete Gaussian Distribution function, and σ is an hyper-parameter. Also xtrepresent the current input vector to the network,

and Mtis the current state of the Memory matrix. In the end, W represent a

learnable weight matrix. We have chosen to always use the sof tmax operation because a desirable goal that our distribution should have is to be quite sharp, allowing to write and read clearly from the memory.

These addressing methods have different properties. First, the straight and indexing ones are location-based, whereas the implicit and by key are instead content-based.

Then, we also expect that the indexing addressing will be a more regularized version of the straight addressing, that incentives more contiguous accessing of the memory, whereas the straight one will be more spread.

For the content-based ones, they are similar but the addressing by key uses a neural projection, whereas instead the implicit uses the input vector itself. This will mean that the second will rely on the actual content of the Memory matrix to ”self-address” itself, whereas the by key can learn to generate an appropriate key to extract the desired information.

As we can see, in this attention generation process there are less parameters that the model uses to control the generation of the attention. This should help the model learn quicker to control these operations.

At last we are going to mention the Memory Initialization. Since we are using also key-based addressing, the initial state M0 is highly relevant for the

architecture.

3.3.3 Memory initialization

To initialize the memory, we have experimented with three different techniques. Our goal was to find the best performing one, and at the same time verify the effect of one, compared to the others work.

The first one that we applied is a Constant initialization to zero. This has the good property of not introducing any interference, but cannot be applied with content-based addressing like the Key-based.

The second one used an Identity matrix to initialize the memory. This has the requirement that nloc≤ ndim, where the memory is an nloc× ndimmatrix.

This is because, if instead nloc>> ndim the initialization would degenerate in

(24)

The third technique is a small Random initialization. This is good as a third option when we cannot use neither of the ones before, but has the downside that introduces a small noise that is more difficult for the network to learn to ignore. 3.3.4 Other model features

The model also implements the ability to write and read in multiple points of the memory simultaneously, allowing for the agent to memorize and retrieve multiple pieces of information at each time-step. This will particularly come in helpful in the second task, where the agent needs to memorize multiple pieces of information.

Another possible improvement that the model has is the memory feedback. This has been inspired by Oh et al. [2016], where they first implemented this concept. The idea is that we want to compute the current memory attention not only on the current vector xt, but we also want to consider the information

read at the previous time-step ot−1. This could allow the agent to process

in-formation across multiple time-steps, allowing for some ”pondering”.

At last, before moving to the next section, we are going to give a brief recap on how the data-flow works in the Memory Addressing cells.

3.3.5 General Data-flow

Assume to have a input vector xt, that will be the output of a pre-processing

fully connected layer. Then the MAN cell, using the current state of the memory Mt, works as follow: vt= [xt, ot−1] if feedback else xt rt= read addressing(Mt, vt) wt= write addressing(Mt, vt) Mt+1= write(Mt, wt, vt) ot= rt· Mt Return [xt, ot]

Where [·] is the concatenation operator, and rt, wtare the reading and

writ-ing attention distributions. There is one point that we want to underline. The return value, is the product of the concatenation between the input and the read vector. This has been done because memory reading should be focusing on past information, enriching the current processing of the state.

We are now going to introduce the tasks that we have tested our architectures with. The tasks are derived from the ones presented in Oh et al. [2016]. We are going to summarize how they are designed and what are the challenges that the models has to face.

(25)

4 Experimental Setup

The tasks that we have designed are on a Grid World, and offer only a partial view to the agent. How the view is implemented depends on the specific task, but is in general a partial view of the surrounding of the agent position. This provides a partial observability of the environment.

We are also going to highlight the design of the reward function. Since our goal is to test the capabilities of models at learning temporal dependen-cies through memory, we have designed a reward function that simplifies the exploration process of the agent. The reward function is has follows:

• -0.5 for each non successful step, meaning that the agent tries to move towards a wall;

• -0.04 for each successful step; • +2 for reaching the correct end-tile; • -1 for reaching the wrong end-tile.

This reward schema helps the agent learn quick to not go towards a wall, and instead explore the maze. This approach has the upside of speeding up the training process, but has the downside to introduce another sub-optimal solution.

All these tasks have two sub-optimal solutions that the agents will have to try to avoid:

• The first one is introduced by the reward function that we defined. It is a policy where the agent just moves backward and forward, just standing still. This will give EsVπ(s) = −2, since we will always have successful

steps, but will never reach the end.

• The second one is instead a solution where the agent always goes to the same end-tile, disregarding for the indicator at the beginning. This will mean that we will have EsVπ(s) = +0.5, since half of the time we will

be correct, and half we will be wrong.

We are now going specifically into the implementation of these tasks, specif-ically highlighting what makes them different from the ones presented by Oh et al. [2016].

I-Maze

This first task is based on the I-Maze Task presented by Oh et al. [2016]. The biggest difference is that in this task we are not using the Minecraft framework, but instead using a Grid-world implementation.

The agent has a limited view of all his surrounding. The observation is represented as a square centered on the agent. In our experiments we are using a radius of 2 (square side is 5). Moreover, we are representing each cell in a 1-hot encoding.

(26)

Figure 7: I-Maze View of the environ-ment

The maze has a variable length. During training we are using a length in the range of 4 to 7 cells of the main corridor. When the agent is trained, we are then testing it with mazes with length between 3 and 40. This is a much wider range, that in-cludes many never seen by the agent during training.

As a separate experiment, we also added a switch of the two end-tiles. This, unexpectedly, has resulted in a much harder task for the agent, that struggles to learn the correct behav-ior. We suppose that this difficulty arises from a problem of exploration and since it is not a goal of this work, we will not report results on this variant of the environment.

Pattern-Matching Maze

This second task is also based on the work from Oh et al. [2016]. As for the previous one, we have used a grid-world implementation with a top-down view. Also, we didn’t varied the length of this environment, since it was not the core focus of this task.

Figure 8: Pattern-Matching Maze view of the environment

We have used reduced dimen-sions for the rooms that the agent has to compare. This has been done to simplify the task that the agent has to learn, since it was very difficult to solve it with 3 × 3 rooms. We will show how the per-formances drop with the growth of the dimensions of the rooms, and discuss about it in the next sec-tion.

We have decided to not test our agent on any task of the random mazes class presented by Oh et al.

[2016], because we thought that the skills involved where going out of our scope. These tasks require mainly good navigational skills, and that was out of the memorization ability that we wanted to test from our agents.

We are now going to report the results from our experiments, and discuss on their consequences that these results yield.

(27)

5 Results

In this section we are going to report and discuss the experimental results of our experiments. In the first 2 sections we are going to explore the different config-urations of the model, and progressively build the best performing architecture. In the first section we will mainly experiment using the I-Maze environment, experimenting with different reading and writing addressing and techniques. We will also explore different memory initialization, and the efficacy of Memory Normalization.

In the second section we will use the Pattern Matching Maze environ-ment, experimenting with the last settings. We will evaluate the use of feedback connections and multi-head reading and writing.

Lastly, in the third section we are going to compare the performances of the obtained model with NTMs and LSTMs. The goal is to show how challenging the tasks are, and compare the performances of the proposed model with some common approaches.

We were not able to do an exhaustive experimentation of all the possible model settings, since it would have been unfeasible. We will show a progressive building of the choices, varying one or two configurations at a time, and show how the performances change.

We are now going to introduce the evaluation metrics used in these experi-ments.

5.1 Evaluation Metrics

To evaluate a model after the training process we will use two metrics. The first one is the cumulative reward, that consists on the sum of rewards obtained by the agent during the episode. We will call this metric Score.

A second metric will be the solving capabilities of the model. The amount of steps required by the agent to solve the task is variable, especially in the I-Maze environment. To truly evaluate the capabilities of the model, we want to measure how many times the agent is able to reach the correct final position in the environment. To do so, we will compute across a batch how many times the agent is solving correctly the task. To do so, we will consider solved an environment only if the final score will be greater than 0. This is done because of the reward function that we are using, that gives a positive reward only if the agent reaches the desired position. We will call this metric Solved.

The score are computed in 2 different settings. The first one, called training in the tables, is computed using environments that have also been fed during the training process. For example, in the I-Maze environment, the agent is trained on mazes of length from 4 to 7.

The second one, called test in the tables, is instead a setup a little bit different, that is not been encountered by the agent during training. As another example, in the I-Maze environment the agent is tested on mazes of length from 3 to 40.

(28)

An early stopping technique has been applied in all the experiments, to avoid overfitting or diverging for over-training. In the I-Maze the agent has been trained for at most 1000 episodes, and stopped when the average score on the last 100 episodes was higher than 1.0 (we will call this value target score). Instead for the Pattern Matching Maze the maximum amount of training episodes has been raised to 2000, and the target score to 1.5.

This has been done to account for the greater complexity of the second environment. Also the difference in the target score is due to the difference in the task. In the case of the I-Maze, the environment had 2 possible configurations, so we wanted to evaluate if the agent was in a sub-optimal policy or if he had reached the optimal policy. Instead in the Pattern Matching Maze, there are many possible configurations of the environment. This meant that sometimes the agent could learn just to act in some of this configurations, acting sub-optimally in some others.

All the reported results are obtained running 10 separate trainings and com-puting an average over the performances of the obtained models. In the case of the Pattern Matching Maze we have not reported any test results. We will discuss this choice later. We have used a fixed learning rate of 10−3 for all the experiments reported.

5.2 Experiments using I-Maze

In this section we report the results of the experiments with our proposed ar-chitecture. We will start by experimenting with the reading operation.

5.2.1 Reading Experiments

We will first start to compare different reading techniques. As writing methods we will applying two different techniques. The first will be the FIFO, that provides a baseline. After that, we will also use the replace memory described in equation (23), since it does not require Normalization, and we will use the same addressing for both reading and writing.

We are going to test the reading with two different Writing because of one main reason. The FIFO memory allows for a ”perfect” memory, but does not give a stable positioning of the information. This means that content-based addressing will most likely perform better. With the Replace method instead we will have a more location-based memorization of the information, and this will mean that other addressings will probably work best.

We are also using the constant initialization for the memory, and we will test later the influence of this choice in different architectures.

We are going to summarize the results in Table 1.

As we can see from the results, the straight addressing seems to be the best performing one, although with very close performances to the addressing by key. The experiments using the FIFO writing show that using a more general approach does not yields a significant drop in performances. Instead the results are quite close, having better test scores in general. The only exception is

(29)

Model Training Test Reading Writing Score Solved Score Solved

straight FIFO 1.15 90% -0.46 60% by key FIFO -0,79 71% -1,74 62% indexing FIFO 0.24 50% -0,51 47% straight replace-straight 1.25 85% -0,17 64% by key replace-by key 0.88 74% -0,08 62% indexing replace-indexing -0,43 35% -0,93 33% Table 1: Results of the Experiments on the memory reading

the indexing, that performs very poorly. We also want to highlight that the results of the addressing by key using the FIFO writing has the reported results because his training had a very high Variance. Sometimes the results were leaning towards the optimal policy, where other times the results where instead converging on sub-optimal policy or even diverging.

In the end, we select the straight addressing as the best performing address-ing method for the readaddress-ing operation, since it yields optimal performances both in content-based and in location based writing methods.

5.2.2 Memory dimensions

Another crucial point that we need to consider is the dimensions of the memory. For the experiments so far, we have used a 40 × 24 matrix, where 40 are the number of memory locations, and 24 the dimensions of the vectors. This has been done because for the FIFO memory, we need enough space to contain all the history throughout the episode. For comparison, we have also kept the memory size consistent with the replace experiments.

Now we are going to experiment with this dimensions, starting with the memory locations. In table 2 we will report the results:

Training Test Memory Locations Score Solved Score Solved

40 1.25 85% -0,17 64%

20 1.55 95% 0.88 92%

12 1.74 100% 0.89 92%

8 1.74 100% 0.72 87%

4 1.59 95% 0.17 74%

Table 2: Results on the number of memory locations

As we can see from this results, reducing the number of memory locations available does improve significantly the performances of the model. The best generalizing one is clearly 12, and it will be used from now on for the next experiments.

(30)

We also want to test the impact of the dimensions of the memory vectors. We expect it to have very little impact, since the ”concepts” that we are trying to learn are quite simple, and do not require a very big vector space to be represented. We report the results of this experiment in Table 3

Training Test Vector Size Score Solved Score Solved

48 1.74 100% 0.76 89% 36 1.74 100% 0.64 85% 24 1.74 100% 0.89 92% 12 1.74 100% 0.82 89% Table 3: Results on the memory-vector size

As we can see from the results, the Vector size has a now clearly defined behavior. As a training score, all the choices work very well, but the size of 24 seems to have a small edge in the generalization. For this reason, we decided to keep the Vector size to 24.

5.2.3 Writing Experiments

We are now going to experiment with the writing operation. We will compare all the possible combinations of writing methods, with all possible addressing generation techniques. At the beginning we will not consider implicit addressing as one of the viable options. Later we will experiment a bit with this type of addressing and show how it can perform, compared to the learnable ones.

We report in table 4 the results of the experiment with all the combinations of writing and addressing methods.

Write Training Test

Addressing Method Score Solved Score Solved straight Replace 1.74 100% 0.89 92% by key Replace 1.63 96% 0.66 85% indexing Replace 1.74 100% 0.85 91% straight Overwrite 1.74 100% 0.96 96% by key Overwrite 1.74 100% 0.89 94% indexing Overwrite 1.74 100% 0.93 94%

Table 4: Experiment on the writing of the memory

As we can see from this table, the overwrite technique seems to perform much better than the replace. One probable reason about this could be that the over-write requires some kind of normalization, and this may effect the performance of the model. We will explore this concept later in this section.

As per the addressing method, we can see that all 3 seems to perform quite well. We will keep all of them to compare the effect of the next modifications.

(31)

As a last comparison, we also report the result of the use of implicit address-ing.

Write Training Test

Addressing Method Score Solved Score Solved implicit Replace 1.59 95% -0,56 61% implicit Overwrite 1.74 100% -0,55 69%

Table 5: Experiments with implicit memory addressing

As we can see from the results in table 5, the implicit addressing does reduces the ability of the model to solve the task. The network is still able to learn it sometimes, but in general his capabilities are much less reliable than with the learnable addressing methods. We also needed to use the identity initialization, to make the model able to route the values on different locations, and not blend the result across all the memory matrix.

In conclusion, we decided to discard the implicit addressing as one of the viable options, since it poorly compared to the other methods.

5.2.4 Normalized Replace

Here we want to report the results of the experiments on memory normaliza-tion. The overwrite operation needs memory normalization by his design, but replace does not, so it allows us to see the impact on performance on memory normalization.

We are going to report the result in Table 6:

Training Test Write Addressing Normalized Score Solved Score Solved

straight True 1.74 100% 0.91 93% indexing True 1.03 76% -1.03 59% by key True 1.74 100% 0.15 77% straight False 1.74 100% 0.89 92% indexing False 1.74 100% 0.85 91% by key False 1.63 96% 0.66 85% Table 6: Experiments on normalizing the memory using the replace writing method

As we can see from this table, normalization has a marginal effect on the memory replacing. The straight addressing seems also to be the only one able to maintain the same score in both situations, where the other 2 instead drop generalization performances considerably.

Since the memory overwriting has still better performances, we decided to lean towards that method for the next experiments.

(32)

5.2.5 Memory initialization

We are now going report the results on the experiments on memory initialization. The initial value that the memory has is a crucial choice for the performances of the model.

Training Test Write Addressing Initialization Score Solved Score Solved

straight Eye 1.29 89% -0.03 65% indexing Eye 0.62 65% -2.79 49.8% by key Eye -0.71 90% -3.14 60% straight Constant 1.74 100% 0.96 96% indexing Constant 1.74 100% 0.93 94% by key Constant 1.74 100% 0.89 94% straight Random 1.73 100% 0.45 79% indexing Random -0.83 55% -2.38 45% by key Random 1.74 100% 0.32 74% Table 7: Result on the experiments of the memory initialization policy As we can see from table 7, the results shows three main points.

• The stabler and most effective initialization policy seems to be the constant initialization. This is probably due to two main reasons. First, using the overwrite method, we are using memory normalization. This has the effect that, with the random initialization, we are going to increase the amount of noise that the algorithm has to learn to ignore. Second, since we are using location-based addressing, the identity initialization has no benefit for us, but rather adds some noisiness in the initial state of the memory that we need to learn to ignore.

• A rather curious phenomenon happens, where the constant initialization makes the addressing by key perform better than the identity one. This is very interesting, and we expect that it might be a non-reliable behavior. Since the information to be memorized is just one, a blurred read across all the memory can perform better than a sparse one. When we will use multi-head attention, this behavior might change.

• The straight addressing shows to be the more resilient method, that could mean that it can work better in more complex situations. In general it showed to always perform better in this experiments. We are going to experiment further with the Pattern-Matching Maze to test their efficacy in a more complex task, where more than 1 piece of information needs to be carried on.

(33)

5.2.6 Summary

In this section we have explored the different configurations of our model, and used the I-Maze Task to evaluate and find the best configuration for this type of tasks. We are now going to summarize our final choices.

Reading Addressing straight Writing Method overwrite Writing Addressing TBD Memory Normalization True Initialization Constant Memory Locations 12

Vector Size 24

Table 8: Final Model configurations

We have decided to not experiment with different amounts of read/write heads in the I-Maze, since the information that needed to be stored in memory is only one. The current task is not well suited to test the capabilities of multiple memory accessing points.

We have also not explored the use of memory feedback, since our final gen-eralization score of 96% is very high. A test done with the feedback yielded the same result, so we will try to compare them in a more challenging scenario.

In the next section we are going to finish testing some of the settings of our model, testing how multi-head reading and writing works, and if feedback is really beneficial.

5.3 Experiments using Pattern-Matching Maze

This task is much more challenging that the previous one. It requires the model to understand ”visual concepts”. Our model design is made to be fully con-nected, so we are currently flattening the environment state representation, los-ing all the spatial information. We believe that this task could benefit greatly from convolutional layer in the network, but our current design does not allow for their use. We leave this as a future work and improvement that we could work on.

5.3.1 Memory Feedback

In this section we aim at experimenting with the memory feedback. We are going to test his efficacy on the Pattern Matching Maze. We are going to use three configurations of the maze. We are first going to try with 1 × 1 rooms, and then progressively increase their dimension. We will notice that the performances of the network drops heavily increasing the dimension of the rooms. This is because the network needs to learn how to act in every possible combination that he faces.

(34)

Since this is a more complex task, we have increased the maximum amount of training episodes to 2000, and the target score to 1.5. This will allow the network to learn better how to act inside the environment.

We report in Table 9 the results with the different addressing and the use of feedback memory.

Addressing Environment Score Solved Feedback straight PM Maze 1x1 1.56 97% True straight PM Maze 1x1 1.56 90% False by key PM Maze 1x1 1.73 95% True by key PM Maze 1x1 1.79 97% False indexing PM Maze 1x1 -1.52 47% True indexing PM Maze 1x1 1.11 75% False straight PM Maze 1x2 0.85 71% True straight PM Maze 1x2 0.41 62% False by key PM Maze 1x2 0.42 72% True by key PM Maze 1x2 1.11 74% False indexing PM Maze 1x2 0.12 45% True indexing PM Maze 1x2 -0.05 50% False straight PM Maze 2x2 -0.70 57% True straight PM Maze 2x2 0.44 55% False by key PM Maze 2x2 0.80 66% True by key PM Maze 2x2 0.03 48% False

Table 9: Results on the experiments with the Pattern Matching Maze There are many experiments in this table, so we will analyze some critical aspect of these results.

First, we notice that the indexing addressing is performing very poorly in this task. The training are very unstable, and very frequently it converges to sub-optimal policies, even in the 1 × 1 rooms task. We decided to not test it on the 2 × 2 rooms because the results reported in the previous experiments were bad enough to make us discard it.

Second, we notice that the memory feedback has a good effect overall. Some-times the performance remain quite stable, and in general it improves the ability of the model to learn the task.

Lastly, we want to underline the fact that 3 times during the experiment with addressing by key and memory feedback, the network was able to learn to solve convincingly the 2 × 2 pattern matching task.

Based on this result, for our final configuration of the model, we will choose to keep the memory feedback. We could also use the addressing by key for the Pattern Matching Maze. But there is still the unclear effect of using the constant initialization that we didn’t explore earlier. We are going now to dig a little bit deeper on this concept.

(35)

Constant initialization on addressing by key

Figure 9: Visualization of the memory, where the top graph shows the matrix state, and the bottom vector shows one of the reading attention vectors In this section we aim at explaining in a

clearer way what is happening with the com-bination of key-addressing and constant ini-tialization. As we can see from figure 9, the result is quite simple. The attention distribu-tion is always a uniform distribudistribu-tion across all the memory, and so the effect is that we are writing a blurred version of the information we need.

To better understand the impact on this, we are going to experiment by reducing the number of memory locations down to 1. This should make the addressing irrelevant, but it gives a more sharp representation, and also degenerates the model towards a common RNN cell.

We are going to report in table 10 the re-sults of these experiments.

Environment Score Solved Memory Locations PM Maze 1x1 1.73 95% 12 PM Maze 1x1 1.69 95% 6 PM Maze 1x1 0.43 75% 3 PM Maze 1x1 0.51 92% 1 PM Maze 2x2 0.25 55% 12 PM Maze 2x2 0.58 58% 6 PM Maze 2x2 -1,79 46% 3 PM Maze 2x2 0.01 54% 1

Table 10: Experimental results on reducing the memory locations with the addressing by key and constant initialization of memory

As we can see from these results, there is no effect in reducing to 1 the num-ber of memory locations. This is probably due to the conceptual equivalence of the operation with the number of memory locations. We suppose that the net-work is able to learn to ignore the constant blurring factor that the addressing is adding, that is _n1

loc.

In general, this is a behavior that we don’t like, especially if we want to apply multiple read/write heads. After this testing we are going to put aside the addressing by key, and use the straight addressing for the next experiments, since it’s the second best performing method, and it makes more sense to be applied.

(36)

could be a critical point that helps the model learn even better to solve the task, and we are going to explore what is the optimal amount of memory accessing heads for this.

5.3.2 Read/Write heads

In this section, we aim at experimenting with different amounts of read and write heads. This gives the opportunity to the agent to retrieve and write more data in memory when he needs it, and to read more than one information at the end when he has to make a decision.

We have decided to keep uniform the amount of read/write head, since from our initial experimentation we did not notice any difference in increasing just one of the two.

We report the results in table 11:

Environment Score Solved W/R Heads PM Maze 1x1 1.56 97% 1 PM Maze 1x1 0.61 95% 2 PM Maze 1x1 -0,67 73% 4 PM Maze 1x2 0.85 71% 1 PM Maze 1x2 0.94 72% 2 PM Maze 1x2 0.82 65% 4 PM Maze 2x2 -0,71 57% 1 PM Maze 2x2 0.34 53% 2 PM Maze 2x2 0.71 61% 4

Table 11: Experiments on varying the amount of Read/Write Heads inside the model

The results gives us a clear picture. The number of read/write heads seems to be influential on this task as we expected. In the first task, where the rooms are small, the amount of information to retrieve is small, and increasing the number of heads does not seems to help. This is probably due to the fact that we increase the amount of noise that we extract from the matrix, that make the training less stable.

From the second task onward we see a different trend, where the perfor-mances do not drop and instead remain stable. In the third task, having 4 heads seems to help the performances of the model.

This is probably a parameter that is very task dependent. In simple tasks like the I-Maze or the PM-Maze 1x1, it does not help to increase the interactions with the memory, while in a more complex task like the PM-Maze 2x2 we see that it does help to interact more with the memory, helping the model learn better to solve the task.

All in all, the Pattern-Matching task remains quite complex to solve, scaling the dimensions of the room. We were not able to reach the capabilities shown

Memory Augmented Neural Networks in Reinforcement Learning