Continual Learning with Recurrent Neural Networks Models

(1)

Continual Learning with

Recurrent Neural Networks

Models

Andrea Cossu

Tesi di Laurea Magistrale in

Informatica

Dipartimento di Informatica

Universit`

a di Pisa

Anno Accademico 2018-2019

Relatori:

Dr. Davide Bacciu

Dr. Antonio Carta

(2)

Ringraziamenti

Un sentito ringraziamento va ai miei relatori, Davide Bacciu e Antonio Carta, per avermi assistito durante tutto lo svolgimento di questo lavoro.

Nessuno dei risultati qui presentati sarebbe stato possibile senza il loro fonda-mentale contributo.

(5)

Chapter 1 Introduction

The long-term objective of Artificial Intelligence (AI) [43] is the development of rational, intelligent agents capable of operating in the real world in ways that are comparable, or even superior, to the ones commonly attributed to human beings.

In the past decade, the rise of Machine Learning (ML) [28] brought great pop-ularity to AI. The development of ML systems able to learn from and adapt to available data led to impressive results in many practical applications. However, the use of ML techniques in complex, real world scenarios is still very limited.

In order to operate in these dynamic environments, ML models should be able to adapt to new interactions, new experiences and new goals as soon as they are available. The aspect of versatility is strictly required when dealing with heterogeneous sources of information that can change in time and influence future behaviors.

Unfortunately, current ML models are able to address only specific tasks in very narrow contexts, though with very accurate results. They are, as a mat-ter of fact, more similar to advanced engineered applications rather than truly intelligent agents.

Continual Learning (CL) aims at developing adaptive models able to learn continuously over a dynamic stream of multiple data sources that may not be all available at once [29]. CL algorithm should build, incrementally, a rich set of internal representations and update them during the learning process, in order to address different kind of tasks.

While this is the long-term achievement that CL systems should in principle reach, current works are still at the beginning.

CL and ML research are deeply interconnected. In fact, current trends in CL often focus on the extension of ML models (in particular, Artificial Neural Networks - ANNs) with the ability to efficiently address multiple distributions and tasks.

As an example, consider a standard feedforward (MLP) or convolutional (CNN) neural network trained on a custom dataset of license plates images. The

(6)

trained model is deployed on a speed detector station in order to autonomously recognize license plates of vehicles traveling faster than the speed limit. The model performances are excellent and it proves itself capable of correctly de-tecting license plates from the pictures taken by the speed detector.

However, at a certain point in time, the government issues a new regulation regarding the license plates format, completely changing their size, their shape, their colors, even the font of the characters.

The model is not able to adjust to these changes and consistently fails in rec-ognizing the new license plates.

With current ML techniques, the only option is to train a new model on both the old and new license plates, since some of the old ones are still circulating, waiting for dismissal.

The retraining, however, takes a very long time. In the meantime, speed de-tectors are not able to recognize new license plates, greatly reducing their effectiveness.

Another solution would be that of training the already existing model with data about new license plates only. However, as described in Section 1.2, this process leads the model to forget its knowledge about old license plates. A CL algorithm represents the only solution to enable a continuous update of the model, without needing to interrupt its functioning. Moreover, a CL algorithm will also deal with forgetting, preventing the old knowledge to be overwritten.

In addition to important improvements in real world applications, CL can also bring benefits to AI in general.

The study of CL is driven by the evidence that the human brain is perfectly ca-pable of adapting to a continuously changing environment. As a consequence, such objective is likely to be a necessary condition for artificial systems to exhibit human-level intelligence.

1.1 Continual Learning

The definition of Continual Learning itself presents many variants (one of them being Lifelong Learning [46]).

For the purpose of this thesis, CL can be defined as ”the unending process of learning new things on top of what has already been learned” [40].

A more formal definition centered around the computational aspects of a CL algorithm is presented in [29]:

Definition 1. Given a potentially infinite sequence of unknown distributions D = {D1, D2, D3, ...}, where each Di = {Xi, Yi} includes the input data Xi

and the target labels Yi, a Continual Learning algorithm ACLi is defined by the

following signature:

(7)

where T Ri is the training set drawn from the corresponding data distribution

Di, hi is the function that the model has learned after having seen all the

training sets up to the i-th, Mi is an external memory in which to store some

patterns from T Ri and ti is the label identifying the task.

This definition highlights the fact that the model updates its own hypothesis (from hi−1 to hi) when trained on new incoming data T Ri.

It is important to clarify the terminology related to task and distribution. In the definitions presented in this Chapter, the notion of task is associated to a broad class of objectives, like classification, regression, clustering. A distribu-tion, instead, plays the role of generating a set of inputs that can be related to one or more tasks.

There are many ways in which these distributions can be presented to (and learned by) a CL architecture [29]:

• Single-Incremental-Task (SIT): in which there is a single task that is learned incrementally, as the data (i.e. a different number of distribu-tions) about it becomes available

• Multi-Task (MT): in which there are different tasks to be learned, each of which with its own data distribution

• Multi-Incremental-Task (MIT): in which multiple different tasks have to be learned, but they are learned incrementally, as the data about them becomes available. Hence each of them can have multiple associated data distributions.

Moreover, depending on what is the informational content of incoming new distributions, an additional classification is possible [33] [29]:

• New Instances (NI): each step introduces new examples drawn from al-ready encountered distributions

• New Classes (NC): each step introduces new examples drawn from new distributions

• New Instances and Classes (NIC): each step introduces new examples drawn from both new and already encountered distributions

1.2 Catastrophic forgetting

The problem of catastrophic forgetting, sometimes called catastrophic interfer-ence, is a very well studied and also challenging one [42] [13] [35] and can be summarized as follows:

If, after its original training is finished, a network is exposed to the learning of new information, then the originally learned information will typically be greatly disrupted or lost. ([42])

(8)

As it is immediate to see from the definition above, the catastrophic forgetting problem is strongly related to the CL objective, since it prevents standard algo-rithms and models to achieve a continuous learning of new knowledge without forgetting the existing one.

This definition explicitly states the problem for any network, meaning any con-nectionist model that learns from data (e.g. ANNs in ML). However, nothing prevents the generalization of the phenomenon to other kind of models. There is large evidence of the presence of catastrophic forgetting phenomenon in feedforward ANNs [37]. To understand the origins of this issue, it is very helpful to introduce the stability-plasticity dilemma [37] [42] (or sensitivity-stability [11]), which is a more general phenomenon that heavily influences the catastrophic forgetting:

The stability-plasticity dilemma regards the extent to which a sys-tem must be prone to integrate and adapt to new knowledge and, importantly, how this adaptation process should be compensated by internal mechanisms that stabilize and modulate neural activity to prevent catastrophic forgetting. ([37])

In practice, whenever new information becomes available to a system, two different needs start to compete with one another: the first is the need to ad-equately learn what is new (plasticity), the second one amounts to avoid the disruption of what has been previously learned (stability).

On a static, fixed architecture, these two aspects of the learning process can hardly be satisfied at the same time and sooner or later one of them will prevail against the other, due to the limited resource availability.

This is exactly what happens with ANNs: the knowledge about the input data stored in the network adaptive parameters is overwritten as soon as different kinds of distributions arrive. The model needs then to be retrained on all the available distributions in order to encompass efficiently all the information. The retraining is a very costly procedure which is rarely feasible. Different solutions have been found in order to deal with this phenomenon. For ex-ample, balancing the stability-plasticity dilemma, which for ANNs is largely shifted towards the plasticity component, greatly helps in preserving stability of previous knowledge.

Despite all the efforts in limiting catastrophic forgetting in current ANNs, the problem is far from being solved [24] and it constitutes an active area of re-search.

The similarity between ANNs and biological neural networks is very loose: ANNs represent a multi-layered structure made by fully interconnected func-tional units. Each unit resembles a biological neuron only in the very basic computational principle it implements. In the end, there is not much of bio-logical principles left inside an ANN. This is an important point when dealing with forgetting, since the human brain is capable of addressing it in a very spontaneous way and with impressive results. The brain, during its normal

(9)

activity, is able to retain most of the collected knowledge, despite the con-stantly changing nature of the environment and the challenges it has to face. This very basic fact inspired the study of catastrophic forgetting in light of bio-logical principles, leading to a similar view on the stability-plasticity dilemma. One of the most famous and simple formalization dates back to 1949, when Donald Hebb in its “The organization of behavior ” [18] proposed a mecha-nism accounting for the structural and functional changes occurring in the human brain at a synaptic level in correspondence to an external stimulus. This formulation leads to the concept of Hebbian plasticity, represented in the following equation (this process is summarized in [37]):

∆w = m · x · y · η (1.1) where ∆w measures the change (weighted by the learning rate η) in the synap-tic strength w depending on the pre-synapsynap-tic activity x, the post-synapsynap-tic ac-tivity y and on the modulatory signal m, that accounts for the instability of the pure Hebbian learning [1].

This abstract characterization of the stability (driven by m) plasticity (driven by x · y) dilemma suggests different approaches to the problem. In particular, it leverages bio-inspired methodologies in addition to the classic ML approach (e.g. [36]), thus reinforcing the already strong interactions between neuro-sciences and AI.

1.3 Continual Learning metrics

The catastrophic forgetting problem, being the most influential issue in the CL field nowadays, has a large impact on the performance measures for CL algorithms. However, not all of them focus on it [9].

Some of the most used metrics are Average Accuracy, Backward Transfer and Forward Transfer [32].

They are computed using the accuracies matrix R ∈ RT ×T_{, where R}

i,j

indi-cates the test accuracy on task tj after observing the last sample from task ti.

Moreover, ¯b represents the vector of test accuracies on all the tasks at random initialization.

The Average Accuracy is described by the following formula: ACC = 1 T T X i=1 RT,i (1.2)

ACC computes the average accuracy on each task after the final one has been learned. It could be also useful to measure it at intermediate steps (e.g. com-pute it after learning each task).

The Backward Transfer is related to the effect that learning a new task has on the previous ones (basically, it measures forgetting) and it is computed by the

(10)

following equation: BWT = 1 T − 1 T −1 X i=1 RT ,i− Ri,i (1.3)

It compares the accuracy on a certain task after all the others are learned to the accuracy on the same task right after it has been learned.

Finally, the Forward Transfer focuses on knowledge transfer from a certain task to the future ones. It can be computed by measuring the accuracy over a set of future, unseen input data and compare it with the accuracy achieved by a random initialized model on the same task. Formally:

FWT = 1 T − 1 T X i=2 Ri−1,i − ¯bi (1.4)

The BWT and FWT are not computed for the first and last task, respectively. The Negative Transfer Learning is a measure that is not explicitly stated in the literature, but that is strongly related to negative values of the Forward Transfer. Both [32], which introduces the Forward Transfer, and [29], which takes into consideration many other measures, do not put a particular em-phasis on Negative Transfer Learning. They focus, instead, on positive values of Forward Transfer, correctly identifying them as signs of zero-shot learning capabilities.

Negative Transfer Learning is a concept familiar to psychology [50]: it describes the negative effect that learning one or more tasks can have on learning a new one, making the process slower or even impossible.

This thesis proposes a formalization of Negative Transfer Learning for CL. Its value can be computed by relying on two different values.

The first value is computed by training a model only on the target task and retrieving the accuracy obtained on a separate test set.

The second value is computed by training a model on a series of different tasks (with at least 1 task) and then training the model on the target task, retriev-ing the correspondretriev-ing test accuracy at the end. If the target task is indicated by ti, by calling the aforementioned two values ai and Ri,i, it is possible to

formally describe the Negative Transfer Learning: N T L = 1 T − 1 T X i=2 ai− Ri,i ¯ bi (1.5) where ¯bi is used as a normalization coefficient in order to obtain a NTL of 1

(maximum NTL) if Ri,i is on par with the performance of a random classifier.

There can be no Negative Transfer Learning for the first task, hence the mea-sure is averaged over T − 1 tasks.

The FWT differs from the NTL. The former computes the accuracies on the target task without training the model on it (but it uses a random initializa-tion or a model trained on previous subtasks). The measure introduced here, NTL, trains the model on the target task and then compute the accuracy.

(11)

Hence, FWT evaluates possibilities of zero-shot learning, while the NTL de-tects negative interferences between tasks. A positive value for NTL does not necessarily imply positive values for FWT.

It should be noted that when ai > Ri,i there is effectively Negative Transfer

Learning, while when ai < Ri,i there is Positive Transfer Learning, but it is

not appropriate to talk about zero-shot learning.

Other metrics that can be taken into account are related to the memory space required by Mi in order to (optionally) store previous patterns and to the

com-putational and memory efficiency of the CL models (in terms of the number of operations and parameters required). Finally, a weighted sum of all the metrics can be used to provide a comprehensive performance evaluation (the so called CL score - CLscore) [9].

1.3.1 CL for sequential data processing

Even if CL has recently gained a lot of interest inside the ML community, its application has been mostly limited to reinforcement learning and computer vision scenarios, where many results are already available.

However, there is very little work related to the problem of sequential data processing in a CL fashion (see Section 2.4). This is somewhat surprising as sequential data are present in a large number of important applications in which ML has already proven its capabilities: Natural Language Processing, robotics, bioinformatics, signal processing (as in Text to Speech and Speech to Text). They are all examples of fundamental areas in which ML has given an important contribution.

In general, a sequential data pattern X is composed by a sequence of vectors (x1, x2, ..., xT), where each xi ∈ Rm and the index t ∈ [1, T ] represents the

sequential observation order.

This thesis introduces a definition of a CL algorithm for sequential data pro-cessing, extending the one described in Def. 1.

Definition 2. Given a potentially infinite sequence of unknown distributions of sequences D = {D1, D2, D3, ...}, where each Di = {Xi, Yi} includes the

sequential input data Xi and the target labels Yi, a Continual Learning

algo-rithm ACL

i is defined by the following signature:

∀Di ∈ D, ACLi ≡ < hi−1, si−1, T Ri, Mi−1, ti > → < hi, si, Mi >

where si is the internal representation of the past inputs up to time step i that

the algorithm has developed and can use to embed knowledge about previous patterns.

The need to extend the definition with an internal state si comes directly from

(12)

traditional ML models able to deal with it almost always include such hidden state. In particular, Recurrent Neural Networks (RNNs) are a class of ML models able to store the recent history of the past inputs inside their hidden state, hence they are able to process sequential patterns in a very natural way. The internal representation si can be included inside the term hi of Def. 1.

However, si has been explicitly pointed out in Def. 2, since it allows to

sep-arate the concerns of feature extraction (hi) and memory (si). CL methods

may address both components, or put more emphasis on one of them over the other, focusing on different aspects of Def. 2.

The fundamental research question that drove this work can then be stated as follows: is the problem of CL on sequential data processing worth studying by using recurrent neural network models?

This question is a very general one but, given the scarcity of available results, it still deserves to be addressed explicitly. However, since both CL and ML do not offer (yet) a strong mathematical formulation of the problem upon which rely when trying to study a very general issue like this one, the only feasible approach is an empirical one.

Therefore, this thesis proceeds by first introducing a new recurrent architecture for CL on sequential data inspired by existing approaches to CL for feedforward models. Then, it proceeds by experimenting with it on a variety of different datasets in order to study its behavior in distinct contexts.

In particular, with respect to the different learning methodologies for CL out-lined above, the adopted approach follows the SIT + NC setting [33], in which a set of non overlapping distributions is presented to the model. New class examples (drawn from a new distribution) are provided as soon as the previous distribution is learned.

The rest of the thesis is structured as follows: Chapter 2 provides a compari-son of existing methods for CL, with a focus on the different design strategies around which they are clustered. Chapter 3 describes a new recurrent archi-tecture for CL on sequential data processing together with the corresponding baseline models used in the experimental phase. Their description and results are presented in Chapter 4. Chapter 5 offers a general discussion about the experimental results and compare them against existing works. Finally, Chap-ter 6 concludes the thesis by wrapping up what has been done and drawing inspirations for future works.

(13)

Chapter 2 Continual Learning strategies

A plethora of different CL models exists in the literature and the vast majority of them focuses on the problem of catastrophic forgetting and on how it could be addressed.

Since there is no clear winner among existing solutions, it is necessary to choose the best one on a case by case basis. Important constraints that can guide the selection process are the computational resources available, the environment type or the specific task to address.

A convenient and popular approach to explore CL literature proceeds by group-ing available methods into different paradigms[37] [29]:

1. Regularization 2. Dynamic expansion

3. Complementary Learning System

An overview of each paradigm (mostly in a supervised setting) is given in the next sections.

There are also models that lie in the intersection between paradigms, putting together different combinations of approaches. As an example, AR1 model [33] is a technique to address single-incremental-task scenarios. It includes both regularization and architectural strategies. It is able to learn continuously not only the top hidden layers (like in most of the other regularization methods) but also the middle ones, without being affected by forgetting. AR1 relies on two other methods: a slightly modified version (mean-shift and zero ini-tialization) of Copy Weight with Reinit [31] (CWR+) and an Elastic Weight Consolidation alternative called Synaptic Intelligence (SI) [51].

2.1 Regularization techniques

Regularization techniques are very popular, even outside the CL community. They are used to improve generalization on test set, since they prevent the

(14)

model to overfit to the data seen during training.

Regularization techniques proceed by adding an additional penalty term to the standard loss function. The penalization design characterizes the desired objective. As an example, large absolute values of the model parameters usu-ally lead to overfitting, hence, a regularizer should penalize parameter matrices with large norms.

In CL, regularization targets different part of the computation, with the final objective of enhancing the stability of the predictions over different input dis-tributions.

The simplest approach penalizes large changes in the model parameters. A model with parameters that do not change in time is the most stable one, but it is also the least useful since it prevents any kind of future learning.

Elastic Weight Consolidation (EWC) [25] is a technique that inhibits the model parameters θ from changing much when the input distribution changes. It achieves this objective by applying a quadratic penalty to the usual target loss, as follows: L(θ) = Lnew(θ) + X task λ 2 X i Fi(θi− θtask,i∗ ) 2 _(2.1)

where Fi is the i-th element of the diagonal of the Fisher information matrix,

λtask is the task-specific hyperparameter related to the penalty term, Lnew is

the standard loss on the current task new and θ_task∗ are the learned parame-ters for the task. Hence, in this formulation there is a penalty term for each incoming task (or input distribution).

EWC relies on the hypothesis that, for an over-parameterized model, there are many combinations of the parameters θ that produce the same performance on a certain input distribution. It is therefore likely that, even when constraining the parameters to stay approximately in the same region, it will be possible to learn new distributions.

EWC resembles the sparse approach, which is a similar one. The sparse ap-proach relies on a segmented parameter space in which there are many regions that do not overlap. This structure enables an effective continual learning of many different distributions [12] preventing forgetting. A parameter update, in fact, affects only the corresponding region, without changing the others. If EWC proceeds by small changes in the adaptive parameters, Learning with-out Forgetting (LwF) [30] adopts the same strategy for the with-output activations. LwF is inspired by the distillation procedure [20], which transfers the knowl-edge from a larger, “cumbersome” network (or from an ensemble of experts), into a smaller one. The transfer is performed using the output of the large model as soft target for the distilled model.

LwF has a set of shared parameters θs, and multiple sets of task-specific

pa-rameters θt, where t represents the task. The distillation loss is a cross-entropy

(15)

defined as follows: qi = exp(zi/T ) P jexp(zj/T ) . (2.2)

LwF includes the distilled loss in its final loss L:

L= Lnew(Yn, ˆYn) + λLold(Yo, ˆYo) + R. (2.3)

The first term is the usual cross-entropy loss between the output and the tar-get, the second one is the distilled loss and R is a traditional regularizer. The distilled loss requires two terms, Yo and ˆYo.

The former uses the task-specific parameters of the previous task to compute the output on the new task. It is computed before starting training on the new task. The latter is the output on the new task computed using the task-specific parameters ˆYo. It is computed after learning the new task.

Regularization techniques often succeed in preventing catastrophic forgetting. However, they eventually lead to a solution which is a tradeoff between the performance on new and old tasks.

2.2 Dynamic expansion techniques

One of the reasons why catastrophic forgetting happens is related, as already said, to the stability-plasticity dilemma.

While regularization approaches work on the stability side, it is also possible to focus on the plasticity component.

This is exactly what the dynamic expansion approach pursues, by increasing, as the name immediately suggests, the computational capabilities of a model. This expansion creates new learning opportunities without being forced, in principle, to overwrite existing knowledge.

This idea has taken many forms in the literature and it would be difficult to give a fair overview of each and every method proposed. This section will focus on the Progressive network, one of the main inspirations for this thesis, and on some other works that could be interesting for future extensions.

The Progressive Network architecture [44], depicted in Fig. 2.1, is composed by multiple feedforward networks (called columns), glued together by lateral adaptive connections. The state of the i-th hidden layer in the column k is described by the following equation:

h(k)_i = σ(W_i(k)h(k)_i−1+ U_i(k:j)σ(V_i(k:j)α(<k)_i−1 h(<k)_i−1 )) (2.4) where W_ikare the feedforward weights of the k-th column, U_i(k:j)are the lateral connections from column k to column j, V_i(k:j) is a projection matrix to ensure that the total number of parameters is of the same order of the first column’s parameters, α(<k)_i−1 is just a learned scalar to adjust different input scales and σ is a nonlinearity. The rightmost part of the equation defines what is called

(16)

Figure 2.1: Progressive Network architecture, taken from [44]

an adapter, which empower the lateral connections with a single-hidden-layer feedforward network.

Progressive networks are resilient to catastrophic forgetting and can enable transfer learning via their lateral connections. However, the number of pa-rameters grows linearly with respect to the number of columns and the full capacity of the network is not completely exploited. Approaches to prune the final model can reduce the computational cost without affecting performances. More importantly, since each column is trained to be an expert of its own domain, at test time the best performance would be obtained by the col-umn trained on the input data. However, knowing this correspondence would amount to know the data generating distribution. Such knowledge is often not available in real world scenarios, thus limiting the inference phase.

Other examples of dynamic expansion in CL include denoising autoencoders in which units are added during training phase and then merged together to optimize the computational cost [39] [52].

Similar to the distillation process, student networks can learn to extract knowl-edge from a teacher model and then they can dynamically adapt their depth (number of layers) or width (number of units) in order to optimize the perfor-mances [7].

For what concerns unsupervised approaches (i.e. approaches that do not need an explicit target label), self-organising networks (SOM) are a popular choice. They are used together with grow-when-required (GwR) algorithms in order to dynamically expand their number of units and adapt their neighborhood size in response to shifts in the input distribution [34] [5].

Dynamic solutions favor plasticity of a model, but they can also deal with its stability: previous knowledge can be in fact stored in the old parameters with

(17)

little to no modification. No additional data is in principle required to make this models work.

The approach, however, is very delicate to manage and introduces an additional cost to the overall complexity. The type of expansion has to be carefully cho-sen in order to avoid an explosion in the number of adaptive parameters. The techniques through which the architecture is expanded are a key component of the overall algorithm and are one of the main topic of current works.

2.3 Complementary Learning System

Complementary Learning System (CLS) based models [27] take inspiration from neurosciences and in particular from the study of the neural regions called hippocampus and neocortex, in the mammalian brain. Since they work with these two main components, they are also called dual models [37]. Each component learns at a different rate: the slow learning component plays the role of the neocortex, accounting for long-term retention of overlapped knowledge and higher-level computation. The fast learning component plays the role of the hippocampus, building an episodic memory through short-term acquisition of sparse knowledge representations. These two components are connected via the interplay of old memories (i.e. relearning of old input patterns) flowing from fast-rate components to slow-rate components. This phenomenon is often called rehearsal.

One of the earliest implementation of a CLS model is presented in [19], where two different types of weights, clearly separated, are adjusted by using a fast and a slow value for the learning rate. The model shows that, when previous knowledge is “blurred” by subsequent learning, rehearsal of few input patterns is sufficient to restore it. This is possible due to the fast weights, which reset the slow weights to a previous state in which such knowledge was available. Subsequently, a new learning session begins and the cycle restarts.

Rehearsal techniques can in principle be combined with any other CL strat-egy, but they are naturally integrated with CLS as they provide a connection between the components of dual models.

Rehearsal [42] consists in learning again previous input patterns in order to recall past representations. The assumption is that it takes less time (less in-put patterns) to recall a representation than it takes to learn it from scratch. However, in many practical cases there are restrictions on the available mem-ory that prevent the use of rehearsal (for example in an online setting). Since rehearsal is a very useful technique, there exists a workaround, called pseudo rehearsal, to avoid an increasing memory usage.

Pseudo rehearsal [42] is a very simple idea that leverages popular ML models called generative models. In this context, they are trained to approximate the input distribution and then used to reproduce patterns similar to the ones seen during training. The most popular choice among generative models is the

(18)

Generative Adversarial Network (GAN) [15]. GANs act as an unsupervised learner (they do not require any target label) of the input distribution, pro-ducing patterns very similar to the original ones.

The use of GANs enables rehearsal with only a constant amount of memory (the GAN’s fixed parameters).

2.4 Appproaches for CL on sequential data

processing

The vast majority of existing CL works deals with computer vision or rein-forcement learning environments: popular examples are reinrein-forcement learn-ing mazes (2D mazes in [41] and 3D mazes in [3]), simple artificially generated tasks [34] or more often computer vision tasks with datasets like MNIST and its variations (FashionMNIST, PermutedMNIST...), CIFAR and ImageNet [52], [2] [32] [51].

A specific dataset for CL on computer vision tasks, CORe50, has been pub-lished in [31].

For an overview of popular datasets used in Robotics, where both Computer Vision and reinforcement learning techniques are useful, see [29].

There are very few attempts to apply CL methods to sequential data process-ing.

One of them is presented by Sodhani et. al. in [45], where the authors in-troduce a new model that merges together idea of Gradient Episodic Memory (GEM) [32] and Net2Net [7], two existing CL model.

The authors use the Copy dataset, the Associative Recall dataset (introduced in [16]) and the Sequential Stroke MNIST (SSMNIST) dataset [22], training the models on a curriculum of increasingly difficult levels.

The authors use a LSTM extended through the network expansion algorithm introduced in Net2Net. They train the overall architecture with a GEM up-date, which however requires an explicit memory for old patterns.

Another approach to deal with sequential pattern has been presented in [3], as an extension of the Progressive network architecture. The authors use a recurrent architecture with an external memory and attention mechanisms. The attention slots are dynamically expandable and they are used to tackle NLP tasks on the MultiNLI corpus [49].

A combination of autoencoders, GANs and rehearsal is used in [48] to address audio classification tasks.

Finally, in [24], the Audioset dataset [14] is used in conjunction with non-recurrent models. The Audioset dataset is used for sound classification from multiple audio clips. Even if this is a typical sequential task, the authors use feedforward models to tackle it. In fact, each audio clip in Audioset lasts 10

(19)

seconds and it is embedded through a VGG-inspired model into 10, fixed-size, vectors. The fixed size representation allows for the use of non recurrent ar-chitecture.

(20)

Chapter 3 Proposed approach

One of the main contribution of this thesis is the development of a new recur-rent approach targeting the problem of CL for sequential data processing. This chapter describes a new technique to build dynamic RNN architectures, highlighting both the advantages and disadvantages that one has to face when using such models in practice.

3.1 Recurrent Neural Networks

A Recurrent Neural Network is a type of ML model able to encapsulate the past history of the inputs inside its hidden state. The hidden state establishes a temporal dependence between the states of the network at different steps of the computation.

The simplest form of RNN has been first proposed by Jordan in [23] and then further explored by Elman in [10]. The network is defined by the following equations, similar to Elman’s version:

ht= σ(Whhht−1+ Wxhxt+ bh) (3.1)

yt = σ(Whyht+ by) (3.2)

where ht is the hidden state at time step t, subjected to a recurrent connection

with ht−1, ytis the output at time step t, xtis the input vector, all b and W are

adaptive biases and weights and σ is a nonlinear function applied element-wise. How much past information a RNN like the one above is able to maintain is a difficult question and at the heart of many problems and inspirations in the development of new recurrent models.

This work relies on two recurrent architectures as baseline models, namely the Long Short-Term Memory (LSTM) [21] and the Linear Memory Network (LMN) [4].

(21)

The first one is a very popular recurrent model, as it has been proven capable of addressing the vanishing gradient problem.

For standard RNN, in fact, it is difficult to learn long temporal dependencies within a sequence, due to the annihilation of the gradient along backward pass [21].

The LSTM is able to overcome this problem by introducing gates in the ar-chitecture, as follows: it = σ(Wiixt+ bii+ Whiht−1+ bhi) ft = σ(Wifxt+ bif + Whfht−1+ bhf) gt = tanh(Wigxt+ big+ Whght−1+ bhg) ot = σ(Wioxt+ bio+ Whoht−1+ bho) ct = ft∗ ct−1+ it∗ gt ht = ot∗ tanh(ct)

The equations define respectively the input gate it, the forget gate ft, the cell

gate gt, the output gate ot, the cell state ct and the hidden state ht of the

LSTM at each time step t. Each W is a matrix of adaptive parameters with their corresponding bias b, while xt represents the input at the current time

step.

The gates are useful to control the amount of information that flows through the network. The cell state combines the gates in order to prevent the vanish-ing gradient problem by enablvanish-ing a backward path of gradient multiplications with approximately norms of value 1.

A fixed number of LSTM units (also called LSTM cells) are usually combined together in order to form a recurrent layer. The same approach is then used to produce multi-layered recurrent networks.

The LMN is arguably the first attempt to disentangle the functional compo-nent of a recurrent model from its correspondent memory compocompo-nent: current RNNs, in fact, do not clearly distinguish between the two. Also complex archi-tectures, like the Differentiable Neural Computer (DNC) [17], rely on external memories alongside a standard RNN controller. In these models, however, the two components are still mixed.

In the LMN instead, these different objectives are clearly separated, allowing for a better memory capacity management as well as a more accurate inspec-tion of hidden funcinspec-tional features.

The functional component is implemented by a standard feedforward neural network (with one hidden layer by default), while the memory component is implemented using a linear autoencoder for sequences. The linear autoencoder can be separately pretrained [38] to optimize the final performances, thus ex-ploiting the already mentioned decoupling.

(22)

Figure 3.1: Depiction of the LMN architecture with a clear distinction be-tween the functional and memory components and the two alternative types of outputs.

More formally, a LMN can be described as follows: ht= σ(Wxhxt+ Wmhhmt−1)

hm_t = Whmht+ Wmmhmt−1

yh_t = σ(Whoht) (LMN-A)

ym_t = σ(Wmohm_t ) (LMN-B) where ht is the functional component activation and hmt is the memory state

component (that is, the linear autoencoder for sequences) at time t, strongly intertwined as showed in Fig. 3.1. The W matrices are filled, as always, with adaptive parameters (biases omitted for brevity).

The model output y can be produced in two possible ways: by starting from the functional activation ht and producing yth (LMN-A) or by starting from

the memory state hm

t and producing ymt (LMN-B).

3.2 Augmented recurrent models

This section describes a new kind of architecture for CL for sequential data processing introduced by this thesis.

The proposed technique strictly falls in the dynamic expansion approach for CL, explained in Section 2.2. It is largely inspired from the Progressive net-work architecture, where a new feedforward column is added for each new distribution or task that the model has to face.

The proposed model has been called Augmented architecture and, as the name suggests, it does not identify precisely a unique model, but instead it defines a general procedure that can be used with virtually any kind of RNN.

(23)

Supposing the availability of a general RNN model, the construction and train-ing of its Augmented version is detailed in Algorithm 1.

Algorithm 1 Build an Augmented architecture and train it on many distri-butions

Require: A RNN baseline model, a sequence of distributions D = {D1, D2, D3, ...}

Ensure: | D |> 1

1: Create the first module of the Augmented architecture as a baseline RNN, initialize its parameters and train it on D1.

2: while a new distribution Di is available do

3: Create a new RNN module and initialize its parameters

4: Connect the new module to the previous one

5: Freeze the parameters of the previous module

6: Train the overall architecture on Di

7: end while

8: Test the overall architecture

Even if the algorithm is very general, it describes the main steps of the process and highlights some of its characteristics.

The number of data distributions does not have to be known in advance (but there must be at least 2 different distributions), and the complexity of the model, like in the Progressive networks, grows linearly with the number of distributions.

When learning the first distribution, there is no difference with respect to the training of a standard RNN and the model is free to learn without restrictions. As soon as a new distribution is detected, a new component (called module) is added and connected to the previous one. The are multiple ways in which this connection can be implemented and they strongly depend both on the type of RNN and on the targeted information flow between modules.

The freezing of the old module’s parameters is required in order to prevent catastrophic forgetting. It withholds previous knowledge and keeps it avail-able for future use through the inter-modules connections.

In this thesis, the Augmented architecture technique is applied to the two base-lines RNN described in the previous section. Such models are consequently called Augmented-LSTM (A-LSTM) and Augmented-LMN (A-LMN).

The A-LMN is depicted in Fig. 3.2.

The forward pass on the A-LSTM and A-LMN (type B) are described by Al-gorithm 2 and AlAl-gorithm 3.

The flow of the computation inside an augmented model starts from the first module: it takes the input x and updates its hidden state (for the A-LSTM) or its functional activations and memory state (for the A-LMN).

Then, the result of this update is concatenated with the input x and forwarded to the next module in the chain. The module uses it as the input for the LSTM

(24)

Figure 3.2: A-LMN architecture with three interconnected LMN modules

cells (in the case of A-LSTM) or for the functional component (in the case of A-LMN).

The computation goes on until the last module is reached. At that point, the module also computes the output relative to the current input (the A-LSTM uses a linear layer to compute the output from its hidden state in order to match the output vector dimensionality).

The entire architecture is differentiable end-to-end, hence the traditional back-propagation can be used to compute the gradient of the loss and then to update the parameters with one of the standard methods (like gradient descent).

Algorithm 2 A-LSTM Forward Pass

Require: A-LSTM with N modules, input x Ensure: N > 1

1: Initialize hidden state of module 1: h1,i

2: h1,f = A-LSTM1(x, h1,i) . Compute final state of module 1

3: for d ← 2, N do

4: Initialize hidden state of module d: hd,i

5: x = [x; hˆ d−1,f] . Concatenate hidden state with input

6: hd,f = A-LSTMd(ˆx, hd,i) . Obtain final state of module d

7: end for

(25)

Algorithm 3 A-LMN Forward Pass

Require: A-LMN with N modules, input x Ensure: N > 1

1: Initialize memory state of module 1: hm 1,i

2: hm_1,f, h1,f = A-LMN1(x, hm1,i) . Compute final memory and functional

state of module 1

3: for d ← 2, N do

4: Initialize memory state of module d: hm_d,i

5: x = [x; hˆ m_d−1,f; hd−1,f] . Concatenate functional and memory state

with input

6: hm_d,f, hd,f = A-LMN1(ˆx, hmd,i) . Compute final memory and functional

state of module d

7: end for

8: y_Nm = σ(W_Nmohm_N,f) . Output of the final module But how does this architecture relate to catastrophic forgetting?

During learning, if weights update is propagated down until the first module, then catastrophic forgetting will surely happen: the update will disrupt knowl-edge stored in the previous modules’ parameters.

In order to prevent forgetting, the parameters of the previous module are con-veniently freezed and never updated again, making each module an expert of its own domain. During each learning session only the top module is updated, while the remaining ones are kept fixed.

The contributions of the previous modules, however, are not completely de-stroyed: during the forward pass, by exploiting the inter-modules connections, each module can leverage the final representation computed by the previous one, including previous expert’s knowledge in the current module’s computa-tion. Similarly to what happens in the Progressive network, this process can lead to transfer learning between modules.

The Augmented architecture is consistent with the CL metrics described in Section 1.3. In fact, it is possible to construct the train-test accuracy matrix R from the sequential training on all the tasks. Moreover, the Negative Trans-fer Learning, the novel metric introduced in Section 1.3, can be easily included in this setting: through the matrix R it is possible to compute the NTL by training one more time a single baseline RNN on each task, separately. The progressive approach of the Augmented architecture, in fact, learns the first task in the same way as the corresponding baseline model.

The normalization factor, that is the performance obtained by a random clas-sifier, is easily computed by using for inference a random initialized baseline model.

The Augmented architecture is simplified with respect to the Progressive net-work. In the latter, in fact, there is an additional component, the adapter, that manages each connection with a feedforward network. Other adaptive parameters are employed in order to improve initial conditions and perform

(26)

dimensionality reduction at the same time. The adapters, however, require each column to be connected to every other following ones, making the model very expensive to build and train. In the Augmented architecture the connec-tions are only between neighbors modules, thus mitigating this cost.

The use of Augmented architectures leads directly to another problem, shared again with the Progressive network model: by using experts for each session that do not learn outside their domain, at test time there is the need to choose the right module for inference (or the right column in the case of Progressive model), since any other module would not be able to achieve good perfor-mances.

In the Progressive network the authors state clearly the problem, but leave the solution to future works.

Another contribution of this thesis is the integration of the Augmented ar-chitectures with an approach to select at run-time the correct module. The approach has been introduced in [2] for feedforward experts in a CL setting: this thesis represents the first attempt to apply it on top of recurrent architec-tures.

It proceeds by associating to each expert (RNN module) an autoencoder that is trained to reconstruct the input data during each learning session. Hence, during the first session there is only one expert and its associated autoencoder that jointly learn to address the task (the expert) and to reconstruct the input (the autoencoder). When a new session comes, another expert, together with its own autoencoder, is created and trained on the new input distribution. The previous autoencoder is never trained again. At the end of the process there are a certain number of autoencoders that are experts in reconstructing a cer-tain type of input data. The training and test of the gating autoencoders are depicted respectively in Fig. 3.3 and Fig. 3.4 for the A-LSTM. The A-LMN follows the same process.

The use of sequence autoencoders has been proven useful also as a pretraining procedure for recurrent neural networks (in particular LSTM) [8]. In this the-sis however this approach has not been followed due to the online nature of a CL scenario.

A standard (feedforward) autoencoder reconstructs the input by using an hid-den layer of units, where the number of hidhid-den units is usually much smaller than the input dimension. This ensure to avoid trivial transformations (i.e.: identity). Alternatively, it can be highly regularized to learn meaningful man-ifolds. The equation representing the process is the following:

r(x) = Woσ(Wix) (3.3) where r(x) is the reconstructed input and W are the usual matrices of param-eters.

The loss associated to the autoencoder is the reconstruction loss, which is often a simple Mean Squared Error (MSE) loss: _N1 PN

i=1(r(x)i− xi) 2_.

(27)

Figure 3.3: Augmented architecture together with gating autoencoders at training time. The last autoencoder is the only one receiving the input xt and

it is trained to reconstruct it. The A-LMN receives the input xt(forwarded to

all modules) and it is trained to map it to the corresponding output.

Figure 3.4: Augmented architecture together with gating autoencoders at test time. The input is presented to all the autoencoders. In this case, the second autoencoder AE2 obtains the minimum reconstruction error and selects the corresponding A-LSTM module. The input is processed by the A-LSTM up until the second module, where the final output is produced.

(28)

With sequential data however, a feedforward autoencoder does not suffice and a recurrent version must be used.

In this thesis the autoencoders are LSTM autoencoder: a sequence-to-sequence model where the encoder takes in input the entire sequence, updates his state and then passes it to the decoder. The decoder has the objective of producing in output the original sequence (using a linear layer to produce the output from the final hidden state).

Finally, at test time, without any knowledge about the input distribution, each pattern is fed to all autoencoders. The module associated with the autoen-coder that obtained the minimum reconstruction error is used to produce the output.

The steps of the computation are outlined in Algorithm 4.

Algorithm 4 Training and testing of LSTM Autoencoder, including output selection from Augmented architecture

Require: A sequence of distributions D = {D1, D2, D3, ...}

Ensure: | D |> 1

1: lae← [] . Create empty list

2: while a new distribution Dk is available do

3: LSTMenc, LSTMdec = init-autoencoder()

4: lae.append((LSTMenc, LSTMdec)) . Append new autoencoder

5: Initialize the hidden state of the encoder henc,i

6: for training batch x ∈ Dk do

7: henc,f = LST Menc(x, henc,i) . Compute encoder state

8: hdec,f = LST Mdec(0, henc,f) . Compute decoder state

9: y = Wfhdec,f . Compute output

10: J = M SE(x, y) . Compute reconstruction loss

11: Take an optimization step using ∂J_∂wcomputed with backpropagation

12: end for

13: end while

14: for test batch x ∈ D do

15: lrec ← [] . Create empty list

16: for LSTMencLSTMdec∈ lae do

17: y ← output of current autoencoder on x

18: lrec.append(M SE(x, y)) . Append the reconstruction loss

19: end for

20: m = arg min lrec . Retrieve minimum loss autoencoder’s index

21: Select the output produced by the m-th module of the Augmented architecture

22: end for

If the gating autoencoders succeed in learning the input manifold, then, at test time, they will be able to autonomously distinguish between each pattern. In this way, the corresponding expert is selected to compute the overall network

(29)

prediction without the need of an external supervision, resulting in a robust generalization to unseen examples.

Finally, by using the reconstruction error values is possible to compute a relat-edness measure between the data distributions modeled by the autoencoders. Given a new task Tk, an old task Ta and their associated autoencoders, the

relatedness is computed by using the reconstruction errors Erk and Era made

by the autoencoders on validation data of task Tk.

More formally: Rel(Tk, Ta) = 1 − Era− Erk Era (3.4) The relatedness measure is not symmetric. It is computed only between a new task and all the previous ones.

The formulation of the relatedness presented in Eq. 3.4 is a slightly modified version of the one presented in [2]. More precisely, the denominator is changed from Erk to Era. This modification allows for a better normalization of the

relatedness coefficient by using the error that is likely the largest between the two. The experiments carried out with the original version, in fact, result in inconsistent values of the relatedness (large negative values were often present).

(30)

Chapter 4 Experiments

Experiments have been performed on 3 different datasets: the synthetic dataset Copy (Sec. 4.1), the Sequential Stroke MNIST dataset (Sec. 4.2) and the Au-dioset dataset (Sec. 4.3).

Each dataset is used in one or more tasks and, throughout all the experiments, each task has been partitioned into a fixed number of subtasks (or learning sessions). The models are trained sequentially, one subtask after the other, in order to implement a CL environment where the input distribution changes between each learning session.

The subtasks order is fixed, but chosen at random when designing the experi-ment.

Finally, after the training phase is completed, the models are tested on all the subtasks, in order to estimate their ability to overcome the catastrophic forgetting problem.

All the learning curves presented in this chapter show both the training curve (solid line) and the validation curve (dotted line). Each subtask is identified by its own color.

4.1 Copy Task

4.1.1 Task Description

The Copy Task (introduced in [16]) is a synthetic task whose objective is the reconstruction of sequences of binary vectors: the models receive a batch of sequences in input and they learn to reproduce them after having seen all the vectors in the sequences, without any intermediate help.

A delimiter vector is used to signal the end of each sequence, by using an ad-ditional bit which is always zero, except for the last vector of the sequence. The task is structured in a curriculum learning fashion [6], with the sequence length becoming increasingly longer as the task progresses.

(31)

since the distribution of the input data does not change across the subtasks involved. Nonetheless, it has been used in this work as it is a convenient and widely adopted method to evaluate the memory capacity of a model.

This task is important as a preliminary test to measure the relative perfor-mance achieved by each model and to compare it with respect to all the others. It is also important to notice that, from the results reported on this task, it is not possible to draw any conclusion about the memory capacity the models may exhibit on another task.

Due to the nature of the task, in this context it does not make sense to use the gating autoencoders to autonomously recognize each subtask, since the objec-tive of the Copy Task is precisely that of reconstructing the input sequence. Hence, the autoencoders would play the role of the recurrent models and com-pletely solve the problem.

4.1.2 Experiments settings

The experiment settings used for the Copy Task are listed below:

1. Each vector in the sequence has 8 binary values (either 0 or 1). The last bit is always 0, except for the delimiter vector which is a vector of zeros with the last bit set to 1.

2. The Copy Task has been partitioned in 4 subtasks in which the input sequence length is progressively increased from sequences with 5 ± 2 vectors to sequences with 26 ± 2 vectors with steps of 7 vectors.

3. The training loop scans the subtasks sequentially, moving to the next only after a fixed number of training steps. The dataset is generated dy-namically at each step with a fixed mini batch size. The sequence length for each step is drawn at random according to the lengths corresponding to the current subtask.

4. The loss function used is the Binary Cross Entropy (BCE) loss and the optimizer is the RMSProp with learning rate of 3e − 5, momentum of 0.9 and no weight decay. The gradient clipping is set to 5.0. The mini batch size is 3. The metric used to evaluate the performance is the accuracy (percentage of correctly classified items).

5. The LSTM and A-LSTM use 256 hidden units, while the LMN and the A-LMN use 126 memory units and 126 feedforward units. With this setting the models are better comparable since they use the same number of units.

6. For what concerns the LMN and the A-LMN, the memory matrices have been initialized as orthogonal, due to the benefits on the norms of the matrices explained in [47]. Moreover, in order to constrain the mem-ory matrices to remain approximately orthogonal during training, the

(32)

Figure 4.1: Accuracy (left) and loss (right) for LSTM on Copy Task. 120, 000 steps, metrics reported every 200 steps.

Figure 4.2: Accuracy (left) and loss (right) for LMN on Copy Task. 120, 000 steps, metrics reported every 200 steps.

BCE loss has been augmented with an additional penalty term in order to penalize non-orthogonal memory matrices (λ k W WT _{− I k). The}

hyperparameter λ associated with this penalty term has been set to 0.1.

4.1.3 Results

The LSTM exhibits the worst performance among all the models, as showed by Fig. 4.1. Long sequences (with at least 19 vectors) are a challenge that the LSTM is not able to tackle appropriately.

The LMN (Fig. 4.2) obtains a better performance, even if the learning curves are noisier than LSTM. For sequences with at least 26 vectors however, the model is not able to reach 100% of accuracy.

This behavior occurs also if a LSTM or a LMN is trained directly starting from sequences of proper length (26). The models are not able to learn the task and they obtain an accuracy around 60%, even with more hidden units (tested up to 1024).

The augmented models, on the contrary, are able to reach 100% accuracy (or values near it) on all the subtasks (Fig. 4.3, Fig. 4.4). The main difference be-tween A-LSTM and A-LMN lies in the convergence speed: while the A-LSTM

(33)

Figure 4.3: Accuracy (left) and loss (right) for A-LSTM on Copy Task. 350, 000 steps, metrics reported every 350 steps.

Figure 4.4: Accuracy (left) and loss (right) for ALMN on Copy Task. 120, 000 steps, metrics reported every 200 steps.

requires 350, 000 steps in order to be trained, the A-LMN needs only 120, 000 steps.

The augmented models, being the best on the task, have been trained also with longer sequences in order to stress their memory capabilities (Fig. 4.5). From the experiment on longer sequences, it appears more evident how the LMN model learns faster to reach the full accuracy when compared to A-LSTM.

In order to prevent the accuracy and loss peaks showed by the LMN and by its augmented version on long sequences, the orthogonality-preserving BCE loss can be further extended with a penalization on the hidden state norm [26]:

β1 T T X t=1 (k htk2 − k ht−1 k2)2 (4.1)

where the hidden state ht is collected for every input step t.

The result on the hidden state norm is showed in Fig. 4.6 for a value of β of 0.2. In the Copy Task it is not appropriate to talk about the catastrophic forgetting problem as the input distribution remains the same across all learning sessions (Bernoulli with p = 0.5) and what does change is only the sequence length.

(34)

Figure 4.5: Accuracy for A-LSTM (left) and A-LMN (right) on Copy Task on longer sequences. The A-LSTM has been trained for 450, 000 steps, metrics reported every 450 steps, while the A-LMN for 200, 000 steps, metrics reported every 200 steps.

Figure 4.6: Norm of hidden state for LMN (128 hidden units for both memory and functional components) during training without any penalization (left) and with both the penalizations on large norm of hidden state and non orthogonal matrices (right). 80, 000 steps, metric reported every 200 steps.

Changes in the sequence length are a challenge for the baseline and augmented models, and more in general for RNN models, which are not able to address older subtasks, obtaining a 50% accuracy, on par with a random classifier.

4.2 Sequential Stroke MNIST Task

4.2.1 Task Description

Sequential Stroke MNIST (SSMNIST) [22] is a variation of the popular MNIST dataset of handwritten digits. In this version, each digit is represented as a se-quence of pen strokes where each pen stroke is represented by two coordinates (x and y, scalars), an end of stroke bit eos and an end of digit bit eod. The first element of the sequence represents the first stroke’s starting point. The rest of the sequence values are −1, 0 or 1, representing atomic offsets from the current position.

The average length of the sequences is 40 items, resulting in sequences much longer than the ones of Copy Task.

The SSMNIST Task allows for the use of the gating autoencoders in order to autonomously recognize different incoming subtasks at test time. In this way, the appropriate module of the augmented architectures can be selected

(35)

Figure 4.7: A single digit sequence is classified into its corresponding digit class. The vertical segments correspond to a single input vector (x, y, eos, eod ). The last segment is the eod, indicating the end of the digit sequence. The eos vectors are not explicitly marked.

to tackle the corresponding classification problem.

In fact, all the augmented models use a different output layer for each learning session, hence it is not possible for a submodule to classify correctly any in-put pattern that does not belong to the learning session it has been trained on. The autoencoders are trained jointly with the classifiers, receiving the same input patterns for the same number of training steps.

The SSMNIST dataset has been used for this work on a variety of different experiments, each of which is described in the following sections. In all the experiments, x and y features have been normalized with zero mean and unit variance.

4.2.2 Single digit classification

The purpose of this task is to accurately classify a single digit by using its entire sequence representation (Fig. 4.7).

It is useful to validate, on simple input patterns, both the classification per-formances of the models and the gating autoencoders capability of recognizing the current subtasks.

In these experiments, each subtask is assigned a set of fixed, non overlapping, digit classes from which draw the input patterns. Hence, when switching from a subtask to the next, the input distribution suddenly changes from a distri-bution representing a certain set of digits to another representing a completely different set.

Experiments settings

The experiments have been carried out by creating 5 different subtasks, each of which covers 2 classes, in order to include the entire dataset of 10 digits. The subtasks, numbered from 1 to 5 (following the order of the training phase), are associated to the digit classes according to the list below:

(36)

2. { 0, 1 } 3. { 4, 5 } 4. { 6, 7 } 5. { 8, 9 }

Hence, during each subtask, a single digit is selected from the current set and its sequence representation is processed by the models. The output is the class corresponding to the input sequence.

The details of the experiment are outlined below:

• LSTM and A-LSTM with 64 hidden units and 1 hidden layer

• LMN and A-LMN with 64 hidden units both for the memory component and the functional component

• Each output layer has 10 units (on all subtasks), corresponding to the digits

• RMSProp optimizer with learning rate of 3e−5, L2 regularizer of 1e−3, gradient clipping with norm 5.0 and momentum of 0.9

• Mini-batch size of 4 sequences

• The LSTM autoencoders use 40 hidden units for both the encoder and the decoder and they are optimized with Adam optimizer, learning rate of 1e − 4 and L2 decay of 1e − 3.

Results

The single digit experiment shows that the task of classifying a digit is simple to tackle for both the baselines and augmented models, since the accuracy reaches almost always the top value of 100%, as showed in Fig. 4.8.

The gating autoencoders worked correctly: on all runs, the minimum recon-struction error was always associated to the autoencoder in charge of the task presented in input. As a direct consequence, the performance on the test set after training on all subtasks is the same as the one on the validation set, achieving nearly 100% on all learning sessions.

The baseline models, on the contrary, are strongly subjected to catastrophic forgetting and they maintain the same performance of the validation set only on the last subtask, while dropping to an accuracy of 0% on all the previous ones.

The drastic drop in accuracy reported by the baseline models is caused by the fact that each submodule only learns to classify a subset of the possible classes: when trained on a subtask, the model highly overfits on the input patterns and learns to use only the two output units related to the current subtask. When a new learning session starts, it completely destroys the previous knowledge.

(37)

(a) A-LMN (b) A-LSTM

(c) LMN (d) LSTM

Figure 4.8: Accuracy for the models on the SSMNIST task with single digits on 50, 000 steps. Accuracy reported every 1, 000 steps.

Hence, only the output units related to the final subtask are used at test time. The learning curve of the autoencoders is showed in Fig. 4.10. Table 4.2 reports the absolute values of the reconstruction errors of each of the 5 au-toencoders in each of the 5 subtasks (training lasted for 50, 000 steps). Table 4.3 reports the relatedness values between subtasks.

The only poor performance is related to the LMN architecture: it achieves 50% accuracy not only on the third subtask, as showed in Fig. 4.8, but also on the first one, if learned as third in the sequence.

The final accuracy achieved by the LMN largely depends on the previous sub-tasks. By restricting the analysis only to training of three subtasks, there are 5 × 4 × 3 = 120 possible combinations. Table 4.1 shows the behavior obtained in 9 of the 120 possible cases. They are already sufficient to give a fair overview of the phenomenon: in some cases the LMN learns correctly the subtasks 1 and 3, while in some others it does not.

This behavior is a concrete realization of the Negative Transfer phenomenon and it shows how it can impact on learning new tasks when starting from an already existing representation.

This result also represents the first example of how the augmented models (A-LMN, in this case) are able to overcome the NTL by relying on a modular architecture.

(38)

1 2 3 1-2-3 95% 100% 50% 1-3-2 100% 100% 100% 1-3-4* 95% 100% 100% 1-3-5 100% 100% 100% 3-2-1 100% 100% 50% 3-4-5 100% 100% 100% 4-5-1 100% 100% 100% 4-5-3* 100% 100% 100% 5-4-3 100% 100% 50%

Table 4.1: Validation accuracy obtained by a LMN when training on 3 subtasks and switching their order between different experiments. Each row indicates the order of the training subtasks with the achieved accuracies. The column are related to the training order.

The * sign means that the subtask achieved that accuracy at the end of the training, but it maintains a 50% accuracy for a long time during training phase. The results show a concrete realization of the Negative Transfer Learning phe-nomenon, associated to the NTL metric.

AE1 AE2 AE3 AE4 AE5 T1 0.1984 0.3901 0.2501 0.2019 0.2602 T2 0.2842 0.0484 0.2140 0.2804 0.2250 T3 0.1722 0.1955 0.0931 0.1732 0.1131 T4 0.2823 0.3574 0.2335 0.1028 0.1991 T5 0.1753 0.1988 0.1194 0.1706 0.0881

Table 4.2: Absolute reconstruction errors of all the LSTM autoencoders on the subtasks of SSMNIST on single digit.

subtask achieves 100% accuracy from a certain point on, while in the beginning of training it maintains an accuracy on par with the one of a random classifier.

4.2.3 Multi digits classification

The multi digits experiments focuses mainly on the autonomous subtask se-lection via the gating autoencoders.

The input to the models is a sequence of concatenated digits and the expected output is the corresponding sequence of digit classes (Fig. 4.11).

The fact that each subtask draws sequences of digits from a set of 3 digit classes makes this task more challenging for the autoencoders with respect to the single digit one. In fact, in the single digit the sequences are composed by 1 digit only, and each subtask draws input patterns from a set of 2 digits. Hence, the input distribution of the multi digits task is much more heterogeneous. The classification aspect of the task, however, is similar to the one of the sin-gle digit task: in this context, in fact, the models are provided with additional

(39)

Figure 4.9: Accuracy for the subtask order 4-5-3, described in Table 4.1

Figure 4.10: Autoencoders reconstruction errors during training on all of the 5 subtasks Subtasks T1 T2 T3 T4 T5 T1 1.0000 T2 0.1703 1.0000 T3 0.5407 0.4762 1.0000 T4 0.3642 0.2876 0.4403 1.0000 T5 0.0102 0.4432 0.7379 0.5164 1.0000

Table 4.3: Relatedness values between subtasks on SSMNIST dataset with single digits experiment.

Continual Learning with Recurrent Neural Networks Models