Incremental pretraining of multi-resolution memory networks

(1)

DEPARTMENT OF COMPUTER SCIENCE

Master Degree in Computer Science

Master Thesis

Incremental pretraining of

multi-resolution memory networks

Author

Supervisors

Diego Giorgini

Prof. Davide Bacciu

(2)

In the context of temporal sequences and Recurrent Neural Networks, the vanishing gradient and the need to discover and memorize long-term dependencies and hierarchical information are actively studied problems, but they may also lead us to create overly-complicated networks. Thus some researchers decided to separate concerns with the purpose of controlling such complexity. We combined Linear Memory Network, which conceptually separates functional input-output transformations from memory capabilities, with Clockwork Recurrent Neural Network, which better memorizes dependencies at different resolutions thanks to dedicated modules. We call this new model Clockwork Linear Memory Network. We also developed an incremental pretraining algorithm for this model as an extension of the pretraining algorithm available for the memory component of Linear Memory Networks, in which we incrementally add and train a memory module at a time. We show that our model outperforms related models from literature, such as gated networks, in tasks of sequence generation on signals and spoken word recognition and that pretraining algorithms provide better performance, improved training stability and possibly lower training times.

(3)

1 Introduction 1

1.1 Thesis outline . . . 3

2 Background 4 2.1 Notation . . . 4

2.2 Machine Learning . . . 5

2.3 Artificial Neural Networks . . . 6

2.4 Recurrent Neural Networks . . . 8

2.5 Clockwork RNN . . . 11

2.6 Linear Autoencoder for Sequences . . . 14

2.7 Linear Memory Networks (LMN) . . . 15

2.8 Other related work . . . 19

3 Clockwork LMN 22 3.1 Model architecture . . . 22

3.2 Pretraining . . . 25

4 Experiments 32 4.1 Sequence Generation . . . 32

4.2 Spoken Word Classification . . . 36

5 Conclusions 46 5.1 Future work . . . 47 Bibliography 52 List of Figures 55 List of Tables 57 List of Acronyms 58

(4)

Introduction

Machine Learning (ML) [29] is becoming one of the most important areas of Artificial Intelligence. It consists in using statistical models and inference to solve problems for which we do not have an equation nor ‘specific instructions’. Very often someone manages to apply Machine Learning to difficult tasks previously tackled with other methods and obtain better performance with less prior knowledge about that particular task. It has gained interest from the public and from companies that realized its importance and want to incorporate it in order to have an advantage over their competitors and some of its applications are starting to appear on our daily lives.

ML models are able to learn complex dependencies in the data or learn hierarchical representations of information, as certain problems can only be solved when constructing higher and higher abstractions. For example in vision we don’t directly see a face as it is, but the part of our brain responsible for recognizing faces probably starts from colors, lighter and darker areas, shades, detect edges and depth, recognize part of objects, then puts things together in order to understand that the big object it has in front is a face, then compares features with others in memory in order to recognize as friendly or stranger. Many problems need systems designed this way. For instance, Convolutional Neural Networks (CNNs), which work well in machine vision tasks [44, 35], do not need prior knowledge of what are the best filters to use, as previously these were hand-made, but learn them by themselves, creating higher-level abstractions at each layer.

An important set of problems consists in requiring processing, identifying patterns in and making predictions with sequential or temporal data, such as data that has sequential or temporal dependencies among its components. Data coming from physical sensors, sequences of images, speech are only a few

(5)

of the many examples. Recurrent Neural Networks (RNNs) is the paradigm of Artificial Neural Networks (ANNs) that deals with this kind of problems. Since required pieces of information may be far from each other and since temporal data may as well have an inherently hierarchical nature, we need to research new models with this in mind.

This is why the issues in learning RNNs, such as the vanishing gradient, or in general the problem in learning long-term dependencies, are crucial.

A well known standard line of research is the one dealing with gates, such as Long-Short Term Memory (LSTM) [17] and Gated Recurrent Units (GRUs) [10], which allow certain information to flow forward and others to be forgotten, and the gradient to flow backward less degraded.

A more recent research is the one regarding hierarchical models, which is focused on the problem of finding hierarchies among temporal data. The authors of [12] observed that information from higher hierarchies changes slower within the temporal sequence, hence some models have been devised with the aim to separate information at different resolutions in different parts of the architecture. As CNNs finds them with depth, RNNs finds them with different speeds, as they are naturally deep networks with respect to time within the sequence. The model we considered in this thesis is the Clockwork Recurrent Neural Network (CW-RNN) [23], which employs a multi-scale architecture.

The second idea from which this thesis originated is model complexity. Researchers often create more complicated models that solve a certain task better and better, and to make things more complex is sometimes necessary, but it is sometimes good to find ways to separate components in order to make them singularly simpler. This way they may work even better (as we will show) and be conceptually more understandable. Indeed, the work we started from is the one that introduces Linear Memory Networks (LMNs) [2], in which they conceptually separated the functional task from the memory task: the former consists in mapping the input to an output, the latter consists in retrieving the required information from the memory of past states.

Finally, training complex networks is difficult, so apart from several tech-niques that may help the optimization algorithm, there is also the approach of pretraining the network in such a way to start training it from somewhere in the parameter space closer to the solution, with the goal of making the learning faster, more stable and able to reach better solutions. Pretraining was initially used to train Deep Belief Networks (DBNs) [16, 6], when it was difficult to train them directly. After pretraining the layer pairwise, they fine-tuned the

(6)

network with backpropagation. Something similar can be done for sequential data, so we continue the work done in the context of RNNs with LMNs in [2], in which they devised a pretraining scheme for the memory component of their network.

In brief, the contributions of this thesis can be summarized as follows. We took the conceptually simpler architecture of the LMNs and we enhanced the memory with the multi-scale architecture from CW-RNNs, which makes it able to internally separate the representation of information at different resolutions, i.e. probably at hierarchically different heights, with the same amount of learned parameters, because making them separated decreases internal interferences. This may seem counterintuitive when thinking about complexity, but we think that trying to model increasingly abstract information may require an architecture that mimics it. We also built an incremental pretraining algorithm specific for our multi-resolution model; it is incremental in the sense that we build the memory incrementally, piece by piece: at each step we initialize the new memory module with pretrained weights, then we train it. We have found that we were able to increase performance on a task of speech recognition, a subtask of TIMIT [13], without increasing the number of learned parameters with respect to well-accepted models, such as LSTMs.

1.1 Thesis outline

In Chapter 2 we introduce useful background material, such as our notation, an introduction on Machine Learning and Neural Networks, a summary of the most important models for understanding the rest of this thesis and brief coverage of some other related works which share some of our goals.

In Chapter 3 we define our new model, its architecture, how to build it efficiently by putting together all weights from different memory modules into a single matrix, and we describe our incremental pretraining algorithm.

In Chapter 4 we show experimental results on two tasks: a sequence generation one, and a speech recognition one. We compare our models, with and without pretraining, with the basic RNNs, the standard LSTMs, simpler multi-resolution CW-RNNs, and LMNs.

Finally, in Chapter 5 we conclude this thesis with a discussion of the contri-butions of this work, as well as an analysis of the advantages and disadvantages of the proposed model. Finally, we discuss some future work that could be done to enhance it.

(7)

Background

This chapter covers some background material useful to understand the content of this thesis. We start with some notation in Section 2.1. Then we proceed with an introduction about Machine Learning in Section 2.2. Then we introduce Neural Networks in Section 2.3. Then we continue with a brief coverage of Recurrent Neural Networks in Section 2.4, because all the models we show here belong to that class of Machine Learning models. Then we talk about Clockwork Recurrent Neural Networks in Section 2.5. To understand an interesting pretraining technique we cover Linear Autoencoders for Sequences in Section 2.6, and Linear Memory Networks in Section 2.7; we are especially interested in these, as our own model is an attempt at extending Linear Memory Networks with a multi-resolution memory. Finally, in Section 2.8 we briefly discuss interesting related works.

2.1 Notation

This work is all about Artificial Neural Networks, so we need to define the required notation about nodes, layers, and hidden states.

With x, h and y we refer respectively to the input vector, the hidden state vector and the output vector of the network. We use superscript numbers and letters to refer to a vector at a specific timestep, for instance h0_{, h}1_{, h}t_{, while}

we use subscript numbers to refer to a particular hidden state, when there are multiple ones, such as h0, h1, hg. The term m is used to refer to memory, a

particular type of hidden state.

Uppercase letters such as A, B, V, Ξ represent matrices. In particular,

Wxh_{, W}ho _{represent learnable weight matrices associated respectively to the}

(8)

and from the hidden layer and the output layer in the latter.

We refer to a single weight associated with the connection from neuron xi

to neuron hj with wxhij .

We use Nx, Nh, Nm, No to indicate dimensions related to neural network

layers and lowercase letters like g, l, k, b to indicate other quantities. We use INx, 0Nx to indicate special matrices of specific dimensions.

We use σ to indicate a non-linear activation function such as a tanh. With Np0, εq we refer to the normal distribution with mean 0 and small variance.

2.2 Machine Learning

With Machine Learning (ML) [29, 28] we refer to the area of Artificial Intel-ligence that naturally intersects Computer Science and Statistics. Computer Science builds computer systems that solve problems and studies their tractabil-ity, while Statistics, given some modeling assumptions, focuses on what can be inferred from data and how reliably. As a consequence, ML seeks to build com-puter systems that, instead of being manually programmed, program themselves; computer systems that learn how to infer information from data, changing their behavior all along the process, and studies the fundamental laws that govern learning processes.

Mitchell defines a learning task this way:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

ML is particularly useful on hard problems in which a rigorous mathematical explanation (or simply an equation) is not known, hence we try to build a mathematical model that, relying on patterns and inference, is able to solve the problem –without having to identify an the explicit solution– for us.

ML can be divided into different categories:

• Supervised Learning: the training data consists in input-output couples. The learning system is built to tune the parameters of the function it is optimizing by responding to error coming from the outputs it wrongly predicts. Input data with different outputs instructs the system on what

(9)

makes data different, while input data with the same output instructs the system on what makes data similar.

Under this category we can define two broad classes of tasks:

– Classification: the output of the data is restricted to a limited

set of discrete values, each element of this set is called class. The mathematical model, given a training sample, should decide which class the sample belongs to (or output a probability distribution among them).

– Regression: the output of the data is not restricted to a limited set,

but it is allowed to take any numerical value inside a range. The learning model now has to produce a continuous value in the same range.

In certain cases it can be difficult in terms of economic or human effort to build a sufficiently large dataset, hence it is useful to study how a learning system may be able to derive information without outputs. This leads to the next type of learning task.

• Unsupervised Learning: the training data does not supply expected outputs, so the learning model cannot be provided with a measurement of the error to guide learning. What the learning model can do instead is to find commonalities in the data, identify groups and clusters, and self-organize itself in such a way as to react, when given new data, based on the presence or absence of these commonalities.

• Reinforcement Learning: this kind of learning is used when trying to model agents that ought to take actions in an environment that provides stimulus, under a notion of cumulative reward. The objective is to maximize a reward.

2.3 Artificial Neural Networks

ANNs [39] (which we will often call Neural Networks (NNs) for simplicity) are systems inspired by the biology of the neurons inside our brain. An ANN is a large framework of mathematical models, all of them with the common idea of building a complex system through the connection of a lot of simple artificial neurons.

(10)

x0 x1 b h1 h0 h2 b y0

Input layer Hidden layer _{Output layer}

wxh 00 σpnetpxqq x0 x1 b h0pxq

Figure 2.1: To the left, the general architecture of a Multi Layer Perceptron, composed of three layers. Every node is connected with an edge that represents a weight to every other node in the next layer of the network. To the right, how a single neuron is computed. σ is the activation function,

netpxq is the sum of the products of each input by its weight (See

equation 2.1).

A not complex, yet important, NN in the class of Feedforward Neural Networks, is called Multi Layer Perceptron (MLP), and it is made of three layers (See Figure 2.1). Let us assume we are talking about a single sample, this is made of a certain number of features (two in our example), one per node in the input layer and each feature is a real number. In the forward pass a certain input sample is given in input to the network and values flow from layer to layer up to the last one, in which an output is produced. Each edge represents a connection and is associated to a weight, which strength is what actually changes through learning.

The core of the NN is the structure of the node, or neuron. Let us assume we want to know the output of any one of the neurons of the hidden layer of Figure 2.1 and refer to the function it computes as hjpxq. This function is the

result of the application of an Activation Function σ over the net of the neuron

hj: hjpxq σpnetpxqq σ 1 ¸ i0 wxh_ij b (2.1)

where wxh_ij is the weight on the edge between xi and hj and σ is a special

(11)

which is fundamental, otherwise the network would only be able to learn linear functions.

There are many activation functions, some of them more suited for a particular task. In order to be able to train the network (more on this later) we need to be able to compute derivatives, hence we need differentiable functions. Three of the most famous ones are the logistic function (or sigmoid), the

hyperbolic tangent and the Softmax. The first produces values in the continuous

range p0, 1q, while the second in p1, 1q, and the last reshapes an output vector with one value for each class to become a probability distribution: every value is in range p0, 1q and they must sum to one. We report the equation of the first two and their derivative:

spxq  1 1 ex s 1_{pxq spxqp1 spxqq} _(2.2) tanhpxq  e x ex ex _ex tanh 1_{pxq 1 tanhpxq}2 . (2.3)

Learning. NNs are typically trained with Backpropagation [40]. The idea of this algorithm is to compute the forward pass, compute a loss function (or error function) L representing a measure of the inaccuracy of the prediction, then backpropagate the error backward layer by layer to every weight of the network.

What we actually compute is the partial derivative of the loss function with respect to the network parameters. Learning in this context is actually like setting up a non convex optimization problem in which the weights represent a point inside the space of all solutions the network is able to reach and the gradient of the error represents the direction in which the point should move in order to increase or decrease the function (the loss). A standard way to approach the problem is to use an optimization algorithm: the most common is Stochastic Gradient Descent. What it does it to simply move in the direction pointing to a local minimum. We know this may not be the global optimum of the function, that’s why a lot of techniques have been proposed to help the search through the space of all solutions.

2.4 Recurrent Neural Networks

An RNN is a class of Neural Networks specialized to work with sequences as its input, such as temporal sequences, as it is able to encode information

(12)

x h y Wx Wh Wo ht1 xt1 yt1 ht xt yt ht 1 xt 1 yt 1 Wh Wx Wo Wh Wx Wo Wh Wx Wo Wh

Figure 2.2: On the left, the general representation of an RNN: the hidden state is connected to itself. On the right, the unfolded representation: the hidden state is different at every timestep, but the weights are always the same. Some nodes are represented with no circle to indicate that no particular

computation is computed at the level of that node, differently from h.

(dependencies among particular elements) about the sequence in its hidden state. The difference with non-recurrent Neural Networks is that they have cycles, allowing their internal state to be updated step by step, giving them the capacity to have a dynamic memory of the history of inputs, therefore handling sequences of inputs with temporal dependencies.

Figure 2.2 depicts a generic RNN with its unfolding. The simplest form of RNN has a single layer of hidden units h connected to itself. The hidden state h is actually a sequence of hidden states ht_{, one for every timestep t. At}

each timestep a single element xt _{of the input sequence}tx1_{, x}2_{, . . . , x}t_{, . . . , x}lu

is provided to the hidden units together with the previous hidden state, ht1_{, in}

order to update the hidden state and produce an output yt. Of all the outputs produced, the task decides which are useful: for instance in classification tasks such as predicting the class of the input sequence only the last one is kept, while in tasks such as generating or predicting the next word of a sentence, an output at each timestep is required.

The state and the output of a vanilla RNN is updated with the following formulas:

ht  σ Wxxt Whht1 (2.4)

yt  Woht1. (2.5)

Training. RNNs are trained with an extension of backpropagation called Backpropagation Through Time (BPTT) [31, 37, 49]. We will also discuss

(13)

x0 h0 h1 .. . hl1 hl x1 .. . xl1 xl y Wx Wx Wx Wx Wh Wh Wo

Figure 2.3: An Unfolded RNN in the special case of a task that requires a single output at the end of the sequence. An RNN can be seen as a very deep feedforward network (as many layers as the length of the input sequence, l) in which the weights between the hidden layers are shared (as well as the weights between the input and the hidden layers). Thus they present some of the same

issues, such as gradient vanishing.

about a more practical variant, the Truncated BPTT [50].

The idea of BPTT is to compute the contribution to the gradient on a network unfolded over k timesteps. For now let us assume k  l. Hence, we have k input-output couples Dpx0_{, y}0q, px1_{, y}1q, . . . , pxk_{, y}kqE_{. We completely}

unfold the network, forward-propagate, compute the predicted output and the error, then compute the gradient backwards, and finally update the weights.

Given long sequences, this kind of network often suffers from vanishing and exploding gradients [5], due to the repeated multiplications when back-propagating, leading to the inability to learn long dependencies and to the convergence to bad local minima. In Figure 2.3 we can see how an unfolded RNN resembles a very deep feedforward network with weight sharing between the hidden layers. The reason behind this issue is that when backpropagating through a lot of timesteps, in order to compute the partial derivative of the loss L with respect to a distant hidden state, BL{Bhk, the recurrent weight matrix

(14)

the power of l k. By applying the chain rule it can be shown that BL Bhk ¤ W lk (2.6)

where Wlk _{is the recurrent weight matrix to the power of l}_{k. It follows}

that the influence is determined by its biggest eigenvalue λ: if λ ¡ 1, then the gradients will tend to explode, if λ 1 then the gradient tend to vanish. As for the activation function, it contributes to the vanishing gradient, as the derivative of the function has a maximum value of 1{4 for a sigmoid and 1 for a tanh, and both have a minimum value of 0, thus reducing the gradient more and more.

One of the most simple way to approach exploding gradients and make the computation more efficient for very long sequences (or streaming endless sequences) is to just unfold the network up to a k much smaller than the length of the sequence (this is called Truncated BPTT), compute the error, update the weights, then proceed to unfold the next k timesteps. This is not always a solution, as it precludes the network from learning long-term dependencies.

Another issue is that at every step all weights are updated, hence long-term dependencies often gets forgotten in favor of recent information, often useless for generalizing over new sequences.

Models has been devised with either or both of this issues in mind, for instance all models with gates, such as LSTM [17] and GRUs [10], or even CW-RNNs allow information to flow in the forward pass from very old nodes directly to the ones in the current timestep, while allowing the gradient to flow backward through the same connection without being diminished by many gradient computation. LMNs [2] employs a linear memory which is internally updated without activation functions nor gates.

We discuss more in depth about CW-RNN, LMNs and how the problem is faced in the literature in the next sections.

2.5 Clockwork RNN

Instead of using gates to avoid low-level information overwriting long de-pendecies, CW-RNNs [23] tries to separate the hidden states into different components, each with its own time resolution, so that faster module learn short-term dependencies, while slower ones learn long-term ones.

(15)

ht 1 ht2

. . .

h t g xt yt ht₁1 ht₂1 .. . ht1 g ht₂1 .. . ht1 g htg1

Figure 2.4: The intuitive architecture of a CW-RNN. The hidden states are factorized across g modules with increasing clock rates tT1, . . . , Tgu. The state

of module i gets updated every Ti timesteps. At each update, module i

receives the contribution from equal or slower modules j such that Ti ¤ Tj.

Weights are omitted for better clarity; every edge has its weight matrix. Anyway, its implementation exploits properties of block matrices, see

Section 2.5 and Figure 2.5

different clock rates or periods. The network architecture is shown in Figure 2.4. Each module is connected to all non-decreasing clock rate modules. We order the modules with increasing clock ratestT1, . . . , Tgu and assign them increasing

indices t1, . . . , gu, then each module i gives its contribution to itself and all the faster modules to its left. This architecture is effective because slower modules, updating every once in a while, focus on the broader context, while providing information about long-term dependencies to the faster modules which focus on short-term ones.

Before introducing blocks, let us define or recall some quantities: Nx is the

input size; g is the number of modules, Nh is the hidden size; No is the output

size.

In order to make the computation more parallelizable (especially on GPU), instead of considering the hidden states as separate nodes, we can exploit properties of block matrices to aggregate their weights into a single matrix, and their states into a single one as well. In particular, instead of having g set of weights from input to hidden layer tWx,h1_{, . . . , W}x,hgu, g set of recurrent

(16)

0

h

t

σ

p

W

xh

_x

t

_W

hh

_h

t1

q

h5 h4 h3 h2 h1 gNh b gNh gNh gNh b gNh Nx Nx b

Figure 2.5: The memory update with block matrices showing all dimensions assuming g 5 and t 6. Given g the number of modules, Nh the hidden

state size, b the batch size, Nx the input size and assuming that only the first

two modules has to be updated at time t, this Figure shows which block matrices need to be involved in the computation: only the first two block rows

of Whh _{and W}xh_.

only need three block matrices: Wxh P RgNhgNx _{which holds all the weights}

from the input layer to the hidden layer, Whh P RgNhgNh _{which holds all the}

recurrent weights, and Who P RNogNh _{for the output layer. Moreover, we call}

the matrix containing the state of all modules at a specific timestep ht_{. In}

Figure 2.5 we show the matrices graphically and how to update the hidden states. These are the equations used to compute the recurrent step and the output:

ht σ Whhht1 Wxhxt (2.7)

yt σ Woht (2.8)

Equation (2.7) assumes we update the state of all modules, but using block matrices has the advantage that, if we do not need to update a module, we can slice out the relative block-row from the weight matrices, and make the computation faster as a consequence.

To be more precise, the weight matrix Whh _{is made of g block rows, each}

one made of other g blocks, one for every other module. Every one of these matrices contains weights between couples of modules. Since modules do not contribute to the update of slower modules, as we explained earlier, all blocks below the diagonal are 0. The other weight matrix Wxh _{is similar, but each}

(17)

2.6 Linear Autoencoder for Sequences

Nonlinear autoencoder networks have become famous for their ability to build an internal representation with interesting properties of a set of inputs. In this thesis we focus on the results of [46], in which it is shown how linear autoencoders perform PCA on some input data and how to apply them to different data structures. We are, of course, interested in sequences, as performing PCA on a set of sequences is equivalent to find a linear system that compresses the data into the most efficient state vector. We will use such compression to mitigate the vanishing gradient issue of training long recurrent networks. Autoencoders are typically trained through backpropagation, or as Restricted Boltzmann Machines (RBMs), but when dealing with linear ones, closed form solutions exist.

Let us see a summary of how to get the closed formula to train linear autoencoders.

Given a temporal sequence as inputtx1, x2, . . . , xt, . . . , xlu, the dynamical

system is defined by the following equations:

yt  Axt Byt1 (2.9) xt yt1  Cyt (2.10)

where AP RNmNx_{, B} P RNmNm _{and C} P RpNx NmqNm _{are the model}

param-eters1_{. A contains the parameters related to the input, while B to the memory}

term. We define the state matrix Y P RlNm _{as the matrix that encodes the}

input sequence from all timesteps, then we decompose each yt with its equation

(2.9) up to y0  0, and factorize the input vectors xt from the parameter

matrices Nx and B into:

yJ₁ yJ₂ .. . yJ_l looomooon Y xJ₁ 0 0 xJ₂ xJ₁ 0 .. . ... . .. ... xJ_l xJ_l₁ xJ₁ loooooooooooooomoooooooooooooon Ξ AJ AJBJ .. . AJBl1J looooooomooooooon Ω . (2.11)

We factorize Ξ with its truncated SVD decomposition: Ξ V ΣUJ. The proof

1In order to make the notation more uniform, we are already using the dimensions we are

(18)

starts by imposing Ω  U, exploiting the structure of Ξ and by noticing that, since UJ is orthogonal, then ΩUJ  I, and ends by concluding that we are able to derive optimal A and B (by only keeping as much rows as rankpΞq). Given the decomposition of U into l submatrices, the problem reduces to find matrices A and B such that:

Ω AJ AJBJ .. . AJBl1J U1 U2 .. . Ul  U. (2.12)

Since we are using an SVD decomposition, Σ contains the singular values and each column of U_tJ is a left-singular vector and corresponds to a principal direction of the data, i.e. PCA (apart from technicalities).

For more details and an in-depth proof, see [46]. It is enough for us to say that the former equation is solved by defining the matrices

P INx 0J_N xpl1qNx , R 0NxNxpl1q 0NxNx INxpl1q 0Nxpl1qNx (2.13) and by computing: A UJP , B  UJRU, (2.14)

where Ix is the identity matrix of size Nx and 0NxNx is the zero matrix of size

Nx Nx.

We will see later how this is exploited for pretraining the memory component of the Linear Memory Network and how it can be easily extended to a memory with a different structure, such as a multi-resolution one.

2.7 Linear Memory Networks (LMN)

LMNs arise from the intuition that sequence processing problems may be decomposed into two separate tasks: a functional one, in which the network has to map the input sequence into a certain output and encodes non-linearities, and a memory task, in which information needs to be stored and retrieved in order to be able to produce the output that the functional part has selected. An architecture like this may be less complex, because by decoupling the two parts, they can be simpler: in LMNs the functional part is a simple fully connected

(19)

xt ht _mt mt1 yt h ytm Wxh Whm Who _Wmo Wmh Wmm

Figure 2.6: The architecture of a LMN. Functional component: ht _{contains the}

information needed to map the input xt _{to an output; the previous memory}

content mt1 is also provided. Memory component: mt is a memory of previous hidden states implemented as a linear autoencoder for sequences of ht

elements.

layer, while the memory is a linear autoencoder for sequences. The hidden states are computed with the following equations:

ht σ Wxhxt Wmhmt1 (2.15)

mt Whmht Wmmmt1. (2.16) The output activation is standard, apart from the way it can be connected to the rest of the network, either to the functional part, or to the memory, resulting in one of the two equations:

ot_h  σ Whoht (2.17)

ot_m  σ Wmomt. (2.18)

Pretraining. Pretraining techniques are useful when dealing with hard-to-learn models. We initialize the model with parameters such that they represent a point in the space of all solutions which is nearer to the optimal one, thus hopefully avoiding the training process to lead to bad local minima.

An interesting pretraining technique was first used to train DBNs [16, 6]. What the authors did is to pretrain all layers, two adjacent ones at a time, like RBMs with the contrastive divergence algorithm [15], then fine-tune the

(20)

xt ht _ht1 _ht2 _htk xt1 _xt2 _xt3 yt h (a) Step 1.

Unrolled network trained with BPTT.

xt ht _ht1 _ht2 _htk yt h Wxh Whh 1 Whh 2 Whh k Wo 0 W o 1 W2o W o 3 (b) Step 2. Explicit memory: {ht1_{, . . . , h}tk}.

Figure 2.7: Model architectures for the first two steps of the pretraining of a LMN.

network (especially the output layer) with backpropagation.

What we repropose here is a pretraining scheme for encoding into the weights of the LMN (thus, into the memory component) the necessary information to reconstruct the hidden state sequence, then train the network starting from there [2].

The memory is a linear autoencoder, one of the properties of the linearity is that we can derive an explicit solution to find the optimal encoding (as discussed in section 2.6) and use it to initialize the memory.

The general idea is to first build an unfolded version of the network, then decouple function and memory, and then bring it into an efficient form.

In the first step we build a recurrent network with an unfolding up to k timesteps in the past. The output at time t depends on the current hidden state and k previous ones (Figure 2.7a). Each hidden state is computed with information coming from the current input and from k previous time steps. We train this network through standard backpropagation.

In the second step, we take the hidden states computed this way and look at it at a different perspective: we want to take the previous hidden states with respect to time t and construct a memory from them. The idea is that only

ht has access to the current input xt, thus it is probably the one solving the

input-output transformation at time t, while the other hidden states should just be a memory of past states. We call tht1, . . . , htku the explicit memory

for ht. This is based on equivalence results between recurrent and feedforward

(21)

ht1 ht2 ht3 .. . htk m ˆ ht1 ˆ ht2 ˆ ht3 .. . ˆ htk

Figure 2.8: The linear autoencoder architecture for compressing k input nodes (the hidden states of the explicit LMN) into a single one. Since the network has the job of reconstructing the input nodes as best as possible, the weights are forced to compress the input into m, so that the decoder is able to do the reconstruction. Note that ˆht ht _{error. Being linear what this network}

actually does is to compute the PCA, apart from mathematical technicalities. This is used in step 3 of the pretraining.

The explicit representation is inefficient, so in the third step we want to compress the memory. We train a linear autoencoder to encode a sequence of hidden states into an efficient representation. In Figure 2.8 we show the general architecture of an autoencoder for our problem. It is possible to solve the task with an optimal closed form solution; as seen in section 2.6 a linear autoencoder basically performs PCA.

Solving the linear autoencoder allows us to obtain the input matrix A and the recurrent matrix B. We can use them to initialize the LMN, then fine tune it with backpropagation. We also need to recall U

AJ, AJBJ, . . . , AJBk1J_.

Given all this, we initialize the parameters of the LMN in the following way:

Whm  A (2.19)

Wmm  B (2.20)

Wmh W₁hh, . . . , W_khh U (2.21)

Whm  rW₁o, . . . , W_kos U (2.22)

(22)

2.8 Other related work

In this section we briefly present another research line that tries to tackle hierarchical information processing, some promising models specifically aimed at reducing the vanishing gradient problem.

Hierarchical models Deep learning [4] became famous for being able to learn increasingly more abstract representations of information with its highest layers, for instance in problems in which information is organized spatially, deep convolutional neural networks were shown to be able to classify images on ImageNet much better than any other previous model [24].

Temporal data may as well have an inherently hierarchical nature [12], thus the need to devise new recurrent models able to capture this kind of structure inside temporal data. It has been observed that higher-level abstractions change slower within temporal sequences with respect to lower-level ones. Since that, many models and approaches emerged: multi-scale RNNs, such as Skip-Recurrent Neural Network (Skip-RNN) [8], or Hierarchical multiscale RNN (HM-RNN) [42, 12] up to CW-RNNs [23], of which we talked about in Section 2.5 and we extended in Chapter 3.

In LSTM [17], each hidden unit has update and forget gates, thus can learn multi-scale information, but the timescales are not organized hierarchically and information are still diluted over hundreds of time steps due to leaky integration.

Under the line of conditional updates, one approach, aimed at reducing the side effects of very long sequences on the training, is Skip-RNN [8]. When the sequence is very long, or efficiency is important, a suboptimal solution is to sample the data less frequently, or simply decide to skip certain sequence values. Skip-RNN introduces the concept of a learnable parameters, used as a probability to update the hidden states. With this, the network may learn to skip a certain number of useless steps. One of the cons of this approach is that the network has to be trained with a reinforcement learning manner, and determining the reward may be non trivial.

CW-RNNs [23], of which we discussed in a dedicated section (2.5), follows this line of research, as it tries to learn different scale dependencies by separating the hidden states into modules that run at different speeds.

Another approach with explicit boundaries is Hierarchical RNN (HRNN) [26]. This model is used for neural machine translation; word boundaries are obtained

(23)

through word tokenization.

Some work has been done trying to process the sequence in its entirety, instead of a traditional online manner (one step at a time), such as in [20, 21]. The authors of [9, 3] tried to do merging and pooling of low-level repre-sentation with the aim to obtain shorter sequences representing higher-level abstractions.

The last approach we cite is HM-RNN [11] that tries to discover the latent hierarchical information inside temporal sequences by making the model learn

boundaries which identify places where different scale information begin or

end, and adding some operations, in a manner similar to gated networks, to intuitively let the model decide to update or not the hidden state based on those boundaries.

We chose to work with CW-RNNs: some of the advantages are that slower modules are less affected by vanishing gradient, we can update slower modules less often, and we can decide how many neurons to dedicate to each module (i.e. each representation at a certain time resolution). One of the disadvantages is that the different time-scales are an hyperparameter, while in many contexts, if data is non-stationary in temporal sequences, it may be needed to have the network learn it by itself.

Orthogonality. A line of research that is directly focused on one of the two causes of vanishing (and exploding) gradients [5] is the one that revolves around orthogonal matrices. The idea introduced in [1], is that, since the problem is caused by repeated multiplications of a recurrent matrix that have spectral radius smaller or bigger than 1, we should try to find a way to enforce the recurrent weight matrix to be orthogonal –unitary in their case, as they work in the complex field. They devise a composition of matrices with certain properties in order to yield a weight matrix with such properties.

Enforcing this constraint perfectly is not necessarily the best thing to do. Other researchers working on this idea proposed another kind of solution: instead of using a special decomposition, they introduce a soft bound [48]: a special regularization term that penalizes the loss when the spectral radius is far from 1, hence enforcing the constraint only asymptotically. Given matrix

W , WJW  I if and only if W is orthogonal. Hence the regularization term is

given by how far W is from being orthogonal:

λ W

J_W _I

(24)

In [19] it is shown hot to make the network less parametrized and make this product more efficient by decomposing the recurrent weight matrix as the Kronecker product of a few smaller matrices, and enforcing the constraint on the smaller matrices.

(25)

Clockwork LMN

Our work consists in extending the memory component of LMNs with multi-resolution modules as done in CW-RNNs. By doing this we expect to have a more flexible memory able to learn dependencies of different length (possibly higher-level information), while the linearity lessens the vanishing gradient effect. We named the new model Clockwork Linear Memory Networks (CW-LMNs).

3.1 Model architecture

The basic architecture is shown in Figure 3.1. The functional component is the same as an LMN, while the memory m P RgNm _{is organized in g modules of}

size Nm.

ht_{P R}Nh _{is the hidden state of the functional component. Given W}xh _the

weight matrix from the input to the hidden layer of the functional component,

Wmih _{the weight matrix from memory module m}

i to the hidden layer and σ

the activation function tanh, the equation to update ht _{at timestep t is:}

ht  σ Wxhxt g ¸ i1 Wmih_mt1 i . (3.1)

so at each update, ht receives information from the whole memory.

The memory m is instead much different from an LMN. The memory is composed by g modules in which module i has clock rate Ti P tT1, . . . , Tgu. We

update the state of module mi with only information about slower or equal

modules. The intuition is that faster modules are provided with the wider context, and slower modules are not distracted by low-level information. The simpler case is to use exponentially increasing clock rates t1, 2, 4, 8, 16, 32, . . .u:

(26)

ht _mt 1 mt2

. . .

m t g xt yt h ymt mt₁1 mt₂1 .. . mt1 g mt₁1 mt₂1 .. . mt1 g mt₂1 .. . mt1 g mtg1

Figure 3.1: The architecture of a CW-LMN with g modules. The functional hidden state ht _{is updated with information from the input x}t _{and from the}

whole memory at the previous timestep. Then the memory modules with a clock rate such that pt mod Ti  0q are updated with information from ht and

slower or equal modules. For instance mt

1 receives information from ht and all

modules in tmt₁1, . . . , mt1

g u. Finally, the output layer can be connected either

to the functional component, or to the memory. Weights are omitted for clarity see Figure 3.2 for the weight matrix structure.

it seems a reasonable way of increasing the context size and it allows for a straightforward update policy. At any timestep t only modules with a clock such that

t mod Ti 0 (3.2)

will be updated. We sort modules with non decreasing clock rates, such that for all i j, Ti ¤ Tj, hence, at each timestep, we only update modules of contiguous

indices always starting from module m1. An efficient implementation of the

model can use a single block matrix for each set of weights: WhmP RgNmgNh

holds all the weights from the state of the functional component to the memory:

Whm W₁hm .. . Whm g  . (3.3)

(27)

0

m

t

σ

p

W

hm

h

t

_W

mm

_m

t1

q

m5 m4 m3 m2 m1 gNm b gNm gNm gNm b gNm Nh Nh b

Figure 3.2: The memory update with block matrices showing all dimensions assuming g  5 and t 6 (only modules m1 and m2 has to be updated). Since

t mod 1 0, t mod 2 0, t mod 4 0, only the first two block rows of Whm

and Wmm are needed in order to compute the state of memory mt. Block matrices of Wmm _{under the diagonal are 0 because of the asymmetry of the}

update rule based on clock rates.

Wmm _{P R}gNmgNm _{contains all the recurrent weights of the memory:}

Wmm Wmm 1 .. . W_gmm (3.4)

Wmo P RNogNm _{has the weights of the output connected to the memory, where}

No is the size of the output layer:

Wmo W₁mo .. . Wmo g  . (3.5)

Figure 3.2 shows how the block structure is exploited to update the recurrent weights analogously to the CW-RNN case. b is the batch size and we excluded the bias from all figures and equations for simplicity.

We can now define the equation used to update the memory:

mt σ Whmht Wmmmt1, (3.6) where, at each time step, we use block-rows of Wmm _{and W}hm _{that correspond}

(28)

to executed modules: W_imm $ & % Wmm i for pt mod Tiq 0 0 otherwise (3.7) W_ihm $ & % W_ihm for pt mod Tiq 0 0 otherwise . (3.8)

As for a regular LMN, the output layer can be wired to the functional component, ht_{, with W}ho _{P R}NoNh_{, or to the memory, m}t_:

yt_h  σ Whoht (3.9)

yt_m  σ Wmomt (3.10)

The non-linearity used for computing the hidden states and the memory is

tanh. The network is trained through standard BPTT.

3.2 Pretraining

We extended the pretraining scheme based on the results on linear autoencoders for sequences in Section 2.6 and presented in Section 2.7 in order to efficiently pretrain the CW-LMN modules in an incremental fashion.

The intuition is that, since what changes with respect to a regular LMN is the architecture of the memory, we want to intervene in the compression phase of the explicit memory by encoding the past hidden states into a multi-resolution memory, instead of into a single module memory. What we do is to construct the CW-LMN incrementally by building, pretraining with an approximate initialization algorithm, and then training the network, one module at a time.

We briefly explain why we chose an incremental pretraining over a single step one, and then we describe the pretraining scheme in details.

Incremental pretraining motivations. The natural extension for the pre-training of the LMN would be to devise a mechanism to pretrain the weights of all modules together, but this would have an important drawback. In the first pretraining step, we choose an unrolling factor k that represents how many hidden states we unroll and then we train the unrolled network. Then, over the past states of this network, we build a linear autoencoder that compresses them into a single hidden state. In the case of a multi-scale memory we could

(29)

think of connecting the input layer and the encoding layer of the autoencoder in such a way as to replicate the behavior of the number of modules we want to have, but the slowest module should be able to get information from hidden states very far back in time, farther than k steps. This means we would need to increase k accordingly, but this would lead to an unrolled network too expensive in terms of time and memory.

What we do instead is to build the network incrementally. For clarity we call

pretraining step (p-step) the pretraining scheme of a single module, as described

in Section 2.7 and we call incremental step (i-step) the whole procedure done to add a module, which contains many p-steps.

In the next paragraph, we describe the whole procedure in details.

Incremental pretraining steps. Let us assume G to be the final number of modules, T1  1 and T2  2, to be the first two module periods. Let us also

define the names of the relevant weight matrices:

• Whm

j P RNmNh is the weight matrix associated to the connection from

the hidden state of the functional component to the memory component

j;

• Wmm

ij P RNmNm is the (recurrent) weight matrix associated to the

connection from the memory module i to memory module j;

• Wmh

i P RNhNm is the weight matrix associated to the connection from

the memory module i to the hidden state of the functional component;

• Wmo

i P RNoNm is the weight matrix associated to the connection from

the memory module i to the output layer.

We refer throughout the following steps to Figure 3.3 and Figure 3.4. Biases are excluded from all dimensions and Figures.

• i-step 1: We start with an untrained CW-LMN with no modules, only the functional part, with equations:

ht σ Wxhxt (3.11)

y_ht  σ Whoht, (3.12)

where Wxh P RNhNx _{and W}ho P RNoNh_{are initialized with small random}

values.

(30)

– p-step 1: We build an unrolled network with k previous hidden

states other than ht (Figure 2.7a). The unrolled network is defined

by the following equations:

ht σ Wxhxt k ¸ i1 W_ihhhti yt σ k ¸ i0 W_iohti , / / / / . / / / / -for t mod Ti  0, (3.13)

where Whh P RNhNh _{represents the relationship between the}

cur-rent hidden state and the state at time t 1, while W_io P RNoNh

represents the relationship between the output and the hidden state at time t 1. Being the first incremental step, we choose input values according to our first period, T1 1, hence we use the whole

sequence tx0_{, x}1_{, . . . , x}l_{u, where l is the length of the sequence. To}

predict the first k output vectors we pad with 0 the missing inputs (inputs xi _{with i} 0). We train this network with backpropagation.

– p-step 2: Based on equivalence results between recurrent and

feed-forward neural networks [45], we can construct an explicit memory from the hidden states of the unrolled network (Figure 2.7b). In-tuitively, the hidden state ht is the only state aware of the current

xt, while tht1, . . . , htku consists in a memory of previous hidden

states which are used to compute the next hidden state.

We want to optimize this expensive explicit memory, so we build a linear autoencoder for sequences trained to compress all previous hidden states ht1, . . . , htk into an inner representation of fixed size

Nm. Thanks to the results of [46], an optimal closed-form solution

exists and can be used. We end up with two parameter matrices,

AP RNmNh _{and B} P RNmNm_.

– p-step 3: We use the obtained parameters to initialize the CW-LMN.

The process is illustrated in Figure 3.3a. We add the first module, resulting in a model with the equations of a CW-LMN as described earlier (3.1, 3.6, 3.9, 3.10), but only g 1 modules, then we initialize

(31)

the new matrices as follows:

W₁hm A (3.14)

W₁₁mm  B (3.15)

W₁mh N p0, εq (3.16)

W₁mo  N p0, εq. (3.17)

The last two matrices are initialized with a normal distribution with mean 0 and small variance because we want the network to learn those connection by itself and prevent the random initialization to interfere with pretraining results.

At this point, we fine-tune the CW-LMN with backpropagation for some epochs before adding the second module.

• i-steps 2,. . .,G: In summary, at every i-step we add a module to the CW-LMN, initialize the added weights either with the result of the linear autoencoder trained on a new unrolled network (skipping some input elements according to the module period), or randomly, then fine-tune the CW-LMN for a fixed number of epochs.

The newly added module should know the sequence at a different time-scale, so at each i-step we build a new unrolled model (equations 3.13), but every time we skip different elements of the input sequence, according to the period of the module. Given i the index of the current module, we visit only input elements xt such that

t mod Ti  0. (3.18)

For example in the second incremental step (see Figure 3.3b and Fig-ure 3.4) of our example, T2 2, so we skip odd inputs at odd timesteps,

using tx0, x2, . . . , xlu.

The linear autoencoder is trained as before on the previous hidden states of the unrolled network, ht1, . . . , htk.

(32)

add module j: W_jhm A (3.19) W_jkmm  N p0, εq @k P t1, . . . , j 1u (3.20) W_jjmm  B (3.21) W_jmh N p0, εq (3.22) W_jmo  N p0, εq. (3.23)

Again, the newly added weights we do not get from the linear autoencoder are very small to minimize the initial influence of these connections. New biases included.

(33)

ht xt yt h Pretrain LA pclock 1q

Before i-step 1

ht _mt 1 xt yt h ytm mt1 1 mt11 A rand _B rand

After i-step 1

(a) i-step 1: p-step 3.

We add the first module and initialize the new weights in the following way:

W₁hm A, W_1,1mm  B, while W₁mh and W₁mo are initialized with very small random

values. ht _mt 1 xt yt h ymt mt1 1 mt11 Pretrain LA pclock 2q

Before i-step 2

ht _mt 1 mt2 xt yt h y t m mt1 1 mt21 mt1 1 mt21 m t1 2 A rand rand _B rand

After i-step 2

(b) i-step 2: p-step 3.

We add the second module and initialize the new weights in the following way:

Whm

2  A, W2,2mm  B, while W2mh, W2,1mm and W2mo are initialized with very small

random values.

Figure 3.3: First two steps of the incremental pretraining scheme for CW-LMN. We start with a network with with no modules (only the functional part) and

add a module at a time with some pretrained matrices and some random weights; we continue until we reach the desired number of modules. Each step

we build A and B keeping into account the clock of the module we want to add.

(34)

 σp

q m1 mt _Whm _ht _Wmm _mt1 Pretrain LA pclock 2q  σp A

B rand 0

q m2 m1 mt _Whm _ht _Wmm _mt1

Figure 3.4: An alternative view of (Figure 3.3b) i-step 2. We show the memory update equation 3.6 with block matrices after adding the second module. We show the newly assigned matrices in red. rand means we initialize the block matrix with very small random numbers. Matrix dimensions are omitted, refer

(35)

Experiments

We compared our new model with a RNN, a LSTM, a LMN and a CW-RNN. The tasks we chose are about audio signals because our network has modules that work at different time-scales, hence easily capturing information at different frequencies, which helps with long dependencies and global features of the sequence.

The first task, Sequence Generation, requires the model to predict the next element of a signal, given no input. The second task, Spoken Word Classification, requires the model to distinguish spoken words. In this case, each sequence is preprocessed into Mel-frequency cepstral coefficients (MFCCs) [41].

Every network has a single hidden layer. The activation function for the hidden layer is always tanh. The total number of nodes is chosen to get ap-proximately the same number of parameters to train. The initial hidden state of the recurrent layer is set to 0. The initial values for all weights are drawn from a normal distribution with mean 0 and small standard deviation. As opti-mization algorithm we used Adam [22] with L2 weight decay as regularization technique [27]. For each model we executed a Random Search [7] in order to estimate the best hyperparameters. The models are implemented with Pytorch 1.0 [34] and tested on a CUDA enabled GPU [32].

4.1 Sequence Generation

The Sequence Generation task consists in taking a single sequence and training the model to predict, step by step, the next element of the same sequence. We used a sequence generated from a portion of a music file, sampling it at 44.1 KHz starting at a random position, turning it into 300 data points. The sequence elements were scaled to lie in the range r1, 1s. The task, having no

(36)

Table 4.1: Number of parameters for every trained model during the Sequence Generation task. We chose the number of hidden units for each configuration to keep the total number of parameters constant across the experiments. The two values in columns CW-RNN, LMN and CW-LMN correspond, respectively,

to ’total units (units per module)’, ’hidden units, memory units’ and ’hidden

units, memory total units (memory units per module)’. CW-RNN and

CW-LMN used 9 modules. # parameters RNN LSTM CW-RNN LMN CW-LMN 100 9 4 9 (1) 4, 6 1, 9 (1) 250 15 7 18 (2) 7, 10 1, 18 (2) 500 22 10 27 (3) 11, 13 1, 27 (3) 1000 31 15 36 (4) 2, 29 1, 36 (4)

Table 4.2: The hyperparameters and the accuracy for all the models taken into consideration for the task of Sequence Generation. For CW-RNN and CW-LMN we used 9 modules. Moreover, we set the bias for the forget gate of

the LSTM to 5. The best performing model is CW-LMN.

RNN LSTM CW-RNN LMN CW-LMN

Hidden Units 31 15 36 2 1

Memory Units / / / 29 36

Learning Rate 1e3 1e2 5e5 5e4 5e3

Weight Decay 1e5 1e8 1e6 1e6 1e5

Weights Init Std 0.1 1.0 0.6 0.05 0.1

# parameters 1000 1000 1000 1000 1000

Epochs 6000 12000 2000 5000 8000

NMSE 0.07958 0.02074 0.01251 0.03847 0.00011

input to memorize, only tests long-term dependencies. In [23] CW-RNN was shown to perform much better than RNN and LSTM, so we wanted to see how LMNs and CW-LMNs perform.

The model is provided with no input; at each timestep t the network is asked to produce as output the element of the sequence at time t 1. Since the output at each timestep is a single real number, the output layer has only one node, hence only a weight for each hidden neuron. This way the network is forced to encode information about the sequence into the hidden weights, without relying (too much) on the weights of the output layer.

We tested all models with approximately 4 different numbers of parameters: 100, 250, 500 and 1000 varying the number of hidden neurons (see Table 4.1). The loss function we used the Normalized Mean Squared Error (NMSE) (the Mean Squared Error (MSE) divided by the range of possible values, in our case

(37)

2). In Table 4.2 you can see some of the final hyperparameters for each model. With LMN and CW-LMN we had better results with the outputs produced by the memory. For CW-LMN we chose 9 modules with exponential clock rates t1, 2, 4, 8, 16, 32, 64, 128, 256u. More modules would be useless, as the sequence is 300 data points long. The number of hidden units for the CW-RNN and of memory units for the CW-LMN is given by the number of modules times the number of units per module. We obtained the best results by setting the forget gate of the LSTM to 5.

We show in Figure 4.1 and report in Table 4.2 the results of our experiments. We show the reconstruction of one sequence with every model, selecting only the best configuration. From the figure we can see that:

• The output of the RNN is precisely reconstructed only for the first data points, then the mean of the output range is predicted for the rest of the sequence. That is the best example of the vanishing gradient problem;

• The output of the LSTM resembles a sliding average, with certain zones approximated more closely, other less so. Thanks to its gates it is able to memorize more information spread across the whole sequence, but it is not able to fit it completely. Moreover, we see that it has a limit to which it is able to learn, in the last part of the sequence it outputs a mean, like RNN does.

• The output of the CW-RNN closely matches the whole sequence, but some areas are only approximated. This shows that a multi-scale architecture is very useful. The network closely matches the sequence, better than an LSTM, even at the end, but some parts are still guessed wrong. We can hypothesize that the fixed scales we chose were not perfect to represent information at those time steps.

• The output of the LMN is a slightly worse sliding average with respect to LSTM, but the precision is of the same quality throughout the sequence. The memory is linear, thus free from one of the causes of vanishing gradi-ent, but it is hard for a single module to internally represent information at different time-scales simultaneously;

• The output of the CW-LMN matches the sequence almost perfectly. We can guess that decoupling the functional components frees the memory (that has the same time-resolutions of CW-RNN) of solving the functional

(38)

RNN NMSE: 0.079582 LSTM NMSE: 0.020735 LMN NMSE: 0.038471 Clockwork-RNN NMSE: 0.012510 Clockwork-LMN NMSE: 0.000116

Figure 4.1: The plots for the Sequence Generation task showing the original sequence we produced (dashed blue) overlapped with the reconstructed one

(39)

4.2 Spoken Word Classification

TIMIT dataset. The TIMIT [13] corpus is designed for training acoustic-phonetic models and automatic speech recognition systems. It contains record-ings from different speakers in major American dialects. The dataset provides handmade splits and information about the phonemes that make words, without having to work with a vocabulary of words.

TIMIT has been used for about three decades for research purposes: many models have been used, starting from [25] with Hidden Markov Models (HMMs), RNNs [38], hybrid models RNNs/HMMs [38], MLPs [36], specialized models such as TempoRAl Patterns (TRAPs) [43], DBNs [30], kernel methods [18], deep LSTM [14], and deep convolutional networks [47], among others.

Task description and preprocessing. We took recordings of spoken sen-tences from different speakers from TIMIT. We trained the model to recognize different words, regardless of the speaker. To be specific, we followed the experimental setup defined in [23]: we took 25 words pronounced by 7 different speakers, for a total of 175 examples. We can categorize them into 5 clusters, one for each common suffix. This suffix is useful to force the network to discrim-inate according to the first part of each word, as we want to test their ability to learn long-term dependency. The clusters and words are the following:

• Class 1 : making, walking, cooking, looking, working • Class 2 : biblical, cyclical, technical, classical, critical

• Class 3 : tradition, addition, audition, recognition, competition • Class 4 : musicians, discussions, regulations, accusations, conditions • Class 5 : subway, leeway, freeway, highway, hallway

which are taken from the file codes reported in Table 4.3.

The actual audio signal is in WAV format. Each file is a sentence, so we precisely trimmed it to the part related to the actual word using the metadata provided by the authors of the dataset.

At this point we had to prepare the data for speech recognition: we used a well-known technique from this field of research, by using the MFCCs as features; they are designed to model the response of the human auditory system [41]. Some of the parameters used to extract the MFCCs are the following: analysis window length: 25ms; step between successive windows: 1ms; preemphasis

(40)

Table 4.3: To help with the reproducibility of Spoken Word Recognition, we report, for each word, a code indicating what files have been used. For the

same words there may be more recordings than necessary (to compare experiments from [23], we favoured balanced classes, leaving out some examples), so it might be helpful to indicate the precise examples we used.

Word Code Word Code Word Code

Making SX430 Classical SX52 Discussions SX40 Walking SX320 Critical SX52 Regulations SX41 Cooking SX177 Tradition SX137 Accusations SX191 Looking SX229 Addition SX45 Conditions SX349

Working SX4 Audition SX194 Subway SX246

Biblical SX42 Recognition SX251 Leeway SX230 Cyclical SX146 Competition SX141 Freeway SX233 Technical SX135 Musicians SX15 Highway SX233 Hallway SX106

filter: 0.97; number of cepstrums to return: 13, but we replaced the zeroth cepstral coefficient, which is considered not useful, with the energy (the log of total frame energy). As a result, we got 13 features. Finally, we normalized each feature over the whole dataset in order to have mean 0 and variance 1. We want to highlight that we work with MFCC features related to the whole word signal, not with phonemes as other word-based approaches.

Training procedure. For the training procedure, we wanted to have a comparable setting with the work of [23], from which we took the multi-scale clockwork architecture. We divided the dataset with the same split ratios, but not with the same assignments, as they were not indicated: in order to ensure balanced classes in both sets, we took 5 words for training and 2 words for testing from each class. We found the best hyperparameters with a Random Search [7] as model selection. We know using the test set this way leads to an overparameterization and an optimistic assessment of the performance of the model. Nonetheless what we are interested in is the ranking among the models, which is accurate, because we used the same model-selection for every model. To alleviate this overestimation, after the parameters tuning phase in which we fixed the hyperparameters, we reshuffled the dataset before splitting into training and test set, made the final retraining and tested the models on this split.

Given 25 classes, the output layer of every network is a Softmax and every network was trained to minimize the Cross-Entropy Loss. We used a batch size of 25, one sample per class to keep them balanced, shuffling the training set at

Incremental pretraining of multi-resolution memory networks

DEPARTMENT OF COMPUTER SCIENCE

Master Degree in Computer Science

Master Thesis

Incremental pretraining of

multi-resolution memory networks

Author

Supervisors

Diego Giorgini

Prof. Davide Bacciu

Introduction

1.1

Thesis outline

Background

2.1

Notation

2.2

Machine Learning

2.3

Artificial Neural Networks

2.4

Recurrent Neural Networks

2.5

Clockwork RNN

. . .

0

h



σ

p

W

x

W

h

q

2.6

Linear Autoencoder for Sequences

2.7

Linear Memory Networks (LMN)

2.8

Other related work

Clockwork LMN

3.1

Model architecture

. . .

0

m



σ

p

W

h

W

m

q

3.2

Pretraining

Before i-step 1

After i-step 1

Before i-step 2

After i-step 2

Experiments

4.1

Sequence Generation

4.2

Spoken Word Classification

_x

_W

_h

_W

_m