Hebbian Learning Algorithms for Training Convolutional Neural Networks

(1)

Department of Information Engineering Master Degree in Computer Engineering

Hebbian Learning Algorithms for Training

Convolutional Neural Networks

Author: Gabriele Lagani

e-mail: [email protected]

Supervisors:

G. Amato

F. Falchi

C. Gennaro

Academic Year 2018/2019

(2)

(3)

Abstract

The concept of Hebbian learning refers to a family of learning rules, inspired by biology, according to which the weight associated with a synapse increases proportionally to the values of the pre-synaptic and post-synaptic stimuli at a given instant of time. Different variants of Hebbian rules can be found in literature. In this thesis, three main Hebbian learning approaches are explored: Winner-Takes-All competition, Self-Organizing Maps and a supervised Hebbian learning solution for training the final classification layer of a network. In literature, applications of Hebbian learning rules to train networks for image classification tasks exist, although they are currently limited to relatively shallow architec-tures. In this thesis, the possibility of applying Hebbian learning rules to deeper network architectures is explored and the results are compared to those achieved with Gradient Descent on the same architectures. The results suggest that the Hebbian approach is adequate for training the lower and the higher layers of a neural network, but not for the intermediate layers. Supervised Hebbian training is effective for training a final network layer taking high-level feature representations as input and providing class scores as out-put. In addition, the Hebbian approach is much faster than Gradient Descent in terms of number of epochs required for training. Currently, a possible application of the Hebbian approach could be that of re-training the higher layers of a pre-trained network for a new task in few epochs. Hebbian learning approaches are currently open to further explo-rations in order to discover effective solutions also for training the intermediate network layers.

(4)

(5)

Chapter 1 Introduction and related work

1.1 Weight Update Rules and Hebbian Learning

A discrete-time learning rule can be expressed as a law that, given the input x(t) and the synaptic weights w(t) of the network at the discrete time step t, allows to compute the new weights for the next time step t + 1:

∆w(t) = f (x(t), w(t)) (1.1)

and

w(t + 1) = w(t) + ∆w(t) (1.2)

although, more in general, the update equation might also depend on the history of updates at past time instants 0, 1, . . . , t − 1.

The above formulation defines an online learning rule, i.e. a single input is shown to the network at time t; alternatively, in a batch formulation, multiple inputs are shown to the network at a given time step:

∆w(t) = f (x(1)(t), . . . , x(N )(t), w(t)) (1.3) In the familiar case of gradient descent, and omitting the time step for simplicity, the general rule instantiates as

∆w = −η ∇wE(x, w) (1.4)

or in the batch version

∆w = −η ∇wE(x(1), . . . , x(N ), w) (1.5)

η being the learning rate, E the loss to be minimized and the symbol ∇w was used to

denote the gradient w.r.t. the parameters w.

Gradient descent with error backpropagation is very popular among ANN engineers, but it requires an error signal to be delivered to each neuron, which is not biologically real-istic. Although backpropagation might be supported by bidirectional connections (see, for example, the random feedback alignment algorithm [32]), which are actually present in biological neural networks [51] [44] [37] [12], neuroscientists have doubts about the biological plausibility of a per-neuron error-delivery mechanism [37].

(9)

Figure 1.1: Schematic representation of a neuron.

The concept of Hebbian learning refers to a family of learning rules, inspired by biology, according to which the weight associated with a synapse increases proportionally to the values of the pre-synaptic and post-synaptic stimuli at a given instant of time. The ability of biological synapses to modify their strength in response to stimuli is called synaptic plasticity, which can take the form of an increase, called Long-Term Potentiation (LTP), or a decay, referred to as Long-Term Depression (LTD) [4] [51] [44] [37]. An overview of the synaptic plasticity rules discussed hereafter is given in [21], while [25] provides an introduction to Hebbian learning in Artificial Neural Networks (ANNs).

Despite the biological inspiration, synaptic plasticity rules can be applied to ANNs, the only difference being that biological plasticity takes place in a continuous time environ-ment, while we prefer to switch to a discrete time setting when simulating ANNs.

Therefore, a Hebbian plasticity rule in its simplest formulation can be expressed as (for the online case and avoiding again to specify time dependency for simplicity):

∆w = η y(x, w) x (1.6)

The rule applies individually to each neuron of the network (fig. 1.1). In the formula, y(x, w) is the post-synaptic activation of the neuron, which is a function of the input and the weights and is assumed to be non-negative (e.g. a dot product followed by a ReLU or sigmoid activation).

The main problem of rule 1.6 is that it allows weights only to grow, but not to decrease. In order to prevent the weight vector from growing unbounded it is possible to normal-ize it after every update, to add saturation constraints or to introduce a weight decay (forgetting) term [21]. In particular, let’s consider the latter case:

∆w = η y(x, w) x − γ(x, w) (1.7)

The weight decay term is γ(x, w) and a possible choice is to take γ(x, w) = η y(x, w) w [25], i.e. the decay is proportional to the value of the weight, but its extent is also controlled by the value of the activation. The goal is to make small weight modifications when the output is low and high weight modifications when the output is high. In this way, eq. 1.7 becomes

∆w = η y(x, w) (x − w) (1.8)

If we assume that η y(x, w) is smaller than 1, the latter equation obtains the following physical interpretation (fig. 1.2): at each iteration the weight vector is modified by taking

(10)

(a) Weight vector subject to an up-date. Points are inputs (organized in a cluster), the green point is the input currently being processed. The blue arrow represents the direction of the update that will affect the weight vec-tor w, while the red arrow is the actual update.

(b) Final position of the weight vector after training.

Figure 1.2: Effect of Hebbian updates on a weight vector.

a step towards the input, the size of the step being proportional to the similarity between the input and the weight vector, so that if a similar input is presented again in the future the neuron will be more likely to produce a stronger response. If an input (or a cluster of similar inputs) is presented repeatedly to the neuron, the weight vector tends to converge towards it, eventually acting as a matching filter. In other words, the input is memorized in the synaptic weights. In this perspective, the neuron can be seen as an entity that, when stimulated with a frequent pattern, learns to recognize it.

1.2 Competitive Learning

Equations in the form of 1.8 (or similar) were used in the context of competitive learning [48] [24] [28] [42] (see also [21] [25]), which is now illustrated. As it was mentioned above, a neuron stimulated with a repeated pattern learns to recognize it, but if there are several different types of patterns as input, it might be desirable to use different neurons as well to learn those patterns. However, neurons are typically organized in layers and all the neurons in a layer will see the same inputs: the plain Hebbian approach does not provide any mechanism to allow different neurons to learn different patterns.

A possible mechanism that was introduced for this reason is competition among neurons, specifically in the form of the Winner-Takes-All (WTA) approach [24] [42]:

1. when an input is presented to the network, the neurons start a competition

(11)

(a) Winning weight vector subject to an update. Points are inputs (orga-nized in a cluster), the green point is the input currently being processed. The blue arrow represents the direc-tion of the update that will affect the weight vector w, while the red arrow is the actual update.

(b) Final position of the weight vectors after training.

Figure 1.3: Hebbian updates with competition.

the input vector (according to some distance metric, e.g. angular distance [24] or euclidean distance [25]), while all the other neurons get inhibited

3. the neurons update their weights according to eq. 1.8, where y is set to 0 for the inhibited neurons and to 1 for the winner neuron

Thus, the winner neuron is the only one to actually update its weights, and, by doing so, it moves its weight vector closer to the input that made it win, so that it will be more likely to win again on similar inputs in the future (fig 1.3).

This type of organization requires the presence of lateral interaction between neurons, which is biologically plausible [18] [51] [44] [37].

1.3 Self-Organizing Maps

In [28], the work on competitive learning was further extended with the introduction of the Self-Organizing Maps (SOMs). A SOM is a layer of neurons arranged in an n-dimensional lattice (typically 1-dimensional or 2-dimensional, the latter being more common).

After the competitive phase, but before the weight update, training is extended with a new cooperative phase, in which lateral interaction takes place in the form of a lateral feedback signal that is provided by the winning neuron to its neighbors in the lattice topology (fig. 1.4). The strength of this signal decreases with the distance from the win-ning neuron. Specifically, denoting with i(x) the winwin-ning neuron on input x, the strength of the signal delivered to any neuron j, whose distance from i(x) in the lattice topology

(12)

(a) 1-dimensional lat-tice.

(b) 2-dimensional lattice.

Figure 1.4: Self-Organizing Maps with neurons arranged in different topologies. Some of the lateral feedback connections are highlighted in green.

(13)

(a) 2-dimensional lattice with radius highlighted in green and distance dj,i between neuron j and neuron i high-lighted in blue. −2 0 2 0.2 0.4 0.6 0.8 dj,i h(dj,i)

(b) Gaussian neighborhood function.

Figure 1.5: Lateral interaction with Gaussian neighborhood function.

is dj,i(x), is determined by the neighborhood function h(dj,i(x)). This function should be

equal to 1 when dj,i(x) is 0 and should decrease with the distance. For instance, a possible

choice for the neighborhood function is the Gaussian function centered in zero (fig 1.5) [25]. Other possible choices for the neighborhood function and further theoretical details about the SOMs are discussed in [33] [34] [9] [10] [11] [7]. The neighborhood function is characterized by a radius (the standard deviation in the Gaussian case) which is typically initialized to be equal to the radius of the lattice and then shrank during time. Once the cooperative phase is completed, the weight update takes place by applying eq. 1.8, in which y is set to h(dj,i(x)). Note that the WTA approach can be seen as a particular case

of SOM in which the neighborhood function has radius zero.

A property of the SOMs is that during the training process, thanks to the lateral feed-back, weight vectors tend to be updated so that neurons that are close in the lattice topology tend to respond to input patterns that are similar in the input space. However, at the beginning, weight vectors start in random positions, and it takes several update steps (in the order of thousands [25]) for them to disentangle and reach an ordered place-ment (fig 1.6), and this is the most expensive part of the training.

In [29] (see also [7]) also a batch version of the SOM algorithm is proposed. The modifi-cation of the weight vector of each neuron j consists of a single update step taken towards the point obtained as the weighted average of all the input vectors xk in the batch, with

(14)

(a) Before training (lattice entangled). (b) After training (lattice disentan-gled).

Figure 1.6: Organization of SOM neurons before and after training.

1.4 Supervised Hebbian Learning

Hebbian learning is an unsupervised approach to neural network training, because each neuron updates its weight without relying on labels provided with the data. However, it is possible to use a simple trick to apply Hebbian rules in a supervised fashion: the teacher neuron technique [45] [39] involves imposing a teacher signal on the output of the neurons that we want to train, thus replacing the output that they would naturally produce (fig. 1.7). By doing so and by applying a Hebbian learning rule, neurons adapt their weights in order to actually reproduce the desired output when the same input is provided. In biological systems, such a teacher signal might be produced by mirror neurons, which are present in the human brain and are thought to play an important role for transferring knowledge in some learning processes [41].

1.5 Some Existing Experimental Results

There are already some existing works regarding the application of Hebbian learning rules in the context of deep neural network training for image classification. In [50] and [49], the authors propose a deep Convolutional Neural Network (CNN) architecture consisting of three convolutional layers, followed by an SVM classifier. The convolutional layers are trained, without supervision, to extract relevant features from the inputs. The proposed training algorithm, named Adaptive Hebbian Learning (AHL), combines Hebbian weight update with k-WTA (a variation of WTA in which k > 1 neurons are chosen as winners after each competition), pre-synaptic competition (given two winning neurons j and k, a pre-synaptic neuron i and the connecting synaptic weights wi,j and wi,k, only the highest

between wi,j and wi,k is updated) and dynamic recruiting/pruning of neurons.

(15)

(16)

term is important to balance the activations of different neurons; hence, the idea is to keep a running average of neuron activations as r, choose a target activation value Abias and

increase or decrease the bias in order to make r approach Abias. The rule is the following:

∆b = η (r − Abias) (1.9)

The output provided by a layer of neurons should synthetically represent the input. The discussions in [14] [50] [49] describe some important characteristics that should be found in the neural activations:

• sparse, distributed coding, i.e. an input pattern is encoded in the activity of a few neurons

• decorrelation, i.e. different neurons should not learn to respond to the same features in the input pattern, otherwise they just encode redundant information

These features are also observed to arise spontaneously in networks trained with back-propagation [1].

It has been noticed that, in neural networks trained with backpropagation, neurons in deeper layers (after the second) tend to develop a certain class-specificity, which is im-portant for the classification task [1] [53]. Discriminative Hebbian Learning (DHL) is another algorithm, proposed in [49] and intended to work together with AHL, which aims at reproducing this behavior in Hebbian networks. It consists in choosing a number of neurons at a given layer (in [49], the algorithm is applied at layer 3) to dedicate to each class; then, when an input of class c is fed to the network, only the neurons associated to class c undergo a Hebbian update. Thus, DHL is supervised, since it requires labels to be provided together with the inputs, and it is somewhat similar to using a teacher signal on the deep layers of the network.

The authors of [50] [49] applied these ideas on different image datasets, among which CIFAR-10 [30]. On the latter, the algorithm achieved above 75% accuracy with a three-layer network, maybe giving some margins of improvement by moving to more complex network architectures.

A different approach is taken in [38], where the authors obtain Hebbian weight update rules by minimizing an appropriate loss function. The proposed loss function is the strain loss and the output of a layer should be the argument that minimizes it:

Y∗ = arg min

Y

kXT_{X − Y}T_{Y k}2

F (1.10)

Where X is a matrix obtained by concatenating a set of input vectors and similarly Y is the matrix of the output vectors, while k · kF is the Frobenius norm. Let’s give an

intuitive interpretation of what this loss represents: XTX is a matrix whose elements are the dot products of pairs of input vectors, hence they represent the similarity of input vectors with other input vectors, and the same holds for YTY . Therefore, that difference represents how much the similarity metric gets distorted when moving from the input

(17)

space to the output space, and this is what should be minimized. The authors show that the problem can be solved by applying the following rules:

y = W x − M y ∆Wi,j = yi(xj− Wi,jyi) Di ∆Mi,j6=i = yi(yj − Mi,jyi) Di , Mi,i = 0 ∆Di = y2i (1.11)

where matrices W and M represent respectively the weights associated to the feed-forward and lateral interactions, while D is a vector containing the cumulative squared activations of the neurons, which act in the equations as an dynamic learning rate.

This model was applied to image classification tasks in [3], again on the CIFAR-10 dataset, achieving 80% accuracy with a single convolutional layer followed by an SVM classifier.

1.6 Analogy between Hebbian Learning and Dynamic

Routing between Capsules

An interesting analogy can be pointed out between Hebbian learning and the recent al-gorithm of Dynamic Routing between Capsules [43]. A capsule can be considered as a set of neurons packed together and specialized into a particular task, e.g. recognizing a particular object. From a higher-level perspective, capsules can be thought of as entities that take a number of vectors as input and give a vector as output. From this point of view, a capsule can be considered as a generalization of a neuron whose inputs and outputs are vectors instead of simple scalars.

Now, capsules, just like simple neurons, are organized in layers, and each capsule in one layer is connected to each capsule in the next layer with a connection characterized by a coupling coefficient. A capsule computes the sum of the input vectors, weighted by the corresponding coupling coefficients, and then a non-linearity is applied to obtain the output vector. The output vector is then multiplied by a prediction matrix, learned via backpropagation, and the resulting prediction vector is sent as input to a next-layer cap-sule. The goal of the dynamic routing algorithm is to compute the coupling coefficients associated to the connections between pairs of capsules, and it does so by measuring the similarity between the input and output vectors, in terms of scalar product: the more a input prediction agrees with an output vector, the more a coupling coefficient is re-inforced. In Hebbian learning, a synaptic weight is also reinforced when the input on a synapse ”agrees” with the output of the neuron; however, according to the Hebbian rule, the agreement is measured simply in terms of scalar multiplication, rather than dot product, between input and output values, as in eq. 1.6. In this perspective, dynamic routing can be viewed in some sense as a generalization of the Hebbian rule to the case of computational units with vector inputs and outputs.

(18)

−3 −2 −1 1 2 3 −1 −0.5 0.5 1 ∆t ∆w

Figure 1.8: Profile of STDP weight update law.

1.7 Spike Timing Dependent Plasticity

Spiking Neural Networks (SNNs) [21] are a class of biologically plausible network mod-els in which neurons communicate by means of trains of pulses, or spikes, rather than continuous-valued signals. Pulses are all equal to each other and values are encoded by neurons in the spiking rate. Neurons behave like integrators, summing up pulses received as input until a threshold is exceeded, and at this point an output spike is emitted. In practice, this ”integration logic” is implemented in terms of an electric potential which is accumulated on the neural membrane every time an input spike is received; when the threshold is reached and the output spike is released, the neural membrane discharges the accumulated potential and the process restarts. According to this process, a neuron needs to receive several spikes as inputs before an output spike is produced; therefore, it seems like the output frequency of a neuron in a given layer will necessarily be slower the the output frequency of neurons in a previous layer. Actually, this is not what happens, because the fact that a neuron needs several input spikes before it can fire is compensated by the fact that the neuron receives inputs from thousand of other neurons.

Learning occurs in the form of Spike Time Dependent Plasticity (STDP) [6]: when an input spike is received on a synapse and it is immediately followed by an output spike, then the weight on that synapse is increased. Specifically, a possible STDP rule can be expressed as follows: ∆w = ( η+_e−|∆t|/τ+ if ∆t > 0 η−e−|∆t|/τ− otherwise (1.12)

where ∆t is the time difference between the post-synaptic and the pre-synaptic spike, η+ _{and η}− _{are learning rate parameters (η}+ _{> 0 and η}− _{< 0) and τ}+ _{and τ}− _{are time}

constants. Long Term Potentiation occurs when a pre-synaptic spike has been a likely cause for a post-synaptic spike, hence pre-synaptic and post-synaptic activations are cor-related. If instead a pre-synaptic spike occurred right after a post-synaptic one, then the two activations are anti-correlated and LTD occurs. Figure 1.8 shows the profile of a possible function that determines the dependency between ∆w and ∆t according to the

(19)

STDP weight update rule.

Rules similar to 1.12 or more efficient approximations, have been used to train net-works for image classifications. In [8], STDP has been successfully applied to handwritten digit recognition. The teacher neuron technique [39] was also proposed in the context of SNNs, in order to combine STDP with supervised learning and again applied in digit recognition [45]. In [13], CNNs trained with STDP and WTA competition were used on different image datasets, including CIFAR-10. In particular, the architecture proposed for the CIFAR-10 classification task consisted of a single convolutional layer trained with STDP + WTA, followed by three fully connected layers trained with backpropagation. The network achieved above 71% accuracy.

1.8 Pitfalls and Potentials of Hebbian Learning and

STDP

The main pitfall of Hebbian learning and STDP is that their performance is not yet com-parable to that of backpropagation-based approaches in terms of classification accuracy. However, there are several reasons that motivate the study of biologically plausible learn-ing models. While research on backpropagation-based models has been extensive, biolog-ically plausible models have not received the same efforts so far. Therefore, there might be still a significant margin of improvement by investing further efforts both in the refine-ment of the algorithms and in their application to more complex network architectures, and this work hopes to stimulate further interest in this sense.

A possible advantage of the Hebbian learning rules presented above is that they are local, in the sense that each layer of neurons can perform an update without the need of waiting for the whole network to have finished processing the input (each layer is inde-pendent of all the others). This opens the possibility for layer-wise training which can be highly parallelized.

Additionally, the Hebbian learning rules do not require gradient computations, which could make it easier to train deeper architectures without worrying about gradient van-ishing problems; a possible application would be, for instance, efficient training of deep Radial Basis Function (RBF) networks. At the same time, it would be interesting to study how networks trained with Hebbian algorithms behave against adversarial exam-ples [23]. For instance, combining an RBF network architecture with a Hebbian learning approach might result in a model that learns efficiently and exhibits a more biologically plausible behavior against adversarial examples [22].

Power-efficiency is an important aspect of biological neural networks and it is mostly achieved thanks to the communication paradigm based on short pulses. An important aspect of SNNs and STDP is the possibility of realizing energy-efficient neural network implementations in hardware, which could also find application in embedded devices.

(20)

Some examples of this kind of implementations, a.k.a. neuromorphic hardware, are [47] [15] [17] [16] [19] [5] [36] [20] [2] [52].

Finally, research efforts focused on the formulation of an algorithmic theory of the brain and further investigation of biologically plausible learning models are important to finally achieve a deeper understanding of how the human brain works, which could open possibilities of further advances both in technological and in medical fields [35].

1.9 Outline of this Document

The goal of this thesis was to explore the effects of a Hebbian-WTA learning algorithm when applied to train a deep convolutional neural network. A module that learns by means of the Hebbian-WTA rule was implemented in Python [46], specifically using the Pytorch library [40], and then used to build and train different neural network architectures on image classification tasks, specifically on the CIFAR-10 [30] dataset. Additionally, the supervised Hebbian training algorithm was also used to train classifiers for CIFAR-10. The results were compered to those obtained by using Gradient Descent learning on the same network architectures.

The remainder of this document is structured as follows:

• Chapter 2 describes the design of the software used to simulate the experiments, including descriptions of the various modules that constitute the software, their interfaces, how they interact with each other, how data are processed and further details on the design choices related to the learning algorithms analyzed in this study.

• Chapter 3 deals with the implementation details of the modules described in Chap-ter 2.

• Chapter 4 provides details about the configurations used in the experimental sce-narios, including network architectures and parameter settings.

• Chapter 5 illustrates and discusses the results obtained from the experiments. • Chapter 6 presents final considerations on the results described in Chapter 5 and

concludes with a discussion on possible future work directions.

The code used for the experiments described in this document can be found at: https://github.com/GabrieleLagani/HebbianLearningThesis

(21)

Chapter 2 Design

The goal of this thesis is to apply Hebbian learning to deep networks and to test them on image classification tasks. For this purpose, we implemented the code to run our experiments using the Python [46] language. In particular, we used Pytorch [40] to simulate, train and test our models. This chapter describes the structure of our project, the functionalities of the various modules that constitute our code and the way they interact with each other.

Our project aimed at satisfying the following requirements:

• Implementation of software for training and testing a variety of network models, using different parameter configurations and training algorithms, with minimal effort required from the user.

• Implementation of Pytorch modules, equivalent to convolutional and fully connected layers, that learn according to the Hebbian paradigm rather than Gradient Descent. • Implementation of different network models and preparation of experimental setups

that will be used to run the experiments object of this study.

Figure 2.1 shows how our project was structured in terms of directories/packages. The following files were placed inside the project root folder:

• params.py: contains generic constants and parameters used throughout the code, such as folder names, dataset-related constants (e.g. dataset size, input image di-mensions, number of output classes), devices (CPU or GPU) available on the current machine.

• configs.py: contains different configurations of training-related parameters that can be chosen when launching the experiments.

• utils.py: contains utility methods invoked throughout the code. • data.py: contains the code to load the dataset.

• train.py and evaluate.py: contain the code to train and test the models. Moreover, the following folders were created:

(22)

(23)

• basemodel: contains different network models that can be used in experimental sessions based on Gradient Descent training.

• hebbmodel: contains the file hebb.py, where Hebbian algorithms and related func-tionalities are implemented, and different network models that can be used in ex-perimental sessions based on Hebbian training.

• data: this is the folder where datasets are stored.

• results: this is the folder where the results of the experiments are stored. The results of experiments based on Gradient Descent are stored in the gdes sub-folder. Specifically, the results of each experiment are saved in the gdes/<config_name> sub-folder, where <config_name> is a name associated with the experiment (for

ex-ample, the results of an experiment named config_base are saved in the gdes/config_base sub-folder). Each of these folders contains, in turn, a figure sub-folder, where

fig-ures generated during the experiments are saved (graphs showing how accuracy varies during epochs and images showing the features learned by neural network kernels), a save sub-folder, where the trained network models are saved, and a test_results.csv where the accuracy values achieved by models during testing are stored. The results of the experiments based on Hebbian training are stored in the hebb folder, following the same internal structure used for the gdes sub-folder. The stats sub-folder contains files where various statistics are stored.

2.1 Configurations

Since Python is an interpreted language, it is possible to define our program parameters in configuration files written using the Python syntax directly. The advantage is that we can modify these parameters when necessary, without the need to execute a compilation step whenever a configuration is modified.

In our project, each experiment that we may want to run corresponds to an experi-mental configuration defined in the config.py file. A Configuration is an object defined in config.py which contains all the necessary information to execute an experiment. A Configuration can be created as shown in the code of listing 2.1.

Listing 2.1: Configuration creation.

1 Configuration( 2 config_family=<config_family>, 3 config_name=<config_name>, 4 net_class=<net_class>, 5 batch_size=<batch_size>, 6 num_epochs=<num_epochs>, 7 iteration_ids=<iteration_ids>, 8 val_set_split=<val_set_split>, 9 augment_data=<augment_data>, 10 whiten_data=<whiten_data>, 11 learning_rate=<learning_rate>, 12 lr_decay=<lr_decay>,

(24)

13 milestones=<milestones>, 14 momentum=<momentum>, 15 l2_penalty=<l2_penalty>, 16 pre_net_class=<pre_net_class>, 17 pre_net_mdl_path=<pre_net_mdl_path>, 18 pre_net_out=<pre_net_out> 19 )

The parameters that are provided when a Configuration is created are the following: • <config_family>: either gdes or hebb, depending whether the experiment

associ-ated with this Configuration is meant to perform Gradient Descent or Hebbian

training. In the params.py file, the constants CONFIG_FAMILY_GDES and CONFIG_FAMILY_HEBB are defined, which are placeholders for the keys gdes and hebb, respectively. The

key gdes is used to specify that a configuration corresponds to an experiment where Gradient Descent training is used, while the key hebb is used to specify that a configuration corresponds to an experiment where Hebbian training is used.

• <config_name>: mnemonic name associated with the Configuration (this is also the name that will be assigned to the folder inside results/gdes or results/hebb where the results of the experiment are stored).

• <net_class>: a reference to the class of the network model that we want to use for this experiment.

• <batch_size>: the size of the mini-batches of data that will be fed to the network during training.

• <num_epochs>: the number of epochs for which training is performed.

• <iteration_ids>: a list of non-repeated integers representing the identifiers of different iterations for which the experiment is replicated. During each iteration, the corresponding id is used as seed to initialize the Random Number Generators (RNGs), so that each experiment replica will be independent of the others.

• <val_set_split>: specifies in which point to split the training set in order to obtain a subset of data that will be used as validation set.

• <augment_data>: boolean value specifying whether data augmentation should be used (see 4 for the details on the transformations used to perform data augmenta-tion).

• <whiten_data>: boolean value specifying whether data whitening should be used (see app. A for the details on whitening).

• <learning_rate>: the learning rate to be used during training.

• <lr_decay>: factor by which the learning rate is reduced, on the basis of a learning rate scheduling policy, at predefined epochs during training.

(25)

• <milestones>: list of the predefined epochs at which learning rate scheduling is applied.

• <momentum>: momentum coefficient to be used during training. • <l2_penalty>: coefficient to be used for L2 regularization loss.

• <pre_net_class>: a reference to the class of a network model to be used as pre-processing network from which features are extracted before they are fed to the net_class network during this experiment.

• <pre_net_mdl_path>: path to a file from which a trained network model will be loaded into the pre_net_class network.

• <pre_net_out>: symbolic name of the <pre_net_class> network layer from which features are extracted.

All the available Configurations are gathered in a list contained in the config.py file. Users can launch experiments corresponding to any of these Configurations. In case the users need to launch a new experiment (for example, in order to train a model using dif-ferent hyper-parameters), they just need to create the corresponding Configuration and add it to the list in config.py. Alternatively, rather than creating a new Configuration, users may modify an existing one, if they prefer. The complete list of Configurations used in this thesis can be found in appendix B.

2.2 Training and Testing

The train.py and evaluate.py files can be used to launch the experiments. Specifically the train.py script can be used to launch a training session, while the evaluate.py script can be used for testing.

On Linux systems, training can be launched with the command:

PYTHONPATH=<project_root> python <project_root>/train.py --config <config_family>/<config_name>

where <project_root> is a relative path to the directory where the project files are stored and the argument of the --config option specifies which of the experimental configurations defined in the config.py file we want to use. The selected configuration is identified by the string <config_family>/<config_name>, where <config_family> is either gdes or hebb (depending whether we want to perform Gradient Descent or Hebbian training) and <config_name> is the name associated with the configuration in the config.py file. For example, we can type the command

PYTHONPATH=<project_root> python <project_root>/train.py --config gdes/config_base

which launches a training experiment using the model and the hyper-parameters specified in the configuration identified as gdes/config_base.

(26)

PYTHONPATH=<project_root> python <project_root>/evaluate.py --config <config_family>/<config_name>

In case the user prefers to launch an experiment directly from an Integrated Develop-ment EnvironDevelop-ment (IDE), without passing command-line arguDevelop-ments, it is also possible to provide the configuration identifier as a parameter DEFAULT_CONFIG in the params.py file.

The results of the experiments (both in the case of training and evaluation) are au-tomatically saved in the folder results/<config_family>/<config_name>. Training should be performed first, so that a trained model can be saved in the save sub-directory. Then, when testing is launched, the trained model is automatically loaded and the test performance can be evaluated.

2.3 Data Loading

The experiments discussed in the next chapters of this thesis were performed on the CIFAR-10 dataset [30].

All the logic related to data loading was implemented in the data.py file. Specifically, a class DataManager was implemented for this purpose. A DataManager is an object that allows to:

• Download the dataset (if necessary).

• Split the dataset into training/validation/test sets. • Fetch the input images and the corresponding labels. • Apply possible pre-processing operations.

• Feed the inputs to the network currently being used.

A DataManager is instantiated in the train.py and in the test.py file in order to obtain the training, validation and test sets.

A DataManager is created as shown in the code of listing 2.2. Listing 2.2: DataManager creation.

1 import data

2 dataManager = data.DataManager(config)

where config is a Configuration object described in section 2.1. It is necessary to pass this object to the constructor because it contains important information related to data loading, like the size of the mini-batches to be fed to the network, the pre-processing operations to be applied on the data (data augmentation or data whitening, see ch. 4 and app. A) and information on how to divide training/validation/test sets.

The DataManager provides the method shown in listing 2.3. Listing 2.3: DataManager methods.

(27)

2 ...

3

4 # Methods for obtaining train, validation and test set

5 6 def get_train(self): 7 ... 8 9 def get_val(self): 10 ... 11 12 def get_test(self): 13 ...

These methods return an iterable object which allows to loop over the mini-batches extracted from training, validation and test set respectively.

2.4 Neural Network Models

The neural network models that we used in our project are defined in Python modules stored under the basemodel package, for models trained with Gradient Descent, or the hebbmodel package, for models trained with Hebbian approach. In each of these mod-ules, a Net class is defined, which is going to represent our neural network. Users can define their own neural network by creating a new module and defining the desired Net class within it. According to Pytorch design, neural network models must extend the torch.nn.Module base class, the various layers that compose the network must be instantiated in the constructor and the neural network model must implement the forward() method, where it is defined how the network layers process the input in order to compute the output. An example skeleton of a Net class implementation in shown in the code of listing 2.4.

Listing 2.4: Net class example skeleton.

1 import torch.nn as nn

2 import params as P

3

4 class Net(nn.Module):

5 # Layer names

6 ... # Strings identifiers for the network layers are defined here

7 FC = ’fc’

8 CLASS_SCORES = FC # Symbolic name of the layer providing the class scores as output

9

10 def __init__(self, input_shape=P.INPUT_SHAPE):

11 super(Net, self).__init__()

12

13 # Neural network layers are defined here

14 ...

15 self.fc = ... # Fully Connected layer

16

17 # Here we define the flow of information through the network

18 def forward(self, x):

(28)

20

21 # Compute the output of the various network layers

22 ...

23 fc_out = ... # Output of the FC layer defined above

24

25 # A dictionary is built and returned, where the keys are

26 # identifiers of the layers and the values are the

27 # corresponding outputs generated during this forward

28 # operation.

29 out[...] = ...

30 ...

31 out[self.FC] = fc_out

32 return out

It is possible to observe that, before the constructor, some static constants are defined, for exampleFC = ’fc’. These are string identifiers, symbolic names associated with the

layers that will constitute our network.

Another observation that can be done is that the constructor takes an optional argument

input_shape, which represents the dimension of the desired inputs that the network is

go-ing to process. The default value is theINPUT_SHAPE constant defined in params.py, which

corresponds to 3x32x32 images (three color channels, 32 pixels height, 32 pixels width). Finally, the forward() method takes an argument x, which is the input passes to the network, performs the various processing steps implemented in the body of the method and returns a dictionary of key-value pairs, where the keys are the identifiers of the net-work layers defined above and the values are the outputs generated by the corresponding layers. In this way, an external module that invokes this method obtains this dictionary with the outputs generated by each layer of the network, then it is possible to extract feature representations from any internal layer of the network simply by using the key corresponding to the desired layer. Most of the times, we need to extract just the output of the last layer of the network, containing the classification predictions; for this reason a specific constant CLASS_SCORES is defined to address specifically the entry of the output

dictionary containing the class scores.

2.5 Hebbian Module

The core of our project resides in the file hebbmodel/hebb.py, where a Hebbian learning module and various support functions are implemented. The Hebbian learning module is equivalent to a convolutional layer, but it is trained following a Hebbian rule specified during construction and it learns as soon as an input is provided, while in the usual backpropagation-based algorithms learning occurs in a final backward step. Following the Pytorch naming conventions, the module was named HebbianMap2d, since it performs 2-dimensional convolutions. A Hebbian module equivalent to a Fully-Connected (FC) layer was not implemented. The reason is that the HebbianMap2d itself can be used to replace an FC layer; in fact, it is sufficient to use convolutional filters of the same dimensions as the inputs, and this is equivalent, for all intents and purposes, to an FC layer. The learning behavior of a layer can be enabled by setting the training mode (invoking the .train()

(29)

method of Pytorch module) or disabled by setting the evaluation mode (invoking the .eval() method of Pytorch module).

The Hebbian module can be used as described in the example code of listing 2.5. Listing 2.5: Hebbian layer creation.

1 conv1 = H.HebbianMap2d( 2 in_channels=<in_channels>, 3 out_size=<out_size>, 4 kernel_size=<kernel_size>, 5 out=<out>, 6 similarity=<similarity>, 7 competitive=<competitive>, 8 random_abstention=<random_abstention>, 9 lfb_value=<lfb_value>, 10 weight_upd_rule=<weight_upd_rule>, 11 eta=<eta>, 12 lr_schedule=<lr_schedule>, 13 tau=<tau> 14 )

The meaning of the parameters is the following: • <in_channels>: the number of input channels.

• <out_size>: the number of convolutional kernels, which is also the number of output channels of this layer.

• <kernel_size>: the size of the convolutional filters. It can be a tuple with two elements, which will represent height and width of the convolutional filter, or an integer, in which case height and width of the convolutional filter will be both equal to this value.

• <out>: the activation function of the neuron. In this argument an actual Python function must be passed, which defines how the layer neurons compute their output given an input, the layer weights and optional layer bias terms. The signature required to this function is out(x, w, b=None), x being the input, w the weights and b the bias.

• Other parameters: the subsequent parameters are used to modify the learning be-havior of the module, by allowing different types of learning rules, different schemes of interaction among neurons, etc. All these features are described in the following. The fundamental learning rule applied within the Hebbian module is the one of equa-tion 1.8, described in secequa-tion 1.1, which we report here for simplicity:

∆w = η y(x, w) (x − w) (2.1)

where y(x, w) is a similarity metric between the input and a weight vector. In our Hebbian module, this similarity metric is specified in the constructor by passing a function in place of the <similarity> argument, which defines how a similarity score for each neuron in the

(30)

Hebbian layer is computed given the input, the layer weights and and optional bias term. The signature required to this function is similarity(x, w, b=None), x being the input, w the weights and b the bias. This is very similar to the out(x, w, b=None) function; in fact, in principle we could have used the same argument to define both a similarity metric for the update rule and an activation law for the output, but we preferred to define the two arguments separately, in order to give more flexibility to the user in the choice these function, opening the possibility for modeling more complex network behaviors.

Competitive learning, specifically in the form of Winner-Takes-All (WTA) competition (described in section 1.2) can be enabled by means of the parameter <competitive> in listing 2.5. This is a boolean value which allows to activate competition when set to True. When competition is enabled, the weight update rule is modified to:

∆w = η r (x − w) (2.2)

where the quantity r, which replaces y(x, w), is 1 for the winner neuron and 0 for the others 1_{. The winner neuron is the neuron that achieved the highest similarity value of}

its weight vector with the input, i.e. highest y(x, w).

The former rule needs to be further extended to the case of convolutional layers. In this scenario, the same set of kernels is applied to patches extracted from the images at different offsets. In our implementation, all the patches are extracted from the input image and treated as distinct inputs, each of which is fed to a set of neurons which apply the layer kernels. Furthermore, it should be considered that, typically, the module won’t be processing a single image at a time, but rather a mini-batch of images. Patches are extracted from each of these images. As a result, the module will be processing a macro-batch of patches, each of which is treated as a distinct input and fed to a set of neurons which apply the layer kernels. Competition is performed among the neurons acting on the same input patch and the respective ∆w’s are computed and kept in memory. This procedure is applied for all the macro-batch of patches extracted from the inputs. Eventually, the module ends up with an array of ∆w’s which should be applied to the corresponding neurons. The next problem to face is related to the weight sharing policy in convolutional layers: when an input is processed, a set of neurons applies the same kernel before the update. After the update, these neurons are still forced to share the same weights. However, the ∆w’s computed for each of these neurons might be different. Therefore, we need to perform an aggregation step in which the different ∆w’s that should be applied on the same kernel are used to produce a global ∆wagg which is actually used

for the update. A very simple type of aggregation that could be performed is an average, over the macro-batch, of the ∆w’s related to the same kernel. However, most of the ∆w’s that we encounter are just null vectors, since they correspond to inputs over which the neurons associated with the considered kernel didn’t win the competition. These null vectors drastically limit the size of the resulting update step, which is unwanted because the size of the update step should rather be controlled with the learning rate hyper-parameter. It seems more appropriate to perform the aggregation step by computing a weighted average, over the macro-batch, of the ∆w’s related to the same kernel, where

1_{Letter r stands for reward. The goal is to give the idea that updates are reward-driven. In the WTA} case, the reward depends on the outcome of the competition.

(31)

the weights are 1 for ∆w’s associated to victories and 0 otherwise (i.e. we compute the average only over those ∆w’s corresponding to inputs where the neurons associated to the considered kernel won). Put it another way, the values of the weighted average coefficients are the same r’s of equation 2.2. In formula, the aggregated update ∆wagg for a given

kernel is computed as:

∆wagg = P krk∆wk P krk (2.3) where rk and ∆wk are the r and ∆w values achieved on the kthinput patch by the neuron

sharing the considered kernel.

Another feature called random abstention can be enabled by means of the argument <random_abstention> in the constructor described in listing 2.5. This is a boolean value which can be set to True to enable said feature. Random abstention is a mechanism de-signed to balance the number of victories associated with the different kernels; in fact, it might be possible that, in some cases, a few kernels are associated with a large number of victories, while others almost never win. In situations like this, the kernels that do not produce any victories never get updated, therefore they are prevented from learning to recognize some pattern. With random abstention, the Hebbian module keeps a counter of the number of victories associated with each kernel. Neurons may randomly decide to refrain from competition, by forcefully setting their similarity score y(x, w) to a min-imum value, so that they will be automatically precluded from obtaining a victory. The abstention probability should be computed so that neurons associated with top-winning kernels have high probability of refraining from competition, while neurons associated with a small number of victories have low probability of being subject to random absten-tion. The function that we used to compute the abstention probability pi of a neuron

associated with a generic kernel i is: pi =

vi− minj(vj)

maxj(vj) − minj(vj) + ρ

(2.4) where vi and vj are the victory counts relative to kernel i or to a generic kernel j and ρ is

a positive smoothing constant, the purpose of which is described hereafter. According to eq. 2.4, the abstention probability is zero for neurons associated with a minimum number of victories, it grows proportionally with the victory count vi and reaches its maximum

for the neurons associated with the highest number of victories. This is also represented graphically in figure 2.2. However, the maximum value of the abstention probability also depends on the smoothing constant ρ and on the victory gap between the top-winning and the least-winning neurons, i.e. the difference maxj(vj) − minj(vj). In particular:

• When ρ is much larger than the victory gap, all the abstention probabilities will be very small.

• When the victory gap is comparable to ρ, the abstention probabilities for the highly-winning neurons will be significant.

• When the victory gap is much larger than ρ, the abstention probabilities for the highly-winning neurons will be high, approaching 1 for the top-winning neurons.

(32)

Figure 2.2: Random abstention probability.

It is clear now that the purpose of the constant ρ is to provide a metric of comparison to establish whether a given value of the victory gap is critical, and therefore intensive random abstention should occur, or it is not significant, and therefore random abstention can be relaxed. Specifically, while when the victory gap is much smaller than ρ, victories can be considered roughly balanced, but when the victory gap is comparable to ρ, the victories should be considered unbalanced and random abstention should occur. Our choice for the parameter ρ is

ρ = I

N (2.5)

where I is the number of inputs in the macro-batch currently being processed and N is the number of kernels in our layer. The ratio between these two quantities is the expected number of victories imputed to each kernel per macro-batch, under the assumption that victories are uniformly distributed over kernels, and it is reasonable to consider the victory gap significant when it becomes comparable to this value.

The parameter <lfb_value> of listing 2.5 allows to enable different types of lateral interaction among neurons. A string can be passed in place of this parameter in order to enable lateral interaction in the form used by Self-Organizing Maps (SOM), introduced in section 1.3. Competition must also be enabled by setting competitive to True. Different constants are defined in the HebbianMap2d class, which represent strings that can be passed as lfb_value in order to enable different types of neighborhood interaction. The types of neighborhood interaction provided by the module are:

• HebbianMap2d.LFB_GAUSS: Gaussian neighborhood interaction (fig. 2.3a). h(dj,i) = e

−x2

2σ2 (2.6)

• HebbianMap2d.LFB_EXP: Exponential neighborhood interaction (fig. 2.3b). h(dj,i) = e−

x

(33)

−2 0 2 0.2 0.4 0.6 0.8 dj,i h(dj,i) (a) Gaussian. −2 0 2 0.2 0.4 0.6 0.8 dj,i h(dj,i) (b) Exponential. −2 2 0.5 dj,i h(dj,i) (c) DoG. −2 2 0.2 0.4 0.6 0.8 dj,i h(dj,i) (d) DoE. Figure 2.3: Different types of neighborhood functions.

(34)

• HebbianMap2d.LFB_DoG: Difference-of-Gaussians neighborhood interaction (fig. 2.3c). h(dj,i) = 2e− x2 2σ2 − e− x2 4σ2 (2.8)

• HebbianMap2d.LFB_DoE: Difference-of-Exponentials neighborhood interaction (fig. 2.3d). h(dj,i) = 2e− x σ − e− x 2σ (2.9)

When the SOM mode is enabled, the module behavior is modified as follows: competition is performed as usual, but the winner provides a lateral feedback signal to its neighbors, the value of which is determined by the chosen neighborhood function h(dj,i). The quantity

d(j, i) is the distance between the neuron j receiving the feedback and the neuron i winner of the competition in the physical topology in which neurons are arranged. In our implementation, neurons can be arranged in a 1-dimensional, 2-dimensional or 3-dimensional lattice topology. The shape of the topology can be controlled by means of the <out_size>parameter in the module constructor: a 1-dimensional lattice with N neurons can be created by passing the value N as an integer or as a unit length tuple or list, a 2-dimensional lattice with height H and width W can be created by passing a length-two tuple or list [H, W] and a 3-dimensional lattice with height H, width W , depth D can be created by passing a length-3 tuple or list [H, W, D]. The distance in the lattice topology is measured in terms of L∞ metric, i.e.

dj,i= max

k (coordk(j) − coordk(i)) (2.10)

where coordk(j) and coordk(i) represent the kth coordinate of neuron j and neuron i in

the lattice topology. Neighborhood functions like those in fig. 2.3 are characterized by a parameter σ that controls their spread. This parameter is initially set equal to the radius of the lattice and decreased exponentially during time (measured in terms of update steps), with time constant equal to <tau> (see listing 2.5) update steps. Once the function h(dj,i)

is evaluated for a given neuron j, the weight update is computed according to eq. 2.2, but instead of having r = 0 or r = 1, we have r = h(dj,i). In fact, the case in which r is 1

for the winner neuron and 0 for the others can just be considered as a particular case of lateral interaction in which the neighborhood function is:

h(dj,i) =

(

1 dj,i = 0

0 otherwise (2.11)

Actually, the module also offers the possibility of using a type of lateral interaction like that of eq. 2.11, but the value of the lateral feedback signal provided to non-winning neurons can be set to an arbitrary value, not only 0. This can be achieved by set-ting the <lfb_value> parameter in lisset-ting 2.5 to the desired value. The default WTA-competitive behavior can be obtained by setting <WTA-competitive> to True and passing 0 as <lfb_value>. As we already observed, once the ∆w’s are computed they need to be aggregated in a unique update step for each kernel. In the case when lateral feedback interaction is enabled, this can be done in the same way as in the simple case of WTA

(35)

competition, i.e. with equation 2.3, but again considering r = h(dj,i) (and not just r = 0

or r = 1). In other words, aggregation is performed by computing a weighted average of the ∆w’s, where the weights are given by the h(dj,i) values themselves.

In listing 2.5 it is possible to pass different keys in place of the parameter <weight_update_rule> in order to use different variants of the weight update rule. Two keys are defined in the

Hebbian module: HebbianMap2d.RULE_BASE and HebbianMap2d.RULE_HEBB. Their effect is to modify the way the r coefficient in eq. 2.2 is computed. When HebbianMap2d.RULE_BASE is used, the coefficient is simply:

r = h(dj,i) (2.12)

where the function h() can be that of eq. 2.11, when WTA competition is enabled, it can be any other neighborhood function, when the SOM behavior is enabled, or it can just be a constant equal to 1, when competition is disabled. When HebbianMap2d.RULE_HEBB is used, the coefficient r is computed as:

r = h(dj,i) y(x, w) (2.13)

where h() behaves in the same way as above, but additionally there is a further mul-tiplicative contribution of the term y(x, w). We can use this setting and, at the same time, disable the competitive flag, in order to obtain the original Hebbian rule without competition, i.e. that of eq. 1.8

∆w = η y(x, w) (x − w). (2.14)

In fact, when these settings are enabled, r becomes simply equal to y(x, w), and the rule above is obtained. It is also possible to enable competition or self-organization and use the HebbianMap2d.RULE_HEBB in order to obtain more variants of the learning rules discussed above. The point in common to all these rules is the general structure based on a vector (x − w), representing a direction towards which an update step is taken, and a coefficient r, which modulates the size of the step to be taken. In any case, batch aggregation can always be performed according to the rule in eq. 2.3, i.e. a weighted average of the various ∆w’s is computed, where the weights are determined by the step modulation coefficients r.

Actually, a small correction should be applied to the update rules discussed so far. The reason is that we made the implicit assumption that term r is positive, which is not always true in practice. Therefore, we now discuss how the learning rules are modified in order to account for negative r values as well. We remind that the original learning rule of eq. 1.8 was composed of two terms: a Hebbian reinforcement term ηy(x, w)x and a weight decay term −ηy(x, w)w. Similarly, when the rule is modified by replacing y(x, w) with a generic coefficient r, we can still point out the two contribution ηrx and −ηrw. The problem of these rules is in the definition of the weight decay term, which does not account for the sign of the multiplicative coefficient. In fact, when the coefficient r (or y(x, w)) is positive, the weight decay term is actually causing the weight vector to shrink. On the other hand, when r (or y(x, w)) is negative, the effect of the weight decay term is reversed, causing the weight vector to grow instead of shrinking. The weight decay term is modified to −η|r|w, i.e. by taking the absolute value of r, thus removing the sign

(36)

component. At this point, the weight update rule can be written as ∆w = η sign(r) |r| x − η |r| w

finally leading to the update equation

∆w = η |r| (sign(r) x − w). (2.15)

The batch aggregation rule of eq. 2.3 should also be corrected, because the coefficients used when the weighted average is computed should be all positive. Therefore, the aggre-gation rule is modified to:

∆wagg = P k|rk| ∆wk P k|rk| . (2.16)

The parameter <eta> in listing 2.5 can be used to set the desired learning rate of the Hebbian module, corresponding to the coefficient η of the weight update rule. A learning rate scheduling policy can also be provided by means of the parameter <lr_schedule>. A callable object can be passed in place of this parameter, with signature lr_schedule(eta). The argument eta is the learning rate to be updated. The module performs a call at the end of every update step, passing the current learning rate as argument to the callable. The latter returns the new learning rate that will be used by the module during the next update step.

A teacher signal can be provided to a Hebbian layer to allow supervised Hebbian learn-ing as described in section 1.4. This can be done by means of the set_teacher_signal() method, shown in listing 2.6:

Listing 2.6: Method for setting teacher signal.

1 import torch.nn as nn

2

3 class HebbianMap2d(nn.Module):

4 ...

5

6 def set_teacher_signal(self, y):

7 ...

When the method is invoked, the desired teacher signal is passed in the y argument as a matrix with one column per layer kernel and one row per mini-batch input. Element in position (i, j) of this matrix is the teacher signal to be imposed on the neurons operating on input i and sharing kernel j. All the patches extracted from input i of the mini-batch will share the same teacher signals [(i, 1), (i, 2), . . . ] associated with that input. The teacher signal can be unset by calling the method and passing None in the y argument. When training the layer, the teacher signal should be set before the input is fed to the layer and unset afterward. The presence of a teacher signal modifies the module behavior as follows:

• When competition is enabled, the teacher signal is used to multiply the y(x, w) value of the corresponding neuron before competition occurs. The result of this multiplication is used as actual score to determine the winner of the competition. The teacher signal of some neurons can be set to 1, so that they will not be affected

(37)

during the competition. At the same time, we can set the teacher signal of other neurons to a small value, so that they will be prevented from winning the competi-tion. In this way, the teacher signal can be used to forcefully exclude some neurons from the competition.

• When competition is disabled, the teacher signal replaces h(dj,i) in the weight update

rule. Again, the teacher signal of some neurons can be set to 1 in order to allow those neurons to perform an update, or it can be set to 0 in order to inhibit any update step.

(38)

Chapter 3 Implementation

In this chapter, we delve more in detail in the implementation of the project described in chapter 2. So far we have focused on the behavior of the modules from an external point of view, but now we take a look at their internal structure, showing the implemen-tation details and the solutions used to guarantee efficiency. The main implemenimplemen-tation requirements that we targeted are:

• Code portability both on CPU and on GPU.

• Efficient implementation and, in particular, exploitation of parallelism on GPU, when this is available.

In order to satisfy these requirements, we strongly relied on Pytorch [40] primitives. In particular, we used parallelized primitives, instead of explicit loops, whenever there was a chance to do so, in order to exploit GPU parallelism as much as possible.

3.1 Configurations

In this section we take a look at the config.py file, where the Configurations described in the previous chapter are created. The implementation of the Configuration class is shown in the code of listing 3.1:

Listing 3.1: Configuration class.

1 import params as P 2 3 class Configuration: 4 def __init__(self, 5 config_family, 6 config_name, 7 net_class, 8 batch_size, 9 num_epochs, 10 iteration_ids, 11 val_set_split, 12 augment_data, 13 whiten_data,

(39)

14 learning_rate=None, 15 lr_decay=None, 16 milestones=None, 17 momentum=None, 18 l2_penalty=None, 19 pre_net_class=None, 20 pre_net_mdl_path=None, 21 pre_net_out=None): 22 self.CONFIG_FAMILY = config_family 23 self.CONFIG_NAME = config_name

24 self.CONFIG_ID = self.CONFIG_FAMILY + ’/’ + self.CONFIG_NAME

25 26 self.Net = net_class 27 28 self.BATCH_SIZE = batch_size 29 self.NUM_EPOCHS = num_epochs 30 self.ITERATION_IDS = iteration_ids 31

32 # Paths where to save the model

33 self.MDL_PATH = {}

34 # Path where to save accuracy plot

35 self.ACC_PLT_PATH = {}

36 # Path where to save kernel images

37 self.KNL_PLT_PATH = {}

38 for iter_id in self.ITERATION_IDS:

39 # Path where to save the model

40 self.MDL_PATH[iter_id] = P.RESULT_FOLDER + ’/’ + self.CONFIG_ID + ’/save/model’

+ str(iter_id) + ’.pt’

41 # Path where to save accuracy plot

42 self.ACC_PLT_PATH[iter_id] = P.RESULT_FOLDER + ’/’ + self.CONFIG_ID +

’/figures/accuracy’ + str(iter_id) + ’.png’

43 # Path where to save kernel images

44 self.KNL_PLT_PATH[iter_id] = P.RESULT_FOLDER + ’/’ + self.CONFIG_ID +

’/figures/kernels’ + str(iter_id) + ’.png’

45 # Path to the CSV where test results are saved

46 self.CSV_PATH = P.RESULT_FOLDER + ’/’ + self.CONFIG_ID + ’/test_results.csv’

47

48 # Define the splitting point of the training batches between training and validation datasets

49 self.VAL_SET_SPLIT = val_set_split

50

51 # Define whether to apply data augmentation or whitening

52 self.AUGMENT_DATA = augment_data

53 self.WHITEN_DATA = whiten_data

54

55 self.LEARNING_RATE = learning_rate # Initial learning rate, periodically decreased by a lr_scheduler

56 self.LR_DECAY = lr_decay # LR decreased periodically by a factor of 10

57 self.MILESTONES = milestones # Epochs at which LR is decreased

58 self.MOMENTUM = momentum

59 self.L2_PENALTY = l2_penalty

60

61 self.PreNet = pre_net_class

(40)

63 self.PRE_NET_OUT = pre_net_out

The constructor simply stores the argument in internal attributes of the Configuration object. There are only few additional attributes w.r.t. the constructor argument which are:

• self.CONFIG_ID: stores a unique identifier for the Configuration. The identifier has the form <config_family>/<config_name>.

• self.MDL_PATH: collection of paths to the files where the trained models will be stored. A trained model is generated for each training iteration performed with a different RNG seed. Each model is saved to a file named model<id>.pt, where <id> is the integer used to identify the iteration that produced that model.

• self.ACC_PLT_PATH, self.KNL_PLT_PATH: collection of paths to the files where figures obtained during training will be stored. The same naming convention of self.MDL_PATH are used. The figures obtained during training are graphs show-ing how accuracy varies over epochs (correspondshow-ing to self.ACC_PLOT_PATH) and images showing the layer-1 learned filters (corresponding to self.KNL_PLT_PATH). • self.CSV_PATH: path to the .csv file where the test results will be stored.

In the same file, the list of all the Configurations is defined (listing 3.2). Listing 3.2: Configuration list.

1 CONFIG_LIST = [ 2 Configuration( 3 ... 4 ), 5 ... 6 ] 7 8 CONFIGURATIONS = {}

9 for c in CONFIG_LIST: CONFIGURATIONS[c.CONFIG_ID] = c

The name of the Configuration list is CONFIG_LIST. If the users want to prepare a new experiment, they have to add the corresponding Configuration to it. At the end of the config.py file, also a dictionary named CONFIGURATIONS is created and filled with all the Configurations in CONFIG_LIST. For each Configuration, the cor-responding key used to store it in the dictionary is the CONFIG_ID. This allows fast access to a Configuration when the corresponding identifier is provided, which is ex-actly what happens when an experiment is launched: the user provides the identifier of a Configuration, the script extracts it from the CONFIGURATIONS dictionary and the experiments is launched. In fact, the CONFIGURATIONS dictionary is used exactly in this way in the evaluate.py and train.py scripts, that are illustrated in the next section.

3.2 Training and Testing

When training and testing experiments are launched, the code of function launch_experiment() is executed. This function is defined in evaluate.py and is reported in listing 3.3.

Hebbian Learning Algorithms for Training Convolutional Neural Networks