Deep learning methods for one-shot learning on image recognition

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering

Master of Sciences in Computer Science and Engineering

Deep Learning Methods for One-Shot Learning on Image

Recognition

Supervisor: Student:

Prof. Matteo MATTEUCCI Bardh SHABANI

10598678

(2)

(3)

iii

I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references.

(4)

(5)

v

(6)

(7)

vii

Abstract

One of the biggest challenges in the field of machine learning is being able to learn new concepts under the constraint of limited data. While humans can grasp new visual concepts from just one example, contemporary machine learning algorithms require hundreds or thousands of examples and can be very computationally expensive. These algorithms do not generalize well on problems consisting of small datasets. One-shot learning is particularly concerned with this sort of problem, i.e., being able to predict the class of an input image where only one single example is available for each class.

In this thesis, we develop a series of deep learning models concerning the one-shot classification problem. In-depth analysis of these models is continuously performed, thus obtaining relevant information, which are used as a guide towards the adjustments that are necessary for improving the models. Consequently, we gain valuable insights on what positively impacts the network performance at the greatest extent.

Four different models were developed as step-by-step improvements in terms of architecture, training procedure, as well as data preprocessing. Finally, using the best performing techniques attained during the analysis of our models, combined with the power of transfer learning, we were able to achieve a near state-of-the-art performance in one-shot classification on the Omniglot dataset.

(8)

(9)

ix

Table of Figures

Figure 2.1 Letters of different alphabets from the Omniglot dataset [2] ... 7

Figure 3.2 a) First, three random letters are selected, and then 2 drawings are chosen for each letter. 8 Figure 4.1 High-level architecture of a Convolutional Siamese Neural Network ... 10

Figure 4.2 Architecture of the Siamese Neural Network [1] ... 11

Figure 4.3 Binary Crossentropy Loss during the first phase of training ... 11

Figure 4.4 One-shot accuracy monitoring of the model during the first phase of training. 12 Figure 4.5 Binary Crossentropy Loss during the second phase of training ... 12

Figure 4.6 One-shot accuracy monitoring of the model during the second phase of training 13 Figure 5.1 Distance between positive and negative pairs before and after training [10] ... 17

Figure 5.2 The first two letters are from the Sylheti alphabet while the last two are from the Latin alphabet ... 18

Figure 5.3 TriameseNet Architecture ... 19

Figure 5.4 One-shot accuracy in the training and validation set over 40k iterations in the first phase 20 Figure 5.5 Triplet loss over 40k iterations in the first phase ... 20

Figure 5.6 One-shot accuracy in the training and validation set over 40k iterations in the second phase. Initial weights are loaded from the first phase of training ... 21

Figure 5.7 Triplet loss over 40k iterations in the second phase. Initial weights are loaded from the first phase of training ... 21

Figure 5.8 Mappings of distances between features on the sigmoid function ... 22

Figure 5.9 Training batch created by the Batch Hard algorithm with P=3 and K=2 ... 23

Figure 5.10 One-shot accuracy of TriNet using high weight decay as regularization ... 23

Figure 5.11 Loss of TriNet using high weight decay as regularization ... 24

Figure 5.12 Distance percentiles between embeddings in batch during training iterations, when using weight decay as regularization ... 24

Figure 5.13 One-shot accuracy of TriNet using dropout as regularization ... 25

Figure 5.14 Loss of TriNet using dropout as regularization ... 25

Figure 5.15 Distribution of distances between embeddings in batch during training iterations when using dropout as regularization ... 26

Figure 5.16 Distribution of 2-Norm embeddings in batch during training iterations when using dropout as regularization ... 27

Figure 5.17 Distribution of embedding values in batch during training iterations when using dropout as regularization ... 27 Figure 5.18 Graph of the soft-margin function. When D is zero the function has value 0.69

(12)

xii

Figure 5.19 Distribution of embedding values in batch during training iterations when

using ReLu activation in embeddings as regularization... 28

Figure 5.20 Distribution of 2-Norms of features of embeddings in batch during training iterations when using ReLu activation in embeddings as regularization ... 29

Figure 5.21 Distribution of distances between embeddings in batch during training iterations when using ReLu activation in embeddings as regularization ... 30

Figure 5.22 Number of features in the embeddings the network is using during training iterations when using ReLu activation in embeddings as regularization ... 30

Figure 5.23 Loss during training iterations when using ReLu activation in embeddings as regularization 31 Figure 5.24 Comparison of one-shot accuracy on validation set between TriNet with ReLu in embeddings (pink line) and TriNet without ReLu in embeddings (blue line)... 31

Figure 5.25 One-shot accuracy on training and validation set of TriNet with ReLu in the final embeddings ... 32

Figure 6.1 Sample images from the ImageNet Dataset... 38

Figure 6.2 One-shot accuracy of our adjusted ResNet-18 on traing and validation set ... 39

Figure 6.3 Loss during training of our adjusted ResNet-18 ... 40

Figure 6.4 Distance percentiles of our adjusted ResNet-18 model during training ... 40

Figure 6.5 Embedding percentiles of our adjusted ResNet-18 model during training ... 41 Figure 6.6 2-Norm feature percentiles of our adjusted ResNet-18 model during training . 41

(13)

(14)

1. Introduction

1

1 Introduction

1.1 Overview

Humans, in general, are able to understand new concepts quickly and then also recognize variations of these concepts. Furthermore, aside from being able to generalize from a perceived example to its variations, people can also recognize similarities and differences between different concepts.

Traditional machine learning algorithms have proven to be very successful in many problems, like image recognition, speech recognition, recommendations, search engines, etc. However, these algorithms require huge datasets in order to achieve state-of-the-art results, and when tested on data distributions for which little supervised information is given, results are often not satisfactory. There are many scenarios where the ability to generalize to new concepts without retraining is crucial to the success of the task. An illustration of this is the ability to classify a new concept where only one example of each possible class is available. This is called one-shot learning.

A very important topic where one-shot learning is required is the problem of person re-identification. Person re-identification is the capability of associating images of the same person taken on different occasions or from different cameras. Humans are easily able to re-identify others by leveraging descriptors based on the person’s face, height, and build, clothing, hairstyle, walking pattern, and other features. Unfortunately, this easy task for humans is an extremely difficult problem for machines. Traditionally, face recognition is used for this task. However, face recognition only works when the subject is close enough and facing the camera. In many scenarios, because of the environment, and because of the motion blur caused by the cameras, that is not the case. Person re-identification is of significant importance because of its application; It is applied in tracking a particular person across multiple cameras and detecting the trajectory of a person for forensic, surveillance, and security applications.

Most successful computing models that deal with one-shot learning use extra domain-specific knowledge of the dataset. As a result, these methods cannot generalize for problems of other domains; in other words, they are not scalable. In this thesis, we develop and compare some deep learning models which approach this problem in a generalized manner; without using any prior domain-specific data for the problem so that the algorithm can be extended to other more complex datasets than the one we use.

(15)

1. Introduction

2

1.2 Approach

Generally, computer vision methods for one-shot learning can be of two categories: feature learning and metric learning. The methods this thesis presents are based on feature learning, where the models seek to extract meaningful features from the input images and perform classification based on them.

Considering the computational power limitations, our models are trained only for character recognition using the Omniglot dataset [1]. However, since none of our methods rely on prior domain-specific knowledge, they can also be used on other, bigger, and more complex datasets. In this thesis, we develop a series of deep learning models concerning the one-shot classification problem. In-depth analysis of these models is performed, obtaining relevant information, which are used as a guide towards the adjustments that are necessary for improving the model.

We start by implementing the Siamese neural network presented by Kotch et al. [2] with slight variations. The model is first trained on the standard verification task, learning to predict if two given characters belong to the same class. It is expected that models that perform well on the verification task should also generalize to the one-shot classification task. The one-shot task is executed by comparing the new image of unknown class to each test image in a pairwise manner. The verification network outputs the probability that the input image belongs to a certain test image’s class. Afterwards, the input image is assigned to the class of the test image with which it showed the highest probability from the verification model. This neural network architecture is also the initial point from where we start developing our second model.

In contrary to the Siamese network, our second model, instead of learning to output the probability that two images belong to the same class, it learns embeddings from which we can calculate the L2 distance between two inputs. In other words, it maps the input image to a

numerical feature space, where we can easily calculate the distances mathematically. This way, in a one-shot classification scenario we choose the class from which the distance is minimal. During learning, the algorithm learns to “push” instances from different classes and “pull” instances from the same class. This is done implicitly using the triplet loss function while choosing the triples in a “smart” and efficient way. Although the technique used in this model is very beneficial, it presents many constraints in terms of architecture and hyperparameters choice. These constraints motivate us to choose a stronger architecture in terms of learning capacity.

For the final model, by taking advantage of the transfer learning techniques, we adopt a strong architecture that enables us to bypass the constraints and limitations presented by the techniques used in the previous model. We use transfer learning with a slightly modified version of ResNet-18. We initialize the weights trained on the ImageNet dataset and train the whole network to perform fine-tuning on the pre-trained weights, while also learning the

(16)

1. Introduction

3

weights of the adjusted layers. The final results show that this model easily outperforms the others.

1.3 Thesis organization

The overall structure of this thesis contains nine chapters, including this introductory chapter. An overview of the following chapters, with a brief description is given below.

Chapter 2 Contains a review of the literature and background that is necessary in order to understand the rest of the thesis. It starts with depicting the state-of-the-art in one-shot learning and then explaining topics that are of special importance regarding the methods that are used in this thesis.

Chapter 3 Describes the dataset that will be used for the training of the models treated in this theses, as well as the evaluation method that is pre-defined for the models trained in this dataset. In the end, it shows the state-of-the-art results of the methods trained in this dataset

Chapter 4 Describes the first model considered in this thesis. Starting from the architecture, then describing the training procedure, as well as the results attained with this method.

Chapter 5 Starts with an extended literature review specific for the models treated in this chapter. Afterwards, it explains two methods implemented as a step by step improvement from the first model. A particular emphasis is put on the importance of having different plots, representing different kinds of information about the inner state of the neural network, as a means for making the right decision towards model improvement.

Chapter 6 Examines the drawback of the model described in the last chapter while providing a better architecture using transfer learning techniques. Provides a near state-of-the-art performance in the one-shot classification task.

Chapter 7 Discusses and compares the results attained with the implemented models in the one-shot classification task. Also, it discusses the comparison with other baseline and state-of-the-art models.

(17)

1. Introduction

4

Chapter 8 Concludes the key aspects of this thesis while also discussing the possible approaches that can further improve models discussed in this thesis.

Chapter 9 Depicts the implementation details, focusing on computational optimization using parallel algorithms.

(18)

2. State of the Art

5

2 State of the Art

This thesis was initially inspired by the work of Gregory Koch et al. [2]. In that paper, it was introduced a deep learning model, using siamese convolutional neural network. The authors trained the network on the standard verification task for image recognition; by comparing a test image against the class-identity image and learning to identify the probability that the input pairs belong to the same class. The same network was then used in the one-shot classification task without any further retraining.

The idea of addressing one-shot learning problems using machine learning algorithms started in the early 2000s with work of Kenneth Yip and Gerald Jay Sussman. In [3], [4] Li Fei-Fei et al. introduced a variational Bayesian framework for one-shot image classification using the premise that previously learned classes can be leveraged to help forecast future ones when few examples are available from a given class.

Lake et al. used cognitive science theories, introducing a method called Hierarchical Bayesian Program Learning (HBPL), and achieved state of the art results on one-shot learning for character recognition [5], in later publication surpassing even human performance in this task [6]. In their long work and many publications, they modelled the process of drawing characters generatively in order to break down the image into small pieces [1], [7].

Many also approached the one-shot learning problem using metric learning techniques. Wolf et al. used a bag of features from which they tried to learn a similarity function using metric learning for image classification for insects [8]. More recently, Google DeepMind developed a novel approach to one-shot learning from metric learning based on deep neural features using external memory augmented neural networks, which enables rapid learning, and using a training procedure that matches the testing conditions [9]. They show state-of-the-art results surpassing all deep learning methods in the field. They also contribute by defining tasks that can be used to benchmark other approaches on both the ImageNet dataset and small-scale language modelling.

Schroff et al. from Google introduced FaceNet [10], a system that directly learns a mapping from face images to a Euclidean space in which distances correspond to a measure of face similarity. Although this paper was not directly intended for one-shot classification, this technique can be directly applied also in the problem we are dealing with, because once the Euclidian space has been produced, tasks such as face recognition, verification, clustering and also one-shot classification can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

Hermans et al. in their paper about triplet loss for person re-identification [11], they present an improvement FaceNet’s method for person re-identification. They present a novel technique choosing “Hard triplets” in a smart and efficient way for the triplet loss function. They also

(19)

2. State of the Art

6

present e new version of the triplet loss function called “Soft Margin”, removing the necessity for using a margin hyperparameter, and showing better overall results.

A more detailed explanation of the related work in this field will be encountered in next chapters of this thesis, especially in the Chapters 5 and 6 where literature review is necessary in order to fully understand the design and development of our models.

(20)

3. The Omniglot Dataset

7

3 The Omniglot Dataset

3.1 Structure

The Omniglot data set was created by Lake et al. [1] using Amazon’s Mechanical Turk. They created a standard benchmark for learning from few examples. Omniglot contains letters from 50 different alphabets. Except that they collected the letters, they also collected the strokes which formed those letters, but in this thesis, we do not use that data, since our goal is to approach the one-shot classification problem in a generalized manner. Therefore, we are not using any prior domain-specific knowledge about the dataset.

Omniglot contains a variety number of letters for different alphabets, while for every letter Omniglot contains 20 different drawings, all of which are drawn by different people. In contrary to MNIST1_{, Omniglot contains many different classes while having only 20 examples}

for each class. For this reason, Omniglot is also known as “MNIST transpose”.

Lake split the dataset into a 40 alphabets background set and 10 alphabets evaluation set. We then divided the background set into 30 alphabets for training the network, and the other 10 alphabets are used for validating the model during training, so that we can regularize the model and prevent overfitting. The 10 alphabets from the evaluation set are used for the final evaluation of the model’s performance using the method described in the next section.

1_{MNIST - Modified National Institute of Standards and Technology database is a large database of handwritten}

digits that is commonly used for training various image processing systems. This database contains only 10 different classes, while containing thousands of examples for each class

(21)

8

3.2 One-Shot Performance Evaluation

In order to evaluate the one-shot learning performance of the model, Lake et al. developed a 20-way one-shot learning classification test [5]. The classification test is done only between letters that belong to the same alphabet. At first, an alphabet is chosen from the evaluation set, along with 20 random characters from that alphabet. Then, two drawings are selected for each selected character. Now that we have two drawings for every character, we use the first one as a test image and individually compare it against every second drawing of selected letters. This way we have twenty 20-way one-shot trials for every alphabet. Because we have 10 alphabets in the evaluation set, we acquire 200 20-way one-shot trials. We repeat this process twice and we end up with 400 one-shot trials in total. Figure 3.2 illustrates this process in a simpler example using only 3 characters, generating three 3-way one-shot trials.

Figure 3.2 a) First, three random letters are selected, and then 2 drawings are chosen for each letter. b) Three one-shot trials are created by matching charaters from Drawing 1 to every character from Drawing 2

(22)

9

3.3 State of the Art in the Omniglot Dataset

Many different researchers use this model, and its performance evaluation technique as a benchmark for the model's capability to be able to learn from few examples per class, and most importantly to be able to generalize for classes never encountered during training.

Table 3.1 depicts the baseline and state-of-the-art results in one-shot classification task on the Omniglot dataset.

Table 3.1 State of the art results in the Omniglot dataset

Method One-shot accuracy

Humans 95.50%

HBPL 96.70%

Matching Networks 93.80%

Hierarchical Deep 65.20%

Deep Boltzmann Machine 62.00%

Simple Stroke 35.20%

1-Nearest Neighbor 21.70%

(23)

4. Siamese Neural Network

10

4 Siamese Neural Network

Our first approach is based on Gregory Koch’s work [2], with some slight variations in the architecture, mostly because of the limited computing power.

In general, image representations is learned using a supervised metric-based approach with siamese neural networks, and then the network’s features are reuse for one-shot learning without any extra retraining.

Our model will be able to predict the class of a hand-written character by only having one example for each class. We use a siamese convolutional network that is trained using standard optimization techniques. At first, we train the neural network for the standard verification task; to be able to discriminate between class-identity of image pairs. The verification model outputs the probability that the two input pairs belong to the same class. We expect that the same neural network will show good results on one-shot learning task, without any further training. Since in one-shot learning we have one image available for every class, we can use the verification network for test image and every class image to calculate their probabilities that they belong to the same class. The pairing with the highest probability will be the result of the one-shot task. 4.1 Deep Siamese Networks for Image Verification

A Siamese neural network consists of two identical neural networks, which accept different inputs, and are connected at the top by a metric function. Both networks contain the same parameters. This guarantees us that two very similar inputs cannot be mapped by their networks to very different locations in the feature space.

In this paper, like Koch et al. [2], at the top of the twin networks, we use the weighted L1 distance between the twin feature vectors combined with a sigmoid activation.

Figure 4.1 High-level architecture of a Convolutional Siamese Neural Network

4.2 Model Architecture

We adopted the model from the work of Gregory Koch et al. [2] The model is a sequence of convolutional layers of varying sizes and fixed stride of 1. The model contains 4 convolutions, with max-pooling only after the first 3 convolutions, 1 fully connected layer, and a layer

(24)

11

computing the induced distance metric between the 2 fully connected layers of twin networks. With convolutional layers we use the rectified linear unit function (ReLu), while with the other two layers we use the sigmoid function. The final layer, computing the metric distance between the two inputs, is given as:

𝜎 (∑ 𝛼_𝑗|ℎ_1,𝐿−1(𝑗) − ℎ_2,𝐿−1(𝑗) |

𝑗

)

Where ℎ_1,𝐿−1(𝑗) and ℎ_2,𝐿−1(𝑗) are the hidden feature vectors representing the twin networks, respectively. The 𝛼𝑗 are additional parameters that are learned during training, in order to

weight the importance of different features captured by the network.

Figure 4.2 Architecture of the Siamese Neural Network [1] 4.3 Training procedure

Our verification network gets as input two batches of images and for each pair of images outputs a number between 0 and 1 as the probability that two images are the same. We compare the network outputs with the known values; 0 if the images are different, 1 if they are the same. We use the binary cross-entropy as the objective function.

(25)

12

We train the network in two phases, first using the Adam optimizer, and then using the Stochastic Gradient Descent (SGD). In the first phase we set the learning rate to 6e-5, weight decay to 1e-3, while in the second phase we set the initial learning rate at 5e-3, momentum 0.7

Figure 4.5 Binary Crossentropy Loss during the second phase of training Figure 4.4 One-shot accuracy monitoring of the model during the first phase of training

(26)

13

and weight decay 0. Figure 4.3 shows the progression of the BCEL2 during the first training phase. The decrease in BCEL means that the network is learning for the verification task. Figure 4.4shows that the accuracy in one-shot classification task increased as well, so the network was able to learn for the One-shot classification task indirectly. Still, it is obvious that the model is overfitting. Adding more regularization to the model by using dropout [12] did not help and caused the training to collapse. This motivates us to continue with another training phase. In the second phase we use the dropout regularization in all layers with the nulling probability of 0%, 5%, 10%, 20%, 50%, 50%, respectively. Also, a scheduler is used for decreasing the learning rate by 1% each epoch. By using the higher the SGD3 and a higher learning rate, we expect the model to break form the local minimum achieved during first training iteration with the Adam optimizer.

Figure 4.5 and Figure 4.6 show the BCEL4_{and the One-shot classification accuracy during the}

second phase of training. It can be noticed that after the second phase of training we have a higher and more confident one-shot accuracy on the validation set.

We initialize all network weights in the convolutional layers from a normal distribution with zero-mean and a standard deviation of 10e-2. Biases are initialized from a normal distribution with mean 0.5 and standard deviation of 10e-2. In the fully connected layers, the biases are

2_{BCEL – Binary Cross-Entropy Loss} 3_{SGD – Stochastic Gradient Descent} 4_{BCEL – Binary Cross-Entropy Loss}

(27)

14

initialized the same, while the weights are initialized from a normal distribution with zero-mean and standard deviation 2×10e-1.

In contrary to Gregory Koch’s work, we did not create a fixed training set before training the network. Every iteration, we dynamically called a method which retuned a batch of training data, containing pairs of drawings that belong to the same alphabet.

The network was trained in 200 epochs. Each epoch having 200 iterations, with a batch size of 128. By the end of every epoch, the model was monitored using the method explained in section 3.2 and the 10 alphabets from the validation set. The model attained by the first training phase was used as the initial state for the second training phase. The final model is the one that showed the best one-shot performance on the validation set.

4.4 One Shot Task

Once we have finished training our verification network, we can test its performance in one-shot learning task. The performance evaluation is done using the evaluation set, which was reserved in the beginning for testing the final model. We evaluate the one-shot learning performance with 400 one-shot trials as described in section 3.2.

The one-shot results are given in Table 4.1. We borrow the baseline and state of the art results from [2], while updating some of them from [6] and [9]

Table 4.1. One shot learning results including the baseline and the state of the art

Method Test

Humans _95.50%

HBPL5 _96.70%

Hierarchical Deep _65.20%

Deep Boltzmann Machine _62.00%

Simple Stroke _35.20%

1-Nearest Neighbor _21.70%

Koch’s Siamese Neural Net _92.00%

Our Siamese Neural Net _85.25%

The best performing model is HBPL, which also surpasses the human performance on this task. It can also be noted that our model performed e little lower than the model of Koch et al.

(28)

15

Although we used the almost same architecture, the difference is in the choice of some hyperparameters. Koch [2] used Whetlab, a Bayesian optimization framework, to perform hyperparameter selection. They fed to the tool the range from which the hyperparameters will be chosen, and it returned the best performing model, including wise learning rate, layer-wise momentum, and layer-layer-wise L2 regularization penalty. Whetlab is now acquired by Twitter and is no longer available as a public tool. Nevertheless, other hyperparameter optimizations tools are available, but because they try many possible combinations of hyperparameters, they require a very high computational power, which we do not possess, therefore the usage of these tools was not possible in this project and hyperparameter selection was done manually. Kotch also did not use dropout, but in our model, it showed to be effective on lowering overfitting.

(29)

5. Triplet Models

16

5 Triplet Models

Our second model differs from the Siamese Network in the sense that it directly learns a mapping from images to a Euclidian space in which distances measure how similar/different two embeddings are. The dimensions we have in the produced Euclidian space correspond to the number of features we extract from the images. The network is trained in such a way that the L2 Euclidean distances in the feature space correspond to the similarity between the two

input images. Images that belong to the same class have small distances, and vice versa, images belonging to different classes have large distances. Once we calculate the space embeddings, the one-shot classification task becomes a simple L2 distance calculation problem between the

unlabeled input and the class-representation images we have.

In this chapter we present and analyze two models, the first one being an intermediate model between the Siamese network and the second; a model based solely on the feature extraction as described in the previews paragraph. We name the second model TriNet and it is the main work of this chapter.

While the Siamese Network relies on the hypothesis that a model which performs well on verification task should be able to generalize also to one-shot learning, TriNet is trained to directly optimize the embeddings, which are directly used in the one shot learning task. We call this model TriNet since the bases for this model is the Triplet Loss function it uses for learning.

5.1 Triplet Loss

Triplet loss is a well-known function used for learning feature space embeddings. The triplets contain two matching-class images and one non-matching-class image. The loss objective is to separate the same-class pair from the other image by a distance margin. The embeddings of the input images are represented by 𝑓(𝑥) ∈ ℝ𝑑_{; a d-dimensional Euclidian space. As mentioned}

by [10] we want to make sure that an anchor image 𝑥_𝑖𝑎 is closer to all other images that belong to the same class 𝑥_𝑖𝑝 (positive), than it is to any image that belong to a different class 𝑥_𝑖𝑛

(negative). This triplet loss that is being minimized is represented by the following function:

∑ [‖𝑓(𝑥_𝑖𝑎) − 𝑓(𝑥_𝑖𝑝)‖ 2 − ‖𝑓(𝑥𝑖 𝑎_{) − 𝑓(𝑥} 𝑖𝑛)‖2+ 𝛼] 𝑁 𝑖 ∀(𝑥_𝑖𝑎, 𝑥_𝑖𝑃, 𝑥_𝑖𝑛) 𝜖 𝜏

Where α is the margin distance enforced between positive and negative pairs and τ is the set of all possible triplets. This is also illustrated in Figure 5.1.

(30)

5. Triplet Models

17

Figure 5.1 Distance between positive and negative pairs before and after training [10]

There are two drawback of using this equation on our learning algorithm as a loss function. Frist, it is computationally exhaustive and therefore not efficient. Secondly, most of the triplets are easily satisfied and therefore there is no benefit in using them at all. So, eliminating easily satisfied triplets is necessary, otherwise the training will stagnate. On the other hand, selecting only the hardest triplets can make the training unstable by not being able to learn simple features. So, the choice of triplets that we use in our training algorithm is a very important task and has a crucial role in achieving good results. Many triplet mining techniques exist that resolve this issue, while the drawback in most of them is that they require an extra separate step from training and add a substantial overhead. FaceNet [11]. They argue that in the classical Triplet Loss implementation, only a handful number of triplet combinations are considered inside a batch and a lot useful information is wasted. They propose to select batches by randomly sampling P classes (in our case P letters) and K images for each class (in our case K drawings per Letter). This way we form a batch of PK images. Now for each image in the batch we select the hardest positive and the hardest negative samples within that batch. This triplet selection is called Batch Hard

Because we are only selectin the hardest samples inside a small batch, this approximates the FaceNet’s moderate batch selection, while still executing on the GPU without causing any delay. Another novel improvement from [11] is the removal of the margin from the Triplet Loss function by introducing another variant of it called Soft margin

log(1 + ⅇ𝐷₎

Where D is the difference between hardest positive and hardest negative distances. This method provides double benefits by improving the overall result and by removing one hyperparameter which would need to be tuned.

(31)

5. Triplet Models

18

In our model we experimented with different versions of the triplet loss. First, we implemented the classical one, using different margins as a hyperparameter. Afterwards, we implemented the batch hard version from [11], both with margin hyperparameter and the soft margin version. Finally, we reached in the same conclusion as [11], where the batch hard triplets with soft margin showed the best results

5.2 TriameseNet

The initial architecture we use for this model is the Siamese Network architecture we had in the first model, but instead of using two identical networks for embedding calculation, we use three. As input in the network we are going to have triplets of images as described in the previews section from the classical triplets. In order to limit the number of easy examples from which the network will not benefit anything, we only choose triplets belonging to the same alphabet, holding into the assumption that two different letters from the same alphabet are more likely to be similar than two letters from different alphabets. In Figure 5.2 we see an example of two different letters from the Latin alphabet and two different letters from the Sylheti alphabet. It is obvious that letters from the same alphabet seem way much more similar than those from different alphabets.

Figure 5.2 The first two letters are from the Sylheti alphabet while the last two are from the Latin alphabet

In this model, we are not going to use the Euclidian distance as it is usually expected with the triplet loss, but instead we are going to add an extra fully connected layer on top of L1 distance

of embeddings in order to calculate the L1 weighted distance, in the same manner as it was used

in the Siamese Network. In Figure 6 we depicted an illustration of the “double Siamese” or, as we name it, “Triamese” network that calculates the triplet loss for a triplet input.

5.2.1 Training procedure

TriameseNet’s architecture enables us to use the same training procedure as we did in our version of the Siamese Network. The difference stands in the fact that, as inputs for the network we have batches of three images, and the output is not a probability (number between zero and one), but it is directly the mean triplet loss calculated for every triplet in the batch (classical triplet loss).

(32)

5. Triplet Models

19

As in the Siamese Network, we train the network in two phases, first using the Adam optimizer, and then using the Stochastic Gradient Descent (SGD). In the first phase we set the learning rate to 6e-5, weight decay to 1e-3, while in the second phase we set the initial learning rate at 6e-4, momentum 0.7 and weight decay 0. Additionally, in the second phase we use the dropout regularization in all layers with the nulling probability of 0%, 5%, 10%, 20%, 50%, 50%, respectively. In the second phase we also set a scheduler to decrease the learning rate by 1% each epoch.

In contrary to Gregory Koch’s work [2], we did not create a fixed training set before training the network. Every iteration, we dynamically created a batch of training data in a randomized manner, containing triplets from which the positive and negative pairs are formed. Also, as we noted earlier, in order to prevent training stagnation form easy triplets, all the triplets contain only drawings from the same alphabet.

(33)

5. Triplet Models

20

The network, in each phase, was trained in 40000 iterations with a batch size of 128, while being monitored every 200 iterations on the one-shot learning accuracy on the validation set as well as in the training set. This monitoring allows us to observe the model’s performance and have an insight of the network so that we can make more informed decisions on how to improve the model continuously. Figure 5.4 show the one-shot accuracy for both training and validation sets, while Figure 5.5 depicts triplet loss progression with respect to the number of training iterations (Figure 5.6 and Figure 5.7 show these results for the second phase of learning). After the first phase is finished, we continue with the second phase of training as described earlier, by using the trained weights of the first phase as initial weights for this one. The idea behind using a two-phase training strategy is based on the properties of the optimization algorithms. The Adam optimizer, being an adaptive optimizer, is known to converge faster to a local

Figure 5.5 Triplet loss over 40k iterations in the first phase

(34)

5. Triplet Models

21

optimum. On the other hand, the Stochastic Gradient Descent (SGD) with momentum has slower convergence, but because of the momentum it is less prone to being stuck in a local minimum. By using the Adam optimizer in the first phase we can achieve acceptable local minimum. On the second phase, by using the SGD optimizer with momentum, and a high learning rate, the network breaks from the local minimum achieved in the first phase and, more often than not, ends up in a better position. By using this strategy, we could improve the final test set results by up to 5%.

Figure 5.7 Triplet loss over 40k iterations in the second phase. Initial weights are loaded from the first phase of training

Figure 5.6 One-shot accuracy in the training and validation set over 40k iterations in the second phase. Initial weights are loaded from the first phase of training

(35)

5. Triplet Models

22

5.3 TriNet

In this model, we apply a feature extraction technique through deep convolutional networks. By using the triplet loss as a loss function, we are able to train the network to directly optimize the feature embeddings for the final one-shot task. The architecture we employ for this task contains 4 convolutional layers combined with the rectified linear unit (ReLu) and max pooling. In the end, the model contains a fully connected layer, resulting the final embeddings. The architecture is similar to the one used for the Siamese network [2]. Still, in TriNet we remove the last layer, which is used to calculate the verification probability, since we do not need the network to compute probability or distance directly, but we want the feature embeddings as the output of the network. It is also necessary to remove the sigmoidal function at the final feature embeddings, for as it is depicted in Figure 5.8, it causes features to be in a very limited range and maps very distant features close to one another. Besides, using sigmoidal function at embeddings can hide training problems, like vanishing or exploding embedding values.

TriNet will be computing the feature embeddings of the input batch; therefore, in contrary to the TriameseNet, the input needs not be a triplet. It is more efficient in terms of memory and computing time if we feed all the images of one batch as single images, get their embeddings, and then continue with further processing using only the embeddings.

TriNet uses the Batch hard with Soft margin version of triplet loss [11]. As explained in section 5.1, this enables us to only use triplets which contribute to the learning process by choosing the hardest triplets inside the batch, resulting in moderate triplets relative to the entire dataset. This method shows superior results compared to other triplet loss methods discussed in Section 5.1, while still being significantly more computationally efficient.

(36)

5. Triplet Models

23

5.3.1 Training procedure

TriNet’s training procedure differs from the others described above since the neural network doesn’t output the loss directly; instead, it outputs the feature embeddings that are used to calculate the hard triplets inside the batch, from which the soft margin triplet loss will then be computed. After the neural network outputs the embeddings for all images in the batch, the next processing step is the hard triplet’s calculation. It is important that this calculation is done in parallel using the Graphical Processing Unit (GPU); otherwise, it will be a bottleneck in computation time. The algorithm for parallel computation of hard triplets is explored in detail in Chapter 7. The final step is feeding the hard triplets to the soft-margin loss function and updating the networks weights through backpropagation.

The training set, as explained in the TriameseNet section, is not fixed before the training starts, but instead the training batches are created dynamically in a randomized manner. In contrary to the TriameseNet, one batch will contain list of single images, instead of triplets. The images will be drawn, as described in Section 5.1, by first choosing P letters, and then fetching K drawings of that letter. The drawback of using this method is that it requires two extra hyperparameters, P and K, to be tuned. An example of a batch with P=3 and K=2 is shown in Figure 5.9

Figure 5.10 One-shot accuracy of TriNet using high weight decay as regularization Figure 5.9 Training batch created by the Batch Hard algorithm with P=3 and K=2

(37)

5. Triplet Models

24

The usage of soft margin loss benefits us by removing one hyperparameter to be tuned, and also providing better results, but on the other hand, because of its exponential formulation, log(1 + ⅇ𝐷_{) limits us in the choice of other hyperparameters, like the number of features in the}

last layer of the network. A big drawback of using this function is that very often it can cause exploding loss values, preventing the network from learning. We need to choose the hyperparameters and initialize weights in a manner that prevents ⅇ𝐷 from overflowing. One technique that helps lowering the value of 𝐷 is the normalization of input images from 0-255 to 0-1 range. Another limitation is the number of features we can use in the embedding vectors;

Figure 5.11 Loss of TriNet using high weight decay as regularization

Figure 5.12 Distance percentiles between embeddings in batch during training iterations, when using weight decay as regularization

(38)

5. Triplet Models

25

a high number of features would rapidly increase the value of 𝐷 because of the L2 norm.

Because of these limitations, choosing the optimal hyperparameters by only monitoring the one-shot accuracy and loss during training becomes problematic. Due to the unexpected failures of many training sessions of TriNet, especially when regularization was applied, we decided to take e deeper insight of what is happening inside the network during training, and what is causing their failure. In Figure 5.10 we see the progression of one-shot accuracy of a model with high weight decay during training iterations. We can see that training stagnated very early in the process. If we look at Figure 5.11 we notice that the loss decreased up to a certain point, corresponding to the increase in one-shot accuracy in the beginning, and then

Figure 5.13 One-shot accuracy of TriNet using dropout as regularization

(39)

5. Triplet Models

26

was not able to pass that threshold. Now we know that the accuracy stagnated because the model was not able to minimize the loss below a certain point, but from these graphs we cannot get any more detailed information as what caused the loss to stop at that certain point. Figure 5.12 depicts a more insightful graph. It displays the distance percentiles (100th, 95th, 50th, 5th and 0th percentiles) between the chosen images from the batch hard algorithm. In this graph we notice that the model was continuously lowering those distances. In a perfect scenario, the triplet loss algorithm should lower the distance between same-class pairs and increase the distance between different-class pairs. Since the number of positive and negative pairs is equal, it is expected that half of the distances will decrease, but we also expect that the other half should be increasing.

In the last example the model was using high weight decay. Weight decay is a known regularization method, which adds an extra term to the loss function to penalize the model for having high weights, and therefore encourages the optimization algorithm to move towards smaller weights in order to minimize the loss function. Having lower weights implies having low embedding values and having low embedding values implies low distances. At this point we can assume that by using a different regularization method the model can overcome this problem. Figure 5.13 shows another training session example, but this time using dropout instead of weight decay. We can see that the accuracy dropped very fast in the early learning iterations. It is normal to have low accuracy in the beginning when using dropout since it is widely known that dropout requires more training iterations in order to start learning, but by looking at the loss graph depicted in Figure 5.14 we can see that loss is stuck again at the same threshold as in the previews example. Progression of distance percentiles over training

Figure 5.15 Distribution of distances between embeddings in batch during training iterations when using dropout as regularization

(40)

5. Triplet Models

27

iterations in Figure 5.15 shows the same phenomenon happening even with dropout regularization. We go further and analyze more insight graphs with which we monitored the training procedure. In Figure 5.16 we have the 2-norm of the embedding’s percentiles over training iterations, while in Figure 5.17 we have the percentiles of the feature values in embeddings. By observing these graphs, we come into conclusion that all the feature embeddings are converging to zero. This means that no matter what input images we have, the network is always going to output a zero feature-vector, therefore a zero 2-norm of embeddings and zero distance between inputs. In Figure 5.11 and Figure 5.14 it can be noticed that the Figure 5.16 Distribution of 2-Norm embeddings in batch during training iterations when using dropout as

regularization

Figure 5.17 Distribution of embedding values in batch during training iterations when using dropout as regularization

(41)

5. Triplet Models

28

threshold value, that loss gets stuck to, is 0.69. By also observing the soft-margin loss function in Figure 5.18 it is noticed that 0.69 is the value of the function when D, the difference between positive and negative distances, is zero. By this observation we came into conclusion that the model, by lowering all the distances and therefore all the embeddings, is lowering D to zero. That is a local minimum, and because of small gradients at that point the model is unable to escape that local minimum. This is a well-known phenomenon also explained by A. Hermans et al. [11] in their paper about person re-identification. According to them a typical training usually proceeds as follows: initially, all embeddings are pulled together towards their center of gravity in the feature Euclidian space. When they come close to each other, they will “pass”

Figure 5.19 Distribution of embedding values in batch during training iterations when using ReLu activation in embeddings as regularization

Figure 5.18 Graph of the soft-margin function. When D is zero the function has value 0.69

(42)

5. Triplet Models

29

each other to join “their” clusters and, once this cross-over has happened, training mostly consists of pushing the clusters further apart and fine-tuning them. But they also note that in certain conditions the training can collapse. One factor that might cause the training to collapse is a small spread of the initial embeddings. This can be a direct consequence of the small standard deviation of initial weights in the neural network. In the beginning of this section we mentioned the problem of having overflown loss value in certain condition. Initializing weight with a high standard deviation contributes to this problem, so we are limited in this direction. While common regularization methods did not work well on TriNet, adding a Rectified Linear Unit (ReLu) after the final feature-embeddings improved the model significantly. Is seems unexpected at first since it is cancelling out approximately half of the features in the embedding, therefore loosing possibly useful information. On the other hand, in methods where feature embeddings are used (ex. word embedding methods), it is common to have sparse embeddings, but in our model that’s not the case based on Figure 5.16 and Figure 5.17 where it is obvious that none of the features in the embeddings is zeroed out until the training collapses. In Figure 5.19 we have the embeddings percentiles during the training of our model using ReLu in the final embeddings. It can be noticed that more than 50% of the embeddings are zero from the beginning of the training. Also by analyzing the of 2-norms of all the features in the embeddings (Figure 5.20) we come into conclusion that the same features are being zeroed out for all the embeddings in one batch. By further analyzing these graphs it can be noticed that during the first training iterations, all the embedding values are shrinking. This is more obvious in the graph depicting distance percentiles (Figure 5.21). In the early training iterations, just like in the previous collapsing models, the distances are shrinking towards zero, but now at a

Figure 5.20 Distribution of 2-Norms of features of embeddings in batch during training iterations when using ReLu activation in embeddings as regularization

(43)

5. Triplet Models

30

certain point, some features in the embeddings and all the distances between the embeddings start to increase rapidly. Here we encountered what A. Hermans et al. [11] call “the difficult packed phase”, where in the beginning all embeddings are pulled toward a center point, where they bypass each other and join their clusters; in our case they join points representing the same character. Once this bypassing is over, training consists of pushing different-class points further Figure 5.21 Distribution of distances between embeddings in batch during training iterations when using

ReLu activation in embeddings as regularization

Figure 5.22 Number of features in the embeddings the network is using during training iterations when using ReLu activation in embeddings as regularization

(44)

5. Triplet Models

31

apart, which is manifested by having increasing distances with training iterations, exactly as our graphs show.

In this model, because there is a ReLu in the final embeddings, we also monitored the number of non-zero features in embeddings, so that we can have a picture of how many features the model is using and how many is zeroing out during training. TriNet has a total of 1024 features in the last-layer embeddings. Figure 5.22 shows the number of active features during training iterations that the model is using to calculate the distances between embeddings. It can be

Figure 5.24 Comparison of one-shot accuracy on validation set between TriNet with ReLu in embeddings (pink line) and TriNet without ReLu in embeddings (blue line)

(45)

5. Triplet Models

32

noticed that the peak is achieved during the “difficult packed phase”, while the embeddings are “bypassing” each other to reach their own clusters, and afterwards it drops significantly and remains to a range between 100 and 250 features for the rest of the training; a significantly lower number than what is expected, the half of total features (512 features). The difficult phase can also be noticed at the loss graph depicted in Figure 5.23. In the early stages of training a flat area can be noticed, where the “bypassing” is happening. The fact that the loss is flat here does not mean that the network is not learning. In contrary, by looking at the one-shot accuracy it can be notices that the highest change happens during that stage. The loss is flat for a while because of the Batch Hard algorithm. As the model learns some hard cases, the batch hard algorithm presents other hard cases to learn, until the learning process enters the “push” phase and the loss continues to drop again.

In Figure 5.24 is shown the improvement in one-shot accuracy by the usage of the ReLu activation in the final embeddings. Using ReLu not only showed significantly better results, but also the network is overfitting way less with the increase in training iterations. So, based on Figure 5.22 and Figure 5.24 we can conclude that using ReLu activation in the embeddings allowed the network flexibility in using as many features as it needs in different training phases. This also works as a network regularization since it lowers the gap between accuracy on training set and validation set by 10%.

The best version of TriNet was using the ReLu activation function at the embeddings and achieved the best result so far on our experiments; 87% accuracy on validation set. The progression of training in both the training and validation set is shown in Figure 5.25.

Figure 5.25 One-shot accuracy on training and validation set of TriNet with ReLu in the final embeddings

(46)

5. Triplet Models

33

Although TriNet outperformed our previews models, Figure 5.25 shows that the model is still overfitting. Adding more regularization to the model causes the training to collapse during the “difficult packed phase”. According to A. Hermans [11] one of the main reasons that the training may collapse during the “difficult packed phase” is if its architecture is not strong enough. This leads us to the next and our last model, which will use the same technique but a stronger architecture in terms of learning capacity.

The best TriNet architecture is depicted in Table 5.1, while the best performing hyperparameters in Table 5.2.

Table 5.1 TriNet's best performing architecture

Table 5.2 TriNet’s best performing configuration of hyperparameters

Layer Name Output Size ResNet-18

conv 1 48 x 48 x 64 10 x 10, 64, stride 1 + ReLu + MaxPool 2x2 conv 2 21 x 21 x 128 7 x 7, 128, stride 1 + ReLu + MaxPool 2x2 conv 3 9 x 9 x 128 4 x 4, 128, stride 1 + ReLu + MaxPool 2x2 conv 4 6 x 6 x 256 4 x 4, 256, stride 1 + ReLu

flatten 1 x 1 x 9216

fully connected 1024 9216 x 1024 fully connections ReLu 1024

Hyperparameter Value

Learning rate 8e-5 Weight decay 0.08 No. of character in one batch 32 No. of drawings per character 5

(47)

5. Triplet Models

34

5.4 One Shot Task

After the models are trained, the weights that showed the highest accuracy in one-shot task on validation set are saved. Those weights are then used to test the models for the final one-shot task in the test set. The procedure of testing the model is exactly as Lake et al. [5] describes. It is necessary to use the same testing procedure in order to be able to compare the results against state of the art and baseline models. The performance evaluation is done using the evaluation set, which was reserved in the beginning for testing the final model. We evaluate the one-shot learning performance with 400 one-shot trials as described in section 3.2.

The execution of the one-shot trials for the TriameseNet is the same as for the Siamese network. We cut one of the three identical networks and use the other two in a similar manner as the verification task. The difference is that instead of similarity probability, this network outputs the distance. After calculating the distance between the test image and all 20 class-identity images, we choose the one with the lowest distance as the result.

The execution of the one-shot trial for the TriNet is simpler and faster. It doesn’t require 20 iterations on the GPU to calculate distances in a pairwise manner, but instead the network treats all 21 images (20 class-identity-images and 1 test-image) as a batch and returns their embeddings. Afterwards, the distance calculation is also done on the GPU using matrixes.

Table 5.3. One shot learning results including the baseline and the state of the art

Method Test

Humans _95.50%

HBPL6 _96.70%

Hierarchical Deep _65.20%

Deep Boltzmann Machine _62.00%

Simple Stroke _35.20%

1-Nearest Neighbor _21.70%

Koch’s Siamese Neural Net _92.00%

Our SiameseNet _85.25%

TriameseNet _85.75%

TriNet _86.00%

TriNet with ReLu on embeddings _88.25%

(48)

5. Triplet Models

35

The one-shot results in evaluation set are given in Table 5.3. We borrow the baseline and state of the art results from [2], while updating new results from [6], [9].

(49)

6. Transfer Learning Model

36

6 Transfer Learning Model

In the last model described, we witnessed improvement in one-shot classification task by mapping the input images to a feature space and then doing the one-shot classification task only by calculating distances using those feature embeddings. Another benefit of using this method is that once the network is able to successfully map images to the feature embeddings, we can then easily use the embeddings on other problems, like verification or identification. However, our best model is still suffering from overfitting and applying more regularization cases the training to collapse. This was accounted for as a shortcoming of the architecture we previously used.

In this chapter, in order to uphold our explanation of the previews model’s shortcomings, we present a stronger architecture in terms of learning capacity, expecting better results by lowering the overfitting.

6.1 Model Architecture

Considering the exceptional performance of ResNet-18 [13] and its ease of use, we decided to use this network as a feature extractor for our final model. The decision is also supported by the fact that this model presents a good trade-off between computational time and model capacity, both of which are crucial in our scenario. In the last chapter, we described that having a strong model is a must for our learning algorithm to be able to learn useful features that can

(50)

37

generalize for examples not encountered during training. In other words, a strong model architecture is necessary to be able to apply enough regularization and prevent overfitting. Table 6.1 depicts the original ResNet-18 architecture [14]. It can be noted that it involves five convolutional stages, followed by a fully-connected layer. ResNet-18 is originally designed for classification problems. In classification models, convolutional layers are usually followed by fully connected layers, which output the final result using the softmax function. The number of output nodes is equal to the number of classes the network is trying to predict [14]. In contrary to classification problems, in one-shot classification, we do not have a fixed number of classes. Using our feature embedding method, we want the network’s output nodes to present the features extracted from the network. Therefore, we are going to adjust the last layer to output the number of features we need, dropping the softmax function since it is not suitable for our problem. The number of fully-connected layers, as well as the number of their nodes, are now hyperparameters of our network that need to be tuned.

Our hyperparameter selection technique is based on trying different combinations that make sense—updating the range of parameters based on the results achieved and repeating this process several times. Table 6.2 shows our best performing architecture attained by the hyperparameter selection process. While one-shot accuracy on the validation set was the main

Table 6.2 Adjusted ResNet-18 architecture

Layer Name Output Size ResNet-18

conv1 112 × 112 × 64 7 × 7, 64, stride 2 3 × 3 max pool, stride 2

conv3_x 28 × 28 × 128

conv4_x 14 × 14 × 256

conv5_x 7 × 7 × 512

average pool 1 × 1 × 512 7 x 7 average pool

fully connected 1024 512 x 1024 fully connections

conv2_x 56 × 56 × 64 _, , 𝑥 , 1 , 1 𝑥 , , 𝑥 , 1 , 1 𝑥

(51)

38

criterion for comparing different architecture’s performances, we also monitored other different attributes during the training process, as described in the next section.

6.2 Training

The motivation for implementing this model was the inability of our previews model (TriNet) to take full advantage of the training technique used with Batch hard and the soft-margin loss. The shortcoming of TriNet was its architecture, as it was not strong enough to pass the “difficult packed phase” under the pressure of high regularization. Therefore, the training steps that our new model needs to undergo should be no different than what it was used for training TriNet. ResNet-18 was initially designed for the ImageNet dataset (Figure 6.1), and its parameters trained on ImageNet are publicly available. Using pre-trained weights is a well-known technique which often results in outstanding performance. ImageNet dataset contains RGB7

images of size 224x224, while Omniglot, the dataset we are using, contains grayscale images of size 105x105. Therefore, before using the ResNet-18 model, it is necessary for the images to undergo some preprocessing steps before being fed to ResNet. During the preprocessing phase, we also added affine distortions to the images, including small rotations, translation, scaling, and shear. Although affine distortions did not show any improvement in the performance of our previous models, in ResNet it proved to be very efficient, boosting one-shot accuracy for up to 5%. The implementation details of image preprocessing are explained in more detail in Chapter 7.

7_{RGB – Red Green Blue}

(52)

39

The same hyperparameter selection technique was also used for the selection of learning rate, weight decay, and batch size. The best performing configuration, which represents the hyperparameters of our final model, is depicted in Table 6.3.

The network was trained using the ImageNet pre-trained weights of ResNet-18. None of the weights were fixated, so the training algorithm performed feature correction in the already-trained convolution layers, while also training the new fully connected layer. The weights in the fully connected layer were initialized using the He initialization method since ReLu is used as the activation in the network layers.

Table 6.3 Best performing configuration of hyperparameters

Hyperparameter Value

Learning rate 6e-5

Weight decay 0

No. of character in one batch 32

No. of drawings per character 5

Number of epochs 95

(53)

40

Figure 6.2 shows the progression of one-shot accuracy in the training and validation set. This model outperforms all the other models we considered in this thesis. The learning capacity of ResNet is evident in comparison to our previous models. In contrary to our previous models, the “difficult packed phase” cannot be noticed at all in the loss plot in Figure 6.3. While, by

Figure 6.4 Distance percentiles of our adjusted ResNet-18 model during training Figure 6.3 Loss during training of our adjusted ResNet-18

(54)

41

analyzing the percentiles that we examined in the previous models, it can only be seen a small shrink in the percentile values in the very first training epochs. It is most noticeably in the distance percentiles (Figure 6.4), where the distances shrink in the beginning, and after that phase, they continue to expand for the entire training session. An interesting fact is that the embedding and 2-norm percentile values are very stable, while the distance percentile values

Figure 6.5 Embedding percentiles of our adjusted ResNet-18 model during training

Deep learning methods for one-shot learning on image recognition

POLITECNICO DI MILANO

School of Industrial and Information Engineering

Master of Sciences in Computer Science and Engineering

Deep Learning Methods for One-Shot Learning on Image

Recognition

Abstract

Table of Contents

Table of Figures

1 Introduction

2 State of the Art

3 The Omniglot Dataset

4 Siamese Neural Network

5 Triplet Models

6 Transfer Learning Model