Implementation of techniques for adversarial detection in image classification

(1)

UNIVERSIT `

A DI PISA

Dipartimento di Ingegneria dell’Informazione

Master Degree in Computer Engineering

Graduate Thesis

Implementation of techniques for

adversarial detection in image

classification

Supervisors: Prof. Fabrizio Falchi Prof. Roberto Caldelli Prof. Giuseppe Amato

Candidate: Roberta Fumarola Accademic Year 2016/2017

(2)

Introduction

The earliest approach to Articial Intelligence (AI) was the implementation of programs in order to solve problems that human brains performed easily, such as understanding text or recognizing objects in an image. The major issue in those years was the restricted capability of computers to solve problems, because, in order to get a result, they had to follow a set of specic instructions (algorithms). But that set of instructions worked well only if the solution of the problem was already known. Computations were too expensive and computers had to have access to huge amounts of informations in order to be smart. As a consequence, rst works gave very disappointing results and their progress were too slow. Researchers understood that a new approach was necessary. The rst step toward Articial Neural Networks (ANNs) came in 1943 when Warren McCulloch and Walter Pitts (a neurophysiologist and a mathematician respectively) wrote a paper on how neurons might work. In 1949 Donald Hebb reinforced the concept of neurons in The Organization of Behavior book, by pointing out that knowledge and learning occured in the brain through the formation and change of synapses between neurons.

The advancement in electronic computers allowed in the 1950s the basis of these theories to be modelled by IBM's Nathanial Rochester in the rst computer simulation of a neural network. In the following years, Frank Rosen-blatt, a neurobiologist of Cornell, intrigued with the operation of the eye of a y, began work on the Perceptron: a mathematical model of how neurons operate in our brains. The perceptron was rst implemented as a software for the IBM 704 and it was subsequently implemented in custom-built hardware as the Mark 1 perceptron.

Despite these successes in neural networks, traditional von Neumann architec-ture took over the computing scene, and neural research was left behind. In addition, many people in the eld were using a awed learning function, be-cause of the use of a threshold activation function that was not dierentiable

(5)

Figure 1: Mark I Perceptron at the Cornell Aeronautical Laboratory. Hardware imple-mentation of the rst Perceptron.

across the entire line. As a result, research and funding went drastically down. This was strengthened by the fact that the early successes of some neural net-works led to an exaggeration of the potential of neural netnet-works, especially considering the practical technology at the time. Promises went unfullled and writers began to talk about the eect that the "thinking machines" would have on humans.

In 1982, interest in the eld was renewed. John Hopeld of Caltech pre-sented a paper to the National Academy of Sciences, in which introduced a neural network with neurons connected in bidirectional way. That same year, Reilly and Cooper used a hybrid network with multiple layers, each layer using a dierent problem-solving strategy. But the real catalyst that brought new funds and researches in this eld was a conference (in the same year) where Japan announced a new Fifth Generation eort on neural networks. USA began to worry that the they could be left behind in the eld.

In 1986, with multiple layered neural networks in the news, the problem was how to extend the Widrow-Ho rule to multiple layers. Three indepen-dent groups of researchers, one of which included David Rumelhart, a for-mer member of Stanford's psychology department, came up with similar ideas which are now called back propagation networks because it distributes pat-tern recognition errors throughout the network. Hybrid networks used just two layers, these back-propagation networks use many. The result is that back-propagation networks are "slow learners," needing possibly thousands of iterations to learn. By 1985 the American Institute of Physics began what has

(6)

become an annual meeting: Neural Networks for Computing.

Back to the present, neural networks are used in several applications and ANNs have recently evolved into Deep Neural Networks (DNNs). DNNs has successfully addressed many problems such as image classication, language translation and identifying spam in email. Neural networks architecture be-came more intricated and their computational capability have had a marked improvement.

Despite this and the high performances reached in classication tasks, they have to face a new problem: adversarial examples. Szegedy was the rst researcher who introduced this concept, explaing how easy it was alter the class predicted by the network applying a small but smart perturbation of pixels in the input to an image classier: this perturbation is the so-called adversarial examples or fooling images.

Contribution of this thesis The aim of this thesis was the design and the implementation of a technique for adversarial image detection in image classication performed through articial neural networks. Since adversarial perturbation does the most damage in the last level of a neural network (i.e. the classier), we decided to focus on the study of the activations of neurons (deep features) in layers that precede the classier, in order to infer some properties that make adversarial images distinguishable from original ones. The thesis followed these steps: rst, two state-of-the-art methods for generating fool-ing images and techniques used to make more robust a DNN were tested and studied; second, adversarial examples were generated through such methods from several original images by using a Convolutional Neural Network; third, deep features of original and fooling images were extracted from hidden layers of the network. Finally, we performed k-Nearest-Neighbors (kNNs) by com-paring these features with the ones extracted from the whole training set, in order to verify if adversarial examples were close to the predicted class of the CNN classier or not.

Experimental results about preliminar tests and nal experiment will be provided, wishing that this work will be helpful in future researches.

(7)

To my mom and dad They will always be my heroes and will forever remain my life's biggest inspiration. To my sister Ilaria She's my sister, she's my mirror. She's my witness who sees me at my best and worst. And loves me anyway. She's my partner in crime, my midnight companion. She knows I'm smiling even in the dark. She's my sister, she's my heart. To my boyfriend Gaetano Since you've walked into my life, now I see why it didn't work out with anyone else.

(8)

Chapter 1 Background

Deep Neural Networks (DNNs) have recently led to signicant improvement in many areas of machine learning, from speech recognition to computer vision. The main skill of DNNs is the ability to take a raw input and trasform it into a representation at a higher, slightly more abstract level. To do this DNNs make computations that sometimes are dicult to interpret and can have counter-intuitive properties. In particular, it was shown [1] that applying a well-chosen perturbation of pixels in the input to an image classier, we can completely alter the class predicted by the network: this perturbation is called adversarial examples or fooling images. Tipically, the dierence between the original and perturbed image is imperceptible to a human observer.

Adversarial examples are interesting for two fundamental aspects. First, they demonstrate that machine learning methods do not yet truly understand the tasks they are asked to perform, even though they often achieve human level performance. Second, adversarial examples also have important implications for computer security, because although they are designed to fool a specic model of a neural network (say model A), they are also able to fool another model (say model B): we call this cross-model generalization of adversarial examples. When the two models are trained on a dierent training set we get the so called cross-dataset generalization. Cross-model and cross-dataset generalization imply that adversarial examples pose a security risk even un-der a threat model where the attacker does not have access to the target's model denition, model parameters, or training set. The attacker can prepare a training set (for the same task), train a model on their own training set, cre-ate adversarial examples that mislead their own model, and then send these adversarial examples against the target system.

(9)

early explanations have suggested it is due to nonlinearity of a deep neural networks and overtting. Only recently researches argued that, despite the previous hypothesis, the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature.

For better understanding how an adversarial works, we'll start explaining briey what is a Convolutional Network (because this model works well in visual tasks), then we'll describe what an adversarial example exploits to fool the classier and what happen in a ConvNet when this kind of image is given as input.

1.1 Convolutional Network

Convolutional Network1 _{(also called ConvNet or CNN) is a powerfull tool for}

image processing and computer vision tasks. CNN is very similar to ordinary neural networks: its neurons have learnable weights and biases, receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single dierentiable score function, starting from the raw image pixels and ending to class scores. It still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer. So what is the dierence between them? These networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. This special architecture takes inspiration from the human visual cortex, where neurons respond to stimuli in a restricted region of space known as the receptive eld. Unlike a traditional neural networks, in which each hidden layers are made up of a set of neurons, where each neuron is fully connected to the ones in the previous layer, convolutional networks have neurons arranged in 3 dimensions: width, height, depth (note that depth refers to the third dimension of an activation volume, not to the depth of a full neural network). For example, an image in CIFAR-10 with size 32x32x3 (32 width, 32 height, 3 color channels - red green blue) is an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). Hence, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, since images in CIFAR-10 are grouped in 10 classes, the nal output layer have

1_{for further details on Convolutional Network architecture see the following link:}

http://cs231n.github.io/convolutional-networks/ http://neuralnetworksanddeeplearning.com/chap6.html

(10)

dimensions 1x1x10, because the last layer of the ConvNet architecture reduces the full image into a single vector of class scores, arranged along the depth dimension.

Usually a ConvNet is made up by stacking three main type of layers: Con-volutional Layer, Pooling Layer and Fully-Connected Layer.

1.1.1 ConvNet Layers

Convolutional Layer The Convolutional layer is the core building block of a CNN and derives its name from the convolution operator. The primary purpose of this layer is to extract features from the input image. Convolutional layer preserves the spatial relationship between pixels by learning image features using small squares of input data.

Parameters of a Convolutional layer consist of a set of learnable lters (or kernels), which have a small receptive eld, but extend through the full depth of the input volume. To be more precisely, each neuron in the rst hidden layer will be connected to a small region of the input neurons, say, for example, a 5Ö5 region, corresponding to 25 input pixels. So, for a specic hidden neuron, we might have connections that look like this:

Figure 1.1: Local receptive eld

The region with black circles in the gure above forms the local receptive eld for a hidden neuron (it is like a little window on the input pixels). Each connection learns a weight. And the hidden neuron learns an overall bias as well. We then slide the local receptive eld across the entire input image. For each local receptive eld, there is a dierent hidden neuron in the rst hidden layer.

During the forward pass, each lter is convolved across the width and height of the input volume, computing the dot product between the entries of the lter and the input and producing a 2-dimensional activation map of that lter. As a result, the network learns lters that activate when they see some

(11)

Figure 1.2: Dierent local receptive elds

specic type of feature at some spatial position in the input. The more number of lters we have, the more image features get extracted and the better the network becomes at recognizing patterns in unseen images.

The size of the Feature Map (Convolved Feature) is controlled by three parameters:

Depth: corresponds to the number of lters we use for the convolution operation.

Stride: is the number of pixels by which we slide our lter matrix over the input matrix. When the stride is 1 then we move the lters one pixel at a time. When the stride is 2, then the lters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.

Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the lter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution.

Pooling Layer A Pooling layer is periodically inserted between successive Convolutional Layers. Its scope is to progressively reduce the spatial size of the input representation. In particular, pooling:

reduces the dimensionality of the feature maps;

reduces the number of parameters and computations in the network, hence, controls overtting;

makes the network insensible to small transformations, distortions and translations in the input image.

(12)

Spatial Pooling can be of dierent types: Max, Average, Sum, etc. In case of Max Pooling, we dene a spatial neighborhood (for example, a 2Ö2 window) and take the largest element from the rectied feature map within that window. In addition to Max Pooling, the pooling units can also perform other functions, such as Average Pooling or even L2-norm Pooling. In practice Max pooling is the most used operation, which has been shown to work better than the others.

In summary, the pooling layer:

Accepts a volume of size W1ÖH1ÖD1 Requires two hyperparameters:

their spatial extent F, the stride S

Produces a volume of size W2ÖH2ÖD2 where: W2=(W1=F)/S+1

H2=(H1=F)/S+1 D2=D1

Introduces zero parameters since it computes a xed function of the input We can see an example of Pooling Layer in gure 1.3.

(a) The input volume of size [224x224x64] is pooled with l-ter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved.

(b) The most common downsampling opera-tion is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).

(13)

Fully-Connected Layer The Fully Connected layer is a traditional Multi Layer Perceptron (MLP) that uses tipically a softmax activation function in the output layer. The term Fully Connected means that every neuron in the previous layer is connected to every neuron on the next layer. Its purpose is to use these features for classifying the input image into dierent classes based on the training dataset. Moreover, adding a Fully-Connected layer is also a cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classication task, but combinations of those features might be even better.

The sum of output probabilities from the Fully Connected layer is 1 beacuse the Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

Introducing Non Linearity (ReLU) A Rectied Linear Unit (ReLU) is an additional operation used after every Convolution operation. It is applied per element and replaces all negative pixel values in the feature map by zero (see gure).

Figure 1.4: ReLU

The scope of a ReLU is to introduce non-linearity in a ConvNet, since most of the real-world data we would want ConvNets to learn would be non-linear (Convolution, matrix moltiplication, etc. are all linear operations).

1.1.2 ConvNet training

In gure 1.5is shown a complete ConvNet architercture2_{, with all the layers}

previously described stacked.

The Convolutional and Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classier.

The overall training process of the Convolution Network consists of two stages: forward and backpropagation. The rst stage starts with the random

(14)

Figure 1.5: Graphical illustration of a LeNet model. The lower-layers are composed to alternating convolution and max-pooling layers. The upper-layers however are fully-connected and correspond to a traditional MLP (hidden layer + logistic regression). The input to the rst fully-connected layer is the set of all features maps at the layer below. initialization of all the lters, parameters and weights. Then a training image is given as input to the network; this input goes through the convolutional, ReLU and pooling layers. At the end of the foward step the output resulting from previous layers is put into the Fully Connected layer, where the ConvNet nds the output probabilities for each class and calculates the total error. The backpropagation step is used to calculate the gradients of the error with respect to all weights in the network. Gradient descent algorithm is used to update all lter values, weights and parameters to minimize the output error. The weights are adjusted in proportion to their contribution to the total error. When the same image is input again, output probabilities will be closer to the target vector. This means that the network has learnt to classify this particular image correctly by adjusting its weights and lters such that the output error is reduced. Parameters like number of lters, lter sizes, architecture of the network etc. have all been xed before the propagation phase and do not change during training process. Only the values of the lter matrix and connection weights get updated.

The two stages are repeated for all the images in the training set.

1.2 Linear explanation of adversarial

Now that we know the basic concept of a Convolutional Network, we can understand how an adversarial example fools the classier and what are the weakness that it exploits [1].

Suppose we have a network when neuron's output is just the weighted sum of its inputs (linear model) and suppose that the precision of a single input feature ε is contained in a range of nite values.

(15)

Figure 1.6: Neuron in a neural network. Now we construct an adversarial like this:

x0 = x + η

where x is the original input and η is a perturbation smaller than ε. Theorically, there is no reason for the classier to assign a dierent class to x

and x0 _{until kηk}

∞ < ε.

Consider the dot product between a weight vector w and an adversarial

exam-ple x0_:

w>x0 = w>x + w>η

With this kind of perturbation the activation grows by w>_{η. Moreover, If w}

has n dimensions and the average magnitude of its elements is m, then it will grow by ε · m · n. Since this is subjected to the max norm constraint on η, we can maximize it by assigning η = sign(w).

Since kηk∞is upper bounded but the change in activation caused by

perturba-tion by η can grow linearly with n, then it happens that we can make many innitesimal changes to the input that add up to one large change to the output.

The use of linear functions makes models susceptible for an attack. Con-vNets are a complex Deep Learning model that expresses a highly non-linear function. However, the components that make up a ConvNet are linear: Con-volution of a lter with its input is a linear operation, sliding a lter through the input and computing dot products is also a linear operation, matrix mul-tiplications too.

(16)

Figure 1.7: Example linear classiers for a few ImageNet classes. The weights can be thought of as a template: the images show what the classier is looking for. For example, Granny Smith apples are green, so the linear classier has positive weights in the green color channel and negative weights in blue and red channels, across all spatial positions.

1.2.1 Fooling a linear classier

For better understanding the impact of this minimal perturbation, lets fool a

linear classier3_{. In a linear classier every class score is computed as a dot}

product between all the image pixels and a learnable weight vector, one for each class. With input images of size width ∗ height ∗ 3 and 1000 ImageNet classes we therefore have width ∗ height ∗ 3 ∗ 1000 weights and 1000 biases. We can then visualize each of the learned weights by reshaping them as images (see gure 1.7).

Suppose we have already trained a model. We can start to produce fooling images. Since the classier is linear we don't need the backpropagation step.

This is because the score function is a dot product s = w>_x_{, then the gradient}

on the image x is simply ∇xs = w. Hence, we take an image we want to

corrupt, and then if we want the classier thinks that it belongs to another class, we have to take the weights corresponding to the desired class, and add some fraction of those weights to the image (gure ).

Figure 1.8: The starting image (left) is classied as a goldsh. If we add a small amount

"daisy" weights to the image (last row, middle) the classier is convinced that it's looking at one with high condence.

3_{Complete example is available on}

(17)

Teeny tiny example We can understand this process in even more detail by simplifying the problem into a very elementary example.

Suppose we train a binary logistic regression, where the probability of class 1 is dened as follow:

P (y = 1|x; w, b) = σ(w>x + b)

where

σ(z) = 1

1 + e−z

is the sigmoid function that put the class 1 score s = w>_{x + b}_{into the interval}

[0, 1], where 0 is mapped to 0.5.

This classier hence decides that the class of the input is 1 if σ(s) > 0.5. Consider now the following input x and weight vector w:

x = [2, −1, 3, −2, 2, 2, 1, −4, 5, 1] w = [−1, −1, 1, −1, 1, −1, 1, 1, −1, 1]

If we do the dot product, we get -3. Hence, probability of class 1 is 1

1+e−(−3) =

0.0474. In other words the classier is 95% certain that this is example is class

0. We're now going to try to fool the classier. That is, we want to nd a tiny change to x in such a way that the score comes out much higher. We trying to

buil an adversarial x0 _{by adding +0.5 to all the elements in the weight vector:}

x0 = [1.5, −1.5, 3.5, −2.5, 2.5, 1.5, 1.5, −3.5, 4.5, 1.5]

The dot product now becomes 2 and the probability of class 1 1

1+e−2 = 0.88.

We improved the probability of class 1 from 4.74% to 88% only with a small amount of increment.

If we generalize all the maths computation to an image that has much more dimensions than this simple example, we can build an adversarial example making little changes across all of them in such that we can improve the score of any class we want the image falls.

1.3 Adversarial robustness of a linear classier

We just saw how a linear classier can be easily fooled. Now, the fundamental question relying this subject is the follow: can we establish a limit of the robustness of classiers? If this is the case, what is the dierence between random noise and adversarial noise? Fawzi et. all [3] proposed an upper bound on the robustness valid for all classier, indipendently of the training

(18)

procedure. They drew up mathematical denitions for this concept.

In order to better explain the subject they started by analizing a binary classier, i.e. a classier that outputs input data in two classes.

Let μ the probability measure on Rd_{of the data point we want to classify}

having bounded support, such that Pµ(kxk2 ≤ M ) = 1 for some M > 0.

Assume y(x) ∈ {−1, 1} be the label of a point x ∈ Rd _{and µ}

1and µ−1 be

the distribution of class 1 and class -1. Let f : Rd _{→ R an arbitrary}

classication function with a classication rule obtained by calculating its sign. We can dene the following notions:

Risk Probability of misclassication according to μ:

R(f ) = Pµ(sign(f (x)) 6= y(x)) = p1· P µ1(f (x) < 0) + p−1· Pµ−1(f (x) ≥ 0)

where p±1 = P µ(y(x) = ±1). In other words, risk is the probability that a

classier assign to the input x a wrong label, that is the union of two mutually exclusive terms: the probability that the input belong to the class ±1 and the probability that the sign of f(x) is opposite to the label (i.e. ∓1).

4adv(x; f ) The norm of the smallest perturbation that changes the sign of

f:

4adv(x; f ) = min

r∈Rdkrk

subject to f(x) · f(x + r) ≤ 0. Notice that the perturbed point x + r is not required to belong to the dataset.

Robustness to adversarial perturbation The average of 4adv(x; f ) over

all x.

ρadv = Eµ(4adv(x; f ))

4unif ,ε(x; f ) As we can see in gure 1.9, it can be seen as the maximal radius

of a sphere centered in x, such that perturbed points sampled uniformly at random from this sphere are classied siimilarly to x with high probability. Given ε ∈ [0, 1], then:

(19)

Figure 1.9: Dierence between norms of points perturbated with random uniform noise and adversarial noise. Red line denote the classier decision boundary. 4adv(x; f )is equal

to the distance from x and this line. The spherical cap region below this line has measures ε, meaning that the probability that a random point sampled on the sphere has label +1 is 1 − ε.

4unif,ε(x; f ) = max η≥0 η

such that PnvηS(f (x)f (x + n) ≤ 0) ≤ ε, where ηS denotes the uniform

measure on the sphere centered at 0 and of radius η in Rd_{. This radius provide}

an upper bound to the adversarial norm 4adv(x; f ) for all ε.

Robustness to random uniform noise The average of the uniform ran-dom noise:

ρunif,ε(f ) = Eµ(4unif,ε(x; f ))

We apply all of this notions to a linear classier, in order to retrieve its

robustness. Suppose we have the usual classication function f(x) = w>_{x + b}_,

described in the previous section. Suppose further that bias b is bounded by

|b| ≤ M kwk2. In this case the adversarial perturbation is given by

4adv(x; f ) =

|w>_{x + b|}

kwk2

and this is equal to the distance from x to the hyperplane {f(x) = 0}. The robustness is given by

ρadv(f ) ≤ kp1Eµ1(x) − p−1Eµ−1(x)k2+ M (|p1 − p−1| + 4R(f ))

(20)

is equal to p±1 = 1₂, the above denition can be simplies as follow:

ρadv(f ) ≤

1

2kEµ1(x) − Eµ−1(x)k2+ M (|p1− p−1| + 4R(f ))

The robustness to an adversarial perturbation, hence, is the sum of two terms: the rst one is the dierence between the rst order moment of the

two distribution µ±1and measures the distinguishability of the two classes; the

second one is the risk R(f). Moreover, the rst term is indipendent from the classication function and when the dierence between the means of the distributions is small, then the robustness depends only from the risk R(f).

The robustness of the linear classier with image of size √d × √d to a

random uniform noise is instead given by:                        ρunif,ε(f ) ≥ max(C1(ε) √ d, 1) · ρadv(f ) ρunif,ε(f ) ≤ ˜C2(ε, d) · ρadv(f ) ≤ C2(ε) √ d · ρadv(f ) C1(ε) = √ 1 2ln(2_ε) ˜ C2(ε, d) = q 1 1−(12ε)1d C2(ε) = √_1−12ε1

We can see how a linear classier is more robust to a uniform random noise

by a factor of √d.

1.4 Techniques for generating adversarials

As we mentioned above, neural networks are too linear to resist linear adver-sarial perturbation. ReLUs and maxout networks [4] are all easy to optimize because they are designed to behave in very linear ways. Other models which instead use nonlinear activation function (for example, sigmoid function) are carefully adjusted to stay in the non-saturating zone for the same reason; how-ever in this regime the behaviour is always quite linear.

Researchers took advantage of this weakness and implemented many tech-niques to generate fooling images, such as fast gradient sign, box contrained and evolutionary algorithms.

1.4.1 Fast Gradient sign method

One way to generate adversarial examples is to use the fast gradient sign method [2]. This method exploit the back-propagation learning algorithm

(21)

and stochastic gradient descent.

A perturbation is added to the original data sample, and this perturbation is proportional to the sign of the gradient back-propagated from the output to the input layer.

Mathematically speaking, let θ be the parameters of a model, x the input to the model, y the targets (the desired output) associated with x and J(θ; x; y) be the cost function used to train the neural network. The cost function can be linearize around the current value of θ, obtaining an optimal max-norm constrained pertubation of

η = εsign(∇xJ (θ; x; y))

This is the fast gradient sign method of generating adversarial examples. Note that the gradient can be computed easier using backpropagation.

Practically, an adversarial can be created by training an input image with these steps (notice that during the back-propagation pass the weights are freezed and not updated):

Propagate the input x forward to the output layer as in standard back-propagation.

Calculate the error and back-propagate the gradient all the way to the input layer.

Update the input with the perturbation η such that we get the new input

x0 = x + η.

It was found that this technique is able to fool dierent types of models. Several experiments were performed, with dierent conguration of the pa-rameters, and some of these ones have results quite interesting. For example, setting ε = 0.25 and giving as input these fooling images to a shallow softmax classier and to a maxout network, they caused an error rate respectively of 99.9% and of 89.4% on a MNIST test set, with an average condence of 79.3% for the rst classier and with an average condence of 97.6% for the second one. Using a dierent setting (ε=0.1) on a Convolutional maxout network with a preprocessed CIFAR-10 test set, it was obtained an error rate of 87.15% and an average condence of 96.6%.

Obviously there are many simple methods of generating adversarial exam-ples. For example, it is enough rotate x by a small angle in the direction of the gradient to produces adversarial examples.

(22)

Figure 1.10: A demonstration of fast adversarial example generation. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change the classication of the image. This adversarial was generated to fool GoogleLeNet model. Here our of .007 corresponds to the precision ε of an 8 bit image encoding after GoogLeNet's conversion to real numbers.

1.4.2 Evolutionary Algorithms

Several researches have found another method to generate an adversarial ex-ample. This method actually was not created to really fool a neural network, but it was built because they tried to answer to a question regarding the high performances of CNNs in image recognition tasks. This question was What dierences remain between computers and human vision?.

This technique is called Evolutionary Algorithm (EA) [5] and is able to to produce images that are completely unrecognizable to humans, but that DNNs believe to be recognizable objects with 99.99% condence. More precisely, they choose the AlexNet DNN trained on the 1.3M Image ILSVRC 2012 ImageNet and the LeNet model trained on MNIST dataset and then created images with this algorithm that DNNs believed with high condence as belonging to each dataset class.

1.4.2.1 Generating image with evolution

EAs are inspired by Darwinian evolution. They consists of a population of or-ganisms (in this case, images) in which we can keep the best and then performe random perturbation, such as mutation and/or crossover. The organisms are selected depending on the tness function, which in this case is the highest pre-diction value given by a DNN to an image for a certain class. Instead of using traditional EAs (they evolving images to match a single ImageNet class), it is used a new algorithm called the Multidimensional Archive of Phenotypic Elites (MAP-Elites), which is able to simultaneously evolve a population that

(23)

con-Figure 1.11: Left side: DNN recognizes natural images. Right side: Images produced by EA that fool DNNs.

tains individuals that have high score on many classes (e.g. all 1000 ImageNet classes). MAP-Elites does the following steps:

keep the best individual found for each objective.

At each iteration, choose a random organism from the population, mu-tates it randomly, and replaces the current champion for any objective if the new individual has higher tness on that objective. Here, tness is determined by sending the image to the DNN; if the image generates a higher prediction score for any class than has been seen before, the newly generated individual becomes the champion in the archive for that class. The EA was tested with two dierent encodings, meaning how an image is represented as a genome. The two types of encodings are called direct and indirect.

Images produced by these encodings are shown in gure 1.13 Direct encoding This encoding has:

one grayscale integer for each of 28x28 pixels for MNIST;

three integers (H, S, V) for each of 256x256 pixels for ImageNet.

Each pixel value is initialized with uniform random noise within the [0,255] range. Those numbers are independently mutated; each of them have a certain probability of being chosen to be mutated. This probability starts with the value 0.1 (10% of probability) and drops by half every 1000 generations. The numbers chosen to be mutated are then altered via the polynomial mutation operator with a xed mutation strength of 15.

(24)

(a) Images that MNIST DNNs believe with 99.99% condence are digits 0-9. Each column represents a digit class, and each row consist of the result after 200 generations of a randomly selected, inde-pendent run of evolution.

(b) Images that MNIST DNNs believe with 99.99% condence are digits 0-9. The column and row descriptions are the same as for (a).

Figure 1.12: Direct (a) and indirect (b) encoding for LeNet model, trained on MNIST

dataset

Indirect encoding This encoding produces much more regular images, mean-ing images that contain patterns (symmetry and repetition). The indirect en-coding here is a compositional pattern-producing network (CPPN), which can evolve complex, regular images that re-semble natural and man made objects. 1.4.2.2 Evolving images to match MNIST

There were evolved several images both in direct and indirect encoding to match MNIST dataset.

Direct encoding The directly encoded images were declared by LeNet to be digits from 0 to 9. Multiple, independent runs of evolution repeatedly produce images that this DNNs believe with 99.99% condence to be digits, but are unrecognizable for humans. Result after 200 generations are are shown in gure 1.12.

Indirect encoding As we previously described, a CPPN is used to evolve recognizable images. As a result, images contain more strokes and other reg-ularities, and LeNet labels these unrecognizable images as digits with 99.99% condence after only a few generations. By 200 generations, median con-dence is 99.99%. Certain patterns repeatedly evolve in some digit classes that appear indicative of that digit. As we can see in gure 1.12, images classied as a 1 tend to have vertical bars, while images classied as a 2 tend to have a horizontal bar in the lower half of the image.

(25)

1.4.2.3 Evolving images to match ImageNet

There were also evolved several images both in direct and indirect encoding to match ImageNet dataset.

Results for both methods are shown in gure 1.13.

Direct encoding In this case the directly encoded EA was less successful at producing high-condence images. Even after 20,000 generations evolution failed to produce high-condence images for many categories. However, evo-lution did manage to produce images for 45 classes that are classied with > 99% condence to be natural images.

Indirect encoding Once again, we test whether the CPPN encoding might produce more recognizable images than the direct encoding. In 5 independent runs, evolution produces many imges with DNN condence scores > 99.99%, but that are unrecognizable. High-condence images are found in most cat-egories. While a human would not label a CPNN image as belonging to the class given by AlexNet, the generated images do often contain some features of the target class (see gure ).

For many of the produced images, one can begin to identify why the DNN believes the image is of that class once given the class label. This is because evolution needs only to produce features that are unique to, or discriminative for, a class, rather than produce an image that contains all of the typical features of a class.

(26)

(a) (b)

Figure 1.13: a) Evolved images that are either directly (top) or indirectly (bottom)

en-coded. DNNs trained on ImageNet believe with > 99.6% certainty to be a familiar object. b) Images selected to showcase diversity from 5 evolutionary runs. The diversity suggests that the images are non-random, but that instead evolutions producing discriminative features of each target class. The mean DNN condence scores for these images is >99.12%.

(a) Median condence scores of 21.59% from 5 runs of directly encoded, evolved images for all 1000 ImageNet classes. For 45 classes, evolution can produce images that the DNN believes with over 99% condence to be in a natural, Im-ageNet class.

(b) Median condence scores of 88.11% from 5 runs of CPPN encoding, evolved images for all 1000 ImageNet classes. Evolution can produce many images that the DNN believes with over 99% con-dence to belong to ImageNet classes. Figure 1.14

1.4.3 Box constrained method

Box Constrained [1] [6] is one of the rst method created to deceive a neural network. The fundamental idea here is to construct a small perturbation of the data point x in order to force the method to misclassify the training example

x with some incorrect label y0. The adversarial examples are generated for

a given training point (x; y) by using L-BFGS to solve the box-constrained optimization problem.

(27)

X ∈ I, corresponding to the pixels of a xed sized image, outputs a vector

of probabilities p = [p1, · · ·, pi, · · ·, pn] of the image belonging to the class

label i. Let h be the label of the highest probability ph. Assume further that

I = [L=U], for grayscale images, or I = [L=U]3 for RGB images, where L

and U are the lower and upper limits of the pixel scale. Assume that c is the correct label.

This method starts with h = c, otherwise there is no point in fooling the classier. It adds the smallest distortion D to X, such that the highest probability will no longer be assigned to h. The distortions must keep the input inside its space, hence, it must ensure that X + D ∈ I. In other words, the input is box-constrained.

Thus, we have the following optimization:

minimizeDkDk + C · H(p, pA)

subject to L ≤ X + D ≤ U, p = f(X + D)

where we introduce the adversarial probability target pA_{, which assigns zero}

probability to all but a chosen adversarial label a. H is the cross-entropy between the probability assignments. The constant C balances the importance of the two objectives. The lower the constant, the more we will minimize the distortion norm. Values too low, however, may turn the optimization unfeasible. We want the lowest, but still feasible, value for C.

Through this method researchers realized for the rst time that adversarial examples had the two properties of cross-model and cross-dataset generaliza-tion: indeed, a relatively large fraction of fooling images was misclassied by dierent models of networks (e.g. trained with dierent hyper-parameter) and by networks trained with a disjoint training set. As we can see in tables 1.1 and 1.2, dierent models of neural networks are chosen to test the cross-model generalization.

Tables , show the cross-dataset generalization. The training set of MNIST (60k images) has partitioned in two parts P1 and P2; each of them has size 30k. Three non-convolutional networks with sigmoid activations are trained on them:

FC100-100-10 and FC123-456-10 were trained on P1; FC100-100-10 on P2.

(28)

Model Name Description Training error Test error Av. min. distortion FC10(10−4₎ _{Softmax with λ = 10}−4 _6.7% _7.4% _0.062 FC10(10−2₎ _{Softmax with λ = 10}−2 _10% _9.4% _0.1 FC10(1) Softmax with λ = 1 21.2% 20% 0.14 FC100-100-10 Sigmoid network λ = 10−5_{, 10}−5_{, 10}−6 _0% _1.64% _0.058 FC200-200-10 Sigmoid network λ = 10−5_{, 10}−5_{, 10}−6 _0% _1.54% _0.065

AE400-10 Autoencoder with Softmax λ = 10−6 _0.57% _1.9% _0.086

Table 1.1: Dierent models to tests the generalization of adversarial images on MNIST

dataset

FC10(10−4₎ _FC10(10−2₎ _FC10(1) _FC100-100-10 _FC200-200-10 _AE400-10 _{Av. distortion}

FC10(10−4₎ _100% _11.7% _22.7% _2% _3.9% _2.7% _0.062 FC10(10−2₎ _87.1% _100% _35.2% _35.9% _27.3% _9.8% _0.1 FC10(1) 71.9% 76.2% 100% 48.1% 47% 34.4% 0.14 FC100-100-10 28.9% 13.7% 21.1% 100% 6.6% 2% 0.058 FC200-200-10 38.2% 14% 23.8% 20.3% 100% 2.7% 0.065 AE400-10 23.4% 16% 24.58% 9.4% 6.6% 100% 0.086

Table 1.2: Cross-model generalization of adversarial examples. First column species the

models for which the adversarial examples are generated. Columns from second to seventh species the error induced by fooling images sent to a dierent model. Last column shows average distortion necessary to reach 0% accuracy on the training set.

The reason the researchers have trained two networks for P1 was to study the cumulative eect of changing the network permarameters and the training sets at the same time. Models FC100-100-10 and FC100-100-10' share the same perparameters: both of them are 100-100-10 networks, while FC123-456-10 has dierent number of hidden units. In this experiment, there were distorting the elements of the test set rather than the training set.

Model Error on P1 Error on P2 Error on test Min. av. distortion FC100-100-10 trained on P1 0% 2.4% 2% 0.062 FC123-456-10 trained on P1 0% 2.5 2.1% 0.059 FC100-100-10' trained on P2 2.3% 0% 2.1% 0.058

Table 1.3: Models trained to study cross-training-set generalization of the generated

(29)

FC100-100-10 FC123-456-10 FC100-100-10' Distorted for FC100-100-10 (av. stddev=0.062) 100% 26.2% 5.9% Distorted for FC123-456-10 (av. stddev=0.059) 6.25% 100% 5.1% Distorted for FC100-100-10' (av. stddev=0.058) 8.2% 8.2% 100%

Table 1.4: Cross-training-set generalization error rate for the set of adversarial examples

generated for dierent models.

1.5 Solutions to make more robust a neural

net-work

After the discovery of adversarial examples, several methods for improving the robustness of the neural networks has been explored.

1.5.1 Training adversarial examples

Because it was thought that one of the main causes of adversarial examples was overtting, the rst solution adopted was the augmentation of the training set with these fooling images. The data augmentation is one of the principle techniques used to regularize models and control overtting; tipical schemes provide augmentation data with transformations such as translations that are expected to actually occur in the test set. This form of data augmentation instead uses inputs that are unlikely to occur naturally but that expose failures in the ways that the model conceptualizes its decision function.

This method was applied for all the types of adversarials examples. Data augmentation with box constrained adversarials

A pool of adversarial examples was mixed into the original training [1]. A sub-set of these fooling images has continuously replaced by newly generated ad-versarials. In this way it was trained a two layer 100-100-10 non-convolutional neural network with a test error below 1.2%. The network gets to 1.6% errors when regularized by weight decay alone and it has be improved to around 1.3% by using carefully applied dropout.

Data augmentation with fast gradient sign adversarials It was found that training with an adversarial objective function based on the fast gradient sign method [2] was an eective regularizer:

(30)

J0(θ, x, y) = α · J (θ, x, y) + (1 − α) · J (θ, x + εsign(∇xJ (θ, x, y)))

where α = 0.5. The set of adversarial is continually updated, to make them resist the current version of the model. The tests were done on MNIST dataset, on a maxout network with dropout and early stopping, so that it terminate learning after the validation set error rate has not decreased for 100 epochs. The error rate was reduced from 0.94% without adversarial training to 0.84% with adversarial training. With adversarial training, while the valida-tion set error was very at and reached the stability over time, the adversarial validation set error was not. Thus early stopping is used on the adversarial validation set error. The network was then retrained on all 60,000 examples. The training was run ve times; each of this running has dierent seeds for the random number generators used to select subset of training examples, ini-tialize weights, and generate dropout masks. As result there are four trials with an error rate of 0.77% on the test set and one trial with an error rate of 0.83%. The model also became somewhat resistant to adversarial examples. With adversarial training, the error rate decreased from 89.4% to 17.9%. Re-garding cross-model generalization, fooling images generated via the original model yield an error rate of 19.6% on the model trained with adversarials, while adversarial examples generated via the new model yield an error rate of 40.9% on the original model. For the new model the average condence on a misclassied example was 81.4%.

Data augmentation with evolutionary adversarials Unlike the previ-ous training where adversarial are mixed to the whole training set, in this case the latter is augmented by adding a new class (called for example fool-ing images) that include these examples [5]. Strictly speakfool-ing, a network can be retrained and say that the images that previously fooled it should not be considered belonging to any of the original classes, but instead should be rec-ognized as a new fooling images class.

Tests are performed with CPPN-encoded images on both MNIST and Im-ageNet DNNs. The procedure is the follow:

train DNN1 on a dataset (ImageNet or MNIST);

evolve CPPN images that produce a high condence score for DNN1 for

(31)

put these images into the dataset and form a new class n + 1;

train DNN2 on this enlarged dataset;

(optional) repeat the process, but put the images that evolved for DNN2

in the n + 1 category instead in a n + 2 class, because any images that fool a DNN are anyway fooling images and can thus go in the n + 1 category. To represent dierent types of images, at each iteration m images (randomly sampled from both the rst and last generations of multiple runs of evolution that produce high condence images for DNNi) are added in the n + 1 category.

Each evolution run on MNIST or ImageNet produces 20 and 2k images respec-tively, with half from the rst generation and half from the last.

Regarding MNIST, to make the n+1 class have the same number of images as the other classes of this dataset, in the rst iteration 6k images (taken from 300 evolutionary runs) are added to the training set. For each additional iteration, 1k new images are added to the training set. In this case retraining LeNet model with the new class did not improve its robustness. Evolution

still produces many unrecognizable images for DNN2 with condence scores

of 99.99%. Furthermore, repeating the process training 15 DNNs with images

that fooled DNN1 through DNNi=1 is useless, since evolution can always nd

new fooling images for DNNi (see gure 1.15).

Regardind ImageNet, the original training dataset was extended with a

1001st class, to which 9k images are added that fooled DNN1. Although an

ImageNet class contains up to 1300, it was found that having a new fooling class with the same number of images did not prevent fooling. Contrary to the MNIST's results, for ImageNet models, evolution was less able to evolve

high condence images for DNN2 than DNN1. The median condence score

signicantly decreased from 88.1% for DNN1 to 11.7% for DNN2(gure 1.16).

To see whether this DNN2had learned features specic to the CPPN images

that fooled DNN1, or whether DNN2 learned features general to all CPPN

images, several recognizable CPPN images (taken from Picbreeder.org) were

sent to DNN2. Despite DNN2has never seen those images, it correctly classied

64% of them. The retrained model thus learned features generic to CPPN

images, helping to explain why producing new images that fool DNN2 is more

(32)

Figure 1.15: Columns are digits 0..9. Rows are DNNi for i = 1...15. Each row shows the 10 nal, evolved images from one randomly selected run (of 30) per iteration. Medians are taken from images from all 30 runs.

Figure 1.16: Green line: median and max condence scores for DNN1. Orange line:

median and max condence scores for DNN2 with images that fooled a previous DNN1.

1.5.2 Noise injection

Another strategy to immunize a neural networks from adversarials is based on additional corruptions [7]. In this sense the experiments use additive Gaussian noise and Gaussian blurring.

Figure 1.17 shows the trade-o between the adversarial examples recovered and the clean examples misclassied as one varies the amount of additive noises. For ConvNet model, adding a Gaussian noise of σv = 0.1 at L* allow the model to recover more than 35% at the price of a small loss in model performance on clean data.

Gaussian blurring is only applied at input layer. The results show that convolution works well in recovering from the adversarial examples. For exam-ple, applying Gaussian blur kernel of size 11 to all input data a ConvNet can

(33)

Figure 1.17: Gaussian additive noise: Test error on clean data (Left) and on adversarial data (Right) in respect to the standard deviation of Gaussian additive noise. L1 refers to the noise applied at input layer; L* refers to the noise applied at input layer plus all hidden layers.

recover more than 50% of adversarial examples, at the price of 3% increasing in the test error on clean data (Table 1.5).

However, neither Gaussian additive noises or blurring is able to remove enough noise such that its error on adversarial examples is almost equal to the error on clean data.

Blur kernel size - 5 11 - 5 11

N100-100-10 1.8% 2.6% 11.3% 99.9% 43.5% 62.8%

N200-200-10 1.6% 2.5% 14.8% 99.9% 47.0% 65.5%

AE400-10 2.0% 3.2% 16.6% 99.6% 68.3% 78.8%

ConvNet 0.9% 1.2% 4.0% 100 53.8% 43.8%

Table 1.5: Gaussian blurring: Test error on clean data (Left) and on adversarial data

(Right) in respect to the blur kernel size

1.5.3 Auto Encoder

An Auto Encoder4 _{(AE) takes an input x ∈ [0, 1]}dx and maps it (with an

en-coder) to a hidden representation h ∈ [0, 1]dh through a deterministic mapping:

h = se(W x + bh)

where seis the encoder's non-linear activation function (for example, a sigmoid

z = _1+e1−z), W is the dh× dx weight matrix and b ∈ Rdh. The hidden

represen-tation h, is then mapped back (with a decoder) into a reconstruction y of the

4_For _further _details _on _autoencoder _architecture _and _its _variants _see

(34)

Figure 1.18: Autoencoder

same shape as x. The mapping happens through a similar transformation:

y = sd(W0h + b0)

where sd is the decoder's non-linear activation function (for example, sigmoid

z = _1+e1−z), W

0 _{is the d}

x× dh weight matrix and b ∈ Rdx.

y should be seen as a prediction of x, given the code h. When we have tied

weights, the weight matrix W0 _{of the reverse mapping may be constrained to}

be the transpose of the forward mapping: W = W>_.

Given m training data points, the learning process of an AE is simply

described by nding the model parameters θ = {W, W0_{, b, b}0_} _{that minimize}

the following objective function:

JAE(θ) =

m

X

i=1

L(x(i), y(i))

where L is a loss function penalizing y for being dissimilar from x, such as

the L2 _{norm of their dierence.}

Researchers trained a 3-hidden-layer autoencoder (784-256-128-256-784 neurons) on mapping adversarial examples back to the original data samples. They also trained the model to map original training data back to itself, so that the autoencoder preserves the original data if the non-adversarial data samples are fed in. The autoencoder was training using adversarial examples from the training set only, and cross-model generalization capabilities of adversarial examples was tested.

It was observed that AEs generalize very well on adversarial examples from dierent models. All autoencoders are able to recover at least 90% of

(35)

adver-N100-100-10 N200-200-10 AE400-10 ConvNet

N100-100-10 2.3% 2.4% 2.3% 5.2%

N200-200-10 2.3% 2.2% 2.2% 5.4%

AE400-10 3.6% 3.5% 2.7% 9.2%

ConvNet 7.7% 7.6% 8.3% 2.6%

Test error (clean) 2.1% 1.9% 2.1% 1.1%

Avg adv distortion 0.049 0.051 0.043 0.038

Table 1.6: Cross-model autoencoder generalization test. Error rates on adversarial test

data. Columns indicate whose adversarial data the autoencoder is trained on, rows indicate whose adversarial test data the autoencoder is used to denoise. Entries correspond to error rates when the outputs from the autoencoder is fed into the model identied by the row labels.

sarial errors, regardless of the model from which it originates. However one drawback was found: the autoencoder and its corresponding classier can be stacked to form a new feed-forward neural network, then adversarial examples can again generated from this stacked network, which is even more susceptible to adversarial noises.

1.5.4 Denoising Auto Encoder

Denoising Auto Encoders (DAE) is a stochastic version of the auto-encoder and it is based on a simple concept: in order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, an autoencoder is trained to reconstruct the input from a corrupted version of it.

Intuitively, a DAE does two things: try to encode the input (preserve the information about the input), and try to undo the eect of a corruption pro-cess stochastically applied to the input of the auto-encoder. The stochastic corruption process randomly sets some of the inputs (at least half of them) to zero. Hence the denoising auto encoder tries to predict the corrupted (miss-ing) values from the uncorrupted (non-miss(miss-ing) values, for randomly selected subsets of missing patterns.

Researcers trained a standard DAE without the knowledge of the adversar-ial noise distribution. At each training batch, each pixel in the input data is corrupted by adding independent Gaussian noise with 0 mean and σv standard deviation.

Results are reported in table 1.7. DAE can still recover a signicant portion of the adversarial noises. However, this model also suers the same deciency,

(36)

N100-100-10 N200-200-10 AE400-10 ConvNet

DAE, σ = 0.1 5.0% 4.9% 11.5% 9.1%

DAE, σ = 0.5 10.0% 10.6% 16.3% 15.3%

Table 1.7: Denoising autoencoder test. Error on the adversarial test data.

that a stacked network is more susceptible to adversarials.

1.5.5 Deep Contractive network

A Deep Contractive Network (DCN) [7] is a generalization of the Contractive Auto Encoder (CAE), which is a variant of an autoencoder with an additional penalty, corresponding to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This forces the model to learn a function that does not change much when x changes slightly. Because this penalty is applied only at training examples, it forces the autoencoder to learn features that capture information about the training distribution.

For CAE, the objective function has an additional term:

JCAE(θ) = m X i=1 L(x(i), y(i)) + λk∂h (i) ∂x(i)k2

where λ is a scaling factor that trades o reconstruction objective with

contractive objective. k∂h(i)

∂x(i)k2 is the Frobenius norm of the Jacobian matrix

of h(i) _{with respect to x}(i)_.

A Deep Contractive Network (DCN) outputs y ∈ Rdx with a target t ∈ Rdy.

For a network with H hidden layers, let fi denote the function for computing

hidden representation hi ∈ Rdh at hidden layer i: hi = fi(hi=1), i = 1...H + 1,

h0 = x and hH + 1 = y. Ideally, the model should penalize the following

objective:

JDCN(θ) =

m

X

i=1

L(t(i), y(i)) + λk∂y

(i)

∂x(i)k2

However, this penalty is computationally expensive, because of partial derivatives at each layer in the standard back-propagation framework. Hence, it used an approximation of the above objective function, i.e.:

JDCN(θ) = m X i=1 L(t(i), y(i)) + H+1 X j=1 λjk ∂h(i)_j ∂h(i)_j−1k2

(37)

trained until their accuracy reach almost the same value the one in the original models without a contractive penalty. Thanx to the contractive penalty the minimum distortion of the adversarial noises is increased. Table 1.9 shows how Deep Contractive Networks are more robust than a standard neural network trained with Gaussian input noise. In addiction, the minimum distortion of adversarial noises can be improved by combining a DCN with the Gaussian input noise.

Model Name DCN error DCN avg. distortion orig. error orig. adv. distortion

N100-100-10 2.3% 0.107 1.8% 0.084

N200-200-10 2.0% 0.102 1.6% 0.087

AE400-10 2.0% 0.106 2.0% 0.098

ConvNet 1.2% 0.106 0.9% 0.095

Table 1.8: The error rates on clean test data and the average distortion of adversarial

examples generated from the original model (orig) and the same model with contractive penalty (DCN). The error-rates on the adversarial examples are 100%.

Training condition Test error Avg. distortion

DCN 2.0% 0.102

GN,L1,σv = 0.1 1.8% 0.095 GN,L*,σv = 0.1 2.0% 0.099 DCN+GN,L1,σv = 0.1 2.2% 0.108

Table 1.9: The error rates on clean test data and the average distortion on adversarial

examples from N200-200-10 models with dierent training conditions. GN stands for Gaus-sian additive noise during training. L1 means that the noise is applied only at input layer, L* noise to the input layer + hidden layers.

(38)

Chapter 2 Preliminar tests

In this chapter we have gathered all the preliminary analysis that have been done to study the nature of adversarial examples. We investigated in a detailed way the euclidean space of image features by taking a couple of classes from the ImageNet training set and comparing all the original images with fooling images generated from them.

2.1 OverFeat

Before starting to discuss the statistics distributions obtained, we take a mo-ment to introduce the neural network used for features extraction and genera-tion of the adversarial images. We chose OverFeat [8], which is a Convolugenera-tional Network-based image features extractor and classier. This network is already

trained on the ImageNet1 _{2012 training set.}

This neural network is a modied version of the typical Alex Network; dur-ing the traindur-ing process it uses the same xed input size approach proposed in [9], but turns to multi-scale for classication. According to the constant input dimentionality required by the Alex Network model, OverFeat take all the full-resolution images of ImageNet and down-sample them to a xed full-resolution of

256pixels. In other words, given a rectangular image, the network rst rescale

it such that the shorter side is of length 256 and then, from the resulting im-age, it crop out the central 256 × 256 patch. After that, 5 random crops (and their horizontal ips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128.

Parameters of the network are setted as follow:

1_{For further details on ImageNet dataset structure visit the site}

(39)

Model Top-1 error Top-5 error AlexNet 40.7% 18.2% OverFeat fast 39.28% 17.12%

Table 2.1: Error rates for classication task on validation set

Layer 1 2 3 4 5 6 7 8

Stage conv+max conv+max conv conv conv+max fc fc fc # of channels 96 256 512 1024 1024 3072 4096 1000 Filter size 11x11 5x5 3x3 3x3 3x3 - - -Conv. stride 4x4 1x1 1x1 1x1 1x1 - - -Pooling size 2x2 2x2 - - 2x2 - - -Pooling stride 2x2 2x2 - - 2x2 - - -ZP size - - 1x1x1x1 1x1x1x1 1x1x1x1 - - -SI size 231x231 24x24 12x12 12x12 12x12 6x6 1x1 1x1

Table 2.2: Fast model architecture. ZP means Zero Padding, SI is Spatial Input. 8 Is the

output layer.

the weights in the network are initialized randomly with μ = 0 and

σv = 1 × 10=2. Stocastich gradient ascend with momentum term of 0.6

and an l2 weight decay of 1 × 10=5 is used to update them;

the learning rate is initially 5 × 10=2 _{and is successively decreased by a}

factor of 0.5 after {30, 50, 60, 70, 80} epochs.

DropOut [10] with a rate of 0.5 is employed on the Fully Connected layers (6th and 7th) in the classier.

The detailed architecture is summarized in table 2.2. Layers 1-5 are similar to standard AlexNet, using ReLU and Max Pooling, but with the following dierences:

no contrast normalization is used; pooling regions are non-overlapping;

due to a smaller stride in 1st and 2nd layer, resulting feature maps are larger than AlexNet model.

OverFeat is provided with two dierent models: fast and accurate. We chose the fast one, which has a top-1 error score of 39.28% and a top-5 error score of 17.12% (see table 2.1).

(40)

input 3x231x231

stage 1: convo: 11Ö11 stride 4Ö4; ReLU; maxpool: 2Ö2 stride 2Ö2; output (layer 3): 96x24x24

stage 2: convo: 5Ö5 stride 1Ö1; ReLU; maxpool: 2Ö2 stride 2Ö2; output (layer 6): 256x12x12

stage 3: convo: 3Ö3 stride 1Ö1 0-padded; ReLU; output (layer 9) 512x12x12

stage 4: convo: 3Ö3 stride 1Ö1 0-padded; ReLU; output (layer 12) 1024x12x12

stage 5: convo: 3Ö3 stride 1Ö1 0-padded; ReLU; maxpool: 2Ö2 stride 2Ö2; output (layer 16) 1024x6x6

stage 6: convo: 6Ö6 stride 1Ö1; ReLU; output (layer 18) 3072x1x1 stage 7: full; ReLU; output (layer 20) 4096x1x1

stage 8: full; output (layer 21) 1000x1x1 output stage: softmax; output (layer 22) 1000x1x1

2.2 Generating Adversarial Example

The rst step of our studies consists on the generation of the adversarial ex-amples. We therefore retrieved the necessary scripts in order to build them.

We decided to use the Box Constrained and Fast Gradient methods.2

The choice of the class from which create fooling images fell to Kit Fox, vulpes macrotis.

2.2.1 Fooling images generated with Box Constrained

The script dowloaded for Box Constrained method takes as input an initial seed, through which the algorithm select randomly an adversarial class. Thus, the original image sent to this script will be perturbed such that the resulting

2_{Codes are available on the sites:}

- Fast Gradient: https://github.com/e-lab/torch-toolbox/tree/master/Adversarial - Box Constrained: https://github.com/tabacof/adversarial

(41)

one will be classied from OverFeat as belongin to the adversarial class with an high condence score.

Since the same initial seed gave us the same random adversarial class, the script allowed us to address our fooling images on a specic class. The adversarial class chosed was Pelican.

We generated 5 adversarial images: in gure 2.1 original images and fooling ones are shown, with the dierence between them. Notice that the algorithm perturbed the original images in a very imperceptible way and this perturba-tion is localized only in that region of the image where probabily features are characteristic for the class Kit Fox.

2.2.2 Fooling images generated with Fast Gradient

Despite the previous method we were not able to steer the perturbation in order to choose the adversarial class for the fooling images created, but we was quite lucky because we managed to get some adversarials with the same class. The code only allowed us to modify the number of iterations of the algorithm and the intensity ε of the perturbation. According to [2] we set ε = 0.007. The number of iteration was set equal to 15, because with a number greater than it the perturbation was too visible and the script was not able to generate a good adversarial. We executed the script several times untill we obtained some adversarials with the same class.

If we analize the fooling images we clearly notice what is the main dierence between this method and Box Constrained. The dierence is in the localization of the perturbation: here the noise is distributed throughout the whole image; this distribution of the perturbation permit the method to jumps from one class to another untill the max iteration it is reach. As a consequence, Fast Gradient Sign can generate many types of adversarials, most of these ones are summarized in 4 main category:

true adversarial: the original image is classied correctly by the net-work, but when perturbed it has a completely unrelated class label; re-focused adversarial: with the perturbation the ConvNets re-focuse

their attention from the original entity in the image to a new signicant entity present in the same image;

conaturally adversarial: labels having nuanced synsets, like dog and cat, lend themselves strongly to such a type of adversarial example. Per-turbed images of this class mostly fall in the same family, genus or species;

(42)

Classication: Kit Fox Classication: Pelican

Score: 36.37% Score: 50.96%

Score: 47.94% Score: 59.48%

Score: 53.49% Score: 99.98%

Score: 56.45% Score: 99.98%

Score: 87.96% Score: 99.98%

Figure 2.1: Adversarial images generated by Box Constrained. Left column contains

original images. Central column shows the dierence between original and fooling images. Right column contains adversarial images.

(43)

Classication: Kit Fox Classication: Dingo Score: 76.53% Score: 99.96%

Figure 2.2: Adversarial images generated by Fast Gradient. Left column contains original

images. Right column contains adversarial images.

benign adversarial: The original image is misclassied by the network (this can be happen since neural networks have an error rate on classi-cation task). The adversarial example is instead classied correctly with relatively high condence. These examples are adversarial in the true sense of the denition, but are not dangerous.

Adversarials generated by Box Constrained are obviously true adversarial, while Fast Gradient adversarials are conaturally. Figure 2.2 shows fast gradient adversarials.

2.3 Analysi of the features space

Once we have created the adversarial samples, we began the study of the features space. In previous chapter we told about the linear explanation of