• Non ci sono risultati.

A generative adversarial network approach for the attenuation correction in PET-MR hybrid imaging

N/A
N/A
Protected

Academic year: 2021

Condividi "A generative adversarial network approach for the attenuation correction in PET-MR hybrid imaging"

Copied!
74
0
0

Testo completo

(1)

Universit`

a degli Studi di Pisa

DIPARTIMENTO DI FISICA Corso di Laurea Magistrale in Fisica

A generative adversarial network approach for the

attenuation correction in PET-MR hybrid imaging

Candidato:

Francesco Laruina

Relatore:

(2)
(3)

Contents

Introduction 1

1 Techniques for attenuation correction in PET-MR 5

1.1 MR-based methods . . . 6

1.1.1 Segmentation based approaches . . . 6

1.1.2 Atlas based approaches . . . 7

1.2 PET-based methods . . . 8

1.2.1 Transmission imaging . . . 8

1.2.2 Emission imaging . . . 8

2 Machine learning algorithms 11 2.1 Machine learning approaches . . . 11

2.1.1 Supervised learning . . . 12

2.1.2 Unsupervised learning . . . 12

2.1.3 Reinforcement learning . . . 12

2.2 Artificial Neural Networks . . . 13

2.2.1 Stochastic gradient descent . . . 16

2.3 Convolutional neural network . . . 20

2.4 AutoEncoder . . . 23

2.5 Generative networks . . . 25

2.5.1 Variational Autoencoder . . . 26

2.5.2 Generative Adversarial Networks . . . 27

2.6 Image-to-image translation: the choice of CycleGAN . . . 29

(4)

3 Materials and computing facilities 35 3.1 Dataset . . . 35 3.1.1 PPMI . . . 36 3.1.2 ABIDE . . . 36 3.1.3 NAPLAB . . . 36 3.1.4 Data preprocessing . . . 36

3.2 Computing facilities and tools . . . 37

3.2.1 TensorFlow . . . 38

3.3 Computing resources . . . 39

4 Approaches using autoencoders 41 4.1 Our replication attempt using autoencoders . . . 41

4.2 Our first 2D CycleGAN implementation . . . 43

4.3 Comparison of paired vs. unpaired training performances . . . 44

5 Development of a 3D CycleGAN 45 5.1 Residual blocks . . . 45

5.2 Generators . . . 47

5.3 Discriminator . . . 49

5.4 Full network architecture . . . 51

5.4.1 Training the generative chain . . . 51

5.4.2 Training the discriminating chain . . . 52

5.5 Performance evaluation . . . 56

5.5.1 Visual comparison . . . 56

5.5.2 Quantitative results . . . 57

5.6 Further improvements . . . 63

(5)

Introduction

In this master thesis, an novel deep learning approach to the problem of attenuation correction in PET-MR is discussed. We explore a novel data-driven algorithm for attenuation correction by using structural T1 weighted,

magnetic resonance (MR) images transformed into pseudo-CTs, i.e. images whose intensity values are similar to the ones expected in a computed to-mography (CT) image.

Positron emission tomography (PET) provides functional images useful to track metabolic processes in the body and enables the diagnosis of several diseases. A tracer marked with a β+ emitter is injected into the patient’s

body. The annihilation of positrons with electrons in the human body pro-duces two 511 keV photons, which travel away almost back-to-back[1]. The direction on which they travel, since they move almost collinearly, defines a line of flight(LOF). A ring of detectors is used to detect the photons. Each couple of detectors defines a line of response (LOR) to which events are associated when the two detectors are activated within a time window (co-incidence window). After the scan, a reconstruction algorithm produces a map of activity in the patient body. Photons do not travel in vacuum but in human body, thus a correction for their attenuation in tissues is required.

PET images are characterized by limited spatial resolution. In order to get morphological details to combine to functional ones, PET-CT (PET and computed tomography) and PET-MR (PET and magnetic resonance) systems have been developed. Linear attenuation coefficient maps are ob-tainable directly from the CT scan in the case of PET-CT by means of an accurate energy rescaling to 511 keV.

(6)

PET-MR to derive the attenuation properties of tissues from PET-MR signals. In chapter 1, an overview of the techniques that have been developed to address such kind of problem is presented Our approach is based on machine learning (ML) and artifical intelligence, whose main concepts will be highlighted in chapter 2. Developing and training a deep neural network requires huge amount of data to process, thus we collected scans from different dataset as explained in chapter 3. Already implemented deep learning techniques to this purpose require paired data, thus our first attempt was trying to replicate similar approaches, as detailed in chapter 4).Unfortunately, it is quite hard to obtain a big dataset of paired medical images, i.e. MR and CT images belonging to the same patient. To overcome this limitation, we chose to develop an approach based on a Generative Adversarial Network (GAN) trained on unpaired data. A GAN is a deep learning architecture composed by two neural networks, a generator and a discriminator, fighting against each other: the generator tries to map the input to the desired output and the discriminator tells if the generated output is good or not. In the training phase, the generator has to maximize the similarity to the desired output and the score provided by discriminator; the discriminator instead has to distinguish between the fakes, produced by the generator, and the original data. After the training phase, the generator is capable of mapping any point in the input space (MR images) to a point in the output space (pseudo-CT images).

The generation of pseudo-CTs from MRs with an unpaired training set in the case here proposed has been approached by using a CycleGAN with some ad-hoc developed modifications to adapt it to working with three dimensional volumes. In fact, prior implementations of deep learning models for the generation of medical images require working on single slices of the acquired images, due to the availability of algorithms developed for 2D natural images and the lack of computing power. CycleGAN is characterized by the presence of four sub-networks: two generators, the transformations from MR to CT domain (MR2CT) and vice versa (CT2MR) and two discriminators (fake CT vs. real CT, fake MR vs. real MR). A cyclic consistency constraint imposes that the whole cycle is the identity operator: MR ≈ MR2CT(CT2MR(MR)).

(7)

This requirement, introduced in the loss function, guides the network training to generate not just an image but an image of the specific input patient.

To evaluate the achieved performance, structural similarity index and mean squared errors between the generated output and the expected one have been computed. Details on the implementation and the results achieved are presented in chapter 5.

(8)
(9)

Chapter 1

Techniques for attenuation

correction in PET-MR

Data acquired in MR scanners are signals related to proton density and re-laxation times of tissues after a magnetic stimulus has been provided. There is no direct relationship between MR signal intensity and attenuation coeffi-cient of tissues.

Tissues characterized by low proton density and very fast relaxation time like bones and lungs usually provide a response of low intensity and are difficult to distinguish in MR. Bones is an emblematic case of tissue that standard MR sequences are unable to distinguish but it is the photon most attenuating one in human body.

Several techniques for attenuation correction in PET-MR, differing in the physical principles they rely on and in their implementations, have been developed to address the problem and they can be divided into three main categories: those based on MR and those based on transmission and emission. In order to test the effectiveness of a method the results are compared against a reference standard provided by PET-CT data. With respect to this methodology, MR has the advantage of not requiring any extra dose to the patient.

In PET-CT scans there is a mismatch between the PET and CT images: CT acquisition can be considered as a snapshot while PET take longer times.

(10)

In the latter case, the averaging over several respiratory cycles produces artifacts at boundary between diaphragm and lungs.

1.1

MR-based methods

1.1.1

Segmentation based approaches

Segmentation approaches classify pixels of the acquired MR-images into dis-crete classes[2]. Each label corresponds to an anatomical region to which a linear attenuation coefficient is assigned. Inside each region the variability of µis not accounted for.

Segmentation is achieved with different algorithms ranging from thresh-olding and morphological transform to more complex operations. A big ad-vantage of this kind of techniques is that they work with lots of different structures with different shapes but at the same time they are unable to distinguish inter-subject differences in attenuation coefficients.

Subject specific data coming from MR scan would be useful to fill the gap between a general map and the one of the individual but no useful relation has been discovered so far.

Signals coming from standard MR sequences do not provide any piece of information with regard to bones and air.

Bones have a very short T2relaxation time that can be measured by means

of a specific family of sequences called UTE (Ultrashort EchoTime). Using these sequences bones give a relatively high signal and can be discriminated from other tissues[3]. The downside of this approach is that the sequence length is of the order of 3-5 minutes per bed position and artifacts appear for large objects making it particularly useful in brain scans and not easily applicable for general use cases.

Protons in water have a different Larmor frequency than in fat. Based on that fact Dixon sequences are used to distinguish tissue mainly made by water from fat-based ones[4]. Another important thing to note is that Dixon sequences is helpful in localizing PET uptake. Combining information coming from different sequences errors could be reduced.

(11)

1.1.2

Atlas based approaches

Template methods do not require any additional time and acquisition of different sequences because they are based on atlases build upon datasets of MR and CT scans and sometimes they use also data from PET transmission measurements[5].

There is a wide range of techniques to generate templates combining different kind of information. Once a template is obtained as a couple of an MR and a CT image, the flow that gets from clinical scan to µ-maps is the following:

1. acquisition of an image (MR) of the patient

2. registration of the template to patient’s image

3. application of the same transform to the CT template in order to anatomically fit it to the body of the patient

An advantage of this techniques is that they do not require any additional scans to include bones, so no UTE sequences[3] have to be performed, more anatomical details and a continuous distribution of linear attenuation co-efficients are represented if compared to segmentation approaches. A clear disadvantage is the total absence of abnormalities: if the patient’s anatomy is well represented by the template everything works, if that is not the case problems arise.

Atlas based techniques have been proved to be useful in brain scans but they are not suitable for whole-body usage because they do not take into account individual’s abnormalities. Several segmentation-based techniques make use of a priori knowledge about human morphology to regularize the problem and make it converge to a good solution. Unfortunately, they are modeled on a standardized image, so they share similar issues with atlas based approach in regard to the specific patient’s abnormalities.

(12)

1.2

PET-based methods

1.2.1

Transmission imaging

For scanners that are PET-only (no CT or MR ), transmission imaging has been used and provide a direct measurement of linear attenuation coefficients for photons at 511 keV for all the objects in the field of view[5] (RF coils included, in the case of PET-MR). In principle, that could work better than PET-CT because no energy rescaling to 511 keV is needed to create to right µmap to use when reconstructing the image coming from PET scans.

Unfortunately there are some limitations to the practical implementation. Transmission requires a source that emits at 511 keV and in order to obtain a reconstruction of good quality high counts are required. This cannot be achieved easily because the higher the counts, the higher the dose the patient has to take. Combined emission/transmission systems require long time for sequential data acquisition and cross-talk, in case of simultaneous scans, is a major concern that needs to be carefully addressed.

In machines capable of measuring time of flight when the emitting source is at a relatively long distance from the body, the additional piece of in-formation can help discriminating the photon originated by transmission or emission.

In the case of a combined PET-MR apparatus, spatial constraints limit the possible positions in which to place the emitting source and rotating sources are not easy to implement.

1.2.2

Emission imaging

A sinogram can be obtained from PET emission data not attenuation cor-rected by using algorithms like Maximum Likelihood reconstruction of Atten-uation and Activity (MLAA)[6]. It has been shown that recoiled information are not sufficient to derive attenuation coefficients but using the PET signal in combination with MR-based µ-maps has proven to be helpful in reducing gaps in the truncated MR map reconstruction.

(13)

give signal, so if there is no uptake in some areas no information about attenuation can be inferred. Everything else than the patient’s body does not emit. Scatter correction is complex and may result in bad estimation of both attenuating properties of tissues and emission.

(14)
(15)

Chapter 2

Machine learning algorithms

The term artificial intelligence (AI) refers to devices, which, given some in-puts, are capable of taking choices to achieve a goal.

Examples of methods based on artificial intelligence are expert systems, and machine learning (ML) techniques. The former reach their goals by evaluating the input provided against a chain of if-then rules. The latter, instead, are related to pattern recognition and statistical inference. Provided an adequate number of samples, which should represent the inputs in real use cases, a machine learning algorithm is asked to extract useful features and deduce the functional relation between input and output with no human intervention.

The idea behind the proposed attenuation correction method based on an unsupervised machine learning algorithm is basically trying to get a pseu-doCT image from a T1-weighted MR image of the patient.

In the following sections the general concepts and terminology used in the context of deep learning will be presented.

2.1

Machine learning approaches

There are three main types of approaches, which will be discussed later on, to ML:

(16)

• unsupervised learning; • reinforcement learning.

2.1.1

Supervised learning

Supervised learning require paired dataset to train on. Paired, in this context, means that each sample is composed by a couple (x,y), where x is the input data, usually provided in the form of a vector or a matrix, and y is our output. We expect a relation, f , exists so that x = f (y), and we ask a machine to try to deduce from samples the best approximation for it. The downside of the approach is the need for an adequate amount of samples to train on. In case of classification tasks, a database of, usually, human-labeled inputs is required. Apart from the aforementioned classification tasks, regression problems are also approachable by means of supervised training.

2.1.2

Unsupervised learning

The training dataset when dealing with unsupervised techniques is made up of inputs only: there is no target output to map the input into. We assume that there is some structure in the data that we want to exploit. One of the most known example of unsupervised learning is clustering, which divides a set of observations, given as input, into subsets. Defined a similarity metrics, the aim of the training is finding a rule to assign each observation to a specific cluster, thus requiring the point to be near to the other ones in the same cluster, and, at the same time, far from points belonging to other groups.

2.1.3

Reinforcement learning

The goal of a machine learning algorithm based on reinforcement learning is to maximize a reward by taking decisions in a dynamic context. Let us make an example. A human plays chess against a software. Maximizing the reward means achieving a victory. At each move, the variables in the game change, and the software needs to try and learn new strategies to win from experience.

(17)

2.2

Artificial Neural Networks

An artificial neural network can be described as a connected and directed graph whose nodes take values which are the result of a non-linear activation function applied to a weighted mean of the values of the nodes pointing to it. The nodes, called artificial neurons or perceptrons, are inter-connected by arrows directed from input to output. To each edge of the graph a weight is associated.

As shown in figure 2.1, The basic structure of feed-forward neural net-works, such as multilayer perceptrons (MLP), is made up of at least three layers: an input layer, an output layer and one or more hidden layers in between. In a fully connected network, each neuron is connected to all the perceptrons in the preceding layer. We can arrange the value taken by the

Figure 2.1: The typical structure of a feed-forward network. Image taken from www.pyimagesearch.com.

nodes in a layer as a vector, and the weights involved in the calculation as a matrix. For the i-th layer, except for the input one, the following expression holds: v(i) = f(i) m X j=0 Wijv(i−1)j !

where f(i) is a non-linear activation function and m the number of neurons

in the (i-1)-th layer. In case an activation function is not used, the whole network would reduce to a composition of linear operators, thus the entire

(18)

transform could be written in terms of a matrix dot product. The addition of non-linearities greatly improves the generalisability of the same network architecture to a wide range of different tasks. The choice of the activation function is of paramount importance and each layer can have a different one. In order to build a network, decisions on how many neurons, how many layers, which activation to use, etc. have to be taken. After the structure has been defined, the weights characterize the mapping from the input to the output space. How are the weights chosen? The aim is minimizing the discrepancy, measured by a loss function, between the produced output and the desired one. For that purpose several optimization strategies are possible, one of the most famous is called Stochastic Gradient Descent (SGD), which, as the name suggests, is based on the gradient descent method, which will be described later on.

Basically, the training phase of a feed-forward neural network could be described as follows:

1. weights are initialized (usually in a random manner);

2. a learning algorithm minimizes the loss by iteratively adapting the weights.

In general, the structure of the network together with the choice of acti-vation functions defines a class of operators, which, relying on the weights, map an input tensor, x, to an output tensor, y. This class of operators could be described as:

Fw : x → y

where x ∈ Rn, y ∈ Rm and the subscript w indicates a combination of

weights. Let us suppose a relationship, y = g(x), exists, even if not explicitly known: the aim of the training is to find the weights, w0, that make the

operator Fw0 approximate g(x) in the best possible way.

Optimization is achieved by means of iterative procedures, which take into account how well the approximation works on the samples. An iteration over the entire dataset is defined as epoch.

(19)

Figure 2.2: Underfitting, optimal and overfitting conditions. Image taken from pythonmachinelearning.pro.

A good operator can be found only if the network has been defined cor-rectly and no underfitting or overfitting occur.

To better explain these situations, let us make an example (see figure 2.2). We extract some points, (xi, yi), from a parabula, then we plot it. If we try

to fit the data with a linear model, underfitting shows up: the model has not enough parameters to generalize and adapt to the quadratic curvature of the data. Let the number of points be N , in case we fit a polynomial of degree N + 1, we will get no error: the polynomial crosses every point the dataset. Is the model perfect? What happens if we extract some more data from the parabula? Chances are high that the (N + 1)-th extracted point would not be matched and, if we continue, the results will get worser and worser. This situation in known as overfitting: the parameters are modeled on the specific data but are not cabable of generalising the result to new input.

Underfitting, usually, is a symptom that the network is not complex enough to adapt to the task it is being trained for, thus improving model capacity is required. On the other hand, the more complex the structure, the easier it is to get into overfitting situations. In many other cases the space of solutions is just huge, and constraints are required in order to lead the learning algorithm to find the best weights. L1 or L2 regularization terms, for example, force the network to avoid learning weights that are too big by adding a penalty proportional to the norm of the weights in the cost func-tion. This forces the network not to avoid overestimating some data with respect to others making the process of updating the weights less effective: if

(20)

a weight is two order of magnitude bigger than another one, the effectiveness of iteratively updating the second weight becomes negligible.

Figure 2.3: An insight of how dropout works conceptually. Image taken from https://pythonmachinelearning.pro.

Another interesting and effective technique is the dropout[7]. To adapt to the training data, cooperative effects arise but unfortunately do not gen-eralize to unseen samples. To break this mechanism a fraction of neurons are randomly chosen and deactivated in training, by doing so, no co-adaption are made possible and more general features are learned. The downside is an increase in the computation cost, i.e. time, of the training phase.

2.2.1

Stochastic gradient descent

Usually an optimization problem is solved by minimizing a loss function that could be written as a summation over the error made on each sample. In the general continuous case, we define the discrepancy between the expected result as

L(z, w) = l(Fw(x), y)

(21)

The expected risk is defined as:

E(fw) =

Z

l(fw(x), y) dP (z)

and it represents a loss averaged on the distribution of the samples, the z values, but at the same time it measures the expected loss even on future examples that are expected to have the same distribution of the ones provided in training.

The training set is formed by a discrete number of couples (x, y), so the best approximation we can get for the expected risk is the empirical risk :

En(fw) = 1 n n X i=1 l(fw(xi), yi)

Minimizing the empirical risk instead of the expected risk, according to statistical learning theory, is a legitimate approach if some restrictions on the class F applies and in case the samples provided correctly mimic the real distribution.

The function to minimize depends on z = (x, y) parametrically and on w as variables, so the optimization algorithm will find the weights w0, so that

En(z, w0) is at its minimum.

An optimizing iterative procedure is updating the weights along the di-rection opposed to the gradient. This indicates the didi-rection of maximum growth for the function, so going in the opposite direction finds the minimum (see figure 2.4). This algorithm is known as gradient descent. The rule used to update the weights is defined in the following recursion formula where γ is known as learning rate:

wt+1 = wt− γ 1 n n X i=1 ∇wL(zi, wt)

The expression shows that in order to update the weights a calculation over the whole dataset is needed. That would make for a huge computational cost, thus a simplification, the stochastic gradient descent (SGD), is required

(22)

Figure 2.4: A graphic representation of how gradient descent works.

and takes the form

wt+1= wt− γ∇wL(zr, wt).

where zr is a randomly picked sample. The computation is a lot cheaper

than the standard gradient descent, but the convergence speed of the SGD is influenced by how well the true gradient is approximated by the single sample chosen. The non-averaging over the dataset makes the calculation a lot noisier and more dependant on the learning rate. The steps taken in the direction to reach the minimum are directly proportional to both the learning rate and the gradient. A low learning rate usually guarantees good results at the cost of longer computations, which means it is not always a viable alternative to choose little values for this parameter. On the other hand, big learning rates could lead to both a plateau, a situation in which the cost function oscillates around a constant value, or to a divergence. There is no general rule to choose the best value for the learning rate, a monitoring phase of the training and some exploratory work on different choices is required.

In fact, a doable approach to move out of plateaus exists and it requires to schedule a reduction in the learning rate. For example, if the loss function

(23)

Figure 2.5: How the loss evolution should look like qualitatively. Image taken from towardsdatascience.com.

(24)

remains unchanged for five epochs, lowering the learning rate by a certain factor, decided while implementing the training procedure for the network, could help moving out of that situation.

An alternative implementation of the SGD algorithm is charachterized by the calculation being carried out not on a single but on a randomly chosen mini-batch, i.e. a subsets, of the samples. This makes the evolution of the loss function through the epochs smoother and the approximation of the true gradient more robust.

Gradient-based optimizers require a plenty of gradients to be calculated by means of back-propagation. Back-propagation is nothing more than the chain rule used in calculus to differentiate composition of functions. Let f : Rn → Rm, g : Rm → R and h : Rn → R be three functions so that

y= f (x), h = g(y) the derivative of the function is dh dxj = n X i dh dyi dyi dxj

that translates into the multidimensional expression:

∇xh=

dy dx

T

· ∇yh.

The calculations in the network are carried out from the input layer to the output one, while in training, the back-propagation works in the opposite direction from the last to the first layer.

2.3

Convolutional neural network

The kind of networks described in the previous section where neurons of each layer are connected to all the neurons in the previous layer are said to be fully connected (FC). In the field of image processing, the input layer takes the values of all the pixels in the input image. That requires a huge number of parameters to be optimized, i.e. an unaffordable computational cost for most tasks. Another disadvantage of fully connected networks is the non-locality of the information: the n-th element of the output depends on all

(25)

the other neurons in the preceding layers. In order to reduce the amount of parameters to train for and, at the same time, to introduce a direct spatial relationship of the output layer with respect to the input image provided, convolutional neural networks (CNN) have been introduced. The idea comes from biology, Hubel and Wiesel [8] found that monkeys’ striate cortex was activated only partially in response to specific visual stimuli. In particular, that observation led to the concept of receptive field, i.e. the outcome of an observation is due to the processing of information belonging to patches of the input image. Instead of a fully-connected network where each pixel is mapped to a neuron, the connections are made between a neuron and patches of the input. This results in the n-th layer being the convolution of the (n-i)-th layer wi(n-i)-th a kernel of k-by-k dimension. The choice of k defines (n-i)-the extent of the patches (they will be k-by-k) and is one of the hyper-parameters to tweak when building the structure of the network. Though there are ongoing studies on that regard, no optimal a priori choice can be made, thus exploring different possibilities is still required. Let X be an image, a 2D layer in the network, and H be a k-by-k kernel, i.e. two matrices, their convolution, CXH,

is defined as: CXH(i, j) = [X ∗ H](i, j) = k X p=0 k X q=0 X(i − p, j − q)H(p, q)

In the context of convolutional neural networks, CXH is called activation

map. When building the network, we define how many kernels to use in each layer so that each activation map shows different features extracted by different kernels. In the forward direction, applying N filters, means that N activation maps are produced and stacked to form a volume V

V(i, j, k) = [X ∗ H(i)]jk

where H(i) is the i-th filter. For each (i,j,k), V(i,j,k) reflects the action of

the convolution with the H(i) on the receptive field centered on the point

of coordinates (j,k) in the input matrix. So, the ”position” of information is preserved along the width and height of the image. If the input is

(26)

three-Figure 2.6: How the input representation changes through the CNN. Image taken from cs231n.github.io.

dimensional, such as in the case, of RGB images (images whose colors are expressed as combination of red, green and blue) the convolution takes place on the xy direction and on the whole depth simultaneously, i.e. the kernel size is actually k-by-k-by-3: the produced activation map is always two-dimensional (for 2D convolutional layers). For example, the result of an input RGB image of size 50x50 convoluted with 64 3x3 filters (or kernels) is a volume of dimensions (48,48,64) (in case no padding is applied). Another characteristic of CNNs is the usage of pooling layers: i.e. a layer whose action is to aggregate pixel values according to some operation, such as the maximum, in order to get a down-sampled version of the input provided. To further reduce the quantity of data to process another down-sampling technique is using a stride, i.e. how much the kernel slides on the image, that is more than one pixel to perform the convolutions. In the training phase, the optimization procedure has to find the parameters of the kernels in convolutional layers and other trainable kind of layers, such as fully connected ones at the end of a classifier. Pooling operations do not add any parameter to adapt because they are carried out according to a fixed algorithm.

A typical classifier is obtained as a combination of convolutional, pool-ing layers, normalization layers (more on that in the followpool-ing chapter) and usually a fully connected layer carrying out the last part of the calculation to get the classification score. The state of the art in the field of computer vision is exposed in the ImageNet Large Scale Visual Recognition Challenge

(27)

Figure 2.7: An example of how max pooling works. Image taken from wikipedia.org.

(ILSVRC), where the aim is the classification of images in 1000 categories. A datased of more that 1.2 milion images is provided and the event has al-ways been source of ideas. Many of the most employed models today in use came from that competition including VGG[9], AlexNet[10] and ResNet[11]. Taking a look at the structure of the proposed network could be of great help to decide how to build and train CNNs even on different classification, and not, problems

2.4

AutoEncoder

The main scope of autoencoders, at least when they showed up for the first time, was data compression. In general, compression techniques can be split into two main families: lossless and lossy ones. Lossless algorithms are based on mathematical transforms that can be inverted, i.e. encoding followed by decoding is exactly the identity operator, no information is lost. They try to avoid redundancies in data and to represent information in a more efficient manner. The less the entropy of information, the more the compression is effective.

(28)

required. Lossy algorithms are optimized to the task of compressing data with minimal loss, for example, Fourier-based techniques, such as JPEG, reduce size by truncating higher frequencies in images allowing a cheaper resampling due to a reduced Nyquist frequency. Traditional compression algorithms are based on human-engineered transforms that try to minimize the impact of data simplification in real use cases. Let us suppose a company has to deploy a service of movie streaming and tries to reduce the bandwith required. Minimizing the impact of data compression means identifying what a human viewer needs in order to have a great experience, for example, sharp images, and what can be degraded without worsening the perceived quality. In the case of data-driven techniques the task of identifing features and encoding them in an efficient way is carried out by a machine which learns from samples how to reduce the dimensionality of the input data while pre-serving useful pieces of information. A convolutional neural network built in such a way that the input is translated to a low dimensional representation and then back to the original size is called a convolutional autoencoder. The

Figure 2.8: Change of representation through a convolutional autoencoder. Image taken from hackernoon.com.

layer with the smallest size, the latent representation, defines the bound-ary between the two components of an autoencoder: the encoder and the decoder. The goal of the entire autoencoder is to map the input tensor to itself, but, being the intermediate representation smaller than the one in the first layer, it is not possible to get the identity operator (unless the data are redundant and inefficiently stored). In order to approximate it, the encoder

(29)

has to learn to filter out non-essential details, i.e. noise (see figure 2.9).

Figure 2.9: Denoising through a convolutional autoencoder.Image taken from blog.sicara.com.

Training the network using noisy inputs and clean targets makes the encoder learn a mapping useful for denoising. Another approach is using autoencoders for inpainting, filling gaps in images, or upsampling like Google does with RAISR[12]. In the latter case, the autoencoder is trained on low-resolution images and it is called to produce a high low-resolution version. That approach proved useful in reducing the size by a factor of four without a major degradation in quality. Other application, for example, are used to segment, in real time, elements on the street to be used in software implementations for self-driving cars.

2.5

Generative networks

In the field of image generation, the decoder part usually takes the name of generator. Generative network stands for network that can produce data that resemble the one given as input. Let us say we trained a generator of cat images, i.e. we built a function that maps a vector of the latent space into an image of a cat. What if we use a random vector as input? The idea is that the structure of the latent space is such that every vector is

(30)

mapped to a meaningful output, thus new images can be generated. Two main generative models are variational autoencoders (VAEs) and generative adversarial networks (GANs).

2.5.1

Variational Autoencoder

In standard convolutional autoencoders, the input is processed by the en-coder, then its output becomes the starting point for the decoding pipeline. In variation autoencoders, instead, the vector coming from the encoder, the latent representation, is split into two parts: a mean vector and a standard deviation one.

Figure 2.10: Variational autoencoder. The latent vector is obtained combin-ing the mean and the standard deviation of the encoded data. Image taken from kvfrans.com.

These are merged together to form the sampling layer z according to the following expression:

z = ¯z+ ez log σ2

All the operations are intended to be element-wise. The parameter  has the same dimensions of the other latent variables, ¯z and σ, and takes values sampled from a unit normal distribution.

The cost function is composed by two terms, the former imposes similar-ity with respect to the target (it can be measured by mean squared error,

(31)

for example) and the latter is to assure the distribution of the produced data be a unit Gaussian (the dissimilarity is measured via Kullback-Leibler di-vergence). The constraints applied force the encoder to learn a robust and efficient representation of information that cannot be easily corrupted by the addition of some noise, i.e. . As a result an operator mapping similar latent representations into similar images is obtained, thus making the structure of the latent space continuous. In figure 2.11, a face smiling is represented. All the images are generated starting from the same vector, an encoded smile, by small variations.

Figure 2.11: Artificially generated smiles.

2.5.2

Generative Adversarial Networks

Generative adversarial networks[13] are a composition of two elements, a generator and a discriminator fighting against each other, thus the name. The general idea is to map a vector coming from the space of latent repre-sentation into realistic images (or whichever it is the type of result we are aiming for). The generator learns to map vectors into images of the same kind present in the training set. The discriminator validates the realism of the synthetic images: in case it is tricked to classify the artificial data as if they were original, we assume they are similar to the samples provided in training. There are no strong constraints like in VAEs about the distribution

(32)

of the produced outputs or robustness to small variations, thus no continuity for the space of latent representation is to be expected. In practical applica-tions, though, it has been shown that GANs are indeed capable of achieving greater realism in the output they produce. There are ongoing studies about VAE-GANs, i.e. the integration of generative adversarial networks with the typical constraint and advantages of variational auto encoders. Having shal-low constraints makes training GANs quite difficult compaired to other types of network. All the hyper-parameters need to be carefully tweaked in order to avoid plateaus or divergences in both the generative and the discriminative chain.

There are two alternating phases in the optimization procedure. At first, the generator is required to replicate the target image by transforming a lower dimensional vector, chosen at random. The scope is generating images in a way that is not dependent on the particular input vector: each vector is a coded image that can be decoded into a meaningful output.

On each iteration a score is calculated based on how well the synthetic im-ages fooled the discriminator: the generator is trained to maximize the false positives detected by the discriminator. With the advancement of training epochs, the generator will get better and better scores until the discriminator becomes unable to correctly identify fakes. When that situation occurs, an adaptation in the weights of the discriminator is required to make it work well again. The discriminator is then trained on a dataset composed by the original images and the artificial ones labeled respectively as real and fake (non-real).

To summarize: the discriminator is able to classify real vs. fake data, when the generator improves its scores, discriminating becomes harder so the training of the generator needs to be stopped. A pool of fake images (more-realistic ones with respect to the previous epochs) is generated and used as non-real input to train on the discriminator. Then the loop repeats as much time as needed to reach an optimal set of weights.

It should be noted that after the training completes, only the generative branch is used to generate data, thus a reduction of the computational cost per sample generation is achieved. So the tool we created using a GAN is a

(33)

Figure 2.12: Visual representation of data flow in a GAN.

vector to image mapper, where the vector space of representations has been defined by the GAN in a data-driven fashion.

2.6

Image-to-image translation: the choice of

CycleGAN

Image-to-image translation is a class of problems in computer vision that aims at converting images from a domain X to Y.

The learnt operator G : X → Y is such that the output produced, G(x), has the same distribution as the one of Y. This kind of work can be carried out in different manners: every autoencoder, convolutional or variational, for example, will work just right if correct choices of hyper-parameters and of ar-chitectures are made. Unfortunately, the vast majority of algorithms require being trained on paired datasets. Such a requirement is not always easy, or even possible, to satisfy. In Zhu et al. [14], a double generative adversarial network with a cyclic consistency constraint is proposed and some applica-tions are shown. In particular, the inspiration for the approach followed in

(34)

this work came from the zebra-horse and the style transfer examples (see figures 2.13 and 2.14).

Figure 2.13: Horse-zebra example for CycleGAN. Image taken from github. com/junyanz/CycleGAN.

In the latter, a dataset composed of zebras and horses has been used in training. Let X and Y be respectively the sets of zebras and horses, the mappings to learn are:

F : X → Y and

G: Y → X.

It should be noted that no zebra version of a horse exists: so no paired dataset can be built, i.e. the network needs to be trained on unpaired data. A first generative chain uses zebras as if they were latent representions of horses, then the output produced is checked for its photo-realism by a horse vs. non-horse discriminator, DY. An analogous chain processes images the

other way: from horses to zebras and gets a score by a zebra vs. non-zebra classifier, DX. With no further constraint the network learns to generate

new zebras or new horses, but it is not guaranteed that the images produced are really related to the input provided. Let us make an example: we would like to tranform a horse in the snow into a zebra in the same exact scene and position. In case no modifications to the losses are introduced, the images will be just horses on a scene unrelated to the one of the input zebra image.

(35)

Figure 2.14: Style transfer implemented by CycleGAN. Imge taken from github.com/junyanz/CycleGAN.

To move into the direction of image-to-image translation cyclic consis-tency is imposed so that F (G(x)) ≈ x and G(F (y)) ≈ y: the whole cycle should learn to approximate the identity; a consequence of this constraint is that a correspondence between the original and the transformed images is created and the whole scene is preserved. Another really impressive ap-plication shown is style-transfer; the same architecture learned to convert photographs into paintings styled as if they were made by Monet, Van Gogh, Cezanne and Ukiyo-e (see figure 2.14).

When we started the present work, a major challenge we faced was the difficulty to find computed tomography and magnetic resonance images be-longing to the same patient. A really small dataset with these characteristics and publicly available was collected from the PPMI project. Unfortunately, in this dataset only fourteen subjects matched our requirement: having both CT and MR scans. An additional problem was that the available images were highly smoothed and preprocessed.

(36)

MR using variational autoencoders (VAEs). Training the network on such a small dataset, no useful results were obtained. Thus, we moved towards the exploration of methods which do not require paired data. The choice has been CycleGAN.

In the following chapters, all the adaptations, whose implementation has been necessary, to process three-dimensional data and to make the training converge to a working solution in the case of MR to CT image translation will be described.

2.7

A bit more formal description of GAN

and CycleGAN

We want to transform data coming from a set X to similar data having features typical of the Y domain. To learn this mapping we have at our dis-posal two datasets {xi}N1 , {yi}M1 which containes samples of the two classes,

xi ∈ X, yi ∈ Y . The whole architecture is made up of four components: two

generators

F : X → Y G: Y → X and two discriminators

DX : X → R

DY : Y → R

where, to be more precise, the codomain for the discriminators is the interval [0, 1] ∈ R. The ideal discriminator would be a function with two discrete values as possible outcomes, but the implementation works best for contin-uous and differentiable functions. Remember that backpropagation requires evaluating derivatives.

DXideal(x) = (

1 if x ∈ X;

(37)

The two adversarial loss functions are defined as:

LGAN(F, DY, X, Y) = Epy[log DY(y)] + Epx[log (1 − DY(F (x)))]

LGAN(G, DX, X, Y) = Epx[log DX(x)] + Epy[log (1 − DX(G(y)))].

Due to Ds taking non-negative values that are less or equal to one, the maximum and minimum for the previous expression are respectively 0, when the classification always happens correctly, and −∞, in the opposite case. With respect to these losses discriminators and generators have different optimization goals: the former aim at better classification, thus minimizing the cost function, while the latter aims at increasing the number of false positives detected, i.e. maximizing the loss.

So the best couple of discriminator and generator, G0 and DX0, are the

ones that fullfill both the requirements at the same time:

G0, DX0 = arg min

G maxDX0 LGAN(G, DX, X, Y).

The aformentioned loss by itself makes the generator learn how to pro-duce images that reflect the same distribution of the original ones, but no constraint on the relationship between the specific input and output is forced. If we want to mimic a distribution, any random permutation of samples will be the same, i.e. the distribution is a global properties of the set. In case we want an image-to-image translation, i.e. the output should be strongly and visibly correlated to the input, so that contexts, shapes and anatomies are preserved. In the case of a MR to CT translation, we would like to obtain a scan of the specific patient not just generically a CT-like result. A fur-ther restriction to the the space of solutions, helps at achieving the scope of preserving specific information. In particular, imposing the output of a loop to be equal to the provided input, cycle consistency, proved to work well if applied in both directions:

(38)

Note that the consistency applies directly to generators which should min-imize it. Thus, the full expression for the cost function is a composition of cycle consistencies and adversarial losses:

L(F, G, DX, DY) = LGAN(F, DY, X, Y) + LGAN(G, DX, X, Y) + λLcyc(G, F )

where λ balances the relative importance of the terms involved. The solution to the problem, we aim at exploiting via training on samples belonging to two different classes, i.e. CT and MR scans, is written in implicit form as:

G0, F0 = arg min G,F DmaxX,DY

L(G, F, DX, DY).

For our purpose we only need the generator that maps MR acquisitions to CTs. The other part of the network, such as the discriminators, are used while training but are of no use afterwards. We are training a really big network, but after the training phase the solution to deploy and put into production is just a smaller part which carries out the image transform with relatively low computational effort.

(39)

Chapter 3

Materials and computing

facilities

In this chapter, some information about the dataset used to train, and de-velop the proposed method for attenuation correction will be provided. Com-puting facilities and tools required to process all the collected images will also be presented.

3.1

Dataset

Collecting a dataset of MR and CT scans has been carried out by integrating data from different sources. Our first attempts at implementing a method based on machine learning techniques (see chapter 4) were developed on the publicly available database of the Parkinson’s Progression Markers Initiative. Successive tests, which led to the final three-dimensional implementa-tions, have been conducted on images taken from the Autism Brain Imaging Data Exchange initiative and from the NeuroAnatomy and image Processing LABoratory (NAPLAB) of the IRCCS SDN (Naples, IT).

(40)

3.1.1

PPMI

The Parkinson’s Progression Markers Initiative (PPMI1) aims at identifying

biomarkers to track Parkinson’s disease progression. They collect data from clinical sites spread across Australia, Europe, Israel and United States by means of a wide range on analysis including PET, MR and CT scans, and make them publicly available.

Looking through the database, it emerged that subjects with both MR and CT images available to download were just fourteen.

3.1.2

ABIDE

The Autism Brain Imaging Data Exchange (ABIDE2) initiative aims at

pub-lishing images acquired across several imaging laboratories in order to permit studies on the brain focused on finding biomarkers for autism spectrum dis-orders. More than 2000 structural scans are publicly available.

We chose to take just 60 magnetic resonance images, in order to have the same number of samples for each modality, MR and CT.

3.1.3

NAPLAB

An agreement with the NeuroAnatomy and image Processing LABoratory (NAPLAB3) of the IRCCS SDN (Naples, IT) allowed us to collect an amount

of images, including ones coming from clinical CT scanners, belongig to its proprietary database. These together with MR scans provided by the ABIDE project allowed us to collect a dataset consisting of 60 images per imaging modality.

3.1.4

Data preprocessing

The task of image-to-image translation is quite hard, thus an alignment to a standard orientation is required to make the training converge. Registration

1http://www.ppmi-info.org/

2http://fcon 1000.projects.nitrc.org/indi/abide/ 3http://www.sdn-napoli.it/irccs-sdn/

(41)

is the procedure that finds the best transform to make a moving image overlap to a fixed one.

The transform we chose, is a composition of a rotation, a rigid translation and the well-performing SyN[15] diffeomorfic transform. All the calculation necessary to find the best parameters, which maximize the mutual infor-mation between the moving and the fixed image, have been carried out by ANTs4.

All the MR scans have been registered to the template realized by the Montreal Neurological Institute[16], while the CT images have been aligned to the template provided by Rorden et al.[17]

Before being fed as training inputs, the gray levels in the images have been standardized to have values in the range (-1,1) by means of a linear rescaling. This has been done on both MR and CT data independently. Due to computational limits all the images have been resized to have the same shape: (104,128,104) and a voxel size of 1 mm x 1 mm x 1 mm.

3.2

Computing facilities and tools

The whole process of code development has been conducted in the Python language. First attempts at generating CT images from MR data using autoencoders have been written using Keras5, a high-level application

pro-gramming interface (API) to a backend for tensor calculus, which in our case is TensorFlow6.

Further experimentations made us realize a need for a finer control on the implementation, thus we started to code directly using the backend API. The CycleGAN approach presented in this work is written entirely using TensorFlow.

4the Advanced Normalization Tools: http://stnava.github.io/ANTs/ 5https://keras.io/

(42)

3.2.1

TensorFlow

TensorFlow implements a set of function, which allow the creation of graphs. Two types of node are available:

• tensor nodes; • operation nodes.

Operation nodes take zero or more tensor nodes as input and produce tensors as output. When defining a graph, no operations, not even variable initial-izations, are actually executed. All the calculation are lazy-evaluated, i.e. only when necessary. Let us make a simple example. The following piece of code produces the graph in figure 3.1.

a = t f . c o n s t a n t ( 3 . 0 , dtype=t f . f l o a t 3 2 ) b = t f . c o n s t a n t ( 4 . 0 )

t o t a l = a + b p r i n t ( t o t a l )

If we try to print the value of total, Add in the figure, we do not obtain 7 as we would expect. In order to do the maths, a call to the run function is required. When called, TensorFlow will fulfill all the requirements, assigning 4 and 3 to a and b, and then it will do the computation. The graph can be executed on CPUs, GPUs or TPUs and even on multiple computing nodes.

(43)

HOSTNAME N. cpus RAM GPU

cudawn1 32 111.9G NVIDIA Tesla V100 cudawn9 20 63.8G 8 x NVIDIA K80

Table 3.1: Summary of the GPU-equipped machine used to develop the network.

3.3

Computing resources

Our implementation has been developed thanks to the computing resources available at the Istituto Nazionale di Fisica Nucleare (INFN) in Pisa. In particular, the training has been conducted on two GPU-equipped machines: cudawn1 and cudawn9. In table 3.3, a summary of the hardware installed on them is presented. The amount of available RAM allowed us to load the entire dataset in memory, thus minimizing the latency in training due to retrieving data from disks. All the data given as input to the network have the form of 5D tensors. The index of the first axis defines the particular image in the batch, the following three axes represent width, height and depth of the volumetric scan, and the last dimension is for the color.

(44)
(45)

Chapter 4

Approaches using autoencoders

The idea behind the method this work and the first tests originally came from an article by Liu et al. [18]. In this paper, an approach to attenuation correction by using a convolutional autoencoder (CAE) is proposed. We tried to replicate the pipeline they developed to transforms MR input images into pseudoCT ones, but, unfortunately, no satisfactory result.

With the advent of CycleGAN, we decided to attempt to use them to solve our problem, as this approach allowed us to use a wider dataset composed by unpaired samples.

The lack of computing power, in prior attempts, did not permit working on three-dimensional data, thus we needed to build a network trained on 2D slices; inconsistencies in the gray levels between different slices required a post-processing pipeline. The installation of a more powerful GPU at INFN facilities allowed to overcome some computational limits and permitted the development of an original 3D model of CycleGAN.

4.1

Our replication attempt using autoencoders

In our first implementation, we tried to replicate the approach of using con-volutional autoencoders. MR and CT images are built as a stack of two-dimensional axial slices; the autoencoder learns to map an MR slice to a two-dimensional pseudoCT image, which features three gray levels. This

(46)

re-sult is due to the fact that the target CT provided in the training set are preprocessed to be the sum of masks for air, bone and soft tissues. That is achieved by means of a pixel-intensity based thresholding 1:

1. bones: > 600 HU; 2. air: < −500 HU; 3. the rest is soft tissues.

Figure 4.1: Example of the training input. Images have been co-registered, than a thresholding has been applied on the CT scan.

Before being fed to the training procedure, the images are co-registered. In the field of image processing, registering means that all the images are oriented in the same way and aligned to a template.

The operator to be learned has to classify the pixels in MR data into air, bone and soft tissues, then map them to a new image whose intensity values reflect the three classes; appropriate values are assigned in order to get meaningful information expressed in Hounsfield units.

The encoder part of the network is somewhat of a classifier, the well-known VGG16 network used in object recognition tasks has been used. From such a network architecture, the last fully-connected part is taken off and only convolutional, pooling and normalization layers are preserved and used as encoder. The decoder structure is built mirroring the one of the encoder.

(47)

Convolutions are replaced by deconvolutions and pooling layers by upsam-pling procedures. The size and shape of the last layer, the output, is the same as the input one.

Unfortunately, collecting a set of CT and MR scans is not straightforward because there is a relative abundance of MR scans with respect to CT ones in public databases, but just a little amount of paired scans, i.e. both type of images belonging to the same patient, is available in public datasets.

First implementations of the networks were trained on two-dimensional axial slices extracted from scans of the PPMI dataset(see chapter 3). The low number of examples to train on, together with the fact that most of the CT data were already preprocessed and highly smoothed, made it harder to work out a solution. The first implemention using a convolutional autoencoder did not provide any satisfactory result.

To reduce the space of solutions, further requirements needed to be intro-duced to regularize the problem and make the training converge. A second implementation of ours, based on variational autoencoders also led to noisy output. Variation of parameters and data augmentation did not help in solving the problem of an insufficient number of training samples.

4.2

Our first 2D CycleGAN implementation

The publication of CycleGAN opened to the perspective of training on un-paired data. The availability of public CT scans is quite low, but the possi-bility to use data coming from different patients enabled us to collect a wider dataset.

Due to the high computational cost of the training procedure, working on two-dimensional slices was the only viable option. Following the schema of CycleGAN, we got some qualitatively good result in the sense that the produced slices looked like they were obtained from CT scans. Stacking together the output images to form a volume showed up that some post-processing work was required in order to obtain smooth images with no artificial discontinuities along the axial direction. The lack of information on the third dimension requires some continuity condition to be applied on the

(48)

slices being merged into a volume. When this work started, we had 8 Nvidia Tesla K80 at our disposal. Unfortunately, the lack of memory on that GPUs did not allow for exploratory work on three-dimensional volumes as a whole. Starting from December 2018, a new NVIDIA Tesla V100 has been made available at INFN facilities. It made possible the approach we propose, that lead to a rewriting of the idea behind CycleGAN to extend its capabilities and adapt it to working with volumetric medical images. The network is now able to map the image provided by the MR scanner directly to a 3D pseudoCT without any further intervention. The presence of spatial information along the axial direction, removed the inter-slice inconsistencies in gray levels that were present when only 2D images were considered. In the next chapters, a description of how the dataset is composed and how the proposed pipeline is implemented and works will be presented.

4.3

Comparison of paired vs. unpaired

train-ing performances

No paired set of images is available to us, thus we are unable to compare the results achievable by networks trained on paired vs. unpaired data. A comparison of performances between training with paired vs. unpaired sam-ples has been made[19] on a pool composed by 24 patients. All of them got scanned with both modalities, CT and MR, on the same day. The collected samples have been used to train a CycleGAN and the generator of synthetic CTs alone. The former training has been conducted using unpaired data, while the latter pairing them. Computing the mean absolute error between the real CT and the generated ones, they found slightly better performance when the network was trained on unpaired samples. It has also been observed that less artifacts are visible in images generated by the GAN model.

(49)

Chapter 5

Development of a 3D

CycleGAN

In the following sections, the implementation of our model in the case of three-dimensional data is presented. The network structure is composed by two generators, i.e. MR to CT and vice versa, and two discriminator, i.e. fake MR vs. real MR, and fake CT vs. real CT. The building blocks are convolutional and deconvolutional layers togeter with residual blocks. The models have been built to work on five-dimensional arrays, as TensorFlow requires (see chapter 3), but we will avoid referring to the first dimension unless specified.

5.1

Residual blocks

When using techniques based on gradient descent, such as SGD, the com-putational of several derivatives is involved in order to update the weights. Differentiation is carried on by means of backpropagation, i.e. the chain rule of calculus, thus each calculated gradient is a product of several terms. When one, or more, of the operands to be multiplied approaches zero, the resulting gradient will be low or zero. The steps taken when updating the weights is proportional to the gradient so in case the gradient reaches zero, the entire optimization procedure loses its effectiveness. In deep-learning

(50)

problems, the more the number of layers, the more frequent issues related to vanishing gradients happen. In 2015, ResNet won the classification chal-lenge held by the ImageNet project and introduced residual blocks. These are composed by several layers operating on the input, x, to produce R(x). Then H(x) = x + R(x) is computed with no added parameters to learn: it is just a non-weighted sum.

The loss function takes the form of a difference: H(x) − x, which requires findinding the best combination of weights for R(x) alone, i.e. the residual of the operation. This approach has proved to be helpful in the mitigation of the problem of vanishing gradients. Residual block to work on three-dimensional data have been implemented and used as a component for the generative chain in our GAN. In particular, the structure of the residual block we developed is the following:

• a padding layer; • a convolutional layer; • a padding layer; • a convolutional layer;

• a layer to sum the output of the previous one with the input.

The entire block do not alter the size of the input it processes. The aim of the padding layers is to make the output of the following convolutions be of the same size as the input of the block. The padded values reflect the ones in the inner boundary. Reflection has been chosen to preserve the context of the pixels involved, i.e. similar background values at the corners, for example.

The sample to process is 5D, i.e. (batch size, width, height, depth, 1), but padding is carried on the dimensions belonging to the image. The effect is the addition of a one-pixel border.

Both the convolution are computed with a stride of 1 pixel and 3-by-3-by-3 kernels whose number is decided when building the generator. The output of the first convolutional layer and of the summation one are evaluated by

(51)

a ReLU activation function. ReLU stands for rectified linear unit and is defined as ReLU (x) = max(0, x) or equivalently as

ReLU(x) =    x, if x ≥ 0 0, if x < 0

Great advantages of employing a ReLU function are a low computational cost, sparsity of the activation maps and they are shown to be more resistant to vanishing gradients with respect to other choices such as sigmoids.

5.2

Generators

Both of the generators implemented in the network share the same archi-tecture. GANs provide more realistic results with respect to other image generation techniques based on machine learning, at the cost of a more dif-ficult training procedure. In generators, the input tensor undergoes a series of transformations that reduce its dimensionality, in the encoding part, then the smallest of the representations is transformed and upsampled to the orig-inal shape, i.e. decoded. Thus, an upsampling and a downsampling strategy have to be adopted.

The latter is commonly achieved by means of pooling operations: a num-ber of pixels are aggregated into one value according to operations such as taking the maximum or the average and similar ones. The advantage of these approaches is that no parameters need to be learned and they are very fast to compute. Unfortunately, in general adversarial network, pooling operations are to be avoided. The downsampling operation is carried out by mean of convolutions with a wider stride than a single pixel. This enables the net-work to adapt its weights to achieve the best data-driven choices on how to downsample the information.

On the other hand, there are at least three possible choices for the up-sampling strategy:

• introducing a layer which increases the number of pixels according to a known function, i.e. an interpolator;

(52)

• interpolating and than applying a convolution, so that the network has the possibility to learn how to compensate the artifacts introduced by the interpolation;

• using a deconvolutional layer.

The latter approach has been followed, thus how to achieve better resampling, i.e. up and down-sampling, is a feature to be entirely learned from data.

The first layers of the generator, are a padding that add a border of 3 pixels, whose intensity values are chosen as reflection of the nearest ones, and a convolution with 32 7-by-7-by-7 kernels and a stride of 1 pixel. The shape of the images provided as input is (104, 128, 104) whith a voxel size of 1 mm x 1 mm x 1 mm. Each image after the first operations is translated into a hyper-volume of shape (104, 128, 104, 32), where 32 is the number of filters1.

Note that at this stage the size of the image actually increases. The next layers are two convolutional ones with a stride of 2 pixels and respectively 64 and 128 filters. As the number of activation maps increases, their width, height and depth are reduced; in particular, each image itself is reduced by a factor of 8 due to the stride being 2 along each dimension. At the same time, the number of kernels used is doubled layer after layer.

After that, 6 residual blocks follow each of them being composed by 128 kernels. This residual part do not alter the shape of data but just rearrange it to better extract features. In fact, the network capability and flexibility benefits from having more weights to adapt to the specific task.

After the residual part, two deconvolutional layers are used to upsample by a factor of 2 in each dimension; a padding adds a border of 3 pixel all around the image and then a convolutional layer with a single 3-by-3-by-3 kernel and stride of one pixel transforms the data to the original shape. All but the last convolutional layer use ReLUs as activation functions. For the last layer hyperbolic tangent has been employed in order to produce output with values in the range (-1, 1). The general criterium followed in the dsign of the generators is to double the number of kernels and at the same time reduce by half each of the spatial dimension. This approach forces

(53)

Layer Kernel size Number of kernels Stride Output shape Input / / / (104, 128, 104, 1) Pad – 6 px / / / (110, 134, 110, 1) Conv3d 7 32 1 (104, 128, 104, 32) Conv3d 3 64 2 (52, 64, 52, 64) Conv3d 3 128 2 (26, 32, 26, 128) 6 x Res. block 3 128 1 (26, 32, 26, 128) Deconv3d 3 64 2 (52, 64, 52, 64) Deconv3d 3 32 2 (104, 128, 104, 32) Pad – 6 px / / / (110, 134, 110, 32) Conv3d 7 1 1 (104, 128, 104, 1) Table 5.1: Summary of the network structure for the generators. All but the last one use ReLU as activation function. The last layer is activated by a hyperbolic tangent.

the network to learn filter parameters to extract features from the spatial dimensions and represent them into specialized activation maps; thus the choice of augmenting the number of kernel while reducing other dimensions. The structure of the network and how the shape of data is modified trough the network is summarized in table 5.1.

5.3

Discriminator

The discriminator networks share a simple architecture made up with con-volutional layers only. No pooling layers are present, reduction in size are achieved by means of three-dimensional convolutions with strides of two pix-els, thus at each stage the input is downsampled by a factor of 8 (each dimension is halved in its size). The activation functions used in all but the last layers are leaky ReLUs with a leak factor of 0.2. Leaky ReLUs are defined as LReLU (x) = max(x, λx), or, equivalently as:

LReLU(x) =    x, if x ≥ 0 λx, if x < 0

(54)

For non-negative values LReLU (x) = ReLU (x), but in the case of values that are less than zero the output of a standard ReLU is zero, while the leaky version returns the number multiplied by a factor that is less than one. LReLUs attenuate negative values instead of zeroing them.

In section 5.1, the problem of vanishing gradient has been discussed. In case negative values are produced by the network, a ReLU activation function would return a constant zero, then the gradient would evaluate to zero nullifying the effectiveness of the gradient descent and producing dead neurons.

A way to reduce the formation of dead neurons is using leaky ReLUs, which have no constant slope, i.e. they have non-zero derivatives.

The applications of this kind of activation to the case of generators showed no improvement so we chose to continue to use the leaky version only for the discriminators, thus limiting some extra computational cost that would have been required.

Another trick implemented to reduce the number of calculations involved in the discrimination is the truncation in the last layer. Typically in a two classes classifier, the output of the last convolutional layer is processed be a fully-connected network leading to a single neuron in the last layer; this indicates the label by taking values near to 0 or 1. Activation functions such as sigmoids mimic the form of a ladder, ideal discriminator, while retaining the advantage of being continuous and differentiable.

Our implementation deliberately avoids the extra cost of a fully connected part in the discriminating chain. The output image of the last convolutional layer, achieves its classification goal by learning how to produce flat maps. The mean value of the produced output image is interpreted as a label. As a clarification, let us make an example. In case a fake image is identified, the output of the discriminator would be, ideally, a small image with a mean value of zero; instead, for real images, it should be 1.

Riferimenti

Documenti correlati

The result will be obtained by extending Gibbs symbolic calculus to tensor fields and in- troducing an original definition of vector product between vectors and second- order

In fact, all the empirical parts of the thesis have been done using aggregate data of the Euro Area, implying that the monetary policy suggestions derived may be helpful for

At the end of the chapter there is a deeply analysis of the fundamental element of the project: the piezoelectric motor Piezo LEGS TM by P iezoM otor company, whose R the driver

To standardize and interpret the data collected and to facilitate comparison between different realities, in 1998 the EU has decided to finance a European surveillance network

Complexity, sophistication, and rate of growth of modern networks, coupled with the depth, continuity, and pervasiveness of their role in our everyday lives, stress the importance

Unfortu- nately, a limitation of magnetization-prepared gra- dient echo sequences is that the contrast-to-noise ratio is inferior to regular gradient echo, and that the time to

PET imaging studies generally use tracer doses containing only very small amounts of the radiopharmaceutical (in the range of picograms to micrograms) which conform to true

The detection efficiency of pathogen bacteria in our subjects that cause intestinal dysbiosis was clinically confirmed using a classical bacterial culture approach.. The ideal