• Non ci sono risultati.

Deep learning models for track reconstruction in particle physics

N/A
N/A
Protected

Academic year: 2021

Condividi "Deep learning models for track reconstruction in particle physics"

Copied!
59
0
0

Testo completo

(1)

U

NIVERSITY OF

P

ISA

MASTER

THESIS

Deep Learning for Track Reconstruction in

Particle Physics

Author:

Antonio Carta

Supervisors:

Dr. Davide Bacciu

Dr. Felice Pantaleo

(2)

i

University of Pisa

Abstract

department of Computer Science

Deep Learning for Track Reconstruction in Particle Physics

by Antonio Carta

Track reconstruction is one of the most expensive tasks performed during the analysis due to its combinatorial nature and the high number of fake tracks generated. In this thesis we develop a deep learning model capable of recognizing fake tracks using only two hit shapes of a track seed.

(3)

ii

Contents

Abstract i

1 Introduction 1

2 Experiment Description and Problem Formulation 3

2.1 CMS Detector . . . 3

2.1.1 Coordinate system . . . 5

2.1.2 Detector design . . . 5

2.1.3 Silicon Tracker . . . 7

2.2 Track Reconstruction Algorithms . . . 8

2.2.1 Pixel Hit Reconstruction . . . 8

2.2.2 Track Reconstruction . . . 9

2.2.3 Performance of the algorithm . . . 10

2.3 Future Track Seeding . . . 12

3 Machine Learning Background 14 3.1 Machine Learning . . . 14

3.1.1 Learning System . . . 15

3.1.2 Linear Models. . . 15

3.2 Neural Networks . . . 15

3.2.1 Recurrent Neural Network . . . 17

3.2.2 Convolutional Neural Network. . . 17

Convolution . . . 19

Pooling. . . 20

3.2.3 Regularization . . . 20

Dropout . . . 21

L2 regularization . . . 21

3.2.4 Binarized Neural Network . . . 21

3.3 ATLAS Higgs Boson challenge . . . 22

3.4 Neutrino event classification with deep CNN . . . 23

(4)

iii

4.1 Data description . . . 25

4.2 Data preprocessing . . . 29

4.3 Doublet Model. . . 31

4.4 Hardware implementation . . . 33

4.4.1 General Purpose GPU . . . 33

GPU architecture . . . 34

4.4.2 Intel Xeon Phi . . . 35

4.4.3 Knights Landing . . . 35

4.4.4 Machine Learning on Xeon Phi . . . 37

4.5 FPGA . . . 37

5 Experimental Analysis and Results 40 5.1 Doublet Filtering Model . . . 40

5.2 Results . . . 42

5.3 Analysis . . . 43

5.4 Benchmarks . . . 47

6 Conclusion 51

(5)

1

Chapter 1

Introduction

High-Energy Physics experiments, like those studied at CERN, require the processing of huge amounts of data, on the order of PB/s.

The Large Hadron Collider (LHC), located in a 27 km tunnel 100 meters underground in the Geneva region, is the biggest particle accelerator in the world and generates 600 million proton-proton collisions per second. It was built to study the fundamental forces of the universe and in 2012 the Higgs boson was discovered(Chatrchyan et al.,2012) analyzing the LHC data. It is un-feasible to store every event generated on permanent storage and only a small number of events contain interesting physics and must be retained for offline analysis. To find the interesting events the raw detector data must be analyzed in real time. This process allows to keep events at a rate of about 1 kHz in the CMS experiment(Chatrchyan et al.,2008), one of the two general purpose detectors installed at the LHC. A first coarse selection is performed at the first level of the trigger, which takes the raw detector’s data, and uses FPGAs and custom hardware to reduce events from a 40 MHz to a 100 kHz rate using simple signals to filter uninteresting events. The events selected by the Level-1 trigger are sent to the High Level Trigger (HLT), a farm of multiprocessors that contains about 22 000 x86 cores which further reduce the data rate to 1 kHz, a quantity that can be reasonably stored on the permanent memory for further analysis.

One of the most expensive reconstruction steps performed in the HLT is the track reconstruc-tion(collaboration,2014), which by its nature is a combinatorial problem. Particles leave small hits when they cross the detector’s layers and the track reconstruction connect them to form the origi-nal trajectories. During this process lots of fake tracks are generated, increasing the computatioorigi-nal cost. In 2025 The High Luminosity LHC will increase the pileup (number of simultaneous colli-sions) by a factor of 10. A Higher level of pileup increases the number of hits in each layer, causing a much higher number of fake tracks. Current track reconstruction algorithms will become unfea-sible(Cerati et al.,2015) and new approaches must be developed otherwise some events will be lost.

(6)

Chapter 1. Introduction 2 Track reconstruction algorithms were developed 25 years ago for single core CPUs. Today many-core architectures and GPUs allow to create massively parallel programs but require a careful de-sign of the algorithm and the implementation to achieve the maximum parallelism degree. Some research on this front shows that it is possible to improve track reconstruction by porting some part of the algorithms on GPU architectures(Pantaleo,2017).

Deep learning techniques have reached state of the art in many computer vision applications, like in Simonyan and Zisserman,2014, Redmon et al.,2016. Similarly, the same techniques are being used successfully in high energy physics for offline event classification, for example in (Aurisano et al.,2016) and (“Heavy flavor identification at CMS with deep neural networks” 2017).

In this thesis, a model to filter fake partial tracks (track seeds) using only two hits is implemented. Such a model can reduce the number of fake tracks, the main reason behind the high computa-tional cost for track reconstruction. The model is based on convolucomputa-tional neural network and can be efficiently executed on manycore and GPU architectures. The model performance has been ex-tensively studied and different networks are proposed. Finally, a preliminary study of the compu-tational cost at inference time is performed and different implementation techniques are proposed to achieve the real time requirements required for the online preprocessing.

Chapter2gives an overview of the CERN LHC experiment and the CMS detector, with a focus on the track reconstruction problem and the algorithms used to solve it. Chapter3is an overview of the machine learning and deep learning area and its application in high energy physics. Chapter

4describe the experimental setup used during the experiments, including the generation of the dataset and the architectures that were used for inference. Chapter5details the experimental re-sults and the differences between each model regarding performance and architectural decisions. Finally, Chapter6conclude with a review of the thesis and some ideas for further development of the work.

(7)

3

Chapter 2

Experiment Description and Problem

Formulation

The objective of this thesis is to develop a model to filter hit doublets in the trajectory recon-struction pipeline used in the data acquisition process of the CMS detector. In this chapter, it is described how the high energy physics experiments work with a brief description of the LHC accelerator and the CMS detector and how the track reconstruction algorithms are currently imple-mented. Furthermore, the challenges caused by the track reconstruction, that led to the approach proposed in the subsequent chapters, will be highlighted.

2.1

CMS Detector

The CMS (Compact Muon Solenoid) is one of the main experiments performed at CERN, the Euro-pean Center for Nuclear Research. It is one of the biggest experiments and uses a general purpose detector to collect data from particle collisions. It was developed to study the fundamental forces in the universe, to detect the Higgs boson, which was found in 2012(Chatrchyan et al.,2012), study its properties, and search for evidence of physics beyond the Standard Model. A particle acceler-ator, the LHC (Large Hadron Collider), is used to accelerate particles up to 6.5 TeV per beam. The LHC is the most powerful particle accelerator in the world, it is located in a 27 km tunnel in the Geneva region, 100 meters below ground. Figure2.1shows a schematic view of the LHC accel-erator and the position of each detector in the tunnel. Two proton beams are accelerated through multiple stages: protons are generated from a hydrogen bottle, stripping the electrons with an electric field; the Linac 2 is a linear accelerator and it is used as the first acceleration stage, tak-ing to particles up to 50 MeV; then there are the Proton Synchrotron Booster, Proton Synchrotron, and Super Proton Synchrotron accelerators, which take the beam up to 1.4 GeV, 25 GeV, 450 GeV

(8)

Chapter 2. Experiment Description and Problem Formulation 4 respectively; the beams are then injected into two beam pipes inside the LHC where they circu-late in opposing directions, one going clockwise and the other anticlockwise. The two beams are bent by a strong magnetic field, accelerated up to 6.5 TeV, and then they collide inside the detector with a center-of-mass energy of 13 TeV. The particles created from the proton-proton collisions are detected with the CMS and the data of the event can be saved for further analysis. Each beam is divided in bunches of protons separated by empty space such that an event happens every 25 ns. Each bunch crossing can generate multiple collisions, a phenomenon called pileup.

FIGURE2.1: Schematic view of the CERN accelerator complex and the detectors. Source: cern.ch

Currently, there is an ongoing upgrade program(“The European Strategy for Particle Physics Up-date 2013. La stratégie européenne pour la physique des particules Mise à jour 2013. 16th Session of European Strategy Council” 2013) at CERN designed to improve the luminosity. Four shut-down periods, two of which are already completed, will be used to upgrade the LHC accelerator and the CMS detector. After the final upgrade, that will be completed in 2025, the luminosity will be increased by a factor of 10. Higher levels of luminosity pose greater challenges since the detector must be more precise and fast and the quantity of generated data is increased, requiring faster algorithms. Pile up measures the number of simultaneous proton-proton collisions in each single bunch crossing. A higher luminosity causes a higher pile-up of the events, which affects the performance of the current algorithms. Current trajectory reconstruction algorithms have an ex-plosion of possible trajectories caused by the higher number of events and affecting the execution time. Higher pile-up also causes more errors in the reconstruction process, reducing the quality of the data. New algorithms must be developed to face these challenges.

(9)

Chapter 2. Experiment Description and Problem Formulation 5

FIGURE2.2: CMS coordinate system.

2.1.1

Coordinate system

The x axis is pointing toward the center of the LHC, the y axis upward and the z axis is pointing west in the direction of the beam. In polar coordinates the angle φ in the xy plane and the radial coordinate r will be used. θ is the angle in the rz plane and η = − ln tan(θ/2). The momentum component transverse to the beam direction is pT. This coordinate system is shown in figure2.2

and will be used afterward to refer to points inside the detector and to define the track parameters.

2.1.2

Detector design

The Compact Muon Solenoid(Chatrchyan et al.,2008) (CMS) is a particle detector installed in the Large Hadron Collider (LHC). The detector weighs 140000tonnes, it is 15 meters high and 21

me-ters long. It is composed of several cylindrical layers and lateral endcaps to ensure maximal cover-age of the area surrounding the collision. Two protons beams at 6.5 TeV collide 40 million of times per second inside the accelerator, once every 25 ns. At design luminosity, approximately 23 inelas-tic events occur on average whenever two proton bunches collide. The simultaneous collision of multiple particles is called pile-up and it will increase when the HL-LHC (High Luminosity LHC) will become active in 2025. Pile-up increases the computational cost of the track reconstruction pipeline due to the increased number of tracks to reconstruct and hits to match with tracks.

(10)

Chapter 2. Experiment Description and Problem Formulation 6 At the current luminosity about 1000 charged particles emerge from the interaction every 25 ns. A good granularity and time resolution are needed for the measurements to avoid to confuse particle interactions between different events.

It is unfeasible to store every event recorded by the detector so they must be processed and filtered in real time to retain only the interesting ones for offline storage and further analysis. After this first event selection phase only about 1000 events per second are retained for off-line storage. This imposes hard constraints on the latency and bandwidth of the algorithms used for the data acquisition process and the design of the readout chips in the detector.

FIGURE2.3: Modular view of the CMS detector. Source: Chatrchyan et al.,2008

Figure2.3shows a schematic view of the detector’s components. Starting from the center of the detector the first component is the inner tracker, divided into 3 inner layers of pixel detector (a fourth layer has been added after the upgrade (Lu, 2012)) and 10 silicon strip detector layers. These layers are used to track the position of the particles and to reconstruct the trajectory fol-lowed by each particle generated by the collision. The silicon detector is the most interesting part of the detector for our purposes since it is the component that registers the particle trajectories.

(11)

Chapter 2. Experiment Description and Problem Formulation 7 Since it is the inner part of the detector it is designed to endure high levels of radiation and is completely made of silicon. After the tracking system there are the electromagnetic calorimeter (ECAL) and the hadronic calorimeter (HCAL), used to record the energy of the particles. The next layer is the superconducting solenoid magnet, which generates a 4 T magnetic field to deviate the charged particles. The magnetic field generated bends the charged particles in different directions, allowing to distinguish a positively charged particle from a negative one. The last 4 layers are the muon detectors.

2.1.3

Silicon Tracker

FIGURE2.4: schematic view of the silicon tracker modules. Source: Chatrchyan et al.,2008

The silicon tracker is the most important component for trajectory reconstruction. A schematic fig-ure showing its modules is shown in Figfig-ure2.4. The innermost part is the pixel detector(indicated with the label PIXEL in the figure), composed of three barrel layers between 4.4 cm and 10.2 cm. The system contains also two endcaps for each side of the barrel to extend the acceptance rate of the tracker for values of pseudorapidity up to |η| < 2.5. The pixel detector contains a total of 1440 pixel modules and a total of 66 million pixels. A recent upgrade(Lu,2012) of the pixel detector added another barrel layer and two endcaps at each side. At the design luminosity, the hit rate density for the first layer is 1M Hz

mm2 at a radius of 4cm. To obtain a low occupancy of about 1% each

pixel module has a dimension of 100 × 150µm2, giving a high resolution 3D image for the particle

(12)

Chapter 2. Experiment Description and Problem Formulation 8 Each sensor is 285µm thick, with high dose n-implants placed into a resistive n-substrate and a p-implant on the back. A charged particle that passes through a sensor releases electron-hole pairs which are collected at the n-implant. The charge carriers are subjected to the Lorentz force. This means that adjacent pixels can share the charge caused by the same particle, forming a cluster. The pixels output pass through a zero-suppressed filter and it is zeroed out if it is below a predeter-mined threshold of 3200 electrons.

The strip detector is composed of 10 barrel layers up to 1.1 m of distance from the center of the collision. The detector is completed by 3+9 endcaps on each side of the detector, amounting to a total of 15148 strip modules and 24244 sensors. Since the strip detector layers have a smaller occupancy a single module is a rectangular area of 10cm×80µm for the first layers up to 25×180µm in the outer region of the tracker. The strip detector can be divided into three submodules: the tracker inner barrel and disks (TID+, TIB, TIB-) composed of 4 barrel layers and 3 endcaps at each layer, the tracker outer barrel(TOB), composed of 6 barrel layers and the tracker endcaps(TEC-, TEC+) composed of 9 disks for each side.

The silicon tracker detects particle hits at each layer and passes the relevant data to the track reconstruction algorithm. The algorithm uses the hits information to reconstruct each particle trajectory.

2.2

Track Reconstruction Algorithms

Classic HEP trajectory reconstruction algorithms are more than 25 years old. After the pixel hit reconstruction phase an iterative algorithm for track reconstruction is applied. The algorithm is divided into several phases: seed generation, track finding, track fitting, track selection. A fast algorithm is fundamental since the algorithm is run in the high-level trigger, the filter which reduces the rate of events from 100 kHz to 1 kHz. Higher pileup, like those that will be expected in the HL-LHC, causes a combinatorial explosion in the seed generation phase which can render the algorithm unfeasible.

2.2.1

Pixel Hit Reconstruction

The pixel detector is the innermost layer of the detector and provides hits up to pseudo-rapidity |η| < 2.5. The charge left by a particle hit can span more pixels either because of the angle of the particle, the Lorentz drift caused by the magnetic field or the damage in the sensors due to the high levels of radiation.

The algorithm used for pixel hit reconstruction(Giurgiu,2008) compares the recorded cluster shape with a cluster shape generated with Pixelav(Chiochia et al.,2008) for two given incidence angles αand β. Incidence angles must be already computed so this algorithm is only applicable in a

(13)

Chapter 2. Experiment Description and Problem Formulation 9 second pass. In the first pass, other algorithms are used. Given the two cluster shapes their X and Y projections are computed, as shown in2.5. A χ2minimization is performed with X and Y, the hit

position. This algorithm, originally developed for radiation damaged sensors, has obtained better performance than the original one even in the case of not damaged sensors.

FIGURE2.5: Cluster hits with X and Y projections. Source: Giurgiu,2008

2.2.2

Track Reconstruction

Track reconstruction is an iterative algorithm. At each iteration of the algorithm, a different set of parameters is given to the seed generation phase to capture different trajectories. The first iterations search only for the easiest tracks using tight bound on the interaction region while the successive iterations search larger regions. Hits corresponding to a track are removed after the end of each iteration to reduce the number of seeds generated. Each iteration is divided into the following phases:

• seed generation • track finding • track fitting • track selection

Seed generation is the phase that creates the initial set of hits from the innermost layers of the silicon tracker. Charged particles in the detector follow a helical path and therefore their trajectory can be defined with 5 parameters. The intuition suggests that it would be easier to find track seeds starting from the outer layers since they are less dense, but this approach cannot work because the sensors in the inner layers have a better resolution and hence a smaller occupancy. Furthermore, some particles do not traverse the entire tracker and would not be found with this solution. Seed generation starts with a loose approximation of the primary vertices generated by a very fast

(14)

Chapter 2. Experiment Description and Problem Formulation 10 algorithm. It is controlled by two parameters: the seeding layers and the tracking regions. The first indicates the layers (can be two or three) to search for hits and the tracking regions define the accepted regions where the hit must be found to be part of a track seed. The pair or triplets of hits in the corresponding layers that respect the constraints imposed from the tracking regions form the track seeds.

Track finding is performed with a Kalman filter(Kalman,1960). Seeds generated from the previous step are used to initialize the filter and the trajectory parameters are updated at each step by finding compatible hits in the successive layer. To find compatible hits an analytical propagator is used assuming a perfect helix trajectory computed with the current estimate of the parameters to select compatible modules. The hits in each module are considered and a χ2test is performed

to check compatible hits. There is also the possibility to add ghost hits that are present in the trajectory but not detected by the module (e.g. because the corresponding module is damaged). New track candidates are generated by adding the compatible hits to the previous trajectory and updating the parameters of the trajectory. The process is repeated for the successive layer until a termination condition is met. The number of track candidates is limited to 5 to avoid an excessive increase in the number of candidates, evaluated with a normalized χ2with a bonus for valid hits

and penalties for ghost hits.

Track fitting evaluates the trajectory parameters of the tracks, using the track’s hits and previously estimated parameters. The track is refitted because the parameters found in the previous process can be biased by the constraint.

Track selection removes fake tracks generated in the previous steps. Various requirements on the number of layers with corresponding valid hits, 3D hits, the maximum number of layers with no hits and track parameters are checked to remove fake tracks.

2.2.3

Performance of the algorithm

The described algorithm has been effectively used in the LHC but the new upgrade coming with the HL-LHC poses bigger challenges. Both the performance and the run time of the algorithm are affected by the pileup(PU). Figure2.6shows the increase in CPU time compared against the pileup. PU25 corresponds to the pileup encountered in the 2012 run while PU140 is an estimate of the pileup in the HL-LHC. The main source of the inefficiency of these algorithms is the high number of seed tracks generated in the first phase. The density in each region of the detector increases with the pile-up and cause an explosion of the track seeds. To improve the cost of the track reconstruction algorithm this thesis develops a model to filter track seeds with a machine learning model.

(15)

Chapter 2. Experiment Description and Problem Formulation 11

(16)

Chapter 2. Experiment Description and Problem Formulation 12

2.3

Future Track Seeding

Current track reconstruction algorithms have been developed 25 years ago when the pileup at the LHC was lower and Moore’s law guaranteed an exponential increase in hardware performance. Today the new challenges imposed by the HL-LHC and the modern computer architectures, like those described in section4.4, require a complete redesign of the current approach. In light of these considerations, a new algorithm for track seeding has been developed in (Pantaleo,2017), which can be easily parallelized on GPU architectures.

The algorithm is based on cellular automaton and is called Hit Chain Maker. Every single cell in a cellular automaton update its state depending only on the neighbors around it, hence using only local data and lending to an easy parallelization even with massively parallel architectures like GPUs. The objective of the algorithm is to create track seed with the Phase 2 pixel detector with 4 barrel layers and 3 endcaps.

First, a graph is created representing the layer configuration, with the layers as nodes and the edges between neighboring layers. Then, a graph for layer pairs is created depending on the configuration of layers selected to find quadruplets. This is used to avoid looking for the same doublets in a layer’s pair multiple times for different configuration which shares the same layer’s pair.

Each cell represents a single doublet and contains a state, which will be used in the evolution phase, and a vector of outer cells, which are cells found in a different layer pair which share a common hit with the cell. Doublets are hit pairs found within adjacent layers that are compatible with the beam spot hypothesis. Cells can be initialized in parallel since they are all independent. After each cell has been initialized the connections between each cell must be created. Two cells are connected if they share a common hit, are located in different layer pairs, are compatible with alignment requirements in the R − z plane and with a circular trajectory coming from the beam spot in the x − y plane. If these requirements are satisfied the doublets can be connected to form a triplet.

The last step is the evolution step. To perform a propagation suitable for a GPU implementation the state of each cell is updated at each iteration with a simple rule. If CellStateA = CellStateB

and CellStateAis the inner cell, then CellStateA = CellStateA+ 1. This update is performed in

parallel on all the cell by first setting a flag that indicates if the state must be updated and after all the cells have set up their flag the cell is updated. The algorithm is generic and can find any kind of N-tuplet, for example using hits in the silicon strip tracker. An example of the final output of the algorithm is shown in figure2.7. To form quadruplet two evolution steps are sufficient, and the quadruplets can be found searching for the chains that start with CellState = 2 in the root layers.

(17)

Chapter 2. Experiment Description and Problem Formulation 13

FIGURE2.7: Example of the Hit Chain Maker output. Source: Pantaleo,2017

The algorithm is implemented both on GPUs and CPUs, and their timing performance is shown in table2.1, where the GPU and CPU version of the cellular automaton, the triplet propagation al-gorithm, which is the trivial implementation of the 2016 track seeding to form quadruplets instead of triplets, and the 2016 pixel tracks algorithm (which is the old algorithm run on the old detector geometry with just 3 barrel and 2 endcap layers).

Algorithm Timing

GPU Cellular Automaton (1.2 ± 0.9)ms CPU Cellular Automaton (14.0 ± 6.2)ms Triplet Propagation (72.1 ± 25.7)ms 2016 Pixel Tracks (29.3 ± 13.1)ms

TABLE2.1: Timing performance for different track seeding algorithms. Source: Pan-taleo,2017

(18)

14

Chapter 3

Machine Learning Background

In this section there will be a short introduction to the machine learning theory, explaining the most important points that lead us to the design of the doublet filtering model. Section3.1gives a formal definition of machine learning and an example of a simple linear classifier. Section3.2

introduce the neural networks and the different kind of architectures, for example recurrent and convolutional networks. The last two sections describe two examples of successful machine learn-ing applications in the high energy physics domain.

3.1

Machine Learning

According to Mitchell’s definition (Mitchell,1997) a learning task can be defined as:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Machine learning can be divided into three categories, depending on the characteristics of the data: supervised learning, where each data sample is labelled with the corresponding target and the task is to train a model to predict the target given a sample; unsupervised learning where only the data is present without the target and the task is to model the data distribution; reinforcement

learningwhere the task is to command an agent that perform actions that can result in losses (or rewards) and the objective is to minimize the loss (or maximize the reward).

These categories do not represent a perfect division but should be used only as a general guideline; for example, unsupervised learning is often used to solve supervised learning problems by first finding a good model for the data and then solving the supervised task.

Furthermore, supervised learning can be divided into two categories: classification, where the target is chosen from a finite set, and regression, where the target is a real value.

(19)

Chapter 3. Machine Learning Background 15

3.1.1

Learning System

After the definition of the problem it is necessary to define a learning system to solve the prob-lem. Since the doublet filtering is a classification problem, this will be the only task that will be discussed from now on. Given X the domain of the input samples, Y the domain of the target, the problem to solve is that of finding a function h : X− > Y , called hypothesis, that can approximate f : X− > Y, the function that associates the input samples with their target.

To define a learning system two components are necessary: H, the set of functions to use as hy-pothesis and a training algorithm to decide which function h ∈ H will be chosen to approximate f. It can be noticed this introduce two different type of bias inside a system: first, H could be a subset of all the possible functions, and it is possible that f /∈ H; second, the learning algorithm must decide which functions are better approximations of the hypothesis.

3.1.2

Linear Models

To clarify the previous definitions weather prediction can be used as a classic example of classi-fication and regression with linear models. Suppose that the target of the prediction is the mean temperature of a day given a set of features such as the weather conditions in the previous days, represented as a vector of real values. The problem can be modeled as a regression problem, and a linear model can be used as the learning system. A linear model can only represent linear func-tions; this means that the hypothesis will be of the form:

h(x) = wTx + b

Now suppose that the training samples are of the form < xi, yi >, where x are the input features

and y is the target, that will be represented for simplicity in matrix form as X and y, with each row corresponding to a different sample. The linear model can be trained with a least squares method solving the following optimization problem:

min{1

2(Xw + b − y)

2}

The solution to this problem will be the parameters w and b of a linear hypothesis, that can be used to compute predictions over unseen data.

3.2

Neural Networks

Artificial Neural Networks(Rosenblatt,1958) are machine learning models inspired by the neurons in the human brain. In figure3.1a schematic view of the Rosenblatt neuron is shown while figure

(20)

Chapter 3. Machine Learning Background 16

3.2shows a biological neuron. Biological neurons are organized in complex networks connected with synapses and communicate by voltage propagation. Each receives electric current as input from the connected input neurons and if the input signal is strong enough it generates an electric signal, called spike, that is sent along the axon to the synaptic terminals where it is received from the connected output neurons. Artificial neurons easily simulate this behaviour by summing all the inputs and applying a nonlinear activation to it, like the sign function in the example. These model does not resemble in any way the biological processes underlying the neuronal dynamics but it is fast to compute and powerful enough to solve lots of problems.

FIGURE3.1: Rosenblatt

neu-ron. FIGUREron. Source: Wikipedia3.2: Biological

neu-multiple neurons can be connected together to form a network.

It is possible to represent the neural network input as a vector and the network as series of layers consisting of a matrix multiplication and a nonlinear activation. Multiple layers can be combined to create deeper networks. A single layer performs the following operation:

y = f (W x + b)

where x is the input vector, y the output vector, W and b the weight matrix and bias and f is a non-linear activation function; common choices for the activation are non-linear functions, scaled sigmoid, rectified linear functions. Each layer performs a matrix multiplication followed by a nonlinear activation. These type of computation can be easily parallelized using GPUs or vectorization in-structions on the CPU.

The model can be trained performing gradient descent on a differentiable error function. Back-propagation (Rumelhart, McClelland, and Group,1988) is used to compute partial derivatives for the model parameters and update them. The algorithm can be executed in batch mode, computing the error function on the whole training set or with mini-batches, computing the error function on a smaller part of the training set.

(21)

Chapter 3. Machine Learning Background 17

3.2.1

Recurrent Neural Network

The neural network seen in the previous paragraph takes as input a single vector but most data in the real world in nor represented in a vector form. If the data is structured it must be mapped into a single vector before using it as input for a feedforward neural network, but in this way all the information about the structure is lost, even if it could be useful to solve the problem. A common type of structured data is a time series, which can be used to represent sequences of data, like stock prices, periodical sensor measures. It is possible to use time series with the previous feedforward model concatenating the time series values into a single vector. Since this vector must have a constant size, it will contain only a finite amount of timesteps. This model is called Input Delay Neural Network (IDNN)Haykin et al.,2009. The advantage of this model is that no change are necessary to the feedforward network architecture; it is sufficient to preprocess the time series. The disadvantage is that only a limited number of timesteps in the time series are considered and that no information about the structure of the data is encoded in the network, making the learning more inefficient.

Recurrent neural network (RNN) solve these problems by using the time series directly as input to the neural network. Assuming that a state transition system can model the time series generation mechanism:

xt= τ (xt−1, lt)yt= g(xt)

where ltrepresent the input at time t, xtthe internal state of the system, τ the state transition

func-tion, ytthe output at time t and g the output function. It is possible to create a neural network that

models the same system, where the g and τ functions will be computed as matrix multiplication followed by nonlinear activation and the matrix weights will be learned from the data.

xt= σ(Whxt−1+ Wllt+ bx)yt= σ(Wyxt+ by)

These type of RNN are also called Elman Network Elman,1990. The backpropagation algorithm can be modified to be used with RNN and it is called Backpropagation Through Time (BPTT).

3.2.2

Convolutional Neural Network

Convolutional Neural Networks (CNN) are neural networks specifically developed to deal with input images. From a biological point of view, it is known that neurons are organized hierar-chically at increasing levels of abstraction(Riesenhuber and Poggio,1999), and that neurons are sparsely connected with each other. CNNs are obtained by applying multiple convolutional lay-ers: the hierarchy is generated from the layer composition while the convolution operation ensures the sparseness of the connection.

CNN have been employed to solve image classification tasks, such as AlexNet (Krizhevsky, Sutskever, and Hinton,2012), shown in figure3.3, VGG(Simonyan and Zisserman,2014), ResNet(He et al.,

(22)

Chapter 3. Machine Learning Background 18

2016), where they achieve state of the art results. The advances in the techniques employed to train and regularize these type of network allow today to use very deep networks (ResNet has been evaluated up to 152 layers).

FIGURE3.3: AlexNet model architecture. Source: Krizhevsky, Sutskever, and Hin-ton,2012

The particular architecture of CNN makes them much more efficient than standard feedforward neural networks for image classification tasks. First of all, the same convolution operation can be applied to images with different sizes, allowing a single network to take as input images of different size if the only operation is the convolution. Furthermore, since the kernel size is much smaller than the input image, the resulting matrix is sparsely connected. For example, considering an input of dimension n and an output of dimension m, the parameters for a feedforward network weight matrix are n × m, while the parameters for a convolution are just k, the dimension of the kernel matrix, which are some order of magnitude smaller.

Using the convolution operation the output for each pixel depends only on a local neighbourhood of its input. In the first layers, the convolution detects features using only a small neighbourhood of the pixel, like simple shapes or strokes. Deeper layers can use these features to detect higher level features.

A layer in a CNN network is often divided into three phases: convolution, nonlinear activation, pooling. In the first phase, a convolution is applied to detect features. Convolution is a linear operation while the second stage is a nonlinear activation, for example, a rectified linear activation. Often the ReLU function is used since it is more efficient to compute and helps with the vanishing gradient problem in deep networks. Finally, a pooling stage is applied to reduce the dimension of the output and allows the network to become invariant to small input traslations. In the next sections the convolution and pooling operation will be defined, along with their most common variants and parameters.

(23)

Chapter 3. Machine Learning Background 19

Convolution

Given two function x and w the convolution operation is defined as: y(t) =

Z

x(a)w(t − a)dt

in the CNN terminology x is the input, y the output, also called feature map and w is the convolu-tion kernel. CNN usually employs the discrete convoluconvolu-tion, a discrete variant of the convoluconvolu-tion operation defined as:

y(t) =

+∞

X

a=−∞

x(a)w(t − a) that can be easily extended on multiple dimensions:

y(m, n) = +∞ X i=−∞ +∞ X j=−∞ x(i, j)w(m − i, n − j)

All the subsequent models will use 2-D and 3D convolution. Figure3.4shows a graphical representation of the operation.

FIGURE3.4: Convolution operation applied to a 2-dimensional input. Source: co-lah.github.io

CNN architectures typically use small kernel sizes. These significantly reduce the number of pa-rameters with regard to a fully connected neural network. For example, consider a convolution operation between a 28x28 image and a 3x3 kernel: the output size is 26x26 without padding

(24)

Chapter 3. Machine Learning Background 20 and the kernel has only 9 parameters. With the same input and output sizes the total number of parameters for a full matrix multiplication is 28 × 28 × 26 × 26.

The previous definitions consider infinite sizes for input and kernel functions. With finite input, it is necessary to decide a strategy to deal with pixels at the border. Two common strategies are employed. Consider an input of size n and a kernel of size m: the first strategy is called valid convolution and doesn’t compute the convolution operation for pixels at the border of the input image, where the kernel would partially lie outside of the input, and produces an output of size n − k + 1. The second strategy, called same convolution produces an output with size equal to the input by padding the input image with zeros and applying the valid convolution.

Another important parameter is the stride. A stride of 1 performs the classic convolution operation expressed in the previous equations. A bigger value for this parameter allow us to skip some elements and downsample the output. With a stride s, kernel size k, input size n × m using the valid strategy the convolution can be defined as:

y(m, n) = k−1 X i=0 k−1 X j=0 x(m + i, n + j)w(i, j) Pooling

The pooling operation is used to reduce the size of a feature map and to allow the network to become invariant to small translations. A pooling operation considers a rectangular area of the input and applies a function to it. Typical pooling functions are max, average, weighted average from the central pixel. if the pooling function takes k arguments the output has dimensions n/k.

3.2.3

Regularization

One of the key aspects that differentiate machine learning from classic optimization is the gener-alization capability of the model. In classic optimization global minima (or equivalently a maxi-mum) of a function are the interesting points while in machine learning the objective is the min-imization of the error over unseen data. If a model has a low error on the training set but has a high error on unseen data the model has overfitted the data. More formally, given a space of hypothesis H and a hypothesis h ∈ H, h overfit the training set if there exists a hypothesis h0that has a bigger error on the training set but a smaller error on unseen data (Mitchell,1997). To avoid overfitting two different strategies are possible: increase the size of the training data or decrease the model complexity. The first strategy is not always feasible because the data acquisition could be expensive. Many different techniques have been developed to decrease the model complexity. In the next sections two different regularization techniques that are widely employed with neural networks will be described.

(25)

Chapter 3. Machine Learning Background 21

Dropout

Dropout (Srivastava et al.,2014) is one of the most used techniques to prevent overfitting. To ap-ply dropout before a layer during the training phase a random bitmask of the input size is created sampling bits from a random bit generator, where the probability of 1 in each element of the bit-mask is determined by the hyperparameter p. The input is then multiplied by the bitbit-mask and the obtained vector is passed as input to the layer. In the test phase, the input must be multiplied by 1/p. Dropout is an approximate and efficient way to perform model ensembling with neural net-works while training only one slightly bigger network. Dropout also helps to reduce co-adaptation of neuron activations, since each neuron cannot rely on a particular input to be active during the training phase. p must be properly tuned with a model selection. Typical values for p lie around 0.5.

L2 regularization

L2 regularization(Tikhonov, Arsenin, and John,1977) is another widely used technique to regular-ize neural networks and linear models. This method is implemented adding a weight penalization term to the error function:

Enew= Eold+ γ||w||22

Using this regularizer large weights of the network are penalized, effectively reducing the model complexity. The penalizer is equivalent to a gaussian prior centered around zero. If instead the L1 norm is used the weights prior distribution becomes a laplacean and the network will have a sparser weight matrix.

3.2.4

Binarized Neural Network

Large scale inference can be expensive with neural network since each model can require up to tens of millions of multiplications. One possible way to improve the computational cost is to reduce the precision of the floating points computations. In the extreme case binary weights and activations can be used instead of real values. This type of networks are called Binarized Neural Networks (Courbariaux and Bengio,2016). Weight binarization can be seen as a strong form of regularization but it is also an interesting alternative to improve the computational cost at inference time by substituting all the multiplication and addition with bitwise operations.

Furthermore, these type of architecture can be efficiently implemented in specialized hardware like FPGA, and some previous work (Umuroglu et al.,2017) achieved 12.3 millions of inferences per second.

This kind of network can also be used for embedded devices where the memory could be limited or the power consumption is a limiting factor.

(26)

Chapter 3. Machine Learning Background 22 Binarization can be performed in a deterministic way using the sign function:

wkb = sign(wk)

or in a probabilistic way, with the probability determined by a hard sigmoid: P (wkb = 1|wk) = σ(x) = clip(

x + 1 2 , 0, 1)

The latter approach is more expensive because it requires a source of random bits and it is not as effective as the deterministic one to improve the inference computational cost. For these reasons, the following description will concentrate on the first approach.

The output of the network can be computed with the following algorithm:

for k=1 to L:

wbin[k] = Binarize(w[k]) s[k] = a[k] * w[k]

a[k] = batch_norm(s[k], theta[k])

if k < L:

a[k] = Binarize(a[k])

where Binarize is the binarization function, batch_norm the batch normalization, L the num-ber of layers.

The weights wkare floating point values and they are binarized only before the matrix

multiplica-tion. Backpropagation is applied to the original floating point parameters.

At inference time the binarization can be implemented more efficiently by packing 32 binary weights into a single 32bit integer and performing matrix multiplication with bitwise operation. The multiplication corresponds to a bitwise XNOR while the sum can be computed with a pop-count, which counts the number of 1 bits.

3.3

ATLAS Higgs Boson challenge

The Higgs Boson Machine Learning challenge(Perez et al.,2014), hosted on Kaggle, was organized in 2014 by the ATLAS experiment (one of the main experiments currently active in the LHC). The competition attracted a total of 1785 teams, most of them working without any physicist among their members. The top three winning models are two ensembles of neural networks and a boosted decision tree model. During this competition Xgboost(Chen and Guestrin,2016), a boosted deci-sion tree library, was developed by one of the competitors. The objective of the challenge is to distinguish the tau decay of a Higgs boson from the background. The data is generated with a simulation and for each event a series of high-level features of the event is given as input. Today

(27)

Chapter 3. Machine Learning Background 23 boost decision tree are often used in the offline analysis, for example in (Papaefstathiou, Yang, and Zurita,2013), and they are implemented in ROOT, the main library used by physicists to perform data analysis.

3.4

Neutrino event classification with deep CNN

The NuMI Off-axis Neutrino Appearance Experiment (NOvA) experiment (Patterson, 2013) at Fermilab is designed to study neutrino oscillation in the NuMI (Neutrinos at the Main Injector) beam. A fundamental part of the experiment is the classification of the detected events. Only certain types of events are interesting and must be kept for further analysis, so a good classification is fundamental to ensure high quality data. Standard techniques for this problem include the definition of high level features, like the masses and energies of each particle, or their trajectory, that are then passed to some classification model, like decision trees, k-nearest neighbors or multi layer perceptron. Another possibility is to perform the classification using the raw data directly from the detector, like in Aurisano et al.,2016. The main idea is to pass the raw data as an image to a convolutional neural network, which will extract the interesting features for the classification of neutrino events. The original images are too big to be used as input for the CNN, up to 350 000 in the Near Detector. To reduce the amount of data only 80x100 slices of the original sensor’s data are fed to the network. Slices of the data are clusters found with the DBSCAN algorithm. The network architecture is shown in image3.5and is based on GoogleNet (Szegedy et al.,2015). Differently from the original GoogleNet architecture, this one has been pruned and only the first output is used because the depth was sufficient to solve the task efficiently. The network has been trained with stochastic gradient descent, with a decaying learning rate. L2 regularization has been applied to weights and dropout is used in the last layer before the softmax output. Gaussian noise has been added to the input images during training, and some of the training images are reflected to augment the dataset. The model beat the previous state of the art in the neutrino event classification.

(28)

Chapter 3. Machine Learning Background 24

FIGURE3.5: CNN architecture for neutrino event classification. Source: Aurisano et al.,2016

(29)

25

Chapter 4

Methodology

In this chapter, the software and hardware setup used during the project will be explained along with the reasons that lead to their usage. Section4.1describes the data including the generation mechanism and preprocessing applied to the dataset. Section4.4describes the different hardware used and evaluated during the experiments.

4.1

Data description

The dataset is generated with a Monte Carlo simulation which generates particle trajectories sim-ulating the LHC and the CMS detector, considering the magnetic field generated by the solenoid, the material budget and other types of interferences. The software used to generate the data is the CMSSW, available athttp://cms-sw.github.io/, which was modified to collect the dataset. Each doublet extracted from the data is represented by the following features:

run, evt identifiers of the physics event simulated;

(inZ, inX, inY) global coordinates of the inner hit;

(outZ, outX, outY) global coordinates of the outer hit;

isBarrelIn/Out boolean flag that indicates if the inner/outer hit is on a barrel layer or endcap;

isBigIn/Out indicates if the silicon pixel hit spans two read out chips (ROCs) and then contains big pixels;

isEdgIn/Out indicates if the inner/outer cluster is on the edge of a module and thus the corre-sponding 8x8 pad is not entirely inside it;

isBadIn/Out indicates if the inner/outer hit cluster contains pixel marked as bad or malfunction-ing;

(30)

Chapter 4. Methodology 26

isFlippedIn/Out indicates the module orientation;

layerIn/Out layer ID;

moduleIn/Out module ID;

ladderIn/Out ladder ID;

sideIn/Out side ID;

diskIn/Out disk ID;

panelIn/Out panel ID;

bladeIn/Out blade ID;

detSeqIn/Out index relative to the layer of the hit, taken directly from CMSSW;

detCounterIn/Out index relative to the layer of the hit, counts the layers going from barrel lay-ers (so 0 1 2 3 for BPix1, Bpix2,BPix3 and BPix4) to forward endcap laylay-ers (FPix_pos_1, FPix_pos_2 and FPix_pos_3) to backward endcap layers (FPix_neg_1, FPix_neg_2 and FPix_neg_3);

"in/outPhi" The angle of the inner/outer hit positions.

"in/outR The radius of the corresponding layer for the inner/outer hit

"in/outOverflowX indicates if the cluster shape on the X axis is longer than the image dimension and therefore is not entirely represented in the image

"in/outOverflowY indicates if the cluster shape on the Y axis is longer than the image dimension and therefore is not entirely represented in the image

"in/outCSizeX" inner/outer cluster size along the X axis

"in/outCSizeY" inner/outer cluster size along the Y axis

"diffADC" difference between the sum of the ADC values between the inner and outer clusters.

inPix 15x15 image with the ADC levels for the inner cluster shape, centered around the cluster barycenter.

outPix 15x15 image with the ADC levels for the outer cluster shape, centered around the cluster barycenter.

The most important features are inPix and outPix, that represent the cluster shapes and are the primary input to the convolutional neural network. Figure4.1shows some cluster shapes exam-ples, while Figure4.2shows the average ADC values divided between inner and outer hits and true and fake hits for BPix0/BPix1 doublets. The mean values for true and fake doublets look quite different and this suggests that at least some simple cases can be recognised.

(31)

Chapter 4. Methodology 27

FIGURE4.1: Hit shape samples extracted from the dataset.

Other features, like the isBad, isEdge, isBig flags identify situations that may alter the hit shapes making them harder to recognize. The orientation of a module, indicated by the boolean flag isFlipped also influences the cluster shape and there is no easy way to transform the hit shape from a flipped module into the equivalent that would be read by a not-flipped module, so it is necessary to decide how to handle these two cases. Finally, each doublet can have three different configurations for internal and external hit:

barrel-barrel where both hits are located in barrel layers; endcap-endcap where both hits are located in endcap layers;

barrel-endcap where the internal hit is located in a barrel layer and the external hit is located in an endcap layer.

The endcap-barrel configuration is missing because it is impossible. The barrel-endcap in prin-ciple should be more difficult because the outer layer is perpendicular to the inner one and not parallel like in the other two cases. The endcap-endcap case also presents some difficulties be-cause it is the rarest case between the three modes. During the evaluation and analysis of the models, the dataset has been divided to select only one of the three configurations to assess the model performance in each case. Some model are trained with the whole dataset and then tested

(32)

Chapter 4. Methodology 28

(A) true inner hits (B) true outer hits

(C) fake inner hits (D) fake outer hits

FIGURE4.2: mean ADC values for each pixel in Bpix0/BPix1 doublets divided be-tween inner and outer hits and fake and true seeds.

(33)

Chapter 4. Methodology 29 separately for each configuration, while other models are trained specifically with the data from a single configuration.

4.2

Data preprocessing

The cluster shapes are raw data registered by the detector. The cluster shapes can be preprocessed to obtain a better representation to give as input to the model. Looking at table4.1it can be noticed that most of the cluster shapes have a low length in both dimensions so they can be represented as a 15x15 image to improve the computational complexity of the model. These considerations are necessary because the final model needs to be as fast as possible to filter the data in real time. Notice that some of the features described in the previous section give information about the clus-ter size and, even if the input image does not contain the entire shape, the length and the other features extracted from the cluster shape contain information about the complete cluster.

TABLE4.1: Statistics for cluster size

dim_x_in dim_y_in dim_x_out dim_y_out

mean 2.46 3.58 2.39 2.48 std 2.30 3.39 2.76 2.50 min 1 1 1 1 25% 2 2 1 2 50% 2 2 2 2 75% 3 4 3 3 max 83 145 56 145

The angle between the particle trajectory and the detector’s module strongly influence the cluster shape. If the trajectory is almost perpendicular a small cluster shape would be expected, while small angle can cause longer shapes because the particle crosses more pixels.

To correct this behavior two additional input images are added as separate channel for each cluster shape. They represent angular correction of the original cluster and are computed as:

sin_correction = hit cos(arctan(y/z))

cos_correction = hit sin(arctan(y/z))

where y and z represent the respective components hit global coordinates and is the elementwise product.

(34)

Chapter 4. Methodology 30

FIGURE 4.3: single_model

approach. FIGURE4.4: multiple modelsapproach.

These two corrections were applied for the first time in a previous version of the doublet filtering model and were found necessary to achieve a good performance.

Other factors that influence the cluster shape are the layer and module orientation. These problems are treated in different ways in the trained model to test different strategies. The first idea is the

single modelapproach: the differences are completely ignored in the convolutional part of the network and the only different features are the flags regarding layer id and module orientation in the feedforward part of the neural network, hoping that the model will be able to capture the differences just using these features. These can work in some limited cases, as will be shown in the subsequent chapters. A different strategy requires the training a different model for each possible layer pair, to ensure that each model is optimised in the best possible way. This approach is possible but it is expensive since it requires to train a large number of different models that will need a different parameter selection and performance analysis. Nonetheless, it is a cost that will be paid only at training time, while at runtime the only difference is a higher memory consumption caused by the need to keep in memory the weights for each model. This is the multiple models approach.

A third solution is to train all the different models together and separating them only at inference time to improve the computational cost of each model. A different channel for each inner and outer layer can be added to the input. This is the feature map approach. For each training sample, only two channels will contain the cluster shapes, the one corresponding to the actual inner and outer layer where the doublets reside, while all the other channels will contain only zeros. The training of this architecture can be done only once for all the different cases and each one will be processed differently because the convolution operation will be different depending on the layers, while the subsequent convolutional kernels will be shared between all the models. At inference time the models can be separated to avoid the additional cost of adding the zero channels for each

(35)

Chapter 4. Methodology 31

FIGURE4.5: feature map approach at training time and at test time. The labels below the input data represent the layer pair (Bpix1-Bpix2 for the first two barrel

layers, FPix2+ FPix3+ for the last two endcap layers).

sample. The assumption is that one specialised convolutional layer is sufficient and all the other layers can be shared but the same idea can be used to separate other layers. Figures4.3,4.4,4.5

show a graphical interpretation of the 3 approaches introduced above.

4.3

Doublet Model

To filter the fake track seeds it is necessary to solve a binary classification problem. Since the model will operate in one of the first stages of the data acquisition, high level features cannot be used because they are not computed yet (for some features there may be a loose approximations computed by fast algorithms). The only information available are the cluster shapes for the hits and their position inside the detector, like the global coordinates and the layer and module where they are detected, as detailed in section4.1.

15x15 images are extracted from the hit shapes and can be used as input for a classification model and in this case convolutional layers were chosen to extract features from the hit shapes, which

(36)

Chapter 4. Methodology 32 is a classic choice both in particle physics and computer vision, as it was seen in chapter3. As additional input to the convolutional layers the angular transformation was added and in some of the experiments also the feature maps extracted with the feature map approach. The additional features, like the positional features and the scalar features extracted from the cluster shape, are concatenated together with the high-level features extracted from the convolutional layers and given as input to the final feedforward layers.

The doublet model can be used to filter fake doublets using a threshold over the predicted proba-bility. This threshold can be used to avoid to insert fake doublets in the cellular automaton graph. The same predicted probability for each doublet could also be used to filter triplets and quadru-plets. The integration with the track seeding is not implemented yet and it is outside the scope of this thesis.

(37)

Chapter 4. Methodology 33

4.4

Hardware implementation

Modern architectures exploit parallelism at different levels in order to achieve the maximum per-formance. In the past thanks to Moore’s law it was possible to write unoptimized programs and obtain improvements with newer hardware. Today it is necessary to design an application to ex-ploit all the different kinds of parallelism in an efficient and scalable way. Modern data centers are often built with heterogeneous components like multicore CPUs, manycore, GPU, FPGA, and each one of this components must be programmed efficiently. In the following paragraphs we give an overview of the possible architectures that can be used to implement neural network inference, their hardware design and their support with software and frameworks.

4.4.1

General Purpose GPU

A Graphics Processing Unit (GPU) is a highly parallel architecture specialized for parallel exe-cution of computer graphics routines. In recent years they have been used for general purpose programming and today most deep learning frameworks are optimized for GPU architectures. It is possible to program these architectures using the Compute Unified Device Architecture (CUDA) language(Nickolls et al.,2008), developed by Nvidia for their GPUs, or its open source alternative OpenCL(Stone, Gohara, and Shi,2010).

In CUDA the main computational unit is the kernel function, a single function which is typically defined to operate on a single element in a data structure. The kernel function is run in parallel and up to three indices are passed during the function invocation that can be used for example to get different elements from a common data structure. Threads are organized in a grid which can be further divided into blocks. Each thread and block can have up to three ID to identify it. This 3D mapping of threads allows to easily deal with 2D or 3D structures. The kernel invocation defines the size and number of the blocks, as shown in the example below.

(38)

Chapter 4. Methodology 34

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { inti = blockIdx.x * blockDim.x + threadIdx.x;

intj = blockIdx.y * blockDim.y + threadIdx.y;

if(i < N && j < N)

C[i][j] = A[i][j] + B[i][j]; }

int main() { ...

// Kernel invocation

dim3 threadsPerBlock(16, 16);

dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

... }

Thread inside the same block can share data with a local cache and can synchronize their execution. Memory is separated in local thread memory, block shared memory and global memory, plus some additional space to store constants and textures. The programming model assumes that the GPU is utilized as a coprocessor, with a host that sends data and instructions.

GPU architecture

The architecture is implemented in hardware with an array of streaming multiprocessor (SM). When a kernel function is invoked each thread block is distributed to an SM. Each SM can execute multiple blocks concurrently if it has enough threads, often in the order of hundredths. This approach is called single instruction multiple threads (SIMT). Threads are divided into warps and each warp is executed in parallel. Threads in the same warp must execute the same instructions to achieve full efficiency and if a branch is taken each path must be followed sequentially. Different warps may execute different instructions.

The Nvidia P100(NVIDIA Tesla P100 whitepaper 2016), one of the GPU used in our experiments is based on the Pascal architecture, and comprises 3584 CUDA cores, with a single-precision perfor-mance of 9.3 TFLOPS.

(39)

Chapter 4. Methodology 35

4.4.2

Intel Xeon Phi

The Intel Xeon Phi(Duran and Klemm,2012) is a many-core processor developed by Intel since 2006 designed for high performance computing. In the current Top500 ranking1 3 out of 10

su-percomputers use Xeon Phi. The second generation (KNL), can be used both as a PCI express coprocessor, offloading some part of the computation, or standalone booting directly the operat-ing system from the Xeon Phi, while the first generation (KNC) can be used only as a coprocessor. It is based on the Intel Many Integrated Core (MIC) architecture and the instruction set includes 512-bit wide vectorization instruction (AVX512), which allows up to 16 single-precision or 8 dou-ble precision simultaneous operations.

To get the maximum performance from this kind of architecture it is necessary to design an ap-plication that exploits all the forms of parallelism present in the architecture. This means using all threads (Xeon Phi has up to 72 cores with 4-way hyperthreading), vectorizing the code, and maximizing local data access.

4.4.3

Knights Landing

In our experimental setup we used a Knights Landing (KNL)(Jeffers, Reinders, and Sodani,2016), the second generation of Xeon Phi, to benchmark the inference time for our models. The archi-tecture, shown in figure4.6, is built with Intel’s 14nm process technology and is made of 36 tiles interconnected by a 2D mesh. Each tile contains 2 cores, for a total of 72 cores, 2 VPU for each core and 1 MB L2 cache shared between pairs of cores. The cores are based on the Atom Silver-mont microarchitecture and support out-of-order execution and 4-way hyperthreading. Each VPU includes 2 AVX512 units and supports Fused Multiply-Add (FMA) operations.

While the first Xeon Phi generation, the Knights Corner (KNC), required the recompilation of the applications, KNL supports legacy code without recompilation. The AVX512 instruction set has been extended with AVX512CD (Conflict Detection), AVX512PF (Prefetch), AVX512ER (Exponen-tial and Reciprocal) instructions.

The KNL includes an additional high bandwidth memory (MCDRAM) with 16 GB. KNL cores are connected with a mesh of rings structure and it is possible to configure three different cluster modes: All-to-All, Quadrant, Sub-NUMA-Clustering. These three modes can be selected at boot time. Additionally, three different modes are available to use the two memories, (DDR and MC-DRAM) which can be also selected at boot time: in cache mode the MCDRAM memory is used as another cache level for the DDR memory in a completely transparent way for the application. This is the easiest way because it does not require any change to the application. Flat mode allows the use of the MCDRAM memory in the same address space of the DDR memory. The application programmer must explicitly specify which data must be allocated on the MCDRAM memory. A

(40)

Chapter 4. Methodology 36

(41)

Chapter 4. Methodology 37 third option is the hybrid mode, which allows to use part of the MCDRAM as cache for the DDR memory and part as separate memory in the same address space of the DDR memory. The choice between each mode is a design decision and depends on the application working set size.

4.4.4

Machine Learning on Xeon Phi

It is possible to use Xeon Phi for neural network training and inference. Tensorflow(Abadi et al.,

2015) supports Xeon Phi and AVX512 instructions. To enable AVX512 instructions it is necessary to compile Tensorflow using the Math Kernel Library (MKL) as backend. MKL implements math processing routines vectorized and optimized for Intel architectures, like convolution, matrix mul-tiplication, transposition, local response normalization and tensor manipulations.

Since most of the deep learning frameworks are optimized for GPU execution there are some mod-ifications to perform to ensure the best performance on different hardware. For example convolu-tion input has a (samples, height, width, channels) format by default, while on CPU architectures the (samples, channels, height, width) format has a better data locality and improves performance. This requires a modification in the network definition to ensure the correct format, otherwise MKL will perform the transposition at runtime incurring in additional cost.

A number of graph optimizations are executed on the computational graph to ensure that the data conversion operations are reduced to a minimum, the Tensorflow operations are substituted with the corresponding MKL optimized version and operations are fused when possible.

Other parameters that can increase the performance are the number of threads, which can be con-trolled with the OMP_NUM_THREADS environment variable, the intra-op and inter-op parallelism, which regulates the level of parallelism for each node in the computational graph, the batch size, the transposition for matrix multiplication, and how much time each thread should wait after completing the execution of a task, which can be controlled with KMP_BLOCKTIME.

Furthermore, Intel announced that in Q4 2017 the new Intel Knights Mill will be released. This is a new Xeon Phi generation optimized for deep learning applications, with new vectorized instruc-tions for low precision operainstruc-tions (AVX512-4VNNIW and AVX512-4FMAPS).

4.5

FPGA

Another possibility is the usage of FPGA for neural network inference. These kinds of architec-tures can be optimized only for this specific task, while GPU and manycore are more general architectures. FPGA are interesting for embedded systems due to the low power requirements and limited memory size but some experiments in the literature, like Umuroglu et al.,2017and Qiu et al., 2016, explore the use inside data centers for large scale inference, where low power

Riferimenti

Documenti correlati

The Europe Egypt Network for Particle Physics..

Stag hunters still end up visiting stag hunters, because even dis- counted reinforcement beats a reinforcement of zero, but now rabbit hunters will get locked either into pairs

The existence, uniqueness and asymptotic representation of its solutions are stated by the Tikhonov theorem [3], while terms of higher order of asymptotic approx- imation are shown

The work presented in this proceedings is based on the framework developed by the NNPDF collaboration which introduces for the first time in the PDF determination context the

Table VIII: Intuitive thinking, reasoning and the effect on narrow framing This table shows marginal values of probit estimates of the effect of accessibility to pre-existing risks

We will relate the transmission matrix introduced earlier for conductance and shot noise to a new concept: the Green’s function of the system.. The Green’s functions represent the

the maximum valve operation frequency is 3 Hz (50 % duty cycle). It was also evaluated that a fluidic channels delivers 1.8 μl/h of fluid at a.. pressurized at 15 psi. Using

Figure 4 indicates that there is broad agreement about the marks given to Queen Rania (and Shakira) and a degree of discord about how good Kevin’s English is. Such a