Deep Learning techniques for wireless communications

(1)

UNIVERSITÀ DIPISA

DIPARTIMENTO DI INGEGNERIA DELL’INFORMAZIONE

TESIMAGISTRALE ININGEGNERIA DELLETELECOMUNICAZIONI

D

EEP

L

EARNING TECHNIQUES FOR

W

IRELESS

C

OMMUNICATIONS

Autore

S.T.V. (A.N.) Claudio ANDRIOLI

Relatori :

Prof. Marco Luise Prof. Luca Sanguinetti Ing. Carmine Vitiello Ing. Andrea Pizzo T.V. (A.N.) Fabio Zorzi

Pisa, Aprile 2018

(2)

(3)

This is... ...for my father, my only guide... ...for my mother, the woman of my life... ...for my brother, my star in the sky.

(4)

(5)

"There is no distance too far between friends, for friendship gives wings to the heart." Unknown Author.

(6)

(7)

Ringraziamenti

I

Lmio primo pensiero è rivolto al laboratorio di wireless communications. Uno staff fantastico e affiatato che ha saputo accogliermi fin da subito non solo come studen-te, ma soprattutto come amico. Ha saputo supportarmi nei momenti più delicati del mio lavoro, insegnandomi a non demordere e ad insistere con caparbietà anche quando gli sforzi non venivano ricambiati ed i risultati tardavano ad arrivare. Vi ho stressato non poco in questi ultimi mesi, ma alla fine di un percorso così intenso posso affermare di aver trascorso con voi il periodo più bello della mia vita universitaria.

Mamma, papà, quando leggerete questi ringraziamenti sarà tutto finito. Sembra ie-ri, quando ancora bambino, mi accompagnaste con le lacrime agli occhi sull’isola di Sant’Elena non sicuri ancora della vostra scelta. Un’ esperienza dura che ci ha fatto però crescere e maturare non solo come persone ma soprattutto come famiglia. Vi devo ringraziare per tutto ciò che avete fatto e state facendo per noi: la vostra continua ap-prensione, l’interessamento per ogni singola decisione e il vostro affetto sono la prova tangibile del vostro amore. Avete sempre messo noi al primo posto, al centro della vostra attenzione. Guardandomi indietro, vedo oltre a numerosi sorrisi, anche tanti sa-crifici, che non tutti avrebbero avuto il coraggio di fare. Vi devo ringraziare per questo traguardo raggiunto, senza i vostri consigli e incoraggiamenti non sarebbe stato possi-bile niente di tutto ciò. Vi devo ringraziare semplicemente per il vostro essere unici. Il mio pensiero ora è rivolto a te Sara. E’ da poco che ci conosciamo eppure sei riuscita a farti amare fin da subito per la tua semplicità e per i piccoli gesti che ti assicuro non sono mai passati inosservati. In poco tempo sei a tutti gli effetti entrata nella nostra fa-miglia e sinceramente sono felice che Luca abbia incontrato una ragazza come te sulla sua strada. Un grazie speciale va ai miei nonni, quelli che ci sono ancora e quelli che

(8)

avrebbero voluto esserci, agli zii e ai cugini che mi hanno visto crescere e diventare uomo.

Un sincero ringraziamento va ora alla città di Pisa. La città di cui mi sono innamorato, la città che mi ha dato nuova linfa, che mi ha dato una nuova vita. La città che mi ha fatto conoscere persone speciali a partire dalle care amicizie come la cameretta 4/5, i colleghi delle Armi Navali, Vale, Sando, Ivana, fino ad arrivare all’amore della mia vita: la mia pic. A te dedico un pensiero speciale e un posto speciale nel mio cuore. Sei la persona che in questa città più mi ha supportato e sopportato, amato ed odiato. A te devo tanto, mi hai aiutato a crescere, a migliorare, a vivere la vita cercando di cogliere l’attimo fuggente. Tutto questo però non sarebbe stato possibile senza la tua bella e numerosa famiglia che ti ha educata nel rispetto di quei valori che ti hanno reso una ragazza speciale, la mia ragazza speciale. Grazie ai tuoi genitori per avermi accolto in casa come un figlio ed un fratello per Katia ed Ale. Grazie quindi ai tuoi nonni e zii che dal primo momento mi hanno sempre trasmesso tanto affetto. Ringrazio Pisa per avermi fatto incrontrare nuovamente gli amici della vita, gli amici conosciuti quel lontano Settembre 2006.

Ringrazio i fratelli del corso Theseus. Qualunque città abbia visitato in questi lunghi anni, ho sempre trovato qualcuno di voi, pronto a farmi vivere giornate spensierate tor-nando in dietro nel tempo e dimostrandomi che la nostra non è una semplice amicizia, ma una grande famiglia. In particolare ringrazio Pietro, Andrea e Luigi. In questi anni ho imparato a conoscervi, ho scoperto in voi persone straordinarie e sono felice che siate stati proprio voi i protagonisti del mio percorso pisano. La vostra presenza ha caratterizzato questa parentesi di vita sicuramente meno entusiasmante senza di voi. Vi prometto che custodirò gelosamente il vostro ricordo, il ricordo delle nostre pizzate e serate passate fra Orso Bruno e le rive del fiume Arno. Un grazie va anche al mio fra-tello di strisce Andrea Paletti che oltre ad avermi accolto nel cuore della città londinese, ha contribuito attivamente alla stesura di questo elaborato sempre con piena disponibi-lità e interesse. Ringrazio te Fabio per essermi stato sempre vincino in tutti questi anni. Ne abbiamo passate tante insieme, dalle vacanze fino ai più stupidi litigi che come una bella coppia non sono mai mancati. Le nostre vite non consentono di incontrarci spes-so, eppure siamo ancora qui, a prova del fatto che le amicizie, quelle vere, non hanno limiti nè confini. Infine vorrei ringraziare gli amici dell’infanzia, Claudio e Gianluca e scusarmi per il tempo che non ho saputo dedicare loro in questi lunghi anni. Spero il tempo ci possa riavvicinare e ci possa dare la possibilità di riprendere a coltivare la nostra amicizia dove l’avevamo interrotta.

Lo so, credi mi sia dimenticato di te, ma non è così. Per te non ho parole, sarebbe riduttivo minimizzare il nostro rapporto in poche righe, quindi cercherò di essere breve: thank you for loving me.

(9)

Summary

V

Ertical markets and industries are paving the way for a large diversity of

het-erogeneous services, use cases, and applications in future 5G networks. It is currently common understanding that, to satisfy all those needs, a flexi-ble, adaptaflexi-ble, and programmable network architecture is required. In this context, operators need the ability to automate their architecture configuration and monitoring processes to reduce their OPerational EXpenditure (OPEX), and more importantly to ensure that the quality-of-service and quality-of-experience requirements of the offered services are not violated. The use of Artificial Intelligence (AI) techniques is emerging as a promising solution to achieve these goals and to replace complex and expensive human-dependent decision-making processes. Amongst the many algorithms of the AI family and of its branch Machine Learning, Artificial Neural Networks (ANNs) based schemes are becoming very popular in the context of future 5G networks. The primary objective of this thesis is to develop a fundamental new way to think about commu-nications system design as an end-to-end reconstruction task through ANN that seeks to jointly optimize the transmitter and receiver components in a single process. As a first instance of this design approach, we apply this idea to an Orthogonal Frequency-Division Multiplexing (OFDM) system, and we show how it can be used to implement a receiver architecture that resembles classical receiving schemes without any a priori knowledge about the format of communication standard.

(10)

(11)

List of Abbreviations

5G 5th generation wireless systems. 2

ADC Analog to Digital Converter. 54 AE Autoencoder. 3, 38

AI Artificial Intelligence. 1, 2 ANN Artificial Neural Network. 2, 6

API Application Programming Interface. 25 ASK Amplitude-Shift Keying. 29, 32

AWGN Additive White Gaussian Noise. 31

BB BaseBand. 32, 54 BER Bit Error Rate. 31, 61

BGD Batch Gradient Descent. 15, 16 BPTT Back-Propagation Through Time. 19

CFR Channel Frequency Response. 48, 65 CIR Channel Impulsive Response. 47, 48, 65 CP Cyclic Prefix. 54

CUDA Compute Unified Device Architecture. 21, 26

DAC Digital-to-Analog Converter. 33 DDS Direct Digital Synthesizer. 33

Deep-FNN Deep Feed-Forward Neural Network. 3 Deep-NN Deep Neural Netwok. 19–21, 58

(12)

List of Abbreviations

DFT Discrete Fourier Transform. 4, 56, 59, 61 DL Deep Learning. 2

FCW Frequency Control Word. 33

FDM Frequency-Division Multiplexing. 52 FNN Feed-Forward Neural Netwok. 19–21, 58 FSK Frequency-Shift Keying. 29, 32

FT Fourier Transform. 30

GD Gradient Descent. 14–17, 19 GPUs Graphic Processing Units. 3, 21

IBI Inter Block Interference. 54

IDFT Inverse Discrete Fourier Transform. 52, 54 IoT Internet of the Things. 2

ISI Intersymbol Interference. 36, 48

ISO International organization of Standardiza-tion. 27

LP LowPass. 32

LS Least Squares. 56

LSTM Long-Short Term Memory. 21

MAP Maximum a posteriori Probability. 37 MbGD Mini-batch Gradient Descent. 15, 16 MF Matched Filter. 35

ML Machine Learning. 1, 14 MLP Multi-Layer Perceptron. 19 MMSE Minimum Mean Square Error. 56 MSE Mean Square Error. 14, 59, 61

NCO Numerically Controlled Oscillator. 33 NLOS Non Line-of-Sight. 50

NN Neural Network. 3, 16, 17, 19–21, 26, 56

OFDM Orthogonal Frequency-Division Multiplex-ing. 52, 58

(13)

List of Abbreviations

PDP Power-Delay Profile. 50 PSK Phase-Shift Keying. 29, 32

QAM Quadrature Amplitude Modulation. 29 QoS Quality of Service. 2

RC Raised Cosine. 36 ReLU Rectified Linear Unit. 11 RF Radio Frequency. 32

RL Reinforcement Learning. 12, 13 RNN Recurrent Neural Netwok. 19–21 RRC Root Raised Cosine. 36

RTRL Real-Time Recurrent Learning. 19, 20

SER Symbol Error Rate. 31

SGD Stochastic Gradient Descent. 15, 16 SNR Signal to Noise Ratio. 36

Tanh Hyperbolic tangent. 12

(14)

List of Figures

2.1 A sketch of a biological neuron [1]. . . 6 2.2 An artificial neuron is a mathematical model for biological neurons [1]. 6 2.3 Neural Network is composed in three different types of layers: input

layer, hidden layer, and output layer. The lines between the nodes in-dicate the connection between nodes characterized by specific values called weights. This particular type of neural network has more than one layer and the information flows only from the input to the output. Other types of neural networks exist which can have more intricate con-nections, such as feedback paths. . . 7 2.4 Activation function in artificial neural networks. . . 10 2.5 Classification of the widespread ML algorithms available in the research

literature used to train a NN [1]. . . 13 2.6 Main architectures of Neural Networks. . . 18 2.7 Feed-forward Neural Network. In this network, the information moves

only from the input layer to the output layer through the hidden layer. . 18 2.8 Recurrent Neural Network. One or more feed-back connection

charac-terizes RNN. . . 19 2.9 The figure represents the Back-Propagation Time Through algorithm

which is a technique for training RNNs. . . 20 2.10 Local and distributed implementation of TensorFlow library [2]. . . 25

3.1 ISO-OSI Model [3]. . . 28 3.2 Example of rectangular impulse shape for wireless communication. . . 30 3.3 Quadrature Amplitude Modulation. Each symbol is connected to a string

(17)

List of Figures

3.4 Single carrier transmitter for AWGN channel. . . 33

3.5 Zero-IF down converter. The received signal in RF is converted in a BB signal thanks to two different sinusoidal signals. . . 35

3.6 Single carrier receiver for AWGN channel. . . 37

4.1 Structure of an autoencoder with an input layer, encoder and decoder. . 39

4.2 A traditional end to end communication system for AWGN channel. . . 41

4.3 Receiver for AWGN channel implemented across a deep-FNN com-posed by three hidden layer. . . 42

4.4 Loss and accuracy vs number of epochs for both training and testing stage. 42 4.5 BER va SNR of a communication system (which uses 4-QAM modula-tion) with a NN receiver compared to a theoretical BER. . . 42

4.6 Structure of AE for digital communication (training configuration). . . 43

4.7 End to end communication system designed across two deep-FNNs: an encoder (transmitter) and decoder (receiver). . . 43

4.8 Loss and accuracy vs number of epochs for both training and testing stage. 44 4.9 BER va SNR of a communication system implemented across an au-toencoder compared to a theoretical BER. . . 45

4.10 Encoder output compared to Gray encoding without noise. . . 45

5.1 Mathematical model of a linear channel. . . 47

5.2 Mathematical model of a time-variant linear channel. . . 48

5.3 Impact of frequency selective and flat channel. . . 48

5.4 Power-Delay Profile of Urban macro-cell C2 model. . . 51

5.5 Urban macro-cel C2 channel realization applied to OFDM systems. . . 51

5.6 OFDM signal spectrum [4]. . . 55

5.7 OFDM transmitter. . . 55

5.8 OFDM receiver. . . 57

6.1 Structure of deep-FNN designed for OFDM receiver. . . 59

6.2 OFDM receiver based on deep-learning approach without computing DFT. 60 6.3 OFDM receiver based on deep-learning approach computing DFT. . . . 60

6.4 Received signal model for OFDM system based on Deep-learning ap-proach. . . 62

6.5 BER vs SNR of OFDM system based on deep-learning approach on an AWGN channel compared with the performance of an ideal OFDM system. 63 6.6 BER vs SNR of OFDM system based on Deep-learning approach over a Rayleigh channel, compared with three classical OFDM systems using different estimation algorithms. . . 64

(18)

List of Figures

6.7 BER vs SNR of OFDM system over Urban macro-cel C2 channel model and compared to two classical OFDM systems which uses different es-timation algorithms. . . 65 6.8 BER vs SNR of two OFDM systems based on Deep-learning approach

over Urban macro-cel C2 channel. . . 66 6.9 BER vs SNR of OFDM receivers based on Deep-learning approach trained

with different number of iteration. . . 67 6.10 Testing and training loss of NNs for OFDM receiver vs epochs. . . 67 6.11 BER vs SNR of OFDM receivers based on Deep-learning approach trained

(19)

List of Tables

2.1 Loss functions for training process. . . 14

(20)

CHAPTER

1

Introduction

Introduction

“The development of full artificial intelligence could spell the end of the human race...It would take off on its own, and re-design itself at an ever increasing rate. Humans, who are limited by slow biological evolution, couldn’t compete, and would be superseded”. Stephen Hawking told the BBC. Nowadays, Artificial Intelligence (AI) is a novel tool which finds application in many branches of the active researches, from computer sci-ence to intelligent software across speech or images recognition, and medical equip-ments. AI aims to design all these applications in an intelligent way mimic the human behavior.

Human brains can find solutions to automatic problems through education, logic and the memory of the past experience. In the same ways, AI aims to reconstruct a human approach to achieve many tasks learning from the experiences and creating an own background and knowledge without considering any mathematical methods, de-scription and solution with relative problem.

The acquisition of their own knowledge can be realized by extracting patterns from raw data. This capability is called Machine Learning (ML). ML capability depends on data representations and on its particular features. The desired information, which

(21)

Chapter 1. Introduction

must be acquired by data analysis, could be different in according with the specific ap-plication, thus it is necessary to define data in a correct way and with precise features. This operation is called representation learning. A representation learning algorithm allows to discover a great set of features (simple and complex) for achieving a specific task. Nevertheless, this approach results sometimes unhelpful since it is difficult to ex-tract high-level and absex-tract features from raw data. In order to solve the representation problem, finding a simple model of the real-word, a novel approach called Deep Learn-ing (DL)has been introduced. In the recent years, a deep learning machine is identify as an Artificial Neural Network (ANN), which is designed as an architecture inspired to the biological neural networks. Similarity to these networks, deep learning is charac-terized by different layers in order to be able to represent the information. This simple ANN structure makes itself an interesting subject of study for applications in wireless communications. The wireless communication is being radically changed by the ad-vent of Internet of the Things (IoT). Indeed, the traditional human centric network will become an heterogeneous ecosystem where several types of devices, such as smart-phones, sensors, drones and vehicles, will interact each other managing a large amount of real-time data on different operational scenarios. With the increasing of the number of services available on the wireless network like cloud-based gaming, immersive vir-tual reality services, real-time HD streaming and conventional multimedia services, the next 5th generation wireless systems (5G) will be able to guarantee different commu-nication requirements, such as ultra reliability, low latency and high data rate. In this novel view of the wireless networking, AI and in particular ML and ANN, may repre-sent a very useful tools for satisfying the requirements, ensuring the Quality of Service (QoS) essential for the emerging wireless and IoT services. For example, AI can play an interesting role also to design a physical layer without any a priori knowledge of mathematical models which are used to implement an optimum communication system over a widely range of channels. Recent researches have demonstrated how is possi-ble to implement functions for physical layer based on ML and ANN technologies [6]. Furthermore, AI can help to optimize the classical technologies thanks by its ability to manage resources in a smart way allows to address a variety of problem, ranging from cell association and radio access technology selection to frequency allocation, spectrum management, power control, and intelligent beamforming. AI is capable to operate in fully online manner by learning the states of the wireless environment and the net-work users, miming and learning human behavior, improving their own performance over time. In particular, ML can support the communication of traditional system, tak-ing into account imperfections and non linearities, such as non-linear power amplifiers (PAs) and finite resolution quantization, that can be captured only approximately by some models. In this way, it will represent an helpful tools to create a network, which is able to adapt its functions to human demands, maximizing the quality of experience

(22)

1.2. Main Contribution and Outline

and life of the user.

Main Contribution and Outline

The primary contribution of this work aims to develop new strategies to design physi-cal layer for wireless communication through using machine learning tools, in order to redefine some functions of communication system which are designed on rigid mathe-matical models. In addition, it is sometimes difficult to characterize natural phenomena with mathematical models. In this context, this thesis focuses on application of ML and Neural Networks (NNs)theory to design an entire single carrier communication system over an ideal channel and a novel demodulations strategy, which resumes channel es-timation, equalization and demapping, applied to an Orthogonal Frequency-Division Multiplexing receiver, emphasizing the capability of this approach.

Firtsly, we will introduce the main features of ANNs. ANN architecture will be described as a complex system similar to a biological neural network, where elemen-tary particles called neurons are connected each other. In the chapter 2, the component of NN will be described focusing on training and learning process. Moreover, at the end of the chapter, we will expose the most popular frameworks for ML application, highlighting their properties to parallelize the training process exploiting the Graphic Processing Units ( GPUs), so reducing the training time.

Secondly, we will present a novel approach to design a baseband end-to-end com-munication system for an ideal channel using NN theory. We will present a simple receiver built through a Deep Feed-Forward Neural Network ( Deep-FNN) trained us-ing a supervisor learnus-ing approach. Then, we will study a communication system as an application of an Autoencoder (AE). The AE is a particular NN which must recover the signal at the input of the network through a combination of an encoder and a decoder, converting the input data into a different representation and than back into the original format. Thus, for its particular structure, AE is very helpful to project communication system dividing the application in two stages: first the AE is trained in a supervised learning manner and then broken down in two Deep-FNNs, which will work as trans-mitter and receiver separately.

The last chapters lead to introduce a deep learning approach applied to channel es-timation, equalization and symbol detection in an OFDM system. We will demonstrate NNs have the ability to recover the transmitted information, learning and analyzing the characteristics of wireless channels. Furthermore, we will emphasize how NN improves

(23)

Chapter 1. Introduction

the channel estimation features when a large amount of training data set and iteration numbers are used, increasing the overall performance. Moreover, two different prob-lems will be studied and compared each other, in particular will revise a traditional OFDM receiver, first in order to substitute Discrete Fourier Transform (DFT), channel estimation, equalized and demapper with a complex system of NNs and then apply the same consideration to an OFDM system designed to compute DFT with the traditional approach.

(24)

CHAPTER

2

Artificial Neural Networks Technology

An introduction to Artificial Neural Networks

The pillar of AI is constituted ANN for its ability to model complex relationships be-tween input and output, and to extract the statistical structure in an unknown joint prob-ability distribution function from an observed data mimicking human behaviour. The ANNs are inspired by the structure and functional aspects of biological neural net-works, which can learn some information from observational complicate and imprecise data. Furthermore, ANNs may process data in a way similar to the human brain. An ANN is composed of a large number of interconnected processing elements built to achieve a certain aim, in particular they can carry out activity that the human brain and the traditional computer are not able to solve. Moreover, an ANN can create its own organization or representation of the information that it receives during the learning process.

Modelling ANNs

ANNs have a structure which is similar to a nervous system composed by different layers made up on a number of simple and highly interconnected processing elements called neurons. A biological neuron has to elaborate the received input from other neurons, and retransmit this signal to adjacent neurons. The structure of a biological

(25)

Chapter 2. Artificial Neural Networks Technology

Figure 2.1: A sketch of a biological neuron [1].

Figure 2.2: An artificial neuron is a mathematical model for biological neurons [1].

neuron is shown in figure 2.1 and it is composed of nucleus, dendrites and axons, furthermore, the connection point between two adjacent neurons is called synapse.

The dendrites and axons represent the connections with other neurons character-ized by a specific strength. They permit respectively to receive and transmit electrical impulses: in a biological neuron the nucleus receives impulses from the previously neu-rons thanks to axons, elaborate the input changing the polarization of its membrane, and when the membrane potential exceed a certain value, it becomes active and sends the signal to the connected neurons.

The same considerations can be applied also for an artificial neuron. As the figure 2.2 shows, in an ANN a neuron (ni,l) is designed in order to have the same features

of a biological neuron: each neuron is connected with previously (ni,l−1) and next

(ni,l−1)neurons through different links. Each neuron has a specific role on the base of

a non-linear function called activation function f (·) which allows to compute one or more outputs from its inputs ri,l−1 and a set of parameters called weights θi,l. In

gen-eral, they have a certain amount of incoming connections Nin(dendrites) to receive the

signal from the previously nodes and outcoming connection Nout(axons) retransmitted

the output to the next neuron.

(26)

activa-2.1. An introduction to Artificial Neural Networks

Figure 2.3: Neural Network is composed in three different types of layers: input layer, hidden layer, and output layer. The lines between the nodes indicate the connection between nodes characterized

by specific values called weights. This particular type of neural network has more than one layer and the information flows only from the input to the output. Other types of neural networks exist

which can have more intricate connections, such as feedback paths.

tion function for every neuron in a layer must be the same. Furthermore, each neuron must be connected just with the neurons of the adjacent layers, this means that connec-tions between the neurons of the same layer are not allowed.

Basically, the ANN is composed by several layer (fig. 2.3): an input layer where the number of neurons coincides with the number of input signal, an output layer where each neuron represent the output signal, and one o more hidden layer with a fixed num-ber of neurons which have to solve a specific task. The hidden layer may be though as a black box which is used to find a relationship between input and output.

The structure of the ANN, the input and the output signal can be synthetized mathe-matically. The input and output signal of each layer can be represented through a vector (r). Furthermore, as we have previously mentioned, every neurons within the layer are characterized by the connection weights (θl), thus, we can assume that each layer is

described by a matrix. The matrix elements can be interpreted like the strengths of the connections between the neurons in a layer and the neurons in the successive layer. In particular, we can discern input weight matrix, hidden weight matrix and output weight matrix.

The previously consideration permits to compute a neuron output through the fol-lowing formula:

(27)

ri,l= fl(θi,l, rl−1) ∈ <Nl (2.1)

where ri,l is the output signal of the neuron ni,l, fl : <Nl−1 −→ <Nl, ∀l ∈ [1, L] is the

l-th layer activation function, θi,l is the weight of the i-th neuron in the l-th layer, Nl

in the number of l-th layer neurons and rl−1 is the (l-1)-th layer output signal. Thus,

suppose that r0 ∈ <N0 is the NN input, and rL ∈ <NL is the NN output as in figure

2.3, we can extend the previously consideration:

                       r1 = fl(θ0, r0) ∈ <N1 . . . rl = fl(θl, rl−1) ∈ <Nl . . . rL = fL(θL, rL−1) ∈ <NL (2.2)

where Θ = {θ1, θ2, . . . , θL} is the set of the connection weight. The weights θl are

random variables, thus, the mapping of the input r0 to the output vector rL across L

iterative steps will be also stochastic. The weight θl usually can depend on other two

parameters: a matrix W and a vector of bias b. In this case the layer is called dense or fully connectedand the argument of the activation function for each neuron is a linear combination of the input rl−1, the matrix W and the bias b :

ri,l= fl(bi,l+ Nl−1

X

j=1

rj,l−1· wi,l). (2.3)

Activation functions for Artificial Neural Networks

The main topics of this section is to illustrate the most typical classes of activation functions and their purposes. The activation functions f (·) are often nonlinear since the linear functions are unbounded. This means that if a linear function is utilized like activation function, the NN stable state will not be reached after training process. This is an important constraint because the set of functions that we can choose to carry out the fixed task will be limited. Moreover in the same NN each activation function can be different from layer to layer. For that reason, the criterion that will be adopted to choose the activation function are based on following points:

(28)

2.2. Activation functions for Artificial Neural Networks

• activation function must be nonlinear; • training process must be optimized.

As it has already mentioned, the activation function is selected due to achieve the aim of the NN. In this work a set of function which satisfies the previously condition is proposed. In particular the list is extracted from [7]. The proposed activation func-tions permit to setting a NN to realize a physical layer for communication system. To achieve this aim, both no dense and dense activation functions must be studied with their purpose and functionality.

The first no dense activation function that we mention is called Noise. Its aim is to reproduce a Noise behaviour for a communication channel adding a random variable n ∼ N (0, βINl−1). Thus, the output from this layer can be computed through the

fol-lowing function:

fl(rl−1; θl) = rl−1+ n (2.4)

where n is a vector of noise. The second model of layer characterized by a non-dense activation function is the normalization layer. This layer is usefulduring the training process to regulate the weights of each layer. As Sergey Ioffe said in his paper [8] the advantages of using Normalization layer are:

• the improvement of gradient flow in the stochastic gradient descent (which is the most common training algorithm);

• the increase of the learning rate ; • the initialization layer is not necessary.

To evaluate the normalization layer output a lot of different functions can be used. According to Tim O’Shea, in this work we will assume that the activation function for a normalization layer is defined like this:

fl(rl−1; θl) =

√

Nl−1rl−1

k rl−1k2

. (2.5)

A layer with a dense activation function are usually more common and useful than no dense activation function layer. It is necessary to introduce five different type of dense activation functions (fig 2.4), which allow to realize a NN which has the same role of a transmitter and receiver for communication system:

(29)

(a) Linear activation function (b) Rectified Linear Unit

(c) Sigmoid

(d) Hyperbolic tangent (e) Softmax

(30)

2.2. Activation functions for Artificial Neural Networks

• Linear activation function. It is typically used at the output layer in the context of regression tasks such as estimation of a real-valued vector. It can be defined as:

fl(rl−1; Wl, bl) = rl−1. (2.6)

• Rectified Linear Unit (ReLU). It is a simple threshold at 0. It is also known as positive ramp functionbecause its output will be equal to the input if the input is more than 0 and 0 otherwise:

fl(rl−1; Wl, bl) = max (0, rl−1). (2.7)

[9] reports that a ReLU layer has several pro and cons. The operations of this layer are quite simple, thus it is characterized by a faster training process than other activation functions such as sigmoid or tanh. However, a neuron with ReLu activation function is often fragile during the train and it could die. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. This more frequently when a parameter learning rate is too high. Thus, a good choice of this parameter could diminish this problem.

• Sigmoid. Sigmoid is a non-linearity function. It is described across the following formula:

fl(rl−1; Wl, bl) =

1

1 + e−rl−1. (2.8)

The main sigmoid layer purpose is to reduce the input value within the range [0,1]. [9] reports that it is rarely ever used since it has two major drawback. Firstly, it saturates and kills gradient during the training process. During the training pro-cess, the network must compute a the weight across a is usually based on the Gradient Descent Algorithm. The consequence of sigmoid neuron saturation is that the local gradient rapidly goes to almost zero and no signal can flow through the neuron. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely

(31)

learn. Moreover Sigmoid outputs are not zero-centred. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centred. This has implications on the dy-namics during gradient descent, because if the data coming into a neuron is always positive, then the gradient on the weights Θ will become either all be positive, or all negative.

• Hyperbolic tangent (Tanh). Similar to sigmoid, Tanh must compress the input signal in the range [-1,1]. The mathematical expression which identifies the tanh layer behaviour is:

fl(rl−1; Wl, bl) = tanh(rl−1). (2.9)

Also in this case, tanh still saturates and kills gradient, but it is zero-centred. For that reason, tanh activation function is more usual than sigmoid activation func-tion.

• Softmax activation function. It is also a non-linearity function for classification. According to [7], it can be defined as:

fl(rl−1; Wl, bl) =

erl−1

P

jerj

. (2.10)

Like sigmoid function, it must squash the input between 0 and 1 but in this case the sum of each output is equal to 1. Thus, it provides to produce a set of probability for each class which will be used to classify each input. Therefore, the last layer for a classification Neural Network is usually characterized by a softmax function.

Training process and backpropagation algorithm

Every NN, like previously described, requires a time window to learn information from its input to adjust and update the weight of the connections between the neurons in the system. This process is better known as training process. In general, training can be done through three different methods [10] a) supervised learning, b) unsupervised learning, and c) Reinforcement Learning (RL).

(32)

2.3. Training process and backpropagation algorithm

Figure 2.5: Classification of the widespread ML algorithms available in the research literature used to train a NN [1].

As the figure 2.5 shows, the choice of the training algorithm depends on the task which must be achieved by the ANN.

The supervised learning is based on using of labelled data where input and output are known a priori; on the contrary the unsupervised learning permits to explore the data and to infer right structures without a labelled data. In other words the desired out-put for each inout-put is unknown. For example, an unsupervised approach could be using the method of moments since the parameters may be considered random variable. For this reason, it is impossible to determine the accuracy of the training process in this ap-proach. Moreover, a mix of both (semi-supervised learning) is allowed, in this case the cost of a fully labelled training process is relatively high. The RL is an approach where ANNs learns the work environment to find the best strategies for a given agent, for every environment. The difference between a),b) from c) is that the first three methods need to be trained with historical data instead RL is trained by the data from implementation.

Firstly, it is necessary to understand how the training process could be done. For that we will focalize our attention on supervised learning algorithm. In particular, the goal of this algorithm is to find the weight matrix which minimize a particular cost function such as the error between the desired output signal and actual output signal:

L(Θ) = 1 S S X j=1 L(rd_j,L, rj,L | rj,0) (2.11)

where rd_j,L, rj,L, rj,L are respectively the desired output, the current output and the

(33)

<ℵL× <ℵL ⇒ < is the loss function.

According to [11] and [7] several loss function can be adopted. The choice of the loss function impacts on the operating principle of the NN. In [12] a loss function is described like a measure of “the quality of a particular set of parameters based on how well the induced scores agreed with the ground truth labels in the training data”. The table 2.1 reports ten different loss functions which “appear in ML literature, however some in slightly different context than a classification los”. The MSE loss function, e.g., “is sometimes (however nowadays quite rarely) applied to weights in order to prevent them from growing to infinity”, and “it still have reasonable probabilistic interpretation for classification and can be used as a main classification objectiv”. In particular, this work focuses on Mean Square Error (MSE) and Categorical Cross-Entropy.

In the table 2.1 π(·) denotes probability estimate,_brd_j,k,Lis true label as +1/-1 encoding

symbol name equation

L1 L1loss k rdj,L− rj,L| rj,0k2

L2 Mean Square Error (MSE) k rdj,L− rj,L| rj,0k22

L1◦ π expectation loss maxjk π(rj,L) − π(rj,L) k1

L2◦ π regularised expectation los maxjk π(rj,L) − π(rj,L) k22

L∞ Chebyshev loss maxj| π(rj,k,L) − rdj,k,L|

log log (Categorical Cross-Entropy) loss −P

kr d

j,k,Llog(rj,k,L)

log2 squared log loss −P

k

h rd

j,k,Llog(rj,k,L)

i2

hinge hinge (margin) loss P

kmaxj(0,1₂−br

d j,k,Lr

d j,k,L)

hinge2 _{squared hinge (margin) loss} P

kmaxj(0, 1 2−br

d

j,k,Lrdj,k,L)2

hinge3 _{cubed hinge (margin) loss} P

kmaxj(0, 1 2−br

d

j,k,Lrdj,k,L)3

Table 2.1: Loss functions for training process.

and the subscript k indicates k-th dimension of a given vector.

The most common methods to do this minimization is the Gradient Descent (GD) algorithm. The principal aim of the GD is to update the weights so as the desired output rd

j,Lis equal to the current output rj,Lfor all training inputs rj,0. In other words the loss

function L(Θ) must move to zero. The next considerations are based on [13], which defines GD like “one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks”. Moreover, Ruder reports that “Gradient descent is a way to minimize an objective function l(θ) parameterized by a model’s parameters θ ∈ <p by updating the parameters in the opposite direction of the gradient of the objective function ∇θl(θ) w.r.t to the parameters”. The GD depends on

(34)

2.3. Training process and backpropagation algorithm

reach a (local) minimum”.

Several variant exists to implement GD. They “differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update”. In particular, Ruder distinguishes:

• Batch Gradient Descent (BGD). This approach converges to the global minimum for convex loss function, and to local minimum. θ can be computed for the whole dataset to perform just one update. The formula which must be used is:

θt+1 = θt− η∇θL(θ). (2.12)

The GD drawback is that it can be “very slow and intractable for datasets which do not fit in memory and it does not allow to update out model in an online manner”.

• Stochastic Gradient Descent (SGD).It updates the parameters θ for each training example x(t) and label y(t):

θt+1= θt− η∇θL(θ, x(i), y(i)). (2.13)

“SGD performs frequent updates with a high variance that cause the objective function to fluctuate”. The fluctuations permit it “to jump to new and potentially better local minima” at each time. However, the convergence to exact minimum is not always achieved. Moreover, SGD and GD have the same behaviour when η decreases: it converges to a “local or global minimum for non-convex and convex optimization respectively”.

SGD is usually faster than GD. GD performs redundant computations for large datasets, as it computes gradients for similar examples before each parameter up-date. SGD does away with this redundancy by performing one update at a time”. Thanks to high convergence speed it is useful for online application.

• Mini-batch Gradient Descent (MbGD). It has usually better performance than GD and SGD. The weights are update in a mini-batch of n training examples on the basis of the following formula:

(35)

θt+1 = θt− η∇θL(θ, x(i,i+n), y(i,i+n)). (2.14)

In [13] two MbGD advantages are reported: “a) it reduces the variance of the pa-rameter updates, which can lead to more stable convergence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learn-ing libraries that make computlearn-ing the gradient w.r.t. a mini-batch very efficient”. Mini-batch sizes can be vary for different application, but they are usually be-tween 50 and 256.

GD and its variants are utilized parallel to backpropagation algorithm.It is very use-ful to compute the gradient for each neuron. In this work, the same approach of [14] has been followed to described the backpropagation method, in particular only fully connected layers are considered (θi,l= {wi,l, bi,l}).

The backpropagation can be summarized through the following steps. Firstly, the input signal is transmitted from the input layer to the output layer. Secondly, the error between the desired output and the current output is computed. The error permits also to define an error propagation as the first derivative order of the previously computed error. It is also simply to verify that the error propagation of the neuron (nj,l) depends

on both the value of its output and the output of the next layer neuron (nj,l+1). This is

the main point of the backpropagation algorithm: backpropagation error is calculated from output layer to input layer. After that, backpropagation error is utilized to find out the weights value by 2.12, 2.13 and 2.14. This approach is repeated until the minimum of 2.11 is achieved for each weight.

The backpropagation and gradient descent does not guarantee to find a global min-imum. The GD target is usually to update the weight in order to achieve the local optimum minimum. Moreover, supervised learning is based on a set of labeled data. If the size of labeled data is too large then the back propagation algorithm will be very slow. For that reason, SGD is more usefull than BGD. However, MbGD is the best way to update θ since the weight values computed across SGD does not converge to opti-mum value. The mini batch also does not guarantee to converge to a global miniopti-mum. In general, the convergence depends on learning rate values and on the size of the mini batch.

The size of the mini bacth impacts also on the overfitting and underfitting problem. In particular when the size is too small in respect to the number of neurons, the NN can

(36)

2.4. Typers of Artificial Neural Networks

learn the random fluctuations and the noise during training process. On the contrary, if the mini batch size is too large the learning algorithm will not fit data well enough. To overcome these problems the training algorithm could be supervised across three different approach:

• Dataset augmentation. The learning algorithm processes more training data to avoid learning noise.

• Early stopping. In the training process a test error and training error are intro-duced. The main training algorithm purpose is to decrease both training and test error. Increasing excessively the steps of the training, the model could start over-fitting and learning noise. In this case the training error will decrease and test error will increase. Thus, to avoid this problem it is possible to stop early the training setting the optimum number of steps.

• Dropout layer. As previously mentioned, overfitting starts when the number of nodes for each layer is too high in respect to the size of the mini batch. For that reason, a good way to avoid overfitting is to remove some nodes in the network. In [13] is reported that “this approach is that the nodes become more insensitive to the weights of other nodes, and therefore the model becomes more robust. If a hidden unit has to be working well with different combinations of other hidden units, it’s more likely to do work well individually”.

• Weight penalty (or weight decay). In general,NN with a small number of weights is better than a network with a large number of weights. Thus, weight penalty allows to increase the number of neuron in the netwok when the gradient assums large value driving all the weights to smaller values.

Training process has been described in this section. The main loss functions, GD and backpropagation algorithm have been introduced. Moreover, GD and back propagation algorithm have usually two problems which are overfitting and underfitting which de-pend on the structure of the NN. A large amount of NN structures exist, and they are determined by the target of each NN. In the next section, different NN construction are described, and some examples are proposed.

Typers of Artificial Neural Networks

As mentioned in the previously section, a NN could be implemented in several ways on the base of the target that must be achieved. Basically, a NN is composed of three layers: input layer, hidden layer and output layer. On the base of the connection and the

(37)

Figure 2.6: Main architectures of Neural Networks.

Figure 2.7: Feed-forward Neural Network. In this network, the information moves only from the input layer to the output layer through the hidden layer.

(38)

2.4. Typers of Artificial Neural Networks

Figure 2.8: Recurrent Neural Network. One or more feed-back connection characterizes RNN.

role of each layer a NN classification can be reported. The figure 2.6 summarises a list of NNs which have been widely utilized. In this figure we want focus on three struc-tures which will be adopted during this work: a Feed-Forward Neural Netwok (FNN), a Recurrent Neural Netwok (RNN), and a Deep Neural Netwok (Deep-NN). The basic NN structure is the FNN. The FNN is represented in figure 2.7. It is composed of three layers: the input layer, the output layer and the hidden layer. Each layer is connected with the next layer and the connections with the previously layers are absent. This means that there are not any loop inside the Network. A FNN is trained through GD and backrpopagation algorithm which has already described. A RNN is characterized by a feed-back connection as the figure 2.8 shows. On the base on [15] “the RNN can have many different forms. One common type consists of a standard Multi-Layer Perceptron, (MLP)plus added loops. These can exploit the powerful non-linear map-ping capabilities of the Multi-Layer Perceptron, and also have some form of memory. Others have more uniform structures, potentially with every neuron connected to all the others, and may also have stochastic activation functions”. The learning process for a RNN is different from the gradient descent and backprogation used to train a FNN. It is possible to distinguish two different approaches: the Real-Time Recurrent Learning (RTRL)and the Back-Propagation Through Time (BPTT) learning algorithm.

On the base of [15] BPTT “extends the ordinary BP algorithm to suit the recurrent neural architecture”. In [16] BPTT is described like “a very powerful tool with appli-cation such as pattern recognition and dynamic modelling. It can be applied to NN to econometric models, to fuzzy logic structures, to fluid dynamics model, and to almost any system built up from elementary subsystems or calculations”. A BPTT consists of reassembling the structure in a way similar to feed forward NNs and then applying backpropagation as in figure. Moreover, it may get more often trapped in numerous

(39)

Figure 2.9: The figure represents the Back-Propagation Time Through algorithm which is a technique for training RNNs.

sub optimal local minima due to the cycle connection and it may have an high time consuming if the training set is very large.

Real-Time Recurrent Learning RTRL is used to overcome this problem. The main goal is to compute the exact error gradient at every time step, indeed this approach could be used for online learning task. In RTRL, the update of weight depends not only on the gradient value at time t but also on the gradient value at the previous time instant (instead of the backpropagation algorithm). In other words the error propagation is for-ward in time instead of backfor-ward. In this case the problem is the too high complexity. Training the weights in RNNs can also be modelled as a nonlinear global optimization problem, for instance, the most common global optimization method for training RNNs is via the use of genetic algorithms. In general arbitrary global optimization techniques may then be used to minimize the sum-squared-difference.

The most useful NN is the Deep-NN. This kind of NN is shown in figure 2.3. It is composed of more hidden layers which are connected each other and with the input and output layers. Usually they introduce a nonlinear transformation. Different types of Deep-NN exist such as deep Convolutional Neural Network. Using more than one hidden layer is also possible to implement a deep configuration of FNN and RNN: a

(40)

2.5. Libraries for Machine Learning application

deep-FNN and deep-RNN, e.g. Long-Short Term Memory (LSTM).

A deep configuration has in general higher performances than a not deep Neural Network. Firstly it can improve the computing capability. This is possible thanks to the advantages in hardware development and in data processing capability, for instance, the Graphic Processing Units, (GPUs), designed for computer graphic, permits to speed up the execution of the machine learning algorithms. Secondly, a large amount of data can be managed easier and finally a Deep-NN can be realized with a lower number of neu-rons for each layer. For this reason the Deep-NN is more multitask than traditional NN. Since the large number of parameters that compose a Deep-NN, the overfitting becomes an important problem. The overfitting could be minimized across the methods which has been mentioned in the last section.

The purpose of this work is to implement an end to end communication system in base band with a Neural Network. In the next chapter we will described how to achieve this goal. Basically, a Deep-FNN will be utilized with no more the 9 layers. Before that some libraries and tools to build and train large NNs have to be introduced.

Libraries for Machine Learning application

In recent years ML has found wide application in computer vision with an high com-putational complexity. For that reason a large number of tools and libraries to build and train NNs are emerged. In this section we are going to analyz the most common tools and libraries which use an high level language to take advantage of GPU capabilities. Among these we can present Caffe, MXNet, Theano, Torch, TensorFlow and Keras. According to [7] they “allow for high level algorithm definition in various program-ming languages or configuration files, automatic differentiation of training loss func-tions through arbitrarily large networks, and compilation of the network’s forwards and backwards passes into hardware optimized concurrent dense matrix algebra kernels”.

Most of these frameworks support Compute Unified Device Architecture (CUDA). CUDA is a parallel computing platform and programming model that makes using a GPU. In agreement with [17] “CUDA is an extension to C based on a few easily-learned abstractions for parallel programming, coprocessor offload, and a few corre-sponding additions to C syntax. CUDA represents the coprocessor as a device that can run a large number of threads. The threads are managed by representing parallel tasks as kernels (the sequence of work to be done in each thread) mapped over a domain (the set of threads to be invoked). Kernels are scalar and represent the work to be done at a single point in the domain. The kernel is then invoked as a thread at every point in

(41)

the domain. The parallel threads share memory and synchronize using barriers. Data is prepared for processing on the GPU by copying it to the graphics board’s memory”. For that reason CUDA permits to accomplishes faster a large number of mathematical operation thanks to GPU capabilities. A parallel computing platform permits to run a significantly larger number of operations per second than the CPU. Moreover, “in this context, graphics processors (GPUs) become attractive because they offer extensive re-sources even for non-visual, general-purpose computations: massive parallelism, high memory bandwidth, and general purpose instruction sets, including support for both single and double precision IEEE floating point arithmetic (albeit with some limita-tions on rounding). In fact, GPUs are really “manycore” processors, with hundreds of processing elements”. This is exactly why CUDA finds a great deal of application not only in ML fields, but also for Fast Video Transcoding, Video Enhancement, Medical Imaging, Computational Sciences, Neural Networks, Gate-level VLSI Simulation and Fluid Dynamics.

Caffeis a popular deep learning frameworks. The main library language (Core lan-guage) is C++. However, matlab and python support also a Caffe library interface for feature extraction and training for easy model. In [18] Caffe is described as “a complete toolkit for training, testing, finetuning, and deploying models, with well-documented examples for all of these tasks. As such, it is an ideal starting point for researchers and other developers looking to jump into state-of-the-art machine learning. At the same time, it’s likely the fastest available implementation of these algorithms, mak-ing it immediately useful for industrial deployment”. Since this software was born to be modular as much as possible, lots of layers and loss functions are already im-plemented. Moreover, “Caffe supports network architectures in the form of arbitrary directed acyclic graphs. Upon instantiation, Caffe reserves exactly as much memory as needed for the network, and abstracts from its underlying location in host or GPU. Switching between a CPU and GPU implementation is exactly one function call”.

MXNetis a multi-language ML library specialized for Deep-NNs. It is efficient in terms of memory and complexity and it can run on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. [19] reports that “MXNet offers powerful tools to help developers exploit the full capabilities of GPUs and cloud computing. While these tools are generally useful and applicable to any mathematical computation, MXNet places a special emphasis on speeding up the development and deployment of large-scale deep neural networks”. In paricular the capabilities which bare offerted by MXNet are:

(42)

should live.

• Multi-GPU training: MXNet makes it easy to scale computation with number of available GPUs.

• Automatic differentiation: MXNet automates the derivative calculations that once bogged down neural network research.

• Optimized Predefined Layers: While you can code up your own layers in MXNet, the predefined layers are optimized for speed, outperforming competing libraries.

Theano is a python library devoloped to optimize and evaluate mathematical ex-pression efficiently. It usually works well with multi-dimensional arrays because it exploits widely CPU and GPU mathematical compilers. This makes it very usefull for ML applications, to train and build NNs. I should like to refer to [20] which describes Thenao as a python library which “allows a user to symbolically define mathematical expressions and have them compiled in a highly optimized fashion either on CPUs or GPUs (the latter using CUDA), just by modifying a configuration flag. Furthermore, Theano can automatically compute symbolic differentiation of complex expressions, ignore the variables that are not required to compute the final output, reuse partial re-sults to avoid redundant computations, apply mathematical simplifications, compute operations in place when possible to minimize the memory usage, and apply numerical stability optimization to overcome or minimize the error due to hardware approxima-tions. To achieve this, the mathematical expressions defined by the user are stored as a graph of variables and operations, that is pruned and optimized at compilation time. The interface to Theano is Python, a powerful and flexible language that allows for rapid prototyping and provides a fast and easy way to interact with the data. The downside of Python is its interpreter, that is in many cases a poor engine for executing mathematical calculations both in terms of memory usage and speed. Theano overcomes this limita-tion, by exploiting the compactness and ductility of the Python language and combining them with a fast and optimized computation engine”.Theano is a very familiar language thanks to the similarity with Numpy [21]. This allows to use easy Theano and generate high performance code for CPU and GPU through the definition of custom graph ex-pressions written in Python, C++, or CUDA.

Torch7 is a flexible numeric computing framework and machine learning library to build and train NNs. Accordind to [22] Torch7 can easily be interfaced to third-party software thanks to Lua’s light interface. It allows to implement fast and easly mathematical algorithms with an high computational complexity, and includes other libreries. Torch7 can be easly interfaced to software throught Lua’s interface. Lua is a

(43)

library written in clean C. In [23] it is described as a powerful, efficient, lightweight, embeddable scripting language. It supports procedural programming, object-oriented programming, functional programming, data-driven programming, and data descrip-tion.

Tensorflow is a software library for numerical computation and machine learning applications. TensorFlow was designed both for research and for Google’s products. The Tensorflow computation is described across a graph composed of several nodes. According to [24] each node in the graph represents a mathematical operation and the graph edge is a multidimensional data arrays communicated between them. An oper-ation has a name and represents an abstract computoper-ation. The nodes can have zero or more input and output. In agreement with [2] “an operation can have attributes, and all attributes must be provided or inferred at graph-construction time in order to in-stantiate a node to perform the operation. One common use of attributes is to make operations polymorphic over different tensor element types. A kernel is a particular implementation of an operation that can be run on a particular type of device (e.g., CPU or GPU). A TensorFlow binary defines the sets of operations and kernels avail-able via a registration mechanism, and this set can be extended by linking in additional operation and/or kernel definitions/registrations”. The value of each operation is called tensorand it could be a multidimensional array. TensorFlow supports a variety of ten-sor element types: complex numbers and string types, signed and unsigned integers, float and double types. Tensors is a tool where the tensors flow from the input lay-ers to output laylay-ers update their value on each hidden layer. Instead, a Variable is a special operation which allows to return “a handle to a persistent mutable tensor that survives across executions of a graph. Handles to these persistent mutable tensors can be passed to a handful of special operations that mutate the referenced tensor. For ma-chine learning applications of TensorFlow, the parameters of the model are typically stored in tensors held in variables, and are updated as part of the Run of the training graph for the model”. To interact with TensorFlow, clients need to use a session in-terface. The principal operation supported by Session is run which allows to compute “the transitive closure of all nodes that must be executed in order to compute the out-puts that were requested, and can then arrange to execute the appropriate nodes in an order that respects their dependencies”. The tensorFlow system is composed of dif-ferent components which can communicate each other. The main components are the clients, which can communicate with the master and one or more worker processes. Each worker processor controls one or more computational devise such as CPU cores or GPU cards which are the computational heart of ThensorFlow. TesorFlow can be implemented through two different ways: a local implementation or distributed imple-mentation. The local implementation takes place when client, master and worker run

(44)

Figure 2.10: Local and distributed implementation of TensorFlow library [2].

on a single machine which represents an operating system process with the option of using multiple devices. The distributed implementation can be considered as an exten-sion of the local implementation (indeed they can also share the same codes), where the clients, the master, and the workers can run on more than one machine as show in the figure 2.10 extracted from [2]. In conclusion, TensorFlow is an open source software made available by Google. It has flexible architecture which allows to take advantage of one or more CPUs and GPUs in a desktop, server, or mobile device making faster the training process. Keras is a high-level library to implement a NN. It can support both Theano and TensorFlow. Its main purpose is to implement fast experimentation to ver-ify some idea or intuition. Keras is a simple environment to build Sequential model and functional Application Programming Interface (API). Basically, the Sequential model is useful for simple models with linear stack of layers while API model allows to re-alize more complicate structures with more than one input and output layers, directed acyclic graphs, or models with shared layers. In [25] there are the main guidelines to introduce the Keras environment and its features. Firstly it is “User friendliness. Keras is an API designed for human beings, not machines. It puts user experience front and centre. Keras follows best practices for reducing cognitive load: it offers consistent and simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error”. Secondly, it is modular: “a model is understood as a sequence or a graph of standalone, fully-configurable mod-ules that can be plugged together with as little restrictions as possible. In particular, neural layers, cost functions, optimizers, initialization schemes, activation functions, regularization schemes are all standalone modules that you can combine to create new models”. Finally it is easy extensibility. In Keras “New modules are simple to add (as new classes and functions), and existing modules provide ample examples. To be able to easily create new modules allows for total expressiveness, making Keras suitable for advanced research”. Furthermore, it is widely utilized because the main programming language is Python which is intuitive [26] and easy to understand.

(45)

Neural Network Toolboxwhere algorithms and pretrained models to create, train and simulate Neural Networks exist. These NNs permit to solve the basic problem of classi-fication, regression, clustering, dimensionality reduction, time-series forecasting [27]. For each task, the toolbox provides several Neural Network models include “convo-lutional neural networks (ConvNets, CNNs), directed acyclic graph (DAG) network topologies, and autoencoders for image classification, regression, and feature learn-ing. For time-series classification and prediction, the toolbox provides long short-term memory (LSTM) deep learning networks. For small training sets, you can quickly ap-ply deep learning by performing transfer learning with pretrained deep network mod-els (GoogLeNet, AlexNet, VGG-16, and VGG-19) and modmod-els from the Caffe Model Zoo”. Morevor, in [26] Matlab provides a list of tested training algorithms and their acronyms used to identify them. The choice of the training algorithms is not easy and depends on several factors “including the complexity of the problem, the number of data points in the training set, the number of weights and biases in the network, the error goal, and whether the network is being used for pattern recognition (discriminant analysis) or function approximation (regression)”.

One of the principal advantage of NNs is to speed up the training process thanks to GPUs. The training process is usually really expensive in terms of timing. To diminish the time spent during the training process, the frameworks previously described can be supported by CUDA which can increase the NN capabilities exploiting the paral-lelism of GPU. For that reason Matlab supplies a Parallel Computing Toolbox to take advantage of multicore processors and GPUs on the desktop achieving higher levels of parallelism. Moreover, Matlab supports also most CUDA-enabled NVIDIA GPUs that have compute capability 3.0 or higher for training deep neural networks. Other details about the Key Features of the Neural Network Toolbox are available on the Matlab website page in [28].

(46)

CHAPTER

3

An overview of single carrier communication

system

Introduction

A communication system is a set of device such as transmission systems, relay sta-tions, tributary stasta-tions, and data terminal equipment (DTE), interconnected each other to work as a coherent system to guarantee a reliable communication between two or more costumers. The purpose of any form of communication is to transfer a message from a sender to a receiver with a high reliability. Every day, in their life, people come in contact with multiple devices for telecommunication such as telephone, radio, tele-vision, and Internet. These devices permit people to communicate instantaneously on different continent, and receive information about various developments and events of note that occur all around the world.

The communication world is extremely various. To understand better the purpose of this work, let consider a conceptual model that describes and standardizes the main function of a telecommunication or computing system. The model, shown in figure 3.1 ( [29]), is called ISO-OSI model where ISO stands for International organization of Standardization and OSI for Open System Interconnection. The model is composed of seven layer: physical, data link, network, transport, session, presentation and

(47)

appli-Chapter 3. An overview of single carrier communication system

Figure 3.1: ISO-OSI Model [3].

cation layer. In particular, the main purpose of this work is to study and re-analyze what happen in the physical layer and try to modify the conventional transmitters and receivers using the NNLabel theory introduced in the first chapter.

Each layer has to carry out a specific task. According to [3] layers can be described as follows:

• “Application: Provides different services to the application; • Presentation: Converts the information;

• Session: Handles problems which are not communication issues; • Transport: Provides end to end communication control;

• Network: Routes the information in the network; • Data Link: Provides error control;

• Physical: Connects the entity to the transmission media”.

A physical layer has to transmit a message over a propagation channel. The mes-sages are transmitted as a flow of bits with values 0 and 1. The time to send a bit is called bit interval (Tb) while bit rate is the number of bits that are sent per unit of

time, in other words it represents the transmission speed (Rb). Bit time and bit rate are

Deep Learning techniques for wireless communications

D

EEP

L

EARNING TECHNIQUES FOR

W

IRELESS

C

OMMUNICATIONS

Ringraziamenti

I

Summary

V

List of Abbreviations

Contents

List of Figures

List of Tables

CHAPTER

1

Introduction

CHAPTER

2

Artificial Neural Networks Technology

CHAPTER

3

An overview of single carrier communication

system