Deep Learning-based MIMO Indoor Positioning

(1)

ADVISORS:

Prof. Marco LUISE

Prof. Luca SANGUINETTI

Candidate:

Alessandro Sciva

(2)

(GPS), based on satellite communication, is able to provide the position of an object located on the Earth surface with an accuracy of ≈ 10 meters [1]. Unfortunately, since those systems are strictly related to the open en-vironment, they still have not been able to meet the following challenge: provide the position of an object in an enclosed space. In order to face the Indoor Positioning Problem, a wide variety of techniques, based on dif-ferent technologies, were provided, it being an active research eld still to the present day. This is due to the numerous applications involved, some of them reported in Figure 1.1, such as:

Health: tracking special medical equipment in hospitals. Especially for very large healthcare environments, knowing the position of special emergency equipment turns out to be crucial in order to provide a responsive medical assistance.

Logistics: tracking inventory in warehouses by locating objects. In modern days, moreover, a growing number of warehouses seek to ben-et from use of robots in order to move objects, for instance, from the shelf to the conveyor belt. For this reason, an ecient coordina-tion mechanism is needed, based on the exact posicoordina-tion of robots in the warehouse.

Industry: improving productivity and safety in industrial environ-ments by knowing the position of workers. In such a scenario, the working environment is often divided into dedicated zones, where the number of workers in the area should be constantly monitored.

(6)

(a) Medical equipment

lo-calization (b) Warehouse robots coor-dination

(c) Warehouse inventory

tracking (d) User localization

Figure 1.1: Indoor Positioning applications

In the present work, a neural network approach aimed at solving the Indoor Positioning problem is hereby proposed, which seeks to benet from the spreading MIMO technology. By way of introduction, the main idea is to have distinct wireless channel ngerprints, coming from dierent antennas and use them in order to train a neural network, which will then be able to provide the position of the transmitter.

1.1 Radio Indoor Positioning

As mentioned before, GPS technology is not suitable for solving the indoor positioning problem. In particular, this is due to the lack of precision: even if 10 meters may seems a suciently accurate position estimation on the Earth surface, it turns out to be too much for indoor environment. Another reason is related to wireless propagation issues: GPS was designed on the un-derlying concept of line-of-sight propagation while, with respect to enclosed environments, the presence of obstacles results in multipath propagation.

The solution for indoor positioning problem have been searched in the ngerprint of a wave, which was distorted by the surrounding environment while travelling from a transmitter to a receiver. Depending on the nature of the wave, dierent solutions were provided, such as optical ones [2] or acoustic ones [3].

The techniques based on electromagnetic waves are known as Radio Indoor Positioning and the rst attempts involved technologies such as Wi-Fi and Bluetooth. Initially, these solutions were based on MAC address localization in a network, resulting in privacy issues.

(7)

RX T X 4πd

λ

2 (1.1)

Where PT X, GT X, PRX and GRX represent the transmission power

and antenna gain of transmitter and receiver respectively, while λ is the wavelength of the travelling wave.

Another way to estimate the distance takes into account the delay

between the instant in which a message is transmitted (tT X) and the

one in which it is received (tRX). Since electromagnetic waves travel

at the speed of light (c), the distance can be estimated considering

d = tRX− tT X

c . (1.2)

Estimation-phase: once the distance from anchors is known, methods such as Trilateration can be used. Under the assumption of radius radia-tion, three anchors are sucient to obtain the intersection point rep-resenting the position of the interested object, as reported in Figure 1.2.

(8)

This strategy carries a set of problems: rst of all, an infrastructure of an-chors is required. Then, since the received power is aected by multipath, the distance estimation may lack of precision. By contrast, in the case of delay between transmission and reception, an extremely precise synchroniza-tion mechanism is required. Based on that, it is easy to understand how the intersection point turns into an intersection region, reducing the accuracy of position estimation in a real implementation.

1.1.2 MIMO-based Indoor Positioning

The methods using the wireless channel ngerprint are based on the con-cept that a received signal is distorted by the channel, depending on the reciprocal position between transmitter and receiver, and by the surround-ing environment. The channel ngerprint can be obtained either in a totally deterministic way, through the Ray Tracing procedure, or in a statical way, by modelling it.

When considering a static scenario, the position of transmitter and re-ceiver and the surrounding environment do not change over time. In this context, having a Single-Input-Single-Output (SISO) communication sys-tem, each transmitted signal will be distorted in the same way. Indicating

with xk the signal transmitted at instant k, and with h the time invariant

channel response, the received signal will be

yk= h xk. (1.3)

Therefore, the overall information about the transmitter position is embed-ded into one single channel realization or, in other words, into one single statistical sample.

The Multiple-Input-Multiple-Output (MIMO) technology introduced the possibility of increasing channel capacity by exploiting multiple transmission and receiving antennas. In the simplied scenario reported in Figure 1.3,

it is shown a single transmitting antenna sending out the signal (xk) and a

receiver equipped with M dierent antennas (MISO communication system). Each one of them, if properly spaced, will experiment a dierent copy of the

transmitted signal. Indicating with h(m) _{the channel response between the}

transmitting antenna and the mth _{receiving one, the received vector can be}

expressed in the following way:

y(m)_k = h(m)xk. (1.4)

A MIMO system is able to provide all the h(m)_{, ∀(m) ∈ M through a}

chan-nel estimation procedure. This results in having M independent chanchan-nel responses, thus M statistical samples. These can be used to reduce statisti-cal uctuations related to multipath presence and obtain a cleaned channel response embedding information about the transmitter position.

Channel samples h(m) _{may now be employed in some wireless indoor}

(9)

Figure 1.3: MISO communication systems

about position of the transmitter. The model must take into account a massive multipath presence due to an indoor environment, together with a MIMO communication system. The following stochastic physical models may be adopted are as follows:

Extended Saleha-Valenzuela[4]: based on modelling multipath in clus-ters and on the assumption that direction of departure and of arrival of rays are independent and identically distributed. This hypothesis is not always veried in real scenarios [5].

COST 273 [6]: based on a set of external parameters, which model the surrounding environment through probability density function. IEEE 802.11n[7]: based on processing of measures performed at 2GHz

and 5GHz within small and large oces, residential houses and open spaces.

Unfortunately, the last two models are based on parameters related to the environment in which also building materials should be taken into account. These, in facts, play crucial role in the eect of reecting properties of sur-faces, such that dierent behaviours have been experimented when consid-ering models for western countries in comparison to other nations [8].

For the above-mentioned reasons, this work aims at employing a neural network in order to obtain the transmitter position, on the basis of channel estimation provided by a MIMO communication system. Should be observed that the receiver is already equipped with a mechanism capable to estimate the wireless channel, in order to compensate distortion in the signal. For this reason, channel estimations are already available on the receiver side and can be immediately provided as input to the neural network to estimate transmitter position. Additionally, this kind of strategy is totally decoupled from any hardware identier, thus it provides the possibility of implementing

(10)

an additional identication mechanism ensuring compliance with privacy protection regulations.

1.2 Related works

In academic literature the neural network approach has already been adopted. In particular, dierent studies focused on the neural network topology, from a feed forward only structure [9], to more specic ones where portions of network are trained specically with modulus and phase channel response [10]. In other cases, a channel measurement system is set up in order to provide a specic dataset for the purpose of training the network [11] [24].

(11)

easy solved all those classes of problems hard to deal with for human being, but instead easy to solve for a machine, if described by a list of formal and mathematical rules. The next step involved into making the AI able to solve the class of problems easy to be solved from human being, but hard to be described in a formal way. Within this context we found image and voice recognition, making prediction, data classication according some non-formal principle, and so on.

The intuition behind the goal of bringing AI to the next level was to mak-ing computers able to learn from experience (machine learnmak-ing) and, at the same time, to make them able to understand complex concepts through a hierarchical structure. Thus, a concept at a certain level of complexity can be dened with a relation among simpler concepts, understood at previous levels. This approach is nowadays known as AI deep learning.

A rst way to proceed was based on the idea of coding the real world knowledge into a formal language, involving into the so-called base knowl-edge approach to machine learning. This turns to be a bad choice due to resulting projects such as Cyc, referred to as "catastrophic failure" [12], due to the amount of data required to provide a productive and useful result, together with the inability of autonomous evolution.

This rst approach leads to a dierent one where AI should learn the ability of autonomously acquire their own knowledge, by extracting patterns from raw data [13]. The representation of data provided to the AI was crucial in order to extract pieces of information (features). Even if in a rst moment data representation to improve feature extraction was manually done, AIs which learned on their own how to represent data, showed better results [13].

(12)

In the end, deep learning solved the problem of data representation and feature extraction, once more, by introducing hierarchical representations. Nowadays, machine learning is considered the best way to achieve AI, and deep learning, as a machine learning approach, ts the purpose very well [13].

2.2 Deep Neural Networks overview

2.2.1 Articial neuron

In neural networks, the articial neuron represents the most elementary

com-ponent. It presents N inputs (xi), each of them coupled with a dedicated

adjustable weight (wi), one single output (y) and a specic activation

func-tion f(·). A scheme of an articial neuron is reported in Figure 2.1.

Its purpose is to compute the result of the activation function of the

weighted sum of its input. Among them, a special one (x0) is always set to

1 and its associated weight w0 = b is called bias, involving in the following

relation y = f N X i=0 xiwi = f N X i=1 xiwi+ b . (2.1)

Figure 2.1: Articial neuron

As it is, a simple single-input articial neuron can be already implied for linear tting purposes. Suppose to have N observations corrupted by

additive white Gaussian noise and coming from a linear relation ˆf : y =

a x + c. After training, an articial neuron equipped with a linear activation

function aims to implement f : y = x1w1+ b such that f ≈ ˆf. The scheme

of a single-input neuron, together with tting result, is reported in Figures 2.2.a and 2.2.b respectively.

Another possible usage is within the context of classication of linearly

separable problems. Supposed to have N bidimensional observations oi =

(xi, yi), belonging to classes (C1, C2), a double-input articial neuron aims

to nd an hyperplane in order to correctly classify observations according classes to which they belong. In binary classication problems, it is very

(13)

(b) Fitting result

Figure 2.2: Linear tting example

common that the activation function of the neuron is a threshold function, dened as

f (a) = (

1 a ≥ θ

0 a < θ. (2.2)

According to the neuron scheme illustrated in Figure 2.3.a, the equation of the boundary decision line is

x1w1+ x2w2+ w0= 0 → x2= − w1 w2 x1− w0 w1 (2.3)

and the corresponding representation on a Cartesian plane is reported in Figure 2.3.b.

(a) Double-input neuron scheme

(b) Classication result

Figure 2.3: Classication example 2.2.2 Feed Forward Neural Network

In most cases, problems turn to be more complex than a linear tting oper-ation. For a larger class of real world applications such as non-linear tting

(14)

and non-linearly-separable problems, a single articial neuron cannot to be used. Thus, an interconnection of several neurons, properly named neural network, can be introduced.

Among dierent topologies, those ones without any cycles between the interconnections of nodes can be referred as feed forward neural networks. With reference to Figure 2.4, it is possible to dene the structure of such networks, composed by three dierent layers:

Input layer : a set of neurons with the only purpose of providing samples to the network as they are. The number is related to the dimensions of input data.

Hidden layer : a set of neurons implementing the functionality provided by the network once the training phase is completed.

Output layer : a set of neurons implementing the output of the network. In particular, the type of activation function is strictly related to the problem to be solved, as for instance linear function for tting purposes or Softmax for classication ones. The number of neurons in the output layer depends on the nature of the output data. So, in the case of tting

f : <m _{→ <}n_{, the number of output neurons will be n, as well as in}

the case of on n-classication problem.

(15)

Figure 2.5: Deep Feed Forward neural network structure

once again the increasing level of non-linearity supported by the network, together with the benets introduced by a hierarchical structure in terms of feature selection. A simple example is reported in Figure 2.6, where four neural networks with dierent amount of neurons have been trained in order to approximate a sinusoidal function. More in details, these networks are composed as follow:

Feed forward 3 neurons: 12 trainable parameters, Figure 2.6.a. Feed forward 5 neurons: 18 trainable parameters, Figure 2.6.b. Feed forward 10 neurons: 33 trainable parameters, Figure 2.6.c. Deep Feed forward 10 neurons per layer: 143 trainable parameters,

Figure 2.6.d.

It is clearly visible how the last two networks perform better than previous ones due to the higher level of non-linearity supported.

(16)

0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 target NN₃ (a) FF 3 neurons 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 target NN₅ (b) FF 5 neurons 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 target NN 10 (c) FF 10 neurons 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 target NN 10-10 (d) Deep-FF 20 neurons

Figure 2.6: Sinusoidal tting example 2.2.2.1 Training process and training algorithm

The training process is at the basis of the learning capability of a neural network.

Starting from a dataset P : (p, tp) containing pairs of example p (or

pattern) and desired output tp (or target), the learning process formally has

the goal of making the output produced by network over the sample p (yp)

as much similar as possible to the associated target tp. In other words, after

the training process will be:

yp ≈ tp∀p ∈ P. (2.4)

This is achieved by gradually adjusting the weights of the model, also known as trainable parameters, in order to reduce an Error function dened as

E : W → < (2.5)

where W is a vector containing all the trainable parameters of the model. Regardless the topology, the training process is based on the back prop-agation algorithm (or on its proposed variants). This one is based on the

(17)

is achieved by a proper computation of ∆wk,h depending on the position of

the neuron, as better depicted in the Delta Rule description reported into Appendix A.1.

An important proposed variant of the back propagation algorithm is based on the ADAM (ADAptive Momentum estimation), recalled in Ap-pendix A.3. It represents the last trend in neural network training due to performance improvement shown in experiments [14], and for this reason was adopted in this work.

A classic implementation of the back propagation algorithm is based upon the Stochastic Gradient Descent (SGD), where the weight variation is computed through the gradient of the error function and more in details as

∆wh,k = −η

∂E(W )

∂wk,h

. (2.7)

We rst note that η (the learning rate) is independent from the weight and thus constant for all the parameters in the model. Even more, the variation of the weight only depends on the actual value of the objective function, rather than taking into account also its past history. ADAM faces both issues, by providing adaptive learning rate for each parameter of the model and takes advantages by the estimation of the momentum, with exponentially decaying moving average on the gradient history. Thus, the weight computed by the back propagation algorithm will be:

wk,h(n + 1) = wk,h(n) − η

ˆ m(n)

pˆv(n) + . (2.8)

where ˆm and ˆv are the rst and second moment estimates of the gradient.

The dataset plays a crucial role within the entire training process. More in details, it is usually divided into three separate slices:

Training set : the portion of dataset used to train the model and gradually adjust weights, according to back propagation algorithm. The entire process of providing all the examples to a neural network is often re-ferred as epoch.

(18)

Validation set : this portion of dataset is used to obtain an unbiased eval-uation of a model t on the training dataset. Since a model is often composed by dierent hyperparameters, such as learning rate, num-ber of hidden layers, numnum-ber of neurons per hidden layers and so on, the purpose of the validation dataset is that of tuning the model. In a practical approach, the validation dataset is used to spot over t-ting, which is the situation where the model starts to memorize the input-output association and stops learning [15].

Test set : this additional portion of dataset is used to nally evaluate the model after training. Its purpose is to assess the performance of a fully specied model [15]. From a practical point of view, this dataset is used to select a specic model rather than another one (i.e. with a dierent topology). As a counterpart, sometimes is dicult to obtain such a dataset, especially if the amount of data provided in the overall dataset is not sucient to obtain a specic error-target value. It is easy to understand that each example used for testing operation is a lost chance to learn more.

2.2.2.2 Activation functions

As mentioned, in both feed forward networks and deep ones, the goal is to increase the level of non-linearity supported. This is strictly related to the nature of the activation function implied. Over the years, several activa-tion funcactiva-tions have been proposed, each of them with proper characteristics, such that the adoption of a specic one depends on the context. The most common activation functions are:

Linear function: y = kx.

Easy to compute, takes place in the output layer in case of tting purposes.

Cannot be used within hidden layers, because its derivative is always constant and since the training algorithm is based upon this, it will not work.

Sigmoid function: S(x) = 1

1+e−x.

Smooth gradient, prevents drastic variation of the output values. These are also bounded between 0 and 1, normalizing output of each neuron and making, in the output layer, clear predictions in terms of binary classication probability.

Computationally expensive, is not zero centred and for both very high and very low values of x, the function is almost constant, involving into a null gradient. This phenomenon is known as Vanishing Gradient, turning into networks too slow in learning process or which stop learning further.

(19)

constant and its gradient vanishes.

Softmax function: σ(x)j = e

xj

PK

k=1exk with j = 1, . . . , K.

Based on a normalized sum and exponentiation, this function pro-vides the probability for a sample to belong to a particular class among N, for each of the classes. For this reason, it is widely used in the output layer for classication purposes.

Swish function: sw(x) = x

1+e−x.

Discovered by researchers at Google, provides better results when used in the hidden layers for classication networks, with respect to ReLU function [16].

Less computational ecient compared to ReLU function. In Figure 2.7, all the above functions are graphically reported.

When comparing feed forward neural networks and deep ones, it must be reminded that every deep network can be attened into a feed forward one, having the same number of neurons but, in general, not the same perfor-mance. This is true even if a feed forward network equipped with one single hidden layer can, in principle, approximate any kind of function, according to the Universal approximation theorem [17]. Considering as metric the num-ber of regions of linearity (response regions) a model is able to represents, the higher this number, the better the network will be able to approximate an arbitrary curved shape. According to this metric, it can be proved the number of regions per parameter grows exponentially with a deep network compared to a simple feed forward one [18].

2.2.3 Convolutional Neural Network

Convolutional Neural Networks (CNNs) are specialized for processing those kinds of data with a rigid grid-like topology. Among them, it is possible to nd mono-dimensional data such as time-series, or bi-dimensional ones such as images. CNNs implies the linear operation of convolution rather than a

(20)

x -5 0 5 y -5 -4 -3 -2 -1 0 1 2 3 4 5 (a) Linear x -5 0 5 S(x) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) Sigmoid x -5 0 5 H(x) -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (c) hyperbolic tangent x -5 0 5 ReLU(x) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 (d) ReLU classes 1 2 3 4 5 6 softmax 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 (e) Softmax x -5 0 5 Sw(x) -1 0 1 2 3 4 5 (f) Swish

Figure 2.7: Activation functions

simple inputs-wights matrix multiplication. By denition, a CNN is a neural network that uses convolution in at least one of its layers [13].

The convolution operation is dened for real valued function as: s(t) = x(t) ∗ w(t) =

Z

x(τ ) w(t − τ )dτ. (2.9)

Within the context of neural networks, usually x is the input, w the kernel and y the extracted feature map. More in details, both x and w are usually multidimensional arrays of nite size, thus supposed an M ×N bidimensional input I and a kernel K, the convolution operation reported in (2.9) can be expressed as:

S(i, j) = I(i, j) ∗ K(i, j) = X

m∈M

X

n∈N

I(m, n)K(i − m, j − n). (2.10)

An example of convolution operation is reported in Figure 2.8, where, start-ing from the top-right corner, the product-sum between the portion of input covered by the kernel and the kernel itself is computed. Then, the kernel is shifted in order to cover the entire input, generating the corresponding output.

2.2.3.1 Motivations behind convolution

The motivation behind CNNs can be briey summarized as follows:

Sparse Interaction: due to reduced kernel dimension with respect to input, it is possible to keep xed, up to k, the interaction in convolu-tion. As reported in Figure 2.9.a, it is clearly seen how each value of

(21)

Figure 2.8: Example of 2D-Convolution

the input of the convolution only aects 3 values of the result. This approach creates a sparse connectivity and leads to a reduced number of matrix multiplications with respect to a fully connected approach, reported in Figure 2.9.b. Here, in fact, a single value of the input will be involved within all the results. In other terms, the sparse interac-tion characteristic of convoluinterac-tion leads to a reduced number of matrix multiplications.

At a rst glance, the Receptive eld of a convolutional layer may result smaller with respect to a fully connected one. In contrast, with reference to Figure 2.10, it is possible to note how the receptive eld of a convolutional layer keeps growing the deeper the network.

Parameter sharing: In a traditional fully connected approach, we have a single parameter for each input unit. In a convolutional ap-proach, instead, parameters are represented by values inside the kernel, which are kept constant while it moves across the input, thus shared. In other terms, instead of learning an entire set of parameters per po-sition, only a set of them is used and shared across the input, leading to a strong reduction of requested amount of parameters.

(22)

(a) Convolutional interaction

(b) Fully connected interaction

Figure 2.9: Interaction in convolutional and fully connected layer

Figure 2.10: Receptive eld in a deep CNN

in CNNs we have kernels reactive to neighbours points. This turns to be very useful for image recognition , where we often are more interested in a group of pixels rather than to single one.

In general, a CNN is composed by three stages: the Convolution stage, which performs this operation, the Detector stage, which computes a non-linear activation function of each result of the convolution operation, and nally a Pooling stage. A graphical representation of this architecture is reported in Figure 2.11.

2.2.3.2 Pooling

The pooling operation, starting from a single value of the detector stage, computes a summary statistic of the nearby outputs. Several pooling op-erations have been proposed, each of them with proper characteristics and for specic uses depending on the context. Some of the most used are the

(23)

Figure 2.11: CNN stages

MaxPool and the AvergePool, which perform the maximum selection and the mean operation of a rectangular neighbourhood, respectively.

Pooling, due to neighbourhood approach, makes the input representation invariant with respect to small translations. This allows to have a network capable to understand if a feature is present, rather than learning its exact position. Additionally, since pooling is an aggregating operation, it reduces the number of input at next stages, thus improving the computational e-ciency of the whole network.

2.2.4 VGG networks

Visual Geometry Groups (VGG) networks are particular very deep CNN structures, oering good performance in large-scale image recognition con-text [19].

A single stack of the network is composed by a set of one or more convo-lutional layers followed by a MaxPool one, for a total amount of ve stacks. These are followed by a set of three fully connected layers and nally a Soft-max one, as reported in Figure 2.12. Where with the terminology ConvA-B, we refer to a convolutional layer with B parallel lters, each of them with a squared kernel (A × A).

VGG networks are based on the following main concepts:

Keeping constant the feature map size after each convolution operation. Keeping small the size of the kernel lters, in particular (3 × 3). Doubling the number of lters after each stack (except for the last

one).

For the rst point, it is intuitive, with reference to Figure 2.13, how a (3×3) kernel with a 1-ZeroPad, involves into a convolution result, which does not alter the size of feature map. The philosophy behind, is to make feature map

(24)

Figure 2.12: VGG net

size invariant to convolution, while letting the pooling operation to deal with its reduction.

Figure 2.13: Feature map invariance

For the second point, with reference to Figure 2.14, it is straightforward to understand that stacking two convolutional operations in sequence, each of them with a (3 × 3) kernel, involves the second kernel to have the same receptive eld of a single convolution with (5 × 5) kernel. The same can be proved with 3 three convolutional stacked layers of (3×3) kernel each, and a single convolutional layer with (7 × 7) kernel. Considering this last example, stacked layers allow to have the same receptive eld of a larger single one, adding tree non-linear injections in between, and thus making the decision function more discriminative [19].

Moreover, this technique reduces the number of trainable parameters, considering the same level of receptive eld. As reported to Appendix A.4, a Conv7-256 involves 12544 parameters, while stacking 3 layers of Conv3-256 requires 6912 parameters. This means 5632 additional parameters.

(25)

Figure 2.14: Active receptive eld

2.3 The TensorFlow framework

2.3.1 Overview

Over the last decade, several libraries have been developed with the common goal of deploying neural networks. Among them, we found:

TensorFlow : framework developed by Google and implied as base tech-nology Google Translate service.

Keras : high-level API, written in Python, suitable to be stand alone used or stacked on top of other technologies such as TensorFlow or Theano. It was developed in order to provide an instrument allowing fast ex-perimentation.

PyTorch : Python framework developed and widely used by Facebook. DL4J : Java library involved in neural network applications such as image

recognition, fraud detection, and natural language processing.

MXNet : framework developed by Apache Software Foundation, able to support dierent languages such as Python, R, and Scala.

For the purpose of this work, Google TensorFlow was selected as a frame-work to deploy a neural netframe-work. This was due either to the presence of a comprehensive support documentation and the possibility to adopt Python programming language. In addition TensorFlow, coming with Google Colab, provides a complete development environment, by oering both APIs and the appropriate amount of computational power for free.

More precisely, TensorFlow is an open source library for high-performance numerical computations and large-scale machine learning. All instruments

(26)

provided by TensorFlow to programmers come with Python, which is easy to learn and work with. Additionally, we nd Nodes and Tensors, being the two main pieces the framework is based on, which will be both described in the next section; they are Python objects, therefore TensorFlow applications are eectively Python ones.

However, mathematical operations, which represent a critical issue in terms of performance, are not performed in Python, but through high-performance C++ binaries instead. Therefore, the actual role of Python in TensorFlow is to direct trac among pieces while providing a high-level programming abstraction to put them together.

Another relevant aspect of TensorFlow is represented by the inclusion of Keras APIs, which allows, once more through Python, the possibility of building large and complex neural network topologies in few simple steps. Keras, due to fast experimentation goal, provides ecient APIs to build complex models by simply stacking several layers, such as convolutional and feed-forward, one on top of the other. Also, within Keras APIs included in TensorFlow, a set of powerful instruments is provided, such as optimizers and dierent evaluation metrics.

2.3.2 Tensors and Flow-graphs

In TensorFlow data are represented by Tensors, which are multi-dimensional arrays where data are stored in order to be provided as input to the neural network. According to the dimensions of tensors, we can easily represent the following data type:

Timeseries: tensors of dimension (N).

Grey scale images : tensors of dimension (N, M). RGB images : tensors of dimension (N, M, 3). RGB videos : tensors of dimension (N, M, 3, t).

A tensor of dimension (N, M, W ) is reported in Figure 2.15.

(27)

Figure 2.16: Perceptron in TensorFlow

Without loss of generality, it is possible to extend the Tensors and Flow concept to an entire neural network representation, as reported in Figure 2.17. Here is shown a simple architecture representing a two-layers feed forward network equipped with REctied Linear Unit activation function. In addition to the presence of nodes dedicated to matrix multiplication, bias addiction and Re-Lu computation, it is possible to nd:

Input node: directly connected to the Pipeline Input Object.

Reshape node: allows to apply transformation on dimensions of tensors. Loss-metric node: allows to compute loss-metrics to be minimized during

training process. More precisely, this node implements all the loss-metrics provided by Keras API according the goal of the neural net-work:

Classication problems: categorical, sparse and binary cross-entropy, Kullback-Leibler divergence, Hinge loss.

(28)

Regression problems: mean squared error, mean absolute error, mean squared logarithmic error, Huber function and cosine prox-imity.

Figure 2.17: Perceptron in TensorFlow 2.3.3 Tensorow's basic steps

As mentioned before, the input node is connected to the Pipeline Input Ob-ject, which is a special feature of TensorFlow, accessible through tf.Data APIs, allowing to apply transformation on raw data [20]. Within the con-text of neural network, it is very common to apply some data transformation techniques such as shuing, or introducing random perturbation (image

(29)

aug-transformation on-the-y, rather than loading the entire dataset, apply transformation, store it back into memory and then stream it to the neural network input.

Decoupling the intent of pre-processing by the way it is executed. Rather than using standard CPUs computational capabilities, data loading, transformation and batching make use of hardware accelera-tors, such as GPUs and TPUs used by these APIs.

Once more with reference to Figure 2.17, it is possible to note the presence of a specic component named SGD Trainer. Its purpose is to collect information from the gradient computation object and update weights of the model according to a specic optimizer. This can be chosen from a set of available ones provided by Keras APIs, such as:

SGD: Stochastic gradient descent.

RMSprop: Similar to SGD with momentum, restricting error oscillations to specic directions [21].

Adagrad: Optimizers which takes into account the past history of gradients in terms of variations to accordingly adapt the learning rate [22]. Adadelta : Similar to Adagrad, considering only a window of the past

history rather than entire one [22].

Adam : Similar to Adadelta, considering and exponential decaying window of the past history of the gradient.

Another important mention must be done about the model update pol-icy, which is provided as parameter to the Gradient object, as it is reported in Figure 2.17. A classic method adopted in neural network training is repre-sented by the Stochastic Gradient Descent algorithm, in which a single sample is provided to the network and the output error is computed together with the resultant gradient. Then, weights of the model are immediately up-dated according gradient and learning rate. This involves into a noisy error training process which leads to:

(30)

Immediate boost of performance in terms of error. Faster learning process with respect to elapsed epochs. Avoid local minima thanks to noisy error process.

Computationally expensiveness, due to large amount of gradient com-putations and model updates.

Large amount of time required to train the model.

Diculty for the model to settle on a minimum error due to noisy process.

As counterpart, the Batch Gradient Descent algorithm was provided. This is based on the calculation of output error for each sample in the training set. Then, when all samples have been evaluated by the network, all the gradients are computed and then combined in a single weight update for the model. Considering that evaluating the entire dataset is often referred as an epoch, Batch Gradient Descent performs a single model update at the end of each training epoch. This involves into a stable training process, which leads to:

Ecient example evaluation, since they may be pipelined through the model.

Stable training error over epochs.

Chance of parallel gradients computations, since errors are available all at the same time.

Model may stuck into a local minima due to stable training error. Memory demanding due to accumulation of errors.

The time required for a single model update increases due to gradients computations.

Furthermore, since all samples must be evaluated before a single update is process, the Batch Gradient descent cannot be employed for online learning. Mini-Batch Gradient Descent algorithm is in the middle of both pre-vious two. Samples from dataset are gathered in batches of small size, usually a power of 2, and individually provided to the neural network. Then output error is computed for each sample in the batch. Once all these samples have been evaluated from the network, errors are combined to compute gradients and weights, and a model update takes place. In other terms, the model is updated once per batch. This strategy takes the best of both two previ-ously described algorithms, paying the cost relative to the introduction of

(31)

function, loss-metrics and optimizers for the model. 3. Train and evaluated the model.

These three basics steps turns into the following lines of code that aims to dene, build and train a convolutional neural network for binary images classication.

1 import tensorflow as tf

2 import tensorflow_datasets as tfds

3

4 #definition of pre-process function for pipeline input

5 def preprocess(record):

6 ...

7 #dataset loading and preprocessing

8 dataset = tfds.load(’my_dataset’, with_info = True)

9 train_dataset = dataset[’train’].map(preprocess).batch( BATCH_SIZE).shuffle()

10 valid_dataset = dataset[’validation’].map(preprocess).batch( BATCH_SIZE).shuffle()

11 test_dataset = dataset[’test’].map(preprocess).batch(BATCH_SIZE )

12

13 #model definiton by layer stacking

14 model = tf.kers.Sequential([

15 tf.keras.layers.Conv2D(num, kernel_size, padding, activation_function) 16 tf.keras.layers.MaxPooling2D(window_size), 17 ... 18 tf.keras.layers.Flatten(), 19 tf.keras.layers.Dense(64, activation_function), 20 ... 21 tf.keras.layers.Dense(2, softmax) 22 ])

23 model = model.compile([loss-metrics], optmizer)

24

25 #model training and evaluation

26 history = model.train(train_dataset, epochs, steps_per_epochs, valid_dataset)

(32)

2.3.4 TensorFlow on Google Colab

Google Colab is a cloud-based service oering computational power for free. It is based on Jupiter Notebook technology, an open source project that allows to write and execute code on remote machines. A single document is represented by one or more cells, which may contains both code to be executed or markdown formatted text. Thanks to cells and text, Colab tends to be very useful in educational contexts, since the entire execution can be split in blocks, each of them enriched with descriptive text and executed one at a time. An example of cells, code and text is reported in Figure 2.18.

Figure 2.18: Google Colab notebook example

Google Colab introduces a relevant level of exibility, by supporting over than 40 programming languages, such as Python, R, Julia and Scala.

Contents as les, even large ones such as entire datasets, can be uploaded from local device, or imported from remote sources, as for instance Google Drive storage service and GitHub. Produced le can be easily downloaded locally through both automatic code procedures or web user interface.

(33)

GPU TeslaP100 Unknown

TPU Google TPU No

Files holding No Up to 100MB

Table 2.1: Colab vs Azure Notebook comparison From the comparison it is possible to conclude that: Colab allows most of datasets to entirely t in memory.

Azure Notebook is not suitable for datasets in-memory holding, due to amount of storage provided. A larger one can be requested under payment.

Colab allows the possibility of usage TPUs, specically hardware ac-celerators designed for tensors processing.

Colab allows usage of hardware accelerators for no longer than 12 con-secutive hours. This is done to prevent Colab usage for mining pur-poses. This does not represent an issue for training purposes, especially considering that a session refresh solves the problem.

Since Colab does not implement le holding, all contents have to be reloaded when the session ends. This can be partially solved through Google Drive storage, in order to allow a faster cloud-to-cloud upload, rather than a local one. Even if Azure Notebook allows le holding, the amount of granted storage is not enough for dataset accommodation. In the end, for the purpose of this work, Google Colab was selected as IDE. This choice was done considering computational power oered by the platform, which coupled with TensorFlow APIs, oers a complete developing environment for machine learning.

(34)

Positioning based on

Convolutional Neural Network

3.1 Model-driven CNN training (Model-CNN)

The term Model-driven CNN refers to a neural network trained with samples generated by a model.

At a rst glance, a model-driven CNN may result meaningless, since neural networks are often used to approximate unknown models starting from real data (Data-Driven CNN). The reasoning behind is to rst train a network on data coming from a model, to which we will refer as Simulated data, then test it on real data. This can be considered as a rst approach to evaluate the similarity between the behaviour described by the model and the one reported by real data.

If the error resulting from the Model-driven CNN on simulated data is suciently low and comparable with the one coming from real data, the model can be considered a good approximation of the real scenario. Hence, it can be used to increase the real dataset in order to provide an additional learning source for the CNN.

3.1.1 Model description

A widely used channel estimation procedure is based upon the transmission of pilot tones (p), known in advance to both transmitter and receiver. At time k, the received complex symbol against the pilot transmission through an AWGN channel is

yk= pkhk+ nk. (3.1)

Where pk is the pilot tone, hk is the channel realization and nk is complex

Gaussian noise. Due to pilot tones, the channel can be easily estimated as: ˆ hk= yk pk = hk+ nk pk ∼ CN hk, σ_n2 |pk|2 . (3.2) 33

(35)

Figure 3.1: MIMO planar wave on (N × M) surface model

In this scenario, considering a planar wave with wavelength λ, which impinges on the surface with azimuth angle ϕ and elevation angle θ, it is possible to compute the array response according to [23]. By gathering all the responses starting from the antenna placed in the origin, the vector response can be expressed as

a(ϕ, θ) = ejαϕ,θ1,1_{, . . . , e}jα ϕ,θ 1,M_{, e}jα ϕ,θ 2,1_{, . . . e}jα ϕ,θ N,MT_. (3.3)

where the phase shift related to antenna located in the array at position (n, m), due to an impinging planar wave with (ϕ, θ) can be expressed as:

αϕ,θ_n,m= 2πd

λ

(m − 1) cos(θ) sin(ϕ) + (n − 1) sin(θ)

. (3.4)

The path-loss related to the attenuation is dependent on the distance (ρ) between transmitter and receiver, according to

β(ρ) = 1

ρK. (3.5)

where K is a factor depending on the environment. The path-loss can be considered constant for all the antennas in the array.

(36)

Finally, from a given point in space, expressed in spherical coordinates

(ρ, ϕ, θ), the channel response can be expressed as

h(ρ, ϕ, θ) = pβ(ρ) a(ϕ, θ). (3.6)

and according to (3.2), taking into account also the transmission power PT X,

the channel estimation can be expressed as ˆ hk(ρ, ϕ, θ) = hk(ρ, ϕ, θ) + nk= p PT X p βk(ρ) ak(ϕ, θ) + nk. (3.7) 3.1.2 Dataset generation

According to the model previously described, a dataset composed of 200000 samples, considering parameters reported in Table 3.1, was generated through a Matlab script and according to the following steps:

Parameter Value Samples 200000 Antenna array (8 × 2) Carrier frequency f0 = 2.4 GHz Subcarriers 22 Transmission power PT X = 20 dBm Noise power σ2_n= −94 dBm path-loss factor K = 3 x-range [m] 1.8 ÷ 6.1 y-range [m] −1.5 ÷ 2.84 z-range [m] −0.4 ÷ −0.5

Table 3.1: Model dataset parameters

1. Random generation of (x, y, z) Cartesian coordinates from uniform dis-tribution.

1 x = min_x + (max_x - min_x).*rand(1,MEASURE_AMOUNT)’;

2 y = min_y + (max_y - min_y).*rand(1,MEASURE_AMOUNT)’;

3 z = min_z + (max_z - min_z).*rand(1,MEASURE_AMOUNT)’;

2. Conversion from Cartesian coordinates to spherical ones, according to

ρ =px2_{+ y}2_{+ z}2 _{ϕ =}_atany x θ =acos z ρ

3. Dataset creation: the ch data structure is lled with samples according to the model reported in (3.7). More in details, it is a 4-D structure, having a dedicated layer for the modulus and one for the phase. Within each layer, the channel responses for a single antenna are vertically

(37)

8 ch(i,m,:,1)=abs_{(real_part + 1i*imag_part);}

9 ch(i,m,:,2)=unwrap(angle_{(real_part + 1i*imag_part));}

10 else

11 ch(i,m+M,:,1)=abs_{(real_part+1i*imag_part);}

12 ch(i,m+M,:,2)=unwrap(angle_{(real_part+1i*imag_part));}

It must be noticed that, due to at fading channel model, the sample for both modulus and phase, of a given antenna from a given position

(x, y, z)is replicated 22 times.

4. Dataset are exported in ch_abs_phase.mat, ch_real_imag.mat and position.mat les respectively.

Figure 3.2: Dataset visualization 3.1.3 Training procedure

Regardless the nature of the dataset, for both Mod-Phase and Real-Imag, the training procedure starts by uploading the proper dataset into a Google Drive dedicated folder. Then, the remote lesystem is locally mounted into

(38)

actual Colab session, where it is copied into a local dataset folder, as reported in Figure 3.3 and depicted by the following Python code.

1 !mkdir dataset

2 from google.colab import drive

3 drive.mount(’/content/gdrive’)

4 !cp /content/gdrive/’My Drive’/dataset/abs_phase/ch_abs_phase. mat dataset

5 !cp /content/gdrive/’My Drive’/dataset/abs_phase/position_.mat dataset

Figure 3.3: Dataset import

From the local Colab dataset directory, les are entirely loaded into memory by the usage of h5py API, a proper Python library dedicated to .h5 le managing. 1 f = h5py.File("dataset/ch_abs_phase.mat") 2 ch = f[’channel_response’] 3 ch = np.array(np.transpose(ch)) 4 f = h5py.File("dataset/position.mat") 5 pos = f[’position_carth’] 6 pos = np.array(np.transpose(pos))

The entire dataset is now split into training, validation and test set, for the reasons depicted in Section 2.2.2.1 and, nally, respective dataset objects are obtained through the tf.data APIs, as described in Section 2.3.3. More in details, all the datasets are batched, in order to implement the Mini-Batch Gradient Descent training and also shued, to avoid statistical biasing. The training set is also repeated, which involves the dataset iterator to be reini-tialized after each training epoch, allowing a predened number of epochs.

1 BATCH_SIZE = 64

2 train = 0.6;

(39)

15 val_dataset = tf.data.Dataset.from_tensor_slices((val_ch, val_pos)).batch(BATCH_SIZE).repeat()

16 test_dataset = tf.data.Dataset.from_tensor_slices((test_ch, test_pos)).batch(BATCH_SIZE)

The following steps, such as network denition, training operation and eval-uation, follow the same procedure depicted in Section 2.3.3, hence here omit-ted.

3.1.4 Training result

As previously mentioned, the same virtual dataset is available in both Mod-Phase and Real-Imag input forms. For this reason, two dierent CNNs were trained in order to select the best one in terms of an error evaluation metric.

More specically, dening as xt, yt, ztand xp, yp, zpthe target values and the

predicted ones, the evaluation metrics is the average distance error:

e = 1 N N X i=1 r x(i)_t − x(i)p 2 +y(i)_t − yp(i) 2 +z_t(i)− zp(i) 2 (3.8) From this point on, with the term the best CNN it is intended a CNN obtained after a trial and error process, involving the tuning of hyperparam-eters, to achieve a neural network with the smallest error as possible.

After the training process, comes the best CNN for both Mod-Phase and Real-Imag input is the one reported in Table 3.2. The batch size was selected in order to achieve a reasonable compromise between accuracy and training time. For each of both convolutional and dense layer, the activation function is a ReLU, described in Section 2.2.2.2. Finally, the kernel size in convolutional layers was chosen accordingly the at fading channel model provided and in order to process antennas separately.

Figures 3.4.a and 3.4.b report training results for the Mod-Phase network and Real-Imag one, respectively. It is possible to note that both CNNs don't show evidence of overtting and converge to a minimum error distance.Figure 3.5 compares both, while in Table 3.3 the resulting average distance error obtained on the training set is reported. According to the results,

(40)

Mod-Table 3.2: Model-CNN conguration Dataset Samples 200000 Training (%) 60% Validation (%) 20% Testing (%) 20% Batch size 64 CNN topology Conv2D(16,(1,22)) Conv2D(16,(1,22)) Conv2D(16,(1,22)) Avg2DPool(1,2) Conv2D(32,(1,22)) Conv2D(32,(1,22)) Conv2D(32,(1,22)) Avg2DPool(1,2) Flatten layer 3×Dense(64) Dense(3) Trainable parameters ≈240000 Learning rate 10−5 Optimizer ADAM

Loss metric MSE

Epochs 50

Phase CNN can be considered as the best network for this kind of topology and dataset provided.

Table 3.3: Model-CNN training results

Network Train paras AVG dist. err. [cm]

Mod-Phase ≈240000 2.17±0.01

Real-Imag ≈240000 3.47±0.01

3.1.5 Test on a real dataset

As previously mentioned, the model-CNN is now tested with a real dataset approximately composed by 17000 samples. This, fully described within the next section, comes from a real scenario, where the channel response is estimated through an (8 × 2) planar antenna array.

On the basis of previous results, the Mod-Phase CNN has been selected as a candidate for a real test. For this reason, also the real dataset has

(41)

(a) Mod-Phase CNN (b) Real-Imag CNN

Figure 3.4: Mod-Phase and Real-Imag CNN training result

been reshaped in order to t the input of the Model-CNN. Details about real dataset preprocessing and reshaping are reported in Section 3.2.3.

The testing procedure is Colab based. The Model-CNN is restored from le and the real dataset is locally copied, once more, through Google Drive.

1 model = tf.keras.models.load_model(’abs_phase_model_CNN.h5’)

2

3 drive.mount(’/content/gdrive’)

4 !cp /content/gdrive/’My Drive’/real_dataset/abs_phase/measures. mat dataset

5 !cp /content/gdrive/’My Drive’/real_dataset/abs_phase/position. mat dataset 6 7 f = h5py.File("dataset/measures.mat") 8 ch = np.transpose(np.array(f[’measures’])) 9 10 f = h5py.File("dataset/position.mat")

11 pos = np.transpose(np.array(f[’positions’]))

Then, a dataset object is created and provided as input to the CNN. 1 test_dataset = tf.data.Dataset.from_tensor_slices((ch, pos))

2 prediction = model.predict(test_dataset)

Results about test on real dataset are reported in Table 3.4, compared with those ones obtained by the same CNN on the model dataset.

(42)

epochs

0 10 20 30 40 50

Average error distance - [m]

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

|h|ej (h) Train AVG err-dist |h|ej (h) Valid AVG err-dist ℜ_{(h), ℑ(h) Train AVG err-dist} ℜ_{(h), ℑ(h) Valid AVG err-dist}

Figure 3.5: Mod-Phase and Real-Imag training comparison Table 3.4: Abs-Phase-Model-CNN results comparison

Network Dataset AVG dits. err. [cm]

Mod-phase model-dataset 2.17±0.01

Mod-phase real-dataset 227.05±1.59

3.2 Real dataset CNN training (Real-CNN)

From results reported in Table 3.4, it is easy to understand the need of a CNN directly trained on real data, in order to achieve the goal of positioning with an acceptable error.

In this section we use a real dataset provided during the IEEE CTW 2019 competition [24], where participants aimed to develop a positioning algorithm based on ngerprint, interpolation, or machine learning.

In this case, a convolutional neural network is trained over the raw dataset, without applying any kind of preprocessing, for both Mod-Phase approach and Real-Imag one. Then, in order to reduce the overall error, a preprocessing based on knowledge about wireless channel behaviour is ap-plied.

(43)

Figure 3.6: Real scenario environment

Figure 3.7: Planar (8 × 2) antenna array

The estimation procedure is based upon a novel MIMO channel sounder [25]. The transmitter sends OFDM symbols containing pilot tones and em-bedding in the payload its estimated position on the table. Then, at the receiver, signals from each of the M antennas are down-mixed from the

(44)

car-rier frequency fc to an intermediate one fm, m = 1, . . . , M, multiplexed,

and provided to a DSP. Here, all the signals are analog-to-digital converted, split in frequency and I/Q sampled. The result of this process is then used to obtain the Channel State Information (CSI), together with the associ-ated position contained in the payload of the message. The entire transmis-sion takes place over a 20 MHz bandwidth, centred at a carried frequency

fc= 1.25GHz, with OFDM modulation equipped with 1024 subcarriers and

10% guard intervals (involving into 924 available subcarriers).

The resulting dataset is composed by 17486 measures and three les: h_Estimated: [17486 × 924 × 16] complex samples about the CSI. r_Position: [17486 × 3] Cartesian estimated position in the order

(x, y, z).

SNR_Estimated: [17486] SNR estimations.

From this point on with the term Raw Real-Imag dataset, we intend a collection of CSIs, with associated positions, where the rst are expressed in term of real and imaginary parts, without any kind of preprocessing. As counterpart, with the term Raw Mod-Phase dataset is intended the same collection with the exception of CSIs represented in terms of modulus and phase, once more, without any kind of preprocessing. A randomly extracted sample from the dataset is reported in Figures 3.8.a and 3.8.b for Raw Real-Imag and Raw Mod-Phase representations respectively. It is easy to note the presence of both additive white Gaussian noise and multipath distortion, where, especially the second one is a very well known peculiarity of the wireless channel, briey mentioned in Appendix C.1.

(a) Real-Imag sample (b) Abs-Phase sample

(45)

following one. For each topology, it is also reported the total amount of trainable parameters, together with the portion related to the convolutional stage only.

In Table 3.6, it is reported the dataset conguration. After some initial trials, it comes out that the amount of available data provided in the dataset does not left chance for a test-dedicated portion. Thus, in order to increase as much as possible learning chances, the average error distance taken into account is the one based on the validation set only.

In Figure 3.9, it is reported a plot representing the localization of each topology on the parameters plane. It can be used to express the amount of convolutional parameters over the total number involved in the network. The more a point is located on the top left, the larger will be the convolutional stage with respect to the overall trainable parameters amount.

Tot params - [mln] 7 8 9 10 11 12 13 14 15 16 Conv params - [k] 0 50 100 150 200 250 netA netB netC netD

Figure 3.9: Network localization on parameters plane

In Figures 3.10.a, 3.10.b, 3.11.a and 3.11.b we report the evolution of the training and validation metrics epoch by epoch. It is clear to see that, after a certain epoch, depending on the network taken into account, it stops learning, exhibiting an asymptomatic behaviour of the average validation distance error. Looking at the comparison in terms of validation metric only, as reported in Figure 3.12, it is possible to note a more accurate result with the increasing of network depth. This is also conrmed by the numerical

(46)

Table 3.5: Abs-Phase ra w dataset CNNs congurations NetA NetB NetC NetD Dataset CNN Con v2D(32,(1,4)) Con v2D(32,(1,4)) Con v2D(32,(1,4)) Con v2D(32,(1,4)) Samples 17486 Con v2D(32,(1,4)) Con v2D(32,(1,4)) Con v2D(32,(1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Training(%) 90 Con v2D(64,(1,4)) Con v2D(64,(1,4)) Con v2D(64,(1,4)) Con v2D(64,(1,4)) Con v2D(64,(1,4)) Con v2D(64,(1,4)) Con v2D(64,(1,4)) V alidation(%) 10 Avg2DP ool((1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Con v2D(128,(1,4)) Con v2D(128,(1,4)) Con v2D(128,(1,4)) Con v2D(128,(1,4)) Con v2D(128,(1,4)) Testing(%) 0 Con v2D(128,(1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) Avg2DP ool((1,4)) 3 × Dense(256) 3 × Dense(256) 3 × Dense(256) 3 × Dense(256) Batc h size 8 Dense(3) Dense(3) Dense(3) Dense(3) T rain p-[mln] 15.08 7.51 7.6 7.68 Con v p-[k] 8.54 41.44 127.67 213.92 Learning rate 10 − 5 Loss metric MSE Optimizer AD AM Ep oc hs 100

(47)

Table 3.7: Abs-Phase Raw dataset CNNs performances

NetA NetB NetC NetD

Train params [mln] 15.08 7.51 7.6 7.68

CNN params [k] 8.54 41.44 127.66 213.92

CNN-Tot ratio 0.05% 0.55% 1.68% 2.78%

Avg dist. err. [cm] 32.63±1.45 31.97±1.39 18.52±0.93 14.58±0.81

epochs

0 20 40 60 80 100

Average distance error - [m]

0 0.2 0.4 0.6 0.8 1 1.2

Train AVG err-dist Valid AVG err-dist

(a) NetA

epochs

0 20 40 60 80 100

0 0.2 0.4 0.6 0.8 1 1.2

(b) NetB

(48)

epochs

0 20 40 60 80 100

0 0.2 0.4 0.6 0.8 1 1.2

(a) NetC

epochs

0 20 40 60 80 100

0 0.2 0.4 0.6 0.8 1 1.2

(b) NetD

Figure 3.11: Abs-Phase raw dataset training result

epochs

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 NetA NetB NetC NetD

(49)

epochs

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) NetA

epochs

0 20 40 60 80 100

0 0.2 0.4 0.6 0.8 1 1.2

(b) NetB

epochs

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(c) NetC

epochs

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(d) NetD

Figure 3.13: Real-Imag raw dataset training result

Once more, the trend suggests the deeper the network, the higher the accuracy.

(50)

epochs

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 NetA NetB NetC NetD

Figure 3.14: Validation average error distance Real-Imag raw dataset

Table 3.8: Real-Imag Raw dataset CNNs performances

NetA NetB NetC NetD

Train params [mln] 15.08 7.51 7.6 7.68

CNN params [k] 8.54 41.44 127.66 213.92

CNN-Tot ratio 0.05% 0.55% 1.68% 2.78%

(51)

subcarrier - (f) 0 200 400 600 800 1000 Mod 0 0.05 0.1 B Mod x 0 1 2 3 4 5 6 7 y -2 0 2 4 generic pos A pos B pos RX pos

Figure 3.15: AWGN distortion

Another problem could be related to the dimensions of a single sample provided to the network. At this moment, with 924 subcarriers, 16 antennas, a single sample is composed of ≈ 30000 points. As such information may become too complex to deal with for a CNN, especially if coupled with an inadequate amount of data.

For this reason, a possible solution could be represented by a channel response frequency subsampling. This approach, performed by an aver-age moving window, may solve both problems at the same time: on one side, in fact, the averaging operation reduces the power of noise of a factor equal to the width of the averaging window. In other words, given a constant value

x aected by AWGN, the kth realization will be yk = x + nk ∼ N (x, σn2).

Considering a N-width averaging window, will hold: ˆ y = 1 N N X i=1 yk= 1 N N X i=1 x + nk ∼ N x,σ 2 n N . (3.9)

(52)

This behaviour is shown in details within appendix B.2. On the other side, as well, an averaging operation behaves as data compressor, reducing example dimensionality.

A special mention must be done for the parameter representing the width of the averaging window. This must be chosen in order to both reduce the noise power as much as possible, but, at same time, without introducing too much distortion on the dynamic of the channel estimation. Rather than an iterative approach, and especially for the second reason, the Coherence Bandwidth ts the purpose very well. It is dened over a channel aected by multipath and represents the bandwidth which the channel response can be considered as at. Considering an N-ray multipath channel and

den-ing τM as the delay related to the longest path among N, the coherence

bandwidth is approximated as:

Bc= 1 10 ÷ 100 1 τM . (3.10)

Since positions are available in the real dataset and due to indoor environ-ment, a good approximation for the largest delay is the one related to the furthest point of transmission from the receiver array. According to this, the coherence bandwidth can be estimated, as well as the number of samples composing the width of the averaging window. In other terms, the subsam-pling step is obtained as

SUB_ST EP = B

Bc

. (3.11)

From a procedural point of view, the subsampling step is computed through a Matlab script according to the following code.

1 B = 20*10^6; %TX Bandwidth

2 x = r_Position(:,1); y = r_Position(:,2); z = r_Position(:,3);

3 rho = sqrt(x.^2 + y.^2 + z.^2);

4 tau_max = max_{(rho)/(3*10^8);} %Maximum propagation delay

5 Bc = 1/(100*tau_max); %Coherence bandwidth

6 SUB_STEP = round(B/Bc); %Subsampling step

Computations led to the following results:

τM = 207.55 ns Bc= 481.8 kHz SUB_STEP = 42 (3.12)

Figure 3.16.a reports the application of the coherence bandwidth on the modulus of the channel response. It is possible to note the interval within the channel can be considered at, together with the associated central point, used for further interpolation. It is clear to see how the usage of coherence bandwidth as metric to select the subsampling step allows de-noised response to track the dynamic of the real one. Figure 3.16.b reports the application of de-noise procedure to both modulus and phase. Finally, Figures 3.17.a and

(53)

(a) Coherence badnwidth visualization (b) De-noise procedure application

Figure 3.16: Coherence badnwidth and de-noise procedure application 3.17.b report a comparison about the same sample provided in Figure 3.15, before and after de-noise procedure respectively. It is clear to see how, after de-noise procedure, the modulus channel responses are fully distinguishable. The frequency subsampling procedure scales down the number of subcarriers from 924 to 22, involving into a single sample represented by 704 points, hence to a reduction of 42 times.

subcarrier - (f) 0 200 400 600 800 1000 Mod 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 A Mod B Mod

(a) Before de-noise procedure

subcarrier - (f) 0 5 10 15 20 25 Mod 0.03 0.04 0.05 0.06 0.07 0.08 0.09 A Mod B Mod

(b) After de-noise procedure

Figure 3.17: De-noise procedure 3.2.4 Preprocessed dataset training result

After data preprocessing, the dataset conguration used for training pro-cess is left unchanged with respect to the one reported in Table 3.6. Once both Mod-Phase and Real-Imag datasets were evaluated, together with dif-ferent network topologies, the combination showing better results is the one involving Real-Imag dataset and the CNN described in Table 3.9. The evo-lution of training and validation metric is reported in Figure 3.18, while nal

(54)

performance is reported in Table 3.10.

Due to encouraging results obtained after 100 training epochs, the pro-cedure was extended up to 200 epochs, in order to get even better results.

Table 3.9: Denoise CNN conguration

CNN topology Conv2D(32,(1,4)) Conv2D(32,(1,4)) Conv2D(32,(1,4)) Conv2D(32,(1,4)) Conv2D(32,(1,4)) Conv2D(32,(1,4)) Conv2D(32,(1,4)) Avg2DPool(1,2) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Conv2D(64,(1,4)) Avg2DPool(1,2) Flatten layer 3×Dense(512) Dense(3) Train params - [mln] 3.28 CNN params - [k] 132 Learning rate 10−5 Optimizer ADAM

Loss metric MSE

Epochs 200

For this network topology several layers were stacked before reaching the average pooling layer at both rst and second stage. This is intentionally done in order to face the reduced dimensions of input data. Since average pooling is performed horizontally, data dimension is reduced from 22 (input) down to 11 (rst stage) and then, again, down to 5 (second stage). This introduces the need of packing convolutional layers, to reach an acceptable level of non-linearity, into two stages only, rather than three as in the previous networks.

(55)

epochs

0 50 100 150 200

0 0.2

Figure 3.18: Denoise dataset training result

Table 3.10: Denoised Real-Imag dataset CNNs performances NetA

Train params [mln] 1.57

CNN params [k] 132

CNN-Tot ratio 8.4%

Avg dist. err. [cm] @100epochs 8.88±0.71 Avg dist. err. [cm] @200epochs 6.97±0.63

3.3 Hybrid CNN training (Hybrid-CNN)

The purpose of hybrid dataset training is to provide chances of better learn-ing by addlearn-ing simulated data to the real dataset. More in details, the target dataset is the denoised one, divided in the following slices:

Original training dataset: a portion of dataset coming from the de-noised one.

Model dataset: a portion of virtual data generated from the model described in Section 3.1.1. Furthermore, this portion of dataset is the

k% of the original training dataset, where k is gradually increased,

such as k ∈ (5%, 10%, 25%, 50%, 100%).

Validation dataset: a portion of dataset coming exclusively from the denoised one.

(56)

Training dataset: the union of original training dataset and model one. The dataset composition is better depicted in Figure 3.19. The network

Figure 3.19: Dataset structure

structure is the same used for denoised dataset and reported in Table 3.9. 3.3.1 Training result

After traininig procedure, in Figure 3.20.a and 3.20.b are reported a com-parison among dierent hybrid datasets considered.

epochs

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Hybrid 5% Hybrid 10% Hybrid 25% Hybrid 50% Hybrid 100%

(a) Validation distance comparison

epochs

75 80 85 90 95 100

0.08 0.1 0.12 0.14 0.16 0.18 Hybrid 5% Hybrid 10% Hybrid 25% Hybrid 50% Hybrid 100%

(b) Validation distance zoom