• Non ci sono risultati.

5.2 Classification techniques

5.2.4 Neural Networks

The basic architecture of a Neural Network [40] (NN) is a multilayer stack of simple modules, also called neurons: in this case, they are also called fully connected NN

Machine Learning and classification techniques

Input Layer Hidden Layers Output Layer

w1

wn

x1

xn

bias

f(x)

weights

Figure 5.8: An example of a Neural Network, with the detail of a single neuron.

or feed-forward NN. What happens inside a neuron is a weighted sum of the inputs of that neuron, followed by an activation function, as better explained in Figure5.8.

The activation function carries out an essential role inside the network: in fact it introduces the non-linearities that are fundamental to distinguish the network from a shallow model. Actually, several types of activation functions exist, with the most common being:

Sigmoid f (x) = 1+e1−x

historically famous, but with many disadvantages. It saturates (i.e. has un upper or lower limit for the output of the function), the output is not zero-centred and contains the exponential, which is expensive to compute;

tanh f (x) = tanh(x)

zero-centred, but still saturates;

ReLU f (x) = max(0, x)

Rectified Linear Unit does not saturate in the positive region, it is very ef-ficient to compute. Unfortunately the output is still not zero-centred and a negative input produces an output equal to zero;

PReLU f (x) = max(αx, x)

Parametric ReLU. All the advantages of the ReLU, but will not be killed by negative inputs.

The issue with non-zero-centred activation functions is due to the peculiar way a NN is trained: an always-positive (or negative) output of an activation function will slow the process of convergence of the network [41], as better explained below.

Usually, ReLU or PReLU are the chosen ones. Finally, Figure 5.9 shows these functions, to better visualize what already explained.

Machine Learning and classification techniques

Figure 5.9: Examples of activation functions. From left to right, on the top: sigmoid and tanh; on the bottom: ReLU and PReLU.

At the end of all the fully connected layers, a softmax function compresses the outputs to have a sum equal to one, to be interpreted as a probability. The equation of the softmax function is:

σ(z) = ezi

∑︁

jezj

where zi is the current element for which the softmax is computed.

The training of a neural network happens differently from the shallow algorithms explained above: it is composed of two phases, the forward propagation and the back-propagation. Given a single sample, each of its features is typically used as input for a different neuron of the input layer and the output of each neuron is then forwarded to the next neurons, until the end is reached: this is the simplest part and it is called forward propagation.

At this point, the error between the predicted output and the actual label must be computed with a “loss” function (or cost function). There are different possible choices for the loss function, but the most popular one is indeed the logistic loss [42], which is defined as:

L(y, p) = −(y log(p) + (1 − y) log(1 − p))

where y ∈ {0, 1} is the true label and p = P (y = 1) is the probability estimate.

The loss measure is used to perform the second step of the training, the back-propagation. This step consists in the computation of gradients to update the weights of the nodes. More in detail, by using a component of the NN called

“optimizer”, it is possible to choose among different algorithms: the most common ones are Stochastic Gradient Descent [43] (SGD) and Adam. The objective of these algorithms is to minimise the loss function, proceeding iteratively. After each batch of forwarded input data, the computed loss is used to update the weights and then,

Machine Learning and classification techniques

once all the training set has been forwarded through the network, the epoch ends.

The number of epochs, i.e. the number of times the whole training set is forwarded through the network, can be a fixed number or the interruption of the training can be triggered dynamically. For example, a common method is to interrupt the training if the loss is not decreasing by a fixed percentage since a certain interval of epochs.

Finally, another type of neural networks exists: Convolutional Neural Net-works [44] (CNN). They are designed to process data that come in the form of multiple arrays. Besides images, many other datasets are composed by such form, e.g. sequences and text, audio spectrograms and videos. CNN try to exploit some intrinsic characteristics in this form of data with some additions to the fully con-nected NN architecture: convolutional layers and pooling layers are added to the architecture. A convolutional layer can be seen as a filter that moves over the input:

each time this filter moves, it performs the scalar product between the weights of the filter and the values of the part of the input it is passing over. Like in standard NNs, the output produced by a layer is used as input for the next one. These filters are helpful to grasp local features typical of images and other similar inputs like audio or video tracks. Instead, pooling layers are needed to reduce dimensionality and make the representations more manageable. An example of pooling filter that chooses the max value of an area of the input is shown in Figure5.10. Some popular CNN architectures are ResNet [45] or Inception [46].

1 3 0 2

8 4 4 5

1 3 2 0

2 1 0 4

8 5

3 4

Figure 5.10: An example of a pooling mechanism.

Chapter 6

Solution design

This chapter’s objective is to explain the workflow followed during this work’s experiments and to describe in detail the different phases that compose it.

The two main steps of the experiments are the creation of the machine learning models that allow to classify network traffic as benign or malign and the integration and deployment of these models inside a working IDS, Suricata. The technical details on IDSes can be found in Chapter4, while the overview of machine learning techniques is in Chapter 5. Finally, Figure 6.1 shows an high level diagram with the whole workflow of the experiments.

Flows statistics creation with Tstat

Selection of useful features

Training and validation of the models

Initialisation of Suricata

Update Tstat internal structures with packets

Classification with new Suricata rules ML

model

Model Creation and Selection Suricata Integration

Test and choice of the final models

Figure 6.1: The high-level workflow of the proposed solution.

6.1 Creation and selection of classifiers

This phase of the experiments starts from the download of the two chosen datasets, which characteristics have been deeply analysed in Chapter 7, and ends with the creation of several different trained classifiers, that will be later employed in the integration phase. The design followed to obtain this result is the one described in

Solution design

Section5.1, composed by a first phase of pre-processing of the dataset, followed by the visualisation of the dataset characteristics and another phase of pre-processing, with different aims from the previous one.

Using the results and the observations of these first steps, the algorithms that bet-ter suit the characbet-teristics of the dataset have been chosen. Finally, the training, validation and testing pipelines have been created, in order to produce and store different classifiers, ready for a later use. In this regard, for each of the chosen algo-rithms have been stored several trained models, depending on the specific training set used to create it. The technical steps to reproduce this workflow are described in Appendix B, while the explanation of the usage of the final tools from a user perspective can be found in Appendix A.