Protecting Smart Home Environments by exploiting Multi-Level Distributed Intrusion Detection System

(1)

Computer Engineering Master Thesis

Protecting Smart Home Environments

by exploiting Multi-Level Distributed

Intrusion Detection System

Advisors:

Prof. Gianluca Dini

Dr. Pericle Perazzo

Dr. Andrea Saracino (IIT-CNR)

Candidate:

Simone Facchini

479391

Academic Year 2019/2020

Pisa - 9/12/2019

(2)

(3)

Abstract

The proliferation of smart devices and the introduction of the Internet of Things (IoT) paradigm have played a significant role in the creation of smart environ-ments. The increased deployment of smart devices has led to an increase of po-tential security risks: most of the devices have lack of security features, and only one security hole in one device can lead to the compromise of the entire network. This work proposes a novel multi-level approach to design a Distributed Intrusion Detection System for Smart Home environments. The proposed approach aims to detect, in a distributed way, unexpected behaviors of network components by exploiting the collaboration between the different IoT devices. The problem has been addressed by implementing an architecture based on a distributed hash ta-ble (DHT) that allows sharing network and system information between nodes. An Intrusion Detection System, distributed in each node of the network, repre-sents the core component to detect malicious behavior. The proposed Intrusion Detection exploits a binary classifier, based on a machine learning mechanism, which analyzes, in a novel way, the aggregation of features extracted from data coming from kernel, network and DHT level.

(4)

(5)

Chapter 1 Introduction

1.1 Thesis Structure

The thesis is structured as follows: in Chapter 1 we introduce smart homes and their vulnerabilities. Chapter 2 contains a general overview of the knowledge and concepts required to fully understand our approach. In Chapter 3 we make a related works overview. The adopted approach is then introduced and described in Chapter 4. Chapter 5 presents technical details about the implementation of the simulation environment built to collect data in order to perform our study. The results of the experiments we conducted are reported in Chapter 6, while in Chapter 7 a discussion of the obtained results is reported together with some limitations and future improvements ideas. Finally, the thesis concludes with a short summary of the entire work in Chapter 8.

1.2 Smart Home

A smart home can be defined as ”an intelligent environment that is able to acquire and apply knowledge about its inhabitants and their surroundings in order to

(8)

adapt and meet the goals of comfort and efficiency” [1]. Smart home is a term that refers to modern homes that have appliances, lighting and/or electronic devices that can be controlled remotely by the owner, often via a mobile app. Smart home-enabled devices can also operate in conjunction with other devices in the home and communicate information to other smart devices, thanks to this fact they are part of the so-called Internet of Things (IoT).

Figure 1.1: Smart home representation.

The proliferation of smart devices and the introduction of the Internet of Things (IoT) paradigm have played a significant role in the creation of smart environments. According to the research report from the IoT analyst firm Berg Insight1_{, the number of smart homes in Europe and North America reached 64}

million in 2018 and they estimate that more than 60.3 million homes in North America will be smart by 2023 (41% of all homes in the region). Smart home penetration rate of households is 7.7% in 2019 but it is expected to grow up to 18.1% in 2023 (as shown in Figure 1.2) and the revenue in the Smart Home market is increasing very quickly too, it is around 73M US$ in 2019 while it was 56M US$

(9)

in 2018 [2]. This growth is probably due to the efficiency of new technologies and the fact that devices are becoming cheaper, but this phenomenon has a drawback: it makes these devices attractive for malicious hackers.

Figure 1.2: Households penetration in Smart Home market, source: Statista [2].

A Smart home environment automatizes the entire home. Therefore it pro-vides services to everyday activities for better quality living, such as sophisticated control of energy, higher security against break-ins, innovations in home entertain-ment, health monitoring, and independent/assisted living arrangements. Smart devices can include appliances like refrigerators, washing machines, dryers, heat-ing and air conditionheat-ing units, lightheat-ing service, and surveillance cameras.

The increased deployment of such smart devices has led to an increase in potential security risks. Hackers’ interest is strongly dependent on the diffusion of the technology they are going to break, clearly because if a vulnerability is found on a widespread device, the opportunity to exploit it is more significant, and so is the reward. Therefore together with the growth of diffusion, the need of security for Smart Homes is quickly increasing. On the other hand, most of the smart home devices have just a few (if any) security features, and only one security hole in one device can lead to the compromise of the entire network,

(10)

despite reasonable security measures. Moreover, these smart devices are rarely updated, even if the producer makes available patches for known vulnerabilities. Since the interconnected devices have a direct impact on the user’s lives, there is a need for a well-defined security threat classification and a proper security infrastructure with new systems and protocols that can enforce privacy, data integrity, and availability in IoT [3].

1.3 Threats

The fact that a Smart Home environment is so pervasive, manages a lot of per-sonal user data and has control of a lot of house features, makes cybersecurity a central topic strictly related to this diffusion growth. Due to the different stan-dards and communication stacks involved, the limited computing power and the high number of interconnected devices, traditional security countermeasures could not work efficiently in IoT systems. For this reason, developing specific security solutions for IoT is essential to let users and organizations catch all opportunities it offers [4].

An important aspect related to Smart Home security is the fact that often the simplest IoT devices have few or even don’t have security features, and a vulnerability in one device can lead to the compromise of the entire network, despite reasonable security measures. Moreover even if for the well-known vul-nerabilities the producers almost always release updates in order to fix them, has been observed that often users’ devices are not updated but instead run the same software version they had when they were bought.

In 2017 Senrio, a team of cybersecurity specialists, found a vulnerability called Devil’s Ivy in a service that imports third party libraries from gSOAP used by a common IP camera, the Axis M3004 security camera. Exploiting that, they

(11)

were able to reset remotely the device to factory defaults and then to set new credentials, gaining access to all data recorded by the camera [5]. A year later, in 2018, they demonstrated that through the same vulnerability on the same device and another well-known vulnerability on some common home routers, they were able to gain access to a Network Attached Storage (NAS) connected to the same network of the IP camera. In the end they obtained a command interface on the NAS that gave them full access to the filesystem [6].

1.4 Thesis Contributions

This work introduces a Multi-Level Distributed Intrusion Detection System for Smart Homes capable of detecting unexpected behaviors of a component by ex-ploiting the collaboration between the different IoT devices, either to fix it or, in the worst case, to exclude the compromised node from the network. The essen-tial component of this IDS will be a machine learning classifier that makes use of multi-level data collected from the system to detect as fast as possible every anomaly in the behavior of a node. The main contributions of the work presented in this thesis are:

• Implementation of a distributed IoT environment.

• Data collection and features extraction of anomalous and normal system behavior, generated through a simulation environment.

• Experiments on a multi-level IDS exploiting machine learning techniques with respect to a specific malicious behavior.

(12)

Chapter 2 Background

2.1 Cyber Attacks in Internet of Things

The Internet is the heart and center supporting for IoT, hence almost all the security threats that lie within the Internet propagate to IoT as well. Compared with other traditional networks, the sensitive nodes of the IoT are assigned in positions without manual supervision, with the weak capability and limited re-sources, making the security issues of the IoT quite troublesome. Furthermore, the fast development and wider adoption of IoT devices in our lives signify the urgency of addressing these security threats before deployment. Due to intrinsic limitation of processing capability and speed, the traditional security counter-measures are not applied as they are for IoT-based security threats [7]. Since the interconnected devices have a direct impact on the users’ lives, there is a need for a well-defined security threat classification and a proper security infrastruc-ture with new systems and protocols that can mitigate the security challenges regarding privacy, data integrity and availability.

Because of the fact that IoT is a relatively new concept, there is a need to define its security goals too. The most desirable security objective of IoT

(13)

is to protect the collected data, since data collected from physical devices may also include sensitive user information. For this reason the security of any IoT system needs to be resilient to data-related attacks and provide trust and data security and privacy. Data security and privacy refers to the protection of any collected or stored data in any IoT system, this means that at any moment the IoT system needs to provide data confidentiality, integrity and availability. Trust is a complicated concept consisting of different properties and aims: i) Trust relations between each IoT layer. ii) Trust for the security and privacy at each IoT layer. iii) Trust between the user and the IoT system [8].

The different IoT attacks can be divided in four distinct classes: physical, network, software and encryption attacks. Encryption attacks are not really of our interest because they are solely based on breaking the encryption scheme being used in an IoT system, so they are not strictly related to IoT security.

Physical Attacks

This kind of attacks are focused on the hardware components of the IoT system and the attacker needs to be physically close or into the IoT system for the attacks to work. Attacks that harm the lifetime or functionality of the hardware are also included in this category.

Node Tampering The attacker can cause damage to a sensor node, by physi-cally replacing the entire node or part of its hardware or even electroniphysi-cally interrogating the nodes to gain access and alter sensitive information or impact the operation of higher communication layers.

Malicious node injection The attacker can physically deploy a new malicious node between two or more nodes of the IoT system, hence controlling all data flow from and to the nodes and their operations.

(14)

Sleep deprivation The attacker keeps the nodes awake which will result in a significant power consumption that will eventually cause the nodes to shut down (considering battery-powered devices). So the attacker could, for example, force the shut down of devices used to prevent physical intrusions.

Malicious code injection Malicious code could be injected through physical devices (e.g. an USB storage device) and potentially disrupt the entire smart home network. Alternatively, an attacker may obtain all informa-tion in the home network by injecting malicious software that has code for seamlessly transmitting all information outside of the protected network [9].

Network Attacks

These attacks are centered on the IoT system network and the attacker does not necessarily need to be close to the network for the attack to work.

Sybil attack A Sybil attack is an advanced form of impersonate attack, here an attacker steals multiple identities and performs malicious activity over the network. It represents itself as a node which contains multiple identities to the other fellow nodes. In this situation the attacker could manipulate and dominate the whole sysyem, disseminate spam and malware, violate other users’ privacy and so on [10].

Sinkhole attack In a sinkhole attack, an adversary compromises a node inside the network and performs the attack exploiting this node. The compromised node could, for example, send fake routing information to its neighboring nodes telling that it has the minimum distance path to the base station and then attracting the traffic. It can then alter the data and also drop the packets. Furthermore, this intruder may also cooperate with some other malicious nodes in the network to interfere detection algorithms [11].

(15)

Routing information attack The attacker can make the network complex by spoofing, modifying or sending routing information. It results in allowing or dropping packets, forwarding wrong data or partitioning the network [12].

Traffic analysis An attacker can sniff out the confidential information or any other data flowing from nodes because of their wireless characteristics.

Software Attacks

Software attacks are the main source of security vulnerabilities in any comput-erised system. Software attacks exploits the system by using Trojan horse pro-grams, worms, viruses, spyware and malicious scripts that can steal information, tamper with data, deny services and even harm the devices of an IoT system.

Malicious code injection An attacker can obtain access to one or more vul-nerable devices (like IP cameras [13]), observe and store data related to time schedules and routines of house residents, maybe locate valuable objects in order to perform a physical intrusion.

Sinkhole, sybil, routing information attack An attacker could, for exam-ple, choose a particular area of the house (e.g. the garage) and partition the network exploiting routing protocols vulnerabilities [14] in order to make all security devices of that area harmless or completely under the attacker’s control.

Malware An attacker can infect with a malware (e.g. Mirai [15]) vulnerable IoT devices which often adopt standard passwords, don’t implement many secu-rity features and don’t have up-to-date software, even if they have publicly known vulnerabilities.

(16)

Malicious code injection, ”lateral” attacks It has been proven that exploit-ing vulnerabilities of the simplest IoT devices [16] (e.g. IP cameras [5][6]) or even embedding malware into smart-phone apps that the user unwittingly runs inside the home network [17], gives an attacker a good chance to vio-late the corresponding router which gives internet connectivity to devices. Exploiting the fact that often home routers are maintained with default or very simple username and password and, moreover, they rarely have up-to-date software, the attacker’s job could become very easy. Once gained full access to the router, a malicious actor can read, store, redirect and drop all or part of the network traffic and/or try to gain access to other devices connected to the home network.

2.2 Intrusion Detection Systems in IoT

Some ongoing projects for enhancing IoT security include methods for providing data confidentiality and authentication, access control within the IoT network, privacy and trust among users and things, and the enforcement of security and privacy policies. However, even with these mechanisms, IoT networks are vulner-able to multiple attacks aimed to disrupt the network. For this reason, another line of defense designed for detecting attackers is needed. Intrusion Detection Systems (IDS) fulfill this purpose [18].

Intrusion detection is the activity of detecting actions that intruders carry out against information systems. These actions, known as intrusions, aim to obtain unauthorized access to a computer system. The primary assumptions of intrusion detection are: user and program activities are observable, for example via system auditing mechanisms; and, more importantly, normal and intrusion activities have distinct behaviors. Intrusion detection therefore involves capturing

(17)

audit data and reasoning about the evidence in the data to determine whether the system is under attack. Research efforts about intrusion detection have been conducted since the beginning of the 1980s and the IDS has consolidated its position as a popular defense technology for traditional IP networks, with several solutions on the market.

Despite the maturity of IDS technology for traditional networks, current so-lutions are inadequate for IoT systems, because of IoT particular characteristics that affect IDS development. At first, while in traditional networks IDS agents are deployed in nodes with usually high computing capacity, IoT networks are usually composed by nodes with resource constraints. The second particular character-istic is related to the network architecture: in traditional networks, end systems are directly connected to specific nodes (e.g. wireless access points, switches, routers) that are responsible for forwarding packets to the relative destination. IoT networks, on the other hand, are usually multi-hop, so regular nodes may simultaneously forward packets and work as end systems. The last important difference is in the protocols used, for different reasons IoT devices often use pro-tocols that are not employed in traditional networks. Different propro-tocols bring original vulnerabilities and new demand for IDS.

2.3 Machine Learning Classifiers

People are often prone to making mistakes during analyses or, possibly, when trying to establish relationships between multiple features. This makes it difficult for them to find solutions to certain problems. Machine learning can often be successfully applied to these problems, improving the efficiency of systems and the designs of machines [19]. Every instance in any dataset used by machine learning algorithms is represented using the same set of features. The features

(18)

may be continuous, categorical or binary. If instances are given with known labels (the corresponding correct outputs) then the learning is called supervised, in contrast to unsupervised learning, where instances are unlabeled. Another kind of machine learning is reinforcement learning [20]. The training information provided to the learning system by the environment (external trainer) is in the form of a scalar reinforcement signal that constitutes a measure of how well the system operates. The learner is not told which actions to take, but rather must discover which actions yield the best reward, by trying each action in turn.

The choice of which specific learning algorithm we should use is a critical step. Once preliminary testing is judged to be satisfactory, the classifier (mapping from unlabeled instances to classes) is available for routine use. The classifier’s evalua-tion is most often based on predicevalua-tion accuracy (the percentage of correct predic-tions divided by the total number of predicpredic-tions). There are different techniques which are used to calculate a classifier’s accuracy. One technique is to split the training set by using two-thirds for training and the other third for estimating performance. In another technique, known as cross-validation, the training set is divided into mutually exclusive and equal-sized subsets and for each subset the classifier is trained on the union of all the other subsets. The average of the error rate of each subset is therefore an estimate of the error rate of the classifier.

Supervised classification is one of the tasks most frequently carried out by so-called Intelligent Systems. Thus, a large number of techniques have been devel-oped based on Artificial Intelligence (Logical/Symbolic techniques), Perceptron-based techniques and Statistics (Bayesian Networks, Instance-Perceptron-based techniques). Some well-known algorithms are based on the notion of perceptron: a hypo-thetical nervous system, or machine, designed to illustrate some of the funda-mental properties of intelligent systems in general, without becoming too deply enmeshed in the special conditions which hold for particular biological organisms

(19)

[21]. A multi-layer perceptron neural network (Fig. 2.1) consists in a large num-ber of units (neurons) joined together in a pattern of connections. Units in a net are usually segregated into three classes: input units, which receive information to be processed; output units, where the results of the processing are found; and units in between, known as hidden units. First, the network is trained on a set of paired data to determine input-output mapping. The weights of the connections between neurons are fixed and the network is used to determine the classification of a new set of data.

Figure 2.1: The structure of a multilayer perceptron neural network.

In pattern recognition, the k-Nearest Neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression. In k-NN classi-fication, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.

(20)

Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. The neighbors are taken from a set of objects for which the class is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data. The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the clas-sification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric for continuous variables is Euclidean distance. A drawback of the basic ”majority voting” classification occurs when the class distribution is skewed [22].

Figure 2.2: K-Nearest Neighbors classification.

Decision trees (Fig. 2.3) have proved to be valuable tools for the description, classification and generalization of data. Decision trees are trees that classify instances by sorting them based on feature values, the feature that best divides the training data would be the root node of the tree. Each node in a decision tree represents a feature in an instance to be classified, and each branch represents a

(21)

value that the node can assume. Instances are classified starting at the root node and sorted based on their feature values [23].

Figure 2.3: A generic example of decision tree.

Support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting) [24]. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping

(22)

Figure 2.4: Support Vector Machine Classification.

their inputs into high-dimensional feature spaces.

Na¨ıve Bayes classifiers are a family of simple ”probabilistic classifiers”, based on applying Bayes’ theorem with strong (na¨ıve) independence assumptions be-tween the features. They are among the simplest Bayesian network models. Na¨ıve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers. Naive Bayes is a simple technique for construct-ing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. Despite their naive design and apparently oversimpli-fied assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of naive Bayes is that it only requires a small

(23)

number of training data to estimate the parameters necessary for classification [25].

Figure 2.5: Illustration of how a Gaussian Naive Bayes (GNB) classifier works.

Random forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classifi-cation) or mean prediction (regression) of the individual trees. Random forests applies the technique of bagging (bootstrap aggregating) to decision tree learners. Bagging is a method of generating new datasets from existing data by creating samples of the existing data with replacement. This means there could be re-peated values in each of the newly created datasets. Bagging is the magic that makes random forest popular because it avoids overfitting, despite increasing the number of trees. This is because it averages many low-bias and high-variance predictors, thereby reducing the variance without increasing bias. Consequently, random forests can achieve high accuracy without the risk of overfitting or un-derfitting data. Also, since multiple versions of the dataset are generated, it is possible to work with relatively small datasets. Random forest is an ensemble decision tree algorithm because the final prediction, in the case of a regression

(24)

problem, is an average of the predictions of each individual decision tree; in clas-sification, it is the average of the most frequent prediction. So, the algorithm takes the average of many decision trees to arrive at a final prediction [26].

Figure 2.6: Random Forests Classification.

AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms (”weak learn-ers”) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learn-ers are tweaked in favor of those instances misclassified by previous classifilearn-ers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and

(25)

configura-tions to adjust before it achieves optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative ’hardness’ of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples [27].

Figure 2.7: Illustration of AdaBoost algorithm for creating a strong classifier based on multiple weak linear classifiers.

2.4 Classifiers Evaluation Metrics

For binary classification problems, the discrimination evaluation of the best (opti-mal) solution during the classification training can be defined based on confusion matrix (Fig. 2.8). The row of the matrix represents the actual class, while the column represents the predicted class. From this confusion matrix, TP and TN denote the number of positive and negative instances that are correctly classified. Meanwhile, FP and FN denote the number of misclassified negative and posi-tive instances, respecposi-tively. From the confusion matrix, several commonly used metrics can be generated to evaluate the performance of classifier with different focuses of evaluations [28], as shown in Tab. 2.1.

(26)

Figure 2.8: Confusion Matrix. TN: True Negatives; FP: False Positives; FN: False Negatives; TP: True positives.

The accuracy is the most used evaluation metric in practice, either for binary or multi-class classification problems. Through accuracy, the quality of produced solution is evaluated based on percentage of correct predictions over total in-stances. The complement metric of accuracy is error rate which evaluates the produced solution by its percentage of incorrect predictions. Both of these met-rics were used commonly by researchers in practice to discriminate and select the optimal solution.

Metric Formula Evaluation Focus

Accuracy _{T P +F P +T N +F N}T P +T N Measures the ration of correct predictions over the total number of instances evaluated. Precision (p) _{T P +F P}T P

Masures the positive patterns that are correctly predicted from the total predicted patterns in a positive class.

Recall (r) _{T P +T N}T P Measures the fraction of positive patterns that are correctly classified.

F-Measure 2∗p∗r_p+r Represents the harmonic mean between recall and precision values.

(27)

2.5 Kademlia

The protocol we are going to use in order to store data and provide communication is Kademlia, which is one of the most popular peer-to-peer (P2P) Distributed Hash Table (DHT) in use today. Here we are going to provide a quick functional description of the protocol, in order to have an exhaustive description we invite the reader to read the original article in which it was presented [29].

A DHT is a class of a decentralized distributed system that provides a lookup service similar to a hash table: (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses to documents, to arbitrary data [30].

Why Kademlia

Kademlia provides many desirable features that are not simultaneously offered by any other DHT. These include:

• Kademlia minimizes the number of inter-node introduction messages.

• Configuration information such as nodes on the network and neighboring nodes spread automatically as a side effect of key lookups.

• Nodes are knowledgeable of other nodes. This allows routing queries through low latency paths.

• Kademlia uses parallel and asynchronous queries which avoid timeout delays from failed nodes.

(28)

Key

Kademlia uses keys to identify both nodes and data on the Kademlia network. Kademlia keys are opaque, 160-bit quantities. Participating nodes each have a key, called NodeId, in the 160-bit key-space. Since Kademlia stores content in the form of (key, value) pairs, each data on the DHT is also uniquely identified by a key in the 160-bit key-space.

Overview

Kademlia nodes are represented in the form of a binary tree where nodes are the leaves of the binary tree as shown in Figure 2.9. The NodeId of each node can be found by tracing the bit value of the edges to that node, the highlighted node in the tree would have a NodeId 0011. From the perspective of a single node, the tree is divided into subtrees. From the node’s location in the tree traversing downwards, the subtrees are successive lower subtrees that don’t contain that node. The highest subtree consists of half of the binary tree not containing the node, the next one consists of the half of the remaining tree not containing the node and so on. The Kademlia protocol ensures that every node knows at least one node in each of its subtrees, if that subtree contains a node.

XOR Metric and Distance Calculation

In order to decide which node a (key, value) pair should be stored at, Kademlia uses the notion of distance between two identifiers. Given two 160-bit identifiers, x and y, Kademlia defines the distance between them as their bitwise eXclusive OR (XOR) interpreted as an integer: d(x, y) = x ⊕ y . XOR captures the notion of distance implicit to the binary tree sketch of the system. In a fully populated binary tree of 160-bit IDs, the magnitude of the distance between two IDs is the

(29)

Figure 2.9: Example of Kademlia tree.

height of the smallest subtree containing both. When a tree is not fully populated, the closest leaf to an ID x is the leaf whose ID share the longest common prefix to x.

Node State

In Kademlia, nodes store contact information about each other to route query messages: for each 0 < i < 160, every node keeps a list of triplets (IP Address, UDP Port, NodeId), for nodes of distance between 2i and 2i + 1 from itself, these lists are called k-buckets. Each Kademlia node also has a Routing Table: a binary tree whose leaves are k-buckets. Moreover Kademlia uses a replication parameter k which specifies on how many nodes a data should be replicated, as well as the size of a node’s routing table.

Kademlia Protocol

(30)

• PING: probes a node to see if it’s online.

• STORE: instructs a node to store a (key, value) pair for later retrieval.

• FIND NODE: takes a 160-bit key as an argument, the recipient of the RPC returns informations about the k nodes closest to the target id.

• FIND VALUE: behaves like FIND NODE returning the k nodes closest to the target Identifier with one exception: if RPC recipient has received a STORE for the given key, it just returns the stored value.

2.6 Mirai Botnet

A botnet is a logical collection of Internet-connected devices such as computers, smartphones or IoT devices whose security have been breached and control ceded to a third party. Each compromised device, known as a ”bot”, is created when a device is penetrated by software from a malware (malicious software) distribution. The controller of a botnet (called C&C, Commander and Controller) is able to direct the activities of these compromised computers through communication channels [15].

Mirai is a malware that turns networked devices running Linux into remotely controlled bots that can be used as part of a botnet in large-scale network attacks [31]. It primarily targets online consumer devices such as IP cameras and home routers. The Mirai botnet was first found in August 2016. Devices infected by Mirai continuously scan the internet for the IP address of Internet of things (IoT) devices. Mirai includes a table of IP Address ranges that it will not infect, includ-ing private networks and addresses allocated to the United States Postal Service and Department of Defense. Mirai then identifies vulnerable IoT devices using a table of more than 60 common factory default usernames and passwords, and logs

(31)

into them to infect them with the Mirai malware. Infected devices will continue to function normally, except for occasional sluggishness, and an increased use of bandwidth. A device remains infected until it is rebooted, which may involve simply turning the device off and after a short wait turning it back on. After a reboot, unless the login password is changed immediately, the device will be reinfected within minutes. Upon infection Mirai will identify any ”competing” malware, remove it from memory, and block remote administration ports. Vic-tim IoT devices are identified by first entering a rapid scanning phase where it asynchronously and “statelessly” sends TCP SYN probes to pseudo-random IPv4 addresses, excluding those in a hard-coded IP blacklist, on Telnet TCP ports 23 and 2323. If an IoT device responds to the probe, the attack then enters into a brute-force login phase. During this phase, the attacker tries to establish a Telnet connection using predetermined username and password pairs from a list of credentials. Most of these logins are default usernames and passwords from the IoT vendor. If the IoT device allows the Telnet access, the victim’s IP, along with the successfully used credential is sent to a collection server.

2.7 Tools

To gather kernel-level data from our simulation devices we used sysdig [32]: a simple tool for deep system visibility, with native support for containers. Sysdig instruments physical and virtual machines at the OS level by installing into the Linux kernel and capturing system calls and other OS events. Sysdig also makes it possible to create trace files for system activity, similarly to what you can do for networks with tools like Wireshark [33].

Wireshark is a free and open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development,

(32)

and education. Wireshark is a data capturing program that ”understands” the structure (encapsulation) of different networking protocols. It can parse and display the fields, along with their meanings as specified by different networking protocols. Wireshark uses .pcap files to store data related to captured packets [34]. We used Wireshark in order to collect network-level data.

(33)

Chapter 3 Related Work

Several approaches for the Intrusion Detection Systems in IoT ecosystem have been proposed in literature[18]. A first distinction can be done between the centralized and the distributed approaches. In IoT networks, the IDS can be placed in the border router, in one or more dedicated hosts, or in every physical object. In the centralized IDS placement, the IDS is placed in a centralized component, for example, in the border router or a dedicated host. All the data that nodes gather and transmit to the Internet cross the border router as well as the requests that Internet clients send to the nodes. Therefore, the IDS placed in a border router can analyze all the traffic exchanged between the network components and the Internet. The advantage of placing the IDS in the border router is the detection of intrusion attacks from the Internet against the objects in the physical domain. Kasinathan et al. [35] employed the centralized placement, but they took into consideration only the IDS protection against a DoS (Denial of Service) attack. This way, the authors decided to deploy the IDS analysis engine and the IDS reporting system in a powerful dedicated host. They deployed the IDS sensors in the network nodes, which were responsible for sniffing the network traffic and sending this data to the IDS analysis engine. The IDS dedicated host

(34)

is wire connected to the IDS sensors, avoiding the transmission of IDS data and network regular data in the same wireless network. Therefore, if a DoS attack degrades the wireless transmission quality, IDS data transmission would not be affected. Wallgreen et al. [36] proposed a centralized approach in which the IDS is placed in the border router. The objective of the proposed solution is to detect attacks within the physical domain. However, analyzing the traffic that traverses the border router is not enough to detect attacks that involve only nodes within the network. Then, researchers must design IDSs that can monitor the traffic exchanged between nodes, without ignoring the impact that this monitoring activity may have on low capacity nodes operation. Also, the centralized IDS may have difficulty in monitoring the nodes during an attack that compromises part of the network. Moreover, systems like these have a single point of failure, so the attacker could try to attack directly the intrusion detection agent in order to make the entire network vulnerable. In distributed intrusion detection system placement, IDS agents are placed in every physical object of the network [37]. Nodes may also be responsible for monitoring their neighbors. Nodes that monitor their neighbors are referred to as watchdogs. Cervantes et al. [38] proposed a solution called INTI (Intrusion detection of Sinkhole attacks on 6LoWPAN for Internet of ThIngs) that combines concepts of trust and reputation with watchdogs for detecting and mitigating attacks. First, nodes are classified as leader, associated or member nodes, composing a hierarchical structure. The role of each node can change over time due to the network reconfiguration or an attack event. Then, each node monitors a superior node by estimating its inbound and outbound traffic. When a node detects an attack, it broadcasts a message to alert the other nodes and to isolate the attacker. The authors did not discuss the impact of the solution in low capacity nodes. Placing the IDS in the nodes might decrease the communication overhead associated with network

(35)

monitoring, but requires more resources (processing, storage, and energy) from them . Distributing IDS agents across some dedicated nodes might be a solution to meet the requirements for less monitoring traffic and more processing capacity. As reported in [39], basing on the method applied to detect the intrusion, the IDSs can be classified respectively in misuse-based (even called signature-based) or anomaly-based. In signature-based approaches, IDSs detect attacks when sys-tem or network behavior matches an attack signature stored in the IDS internal databases. If any system or network activity matches with stored patterns/sig-natures, then an alert will be triggered. Signature-based IDSs are accurate and very effective at detecting known threats, and their mechanism is easy to un-derstand. In [40], the authors proposed a distributed IDS on a Wireless Sensor Network based on a defined number of rules. Oh et al. [41] proposed a dis-tributed signature-based lightweight IDS, defining an algorithm to match attack signatures and packet payloads. However, this approach is ineffective to detect new attacks and variants of known attacks, because a matching signature for these attacks is still unknown. The main disadvantage of the misuse-based IDS is the request of frequent manual updates of the database with rules and sig-natures. Anomaly-based IDSs compare the activities of a system at an instant against a normal behavior profile and generates the alert whenever a deviation from normal behavior exceeds a threshold. However, anything that does not match to a normal behavior is considered an intrusion and learning the entire scope of the normal behavior is not a simple task. Thereby, this method usu-ally has high false positive rates. Cho et al. [42] proposed a detection scheme for botnets using the anomaly-based method. The authors assumed that botnets cause unexpected changes in the traffic of 6LoWPAN sensor nodes. The proposed solution computes the average for three metrics to compose the normal behavior profile: the sum of TCP control field, packet length, and the number of

(36)

connec-tions of each sensor. Then, the system monitors network traffic and raises an alert when metrics for any node violate the computed averages. Summerville et al. [43] developed a deep-packet anomaly detection approach that aims to run on resource constrained IoT devices. The authors argue that small IoT devices use few and relatively simple protocols, resulting in network payloads that are highly similar. Based on this idea, they use a technique called bit-pattern matching to perform feature selection. Network payloads are treated as a sequence of bytes, and the feature selection operates on overlapping tuples of bytes, called n-grams. A match between a bit-pattern and an n-gram occurs when the corresponding bits match in all positions. The growth of computational intelligence has brought major advantages in developing anomaly-based IDS. Its aim is to model the nor-mal system behavior, identifying anonor-malies as deviations from learned behavior. The idea of Gupta et al. [44] is to apply Computational Intelligence algorithms to build normal behavior profiles for network devices. For each different IP address assigned to a device, there would be a distinct normal behaviour profile. In [45] an online network IDS based on autoencoder trained in an unsupervised way has been proposed. In [46] the authors collect different data mining and machine learning techniques adopted for cyber security intrusion detection.

Although the majority of IDSs in IoT are focused on the network flow anal-ysis, other works proposed host-based IDS. Mudgerikar et al. [47] present an host-based anomaly detection system for IoT devices in which collecting system-level information, like running process parameters and their system calls, in an autonomous, efficient, and scalable manner, helps in detecting anomalous behav-iors. To the best of our knowledge, there are no works that combine network data sources and system-level information to detect intrusion in IoT ecosystem.

(37)

Chapter 4 Approach

4.1 Smart and Not-So-Smart devices

Our scenario consists in a common Smart Home environment composed by two basic kind of devices: smart devices and what we call ”not-so-smart” devices. The first ones are all the devices that run an almost complete Operating System (e.g. Android Things) that basically allows the user to install third-party programs and components and have not very limited resources like computing power and battery. The so-called not-so-smart devices instead are those that run a very sim-ple operating system and only pre-installed and/or proprietary software, allowing the user just to use predefined applications. This distinction is needed because the idea is to implement an architecture in which smart devices store and main-tain data and communication between them, and each one of them is responsible for a certain number of not-so-smart devices, which instead communicate only with the corresponding smart device.

(38)

4.2 System Overview

Our architecture is composed by a certain number of smart devices like Smart TV, Smart Speakers (e.g. Google Home, Amazon Echo), Smart Fridge and so on. Every smart device is responsible for a certain number of not-so-smart de-vices like, for example, temperature sensors spread all over the house used by smart heating system. Smart devices are interconnected through a common home network, communicate and share application data on a Distributed Hash Table (DHT) and are forced to put periodically data related to their behavior. Time is ideally divided into slots of pre-defined and constant duration: at the end of each slot every node puts on the DHT data that itself collected which summa-rize its behavior. Each smart node contains an Intrusion Detection System agent which examines the behavior of the other nodes analyzing it on three different levels: kernel, network and DHT [48]. A reputation system is integrated in the distributed network. When a node detects a suspicious activity, it collaborates to assign to the responsible node a low reputation level and eventually starts a procedure to exclude the compromised node from the network. Figure 4.1 shows the overall architecture.

Internet IDS IDS IDS IDS DHT Kernel IDS P2P Network (DHT) Network

(39)

4.3 Threat Model

In our scenario we consider that one of the smart devices in the system could be compromised. Compromised means that has been subjected to an intrusion so could have, for example, a malware installed, or an attacker could have gained access to a remote command interface through which he can execute commands on the device. An intrusion can impact on different security properties such as confidentiality, in case the attacker’s objective is to steal private information; moreover the attacker could perform a Denial of Service (DoS) attack on other nodes of the system, affecting system availability; another possibility is that the attacker affects data integrity, sending to the other nodes false information.

4.4 Distributed Architecture

The proposed smart home architecture adopts a peer-to-peer (P2P) approach that exploits a Distributed Hash Table (DHT) indexing scheme to organize the smart home network nodes. DHT provides a lookup service similar to a hash table. Pairs of (key, value) are stored locally on a certain number of nodes and any participating node can efficiently retrieve the value associated with a given key without the need to know the node on which it is actually stored. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses to documents, to arbitrary data [30]. Exploiting a P2P network, the intrusion information are distributed among all network nodes (smart devices), which contain their own IDS agent. Such distribution allows to analyze the behavior under different viewpoints increasing, thus, the chance to detect the anomaly.

(40)

4.5 Reputation Mechanism

To determine the behavior of a node, a reputation mechanism is needed. If a specific node, analyzing the information shared in the DHT, detects malicious behavior of another node, it puts on the DHT a resource containing that infor-mation to invite other nodes to exclude it from the network. At the same time, we can not allow a malicious node to exclude benign nodes declaring such nodes as malicious when they are not. To this end, a distributed reputation mechanism is needed to assign, in a cooperative way, a reputation score [49]. When a node finds on the DHT an entry declaring a malicious node, it evaluates the trust of that entry, relying on the confidence of the node that put that resource on the DHT.

The main properties of a reputation system are the representation of reputa-tion, how the reputation is built and updated, and for the latter, how the ratings of others are considered and integrated. The reputation of a given node is the collection of ratings maintained by others about this node.

An example of a reputation mechanism that could be adopted in our system is described in [50]. In this approach, a node i maintains two ratings about every other node j. The reputation rating represents the opinion formed by node about node j’s behavior as an actor in the system. The trust rating represents node i’s opinion about how honest node j is as an actor in the reputation system (i.e. whether the reported first hand information summaries published by node j are likely to be true). The ratings that node i has about node j are represented as data structures Ri,j for reputation and Ti,j for trust. In addition, node i

maintains a summary record of first hand information about node j in a data structure called Fi,j.

(41)

first hand information Fi,j and the reputation rating Ri,j are updated. From time

to time, nodes publish their first-hand information to their neighbors. Say that node i receives from k some first hand information Fk,j about node j. If k is

classified as ”trustworthy” by i, or if Fk,j is close to Ri,j, then Fk,j is accepted

by i and is used to slightly modify rating Ri,j. Else, the reputation rating is not

updated. In all cases, the trust rating Ti,k is updated; if Fk,j is close to Ri,j, the

trust rating Ti,k slightly improves, else it slightly worsens. The updates are based

on a modified Bayesian approach. Only first hand information Fi,j is published;

the reputation and trust ratings Ri,j and Ti,j are never disseminated.

4.6 Multi-Level Intrusion Detection System

The Intrusion Detection System (IDS) is integrated in every smart node of the network. It is based on a features extraction component and a pre-trained binary machine learning classifier able to distinguish normal and malicious node behavior from the features extracted. The features extraction component models the node behavior considering multiple abstraction layers corresponding to kernel, network and DHT. Data collected at kernel level are related to a list of number of system calls that summarize the device internal behavior. The network data extracted are related to the data information of the packets exchanged between smart nodes. The network traffic is used to identify unusual traffic flows. Finally, to better characterize the node behavior, the classifier considers data collected at DHT level which represents the number and type of operations performed by nodes on the DHT. The complete list of the features extracted and used by the classifier is shown in Table 4.1. The overall features extracted at different abstraction level provide a complete behavioral characterization useful to detect multiple types of malicious intrusion that can attack a node on different layers.

(42)

Data Level Feature Group Feature Description

Kernel

epoll wait Wait for an I/O event on an epoll file descriptor.

read Read from a file descriptor.

mprotect Set protection on a region of memory.

mmap2 Map files into memory.

close Close a file descriptor.

openat Open and possibly create a file.

fstat64 Get a file status.

futex Fast user-space locking.

rt sigaction Examine and change a signal action.

recvmsg Receive a message from a socket.

stat64 Get a file status.

fcntl Manipulate file descriptor.

getdents64 Get directory entries.

brk Change data segment size.

poll Wait for some event on a file descriptor.

write Write to a file descriptor.

uname Get name and information about current kernel.

pipe Create pipe.

Network

total packets1 _{Total packets.}

total volume1 Total bytes.

pktl12 _{Packets size.}

lat12 Amount of time between two packets.

duration Duration of the flow.

active2 Amount of time flow was active.

idle Amount of time flow was idle.

sflow packets1 Number of packets in a sub flow.

sflow bytes1 _{Number of bytes in a sub flow.}

psh cnt1 Number of times the PSH flag was set.

urg cnt1 _{Number of times the URG flag was set.}

total hlen1 Total bytes used for headers.

DHT GET Number of GET operation performed on the DHT.

PUT Number of PUT operation performed on the DHT.

Table 4.1: A feature vector representation, network features are grouped for space reason. 1: Two distincts: one for backward and one for forward direction. 2: Four distincts: minimum, mean, maximum and standard deviation.

(43)

Chapter 5 Implementation

5.1 Simulation environment

In order to gather useful data for the analysis, we built up a simulation environ-ment using three Raspberry Pi 2 Model B, which is a small single-board computer with the following main features:

• Broadcom BCM2836 900MHz quad-core ARM Cortex-A7 CPU.

• 1 GB SDRAM.

• 4 USB 2.0 ports

• 10/100 Mbit/s Ethernet

On each Raspberry Arch Linux, a simple and lightweight Linux distribution, has been installed as Operating System. In our simulation environment, each smart node manages a certain number of not-so-smart devices distributed in different rooms. The not-so-smart devices considered are represented by tem-perature and motion sensors. The motion sensors register entrances/exits in the rooms, sending values to the corresponding smart node when the action occurs,

(44)

Figure 5.1: Simulation Environment used for data collection.

while the temperature sensor periodically sends its updates. Data are maintained on the DHT to share the information with other nodes. To perform our anal-ysis we needed both data extracted during system normal behavior, and data extracted from system behavior when one of the nodes has been compromised. In order to simulate a normal behavior, entrances/exits and temperature changes are simulated through random values generated by the Java program. Each node periodically performs a random action (selected between room entrance or exit registration, temperature value update, temperature value or number of people in a room requests) at a random time instance extracted from a specific inter-val. To simulate a compromised node, we installed on one of the RaspberryPi representing a smart node, the Mirai malware. Mirai is a worm-like family of malware that infects IoT devices and corrals them into a DDoS botnet [31]. A mirai-infected device, even without receiving an explicit attack command by its CnC, periodically scans the network and tries to infect other reachable devices, generating network traffic and altering the normal device behavior. When Mirai

(45)

identifies a potential victim, it enters into a brute-force login phase in which it attempts to establish a Telnet or SSH connection using username and password pairs selected randomly from a list. The list of normal and malicious actions is summarized in Table 5.1.

Action Description Actor Behavior

GET temperature Temperature value request Temperature

Sensor Normal PUT temperature Temperature value update Temperature

Sensor Normal PUT Entrance Person room entrance Motion

Sensor Normal

PUT Exit Person room exit Motion

Sensor Normal Network Scanning Telnet or SSH connection looking

for other vulnerable devices Mirai Malicious

DDoS Attack UDP Flooding Mirai Malicious

Table 5.1: Normal/Malicious behavior actions.

5.2 Kademlia Implementation

We decided to adopt an already deployed Java Implementation of Kademlia: an Open Source project started by the developer Joshua Kissoon1, that contains all the Kademlia basic features we are interested in. On the top of it we defined an application which exploits the DHT provided and managed by Kademlia: the central component of our application has been defined in SmartNode.java, a java class representing a smart node (Fig. 5.2). In our simulation environment each node manages a certain number of not-so-smart devices, distributed in different rooms. For each room every entrance and exit is registered and periodically the temperature is measured in order to maintain on the distributed hash table the temperature value and the number of people in every room. Moreover, a certain

(46)

number of classes implementing the Java interface KadContent, provided by the Kademlia implementation, has been defined in order to represent the resources stored on the DHT (Fig. 5.2). We defined the following contents:

• DHTContentPeople.java: Java class responsible of storing the number of people in a certain room on the DHT.

• DHTContentTemperature.java: Java class responsible of storing the last measured temperature value for a certain room on the DHT.

• DHTNodeResources.java: Java class that maintains a list of all the resources managed by a specific smart node.

• DHTLogContent.java: Java class used to store on the DHT log data of different types, from the DHT operations made to network and kernel data of a specific node.

5.3 Mirai Malware

Mirai Malware code has been taken from a public github repository2, made avail-able by the code author himself. In order to use it we made some modifications. We did not need to implement a C&C, it was sufficient to modify the code in order to periodically scan the network looking for other vulnerable devices and to perform a Distributed Denial of Service (DDoS) without sending information and receiving orders from someone outside the network.

(47)

public class SmartNode {

private JKademliaNode kadNode;

private ArrayList<KadContent> contents;

private String name;

private DHTLogContent dhtLog;

private DHTLogContent userInteraction;

private DHTNodeResources resources;

private ScheduledExecutorService dataCollectorScheduler;

private ScheduledExecutorService temperatureScheduler;

public SmartNode(int nodeId, InetAddress ip) throws IOException{...}

public SmartNode(int nodeId, int port) throws IOException{...}

public SmartNode(int nodeId, int port, InetAddress ip){...}

public void joinDHT(Node root) throws RoutingException, IOException{...}

public void joinDHT() throws RoutingException, IOException {...}

public void addContent(KadContent content) throws IOException {...}

public void addContent(DHTContentPeople content) throws IOException {...}

public void addContent(DHTContentTemperature content) throws

IOException {...}

public KadContent getContent(String type, int roomId) throws

IOException {...}

public void registerUserInteraction(long time) throws IOException {...}

public void updateNodeInfo() throws IOException {...}

public void updateLog(String op, KadContent contentUpdated) throws

IOException {...}

public void addRooms(int nrooms) throws IOException {...}

public void personIn(int roomId) throws IOException {...}

public void personOut(int roomId) throws IOException {...}

public JKademliaNode getKadNode() {...}

public String getName() {...} }

Figure 5.2: SmartNode.java

5.4 Dataset Production

We needed to introduce a random behavior in our application in order to produce valid data to train machine learning classifiers. Each Smart Node, in a random instance extracted from a certain time interval, randomly generates three values:

(48)

public class DHTLogContent implements KadContent {

private KademliaId key;

private String data;

private String ownerId;

private final long createTs;

private long updateTs;

private String type;

public DHTLogContent(String ownerId, String type) {...}

public DHTLogContent(KademliaId key, String ownerId, String type) {...}

public void setData(String newData) {...}

public void appendData(String newData) {...}

public void reset() {...}

public String getData() {...}

public KademliaId getKey() {...}

public String getType() {...}

public String getOwnerId() {...}

public void setUpdated() {...}

public long getCreatedTimestamp() {...}

public long getLastUpdatedTimestamp() {...}

public byte[] toSerializedForm() {...}

public DHTLogContent fromSerializedForm(byte[] data) {...}

public String toString() {...} }

Figure 5.3: DHTLogContent.java

command, room and node. Command represents the command to execute, it can be one of the following:

• PERSON IN: registration of an entrance in a specific room (PUT operation on the DHT).

• PERSON OUT: registration of an exit from a specific room (PUT operation on the DHT).

• GET NUMPEOPLE: request the number of people in a specific room (GET operation on the DHT).

(49)

• GET TEMPERATURE: request the internal temperature of a specific room (GET operation on the DHT).

Room specifies the room to which the command is referred. Node states the node to which the request (GET NUMPEOPLE or GET TEMPERATURE) will be made.

For each configuration we performed three simulations. One without a ma-licious node in the network in order to collect normal behavior data; one with a malicious node that periodically scans the network looking for new possible victims (Mirai scanner); the last one with a malicious node that performs a Mirai DDoS attack (Mirai DDoS).

5.5 Data Collection

As mentioned in 4.2, time is slotted. Every kind of collected data is referred to a specific time slot. During the collection phase, in each slot we collect kernel, network, and DHT data. To gather kernel-level data from our simulation devices, we used sysdig [32]: a tool for deep system visibility used to capture system calls and other OS events. The network-level data are captured using Wireshark[33]: a free and open-source packet analyzer that allows storing network-related data in a specific file format. Finally, the DHT-level data are collected at the end of each time slot when every node puts on the DHT a resource containing all the operations (GET and PUT) performed by that node on the DHT. We performed five collection campaigns using 5, 10, 15, 20, and 30 seconds time slot. In each time slot, we collected data both related to the compromised node and the other nodes. We labeled each action as malicious or normal behavior with respect to the node on which it occurred. For each time slot configuration we performed several hours simulations with and without an infected node in the system. The

(50)

simulation duration is dependent on the considered time slot. The smaller is the time slot duration more samples we collect in a certain amount of time, so for long time slots we needed longer simulations.

Kernel level

Data collected at kernel level consists in the list and count of system calls per-formed by a smart node in certain time period. This information is collected through Sysdig [32], one the tools explained in Section 2.7. Each node periodi-cally executes the following linux command:

$ sysdig -q -S

Option -S is used to print the event summary (i.e. the list of the top events) when the capture ends. Option -q is used to not print events during the capture process. In Fig. 5.4 an example of the output (not complete) generated by the command performed for three seconds.

(51)

Network level

Data collection at network level consists in capturing data packets exchanged between smart nodes in the network. In order to collect these data we used tshark [51], a terminal oriented version of Wireshark (illustrated in Section 2.7) designed for capturing and displaying packets when an interactive user interface is not necessary or available. Each node periodically executes the following Linux command:

$ tshark -w out.pcap

Option -w is used to specify the output destination file in which all collected data will be stored. In Fig. 5.5 an example of the output generated by the command executed for five seconds.

1 0.000000000 169.254.193.98 −> 169.254.43.253 UDP 95 1234 −> 1234 Len=53 2 0.000928907 169.254.43.253 −> 169.254.193.98 UDP 135 1234 −> 1234 Len=93 3 0.012198855 169.254.193.98 −> 169.254.43.253 UDP 635 1234 −> 1234 Len=593 4 0.042612084 169.254.193.98 −> 169.254.43.253 UDP 95 1234 −> 1234 Len=53 5 0.043398803 169.254.43.253 −> 169.254.193.98 UDP 135 1234 −> 1234 Len=93 6 0.054571198 169.254.193.98 −> 169.254.43.253 UDP 672 1234 −> 1234 Len=630

7 0.087006459 Raspberr df:ed:43 −> Broadcast ARP 60 Who has 9.9.9.10? Tell 169.254.78.248 8 0.087008490 Raspberr df:ed:43 −> Broadcast ARP 60 Who has 1.1.1.1? Tell 169.254.78.248 9 0.243712500 Raspberr a3:53:5a −> Broadcast ARP 60 Who has 9.9.9.10? Tell 169.254.193.98 10 0.281841407 169.254.193.98 −> 169.254.43.253 UDP 95 1234 −> 1234 Len=53 11 0.282702084 169.254.43.253 −> 169.254.193.98 UDP 135 1234 −> 1234 Len=93 12 0.294536407 169.254.193.98 −> 169.254.43.253 UDP 819 1234 −> 1234 Len=777 13 0.341770313 169.254.43.253 −> 169.254.193.98 UDP 95 1234 −> 1234 Len=53 14 0.342902396 169.254.193.98 −> 169.254.43.253 UDP 135 1234 −> 1234 Len=93 15 0.354918959 169.254.43.253 −> 169.254.193.98 UDP 634 1234 −> 1234 Len=592 16 0.356558594 169.254.43.253 −> 169.254.193.98 UDP 95 1234 −> 1234 Len=53 17 0.364565261 169.254.193.98 −> 169.254.43.253 UDP 135 1234 −> 1234 Len=93 18 0.369605729 169.254.43.253 −> 169.254.193.98 UDP 710 1234 −> 1234 Len=668

19 0.909864427 Raspberr e7:7e:20 −> Broadcast ARP 42 Who has 9.9.9.10? Tell 169.254.43.253 20 0.909908646 Raspberr e7:7e:20 −> Broadcast ARP 42 Who has 1.1.1.1? Tell 169.254.43.253 21 1.126950104 Raspberr df:ed:43 −> Broadcast ARP 60 Who has 1.1.1.1? Tell 169.254.78.248 22 1.126996042 Raspberr df:ed:43 −> Broadcast ARP 60 Who has 9.9.9.10? Tell 169.254.78.248 23 1.144816042 Raspberr a3:53:5a −> Broadcast ARP 60 Who has 8.8.8.8? Tell 169.254.193.98

(52)

DHT level

Data collected at DHT level consists in a log file containing all the operations performed by a node on the DHT in a time slot. These operations can be PUT or GET commands executed on a specific resource stored on the DHT. Periodically the log file is consumed by the machine learning component and contextually refreshed. In Fig. 5.6 an example of the operations performed by a node during a time slot.

Figure 5.6: A Distributed Hash Table log file regarding the operations of a specific node during a simulation.

5.6 Features Extraction

For each time slot we extracted the following files: a .pcap file containing all network data; a .sys file containing the system calls performed by the node and a .log file representing all the operations performed by the node on the DHT. A features engineering phase has been applied to the network-level data to extract the main features used to describe the network flow. To this end, a network feature extractor publicly available3, called flowtbag, has been used. It extracts a vector for each network flow (sequence of packets from a source to a destina-tion node) found in the pcap file collected, containing all the network features described in Table 4.1. As features at kernel level, we considered the number of system calls executed during the system operations. The list of system calls

Protecting Smart Home Environments by exploiting Multi-Level Distributed Intrusion Detection System