Once the machine learning models are ready, a way to actually integrate their usage into the Suricata IDS has been designed. Using the Suricata workflow described in Section 4.4.1 as a reference, some of its modules have been modified. The resulting workflow is shown in Figure 6.4, where the added functionalities have been highlighted with respect to the original workflow. The initialisation phase and the decoding and detection modules have been modified, as explained in the following sections. Moreover, a Python script to perform the classification has been created.
Through the whole integration process, the Tstat API functions provided by its library (libtstat) have been used to communicate with Tstat and obtain the TCP flows statistics. The five API functions provided by Tstat are:
1. int tstat init(char *config);
2. void tstat new logdir(char *file, struct timeval *pckt time);
3. int tstat next pckt (struct timeval *pckt time, void *ip hdr, void *last ip byte, int tlen, ip direction);
4. tstat report *tstat close (tstat report *report);
5. void tstat print report (tstat report *report, FILE *file).
The first two functions, tstat_init and tstat_new_logdir, have to be called during the initialisation of Suricata, because they are needed to setup Tstat internal structures. Instead, tstat_next_pckt has to be called in a place where the IP packet is available inside Suricata, since it needs a pointer to the IP header. Finally, the tstat_close function is used to flush to file the collected statistics, while tstat_print_report simply prints on screen a report of the flows analysed by Tstat.
Solution design
Reading of configuration
files
Parsing of command line
options
Loading of rules and syntax checking
Initialisation of log files and internal structures
Creation of worker threads
Initialisation
Extract packet from the packets pool
Decode next protocol of the packet
Update interal structures Decoding
Run rule pre-filtering
Get next signature from
list
Check header match
Check rest of the rule for match Detection
Contains other protocols?
Are there other signatures?
Perform signature action Match?
Match?
Packet
Yes No
Yes
Yes
Yes No
No No
Parsing of new ML options
from rules
Tstat & Python initialisation
(If IP protocol) Pass packet
to Tstat
Match? Yes
ML-based detection
No
Figure 6.4: The modified workflow of the Suricata modules.
6.2.1 Suricata initialisation
Inside the Suricata initialisation phase have been inserted two more sequences of initialisation: the Tstat initialisation and the Python interpreter initialisation. The Tstat initialisation is needed by the Tstat tool, integrated inside Suricata by using two of the five API functions provided by libtstat, while the Python interpreter initialisation is needed to be able to use the C/Python API4, as deeply explained in AppendixB.
The best position to initialise Tstat is inside the main function of Suricata, before the initialisation of the threads that will perform all the Suricata activities.
After all the basic checks and configurations of internal modules have been com-pleted, the Tstat initialisation function has been inserted. In this way, it is possible to take advantage of the Suricata internal logging system to produce informative logs about the status of the integration mechanism.
This process performs many preparative actions before actually calling the Tstat
4https://docs.python.org/3.8/extending/embedding.html
Solution design
API tstat_init. Some of these actions are directly needed to obtain some configu-ration parameters for Tstat, while other are useful to setup some internal variables that will be used later by the decoding module. First of all, the configuration file used for the integration is retrieved (called “integration.conf”): in case it is not found, a critical error is emitted with the Suricata logging system and the execu-tion is interrupted since this file is necessary to continue with the initialisaexecu-tion.
Then, the tstat_init and tstat_new_logdir API functions are called and the Tstat initialisation sequence ends.
Once Tstat has been successfully initialised, the Python interpreter performs some similar actions. It is initialised through some of its API functions and then a configuration file containing information about the ML models is read, with the result of storing the ML models inside an ad-hoc data structure, ready to be used at a later stage. The reason for this choice is the fact that as much of preparative work as possible has been moved inside the initialisation phase, to allow a faster execution of the detection phase. For this same reason, the pointer to the Python function contained inside the external Python script is stored: in this way the function is ready to be called as soon as the needed arguments are ready, as explained later in the classification phase.
6.2.2 Packet decoding
The remaining part of the integration concerns the creation of the Tstat flows statistics and the subsequent detection of malicious flows. The optimal place to perform this step of the integration has been located inside the detection module of Suricata, where the packets are directly available and can be passed to Tstat, as required by the tstat_next_pckt function. More in detail, the DecodeIPV4 Suri-cata function is the exact place where the IP packet bytes are available. However, the whole section of integration inside this function has been considered a critical section, since multiple threads operate concurrently in Suricata: for this reason a mutex lock has been required before the beginning of the actual integration section and has been released after all the operations have been performed. However, this integration step performs only a single action: the IP packet is passed to Tstat, that automatically updates its internal structures with the packet data.
6.2.3 Packet classification
The final phase of the integration happens inside the Suricata detection module.
The ML-based detection has been placed after the normal Suricata process of sig-nature matching: in this way, the ML classification is performed only if all the other options of a rule provide a match, allowing to further reduce the added overhead.
The ML-based detection happens in different stages: first, the flags of the current signature analysed by Suricata are read and the new machine learning flags are retrieved; then, the statistics of the flow of the current packet are obtained from Tstat; finally, the ML flags and the statistics are passed to the external Python function, that performs the prediction and returns the outcome to Suricata.
Solution design
More in detail, the flags are obtained from a Suricata internal structure (a C struct) that contains information about a specific signature. This structure has been populated by Suricata during its initialisation phase and contains, among the other data, the options of the specific signature (e.g. action, protocol, id). If the ML options are found, the ML-based detection process starts, otherwise the normal Suricata workflows continues. These special flags contain information about which ML models have to be used during the detection phase of a specific signature.
Instead, the statistics related to the current packet’s TCP flow are obtained from Tstat using a sixth API function: this function has been created ad-hoc to retrieve statistics of a single flow because the other libtstat functions do not allow to do so.
A specific flow is identified inside the Tstat internal data structures by using the IP address/port pair of both source and destination of the analysed packet, plus a timestamp to not find eventual old flows.
Finally, the external Python function performs very few actions: given the already loaded models, the statistics of a flow and the list of flags, it performs the prediction with each one of the models selected by the flags and returns the outcome to Suricata.
Since the ML-based detection happens at the end of the Suricata detection phase, if the outcome is positive Suricata performs the action specified in the signature, while it continues with the analysis of the next signature if the outcome is negative.
Chapter 7
Dataset analysis
This chapter contains an in-depth description of the datasets used for the experi-ments described in Chapter8. Then, it proceeds with an overview of the features se-lected from the ones extracted by Tstat, among the ones described in Section2.2.1.
Finally, Section7.2 explores the statistical characteristics of the final dataset.