Browdom: Detecting malicious web pages directly within the browser

(1)

UNIVERSITY OF PISA

DEPARTMENT OF INFORMATION ENGINEERING

MASTER OF SCIENCE IN COMPUTER ENGINEERING

MASTER THESIS

Browdom: Detecting malicious web pages

directly within the browser

CANDIDATE

Giacomo Vecere

SUPERVISORS

Prof. Gianluca Dini

(2)

2.5.1 Static approach . . . 22 2.5.2 Dynamic approach . . . 23 3 Design 25 3.1 Overview . . . 25 3.2 Browdom . . . 26 3.2.1 Browser choice . . . 26 3.2.2 Implementation . . . 26 3.3 Features extractor . . . 30 3.3.1 HTML features . . . 30 3.3.2 JavaScript features . . . 31 3.3.3 Host-based features . . . 33 3.3.4 Other features . . . 34 3.4 Machine-learning . . . 36 3.4.1 Training Mode . . . 36 3.4.2 Dataset . . . 37 3.4.3 Server . . . 38 3.4.4 Classifier . . . 38 3.4.5 Detecting Mode . . . 43 3.5 Database . . . 44 3.6 Design of experiments . . . 46 3.6.1 Experiments . . . 47

(3)

4 Evaluation 48

4.1 Model derivation . . . 48

4.1.1 Input dataset . . . 48

4.1.2 Parameters setting . . . 50

4.1.3 Optimize features selection . . . 51

4.2 Performance . . . 53 4.2.1 Accuracy . . . 53 4.2.2 Overhead . . . 55 4.2.3 Throughput . . . 56 4.3 Discussion . . . 57 4.3.1 Performance indicators . . . 57

4.3.2 Comparison with previous work . . . 58

5 Conclusion 63 5.1 Future work . . . 64

Acknowledgments 65

(4)

List of Figures

1.1 Total malware programs over the last 10 years. In November 2014 there were more than 300,000,000 registered malwares . . . 7 1.2 Browser vulnerabilities, 2011 - 2013 . . . 8 1.3 The quarterly trends for the top 10 families detected by Microsoft

enterprise security products, between the third quarter of 2012 and the second quarter of 2013, by percentage of computers en-countering each family . . . 9 2.1 False branding used by a number of commonly detected rogue

security software programs . . . 13 2.2 Examples of the lock screen used by a ransomware . . . 14 2.3 How a web attack works. No user’s interaction is required to

start a drive-by download. In 0.5 seconds the computer is infected 16 2.4 Consequences of exploitation for the 1st half of 2013 . . . 18 2.5 The bot master controls the botnet. He is able to perform attacks

to other computers on the Internet and steal data from them . . 19 2.6 Outdated Browsers worldwide . . . 20 2.7 Outdated browsers and plugins overall (the blue indicates the

usage, the red shows the ones that are outdated) . . . 21 3.1 Overview of the architecture of the system. The input URL is

processed and classified as either benign or malicious . . . 25 3.2 Top 5 Desktop, Tablet and Console Browsers on October 2014 . 26 3.3 Drive-by download pages worldwide . . . 34 3.4 High-level architecture of the system for the training phase . . . 36 3.5 Training dataset, composed by benign and malicious samples . . 37 3.6 Internal view of the server and its components . . . 38 3.7 Overall scheme of the classification model . . . 39 3.8 Cross-Validation module, the training and the testing subprocesses 40 3.9 Stacking module . . . 40 3.10 High-level view of the architecture of the system in the detecting

mode . . . 43 3.11 Application phase, the classifier is applied to real-world unlabeled

data . . . 44 3.12 Overall architecture of the system used for the experiments . . . 46 4.1 Final stacking tree produced by the Decision Tree Parallel operator 50 4.2 Features selection . . . 52 4.3 Accuracy of Browdom in terms of False Negative and False Positive 54

(5)

4.4 Average and standard deviation of the time needed to open a web page, with and without Browdom, for the 10 most visited websites 55 4.5 The throughput of Browdom. Number of tabs means the total

number of tabs opened, divided into each Chrome instance . . . 56 4.6 Printfun . . . 60

(6)

List of Tables

3.1 Object properties hooked by Browdom . . . 27

3.2 Functions hooked by Browdom . . . 28

3.3 Other functions hooked by Browdom . . . 29

3.4 Features divided into four main classes . . . 30

3.5 HTML features . . . 30

4.1 The training dataset before and after the filtering process . . . . 49

4.2 Evolution of the accuracy of the model during the training phase, in terms of False Negatives (FN) and False Positives (FP) rates . 53 4.3 Accuracy of Browdom on a total of 5000 URLs . . . 54

(7)

Abstract

Nowadays, most of the malware authors target web browsers and their plugins in order to steal personal information and gain control of the infected machine. They take advantage of the vulnerabilities present in the user’s system and the lack of critical security updates (from a recent study, it appears that in about 87% of all analysed computers, critical software security updates are missing [2]). The cybercriminals’ vector of choice to deliver malware stealthily on a user’s machine is a drive-by download attack. Using this technique, the attacker is able to infect a computer without the user interaction, by exploiting the vul-nerabilities present on the browser or on its plugins. Even more, these attacks are often unleashed from legitimate sites, which have been compromised.

In this thesis we present a novel approach to the detection of malicious URLs. We designed and implemented a malware detection system, which is called Browdom, directly within the browser, as an extension of the Google Chrome browser. The tool is able to detect the malicious behavior of a web page by tracking its actions, and detecting most malicious behaviors. Browdom creates a log composed by many different traces associated to events that happen during the loading and the execution of the page, and that can be related to a malicious behavior. The features extracted from the log derive from both the HTML and the JavaScript code, the host information and the URL of the web page. A classification model is derived using this information and machine-learning techniques applied to labeled datasets.

Since Browdom executes inside a popular browser, it can be effective in pro-tecting users right on their own machines. Because of this, all the sophisticated techniques to detect virtualized analysis environments, which malware authors have perfected over the years, are ineffective against Browdom.

We performed experiments in order to demonstrate the effectiveness of Brow-dom. We analysed and discussed its performance in terms of overhead, accuracy and throughput.

(8)

Chapter 1

Introduction

The world wide web has become a fundamental part in our daily lives. We regu-larly use online services to store and manage our sensitive information related to bank account, e-commerce, social networks, and so on. Other than the fact that the web is full of users’ private information (hence, potential money), Internet is a complex system, perhaps the largest man-made many-body system. Last but not least, the web is accessible from anyone, at anytime. All these factors make the Internet an incredibly attractive target for a host of illicit activities where miscreants attempt to abuse of the web and its users to make illegal profits.

The number of attacks launched from web resources located all over the world increased from 1 595 587 670 in 2012 to 1 700 870 654 in 2013. The main tool behind browser-based attacks is still the exploit pack, which gives cybercriminals a surefire way of infecting victim computers that do not have a security product installed, or have at least one popular application that is vulnerable (requiring security updates) [3].

The AV-TEST Institute registers over 450,000 new malicious programs every day. The total number of malware, which is over 300,000,000 in November 2014, is shown in Figure 1.1 [1].

Figure 1.1: Total malware programs over the last 10 years. In November 2014 there were more than 300,000,000 registered malwares

(9)

1.1 Where does the problem come from?

Malware is short for malicious software, and it stands for any piece of soft-ware that you might find installed in your system without your knowledge and consent. Malware is used to disrupt computer operation, gather sensitive infor-mation, or gain access to private computer systems. It can appear in the form of executable code, scripts, active content, and other software [6].

An exploit can be thought as a part of the malicious process, the actual mali-cious code, that takes advantage of software vulnerabilities to infect, disrupt, or take control of a computer without the user’s consent and typically without their knowledge. Exploits target vulnerabilities in operating systems, web browsers, applications, or software components that are installed on a computer. In some scenarios, targeted components are add-ons that are pre-installed by the com-puter manufacturer before the comcom-puter is sold. A user may not even use the vulnerable add-on or be aware that it is installed. In addition, some software has no facility for updating itself, so even if the software vendor publishes an update that fixes the vulnerability, the user may not know that the update is available or how to obtain it and therefore remains vulnerable to attack [5].

Vulnerabilities continue to be one of the core choices for the delivery of malicious code. Vulnerabilities are being exploited to serve up all sorts of threats such as ransomware, trojans, backdoors, and botnets.

As we can see from Symantec Internet Security report of 2014, even if the browser vulnerabilities have declined this year, compared to 2012, they are still the biggest problem for the browser security. Even more, Internet Explorer, which is the most targeted browser by attackers, saw an increase in reported vulnerabilities from 60 to 139. As shown in Figure 1.2, while Safari reported the most vulnerabilities in 2012, the Chrome browser came out on top in 2013, with 212 vulnerabilities [7]. Considering that Chrome is the most used browser, we understand the relevance of the problem related to the browser vulnerabilities.

(10)

Many ordinary users and small businesses are comfortable managing their own web servers, whether internally or externally hosted, since it is now easier to do and relatively inexpensive. However, while the ease of installation and cost of maintenance may have decreased, many new administrators are perhaps not familiar with how to secure their servers against attacks from the latest web attack toolkits. They may not be diligent about keeping their sites secure and patched with the latest software updates. These services have become major targets for abuse by hackers, and a single vulnerability may be used across thousands of sites.

In fact, the Symantec Internet Security report of 2014 affirms that approx-imately 67 percent of websites used to distribute malware were identified as legitimate, compromised websites that could be classified, compared with 61 percent in 2012.

Figure 1.3: The quarterly trends for the top 10 families detected by Microsoft enter-prise security products, between the third quarter of 2012 and the second quarter of 2013, by percentage of computers encountering each family

As shown in Figure 1.3, by the end of 2012, web-based attacks had surpassed traditional network worms to become the top threats facing enterprises. In fact, in the second quarter of 2013 six out of the top ten threats facing enterprises were associated with malicious or compromised websites.

One of the main problem is that, for the most of this kind of web attacks, the traditional defenses (such as firewalls and antiviruses) are not effective and pose no barrier to infection. Hence, the damage can be much greater and effective than many other kind of attacks.

For example, in the case of HTML/IframeRef, attackers have built auto-mated systems that probe websites to identify and infect vulnerable web servers. Once compromised, an infected server can then host a small, seemingly benign, piece of code that is used as a redirector [5]. This process is part of an extremely used and dangerous technique, called drive-by download.

(11)

1.2 Drive-by download

Along with the Internet evolution, the design and use of malware has changed dramatically. Today malware use more stealth mechanisms and polymorphism to avoid being detected. Crashing the system is not their main target anymore. Most of the web malware are intended to either steal the user’s personal infor-mation, such as credit card details and passwords, or cause the victim machine to join a botnet [9].

Drive-by downloads are caused when a user visits a website that exploits browser vulnerabilities and launches the automatic download and installation of malware without the knowledge or permission of the user. The variety of client applications installed in web servers and personal computers give enough space to attackers to find vulnerable systems to attack. According to [9], almost 80% of web users are using unpatched versions of Adobe Flash and Acrobat Reader, popular web browser plug-ins.

There are several factors that made drive-by download attacks so common and effective. One is the presence of many vulnerabilities in web-clients and in their plug-ins. Another important fact is that outdated, and therefore vulnera-ble, web clients are commonly used. If we also consider that sophisticated tools (like exploit packs) and techniques are easily obtainable and well-documented, it is easy to understand how big and dangerous is this type of threats.

The goal of our project is to fight this type of attacks by detecting malicious behavior that can lead to an infection of the host that is vising the malicious page that contains a drive-by download.

1.3 Browdom

In this thesis, we present the design and implementation of a malware detection system, called Browdom, which is able to detect the malicious behavior of a web page directly into the browser.

Browdom is implemented through a Google Chrome Extension and is able to analyse each web page loaded in the browser and create a collection of traces that will be processed into features. The system analyses these features that come from the HTML page, the embedded JavaScript code, and some information related to the host and the URL. By using a number of models that are derived using supervised machine-learning techniques, we built a classification model for discovering likely malicious web pages.

Our approach is different from previous work that attempts to detect mali-cious web pages, let us briefly go into the main differences, which are related to the possible evasion of the systems:

• as honeypots do, we do not run in a virtual machine. This is important because we cannot be evaded by the use of red pills (which attempt to detect if the browser is running inside a virtual machine) [17]

• we run on an actual browser, which is Google Chrome, right now, but the concepts of Browdom will be extended to the other main browsers. As

(12)

We presented the concept of evasion, which is a key aspect for a malware detection system. All the malware authors want to know is if they are interacting with an actual user, who usually is not aware that is going to be infected, or with a malware detection system, which is trying to be infected in order to discover them.

With the new approach of Browdom these two scenarios are overlapped, there is the normal user, who ignores the presence of the malware inside the page, and the detection system, which is ready to analyse and detect malicious behaviors.

Distributed approach

The novel approach of Browdom is different from the nowadays paradigm for this kind of systems and we believe that this is the direction to follow in order to improve the actual malware detection system. We propose a solution that goes directly into the user’s browser.

The result of this approach is twofold:

• on the one hand, we distribute malware detection systems to the user, so the whole detection process can go from a centralized to a distributed approach. This might help the spreading process of this kind of system to users that are not aware of the risk of these malicious software

• on the other hand, from the user’s point of view, by simply installing an extension, a user can have a malware detection system on his browser. This tool might become, as a future work, a real-time protection system against malicious websites

As we will see in the next sections, the simplicity of Browdom and the choices we made in order to have a good trade-off between performance and accuracy, allow the system to have a high throughput. We will compare it with previous work.

(13)

1.4 Structure

The thesis is composed by the following chapters:

Background: This chapter introduces the concept of malware and describes in more details the most common web attacks, with a focus on drive-by downloads. It discusses how an anti-malware system work and analyses the state of the art of malware detection system and the most important related projects.

Design: This part of the thesis discusses the design choices made in the imple-mentation of Browdom. It explains in detail how the system is built and what are the main parts and ideas behind it.

This chapter considers the main machine-learning concepts that we used in this project. There will be a brief description of the architecture of the system for the training and the detecting mode and also for the experi-ments we performed.

Evaluation: This part evaluates and discusses the main ideas behind the tool. It reports the result of the experiments and discusses them in term of accuracy, overhead and throughput of the system.

We discuss and compare our result with previous work.

Conclusion: The final chapter of the thesis talks about the targets and the purposes of the project reached. It also considers the future work related to Browdom in order to make it better.

(14)

Chapter 2

Background

2.1 How does malware look like?

The common user might be interested to know if there is a way to recognize a malware and try to avoid to become infected. Unfortunately, for many of them is very hard to detect the presence in a website. Malware authors evolved their techniques and crashing the system is not their main target anymore. Even worse, the most common attack that cybercriminals perform nowadays is the drive-by download attack, where the victim is attacked without even clicking anywhere, but just visiting the website. In this case what we can do is to entrust security software and to keep the system (and all its components) always updated.

Let us focus on the most common threats present on the Internet, to have a better understanding on how they look like and how they behave in order to try to avoid to become infected, if possible.

Rogue security software. This type of malware has become one of the most common methods that attackers use to swindle money from victims. Rogue security software, also known as scareware, is software that appears to be beneficial from a security perspective but provides limited or no security, generates erroneous or misleading alerts, or attempts to lure users into participating in fraudulent transactions. Some common rogue security software programs are shown in Figure 2.1.

Figure 2.1: False branding used by a number of commonly detected rogue security software programs

(15)

These programs typically mimic the general look and feel of legitimate security software programs and claim to detect a large number of nonex-istent threats while urging users to pay for the so-called “full version” of the software to remove the nonexistent threats. Attackers typically install rogue security software programs through exploits or other malware, or use social engineering to trick users into believing the programs are legit-imate and useful. Some versions emulate the appearance of the Windows Security Center or unlawfully use trademarks and icons to misrepresent themselves.

Email threats. More than 75 percent of the email messages sent over the In-ternet are unwanted. Not only does all this unwanted email tax recipients’ inboxes and the resources of email providers, but it also creates an envi-ronment in which emailed malware attacks and phishing attempts can proliferate. Email providers, social networks, and other online communi-ties have made blocking spam, phishing, and other email threats a top priority.

Ransomware. This is a type of malware that is designed to render a computer or its files unusable until the computer user pays a certain amount of money to the attacker or takes other actions. It often pretends to be an official-looking warning from a well-known law enforcement agency, such as the US Federal Bureau of Investigation (FBI) or the Metropolitan Police Service of London (also known as Scotland Yard). Typically, it accuses the computer user of committing a computer-related crime and demands that the user pay a fine via electronic money transfer or a virtual currency such as Bitcoin to regain control of the computer. A ransomware infection does not mean that any illegal activities have actually been performed on the infected computer. First appearing in 2012 these threats escalated in 2013, and grew by 500 percent over the course of the year. These attacks are highly profitable and attackers have adapted them to ensure they remain profitable [7].

(16)

Now that we have seen how a typical malware looks like, let us consider if and how we can recognize a malware website.

A big problem for the users regards the websites that attackers use to conduct attacks and/or distribute malware. Malicious websites typically appear to be completely legitimate, and provide no outward indicators of their malicious nature even to experienced computer users. In many cases, these sites are legitimate websites that have been compromised by malware, SQL injection, or other techniques in efforts by attackers to take advantage of the trust users have invested in such sites [5].

The main ways to be attacked and infected by a malware, nowadays, are: • visiting a website where the web page is malicious or it has been

compro-mised by cybercriminals

• viewing an email message that redirects to a malicious web page • being a victim of social engineering techniques

Some attacks target vulnerabilities in the user’s browser plug-ins, and, if successfully exploited, enable the attackers to execute their code in the browser’s environment and obtain the full or partial control of the victim’s system.

(17)

2.2 How a drive-by download attack works

A drive-by download is the process of inadvertently downloading malicious web code simply by visiting a web page. This happens automatically and without the user knowing.

Malware drive-by downloads are a big challenge, as their prevalence seems to be increasing more and more in malware distribution attacks. They are a serious threat for the safety of the Internet, so understanding the details of these attacks is of major importance.

Figure 2.3 describes the five steps of a web attack, which involves a drive-by download: entry, traffic distribution, exploit, infection and execution.

Figure 2.3: How a web attack works. No user’s interaction is required to start a drive-by download. In 0.5 seconds the computer is infected

(18)

Entry. The first part of an attack involves a drive-by download from an entry point, either a hijacked website or an email that contains a malicious link. The most common type of drive-by download is an invisible iframe, with a reference to a malicious domain, that contains malicious JavaScript code. This sophisticated JavaScript can be masked by obfuscation, as well as polymorphism (i.e., the code changes with each view). Traditional signature-based antivirus solutions cannot detect this kind of tricky code.

Traffic distribution. Once a drive-by download has reached the browser, the unsuspecting user is redirected to download an exploit kit. However, rather than sending users to known exploit kit hosting sites, elaborate traffic distribu-tion systems (TDS) create multiple redirecdistribu-tions that are nearly impossible to track and therefore black-list.

Some TDS systems are legitimate, for instance those used for advertising and referral networks. But like any software, legitimate TDS solutions are prone to being hacked and exploited to drive traffic to malware hosting sites instead of a benign destination.

What’s more, these TDS networks often filter traffic to keep their sites hidden from search engine and security companies. They also use fast-flux networks to cycle thousands of IP addresses through DNS records, preventing their malware hosting sites from being blacklisted.

Exploit. The next phase of a modern web attack is the downloading of an exploit pack from the malware hosting site. These kits execute a large number of exploits against vulnerabilities in web browsers and associated plugins such as Java, PDF readers, and media players.

Cybercriminals typically purchase exploit packs on the black market, making money for their creators.

Infection. Once the attacker exploits an application vulnerability to gain some control over the computer, the next step in the attack is to download a malicious payload to infect the system. The payload is the actual malware or virus that will ultimately steal data or extort money from the user.

Execution. In this final stage of the attack, the malicious payload has been downloaded and installed on the victim’s system and now its job is to make the criminal behind it some money. It can do that in a number of ways: by providing credentials, banking or credit card information that can be sold on the black market, or by extorting the user into paying directly. Ransomware and FakeAV are both examples of malware that extort victims into paying [10].

(19)

As we can see from Figure 2.4, there are several consequences of the exploita-tion of a system. It frequently results in a partial or full control of the infected machine. In most cases, the exploitation results in a gained access to the user’s system or application. This provides the attacker complete control over the affected system, which allows them to steal data, manipulate the system, or launch other attacks from that system [11].

Figure 2.4: Consequences of exploitation for the 1st half of 2013

In other words, once a victim’s system becomes infected, it goes under the control of attackers, turning the machine into a bot, which is a member of an (illegal) botnet.

(20)

2.3 Botnet

A botnet is a collection of compromised computers whose security defenses have been breached and control conceded to a third party. Botnets are the primary means for cybercriminals to carry out their malicious tasks, which are:

• sending spam emails

• launching (distributed) denial of service attacks

• stealing personal data such as email accounts, intellectual property, mili-tary secrets, embarrassing information or bank credentials

• infecting other machines

Figure 2.5 shows the structure and the behavior of an illegal botnet. The bot herder (also known as botmaster), who controls the botnet, communicates with all the nodes and coordinates the activities of the entire structure through the command and control (C&C) infrastructure.

The attackers can recruit bots diffusing a malware, typically via phishing campaign, sending the malicious agent via email, or renting in the underground the entire architecture. As we previously mentioned, the most effective option, which is frequently use from attackers in order to compromise a large number of hosts, is to use the drive-by download technique to infect user machines.

Figure 2.5: The bot master controls the botnet. He is able to perform attacks to other computers on the Internet and steal data from them

(21)

2.4 How to deal with malware?

When it comes to IT security, experts always point out that the most important key value is an up-to-date computer system. According to a current study published by F-Secure, antiquated systems pose one of the most serious security risks – especially in the corporate shpere. In about 87 percent of all checked and analysed computers, critical software security updates are missing [2].

Having taken 200,000 computer systems into account, on every second PC (49 percent) one to four security updates are missing. On one out of four ma-chines, five to nine security leaks are not closed, and on 13 percent of all analyzed systems ten or more critical security updates are not installed. Notably, risky applications like browsers and their plugins are not up-to-date: 54 percent have not performed the latest Java Update, 36 percent run out-dated Adobe Flash Players. But there is more: following the F-Secure study, it appears that 83% of all infections could have been avoided, if the software had been updated in the first place.

Now, as we can see from the interactive map of Check & Secure [13], 79% of British internet users are using an outdated browser to surf the web. This actually is nowhere near the worst rate in Europe, with users in Germany (82%), Denmark (85%) and Norway (83%), all having less frequently updated browsers. A screen shot of the interactive map is shown in the following Figure.

Figure 2.6: Outdated Browsers worldwide

The situation is particularly critical because the trend is not changing over the years. Even with all the security experts’ efforts, the situation is not getting any better. These user behaviors complicate the, already worrisome, scenario.

(22)

We mentioned in the previous Section that drive-by download attacks target vulnerabilities mainly in the browser plugins. As we can see from Figure 2.7, the plugins posing the greatest risk are Java (where 56% of users do not have the correct version installed) and Adobe Acrobat (26%). Cybercriminals will also be aware of this, even if the users themselves are not, and use these plugins to import malware, trojans and other viruses onto PC systems [14].

Figure 2.7: Outdated browsers and plugins overall (the blue indicates the usage, the red shows the ones that are outdated)

The figure also tells us the situation regarding the main browsers, in terms of updates. The situation is critical, in fact, most of the user browsers are outdated. The worst scenario in terms of browser updates is the one regarding Internet Explorer. Almost all of the users using Internet Explorer have an outdated version of the browser. The situation for these users is even worse, considering that Internet Explorer is the most attacked browser due to the fact that runs on Windows, the main targeted OS from cybercriminals (this is because Windows is the most used operating system, hence, potentially there are more money to steal for malware authors).

(23)

2.5 Other malware detection system

Now that we had a general overview on the main Internet threats, how they look like and how they can attack and infect our systems, let us consider what we have to protect us and our machines.

Today, the most used and widespread protection that we have, for a browser, are the URL blacklists, like Google Safe Browsing which is a service provided by Google that provides lists of URLs for web resources that contain malware or phishing content. Once you access to a website that is in these blacklists, the system blocks the connection and it shows the warning.

The question is: how a URL is marked as malicious and goes in those lists? The malware detection systems that are used are different, let us briefly go into some details of the main ones, divided by the approach they use, which are static and/or dynamic.

2.5.1 Static approach

In general, when we think about a system that protects our machines we think about an antivirus (AV) software. What an AV software uses for detecting malware are static characteristics such as suspicious strings of instructions in the binary to detect threats. Sadly, it is quite easy for malware authors to create many different code variants that are functionally equivalent, both manually and automatically, thus defeating static analysis rather easily. For instance, one malware family, AnserverBot, had 187 code variations [15].

Unfortunately, the same problems are present when we focus on the web, if we follow the same approach as AVs do. A malware detection system that uses a static approach tries to find malicious web pages by analysing some static aspects of it, such as some properties and values of the HTML and JavaScript code and some characteristics of the URL of the page. In the past years, there were actual detection systems based just on the URL. Once malware authors discovered the presence of these systems, they changed their techniques in order to evade these kind of systems based only on the URL characteristics.

A similar problem is associated to the signatures used by many malware de-tection systems, which are string patterns that are commonly used in malicious code and by which a system can recognize an already seen malware. Again, the problem is that signatures can be evaded quite easily by the use of obfuscation techniques. As we have seen, studying many malware samples, it is very common for recent malwares to use function like escape/unescape, String.fromCharCode, atob/btoa and similar in order to obfuscate their code.

This does not mean that we do not use static analysis at all. We will see that we still make a static analysis on the JavaScript files present in the web page with the aim of extract parameters like the string length, the presence of different symbols and capitol letters and some suspicious strings. At the same time, we need to highlight that the benefits carried from the static analysis are not that significant, specially compared to the ones that come out from a (more complete but also more expensive) dynamic analysis.

(24)

2.5.2 Dynamic approach

A more complex and effective approach, in order to identify malwares, is to run a system that tries to become infected in order to discover the malicious behavior and categorize a page as malicious. This kind of system, instead of just looking at static properties of the HTML and JavaScript code, executes and loads the page and controls the events that happen after the execution of the page. This is what is called dynamic approach.

The state of the art for this kind of system are honeyclients, which are active security devices in search of malicious web pages that attack clients. They visit the presumed malicious page and they analysed the behavior of the page, the browser and the whole system on which they are running. There are two different types of these malware detection systems: low and high interaction honeyclients. Low-interaction honeyclients

Low-interaction honeyclients are emulated systems, this means they can emulate different browser versions, different plugin modules and they are scalable. They are easier to deploy than high interaction client honeypots and they also perform better. However, they are likely to have a lower detection rate since attacks have to be known to the client honeypot in order for it to detect them; new attacks, like zero-day attacks, are likely to go unnoticed. They also suffer from the problem of evasion by exploits, which may be exacerbated due to their simplicity, thus making it easier for an exploit to detect the presence of the client honeypot [16].

High-interaction honeyclients

In this case the emulation is not necessary, there is a real browser on a real operating system inside a virtual machine. The browser loads the web page that is being analysed. After this, the system checks for artifacts that indicate a successful attack, such as executable files on the file system or unexpected processes. They are able to discover zero-day attacks and they are more difficult, for attackers, to evade, compared to low-interaction honeyclients. The tradeoff for this accuracy is a performance hit from the amount of system state that has to be monitored to make an attack assessment. So, the main weakness is that the analysis is expensive, after each successful exploit, the virtual machine needs to be restored, since the analysis platform can no longer be trusted. Moreover, recent studies show that it is possible to detect if a system is running on a virtual machine (hence, evade it), by using red pills, which are essentially pieces of JavaScript code [17].

Alternative detection approaches

Other than these systems, researchers have proposed alternative detection ap-proaches for malicious web pages. In particular, there are systems like Wepawet [18] and PhoneyC [19], that rely on instrumented JavaScript run-time envi-ronments to detect the execution of malicious scripts. Even if these systems solve part of the problems of high-interaction honeyclients, they are still some concerns that we will analyse.

(25)

We will consider Wepawet as a system to compare our results in terms of accuracy and throughput.

Our approach will consider the limitations of current malware detection sys-tems and it will try to overcome them. Browdom will run neither on a emulated browser nor inside a virtual machine. This makes the evasion to our system much more difficult. Moreover, we will see how Browdom has a higher through-put than Wepawet, allowing us to process more URLs per unit of time.

(26)

Chapter 3

Design

3.1 Overview

Our goal is to create a tool that can detect web pages with malicious behavior (i.e., malicious or compromised web pages) directly within the browser. To perform this classification task, the system uses a model that evaluates the features extracted from a page. This model is derived using supervised machine-learning techniques. The core of the system is Browdom, a tool implemented by a Google Chrome Extension. Considering that the extension will be used extensively by the common users on their browsers, we cannot just focus on the accuracy of the detection; there must be, since the early phases of the design, a particular attention to the performance of the system and the overhead that might be caused by this tool.

Figure 3.1 shows the architecture of the whole system, from an high-level point of view.

Figure 3.1: Overview of the architecture of the system. The input URL is processed and classified as either benign or malicious

(27)

3.2 Browdom

Browdom is the core of our system, and it is implemented through a Google Chrome Extension. Google Chrome Extensions are small software programs used to modify and enhance the functionality of the browser. They are written using web technologies such as HTML, JavaScript, and CSS.

Browdom analyses each web page loaded in the browser and creates a log of that page. This log is a collection of traces related to events generated by the page and recorded from our tool. The events we are interested in are the ones that can be related to a malicious behavior such as calls to eval and document.write functions, hidden iframes and so on. This sort of profiling of the page is what we need in order to perform a malware detection analysis.

3.2.1 Browser choice

We chose to develop our tool for the Google Chrome browser because it is, by far, the most used browser family in October 2014, as shown in Figure 3.2 [20]. Another important factor that influenced our choice is the possibility to take advantage of a Chrome Extension, which perfectly fits for what we need to do.

Figure 3.2: Top 5 Desktop, Tablet and Console Browsers on October 2014

3.2.2 Implementation

In Google Chrome, an extension is a zipped bundle of files — HTML, CSS, JavaScript, images, and so on — that adds functionality to the Google Chrome browser. Extensions are essentially web pages, and they can use all the APIs that the browser provides to web pages. Extensions can interact with web pages or servers using content scripts that are JavaScript files that run in the context

(28)

content script, as if there was no other JavaScript in execution on the page it was running on. The same is true in reverse: JavaScript running on the page cannot call any functions or access any variables defined by content scripts.

Considering that the main idea is to override the functions we want to track, the fact that content scripts execute in an isolated environment is a problem, because every content script sees its functions and variables. To overcome this problem, we injected a JavaScript script directly inside the web page, before the Document Object Model (DOM) is loaded. With this trick we are able to override the functions that we want, because we are loaded before any other scripts in the page, even before the DOM. It is worth pointing out, however, that this problem would be easily overcome if we could add the code directly inside the browser, and not just as an external component, which runs in an isolated environment.

Object properties hooked

Table 3.1 shows the object properties we are able to hook and control. For each of those functions we obtain a parameter that is the trace depth, an integer representing the depth of traced calls. This parameter is very useful because many attackers use obfuscation techniques in order to make the code looks different and to avoid detection. To do this they frequently create a chain of functions one inside the other that makes the static analysis much harder. Thanks to the trace depth value, we are able to discover this strange behavior.

Table 3.1: Object properties hooked by Browdom

HTML Object Specific Property Common Property

iframe height width src srcdoc sandbox innerHTML script async type src source media type hidden src object height width data type aria-hidden embed height width type src

(29)

Functions controlled by Browdom

Looking at the Table 3.2, we can observe the main functions that Browdom is able to hook. The first class is relative to the functions that can evaluate and execute code inside the browser, in other words they are the final stage of the malicious process. The attackers load the code into the page in some way, usually with a XSS, a SQL injection attack or including code from an external domain, then they execute it, using this class of functions. Hence, it is clear why it is extremely important to keep an eye on them.

Cybercriminals often use the Document Object Model (DOM) manipula-tion funcmanipula-tions to dynamically add elements to the DOM, such as iframes, flash objects, inline and external scripts and so on.

To understand why the third class of functions is important, we need to take a look at the evolution of the cybercriminal techniques, in the last years. Drive-by downloads used to contain only the code that exploits the browser. This approach was defeated by static detection of the malicious code using signatures. The attackers started to obfuscate the malicious code in order to make the attacks impossible to be matched by signatures. Obfuscated code needs to be executed by a JavaScript engine to truly reveal the final code that performs the attack. Nowadays, a common piece of malicious code is a combination of obfuscator functions that sets up a shellcode; then, for instance, another code snippet triggers a memory corruption vulnerability, which, if successful, causes the shellcode to be executed.

Table 3.2: Functions hooked by Browdom

Function class Function name

Code evaluation eval() new Function() setTimeout() setInterval() DOM manipulation document.write() document.writeln() node.appendChild() Obfuscation/Deobfuscation String.fromCharCode() String.prototype.concat() String.prototype.substr() String.prototype.substring() window.atob() window.btoa() escape() unescape() encodeURI() encodeURIComponent() decodeURI() decodeURIComponent()

(30)

AJAX requests and cross-origin communication

As shown in Table 3.3, Browdom overrides the functions related to the XML-HttpRequest object that is used to make AJAX requests. Through these calls, a web page can communicate with an external server, request and obtain exter-nal resources, and send information with the POST method. Inspecting these requests can be very profitable.

Another quite common characteristic of malicious page, is the fact that they open popup pages, even without the user’s interaction. This feature is very un-common for benign websites, hence, a strong indicator of the maliciousness of a web page. We are able to control this characteristic by hooking the window.open function, fired every time this behavior occurs. The postMessage method of the window object is used to communicate between different pages; it enables cross-origin communication, so it can be dangerous if used by attackers.

Table 3.3: Other functions hooked by Browdom

Function class Function name AJAX Request xmlhttp.open() xmlhttp.send() xmlhttp.setRequestHeader() Others window.open() window.postMessage()

JavaScript code inside events

Cybercriminals try to hide the malicious code they injected in the web page in many ways. One of this technique consists in putting the JavaScript code inside an event they can trigger. To make it more clear, consider the following example:

<img src="an_invalid_source"

onerror="eval(atob(’ZG9jdW1lbnQud3JpdGUoNSk=’))">

That should be a common image, instead it contains some potentially malicious code. The fact that the source field is not valid, triggers the onerror event that causes the execution of the code inside the eval. In our case, the function that will be executed is a trivial, obfuscated, document.write.

In order to manage these stealthy cases, Browdom logs all the JavaScript code inside events like the one we exposed, for further analysis.

(31)

3.3 Features extractor

From the traces collected by Browdom, we need to extract the features in order to determine if a page is malicious or not. The feature values that we collect derive from different sources like the HTML code, the JavaScript code and from information related to the page’s host. In our project we introduce and evaluate a total of 180 features, with the aim of detecting malicious web pages.

Table 3.4: Features divided into four main classes

Class of features Number of features

HTML 8

JavaScript 159 Host-based 9 Others 4

3.3.1 HTML features

The 8 HTML features that we have are shown in Table 3.5. The focus is on hidden elements, specially the ones whose source is on an external domain, and flash objects, in particular when they are hidden. Elements like iframe, embed and object are considered in computing these features, because they can be used to include external content in a web page. The <img> element and other elements like that are not considered, because they cannot be used to include any executable code.

It is worth pointing out that many of these features are particularly difficult to evade for an attacker. For example, for out of window elements, we ask the position of the elements directly to the browser, that knows exactly where they are. In this way, even if the position is changed dynamically, we are still able to detect this suspicious behavior.

Table 3.5: HTML features

Element Feature iframe hidden iframe

hidden iframe whose source is on an external domain

object

hidden object

hidden whose source is on an external domain flash object

flash hidden object

embed

hidden embed

hidden whose source is on an external domain flash object

(32)

High-level features

In order to have better accuracy of the classification model, we tried to abstract from a single property value and create a sort of high-level features. To better understand what this means, let us consider the hidden high-level feature. There are many different ways to hide an element in a browser, using both HTML and JavaScript code, such as:

• height = 0 and/or width = 0 • hidden = true

• aria-hidden = true • display = none • visibility = hidden

• put the element out of window

We summarize these different cases in one single feature, to support the concept of feature rather than the one of property. In other words, for our purpose, it does not matter if the attacker tries to hide the element using the hidden property or putting it out of window. We are interested in the result, the fact that the element is not visible to the user and it can be used to contain malicious content without the user noticing it.

3.3.2 JavaScript features

These features result from both the static analysis of inline and external JavaScript files and the evaluation of JavaScript code. Most malicious JavaScript files are obfuscated or packed, to make their analysis difficult. To detect these charac-teristics, we also take into consideration some string properties such as string’s length, number of different symbols, letter case variations and the presence of some keywords. To do this, we first tried to parse the JavaScript code with the regular expression patterns. Even if we were able to extract very useful infor-mation from the code, this process slowed down the features extractor process a lot, making this solution unfeasible. So, in order to parse and extract that information from the JavaScript code in a more efficient way, we considered the JavaScript Abstract Syntax Tree (AST) extracted using the PyNarcissus JavaScript parser [21].

Dynamic analysis

From the evaluation of the JavaScript code we get the most important features, that are the ones that contribute more to have good accuracy of the classification model.

As shown in Table 3.2, Browdom hooks many functions used by attackers to execute and obfuscate their code and to manipulate the DOM in order to achieve their purpose. For each of these functions, as previously mentioned, we obtain the trace depth, an integer representing the depth of traced calls. The features we extract identify the number of calls of a function with a particular value of the trace depth, with a total of 152 different features (19 functions

(33)

with 8 possible trace depth values). At first, we set 5 possible values for this parameter (1, 2, 3, 4, 5+), thinking that more than 5 nested calls would be a strong indicator of maliciousness. Then, after some tests, we discovered that there are many web sites, both benign and malicious, that use nested calls even until a hundred times. Hence, we decided to set 8 possible range values for the trace depth: 1, 2, 3, 4, 5-25, 26-50, 51-99, 100+. Even more, in order to make a relation between these features (we have to keep in mind that for the machine-learning system, these are just different features with no relation to each other), we added one that summarizes them, the maximum trace depth for each of the 19 functions.

Another important feature is the arguments length. We measure the length of strings passed as arguments to functions like eval, setInterval, setTimeout and all the others we control. It is common for malicious scripts to dynamically evaluate complex code using the these functions. That means that if there is: eval(a=32+10);

this is, potentially, less dangerous and suspect than:

eval("function this_is_an_attack(){alert(’I am the attacker!’)}; this_is_an_attack();");

We presented this simple example but this behavior is heavily used with obfuscation functions, in order to avoid to be detected by static analysis tools. An attacker could evade these features by not using obfuscation; this would leave the malicious code in the clear, or would significantly constrain the techniques usable for obfuscation. In both cases, the malicious code would be exposed to simple, signature-based detectors and easy analysis.

Due to these obfuscation techniques used by attackers, we take into con-sideration not only the length of strings passed as arguments to functions that evaluate code, but also to functions used for the actual obfuscation, like un-escape, atob and similar.

Static analysis

The static analysis is performed on both the inline JavaScript files, included in a web page via an inline <script> element, and the external files with a content type of text/javascript. More precisely, we extracted the following 7 features that derive from the static analysis of the JavaScript content:

Long strings: this parameter is related to the string’s length, that is high if there are many long strings compared to the median value, computed on the whole script. There are two different features expressing the string’s length for both the variable and arguments names.

Symbols: this parameter is associated to the number of symbols (i.e., non-alphanumeric characters) present in the script, compared to the median value of the whole script. There is one feature relative to the variable names and another one for the argument names.

(34)

Suspicious strings: this feature has been added after manually analysing sev-eral malicious scripts and noticing that most of them, if not obfuscated, tend to use certain strings as variable or function names. Thus, we check whether a script contains some common string such as "botnet", "shell-code", "spray", "evil" and others. The feature counts how many occur-rences of these strings are found.

3.3.3 Host-based features

These features derive from the analysis of URL and host information, such as the presence of an IP address in the URL, a DNS resolution error and so on. We have a total of 9 features related to this information, let us discuss them in details.

Presence of an IP address in the URL: several web sites hosting malware are not associated with domain names but are addressed directly by their IPs. The reason for this is that the malware is hosted on a victim machine on a public network that was compromised.

This behavior is very uncommon for benign websites considering their aim to spread the website as much as they can and this can be reached only using a symbolic address, rather than an IP. This makes this feature stronger for our goal.

Localhost reference in the URL: this feature has been added after analysing different malicious URLs where there was a reference to localhost with both an IP and a symbolic address. The reason is still not completely clear, but apparently they are trying to use a backdoor, trying to access to the host machine in a sneaky way.

DNS resolution exceptions/errors: there are 3 different features related to DNS resolution errors, they consider exceptions or errors that occur (such as NXDOMAIN, DNS timeout and DNS exception). It is quite common for malicious web pages to include content from sites with no valid DNS name.

Use of HTTP/HTTPS: this feature considers the transfer protocol used to access to the resources, if it is HTTPS or HTTP. This feature counts how many resources are accessed using either one or the other.

At first, someone could think that most of the malicious websites are accessed through HTTP rather than HTTPS. This is not completely true and it is clear if we think that around 70% of the malicious websites are legitimate but compromised sites. So, we keep this feature because there is still some kind of correlation but we cannot consider it as one of the strongest.

(35)

Country-related features: as shown in Figure 3.3, there is a strong correla-tion between the drive-by download pages and the country on which the website resources are. This is why we introduced two types of features related to the country on which the web page and its resources belong.

• Number of countries reached: malicious content is often very distributed. With this feature we take into consideration this fact, considering the total number of countries in which the resources are located. This feature is extracted using geoip queries to retrieve location information relative to an URL [22].

• Country code: for each URL analysed, using geoip queries, we retrieve all the countries to which the resources included in the web page belong.

Figure 3.3: Drive-by download pages worldwide

There was another feature that was considering the total number of domains reached from the web page but, after the training process of our classification model, we discovered that it was not a relevant feature in order to determine the maliciousness of the web page.

3.3.4 Other features

There are other features that do not belong to any of the previous classes but they are very common in malicious websites, hence important to include in our analysis.

Auto-redirection: this feature considers the redirection from the initial URL to another one (most of the time there are more than one redirection to make the tracing process difficult). This is rather suspicious considering that we do not click on any of the web page links but we are just visiting it.

(36)

Number of redirections: we record the number of times the browser is redi-rected to a different URL, for example, by responding with HTTP Sta-tus 302 or by the setting of specific JavaScript properties, (e.g., docu-ment.location).

Pop-ups: several times, visiting malicious web sites, some new web browser windows are opened to display advertisements or other content. This feature counts the number of pop-ups opened when we are on the page. This is a rather common behavior for malicious web pages.

File automatically downloaded: this is a very important feature for our classifier. In fact, downloading a file without asking the user interaction is really uncommon for benign websites, whereas it is much more com-mon for the malicious ones. Due to the fact that usually only one file is downloaded, the feature expresses the presence or not of this particular behavior.

(37)

3.4 Machine-learning

The system makes intensive use of machine-learning techniques, in order to build a model to classify a set of feature values as either likely malicious or likely benign. There will be, analyzed in the following sections, two operating modes: the training mode and the detection mode. In the first one, the classification model learns the characteristics of both benign and malicious web pages. In the detection mode, the classifier is used to classify the web pages given in input as either malicious or benign.

3.4.1 Training Mode

The first step in order to build a classifier is the training phase. Figure 3.4 shows the architecture of our system, from an high-level point of view, during the training phase. The framework used is the supervised learning, where the labeled input dataset contains both benign and malicious samples.

Figure 3.4: High-level architecture of the system for the training phase

To better understand the operating procedure, let us consider the central entity that interfaces with all the other components. The server:

• reads the labeled dataset, which contains the input URLs • for each URL:

– sends it to an instance of the browser

(38)

3.4.2 Dataset

In this project, we used the supervised learning framework, where each example is a pair consisting of an input object and a desired output value. It is quite easy to understand how crucial is this point, if we think about what we are looking for: we need labeled samples, which means we first need to know if each of the URLs, on which we are training our system, is benign or malicious.

Figure 3.5: Training dataset, composed by benign and malicious samples

The Figure 3.5 shows what we meant with dataset, where we have both benign and malicious samples. It is worth pointing out that, even if the classifier must be able to generalize, the benign dataset is bigger than the malicious one. This is because in reality, the difference between the probability to detect a benign website and the one related to a malicious website is rather high, in the order of hundreds of times. We will discuss the dataset filtering process in Section 4.1, in more details.

(39)

3.4.3 Server

As shown in Figure 3.4, the server is the central entity of our system, the component that interfaces and coordinates all the other elements. The critical aspects for the server are in term of performance and efficiency, due to the fact that it has to process the input, submit them to the browser, elaborate the result and, finally, build the classifier. To this aim the process of submission to the browser has done in parallel, with few browser instances and several tabs per browser, speeding up the training process a lot.

After we collected enough traces to build a classification model, using a heuristic approach, we extracted the features that should characterise the page’s behavior. At this point, the server is finally able to build the classifier that will be used in the detection phase to discover the malicious behaviors. The entire process we have just descried and the internal components of the server are explain by the following figure:

Figure 3.6: Internal view of the server and its components

3.4.4 Classifier

The purpose of the entire process is to build a model that is able to classify a set of feature values as belonging to either likely benign or likely malicious web page.

(40)

To do this, we use the RapidMiner machine-learning platform [23].

Let us explain the structure of the machine-learning system, to better un-derstand the concepts that are involved and the strategy we adopted.

Figure 3.7: Overall scheme of the classification model

Figure 3.7 shows the whole process to build the actual classifier, represented as module in the RapidMiner platform. The input of the system is the set of feature values coming from both benign and malicious web pages. This infor-mation is extracted visiting the actual web page with Google Chrome browser and Browdom, the tool we have developed.

Cross-Validation

The core of our machine-learning system is the validation module, where the input data is processed, after a normalization. This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen datasets, as in our case). It is used to estimate how accurately the model, which is learnt by a particular learning operator that we will discuss later, will perform in practice.

The Cross-Validation operator is a nested operator and it has two subpro-cesses: a training and a testing subprocess. The training subprocess is used for training a model, using one or more classification models. The trained model is then applied in the testing subprocess, where the performance of the model are measured. In a cross-validation module, the input dataset is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing dataset (i.e. input of the testing subprocess), and the remaining k - 1 subsets are used as training dataset (i.e. input of the training subprocess). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be combined to produce a single estimation.

(41)

Figure 3.8: Cross-Validation module, the training and the testing subprocesses

As we can see from Figure 3.8, for the training process we used a stacking operator which is used for combining the base learners rather than choosing among them, thereby getting a better performance than any single one of the base learners. Let us consider the stacking operator in detail, to understand what are the components and the models that are involved.

Figure 3.9: Stacking module

The Stacking operator is a nested operator and it has two subprocesses: the Base Learners and the Stacking Model Learner subprocess.

We used 5 different classification models, where each of the model is par-ticularly good in some situation whereas it is not that good in some others. Combining them into a single classification model, we try to take advantage of the qualities of all of them in order to build the best classifier we can.

k-NN. The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given test example with training examples that are

(42)

given an unknown example, a k-nearest neighbor algorithm searches the pattern space for the k training examples that are closest to the unknown example. "Closeness" is defined in terms of a distance metric, such as the Euclidean distance.

The k-Nearest Neighbor algorithm is amongst the simplest of all machine-learning algorithms thus not that efficient in term of generalization. How-ever, it is good at classifying examples similar to something already seen. In fact, we will see that this model will be choose among the others by the optimization process we performed.

Decision Tree. A decision tree is a tree-like model. This representation of the data has the advantage compared with other approaches of being mean-ingful and easy to interpret. The goal is to create a classification model that predicts the value of a target attribute based on several input at-tributes of the input dataset. Each interior node of tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf.

Naive Bayes. A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assump-tions. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature.

The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.

Rule induction. Starting with the less prevalent classes, the algorithm itera-tively grows and prunes rules until there are no positive examples left or the error rate is greater than 50In the growing phase, for each rule greedily conditions are added to the rule until it is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain. In the prune phase, for each rule any final sequences of the antecedents is pruned with the pruning metric p/(p+n).

Rule Sets have the advantage, compared to Decision Tree learners, that they are easy to understand, representable in first order logic and prior knowledge can be added to them easily. The major disadvantages of Rule Sets were that they scaled poorly with training set size and had problems with noisy data.

SVM. The Support Vector Machine-learning method can be used for both regression and classification and provides a fast algorithm and good results for many learning tasks. The standard SVM takes a set of input data and

(43)

predicts, for each given input, which of the two possible classes comprises the input, making the SVM a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high dimensional space, which can be used for classifica-tion, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. Describing all of the classification models we used, we understood that each of them contribute to the final model accuracy in its way. The last step, in order to produce a single classification model, is to use a Decision Tree Parallel learner. This operator builds a decision tree (as the base learner we saw before) combining the classifications from the other models mixed with some feature values. We will analyse in details the decision tree produced by this model in the next section. The result of this process is the actual classifier, which is able to classify an unseen sample based on its features.

(44)

3.4.5 Detecting Mode

The actual mode of operation of the system is to classify a given URL into likely benign or likely malicious. To complete the machine-learning process and finally have a classifier to use in the real world, we need to test the classification model produced in the previous phase, the learning process.

Figure 3.10: High-level view of the architecture of the system in the detecting mode

As shown in Figure 3.10, we used a Web Crawler that systematically browses the web and gives URLs to our system. Then, after directly visiting the web page with the browser and our extension, the classifier is able to properly classify the URL in good or bad, in term of maliciousness.

Web Crawler

A web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. Such archives are usually stored such that they can be viewed, read and navigated as they were on the live web, but are preserved as snapshots [24].

The web crawler we used was not a commercial one; it has been made in the laboratory but, due to the fact it had been already used and tested, it worked perfectly. Even in term of speed we did not have any problem because the crawler was much faster than the whole classification system.

(45)

Application phase

We will go into the details regarding the accuracy of the classifier, but it is worth looking at the operating scheme of the classification model in the application phase, on the RapidMiner platform.

Figure 3.11: Application phase, the classifier is applied to real-world unlabeled data

In Figure 3.11 are considered the modules involved in this phase. The struc-ture is quite easy, we have the input data that is basically the real-world data, that is unlabeled of course, which is then normalized. The apply model operator reads the classification model produced before and applies its classification to the data given as the other input.

The result of this phase is a collection of rows in the form of:

< URL, prediction, confidence (benign), confidence (malicious) > .

. .

< URL, prediction, confidence (benign), confidence (malicious) >

3.5 Database

For managing the database we used MongoDB [26], a document database that provides high performance, high availability, and easy scalability. MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas, making the integration of data in certain types of applications easier and faster. MongoDB is often catego-rized as a “NoSQL” database, a term that became increasingly popular in late 2009. While this term is a rather generic characterization of a database, it does clearly define a break from traditional SQL-based databases. A MongoDB database lacks a schema, or rigid pre-defined data structures such as tables. Data stored in MongoDB is a JSON documents and the structure of the data, or documents, can change dynamically to accommodate evolving needs; this was a particularly useful feature for our experience. A NoSQL database provides a simple, lightweight mechanism for storage and retrieval of data that provides higher scalability and availability than traditional relational databases.

Browdom: Detecting malicious web pages directly within the browser

UNIVERSITY OF PISA

DEPARTMENT OF INFORMATION ENGINEERING

MASTER OF SCIENCE IN COMPUTER ENGINEERING

MASTER THESIS

Browdom: Detecting malicious web pages

directly within the browser

CANDIDATE

Giacomo Vecere

SUPERVISORS

Prof. Gianluca Dini

Contents

List of Figures

List of Tables

Abstract

Chapter 1

Introduction

1.1

Where does the problem come from?

1.2

Drive-by download

1.3

Browdom

1.4

Structure

Chapter 2

Background

2.1

How does malware look like?

2.2

How a drive-by download attack works

2.3

Botnet

2.4

How to deal with malware?

2.5

Other malware detection system

2.5.1

Static approach

2.5.2

Dynamic approach

Chapter 3

Design

3.1

Overview

3.2

Browdom

3.2.1

Browser choice

3.2.2

Implementation

3.3

Features extractor

3.3.1

HTML features

3.3.2

JavaScript features

3.3.3

Host-based features

3.3.4

Other features

3.4

Machine-learning

3.4.1

Training Mode

3.4.2

Dataset

3.4.3

Server

3.4.4

Classifier

3.4.5

Detecting Mode

3.5

Database