• Non ci sono risultati.

Analysis and detection of social bots on Twitter mimicking human interests in people or contents

N/A
N/A
Protected

Academic year: 2021

Condividi "Analysis and detection of social bots on Twitter mimicking human interests in people or contents"

Copied!
81
0
0

Testo completo

(1)

P

OLITECNICO DI

M

ILANO

M

ASTER

T

HESIS

Analysis and Detection of Social Bots on

Twitter Mimicking Interests in People or

Contents

Author: Lorenzo VITALI Supervisor: Dr. Florian DANIEL Academic Year 2019-2020

A thesis submitted in fulfillment of the requirements for the degree in

Mathematical Engineering Applied Statistics - Big Data Analytics

(2)

Abstract

This work wants to study the Twitter world, aiming at the identification of bots, namely the algorithmically driven accounts that populate this Social Network. The thesis focuses on bots mimicking the human behavior through the interactions that they are able to perform in the platform, therefore identifying three main actions performed by the bots (retweet, mention, hashtag) with the relative harms that they can bring.

The dissimilar behaviors between genuine users and content polluters can be spotted by the difference in the tweets posted. Therefore the first step is the collec-tion of users and tweets through the streaming API of Twitter, in order to approach the problem in an unsupervised way by gathering the data in real time. In partic-ular the collection of data was made by saving the tweets and the users that were posting on Twitter from throughout the United States in a specific time window. It follows the building up of graphs from every collected user, based on the ac-tion they performed, resulting in one graph for the retweets, one for the menac-tions and one for the hashtags, for each user. Then a node embedding algorithm was performed in order to have numerical features, with the goal of obtaining distinct clusters. This approach did not bring the desired results probably due to the fact that the context of the application was not optimal.

The second phase of the thesis is a collection of previous works in literature in order to assemble a labeled dataset so that the approach can be supervised. Starting from these data, an accurate features extraction pays particular attention to the features that consider the aforementioned interactions. The next phase goes deeper in the analysis of the bot category. The thesis proposes a finer level of granularity and it is detached from lot of works in literature that consider the problem of telling bot and human apart in a binary way (0-1). It identifies, indeed, three particular types of bot behaviors: bots that are massively retweeting users in order to endorse someone or something (RETWEETER), bots replying and interacting with humans in a deceiving way, in order to expand their web (MENTIONER), and bots that are spamming hashtags relative to their field in order to promote themselves or the products they are selling (PROMOTER).

Once identified all these specific categories, the resulting multi-class dataset is divided in train and test set, then several classification algorithms are trained on it: the most performing one is the Random Forest, giving good numerical results (average F1 score of 0.868). The final phase of the thesis is testing the trained algorithm “in the wild”, namely in an unbiased dataset completely different from the one used in the training/testing phase, in particular the one collected in the first step through the streaming API. Therefore it follows an estimation of the Twitter population and a manual estimate of the real precision of the algorithm on these random data.

(3)

Sommario

Questo lavoro vuole studiare il mondo di Twitter, mirando all’individuazione dei bot, i cos`ı detti account guidati da un algoritmo che popolano questo Social Network. La tesi si concentra sui bot che imitano il comportamento umano at-traverso le interazioni che la piattaforma permette loro, identificando quindi tre principali azioni che i bot compiono (retweet, mention, hashtag) coi relativi danni che possono causare.

I comportamenti dissimili tra utenti genuini e bot possono essere trovati nelle differenze tra i loro tweet. Il primo passo quindi `e raccogliere gli utenti e i tweet attravirso la streaming API che Twitter fornisce, cos da approcciare il problema in maniera non supervisionata raccogliendo i dati in tempo reale. In particolare il raccoglimento dei dati stato fatto salvando i tweet e gli utenti che postavano su Twitter dagli Stati Uniti in un determinato intervallo temporale. Segue la costruzio-ne dei grafi partendo dalle interazioni, costruzio-nello specifico da ogni utente si costruisce un grafo per i retweet, uno per le mention e uno per gli hashtag. Poi viene esegui-to un algoritmo di node embedding per ottenere dei dati numerici con l’obiettivo di identificare dei gruppi distinti. Questo approccio non porta i risultati sperati probabilmente per il fatto che viene applicato in un contesto non ottimale.

La seconda fase della tesi `e quindi raccogliere i dati da precedenti lavori del-la letteratura, per avere un dataset etichettato cos`ı da poter adottare un approccio supervisionato. Partendo da questi dati, un’accurata estrazione delle features fa particolare attenzione a quelle che considerano le sopracitate interazioni. La fa-se successiva va pi`u a fondo nello studio nella categoria dei bot raccolti. La tesi propone un livello pi`u elevato di granularit`a e si distacca da molti lavori della let-teratura che invece considerano il problema di riconoscere un bot, in modo binario (0-1). Si indentificano, infatti, tre particolari tipi di comportamento di bot: bot che retweetano massivamente altri utenti per appoggiare qualcuno o qualcosa (RET-WEETER), bot che rispondono e interagiscono con gli umani in modo subdolo per espandere la loro rete di amici (MENTIONER), e bot che spammano hashtag per promuovere loro stessi o i prodotti che vendono (PROMOTER).

Una volta identificate queste categorie, il dataset multiclasse risultante viene diviso in train set e test set, quindi diversi algoritmi di classificazione sono ad-destrati su di esso: il migliore risulta essere il Random Forest con uno score F1 medio di 0.868. La fase finale della tesi testa l’algoritmo precedentemente adde-strato “in the wild”, cio`e su dei dati che non sono distorti completamente diversi da quelli usati durante la fase di addestramento/validazione dell’algoritmo, nella fat-tispecie il primo dataset ottenuto tramite la streaming API. Quindi si procede con una stima della popolazione di Twitter e una stima manuale della vera precisione dell’algoritmo su questi dati casuali.

(4)

Contents

1 Introduction 1

1.1 Context . . . 1

1.2 Problem statement . . . 3

1.3 Contributions . . . 6

1.4 Structure of the thesis . . . 7

2 State Of Art 9 2.1 Traditional approach . . . 9

2.2 Modern Social Bot Detection . . . 10

2.3 The way ahead . . . 11

3 Graph Based Approach 13 3.1 Streaming API Data Collection . . . 13

3.2 Graph Building . . . 13

3.2.1 Retweet graph . . . 14

3.2.2 Mention graph . . . 15

3.2.3 Hashtag Graph . . . 16

3.3 Node Embedding algorithm . . . 16

3.4 Unsupervised Techniques . . . 17

3.4.1 Hierarchical clustering . . . 18

3.4.2 KMeans . . . 18

3.4.3 DBScan . . . 20

3.5 Labelled Data Collection . . . 20

3.5.1 Cresci-RTbust-2019 . . . 20 3.5.2 Cresci-stock-2018 . . . 20 3.5.3 Botometer-Feedback - 2019 . . . 21 3.5.4 Gilani-2017 . . . 21 3.5.5 Varol-2017 . . . 21 3.5.6 Caverlee-2011 . . . 22 3.5.7 Cresci-2017 . . . 23

3.6 Node Embedding Supervised Approach . . . 24

3.6.1 Motivation of the failure . . . 25

4 Features Based Approach 28 4.1 Profile features . . . 28

4.2 Behavioral features . . . 30

(5)

5 Exploratory Data Analysis 34

5.1 Preprocessing . . . 34

5.1.1 Profile Feature Preprocessing . . . 34

5.1.2 Behavioral Features Preprocessing . . . 35

5.2 Data Exploration . . . 35

5.2.1 Basic Features Exploration . . . 35

5.2.2 Retweet Features Exploration . . . 37

5.2.3 Mentions and Hashtags Features Exploration . . 40

5.3 Topic extraction . . . 41

5.3.1 Latent Dirichlet Allocation . . . 41

5.3.2 Results of LDA algorithm . . . 44

5.4 Bot exploration and labelling . . . 46

5.4.1 Cresci-Stock . . . 46

5.4.2 Social Spambots 2 . . . 47

5.4.3 Caverlee bot . . . 50

5.4.4 Inclusions of the other dataset . . . 51

6 Classification Algorithm development 53 6.1 Support Vector Machine . . . 53

6.2 Random Forest . . . 56

6.3 Feed Forward Neural Networks . . . 59

6.4 Into the wild testing . . . 63

6.4.1 Mentioner . . . 65 6.4.2 Retweeter . . . 66 6.4.3 Promoter . . . 68 7 Conclusion 71 7.1 Limitations . . . 73 7.2 Future Works . . . 73

(6)

1

Introduction

1.1 Context

Social media permeate society. Their unregulated nature has given rise to a flood of information of questionable credibility. For instance it has been shown that businesses such as hotels are posting manipulated con-tent on social media to gain an unfair advantage over their competitors (Mayzlin et al. 1).

Against this backdrop, it becomes clear that social media have creasingly become interesting to people and organisations looking to in-fluence the discussion on a certain topic. The automated dissemination of messages promises to be an efficient way to reach many people with little effort. In April 2017, Facebook reported 100,000 monthly active bots on the Messenger platform. In March 2017 Varol et al.[2] estimated that between 9% and 15% of active Twitter accounts are bots (29-49 million accounts out of 328 millions, https://bit.ly/2v3AT6O).

Recently social bots, algorithms programmed to mimic human be-haviour on social media platforms, have become increasingly attrac-tive for people and organisations aiming to automatically distribute their messages to many recipients at very low costs. Current studies reveal that social bots are involved in online discussions about current politi-cal events, such as the armed conflict between Ukraine and Russia and the war in Syria by spamming the discussion with one-sided arguments or unrelated content to distract participants (Abokhodair et al. 2015; Hegelich and Janetzko 2016 [3]).

The mere presence of automated actors in vital opinion-shaping dis-cussions provokes the fear of manipulation and thus ethical concerns. This has led to increasing attention on the impact of these automatically driven accounts. The great public interest in social bots underlines the importance of a profound scientific analysis of the topic.

A starting point to understand what kinds of harm may occur in prac-tice are concrete examples of what we can call bot failures, that is, in-cidents where a bot reportedly caused damage to someone. Daniel et al. [4] followed a particular methodology to derive a taxonomy , based on Google Search, Google Scholar, as well as the ACM/IEEE/Springer online libraries to search for anecdotal evidence of bot failures.

In the following, we describe the identified types of harm and the selected examples:

(7)

• Psychological harm occurs when someone’s psychological health or wellbeing gets endangered or injured; it includes feelings like worry, depression, embarrassment, shame, guilt, anger, loss of self-confidence, or inadequacy. An example of a bot causing psycholog-ical harm is Boost Juices Messenger bot that was meant as a funny channel to obtain discounts by mimicking a dating game with fruits (botname: BoostJuice). Unfortunately, the bot was caught using in-appropriate language, that is, dis-educating children or teenagers. • Legal harm occurs when someone becomes subject to law

enforce-ment or prosecution; it includes for example the breach of a confi-dentiality agreement or contract, the release of protected informa-tion, or threatening. A good example is the case of Jeffry Van Der Goot, a Dutch developer, who had to shut down his Twitter bot, which generated random posts, after the bot sent out death threats to other users (botname: DeathThreat). Police held him responsi-ble for what the bot published.

• Economic harm occurs when someone incurs in monetary cost or loses time that could have been spent differently, e.g., due to the need to pay a lawyer or to clean ones own social profile. Bots are also new threats to security that eventually may lead to economic harm. For instance, a concrete example of an infiltration by a bot happened on Reddit in 2014, where the bot WiseShibe provided au-tomated answers and users rewarded the bot with tips in the digital currency dodgecoin, convinced they were tipping a real user. • Social harm occurs when someone’s image or standing in a

com-munity gets affected negatively, e.g., due to the publication of con-fidential and private information like a disease. An example of a bot causing social harm was documented by Jason Slotkin whose Twitter identity was cloned by a bot, confusing friends and fol-lowers (botname: JasonSlotkin6). However, also bot owners may incur in social harm: Geico, Puma and Oreo had to publicly apolo-gize for their bots respectively engaging with racist users (Geico), re-tweeting offensive content (Puma), and answering tweets from accounts with offensive names (Oreo).

• Democratic harm occurs when democratic rules and principles are undermined and society as a whole suffers negative consequences, e.g., due to fake news or the spreading of misinformation. Bessi

(8)

and Ferrara [5], for instance, showed that bots were pervasively present and active in the on-line political discussion about the 2016 U.S. Presidential election (predating Robert S. Mueller IIIs inves-tigation into the so-called Russian meddling). Without trying to identify who operated these bots, their finding is that bots inten-tionally spread misinformation, and that this further polarized the on-line political discussion (botname: PolarBot). A specific ex-ample is that of Seth Rich, a staff member of the Democratic Na-tional Committee, whose murder was linked to the leaking of Clin-ton campaign emails and artificially amplified by bots (botname: SethRich).

What these examples show is that as long as there are bots interacting with humans, there will be the risk of some kind of harm, independently of whether it is caused intentionally or unintentionally. Thus, the key question is: is it possible to prevent people from getting harmed by bots?

1.2 Problem statement

It is clear that there are issues related to this topic, and it is real for different kinds of Online Social Network (OSN). In this work we are focusing on Twitter in order to exploit its free and public API that allows to retrieve data “easily”. We use the quotes because of course Twitter has limitations that were a huge constraint for our purposes and a big effort was made in order to deal with that. Yet, Twitter remains the only social network that provides free access to its content without content or account-specific restrictions.

The data we are referring to are basically the timeline of each user. The timeline consists in a list of pair timestamp-text, where the text can be up to 280 characters long. Because of the aforementioned limitations we retrieved 200 tweets for each user (we will discuss the data retrieval method later).

Now we focus on the interactions between accounts that are the foun-dation of Twitter; by interacting with each other in a different way the users build up a network.

Millimaggi and Daniel [6] studied 60 Twitter bot code repositories on GitHub in order to characterize potentially abusive actions and they de-rived the following list:

(9)

• Talk to: Send direct messages to users to converse with them; • Mention(@): Mention other users in a tweet or reply to someone

under a post;

• Search: Search users or tweets using names, keywords, hashtags, ids or similar or by navigating social network relationships;

• Retweet(RT): used as an endorsement, its goal is to spread some-one’s tweet because it appears in the user timeline;

• Like(r): Similar to a retweet because a user is showing an appre-ciation towards someone’s tweet, but it is very different because it doesn’t show up in the user timeline and a history cannot be re-trieved.

• Follow: When a user follows someone else it is building its own network of friendship where one can find all the timelines of the users he is following.

The work described in this thesis specifically aims at studying how social bots mimic interest in contents or people. Therefore, we are tak-ing into account the interactions that users (genuine or not) have between each other: mentions, retweets and hashtags(#): the latter is not prop-erly an interaction, but it can be seen as an endorsement of the undprop-erlying topic, as the user that is using a certain hashtag is contributing to increase the popularity of the latter.

Now the aim of our work is to find out which users are exploiting the Twitter functionalities to interact in an automatic or/and deceiving way with other accounts. This kind of algorithmically driven users are called bot and Twitter itself struggles a lot in order to tell apart this kind of users that goes against its rules, and genuine accounts.

Therefore, among the actions we selected, we are focusing on the behaviors that are similar to the ones found out by analyzing the code of the bots on GitHub (Millimaggi and Daniel [5]), summarized in the table 1 below.

In order to take into account even hashtags we identify as an abuse the mass hashtags: aggressively posting tweets having the same content and promoting the same hashtags.

We decide to focus on these interactions because nowadays is very easy to manipulate information in the Online Social Network’s commu-nication. Moreover we found on the internet that it is indeed very easy to

(10)

Action Abuse Description

Mention Indiscriminate mention Mention other users without checking suit-ability of user-name or content shared Targeted mention Classify users on the basis of their tweets

and mention them in targeted messages Whitelist-based mention Mention only users whose attributes match

some element of a whitelist

Blacklist-based mention Dont mention users whose attributes match elements of a blacklist

Retweet Indiscriminate retweet Retweet tweets without checking content or username for suitability

Whitelist-based retweet Retweet content only from users whose at-tributes match some element of a whitelist Blacklist-based retweet Dont retweet tweets whose attributes or

users satisfy some condition expressed in a blacklist

Mass retweet Aggressively retweet multiple tweets by selected users

Table 1: List of Abuses

get in touch with bot creator by paying an affordable price. For example, in order to make your hashtag trending, it is enough to pay just about 2 hundred dollars [23], as you can see from the figure 1:

(11)

And again, it is also possible to buy accounts that automatically retweets one of the tweets that the user chooses. In this way one can increase a lot the popularity of the tweet he/she is deciding to promote. There are several examples of sites that can get you this, like [22] in which the user can buy 250 retweets for just 4 dollars, and a screen of the main page of the site is reported here in figure 2:

Figure 2: Retweet shop window

It is clear now that there are many ways to break Twitter rules [19, 20] and several countermeasures have been taken as we will explain later.

This thesis aims to label particular kinds of bots focusing on the in-teractions between each others and with humans. Once defined these groups, we follow the classical Data Science methodology in order to develop an efficient algorithm that is able to detect abusive behaviors and tell apart bots and genuine accounts.

1.3 Contributions

• We provide two new labelled datasets: one collected through the streaming API of Twitter, and the second one obtained by collecting previous works in literature. With our work we approach the prob-lem enhancing the granularity, more specifically we focus on how bots mimic interests: this activity is the basis for spreading

(12)

mis-information or falsely endorsing people with authority they don’t have. In doing this, we identify and label specific class of bots, making the problem a multiclass one, instead of the usual binary one of most works in literature.

• We study a graph-based approach, that consists in building up graphs of the retweets, mentions and hashtags and then using a node em-bedding algorithm in order to have a vector representation of every node, explaining why it is not working as we expected. The idea is that nodes that have similar connections, after applying the embed-ding, should have “close” vector representations, therefore the hope is to group them together through a clustering algorithm. Since the assumption is that human accounts and bots have different behav-iors and therefore different connections, they should be grouped in different clusters.

• We estimate the Twitter population by applying the algorithm on the first unlabelled data collected through the streaming API of Twitter. We also give a rough estimate of the precision of the algo-rithm in identifying bots, by taking a random sample of the users labelled as bots and compute manually the precision:

1) ≈ 88.5% of genuine users in the Twitter population 2) The mean precision in identifying bots is 33.3 % 1.4 Structure of the thesis

As we were saying before, we are following the Data Science method-ology and in particular the work is going to develop in the following steps:

• Chapter 2 → Introduction to the State of the Art of the problem; • Chapter 3 → Graph-based approach: by exploiting the tweepy

li-brary we collected data from tweets sent from throughout America in real time. Then we built up graphs of the aforementioned in-teractions and we applied a node embedding algorithm in order to have a vector representation of such as graphs. We finally per-formed several supervised algorithm techniques in order to obtain clusters and find groups of bots. After a second data collection

(13)

from previous works with labelled data, we used again the embed-ded features, extracted from the built graphs, and we saw that the results were not satisfying.

• Chapter 4 → Features-Based Approach: Explanation and motiva-tion of the handcrafted features extracted from the data previously collected.

• Chapter 5 → Exploratory Data Analysis: We analyzed the charac-teristics of the collected data by using the traditional visualization exploratory analysis techniques. We further went into data labelled as bot in order to understand patterns and to create our own labels. • Chapter 6 → Classification Algorithm Development: evaluation of

several supervised algorithm techniques in order to find the optimal one in terms of precision, recall and F1 score. After the selection of the best performing algorithm we apply it to the first dataset col-lected as an “into the wild” testing in order to obtain an estimation of the Twitter population. Moreover, we applied a random sam-pling of the three groups of bots discovered by the algorithm to compute a rough estimate of the precision.

• Chapter 7 → Conclusion: Summary and comments on the work conducted, moreover some ideas on possible future developments.

(14)

2

State Of Art

Nowadays a lot of effort has been made in order to find an optimal so-lution in order to tell apart bots and genuine accounts. It has also been developed an Online tool called Botometer. The tool builds on more than 1000 features among network, user, friends, temporal, content and sentiment features, and uses a Random Forest classifier for each subset of features. The training data used is based on bot accounts collected in prior work by Lee et al. [7], who used several Twitter honeypots to lure bots and collected about 36,000 candidate bot entities following or messaging their honeypot accounts. Botometer is a machine learning classifier system based on engineered features of heterogeneous nature. It considers attributes stemming from sentiment analysis, timing, user’s network, user’s content and meta-data. This tool is available on line and it is considered and taken as the State-of-the-art by several works in bib-liography.

Cresci [8] studied the evolution of the problem from the first work to the newest techniques and gave us a characterization of the steps. 2.1 Traditional approach

This approach comprises the vast majority of the papers published nowa-days. It consists basically in focusing on designing machine learning features in order to maximize detection performances of a well-known machine learning algorithm.

Despite achieving promising initial results, the traditional approach which still comprises the majority of papers published nowadays has a number of drawbacks. The first challenge in developing a supervised de-tector is related to the availability of a ground truth (i.e., labeled) dataset, to be used in the learning phase of the classifier.

Moreover, the more work was made on this field and the better the achievements were, the better the social bot developers learned how to evade the classifiers, and this led to the development of more refined bots. They were no longer spamming the same messages over and over again, but they were instead posting several messages with the same meaning but with different words, in order to avoid detection techniques based on content analysis.

Starting from these findings, Yang et al. [9] also proposed a super-vised machine learning classifier that was specifically designed for

(15)

de-tecting evolving bots. Their classifier simultaneously leveraged features computed from the content of posted messages, social connections, and tweeting behaviors, and initially proved capable of accurately detecting the sophisticated bots.

Unfortunately the work of Cresci et al. [10] showed that the paradigms of the social networks are changing and even more sophisticated social spam-bots exist and are very similar to genuine accounts. The work is showing that the modern classification algorithm developed was failing against this new kind of automated users.

What is found out by this paper is that the current sophisticated bots are practically indistinguishable from legitimate accounts, if analyzed one at a time, whilst if we take into account a whole group of bots acting together as the granularity of the analysis, the results improve a lot. 2.2 Modern Social Bot Detection

As a consequence of this paradigm-shift, modern bot detectors are par-ticularly effective at detecting evolving, coordinated, and synchronized bots. So it is clear that an unsupervised approach is needed and a tech-nique discussed in [11] associates each account to a sequence of char-acters that encodes its behavioral information. Such sequences are then compared between one another to find anomalous similarities among sequences of a subgroup of accounts. The similarity is computed by measuring the longest common subsequence shared by all the accounts of the group. Another recent example of an unsupervised, group-based technique is RTbust [27], which is tailored for detecting mass-retweeting bots. The technique leverages unsupervised feature extraction and clus-tering. An LSTM autoencoder converts the retweet time series of ac-counts into compact and informative latent feature vectors, which are then clustered by a hierarchical density-based algorithm. Accounts be-longing to large clusters characterized by malicious retweeting patterns are labeled as bots, since they are likely to represent retweeting botnets. Given that bot detection techniques belonging to this modern ap-proach still represent the minority of all published papers on social bot detection, we still lack a thorough and systematic study of the improve-ment brought by the modern approach to social bot detection, but the results of these few works look very promising. Furthermore, the major-ity of modern bot detectors are semi-supervised or unsupervised, which mitigates the challenges related to the acquisition of a reliable

(16)

ground-truth.

2.3 The way ahead

It is interesting to notice this paradigm shift is happening because of the need to improve the performance of the detectors in finding bots. A major implication of this reactive approach is that improvements in bot detection are possible only after having collected evidence of new bot mischiefs. In turn, this means that scholars and OSN (On Line Social Network) administrators are constantly one step behind of bot develop-ers, and that bots have a significant time span (i.e., the time needed to design, develop, and deploy a new detector) during which they are es-sentially free to tamper with our online environments.

That is the reason why a new approach is just started to be investi-gated by researchers. We are talking about adversarial machine learning [27]. This technique has been applied to several fields such as computer vision and speech recognition with exceptional results.

This branch of research aims at studying possible attacks to exist-ing bot detectors, with the goal of buildexist-ing more robust and more se-cure detectors. In this context, adversarial examples might be sophisti-cated types of existing bots that manage to evade detection by current techniques, or even bots that do not exist yet, but whose behaviors and characteristics are simulated, or bots developed ad-hoc for the sake of experimentation. Finding good adversarial examples can, in turn, help scholars understand the weaknesses of existing bot detection systems, before such weaknesses are effectively exploited by bot developers.

By changing the approach from reactive to proactive it is clear that the countermeasures can be taken before the damage is already occurred and the big advantage would be to predict the development of the bot before they are implemented, giving a great edge to the prevention of the harms.

We learned about how the history of the research in this area has been developed, from the traditional approach by focusing on user based detection, to a completely different point of view of the problem with the adversarial and proactive way to approach the problem.

Each of them has pros and cons and we can say that they focus on different granularities of the problem:

(17)

• modern → group based;

• adversarial → bot detectors based.

With this thesis we want to face the problem in a Modern way, through a Graph based approach and in a Traditional way with a Feature Based approach.

In the past the problem of bot detection has mostly been seen as the problem of telling bots and humans apart. With our work we want to go one step further and analyze more specifically how bots mimic interest. This activity is the basis for spreading misinformation, falsely endors-ing people with authority they dont have or trickendors-ing genuine users into spending their money. In order to do this, we improve the granularity of the analysis by creating different labels of bots and not just reducing the problem to a binary one, therefore we can identify specific behaviors and give answers that other approaches would not be able to give.

(18)

3

Graph Based Approach

Since the state of art of the work is in between modern and traditional approach, even though there is a lot more literature for the traditional approach, we decided to go for the former one to begin with.

As already explained in the introduction the focus of the thesis is on interactions between users, specifically the ones considered are: retweets, mentionsand hashtags. We will explain later how to build a graph from these actions.

Initially we first need to collect data from users, in our case the data are the timeline of 200 tweets of every collected user. For this purpose we exploited the twitter library tweepy.

3.1 Streaming API Data Collection

The tweepy library is one of the most popular APIs (Application Pro-gramming Interface) used by developers and researchers. It has several functionalities and it works by objects, we mainly focus on Tweet Ob-ject (link1) and User ObOb-ject (link2).

Moreover, in this phase of the thesis we exploit another useful func-tionality that is the streaming API. This tool allows one to get real time tweets by letting the developer set some filters in order to retrieve the desired data. In our case we wanted to have the less possible distortion, namely we do not want to retrieve data that are topic-specific and, in ad-dition to that we want to have tweets that are written with Latin alphabet. For these reasons we used as filters the United States coordinates and the Englishlanguage.

The workflow that leads to the streaming API phase to the actual graph representation is described in the chart 3, in particular the action “Downloading Tweets from Users” is possible by utilizing the tweepy function api.user timeline that retrieves up to 200 tweets of the user id passed as an input.

3.2 Graph Building

Now we give some details about the construction of the graphs by taking into account the different interactions.

(19)

Figure 3: Streaming and Graph building workflow

3.2.1 Retweet graph

We considered as a retweet, a simple tweet that begins with the string “RT @”. The user who is posting the tweet is taken as the central node and a bidirectional link is created with the cited user, namely, if I am analysing the collected account X, as soon as we find a tweet that begins with “RT @Y + text” a link is created between X and Y and a weight equal to 1 is set to this link. Since we don’t care about the cited user, the link can be unidirectional or bidirectional, this indeed, does not change anything for our purposes. By doing this for all the timeline of the user, if we find the same cited user Y we increase the weight of the bidirectional link previously set and at the end we obtain a weighted graph that looks like the one represented in figure 4. Formally the graph is a tuple < A, E > where A is the set of accounts and E the set of (weighted) non-directed edges, in our case the higher is the weight the thicker is the link in the figure 4. The central ”Account” is the one collected in our dataset, while the surrounding “Users” are the ones retweeted by the central node.

(20)

Figure 4: Example of a retweet graph

the same people over and over, have graphs with fewer edges but the rare ones are heavier than the human ones, therefore we hope to capture this behavior with the embedding.

3.2.2 Mention graph

In Twitter, mentioning one or more users can be done in several ways, explained in the following list:

1. Replying to somebody’s tweet: the API retrieves a tweet of the form “@user+ text”

2. Citing a user in a tweet: “ text+ @user + text ”

3. Replying to a retweet: “@original account+ retweeting account + text”

For each of the collected account we search for such as interactions and build up a graph in this way:

(21)

• The central node is the considered account

• A bidirectional link is created every time we find in the text the string “@USER”

• If the same user is found more than one time the weight of the link is increased by one

The results is a graph of the form of figure 4. In this way we expect to capture different behaviors with the embedding, in particular bots can use the “Indiscriminate Mention” action (see table 1) and this can be a good way to spot it.

3.2.3 Hashtag Graph

As already explained, the hashtag is not properly an interaction with a user , but we consider it so, because it expresses an endorsement of the underlying topic and therefore it can be seen as an “interaction” with a content. So, for every user, we searched for the string “#generic hashtag” in the history of the tweet and every time we found one of it, we created a link between the user and the hashtag. Moreover, every time we saw the same hashtag, the weight of the edge already created, is increased by one.

In this way we obtain, for every user, a graph similar to the one in fig-ure 4 but, instead of users, the link is between the account and the hash-tag. In this way we hope to capture the different way to use this Twitter features, in particular we expect that bots use more the “mass hashtag” action that we explained in the introduction. Indeed, by spamming the same hashtag over and over, it is possible to increase the popularity of it. Twitter itself has a ranking updated in real time that tells users which are the trending topics.

3.3 Node Embedding algorithm

Any supervised machine learning algorithm requires a set of informative, discriminating, and independent features. In prediction problems on net-works this means that one has to construct a feature vector representation for the nodes and edges. A typical solution involves hand-engineering domain-specific features based on expert knowledge.

An alternative approach is to learn feature representations by solving an optimization problem. In order to do so, we exploit an algorithm built

(22)

by Aditya Grover and Jure Leskovec called “node2vec” [12]. The idea is inspired by the Skip-Gram model in word embedding [13], where the algorithm scans over the words of a document, and it aims to embed ev-ery word so that the extrapolated features can predict nearby words (i.e., words inside some context window). So, the same way a document is an ordered sequence of words, one could sample sequences of nodes from the underlying network and turn a network into an ordered sequence of nodes. We leave the detail of the functioning of the algorithm to the reader since it is not the aim of our thesis.

Therefore, we apply the algorithm to our built graphs. The most im-portant hyper-parameter is the dimension of the feature vector we want to obtain. Since “node2vec” allows us to learn low-dimensional vec-tor representations of nodes we set the output dimension to 10. Indeed, the context of our problem is a sequence of several little graphs, that are representing how the central node, namely the collected account, is interacting with other users.

Our framework is now the following:

• We collected 65k account with the streaming API

• For each of the 3 interactions and for each account collected, we build up a graph as already explained

• We applied the “node2vec” algorithm to each of these graphs (for the users that do not have a certain interaction we created a vector of 0 as embedding)

• the results are three datasets with 65k rows and 10 columns that can be combined together in order to exploit the information of all the actions: retweet, mention and hashtag

3.4 Unsupervised Techniques

The goal is labelling the data we have previously collected by exploiting the features extracted with the embedding algorithm. In order to do so we try some clustering algorithms and aim to identify some groups that are behaving in a “similar” way. Similar in the sense that they are sharing a comparable pattern among the interactions that have been considered.

The clustering algorithms are the following: • Hierarchical clustering

(23)

• KMeans • DBscan

3.4.1 Hierarchical clustering

This algorithm aims to group together the clusters that are “close” to each other. Initially, every point is the cluster itself, so the algorithm needs to compute the distance matrix of all the points and then group them together. This approach has some pros and cons: it is not very scalable since the time complexity is O(N3) and the space complexity is O(N2). Since our dataset is too large, we take a random sample of 6500 users form the initial data, in order to reduce the size of the problem and perform the algorithm on a smaller yet representative set. On the other hand, we do not need to set the number of clusters a priori.

We know that one core aspect of the hierarchical clustering algo-rithm is the choice of the method to merge two clusters together , called linkage. The one we choose is the most common and best performing, namely the ward linkage, that is basically grouping together the clusters that leads to the minimum increase in total within-cluster variance after merging. Mathematically speaking it is solving the following problem:

arg min

A,B ∆(A, B) = arg minA,B nAnB nA+ nB

km~a−m~bk2 (1) where nA and nB is the numerosity, respectively, of the cluster A and B. While ~mj is the vector representing the center of the cluster j and k · k is the euclidean distance.

The algorithm builds up the so called dendrogram, representing the distances at which two clusters are merged, shown at figure 5, from which we can notice that a clear distinction between groups is not present. Even going into the highlighted groups and analyzing the users within the same cluster, we observe that different behaviors coexist inside the same clusters. So, initially, the results do not look very promising.

3.4.2 KMeans

The algorithm belongs to the family of representative-based clustering. It basically works by minimizing the intra-distance clusters and is stochas-tic because the starting centroids are taken randomly. The drawback is that the number of clusters has to be set a priori, but on the other hand it does not have to compute any distance matrix.

(24)

Figure 5: Dendrogram

Even with this clustering approach we did not obtain any interesting results by analyzing the clusters obtained. We show below the knee/elbow analysis plot:

Figure 6: WCSS behavior

From the figure 6 we can note that there is not any elbow, but a smooth decreasing of the WCSS (Within Clusters Sums of Square). This means that, by increasing the number of clusters, there is not a clearer distinc-tion between groups.

(25)

3.4.3 DBScan

This is a density-based clustering technique. The main idea is that a point is considered as a part of a group if it has a minimum number of point in its defined neighborhood. A point that does not satisfy this property is considered as a noisy point. This kind of algorithms is very useful to avoid some limitations that KMeans has, like not working properly with non-convex data. Unfortunately this algorithm did not reach the expected results, neither.

This whole approach of unsupervised techniques does not bring us to the desired goals and, in order to better understand which are the weaknesses, we want to change the point of view and by applying the entire procedure described in figure 3 in a dataset where we had a ground truth, namely the labels for telling humans and bots apart, meaning that we are moving from the unsupervised to the supervised world.

3.5 Labelled Data Collection

Now the purpose is to collect a complete dataset from previous works in literature. We study and select several papers that we consider useful for our thesis that we are going to describe below.

3.5.1 Cresci-RTbust-2019

In this paper [14] the authors exploited the temporal pattern of the retweets in order to find groups of bots. The idea is that the density of humans have been proven to exhibit much more behavioral heterogeneity than automated accounts. As a consequence, we expect that the heteroge-neous humans will not be sufficiently “dense” to be clustered. With this approach they found out 2 groups of bots previously unknown that retweeted simultaneously some italian singers.

3.5.2 Cresci-stock-2018

Cresci et al. [15] retrieved and studied 9M tweets related to stocks of the 5 main financial markets in the US. By comparing tweets with finan-cial data from Google Finance, they highlighted important characteris-tics of Twitter stock microblogs. More importantly, they discovered a malicious practice aimed at promoting low-value stocks by retweeting them together with the high value ones. The behavior of the bots was

(26)

coordinated and this new form of financial spam could have severe con-sequences on unaware investors as well as on automatic trading systems.

3.5.3 Botometer-Feedback - 2019

This dataset [25] was constructed by manually labeling accounts flagged by feedback from Botometer users. The work was conducted by K.C. Yang.

3.5.4 Gilani-2017

Gilani et al. [16] provided a manual annotation of Twitter data collected through the streaming API. They recruited four undergraduate students for the purposes of annotation, who classified the accounts over the pe-riod of a month. This was done using a tool that automatically presents Twitter profiles, and allows the recruits to annotate the profile with a classification (bot or human) and add any extra comments. Each ac-count was reviewed by all recruits, before being aggregated into a final judgement using a final collective review (via discussion among recruits if needed).

3.5.5 Varol-2017

The Varol dataset et al. [2] contains a list of Twitter accounts, labeled as bots (1) or humans (0). The construction of this dataset starts with the identification of a representative sample of users, by monitoring a Twitter stream for 3 months, starting in October 2015. Thanks to this approach it is possible to collect data without bias; in fact other meth-ods like snowball (a technique that nominee new samples, starting from the social networks of an initial pool of users) or breadth-first (a graph exploration technique) need an initial users set. During the observa-tion window about 14 million user accounts were gathered. Using the classifier trained on the honeypot dataset in [17], the authors performed a classification over the active collected accounts. Then the samples were grouped by their score and 300 accounts from each bot-score decile were randomly selected. The 3,000 extracted accounts were manually la-beled by volunteers. The authors analyzed users profile, friends, tweets, retweets and interactions with other users. Then a label was assigned to each user.

(27)

3.5.6 Caverlee-2011

The dataset [17] is composed by content polluters, detected by a system composed of 60 social honeypots, which are Twitter accounts created to serve the purpose of tempting bots; and genuine accounts, randomly sampled from Twitter. The authors observed the content polluters that used to interact with their honeypots, during a 7-months time-span. The accounts that were not deleted, by the policy terms of the social plat-form, were clustered with the Expectation-Maximization algorithm for soft clustering. At the end of this process, the authors found nine di ffer-ent clusters, which were grouped in four main categories, summarized in the table 2.

Cluster Description

Duplicate Spammers Accounts that post nearly the same tweets with or without links

Duplicate Spammers Similar to the Duplicate Spammers, but they also use Twitter’s mechanism by randomly including a genuine account’s username

Malicious Promoter These bots post tweets about marketing, business, and so on

Friend Infiltrator Their profiles and tweets seem legitimate, but they mimic the mutual interest follow-ing relationship

Table 2: Clusters of bots found by the authors

Each cluster has been manually checked, in order to assess its cred-ibility. For each content polluter, the authors collected the 200 most recent tweets, following and follower graph, and the temporal and his-torical profile information.

In order to collect genuine users too, some Twitter ids were randomly sampled and monitored for three months, to see if they were still active and not suspended by the social platform moderation service.

The authors subsequently built a classifier framework trained on their dataset, which uses crafted features grouped in four clusters: User De-mographics, User Friendship Networks, User Content, and User History. After testing several classification algorithms, the most performing algo-rithm was the Random Forest.

(28)

3.5.7 Cresci-2017

Cresci et. al [10] collected several groups of accounts, among them a new novel of social spam-bots that are very similar to genuine accounts.

The categories are the following:

• Genuine Accounts: the authors randomly contacted Twitter users by asking a simple question in natural language. All the replies to their questions were manually verified and the accounts that did not answer to our question were discarded and are not used in this study;

• Social Spambots #1: The dataset was created after observing the activities of a novel group of social bots that they discovered on Twitter during the Mayoral election in Rome, in 2014. One of the runners-up employed a social media marketing firm for his elec-toral campaign, which made use of almost 1,000 automated ac-counts on Twitter to publicize his policies.

• Social Spambots #2: they discovered a group of social spambots that spent several months promoting the #TALNTS hashtag. Specif-ically, Talnts is a mobile phone application for getting in touch with and hiring artists working in the fields of writing, digital photogra-phy, music, and more. The vast majority of tweets were harmless messages, occasionally interspersed by a tweet mentioning a spe-cific genuine (human) account and suggesting him to buy the VIP version of the app from a Web store.

• Social Spambots #3: they advertise products on sale on Ama-zon.com. The deceitful activity was carried out by spamming URLs pointing to the advertised products. Similarly to the retweeters of the Italian political candidate, also this family of spambots inter-leaved spam tweets with harmless and genuine ones;

• Traditional Spambots #1: this dataset is the training set used in [9]. To label a user as a bot, they utilize two methods to detect malicious or phishing URLs in the tweets:

1. Google Safe Browsing (GSB) [18]: is a widely used and trustable blacklist to identify malicious/phishing URLs

2. A high-interaction client-side honeypot based on Capture-HPC [19], which will be used to visit the URL using a real browser in a virtual machine.

(29)

If a user posts at least a malicious tweet, it is labelled as a bot; • Traditional Spambots #2: they are rather simplistic bots that

re-peatedly mention other users in tweets containing scam URLs. To lure users into clicking the malicious links, the content of their tweets invite the mentioned users to claim a monetary prize; • Traditional Spambots #3 and #4: these datasets are related to 2

different groups of bots that repeatedly tweet about open job posi-tions and job offers.

3.6 Node Embedding Supervised Approach

For our purposes, the bots we are interested in are the ones that have in-teractions with other users, in this sense we avoid to collect those users that simply post tweets without mentioning, retweeting or posting any hashtags. Moreover we want to have, for each user, a time series of tweets/retweets that is long “enough” because we do not want to col-lect the inactive users. In the light of these considerations we kept all the works previously mentioned except the Social Spambots #3 and the Traditional Spambots #2,#3,#4 datasets. Moreover, for all the data col-lected, we filtered out only those accounts for which we have available at least 50 tweets, namely if in their history they posted at least 50 tweets.

After this selection , the result is a rich dataset of bots (labelled as 1) and genuine (labelled as 0) accounts:

Account type Number Genuine 22819

Bot 25646

Table 3: Resulting dataset

Now we want to apply the classification algorithm that usually is the best performing one in this field, namely the Random Forest. In this way we can have a measure of the performances of the unsupervised approach.

Therefore we repeat the same procedure described in figure 3 and obtain once again the embedding (10 features for each interaction) of the new data. The big difference is that now there is the column indicating the label of the raw/account, so that we can perform the classification.

Then we apply a stratified division between train set and test set and giving to the latter the 20% of the numerosity of the former one. Finally

(30)

we train the Random Forest classifier, obtaining the results shown in table 4:

Metric Genuine Bot Precision 0.58 0.69

Recall 0.72 0.56

F1 score 0.64 0.62

Table 4: Results of RF with embedding features

The metrics used are the classic ones for classification problems. They are obtained by combining the number of the positive instances well classified (TP= True Positive), the number of the positive instances classified as negative (FN= False Negative), the number of the negative instances classified as positive (FP= False Positive) and the number of the negative instances correctly classified (TN= True Negative).

• Precision: T PT P+FP • Recall: T PT P+FN

• F1 score: 2∗Precision∗RecallPrecision+Recall

As it can be noticed from the table 4 the results are not satisfying, indeed just by random guessing the bot, the precision is about 0.53 ≈ 2564648465.

3.6.1 Motivation of the failure

For all of the analysis made, the reason of the failure is probably due to the fact that the embedding algorithm is not performing at its best because of the context in which we are utilizing it. To understand this we need to go a little deeper in it.

The algorithm is built up in order to derive a vector representation of every node of one single graph. The workflow in figure 3 is creat-ing an embeddcreat-ing of several “little” graphs and it seems that this vector representation is almost random.

The embedding is computed by solving the following optimization problem:

max f

X

u∈V

(31)

where f : V → Rd is a function from the set of nodes V to the embedding’s dimension, while NS(u) is the neighborhood of the node u considered with the sampling technique S .

The sampling technique is based on a biased random walk through the graph. For our problem we have several “little” graphs and this low dimension of the problem is not letting the random walk to differentiate well between one graph and another. The algorithm probably needs a bigger magnitude of the problem, namely more complex graphs.

In confirmation of that we perform the embedding in lower dimen-sions, i.e. 2D and 3D, in order to see if something is changing and to be able to plot the resulting points.

The numerical results are not very different from the previous ones with the 10D embedding, as expected, and now we have a graphical rep-resentation of the points in the vector space as represented in the figure 7.

Figure 7: 3D embedding of retweet graphs

This, for example, is a random sample of 7500 points representing the retweeting activity. As it can be seen, the points seems quite randomly spread into the domain, there is no clear cluster, or miniclusters that are building up.

This behavior can be noticed also in 2 dimensions, for instance we show the mention activity in figure 8.

(32)

Figure 8: 2D embedding of mention graphs

Again the scatterplot shows this sparse behavior of points in the space, in confirmation of the explanation we gave before. It seems that the em-bedding is not working as expected and the reason is probably due the low dimension of our graphs.

In any case this approach does not give us the outcomes that we would like to obtain. Indeed we expect to be able to capture the different behav-iors of the interactions, by catching the differences between the connec-tions of every single account. At this point we want to follow a complete different approach by creating ourselves the features with a particular focus to the features that takes into account the interactions.

(33)

4

Features Based Approach

This chapter is devoted to the explanation of the handcrafted features that we extrapolate from the accounts’ data and metadata, that we divide in different categories.

Technically, we had all the accounts’ ids from the previous works, and we retrieved the data by exploited the already mentioned tweepy library, especially by exploiting these two functions:

Name Functionality api.get user retrieve the user object

by taking as input the user id

api.user timeline retrieve the last tweet objects (up to 200) of the user id passed as input

Table 5: main functions of tweepy library

4.1 Profile features

The User object retrieved has several attributes that indicates the basic information of the accounts. Although they are the easiest features to obtain, they are very important.

Since we are talking about data collected from previous works de-scribed in the chapter 4, a lot of this listed accounts, especially bots, were not present anymore on Twitter platform and for some of them a lot of tweets were deleted, so we have faced some troubles in finding and selecting the accounts for the analysis. At the end, as we have already mentioned, we decide to filter the accounts that have at least 50 tweets in their history, so that we own a timeseries long enough to be meaningful.

(34)

Feature name Type Description

user id int unique identification number of the user

name string the full name of the user screen name string the nickname of the user location string where the user is located url string or null link in the user’s profile (if

present)

description string or null description of the user (if present) protected boolean 1/0 if the account is protected or

not

verified boolean 1/0 if the account is verified or not

listed count int number of lists in which the user is present

favourites count int number of likes given by the user from the account’s creation statuses count int number of tweets posted from the

account’s creation

created at string time of the creation of the ac-count

profile background color string color of the background profile’s image (if present)

profile background image url string link to find the background image profile background tile boolean 1/0 if the profile’s image has the

tile or not

profile image url string link to the profile image profile link color string color of the profile’s link profile text color string color of the profile’s text profile use background image boolean 1/0 if the user has the profile

background image or not default profile boolean 1/0 if the user has not altered the

theme or background of their user profile

default profile image boolean 1/0 if the user has not uploaded their own profile image and a default image is used instead followers count int number of accounts that follows

the user

friends count int number of accounts the user is following

Table 6: Profile Features

(35)

order to have the information on the balance of accounts followed by the user and following the user.

4.2 Behavioral features

In this section we talk about how and why we have extracted this kind of features. Maybe the most significant ones, because they are based on the timestamps and therefore the frequency of activities of the user. Recall that for each account collected we have a timeseries of at least 50 points and up to 200 points, from which we can extract some synthetic measures.

To begin with, we compute the tweet and retweet intradistances in order to obtain a temporal distribution of the tweeting activity:

Figure 9: Tweeting and retweeting intradistances example

The results are two vectors of deltas, one for the tweets and one for the retweets. They are representing how frequent the user is posting a tweet or is retweeting somebody’s tweet.

The next step is to calculate some synthetic measures from these two distributions, that for us are: minimum, maximum, average and median. Therefore we have 8 new handcrafted features listed in table 7.

Feature name Type Description min time tweet float minimum tweet intradistance max time tweet float maximum tweet intradistance average time tweet float average tweet intradistance median time tweet float median of the tweet intradistances min time retweet float minimum retweet intradistance max time retweet float maximum retweet intradistance average time retweet float average retweet intradistance median time retweet float median of the retweet intradistances

(36)

Note that the vector of tweet or retweet intradistances could be empty, in this case the corresponding features are set null, and consider that the distances are expressed in elapsed seconds between tweets or retweets.

From the timeseries collected we also have the number of tweets and retweets and the difference between the first and the last tweet/retweet. We can use this information to compute two other features:

tweet per hour = num o f tweets

total elapsed time ∗ 3600 retweet per hour = num o f retweets

total elapsed time ∗ 3600

(3)

where 3600 are the seconds in one hour, so that we can express the measures in hours.

To conclude, we handcrafted 10 behavioral features, that strictly refer to the tweeting and retweeting activity, aiming to capture the difference between frequency in the collected users. To make a step further, we need to go inside the text of the tweets and extrapolate features that give us some insights about their content.

4.3 Content features

As we have previously said, we’ve collected the 200 oldest tweets with the API function api.user timeline, ending up in a collection of almost 1 million tweets. Every tweet is translated in a string of maximum 240 characters. So it is clear that we have to handle a very large quantity of data and face some memory constraints together with the high compu-tational time to extract all the features. This leads us to pay attention to the optimization of the code.

In any case, we go through every tweets of every single user and follow the procedure described below :

• Select one user;

• Go through every tweet and count: 1. the number of retweets

2. the length of the tweets 3. the number of urls

(37)

4. the number of hashtags 5. the number of quotes

• Divide for the number of tweet and obtain the features listed in the table 8.

textbfFeature name Type Description RT percentage float the number of retweets over the

number of total tweets avg tweet length float the average length of a tweet avg url float the average number of links per

tweet

avg hashtags float the average number of hashtags per tweet

avg quotes float the average number of mentions per tweet

Table 8: 1stset of Content Features

Since we want to maintain what we were trying to do with the graph embedding approach, namely an information about the links between the account and the users he is interacting with, we extract a second set of features. This time we keep into account if the user has retweeted/mentioned a specific user or used a specific hashtag before, by not considering it in the count if it appeared more than once. The resulting features are the following ones:

Feature name Type Description

unique RT perc float the number of unique retweets over the number of total tweets

avg unique men float the average number of unique men-tions per tweet

avg unique hash float the average number of unique hashtags per tweet

avg unique url float the average number of unique links per tweet

Table 9: 2ndset of Content Features

So now we have a better information about the user behavior. For sake of completeness we decided also to craft two additional relative features that gives us a synthetic knowledge of both the set of content features:

(38)

di f f men= avg hashtags

avg unique hash ∈ [0, 1] di f f hash= avg quotes

avg unique men ∈ [0, 1]

(4)

These features indicates how much a user is mentioning the same user or using the same hashtags. Note that, in case the denominator is zero, the value is set to null.

We handcraft these features because the expectation is that a certain kind of bots are retweeting or mentioning the same people over and over, or again using the same hashtags. On the other hand some other bots could randomly mention or retweet users to reach more people.

The two extreme situations are:

• random mentioning, retweeting, hashtag ⇒            di f f men →1 di f f hash →1 unique RT perc →1

• specific mentioning, retweeting, hashtag ⇒            di f f men →0 di f f hash →0 unique RT perc →0 The next step is going deeper and understanding the characteristics of the dataset we have collected.

(39)

5

Exploratory Data Analysis

5.1 Preprocessing

Before understanding the dataset, we need to take a preliminary step in order to clean the data and prepare them to all the analysis as well as making them more readable.

5.1.1 Profile Feature Preprocessing

The preprocess applied is described below: • location (string) → have location (boolean) • url (string) → have url (boolean)

• from description (string) we extracted other two features:

1. have description (boolean) indicating if a user has or not the description

2. len description (int) indicating the length of the description (0 if have description is false)

and eliminated the former one

• from listed count (int) we extracted in list (boolean)

• from created at (string) we extract account age (int) and then delete the former one. With further analysis we decide to cancel also ac-count age since it is a biased feature. Indeed the works are col-lected in the present but they are referring to different years, so older works will have accounts with greater account age, and this is not helping us in telling humans and bots apart

• profile background color → deleted because useless for the analy-sis

• profile background image url → deleted because useless for the analysis

• profile image url → deleted because useless for the analysis • profile text color → deleted because useless for the analysis • profile link color → deleted because useless for the analysis

(40)

• protected → deleted because useless for the analysis • verified → deleted because useless for the analysis

Concerning the f ratio = f ollowers countf riends count we decide to impute the null values, namely the ones with the number of followers equal to 0, with the median. These values are 1249 so they are about the 2.6% of the dataset.

5.1.2 Behavioral Features Preprocessing

These features are summarized in table 7 and equations 3. Since we distinguished between the time series of a tweet and a retweet, if a user has only tweet or only retweet it has, respectively, null values in the tweet frequency and retweet frequency. This is the reason why we impute these null values with 0. Indeed, if a user for instance has 0 retweets, then the retweet frequency is 0.

5.2 Data Exploration

After all the preprocess, we have a clean dataset of 48465 rows and 40 columns: 28 numerical features and 12 categorical, where each row is identified with a label 1 if it is a bot and 0 if it is a genuine account.

5.2.1 Basic Features Exploration

To begin with, let’s start with the analysis with some basic features to see if we can visualize some interesting differences between the human and bot accounts. For example the boxplot of the favorites count fea-tures in figure 10 shows that bots (1) are less likely to give likes to other users. Indeed favourites count is distributed a lot more around 0 whilst for human it has more variability and the median is higher.

This is kind of expected because the “like” is a very human action, it is less likely to be a focus of the bots since it’s not a strong and visible interaction, they usually prefer retweets or mentions. Numerically the median for humans is 808 while for bots it is 3.

A very similar behavior is shown indeed in the figure 11 by the sta-tuses count feature, namely the total number of tweets posted in the his-tory of the user.

(41)

Figure 10: Boxplot of favourites count

Figure 11: Boxplot of statuses count

and this time the reason could be that the genuine accounts (0) live longer and so they post a lot more in their history with respect to the bot activities.

On the other hand an example of a feature that is almost indistinguish-able between the two categories could be the number of lists in which the user is part of, namely the listed count feature shown in figure 12:

Indeed, here you can notice how similar the two cumulative density functions are, making it a weak information in telling apart the 2 cate-gories.

(42)

Figure 12: Cumulative Density Function of listed count

the features that are more interaction-specific.

5.2.2 Retweet Features Exploration

Concerning the retweet interaction, we analyzed the median time be-tween one retweet and another (namely the median time retweet feature) because this metric is robust with respect to the outliers, and this is the reason why it is more appealing to us for the analysis.

(43)

Here in figure 13 it is straightforward that the distribution of the me-dian time of a retweet is very much more centered around zero for bots account with respect to the genuine ones. This means that the retweet-ing activity is much more frequent for algorithmically-driven accounts rather than human-based ones, as we expected. Indeed, a lot of times, bots that are focused on retweeting, are programmed to retweet together more tweets in a small time span, and that is why we are experiencing such as behavior. Numerically speaking the median of the distribution is 53.25 for bots and 20155.5 for humans

Moreover we show in 14 the kernel density function of the unique RT percentage activity. This feature tells how many times (in percentage) a user is

retweeting the same accounts: it is closer to 0 if he’s retweeting the same people and, on the other hand, it is closer to 1 if is retweeting different users.

Figure 14: Kernel Density Function of unique RT percentage

This figure is very interesting because it is showing the differences in the distribution between bot and human accounts as well as giving us the information on different kind of retweeting activities among the bot category. It is true, indeed, that in this group are coexisting different kinds of algorithmically-driven accounts.

From the estimated density in figure 14 we can notice how the gen-uine accounts (blue line) have a peak in 0 and then goes up monotoni-cally until it reaches two maxima around 0.8 and 1.0. This means that

(44)

a group of humans is not retweeting at all while the modes are close to one, so the majority of them do not retweet the same users over and over. The median of the human distribution is indeed about 0.72.

Regarding the bot category instead, they have a more spiked distri-bution. They have the higher maximum in 0, that is very much higher than the human one, meaning that a bigger group of bot accounts do not retweet at all. Then the distribution is increasing smoother than the human one, to reach a second peak in 0.75 and then decreasing again to a third peak in 1. It is becoming evident that bots are typically very specialized in their different activities and not as diverse as human users. This is suggesting that a binary classification may be is too general to capture all the different specialized behaviors inside the bot category, we will later try and go deeper in this group to see if we are able to create different labels from the binary one in order to perform a more precise multiclass classification.

A similar spiked behavior for the bot category is shown if we take into account the RT percentage feature, namely the percentage of the retweets over the total tweets, shown in figure 15.

Figure 15: Kernel Density Function of RT percentage

The density of the human group, indeed, is much smoother. The bots are presenting a high peak around 1 and 0, confirming the need to have a more specific insight on the category to split the different behaviors

Riferimenti

Documenti correlati

Non-invasive cardiovascular surrogate markers may be useful for early diagnosis of acute cardiovascular and thromboembolic complications, as well as for monitoring and management

In order to facili- tate exchange of information on genomic samples and their derived data, the Global Genome Biodiversity Network (GGBN) Data Standard is intended to provide a

the quantitative (cross-section) data containing information on both part- ners and their behaviours and attitudes show that, when measures of relative resources or time

Among the proposed methods seem particularly effective retrieving images according to their similarity with respect to an image given as query (Query by Example) and to refine

High resolution MS based techniques represent an excellent tool to study the distribution of small molecules in tissue, but in view of possible high-throughput

In order to overcome the limitations of dealing with 3D data sets coming from different sensor systems (lasers, cameras) and different perspectives of the environment (ground-UGV

In the sixteenth century the facade of the building appeared flanked on both sides by other existing buildings: on the left the building purchased by Giuliano Gondi in 1455 by