User behaviour analysis and bot detection in the Steem blockchain

(1)

Corso di Laurea Magistrale in Computer Science

Data and Knowledge: Science and Technologies

TESI DI LAUREA

USER BEHAVIOUR ANALYSIS AND BOT DETECTION

IN THE STEEM BLOCKCHAIN

SUPERVISOR

Dott. Barbara GUIDI

Dott. Andrea MICHIENZI

Candidate

Carmine CASERIO

(2)

(3)

(4)

1 Introduction 1

2 State of the art 7

2.1 The Blockchain technology . . . 7

2.1.1 Consensus algorithms . . . 10

2.1.2 Types of blockchain . . . 12

2.2 Blockchain for Social Media . . . 12

2.3 Supervised vs. Unsupervised Learning . . . 16

2.3.1 Supervised learning . . . 16

2.3.2 Unsupervised learning . . . 18

2.3.2.1 Clustering approaches . . . 20

2.4 Bot detection in social environments . . . 21

2.4.1 Twitter studies . . . 22

2.4.2 Reddit studies . . . 26

2.4.3 Facebook studies . . . 28

2.4.4 Summary . . . 28

3 Steem and SteemIt: an overview 30 3.1 Steem: a blockchain for social media . . . 30

3.1.1 Cryptocurrencies . . . 32

3.1.2 Rewarding . . . 33

3.1.3 Users’ roles . . . 34

3.2 SteemIt . . . 34

3.2.1 Reputation classes . . . 36

3.2.2 Social users’ roles within SteemIt . . . 37

(5)

4.2 Analysis of Users and Bots behaviour . . . 45

4.2.1 Differences among bots and users’ behaviours . . . 45

4.2.2 Behavioural differences among the reputation classes . . . 46

4.2.3 Bots’ behavioural differences through textual analysis . . . 47

4.3 Bot detection . . . 48

5 Implementation and tools 50 5.1 Blockchain and user structure’s parsing . . . 50

5.1.1 User structure’s parsing . . . 51

5.1.2 Blockchain parsing . . . 53

5.1.3 Saving procedure . . . 60

5.2 Behavioural analysis . . . 61

5.2.1 Behavioural differences . . . 61

5.2.2 Comments filtering . . . 62

5.2.3 Textual analysis and word cloud creation . . . 64

5.2.4 Word2Vec model - choice of parameters . . . 66

5.3 Bot Detection task . . . 68

6 Behavioural analysis over bots & users and reputation classes 72 6.1 Reputation class-based Statistics . . . 72

6.1.1 Reputation classes: a statistic overview . . . 73

6.1.2 Most frequently used tags in the network . . . 74

6.1.2.1 Plankton users’ most frequently used tags . . . 75

6.1.2.2 Minnow users’ most frequently used tags . . . 75

6.1.2.3 Dolphin users’ most frequently used tags . . . 76

6.1.2.4 Shark users’ most frequently used tags . . . 76

6.1.2.5 Whale users’ most frequently used tags . . . 76

6.1.2.6 Summary . . . 77

6.1.3 Posts and comments from reputation classes . . . 78

6.1.4 Voters from reputation classes . . . 80

6.1.5 Votes from reputation class to reputation class . . . 81

6.1.5.1 Plankton received votes . . . 81

6.1.5.2 Minnow received votes . . . 83

(6)

6.1.6 Follows from reputation classes to reputation classes. . . 87 6.1.6.1 Plankton followers . . . 87 6.1.6.2 Minnow followers . . . 89 6.1.6.3 Dolphin followers . . . 89 6.1.6.4 Shark followers . . . 90 6.1.6.5 Whale followers . . . 91

6.1.7 Reblogs from reputation classes to reputation classes . . . 92

6.1.7.1 Plankton reblogs obtained . . . 92

6.1.7.2 Minnow reblogs obtained . . . 93

6.1.7.3 Dolphin reblogs obtained . . . 94

6.1.7.4 Shark reblogs obtained . . . 94

6.1.7.5 Whale reblogs obtained . . . 95

6.1.8 Transfers from reputation classes to reputation classes. . . 96

6.1.8.1 Transfers with Plankton senders. . . 96

6.1.8.2 Transfers with Minnow senders . . . 97

6.1.8.3 Transfers with Dolphin senders . . . 99

6.1.8.4 Transfers with Shark senders . . . 100

6.1.8.5 Transfers with Whale senders . . . 102

6.1.9 Summary . . . 103

6.2 Users & bots’ presence in the Blockchain data . . . 104

6.2.1 Users and bots’ statistics’ analysis . . . 105

6.2.1.1 Users and bots’ analysis over the vote operation . . . 105

6.2.1.2 Users and bots’ analysis over the comment operation 108 6.2.1.3 Users and bots’ analysis over the follow operation . . 112

6.2.1.4 Users and bots’ analysis over the reblog operation . . 116

6.2.1.5 Users and bots’ analysis over the transfer operation . 120 6.2.1.6 Users and bots’ access applications . . . 126

6.2.2 Summary . . . 127

7 Bot detection through the analysis of the social activity 129 7.1 Word embedding application . . . 129

7.1.1 Word2Vec results . . . 129

7.2 Clustering models application . . . 132

(7)

7.2.2.1 k-means over all the features . . . 137 7.2.2.2 k-means: the importance of the no_reblogs attribute 140 7.2.2.3 k-means: the importance of the no_posts attribute . 143 7.2.2.4 k-means: the combination of the no_following, the

no_reblogs, and the count_transfers_steem attributes145 7.2.2.5 k-means: the combination of no_posts, no_following,

and no_reblogs attributes . . . 147 7.2.2.6 k-means: the combination of count_votes, no_reblogs,

count_transfers_sbd, and count_transfers_steem attributes . . . 151 7.2.2.7 k-means: the combination of count_votes, no_comments,

count_transfers_sbd, and count_transfers_steem attributes . . . 153 7.2.2.8 k-means: the combination of no_following and no_reblogs

attributes . . . 156 7.2.2.9 k-means: the combination of the count_votes and

no_comments attributes . . . 161 7.2.2.10 Summary . . . 164

8 Conclusions and Future works 166

(8)

2.1 Example of header and body of generic blocks . . . 9

2.2 Different types of blockchain available presented in [1] . . . 13

3.1 Different reputation classes of the network . . . 37

4.1 Workflow . . . 42

4.2 Parsing phase over blockchain . . . 44

4.3 k-means functioning . . . 48

5.1 Available data available from a previous work . . . 51

5.2 Word2Vec Workflow . . . 65

5.3 Word2Vec architectures. . . 66

6.1 Frequency of users’ vote operation . . . 106

6.2 Frequency of bots’ vote operation . . . 107

6.3 Frequency of users’ post operation. . . 109

6.4 Frequency of bots’ post operation . . . 109

6.5 Frequency of users’ comment operation . . . 111

6.6 Frequency of bots’ comment operation . . . 111

6.7 Frequency of users’ follow operation with users’ receipt . . . 113

6.8 Frequency of users’ follow operation with bots’ receipt. . . 114

6.9 Frequency of bots’ follow operation with users’ receipt. . . 115

6.10 Frequency of bots’ follow operation with bots’ receipt . . . 115

6.11 Frequency of users’ reblog operation with users’ receipt . . . 117

6.12 Frequency of bots’ reblog operation with users’ receipt . . . 118

6.13 Frequency of users’ reblog operation with bots’ receipt . . . 118

6.14 Frequency of bots’ reblog operation with bots’ receipt . . . 119

6.15 Frequency of users’ transfer operation with users’ receipt . . . 122

(9)

7.1 Results from the application of the wordcloud package . . . 131 7.2 Pearson correlation matrix computed over all the columns of the bot

dataset . . . 134 7.3 PCA variance ratio of the fields in the bot dataset . . . 134 7.4 Pearson correlation matrix computed over the selected columns of the

bot dataset . . . 135 7.5 Error values obtained after k-means application by varying the number

of clusters on a set of users used for the choice of the correct hyperpa-rameter k. . . 136 7.6 k-means application over the #following and #reblogs . . . 157 7.7 k-means application over the attributes no_comments and count_votes161

(10)

2.1 The most popular consensus mechanisms among blockchains . . . 10

5.1 Fields saved in the votes file . . . 57

5.2 Fields saved in the comments file . . . 59

5.3 Fields within the comments file . . . 63

6.1 Cardinality of reputation classes . . . 73

6.2 Most common tags over all the reputation classes . . . 74

6.3 Most common tags over the Plankton reputation class . . . 75

6.4 Most common tags over the Minnow reputation class . . . 76

6.5 Most common tags over the Dolphin reputation class . . . 76

6.6 Most common tags over the Shark reputation class . . . 77

6.7 Most common tags over the Whale reputation class . . . 77

6.8 Post operation. . . 78

6.9 Comment operation . . . 79

6.10 Votes operation, overall and per user . . . 80

6.11 Votes obtained by Plankton users from the other reputation classes . 82 6.12 Average weights per vote obtained by Plankton users . . . 82

6.13 Votes obtained by Minnow users from the other reputation classes . . 83

6.14 Average weights per vote obtained by Minnow users . . . 84

6.15 Votes obtained by Dolphin users from the other reputation classes . . 85

6.16 Average weights per vote obtained by Dolphin users . . . 85

6.17 Votes obtained by Shark users from the other reputation classes . . . 86

6.18 Votes obtained by Whale users from the other reputation classes . . . 86

6.19 Followers of Plankton users. . . 88

6.20 Followers of Minnow users . . . 89

(11)

6.24 Rebloggers of Plankton users . . . 92

6.25 Rebloggers of Minnow users . . . 93

6.26 Rebloggers of Dolphin users . . . 94

6.27 Rebloggers of Shark users . . . 95

6.28 Rebloggers of Whale users . . . 95

6.29 Transfers sent by Plankton users. . . 96

6.30 Transfers received by Plankton users, on average per single user . . . 97

6.31 Transfers sent by Minnow users . . . 98

6.32 Transfers received by Minnow users, on average per single user . . . . 98

6.33 Transfers sent by Dolphin users . . . 99

6.34 Transfers sent by Dolphin users, on average per single user . . . 100

6.35 Transfers sent by Dolphin users, on average per single transfer . . . . 100

6.36 Transfers sent by Shark users . . . 101

6.37 Transfers sent by Shark users, on average per single user . . . 101

6.38 Transfers received by Whale users . . . 102

6.39 Transfers sent by Whale users, on average per single user . . . 103

6.40 Transfers sent by Whale users, on average per single transfer . . . 103

6.41 Number of vote operations . . . 105

6.42 Weights on vote operations . . . 107

6.43 Number of post operations . . . 108

6.44 Number of comment operations . . . 110

6.45 Number of follow operations . . . 112

6.46 Number of follow operations from and to user, bot entities . . . 113

6.47 Average number of follow operations from and to {user, bot} entities 116 6.48 Number of reblog operations . . . 116

6.49 Number of reblog operations from and to user, bot entities . . . 117

6.50 Average number of reblog operations from and to {user, bot} entities 119 6.51 Number of transfer operations . . . 120

6.52 Average number of times in which users and bots have been receiver 121 6.53 Number of transfer operations from and to user, bot entities . . . . 121

6.54 Amount on the basis of the transfer operations. . . 125

6.55 Average amount per user and bot obtained in the role of receiver . 125 6.56 Average price for the bot service . . . 125

(12)

6.58 Top-20 preferred applications for standard users . . . 128 7.1 Centroids’ values over all the attributes . . . 137 7.2 Closest users to the corresponding cluster over all the attributes . . . 138 7.3 Closest users to the corresponding cluster over all the attributes . . . 139 7.4 Centroids’ values over all the attributes, but the no_reblogs . . . 140 7.5 Closest users to the corresponding cluster over all the attributes, but

the no_reblogs . . . 141 7.6 Closest users to the corresponding cluster over all the attributes, but

the no_reblogs . . . 142 7.7 Centroids’ values over all the attributes, but the no_posts . . . 143 7.8 Closest users to the corresponding cluster over all the attributes, but

the no_posts . . . 144 7.9 Closest users to the corresponding cluster over all the attributes, but

the no_posts . . . 145 7.10 Centroids’ values over the #following, #reblogs and #transfers . . . . 146 7.11 Closest users to the corresponding cluster over the #following, #reblogs

and #transfers . . . 147 7.12 Closest users to the corresponding cluster over the #following, #reblogs

and #transfers . . . 148 7.13 Centroids’ values over the #posts, #following and #reblogs . . . 148 7.14 Closest users to the corresponding cluster over the #posts, #following

and #reblogs . . . 149 7.15 Closest users to the corresponding cluster over the #posts, #following

and #reblogs . . . 150 7.16 Centroids’ values over the #votes, #reblogs, #transfers_sbd and

#trans-fers_steem . . . 151 7.17 Closest users to the corresponding cluster over the #votes, #reblogs,

#transfers_sbd and #transfers_steem . . . 152 7.18 Closest users to the corresponding cluster over the #votes, #reblogs,

#transfers_sbd and #transfers_steem . . . 153 7.19 Centroids’ values over the #votes, #comments, #transfers_sbd and

(13)

7.21 Closest users to the corresponding cluster over the #votes, #comments, #transfers_sbd and #transfers_steem . . . 156 7.22 Centroids’ values over the #following and #reblogs . . . 157 7.23 Closest users to the corresponding cluster over the #following and

#reblogs . . . 158 7.24 Closest users to the corresponding cluster over the #following and

#reblogs . . . 159 7.25 Classification of users . . . 160 7.26 Centroids’ values over the #votes and #comments . . . 162 7.27 Closest users to the corresponding cluster over the #votes and

#com-ments . . . 162 7.28 Closest users to the corresponding cluster over the #votes and

(14)

(15)

Introduction

Online Social Networks [2] (OSNs), such as Facebook, Instagram, and Twitter, have permeated the daily life of many people in the world. Today, they represent the most used web applications, indeed they have reached overall about 3 billion total users. OSN are platforms specifically designed for users, where people are encouraged to interact, consume, generate, and share content with the other users by establishing connections. Nowadays, dozens of OSNs exists and although being different they all focus on making users cooperate and interact.

From the technical point of view, OSNs have a centralised structure, in a client-server fashioned architecture. Centralization has been considered one of the main source of their issues. Indeed, they are exposed to problems and criticisms, such as the privacy of the data published by their users. The major scandal which involved Facebook, one of the most used OSNs, is the Cambridge Analytica’s scandal1. About 87 million of Facebook users used an application (approved by Facebook) which was able to collect profiles of users and their friends. Data were delivered to Cambridge Analytica which analysed them for political goal. This is one of the main example of privacy disclosure, but it is not the only problem. Indeed, there is the possibility of server failures, and subsequently the possible data loss, but also the scalability of the architecture, that poses challenges concerning how to make the service available in real-time, all over the world. Furthermore, another important issue concerns the censorship. Facebook has been banned in some countries, such as in China, Tunisia, Iran, etc.., only to mention few cases.

To overcome these kinds of problems, Decentralised Online Social Networks (DOSNs) have been proposed as alternatives. They are Online Social Networks that lie on a

1

(16)

distributed platform, as a peer-to-peer system could be. DOSNs can achieve better results in certain respects. Indeed, privacy can be easily reached by the usage of hash cryptography techniques, besides in all the platforms the anonymity is guaranteed since all the user information is not linked to personal information or mail addresses. Additionaly, scalability can be guaranteed given the nature of the peer-to-peer sys-tems on which the platforms rely.

During the last few years, there has been a constant evolution of decentralisation techniques, which has increasingly involved the presence of the blockchain and the introduction of external file systems use to store and index data (such as IPFS [3]). In particular, the Blockchain technology has been taken into account in several research fields, and it has been widely used to improve the Social Media utility and to solve the issues in DOSNs giving birth to the Blockchain-based Social Media (BOSMs) [4–6].

A blockchain is essentially a public distributed ledger of records that are shared among partaking parties, and it can be referred to as a chain of blocks. The first major and most successful application of the Blockchain technology is Bitcoin [7]. The other major application is Ethereum [8], launched in 2015, upon which users are able to define and use smart contracts, which are pieces of code describing self-executing contracts with agreement terms between buyer and seller.

Several BOSMs applications have been proposed [4]. The most famous one is SteemIt [9], which has passed over the millions of users. The main common motiva-tion shared between all these proposals is the need to give value to generated content and provide a way to give a reward to the content creator. This is the main difference with the most popular OSNs based on centralisation such as Facebook, Twitter, or Instagram. Indeed, while in all these platforms the exploitation and commercializa-tion of personal data is the focal point of their business, in BOSMs the main goal is to reward users for their social activity over the network. The social activity is represented by a publication of a post, by a like, a share, etc.. Concerning all these kind of activities, the users receive a reward.

The introduction of the rewarding process seems to be changing the real behaviour of the entire social platform, as happens in SteemIt, with the introduction of social roles that behave differently from the most known OSNs (as an example, the usage of bots to retrieve more tokens in the BOSMs scenarios, instead of having promotional aims as happens in the OSNs). The real benefits of the introduction of blockchain in social environments is still unclear, at least considering how the reward affects the social activity.

(17)

Media can help to handle the OSNs issues and for which issue a solution has not been found yet. One of these problems is the spreading of fake news, that consists in the diffusion of untruthful information to the users of the Social Network. Many examples of situations where this problem has arised can be done. An example that we might mention is the so called Pizzagate2 that went viral on Twitter during the 2016 United States presidential election cycle, and involved the candidate Hillary Clinton.

The BOSM scenario can ensure the resolution of the fake news problem because the reward given to the published contents of the users is proportional to the quality of that content, so that, fake news cannot flow within the network for a long time, given that they will be considered low-quality content by the users.

The environment in which this thesis has been carried out is the SteemIt social network, built over the Steem blockchain. The main reason is that SteemIt is consid-ered the father of BOSMs and it is taken into account in the improvement of the new BOSMs. It is one of the most used and data are public available, making it a good case study.

The peculiarity of this social network is a rewarding mechanism evaluates the contents produced by the users in terms of how much social activity these contents generate. The better a content is evaluated, the more reward is granted to the creator of that content. The reward is usually granted in the form of cryptocurrency. This means that when an user publishes a content within the network, for every vote, comment or share that it will receive, the user will gain an amount of cryptocurrency. Furthermore, within this social network users are categorised in different repu-tation classes depending on their richness. Users can perform different behaviours within the network, by producing content, interacting with the other users, etc. Very often, however, the implicit purpose of the users that belong to the network is to exploit the rewarding system of the SteemIt platform obtaining rewards from their activities. As side effect, they gain better reputation classes and they may exploit different properties not available for the initial ones.

However, there are some specific kind of users, called bots, that, with the aim of helping the newly registered users or those users that want to reach higher reputation class levels, provide a fee-based service of voting, commenting or sharing. By doing so, bots provide a visibility boost to the user content, because in this way the content will be more easily reachable by other users to which bots are connected. Therefore, by implementing this kind of service, users are stimulated to get in touch with a bot and publish low quality content with the only aim to be rewarded by other users to

(18)

which the bot is linked.

For this reason, these bots and the users that get in touch with them have exploited a workaround to take advantage of the Steem rewarding system in order to achieve an easy reward. Several SteemIt bots have been publicly identified [10,11], however several are still unknown and the behaviour of these bots, useful to provide a bot detection method in SteemIt, has not been studied before.

The aim of our work is to enrich the knowledge concerning the users in BOSMs. The rationale behind our idea is that users tend to behave differently compared to OSNs because of the rewarding system. While revolutionary, this idea has the po-tential to condition the genuine social relationships that form in their counterparts that do not include a rewarding system. Attracted by the idea of easy rewards, users can be prone to try to cheat the system in several ways. One such possible way is by the usage of services, usually ran by bots, to increase their visibility and reward. Moreover, since bots are naturally part of this ecosystem, they have access to the rewarding system as well. And while on one hand human users are "limited" in their actions by their nature, bots can instead be continuously active, thus being much more present in the platform. This gives them a big advantage, and has major conse-quences in a platform where the richest are the most powerful. Therefore, we perform a comprehensive set of analyses to understand what is the typical behaviour of users, divided by their reputation classes, and bots. This set of analyses uncovers interac-tions among the users that are hidden in the general view of the activity. Moreover the analyses are preparatory for the design and development of a bot detection mech-anism which is based on the important features we identified during the behavioural analysis phase.

In detail, the thesis work is divided in three main steps:

• parsing the data, retrieved using the official APIs, and organising the data in convenient ways for subsequent analysis. The SteemIt users’ data, that is pub-licly available, is mainly contained in the blockchain in the form of transactions, however, since most of our study revolves around the analysis of the activity performed by single users or groups of users, we need to organise the data ac-cording to our aim. We also point out that, despite the Steem blockchain was made to natively support social applications, like SteemIt, it is not true that every social action that a user can perform in the platform is mapped to one transaction, and sometimes one transaction type corresponds to several social actions. As an example, some trickery was needed to discern between the cases that a transaction corresponds to a new post and when it corresponds to a

(19)

comment to an existing post. Moreover, data is extremely redundant in the blockchain, up to the point where, without any aggregation technique, the data cannot be handled without proper computational and storage resources. The parsing procedure ended up in generating several JSON files, such that the necessary information needed for the subsequent steps was already available; • analyse the different behavioural analysis performed over the bots’ and users’

sets. In order to accomplish these, three kind of sub-analyses have been carried out. The first step of our analysis consists of highlight the behaviour of users and these bots with the aim of laying the foundations for a bot detection system in SteemIt. We study the characteristics of the bots and users in order to outline the typical behaviours of this two different kind of users. Two different paths have been taken into account to delineate the behaviour of the bots: we first of all consider the different amount of operations performed by the well-known bots and the users and then we exploit the bots’ comments in order to understand the typical words used by bots. We study then, the characteristics of the various reputation classes to which the users are partitioned.

• the application of a clustering method crucial to perform the bot detection task by analysing different sets of operations performed by each single user in the network, be it a standard user or a well-known bot. In order to discover bots, we exploit some well-known bots, found in lists online [10,11]. We collected 206 bots which are classified in according to their interactions within the SteemIt network, by means of specific chosen features.

The rest of the thesis is organised as follows. In Chapter2, we will provide the state of the art of the principal research topics exploited in this thesis. It comprehends an introduction to the blockchain technology and the depiction of some the Blockchain based Online Social Media available nowadays. Furthermore, we will describe the properties of the supervised and unsupervised learning approaches, and finally, we will present the related works about the bot detection task performed over different social scenarios, such as Twitter, Reddit and Facebook. The Chapter 3 contains the description of the scenario in which the thesis is collocated, that is the Steem blockchain and the SteemIt Social Network built upon it. We will focus on the characteristics of the blockchainan on the operations that a user can perform over the network, together with the reputation class to which a users belong. Furthermore, we will provide the classification of the social roles that the users can play. In Chapter

(20)

4, we will present the principal aims of the thesis and the motivations. In Chapter 5, we focus on the implementation aspects and on the tools that have been used in the thesis work. Chapter 6 refers to the results of the analysis performed over the differences between user and bots, and the analysis performed among the reputation classes of the SteemIt environment. Chapter 7 refers to the results of the textual analysis performed over the bots’ comments in order to find common words in their content and the bot detection procedure is explained in all its steps. Finally, Chapter 8is related to the conclusions and to the future works expected.

(21)

State of the art

In this Chapter, we provide an overview of the state of the art of the principal research topics exploited in this thesis. We identify six main topics, and the rest of the Chapter is organized as follow. In Section2.1, that is a general overview of the properties and mechanisms regulating the blockchain data structure and the social media platforms based on the distributed ledger. In Section2.2, we discuss about the various aspects of the Blockchain based Online Social Media (BOSM). Section 2.3 offers a general overview of the classification methods and the differences among the supervised and unsupervised approach.Finally, Section2.4, presents important contributions related to the bot detection task, performed over different datasets (Reddit, Twitter and Facebook).

2.1 The Blockchain technology

The Blockchain technology can be considered as a milestone for a new type of internet, where the information is available to each of its users. This has been a turning point for handling every kind of services because the digital information, that is present there, is distributed and tamper-proof.

We could mention blockchain applications in a number of fields, from the eco-nomic one, to the public administration and governance one, passing through the entertainment and social fields. The economic field has been the first area in which the blockchain technology has been applied. Indeed Bitcoin [7] has been its first application and since then, other blockchains have emerged, such as Ethereum [8], proposed as the “World’s Computer”1_.

(22)

The blockchain is a distributed, decentralized and public ledger [12,13], that is a database consensually shared, replicated and synchronised, that guarantees to cover security and privacy issues by making use of completely transparent transactions.

The properties held by the blockchain are inherited from both the underneath peer-to-peer network and from the cryptographic hash function used for digital sig-nature [14].

The properties owned by the peer-to-peer network are described in the following: we have that the peer-to-peer network is distributed and decentralised, since that there is no central authority that owns the blockchain and controls it. Furthermore, every information, published on the peer-to-peer network, is public and available to everyone, this means that the censorship cannot be applied to the contents published by users. This may be a double-edged sword in all those countries where there is a heavy control over the social media websites from the government. The user is free to publish any content, and thus, he may spread information about the government decisions, but he is still reachable and identifiable through, e.g., its IP address, so he may still be in jeopardy.

The properties that are owned by the digital signature algorithm, used primarily for signing messages digitally and secondly with encryption aims, are shown in the following:

• the non-repudiability property of all the contents that users publish in the net-work, since that the traceability of each transaction and the hashing mechanism that rule the insertion of each new block guarantee that all the data is tamper-proof. Once the data has been signed, the user that signs it cannot deny later that it did not send the message;

• the integrity property, with respect to all the transactions published, together with all the other kinds of operations performed by users. The hashes will not match if the data has been altered;

• the authenticity property of all the contents published in the blockchain and signed by the private key of the user and verified with its public key by other users.

The peer-to-peer related properties are just some of the properties held by the blockchain, indeed, other properties that the blockchains own are:

• the data-availability property related to all the contents in the blockchain, be-cause once a block has been chained to the blockchain, it is immediately ready

(23)

Figure 2.1: Example of header and body of generic blocks for being accessed by all the network;

• the immutability property is reached in the event that it is really difficult to tamper or alter a block. Data written to a blockchain can never be changed; • the resiliency property, that should be guaranteed any time a node fails, in case

there are network problems, such as high network latency or packet drops, etc.; • the security property over the access to the blockchain, guaranteed by the usage

of public-private key pairs;

• the auditability property in relation to all the operations performed by users which becomes in this way transparent, included the financial ones. In this way, the amounts possessed by each user can be tracked, so that, there is no way to hide each user profits and the correspondent source of revenue.

In more detail, the blockchain can be defined as a growing list of blocks, in which each block is chained with the previous one together with the timestamp and the trans-action data. Each block contains a header, which includes all the meta information that may be useful for identifying the block and for chaining it into the ledger, and a body that includes all the transactions. An example showing how the described blocks are chained together is outlined in Figure2.1.

For the sake of reading, each block within the blockchain is identified by a hash performed on the header of the same block. Each block also references its previous block (called parent block) through the hash computation of the previous block’s header and saves this information in its own header. In this way, each block is chained with its previous one and the sequence of blocks created through hashing is able to refer implicitly to the first block.

(24)

The transactions are the core part of a blockchain. Whatever the blockchain is, the transactions are crucial because they include all the operations that have been performed by the users of the network.

When an entity that participates to the blockchain want to insert a new trans-action in the ledger, this has to be signed with a digital signing protocol and spread to all the users of the network. Once they received the transactions, through a dis-tributed consensus mechanism they decide whether to add or not the transaction in a block and then they add the block to the ledger.

Although a block can only have one parent, there are situations in which a block could have multiple children. When multiple children arise, in the blockchain there is a fork. It may happen for various reasons. In particular, it may arise when a blockchain diverges into two potential paths forward, as it happens in Bitcoin. It could also be necessary when there is a change in the protocol, as well as when both the blocks have the same height within the blockchain.

2.1.1 Consensus algorithms

Since the blockchain relies on a decentralised peer-to-peer network, with no estab-lished central authority, it is very hard for the nodes to take decisions, such as deciding which will be the next block of the blockchain. For this reason, a consensus mechanism is needed.

Nowadays, the consensus problem has been rediscovered since that in the peer-to-peer environment there are the same problems that were present in the communication among different computers. This is a fault-tolerant mechanism, reformulated in the blockchain environment, in order to handle the inclusion of a data in the blockchain. This mechanism is performed with the purpose to reach an agreement between a selected set of users. Since the network is distributed, consensus is needed in order to establish which are the valid blocks.

Several kinds of consensus mechanisms have been proposed and are used into the different blockchain-based platforms [14].

Consensus mechanism Blockchain

Proof of Work Bitcoin, Ethereum, Zerocash, Litecoin Proof of Stake Tezos, TRON, Nxt

Delegated Proof of Stake Steem, BitShares, EOS, Lisk

(25)

The most popular consensus mechanisms used in current blockchains are intro-duced in Table2.1, and are the Proof of Work, the Proof of Stake, and the Delegated Proof of Stake.

The Proof of Work is a consensus protocol to prove that the user has performed some kind of work, hence the name. This can be related to the execution of many different tasks. An usual way to perform the Proof of Work, generally adopted by real operative systems may consists of the computation of the inverse of cryptographic hash functions. These are used since they are difficult to invert: this kind of functions, called one-way functions, maps arbitrary-sized data to a fixed-size hash value. The only way to solve this problem, that is, given a value y, find an x such that the hash function h applied to x, returns the given value y, is through a brute-force search.

Usually in order to ensure a link between the previous block bx−1 and the current

block bx, the hash of the previous block, h(bx−1) is inserted in the current block.

Thus, when the same procedure needs to be applied for the current block bx, the

subsequent block bx+1 will contain indirectly also the h(bx−1), by going to create a

chain of hashes.

After a block is added to the chain, it is impossible to tamper blocks because the ledger is distributed and can be viewed by all users, which are continually update and kept it synchronized. Indeed, the decentralize nature eliminates the control by a single entity, and since every transaction is tracked, it becomes impossible to tamper with data [4].

This kind of consensus is used in Bitcoin for block generation and the steps are performed by the miners, all of whom in order to compete to the block generation phase, must carry out a proof of work covering all the data in the block. This process is commonly very time-consuming and with an high computational cost, but afterwards very easily verified.

Given the difficulty of the task, which will be the worker that will solve it and, hence, that will be able to generate the next block is not predictable.

The Proof of Stake(PoS) is another kind of consensus protocol. The functioning depends on the participation of the network’s nodes, called validators, that compete for the creation of a new block. The stake is based on some property of the partici-pants, such as for example their richness, so that in this case, only those users that are richer than the others can participate.

The validators have to bet their stake and, at the block creation time, one desig-nated validator is chosen in order to create a new block. It then creates the block and loses its bet, while instead, all the other participants are rewarded with the

(26)

transac-tion fees present in the created block. In case a validator would attempt to perform an attack, this would be discouraged since it means to put in danger the entire stake, including his own. A variant of PoS is used in Ethereum [15].

The Delegated Proof of Stake, used in Steem, is a variant of the PoS, in which the partaking users pool their stakes, represented by an amount of money that is blocked on their wallets and delegated to the pool’s staking balance.

The network periodically selects a predefined number of top staking pools, that are those with an higher staking balances, and allows them to validate transactions in order to get a reward. The rewards are then shared with the delegators, proportionally with their pool’s stakes.

The network self-stabilises itself from a possible attack because the attacker should be voted in order to become a validator. This could not so easily happen because of the low probability of a vote cast in favour of a malevolent participant.

2.1.2 Types of blockchain

In relation to the type of access that can be granted to the users of a blockchain, they can be classified in four categories [1], as shown in Figure2.2. There are: public and permissionless, which are open and transparent and offer disintermediation and anonymity. They are open for anyone; public and permissioned, which allow anyone to read transactions, but only permissioned users can write transactions; private and permissionless, in which everyone can issue transactions, but only a few selected nodes can read it. Finally, private and permissioned, which can be also called as consortium blockchain, that are accessible only by a set of identifiable entities.

2.2 Blockchain for Social Media

The world of the Online Social Media platforms is constantly growing. By means of these platforms, individuals, communities and organizations can share, discuss, participate and modify user-generated content or self-curated content posted online. Networks formed in a Social Media environment change the way groups of people interact and communicate [4].

During the last ten years, Online Social Networks have increased their popularity, and at the same time their issues. In particular, privacy issue has become one of the principal motivations for studying an alternative. It is well-known the Cambridge

(27)

Figure 2.2: Different types of blockchain available presented in [1]

Analytica scandal2 erupted in early March 2018, where personal data of 87 million of Facebook users were acquired3 through a Facebook application called “This Is Your Digital Life”.

The main alternative concerned the usage of decentralised architectures. The first proposals was defined Decentralised Online Social Networks, by relying mainly on underneath peer-to-peer networks. Nowadays, a new generation of Social Media has been proposed, thanks to the support of the blockchain technology.

The peculiarity of the Blockchain Online Social Media (BOSMs) is given by the internal rewarding system. While in the typical Online Social Networks the primary goods are the users that belong to it, in these platforms, the user is at the center. The leading principle of these decentralised platforms is that the users have to be rewarded for their activity on the network.

As described in [4], the main aim of all these platforms is to overcome the problems of current OSNs, in particular Facebook. We identify four common points which represent the main characteristics of these platforms:

2_{https://www.theguardian.com/news/series/cambridge-analytica-files}

(28)

• No Single-Point of Failure. Current OSMs are centralised and suffers all the centralisation-related issues, such as the vulnerability to attacks, the data avail-ability, etc. Instead, Social Media platforms based on blockchain do not have a single point of failure, thanks to the decentralization of data. Indeed, the decentralized nature eliminates the control by a single entity, and since every transaction is tracked, it becomes impossible to tamper with data;

• No Censorship. This property is very relevant in all those countries such as China, North Korea and Syria, where an active block of Social Media websites is present: citizens can be blocked by the government from accessing Social Media and other kinds of content. The decentralisation instead could lead to a solution to overcome the censorship issue;

• Rewards for Valuable Content. A content creator or a simple social media user can be rewarded for valuable content with cryptocurrency payments. Thanks to the blockchain, the rewarding phase is transparent because transactions are tracked and audited by everyone. This represents one of the main points of a blockchain-based OSM because rewarding is considered the success key to give value to content, to build an economy model, etc.. In particular, most of the current platforms take inspiration from the attention economy [16] and the token economy [17].

• Content Authenticity. People have been exposed to fake news, and current OSMs do no have specific solutions to face this problem. Instead, the usage of the blockchain technology is useful to treat this problem by using economic incentive to both rank and reward content.

All the Blockchain social media proposals are based on these four common points. In detail, the Single-Point of Failure problem is faced by exploiting the blockchain technology which is decentralized. Thanks to the blockchain, also the problem of No Censorship is faced. Indeed, the immutability of the blockchain means total freedom from censorship, and people are free to share information. As concerns the other two points, they are strictly correlated to the content. The Content Authenticity is faced by introducing specific rewarding systems, instead to evaluate the value of a content specific mechanisms are proposed, such as the dislike button. Several BOSMs have been proposed. SteemIt is the most well-known Blockchain-based OSM with more than 1 million of users [6]. SteemIt is the first application built over the Steem

(29)

blockchain. The structure of SteemIt resemble that of Medium4 _{and has some} simi-larities also with Reddit5_{. Indeed, the content fruition is open also to those external} users that are not members of the SteemIt community and the published contents of an user are visible and can be commented by every user in the network. In Chapter 3, we analyse in detail this Social Network.

Lit6 is a platform created to integrate social media services, similar to Instagram and SnapChat, and cryptocurrencies. The main feature of Lit is that users can share stories via Lit Stories and their stories permit to obtain Mithril tokens (MITH). The reward is provided on the basis of the impact and influence of these stories across the network. Stories are represented by any multimedial user’s content, that is photos, slideshows, videos, posts, etc. [18].

Sapien is a social news platform with the principal aim of fighting fake news by giving users more control over their data. Instead of using Twitter, YouTube, Facebook, etc. for different forms of news and media, users can use Sapien for every-thing. Sapien offers the possibility to the users to choose which personal information they share and with whom. Moreover user have the power to control the information they receive by tailoring received news with their interest. One of the fundamental principles of the platform, in terms of fake news and valuable content, is to orga-nize the first Democratized Autonomous Platform. This means that users are able to vote on proposals within a virtual community, facilitating democratic decisions at the community level. The Sapien platform is flexible and allow users to have a public or private identity. This means users can operate with their real identities or in anonymity whenever they want. Indeed, Sapien enables storage of identities on the Blockchain for the purpose of identification. Sapien uses the Ethereum blockchain and introduces a new cryptocurrency, the SPN token. The rewarding system is based on the Proof of Value.

SocialXis another example of Blockchain-based Distributed Online Social Media. It is fully decentralized, as described in [19], and allows users to give feedbacks to content and reward tokens. SocialX is fully decentralized as described in, which means that all media (photos and videos) and data (messages, posts, etc.) are stored in a decentralized manner.

4_{https://medium.com/} 5_{https://www.reddit.com/} 6_{https://mith.io/en-US/}

(30)

2.3 Supervised vs. Unsupervised Learning

Machine learning is a wide field of study which consists in the execution of a task by a computer that has not been explicitly programmed for performing that. Its application is very relevant, since a machine learning model is able to find hidden structure within the provided data, and, once it has been trained, to predict outcomes for new given data. A definition of the algorithms concerning the machine learning field has been provided by Thomas Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T , as measured by P , improves with experience E.”7 Therefore a program can learn from the actions performed over some data whether the performance in the task is improved.

Together with the growth of the machine learning area, the social-related data have increased greatly and this has been a direct consequence of the birth and spread of a large amount of social media websites. Therefore, in relation to this study area, the interest of the scientific community grows and is aimed at building models for solving related tasks (such as prediction, recognition, planning and analysis even in uncertain scenarios) over data coming from these social networks has become increas-ingly necessary.

When we talk about the different machine learning approaches we are able to use, we have to deal with the scenario in which they might be exploited: the initial dataset at our disposal is the main reason for applying a supervised or unsupervised model. The choice between the supervised and unsupervised models depends on the availability of the field used for the training phase and the subsequent test of the model accuracy. The two categories will be described in the rest of the Section. The learning process can be controlled by tuning the so called hyper-parameters, by means of which is possible to manage the learning capability of the model.

2.3.1 Supervised learning

This kind of learning approach consists in the application of an algorithm that takes a set of input variables, has a corresponding set of output variables and it needs to learn the mapping between the input and output. This means that the dataset, for each entry, has significant values both for the input values, traditionally known as x, and for the output ones that correspond to the target value, known as y, recalling

(31)

the functional notation.

Once the model has been selected with certain hyper-parameters, the model is trained. Subsequently, the model can be validated over a set of already labeled data. In this process we validate the trained model in order to analyse the accuracy of it. Afterwards, the model resulting from the learning phase is able to predict new data, historically called test set.

The scenario is specified by the dataset. The dataset has different features rep-resenting its dimensionality, i.e., each feature represents a quality of the entry in the dataset. For this reason, the features of a dataset’s entry is called feature vector and the vector space associated to them is called feature space.

In the large circle of supervised learning approaches, a further division is possible, based on the kind of task to perform: there exists model for handling the classification task and those related to the regression tasks.

• For the classification category, there are those models whose goal is to divide dataset’s entries in two or more classes. This procedure can be performed by taking into account that the purpose of the classification is to split a multidi-mensional dataset in order to classify each entry properly. In this process the model computes a certain amount of hyper-planes H1, . . . , Hn where ∀i : Hi is

a linear subspace of a vector space V s.t. if V is an n-dimensional vector space then Hi is an (n − 1)-dimensional subspace. Every hyper-plane identifies a

corresponding hyper-space on which the model performs the classification task. Proceeding in this way, the multidimensional dataset is split in different hyper-spaces and every different hyper-space is capable of uniquely identify the entries belonging to each different class.

• For the regression, there are those models whose purpose is to estimate the relationships between the dependent variable, and the independent variables. The dependent variable (or outcome) is the one that depends from the input variables, while the independent variables are also known with the name of predictors, features or covariates.

The work that needs to be carried out in order to perform a regression task is to select a model that must be estimated and then use a chosen method to estimate the parameters of that model.

For all the supervised learning approaches, the models that can perform a re-gression task can perform also classification tasks, while the opposite is not true.

(32)

This means that, while there exist models that can be executed only for classification tasks, all those models that can be executed for a regression task can perform also a classification one.

A possible example is the Support Vector Machine for Regression (SVR) that is a regression model, and has the version for handling the classification called Support Vector Machine8 (SVM). Another option is the Regression Tree, that is a model for regression, and the corresponding classification model is called Decision Tree9. On the opposite, the Naïve Bayes Classifier10 _{is a model that is able to perform solely} classification tasks and does not exists the respective model for handling a regression task, or rather, it can be built [20], but the results show that it performs so much worse than the linear regression.

2.3.2 Unsupervised learning

This kind of learning approach consists in modelling the underlying hidden structure or distribution of the dataset, which does not need any pre-existing label.

Differently from the supervised learning that makes use of labeled data, unsuper-vised learning models the probability densities over the inputs.

Basing on the different approach, the unsupervised learning approaches include: • clustering: which is the task of grouping with a certain criteria the entries on

the dataset, by applying a metric transformation in order to have a way to evaluate the similarities. This approach will be presented in more detail in Section2.3.2.1;

• anomaly detection: this is the identification of rare observations in the dataset which could be considered as outliers. These entries differs significantly from the majority of the data and may be considered as noise, novelties, deviations and exceptions. It finds its field of interest in:

– density-based techniques [21] (such as k-nearest neighbour), in which the aim to find those elements that correspond to a sparse point in the multi-dimensional dataset;

– one-class support vector machines [22], with which the element is classified either within the class or as external point;

8_{https://en.wikipedia.org/wiki/Support_vector_machine} 9_{https://en.wikipedia.org/wiki/Decision_tree_learning} 10_{https://en.wikipedia.org/wiki/Naive_Bayes_classifier}

(33)

– replicator neural networks [23], such as, autoencoders, variational auten-coders, LSTM neural networks, etc., in which the models try to replicate the most important features in the training data. When the model faces anomalies, it will worsen the reconstruction performance, so that it will not learn how to precisely reconstruct the main features if the data; – Bayesian networks [24], that is a probabilistic graphical model representing

a set of variables and their conditional dependencies via a directed acyclic graph. Bayesian networks are ideal for understanding which is the con-tributing factor of an event that occurred, by taking into account several possible known causes;

– hidden Markov models [25], that is a statistical Markov model, with a system (assumed to be a Markov process) that has an unobservable state. The assumption is to have another process y that depends on the Markov one, and the purpose is to learn about the Markov model, exploiting the outputs of y. The learning process that lies on the conditional probability distribution of y does not have to depend on the input data of the Markov process;

• neural networks [26]: that is one of the main and oldest models for learning from a given dataset. For the case of the unsupervised learning, generally the tasks are estimation problems. While instead, the applications include clustering, the statistical distributions’ estimation, the filtering and the compression.

A kind of model that depends on an artificial neural network is the Self-Organising Map [27] (SOM), trained in order to produce a low-dimensional representation of the training samples’ input space, called a map, which is dis-cretised. Therefore, it is a method to perform dimensionality reduction and it obtains always more accurate results when we choose to increase the num-ber of nodes. This variable is actually an hyper-parameter, since the model’s complexity strictly depends on this number.

Besides, there are other approaches that are noisy-resistant to incorrect target values in the training dataset. These are the models for handling the so called clas-sification with noisy labels [28], that may be useful in cases similar to the our study. The problem they face is the handling of datasets where the labels used by the machine learning models to be trained are incorrect.

We do not explain in detail the mentioned models, since we cannot exploit those in our scenario, but there are works such as [29], where the authors handle the problem

(34)

of noise within the labels used for training, by estimating in a proper way the different labels. The noisy datasets taken into account have 10, 20 and 30% of wrong labels. 2.3.2.1 Clustering approaches

The cluster analysis [30] is an important and well-known task in case of exploratory data mining and statistical data analysis, including information retrieval and pattern recognition. The most common clustering approaches present in literature are the following:

• hierarchical clustering. [31] It is bound to the connectivity concept behind the dataset entries: some records in the dataset can be considered closer to others by considering a specific metric distance. The iterated application of this process leads to clusters outlined by the two farthest entries that belong to the same cluster.

The procedure outlined by the model starts by considering each entry in the dataset as a separate cluster, and afterwards it repeatedly searches for and merges the two closest clusters. This procedure continues until obtaining a unique cluster. This means that at different distances, different clusters will form, represented by means of a dendrogram and the resulting partitioning of the dataset is not unique;

• k-means. [32] It is a centroid-based clustering. The clusters are represented by a central vector, which commonly is an external element to the dataset.

By fixing the clusters to a certain number k, the k-means clustering can be exploited, in order to find the k cluster centers and assign the objects to these kfound values, such that the squared distances from the cluster are minimized. The model starts its execution by taking k random points called centroids, afterwards it iteratively assigns each dataset entry to its closest centroid and at the end of this phase, the model updates the centroids so that the error measure is minimized. This iterative procedure stops when there is no more change in the assignment of the dataset’s entries to a different cluster, or, in case an early stopping condition has been provided, after the number of steps settled by that parameter.

The main drawback is given by the specification of the k value, that is the num-ber of cluster that will be obtained after its execution, that for other algorithms

(35)

may not be specified. Besides, since the problem is known to be NP-hard, usually we try to search for approximate solutions in these cases;

• mixture models. [33] It corresponds to a probability model for outlining the pres-ence of data clusters within the dataset’s entries that may be considered close in relation to the chosen metric. In literature, the problems associated with the mixture distribution try to figure out properties that belong to the entire dataset. In this case instead, the mixture models are used for performing statis-tical inferences over the sub-populations’ properties, without any information relating to the identity of each sub-population;

• DBSCAN. [34] It stands for Density-Based Spatial Clustering of Applications with Noise, it is a density-based data clustering algorithm. In the space defined from the dataset entries, given a set of points and a metric, for this purpose, it groups together points that are closely related with each other, that corresponds to those points with many neighbours. The closeness of the points in the dataset is defined by a model parameter that works as a threshold, so that, in case two entries lies closer than in the plane, they are considered neighbours. The algorithm instead marks as outliers those points that lie in sparser regions, where the nearest neighbours are too far away.

2.4 Bot detection in social environments

The state of the art related to the bot detection research topic is wide and has been performed over different scenarios. This research field is growing up fast and the various datasets analysed are related to different kind of social networks.

All the related works in the section rely on supervised learning methods because all of them have the information relative to the users and to the bots. Therefore, in all the mentioned works, the authors found a way to recognise (a considerable subset of) the bots before approaching to the problem. They could hence take into account only some of the standard users in their dataset in order to approach the problem with a balanced dataset. Proceeding in this way, they always have a way to identify the bots and the standard users.

In the following we analyse the different works divided by the social network the various authors have exploited as their use cases.

(36)

2.4.1 Twitter studies

Twitter is an American microblogging and social networking service on which users post and interact with short messages known as "tweets". Twitter has been the target of several bot detection studies.

The first work discussed related to the bot detection is the one in [35], where the authors face the problem in the Twitter environment, by proposing a deep neural net-work approach. The model is based on contextual long short-term memory (LSTM) architecture, and it makes use of both content and metadata to detect bots at the tweet level. The approach of considering this two kinds of data has never been used in the context of social media classification, while it has primarily applied over language models.

Hence, they found that by relying straightforwardly on the textual features of the tweet is not an highly predictive approach for the bot detection. For this reason, they decide to exploit additional features such as account metadata, temporal activity patterns, or network structure information, in order to yield more robust and accurate results. The dataset they used consists of a large number of users together with three sets of known bots. Indeed, for the aim of the work, they provide a mixture of the groups genuine accounts, social spambots ]1, social spambots ]2 and social spambots ]3.

For the task related to the account-level classification, the Random Forest and AdaBoost classifiers bring the best results with a percentage of misclassification of less than 2%, while instead by performing an improvement over the AdaBoost classifier, it reached near perfect accuracy: 99,81%.

For the task related to the tweet-level classification, the authors show that by performing this kind of task by considering only the tweets’ text, the accuracy of the classifiers reach an accuracy of about 95,53%. Besides, by taking into account also the metadata, the model’s accuracy increases to 96,33%.

Another innovative work, presented in [36], approaches the problem by analysing the novel semantic feature of the sentiment within the tweets of human and bot accounts.

The built architecture, together with the set of algorithms, has been called SentiBot and is able to obtain features from the users’ “tweets”, from their sentiment to the syntactic and semantic information. An example of a syntactic feature can be the average number of hashtags in the tweets or the average length of a published

(37)

con-tent, while instead an example of semantic feature is the recognition of the topic of discussion for the author of the tweet.

Besides, an user can be identified also from their friends and follows’ networks in order to evaluate for instance the degree of discord present among an user and his followings (that is, the users that are followed by the user under our focus).

The key result is that over the 25 features involved, 19 are sentiment related. This means that sentiment have a significant role in the bot detection task, thus taking the topics of interest into account is very important for identifying bots associated with that specific topic.

Their results show an improvement of the accuracy up to 53%, together with some others useful behavioural differences between the bots and human:

• the sentiment of bots oscillates much less frequently than the humans’ one; • the sentiment expressed by humans is stronger than the one expressed by the

bots;

• the bots agrees more often with the general sentiment over the Twitter’s popu-lation than the humans.

In [37] authors face the problem of recognizing social bots in Twitter. They propose a simple taxonomy that divides the approaches proposed in literature into three classes:

• bot detection systems based on social network information, in which the way to approach the problem is the one followed also in SybilRank [38]. This approach relies on the innocent by association property [39], that is confident that the probability that two non-linked users may interact is low. The problem with this approach is that Twitter encourage specifically this kind of interactions. This yield an high false positives rate and, for this reason, complementary de-tection techniques have been defined in order to aid, together with the manual identification of legitimate users, the training of supervised learning algorithms; • systems based on crowdsourcing and leveraging human intelligence, in which, a number of workers is hired for manually classifying a number of accounts as described in the following: the recognition task is performed by these workers adhering to a majority voting protocol. This means that the same profile is shown to multiple users, and, after their vote, the majority determines the final verdict. In this way it has been reached a near-zero false positive rate;

(38)

• machine learning methods based on the identification of highly revealing fea-tures that discriminate between bots and humans, that is performed through a particular focus over the behavioral patterns. These can be encoded in features and thus adopted in machine learning models in order to understand the be-havioural differences among the humans and bots. Later, this can be exploited to led the classification of the accounts by means of the behaviour observed. Afterwards, they combine these approaches in order to better understand the differ-ences found, and the results shown are significant in terms of z-score obtained.

The results show that social bots produce less tweets, replies and mentions than the human counterparts. They retweet more than humans and are retweeted less than humans. Besides, bots have longer user names and their accounts also tend to be more recent than humans.

Another work related to the bot detection theme is the one in [40], in which the author propose a Bayesian classifier for handling the recognition of bots, given the content the users published in the Twitter social network, by paying attention to the posts and comments containing urls. The features that have been taken into account are both:

• the Following, which led the work to a graph-based approach, taking into ac-count that Twitter recognises the bot presence, by comparing the number of following with the number of follower of an user. Three graph-based features have been considered, that is, the number of friends, of follower and of following; • the Tweet content, which led the work to a content-based approach, by consid-ering the twenty most recent tweets of each user. For each user, the author takes into account the Levenshtein distance among the various tweets, the presence of, possibly shortened, links, and the number of replies and mentions, since in the latter case, a spam account in this way draw user’s attention by sending undesired replies and mentions.

About the dataset used for extracting information related to the users, they ex-tracted the 20 most recent tweets per user and manually labeled each user account as spam or non spam. Since the dataset in this way were too unbalanced (1% of the data entries was considered as spam, while this percentage should be around the 3%, ac-cording to the author), he decides to take advantage of part of the accounts reported by Twitter users as spam, up to reaching nearly to the 3%. The Bayesian classifier has been compared with the SVM, neural network, decision tree and k-nearest neighbors,

(39)

and it gives better results since it is noise robust, indeed the precision, recall and F-measure have all 0.917 as corresponding value.

The bot detection problem has been faced also in the [41], where the authors built a classification framework in order to identify the bots from the users. The case study has been the Twitter social network and the initial dataset has been built by collecting 200 public tweets from a user timeline and up to 100 most recent tweets in which the user was mentioned.

The features used by the framework that have been taken by means of the Twitter API, or computed later, have been divided in six different categories:

• user-based features, that have been roughly extracted from Twitter, using its API;

• friends features, considering four kind of links, that is, retweeting, being retweeted, mentioning, and being mentioned;

• network features, by constructing three types of networks, that is, retweet, mention, and hashtag co-occurrence networks;

• temporal features, for which they consider the rates, on average, of tweet pub-lished over various time periods and how many activities have been performed over different time intervals;

• content and language features, performed by applying the Part-of-Speech (POS) tagging technique over the language features. It identifies different types of POS tags, so that, the tweets’ analysis is performed by considering how POS tags are distributed;

• sentiment analysis, carried out by extracting various sentiment features includ-ing polarisation, emotion score, strength, valence score, etc.

In order to train the framework, they initially used a publicly available dataset consisting of fifteen thousand manually verified Twitter bots and sixteen thousand verified human accounts.

Afterwards, to obtain an updated evaluation of the model’s accuracy, they con-structed an additional, manually-annotated collection of other accounts.

Then they evaluate the model over the manually-annotated dataset, in order to classify the users that belong to it. The results obtained have been less accurate,

(40)

given that in the newly extracted data there are also those bots that were not present in the initial dataset.

For this reason the authors update the models by combining both the datasets, in order to create multiple balanced datasets. They performed a 5-fold cross-validation to evaluate the accuracy of the models:

• for the manually-annotated dataset, they trained and evaluated the model only by using annotated accounts and labels assigned by voting ovevr the majority of annotators. This results is an accuracy of about 0.89%;

• they merge both the datasets for training and testing and the resulting classifier achieves an accuracy of 0.94%, that is a great result given that in the new dataset there is a variety of more recent bots;

• they mingle with different ratios of accounts from both the datasets, by obtain-ing an high accuracy, around the 0.90 ∼ 0.94%.

2.4.2 Reddit studies

This social network is an American social news aggregation, web content rating, and discussion website, that is organised in, so called, subreddits, where, for each subred-dit, the users can discuss of the topic related to that, in a user-to-topic subscription fashion.

In an important study [42] authors analyse the Reddit environment in order to better understand how the bots and the paid human agents perform their spamming work inside the subreddit of their interest.

The proposed approach to identify suspicious behaviour is network-based and the network created by the Reddit platform is organised in a tree-fashion, where the reddit main page represents the root and all the subreddits are its children (that correspond to the interior nodes of the tree). The subreddits to which users can subscribe correspond to the topics they can discuss about. For each subreddit, there are several posts published by users and each post may contain several comments by the users’ community.

For this reason they choose as case study the subreddit “The_Donald”, which became the subreddit hub for Donald Trump supporters during his 2016 campaign.

Their research interest was about analysing the subreddits in order to find dis-tinguishable groups of suspicious users. Their intuition was to find duplicitous users