• Non ci sono risultati.

Harnessing the Social Sensing Revolution: Challenges and Opportunities

N/A
N/A
Protected

Academic year: 2021

Condividi "Harnessing the Social Sensing Revolution: Challenges and Opportunities"

Copied!
201
0
0

Testo completo

(1)

UNIVERSITÁ DIPISA

DOTTORATO DI RICERCA ININGEGNERIA DELL’INFORMAZIONE

H

ARNESSING THE

S

OCIAL

S

ENSING

R

EVOLUTION

:

C

HALLENGES AND

O

PPORTUNITIES

DOCTORALTHESIS

Author

Stefano Cresci

Tutor (s)

Prof. Marco Avvenuti Prof. Maurizio Tesconi

Reviewer (s)

Prof. Roberto Di Pietro Prof.ssa Kalina Bontcheva

The Coordinator of the PhD Program

Prof. Marco Luise

Pisa, May 2018 Cycle XXX

(2)
(3)

"As long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes." Denis Diderot – Encyclopédie (1755)

(4)
(5)

Acknowledgements

I

would like to start these acknowledgements by thanking my family. Needless to say, without them, I wouldn’t be who I am. I thank my father for always being a great inspiration to me. I doubt I’ll ever manage to get any close to the exceptional person he is. Deeply knowledgable, patient, and always willing to help, he has been the source of many useful suggestions and insightful discussions. I am also forever grateful to my mother. Since the very start, she has been the first supporter of my work. She was always keen to know what I was working on, and how things were developing. She undoubtedly showed great perseverance in caring about my research, even if my explanations were probably mostly obscure and boring for her! I would also like to thank my parents for their stubbornness, which I believe I inherited. Indeed, it is that stubbornness, as well as the tenacity and ambition of my personality, that allowed me to reach my goals.

A very special thank goes to Lorena. I believe that achieving each and every goal requires a great deal of support from one’s intimates, and I truly think that she supported me at her best. She spent countless hours reading my drafted works and constructively criticizing them. Indeed, she is the first “reviewer” of all my works, and if something convinces her, then it means that I should definitely keep on developing that topic!

I would also like to express my deepest appreciation to my advisors. I am very thankful to Prof. Maurizio Tesconi for the many opportunities that he gave me during these years. Since the beginning of my work at IIT-CNR, he encouraged me to work on challenging topics and to push my limits. I have learned much of what I know by collaborating with the many bright researchers he introduced me to. I also sincerely thank Prof. Marco Avvenuti, he also has been an inspiration for me and an example of professionalism. I truly appreciate, and I understand the importance of, every moment I spend with him. He always puts forward useful advices and interesting considerations, and indeed, I have learned much from him.

Finally, I thank all my colleagues and coworkers. We’ve shared deadlines, sleepless nights, and many highs and lows. During the course of these years, I have also annoyed many of you with my absurd ideas. Yet, you shared with me your precious time and your useful suggestions.

(6)
(7)

Summary

T

HErecent proliferation of handheld devices that are equipped with a large number

of sensors and communication capabilities, as well as the ubiquitous presence of communication facilities and infrastructures, and the mass diffusion and avail-ability of social networking applications, has created a socio-technical convergence capable of sparking a revolution in the sensing world. One of the most promising and fascinating consequences of this new socio-technical convergence is the possibility to significantly extend, complement, and possibly substitute, conventional sensing by en-abling the collection of data through networks of humans. Indeed, these unprecedented sensing and sharing opportunities have enabled situations where individuals not only play the role of sensor operators, but also act as data sources themselves. This spon-taneous behavior has driven a new thriving – yet challenging – research field, called social sensing, investigating how human-sourced data can be gathered and used to gain situational awareness in a number of socially relevant domains.

However, the social sensing revolution does not come without costs. Now that each of us can send messages for the entire world to read, or upload pictures for the entire world to see, the amount of real-time information out there far exceeds our cognitive capacity to consume it. Today, we have access to a plethora of blogs, discussion fo-rums, and online social network accounts that provide orders of magnitude increases in the number of news sources. We are thus witnessing to the development of a widening gap between information production and our consumption capacity. Moreover, the reli-ability of such sources is not guaranteed. Indeed, it has already been demonstrated that observations produced by social sensors might be affected by a number of issues that undermine their usefulness and applicability. Among such issues are the widespread presence of fictitious, malicious, and deceptive social sensors; and the spreading of de-ceptive content, such as fake news. As a consequence, in order to fully harness this unfolding sensing revolution, we are in dire need of novel algorithms, techniques, and tools that are capable of turning this deluge of messy data into concise, meaningful, and reliable information. The possibility to fruitfully exploit this citizen-sensed stream of big data for novel applications – and ultimately for improving our societies and our everyday life – represents a tantalizing opportunity, counterbalanced by the many

(8)

chal-lenges related to the assessment of the reliability of such information, as well as its aggregation, summarization, and filtering.

The goal of this thesis is to investigate the two sides of the “social sensing” coin. Thus, the main contributions of this doctoral work are twofold: (i) investigate the prob-lem of credibility and reliability of social sensors; and (ii) explore the opportunities opened up by social sensing for a practically relevant scenario, such as that of emer-gency management.

(9)

Sommario

L

A recente proliferazione di dispositivi mobili equipaggiati con una moltitudine

di sensori e di soluzioni comunicative, come anche la costante presenza di in-frastrutture di comunicazione, e la massiva diffusione di applicazioni di social networking, ha dato luogo ad una convergenza socio-tecnica capace di avviare una rivo-luzione nel settore della sensoristica. Tra le conseguenze più promettenti ad affascinanti di questa rivoluzione, vi è la possibilità di estendere, supportare, e addirittura sostituire i metodi tradizionali di “sensing” effettuando la raccolta di dati attraverso reti di umani. Queste opportunità di raccolta e condivisione dati senza precedenti, hanno permesso il verificarsi di situazioni dove gli individui non si comportano esclusivamente come operatori di sensori, ma fungono loro stessi da sensori veri e propri (cosiddetti “sensori sociali”). Il comportamento spontaneo di questi individui ha dato il via ad un fioren-te, seppur sfidanfioren-te, nuovo ambito di ricerca, chiamato social sensing. Questo ambito scientifico ha l’obiettivo di investigare come i dati prodotti dai sensori sociali possa-no essere raccolti, analizzati, ed aggregati al fine di estrarre copossa-noscenza utile in molti scenari applicativi.

Ad ogni modo però, la rivoluzione del social sensing non è esente da sfide. Adesso che ognuno di noi può condividere messaggi con il resto del mondo, o caricare con-tenuti multimediali che tutti possono vedere, la quantità di dati presente online supera notevolmente la nostra capacità di sfruttarla. Oggi abbiamo accesso ad una moltitudine di blogs, forum di discussione online, e profili su social networks, tali da rappresentare un incremento di diversi ordini di grandezza nel numero di sorgenti di informazione disponibili. Stiamo quindi assistendo all’emergere di un “gap” crescente tra l’infor-mazione continuamente prodotta, e la nostra capacità di sfruttarla. In aggiunta, anche l’affidabilità di questa moltitudine di sorgenti di informazione non è garantita. Infatti, è stato già largamente dimostrato che i dati prodotti da questi sensori sociali possono essere affetti da svariati problemi, tali da comprometterne l’utilità pratica. Tra que-sti problemi vi è la pervasiva diffusione di account social fasulli o automatizzati, e la crescente diffusione di contenuti fasulli, come ad esempio le fake news. Se vogliamo sfruttare appieno la rivoluzione del social sensing, c’è quindi un pressante bisogno di realizzare nuovi algoritmi, tecniche, e strumenti capaci di trasformare questa

(10)

monta-gna di dati disordinati in un numero limitato di informazioni circostanziate, affidabili, e altamente significative. La possibilità di sfruttare proficuamente questo flusso di Big Data estratto dai sensori sociali, per nuove applicazioni e servizi mirati al miglioramen-to delle nostre società, e quindi, delle nostre vite, rappresenta un’allettante opportunità, controbilanciata però dalle molte insidie legate alla stima dell’attendibilità di queste informazioni, come anche alla loro aggregazione ed al loro filtraggio.

Lo scopo di questa tesi è quindi l’investigazione delle “due facce del social sen-sing”. Di conseguenza, i contributi di questo lavoro sono duali, ed in particolare: (i) lo studio del problema della credibilità ed dell’attendibilità dei sensori sociali; e (ii) l’esplorazione delle opportunità circa l’applicazione di metodologie di social sensing ad un’importante ambito applicativo, come quello della gestione delle emergenze.

(11)

List of publications

International Journals

1. Avvenuti, M., Cresci, S., Del Vigna, F., Fagni, T., and Tesconi, M. (2018). CrisMap: A Big Data Crisis Mapping System based on Damage Detection and Geoparsing. Information Systems Frontiers. Springer. (in press)

2. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2017). Social Fingerprinting: detection of spambot groups through DNA-inspired behav-ioral modeling. IEEE Transactions on Dependable and Secure Computing. IEEE. (in press)

3. Avvenuti, M., Cresci, S., La Polla, M., Meletti, C., and Tesconi, M. (2017). Now-casting of Earthquake Consequences using Big Social Data. IEEE Internet Com-puting, 21(6), 37-45. IEEE.

4. Avvenuti, M., Cresci, S., Del Vigna, F., and Tesconi, M. (2017). On the need of opening up crowdsourced emergency management systems. AI & SOCIETY, 1-6. Springer.

5. Avvenuti, M., Cresci, S., Marchetti, A., Meletti, C., and Tesconi, M. (2016). Pre-dictability or early warning: using social media in modern emergency response. IEEE Internet Computing, 20(6), 4-6. IEEE.

6. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2016). DNA-inspired online behavioral modeling and its application to spambot detec-tion. IEEE Intelligent Systems, 31(5), 58-64. IEEE.

7. Avvenuti, M., Cresci, S., Del Vigna, F., and Tesconi, M. (2016). Impromptu crisis mapping to prioritize emergency response. Computer, 49(5), 28-37. IEEE. 8. Avvenuti, M., Cimino, M. G., Cresci, S., Marchetti, A., and Tesconi, M. (2016). A

framework for detecting unfolding emergencies using humans as sensors. Springer-Plus, 5(1), 43. Springer.

(12)

9. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2015). Fame for sale: efficient detection of fake Twitter followers. Decision Support Systems, 80, 56-71. Elsevier.

International Conferences/Workshops with Peer Review

1. Cresci, S., Lillo, F., Regoli, D., Tardelli, S. and Tesconi, M. (2018). $FAKE: Evidence of spam and bot activity in stock microblogs on Twitter. The 12th Inter-national AAAI Conference on Web and Social Media. AAAI. (in press)

2. Avvenuti, M., Cresci, S., Tesconi, M., Cimino, A., and Dell’Orletta, F. (2018). Real-World Witness Detection in Social Media via Hybrid Crowdsensing. The 12th International AAAI Conference on Web and Social Media. AAAI. (in press) 3. Avvenuti, M., Cresci, S., Nizzoli, L., and Tesconi, M. (2018, June). GSP

(Geo-Semantic-Parsing): Geoparsing and Geotagging with Machine Learning on top of Linked Data. In Proceedings of the 2018 Extended Semantic Web Conference. Springer LNCS. (in press)

4. Cresci, S., Petrocchi, M., Spognardi, A., and Tognazzi, S. (2018, April). From Reaction to Proaction: Unexplored Ways to the Detection of Evolving Spambots. In Proceedings of The 2018 Web Conference Companion. ACM. (in press) 5. Vadicamo, L., Carrara, F., Falchi, F., Cresci, S., Tesconi, M., Cimino, A., and

Dell’Orletta, F. (2017, October). Cross-Media Learning for Image Sentiment Analysis in the Wild. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, 308-317. IEEE.

6. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2017, October). Exploiting digital DNA for the analysis of similarities in Twitter be-haviours. In Proceedings of the 4th IEEE International Conference on Data Sci-ence and Advanced Analytics, 686-695. IEEE.

7. Avvenuti, M., Bellomo, S., Cresci, S., La Polla, M. N., and Tesconi, M. (2017, April). Hybrid crowdsensing: A novel paradigm to combine the strengths of op-portunistic and participatory crowdsensing. In Proceedings of the 26th Interna-tional Conference on World Wide Web Companion, 1413-1421. ACM.

8. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2017, April). The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of the 26th International Conference on World Wide Web Companion, 963-972. ACM.

9. Cresci, S., Gazzè, D., Lo Duca, A., Marchetti, A., and Tesconi, M. (2015, May). Geo data annotator: a web framework for collaborative annotation of geographical datasets. In Proceedings of the 24th International Conference on World Wide Web Companion, 23-24. ACM.

10. Cresci, S., Tesconi, M., Cimino, A., and Dell’Orletta, F. (2015, May). A linguistically-driven approach to cross-event damage assessment of natural disasters from social media messages. In Proceedings of the 24th International Conference on World Wide Web Companion, 1195-1200. ACM.

(13)

11. Avvenuti, M., Del Vigna, F., Cresci, S., Marchetti, A., and Tesconi, M. (2015, November). Pulling information from social media in the aftermath of unpre-dictable disasters. In Proceedings of the 2nd International Conference on Infor-mation and Communication Technologies for Disaster Management (ICT-DM), 258-264. IEEE.

12. Cresci, S., Cimino, A., Dell’Orletta, F., and Tesconi, M. (2015, November). Crisis mapping during natural disasters via text analysis of social media messages. In Proceedings of the 16th International Conference on Web Information Systems Engineering, 250-258. Springer LNCS.

13. Del Vigna, F., Cresci, S. (2015). Social Media for the Common Good: the case of EARS, In Proceedings of the 1st International Workshop on Community Intel-ligence for the Common Good.

14. Cimino, A., Cresci, S., Dell’Orletta, F., and Tesconi, M. (2014). Linguistically-motivated and lexicon features for sentiment analysis of italian tweets. In Pro-ceedings of the 4th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2014), 81-86.

Others

1. Cresci, S., Del Vigna, F., and Tesconi, M. (2017). I Big Data nella ricerca politica e sociale. In Andretta, M., and Bracciale, R., (eds.), Social Media Campaigning. Le elezioni regionali in #Toscana2015, 113-140. Pisa University Press.

2. Cresci, S., La Polla, M., and Tesconi, M. (2017). Il fenomeno dei Fake Follower in Twitter. In Andretta, M., and Bracciale, R., (eds.), Social Media Campaigning. Le elezioni regionali in #Toscana2015, 141-162. Pisa University Press.

3. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., and Tesconi, M. (2016). Social fingerprinting - or the truth About you. ERCIM News, 106, 26-27. ERCIM. 4. Meletti, C., Cresci, S., and Tesconi, M. (2016). Rapid estimation of earthquake intensity from Twitter’s social sensors. In Proceedings of the 35th General As-sembly of the European Seismological Commission. ESC.

5. Meletti, C., Cresci, S., La Polla, M., Marchetti, A., and Tesconi, M. (2014). Social Media as Seismic Networks for the Earthquake Damage Assessment. AGU Fall Meeting Abstracts, 1, 4338. AGU.

Submitted

1. Tognazzi, S., Cresci, S., Petrocchi, M., and Spognardi, A. (2018). A Masked Ball: Synthesis and Robustness of SocialBots 2.0. The 27th ACM International Conference on Information and Knowledge Management. ACM.

2. Tonga, Y., Uriarte, R., Cresci, S., Petrocchi, M., Tesconi, M., and De Nicola, R. (2018). Communicating an attractive ecosystem: Exploring the relation between festivals and cities by the lenses of Twitter. Tourism Management. Elsevier.

(14)

3. Prieto Curiel, R., Cresci, S., Muntean, C., and Bishop, S. (2018). Crime, its fear and activism in social media. Royal Society Open Science. Royal Society.

(15)

List of Abbreviations

A

API Application Programming Interface.

AUC Area under the curve.

C

CAPTCHA Completely automated public Turing test to tell computers and humans apart.

CCDF Complementary Cumulative Distribution

Function.

CDF Cumulative Distribution Function.

D

DCG Discounted Cumulative Gain.

DYFI USGS’s Did You Feel It? system.

F

FEMA The U.S. Federal Emergency Management

Agency.

FN False Negative.

FP False Positive.

FSC Facebook Safety Check.

G

GPS Global Positioning System.

H

HaaS Human as a Sensor.

I

ICT Information and Communication

(16)

List of Abbreviations

IDCG Ideal Discounted Cumulative Gain.

IG Information Gain.

INGV The Italian National Institute of

Geo-physics and Volcanology. J

JMA Japan Meteorological Agency.

L

LBSN Location-Based Social Network.

LCS Longest Common Substring.

LDA Linear Discriminant Analysis.

LTA Long-Term Average.

M

MAE Mean Absolute Error.

MCC Matthews Correlation Coefficient.

MCL Markov Cluster Algorithm.

MMS Multimedia Messaging Service.

N

nDCG Normalized Discounted Cumulative Gain.

NLP Natural Language Processing.

O

OSN Online Social Network.

P

PCC Pearson Correlation Coefficient.

PMI Pointwise Mutual Information.

POS Part of Speech.

Q

Q&A Question answering.

R

REST Representational State Transfer.

RFID Radio-Frequency Identification.

RMSE Root Mean Squared Error.

ROC Receiver Operating Characteristic.

S

SMS Short Messaging Service.

SNA Social Network Analysis.

STA Short-Term Average.

(17)

List of Abbreviations

T

TN True Negative.

TP True Positive.

U

UML Unified Modeling Language.

(18)

Contents

List of Publications VII

List of Abbreviations XI

1 Introduction 1

1.1 The paradigm of social sensing . . . 1

1.2 Challenges and opportunities in social sensing . . . 3

1.3 Contributions . . . 4

1.3.1 Challenges in the detection of deceptive social sensors . . . 5

1.3.2 Opportunities in emergency management via social sensing . . . 5

1.4 Materials . . . 6

1.4.1 Twitter as a social sensing platform . . . 6

1.4.2 Twitter datasets of social sensors . . . 7

1.4.3 Twitter datasets of social sensors observations . . . 12

1.4.4 Reproducibility . . . 15 1.5 Evaluation methods . . . 15 1.5.1 Correlation . . . 15 1.5.2 Classification . . . 16 1.5.3 Ranking . . . 17 1.5.4 Regression . . . 18 1.6 Outline . . . 19

2 Related Work and State-of-the-Art 21 2.1 Detection of deception in social media . . . 21

2.1.1 Spammers and spam . . . 22

2.1.2 Bots and automated accounts . . . 23

2.1.3 Cyborgs and compromised accounts . . . 24

2.1.4 Sybils, fake followers and account markets . . . 25

2.1.5 Concluding remarks . . . 27

2.2 Social media emergency management . . . 27

(19)

Contents

2.2.2 Fully working systems . . . 29

2.2.3 Practical experiences . . . 31

2.2.4 Concluding remarks . . . 32

3 Detection of deceptive social sensors 34 3.1 Fake followers . . . 35

3.1.1 Detection with algorithms based on classification rules . . . 37

3.1.2 Detection with algorithms based on feature sets . . . 42

3.1.3 A lightweight classifier of fake Twitter accounts . . . 46

3.2 Social spambots . . . 52

3.2.1 Testing Twitter’s defenses . . . 53

3.2.2 Crowdsourcing social spambots detection . . . 56

3.2.3 Benchmarking current spambot detection techniques . . . 60

3.3 Towards an accurate detection of social spambots . . . 62

3.3.1 Modeling collective online behaviors . . . 63

3.3.2 The digital DNA behavioral modeling technique . . . 65

3.4 DNA-inspired detection of social spambots . . . 71

3.4.1 LCS curves of legitimate and malicious accounts . . . 71

3.4.2 A supervised detection technique . . . 74

3.4.3 An unsupervised detection technique . . . 76

3.4.4 Social spambots detection results . . . 77

3.5 Wrap-up and final remarks . . . 85

4 Emergency management via social sensing 88 4.1 A social sensing framework for emergency detection . . . 89

4.1.1 Core concepts and functionalities . . . 90

4.1.2 Architectural design . . . 93

4.1.3 Proof-of-concept implementation: the EARS system . . . 98

4.1.4 Validation of earthquake detection via social sensing . . . 105

4.2 Going beyond emergency event detection . . . 108

4.2.1 Nowcasting of earthquake intensity with big social data . . . 110

4.2.2 Detection of damage mentions . . . 117

4.3 Crisis mapping . . . 122

4.3.1 The CrisMap system . . . 124

4.3.2 Detecting damaged areas . . . 127

4.3.3 A novel geoparsing technique . . . 128

4.3.4 Qualitative evaluation of crisis maps . . . 130

4.3.5 Quantitative evaluation of crisis maps . . . 132

4.4 Post-disaster user engagement . . . 134

4.4.1 Beyond traditional participatory and opportunistic crowdsensing 135 4.4.2 A system for hybrid crowdsensing . . . 135

4.4.3 Hybrid crowdsensing results . . . 139

4.4.4 Practical applicability of hybrid crowdsensing . . . 144

4.5 Augmenting hybrid crowdsensing with witness detection . . . 147

4.5.1 Building a rigorous ground-truth for witness detection . . . 148

4.5.2 The witness detection system . . . 150

(20)

Contents

4.6 Wrap-up and final remarks . . . 155

5 Conclusions 159

5.1 Results and lessons learned . . . 160 5.2 Future research directions . . . 162 5.3 Social sensing: the way ahead . . . 166

(21)

CHAPTER

1

Introduction

1.1

The paradigm of social sensing

Until now, human and environmental phenomena, as well as real-world events, have always been monitored via conventional sensing approaches. Such approaches gener-ally rely on environments that are densely enriched with ICT-enabled sensing devices that are exploited for provisioning context-aware, adaptable, and personalized services for better interacting with – and understanding – the surrounding world [159]. Modern cities represent a good example of such sensing-enabled environments, since they are already filled with cameras and microphones for security purposes, sensor networks for environmental monitoring, RFID-based badges and antennas for keeping track of object and user movements, to name but a few well-known examples. Indeed, in or-der to deepen the unor-derstanding of our own world and to provide novel services and applications, in the last decades we have witnessed to the fast development of a num-ber of scientific fields, such as those of remote sensing, wireless sensor networks, and pervasive computing [199]. Nowadays however, the proliferation of handheld devices equipped with a large number of sensors and communication capabilities, the ubiqui-tous presence of communication facilities and infrastructures, and the mass diffusion and availability of social networking applications, has created a socio-technical con-vergence capable of arising a paradigm-shift in the sensing world. One of the most promising and fascinating consequences of this new socio-technical convergence lies in the possibility to significantly extend, complement, and possibly substitute, conven-tional sensing by enabling the collection of data through networks of humans [16]. Novel sensing paradigms such as crowd-, urban-, or citizen-sensing have been coined to describe how information can be sourced from the average individual in a coordi-nated way. Data gathering can be either participatory or opportunistic, depending on whether the users intentionally contribute to the acquisition campaign (possibly

(22)

receiv-Chapter 1. Introduction

ing an incentive), or they simply acts as the bearers of a sensing device from which data is transparently collected by some situation-aware system [140, 211].

In this scenario, the advent of online social networks (OSN) – such as Twitter1, Weibo2, and Instagram3 – that have grown bigger becoming a primary hub for

pub-lic expression and interaction, has added facilities for ubiquitous and real-time data-sharing [82]. These unprecedented sensing and data-sharing opportunities have enabled situations where individuals not only play the role of sensor operators, but also act as data sources themselves. In fact, humans have a great aptitude in processing and filter-ing observations from their surroundfilter-ings and, with communication facilities at hand, in readily sharing the information they collect [216]. This spontaneous behavior has driven a new challenging research field, called social sensing [4], investigating how human-sourced data, modeled by the human as a sensor (HaaS) paradigm [16, 248], can be gathered and used to gain situational awareness and to nowcast events in differ-ent domains such as health, transportation, energy, social and political crisis, and even warfare. Among the advantages of social sensing is the natural tendency of OSN users to promptly convey information about the context [74, 155] and that those proactively posted messages, especially when witnessing emergency situations, are likely to be free of pressure or influence [265]. The utmost case is Twitter, where users are encouraged to make their messages (so-called tweets) publicly available by default and where, due to the 140 characters length limitation, they are forced to share more topic-specific content.

Now that each of us can send messages for the entire world to read – such as tweets – or upload pictures for the entire world to see – thanks to social networks that sup-port multimedia information broadcast – the amount of real-time information out there far exceeds our cognitive capacity to consume it [247]. In addition, we also have to cope with the growing volume of information generated by traditional sensors, which are getting more widely deployed and capable of generating information in real-time. The pace at which technological advances increase the rate of information production far outstrips the rate at which our own cognitive capacity evolves to consume it. In-deed, we still read and comprehend information at the same rate as our ancestors did. We are thus witnessing to the development of a widening gap between the information production and our consumption capacity [247]. As a consequence, in order to prof-itably exploit this unfolding sensing revolution, we are in dire need of novel algorithms, techniques, and tools that are capable of turning this deluge of messy data into concise and meaningful information. To bridge the widening gap between data production and consumption rates, new algorithms must cut noise from unreliable sources and focus receiver attention on a small subset of relevant and credible observations [247]. The possibility to fruitfully exploit this citizen-sensed stream of big data for novel applica-tions – and ultimately for improving our societies and our everyday life – represents a tantalizing opportunity, counterbalanced by the many challenges related to the assess-ment of the reliability of such information, as well as its aggregation, summarization, and filtering.

1https://twitter.com/ 2https://www.weibo.com 3https://www.instagram.com

(23)

1.2. Challenges and opportunities in social sensing

1.2

Challenges and opportunities in social sensing

As previously introduced, social sensors are a huge network of distributed, sponta-neous, and real-time data sources. Collectively, the big social data continuously gener-ated by social sensors represent an unprecedented opportunity to study many real-world phenomena, as they unfold. For instance, data derived from social sensors has been re-cently employed to study the patterns of human mobility [118] and the human cognitive capacity in establishing and maintaining active social relationships [85, 156]. In turn, models and results derived from these analyses have been used conjunctly [246], for instance to monitor and predict the spreading of epidemics [30, 117]. Social sensors observations have also been used for commercial and financial purposes. Indeed, a number of systems based on social sensing have been designed to forecast stock mar-ket prices [38], and to predict box office incomes of films [171]. In addition, data derived from social sensors have also been leveraged to predict electoral outcomes in many countries [60, 207, 233] and to predict the number of future citations received by recently published scientific papers [93].

All these examples highlight that social sensing systems typically allow to obtain accurate results way earlier than other traditional systems. This interesting character-istic of social sensing systems derives partly from the real-time, responsive nature of social sensors data. In addition, it is also due to the possibility to develop scalable, fully automatic systems that are capable of producing an output of the analyses in a very limited time, when compared to traditional methodologies of analysis.

The responsiveness that is typical of social sensing systems, as well as their ap-propriateness for carrying out monitoring and nowcasting tasks, makes these kinds of systems particularly suitable for all those practical scenarios characterized by rapidly evolving phenomena, with stringent time requirements. Emergency management is a primary example of such a practical scenario. In fact, in the aftermath of mass emergen-cies, responders have little information to exploit for taking their decisions, and little time to act. A decision support system capable of automatically collecting, analyzing, and organizing social sensors observations, may provide a valuable help to emergency responders [23]. For this reason, scholars have recently showed a growing interest in studying the applicability of social sensing techniques for improving emergency man-agement [133].

In order to obtain satisfactory results in this critical research field, many challenges need to be solved, such as those related to the design of data collection techniques capable of beating the trade off between data completeness and data specificity [22], the design of efficient algorithms for analyzing such collected data, and the design of visualization techniques for concisely and effectively presenting the results of the analyses. These challenges represent the majority of currently opened scientific and technological issues in this field. Overcoming such challenges is the key to fully tap the opportunities brought by the social sensing revolution.

Adding to the challenges previously introduced, we also have to consider reliability and trustworthiness requirements, which are mandatory for critical social sensing sys-tems. Traditionally, information was disseminated by a few trustworthy sources – e.g., by news agencies. Today, we have access to a plethora of blogs, discussion forums, and OSN accounts that provide orders of magnitude increases in the number of news

(24)

Chapter 1. Introduction

sources. However, the reliability of such sources is not guaranteed [247]. Indeed, it has been demonstrated that observations produced by social sensors, and especially those collected from OSNs, are affected by a number of issues that might drastically under-mine their usefulness and applicability [186]. The first issue arises from the observation that, for the vast majority of OSN accounts, it is not guaranteed that the account belongs to a real, existing person4. Indeed, there exists a considerable percentage of all OSN

ac-counts that are fictitious or automated. While a few of these automated acac-counts serve a beneficial purpose, the remaining ones are malicious and can be coarsely classified as fakes, bots, and spammers [58]. Malicious accounts have been involved in a num-ber of shady or illicit activities, ranging from spreading malware, viruses and phishing attacks, to spreading fake news, spamming commercial messages, and influencing the public opinion about important societal topics. Obviously, information coming from an unverified account cannot be directly trusted. This problem is also worsened by the ease with which OSNs allow users to create new accounts, and to manage multiple ac-counts at a time. From the technical viewpoint, this first issue involves the assessment of the nature of an OSN account, in order to filter out all those accounts that do not belong to a real, existing person.

A second problem is related to the spreading of fake and unverified information. This time, also legitimate and genuine accounts, operated by real persons, might con-tribute to the spread of such harmful content. In fact, rumors tend to spread rapidly through OSNs, and their veracity is hard to establish in a timely fashion. For instance, as reported in [186], during an earthquake occurred in Chile in 2010, rumors spread through Twitter that a volcano became active and there was a tsunami warning in Val-paraiso. Later, these reports were found to be false. Similarly, after a severe earthquake occurred in Italy in 2012, a famous Italian public character reported that the earthquake has been predicted by an online news site. The claim was obviously fake, since earth-quakes cannot be predicted [23, 109]. However, the popularity of the public character and the delay with which Italian authorities shared an official denial, allowed the news to extensively spread in Twitter causing widespread protests and malcontent. A survey conducted in 2012 by Pew Internet Research investigated the future of big data and concluded that we are running a high risk of “distribution of harms” due to the abun-dance and spreading of inaccurate and false information [11]. In 2013, also the World Economic Forum has listed “massive digital misinformation” – either intentional or unintentional – as one of the main risks for our modern society [129].

Given this picture, it is not surprising that veracity is now widely recognized as the fourth “V” associated to big data, joining volume, variety, and velocity [196]. It is therefore mandatory to jointly consider both the opportunities and the challenges – among which veracity is a crucial one – in social sensing, in order to perform impactful, yet responsible, research on the big data produced by social sensors.

1.3

Contributions

The goal of this thesis is to investigate the two sides of the “social sensing” coin. Thus, the main contributions of this doctoral work are twofold: (i) investigate the problem of credibility and reliability of social sensors; and (ii) explore the opportunities opened up

4An exception to this are verified accounts, whose real identity has been verified by the OSN administrators. Typically, verified accounts are in the region of 0.1% of all the accounts of an OSN, and they generally are the accounts of famous, public characters.

(25)

1.3. Contributions

by social sensing for a practically relevant scenario, such as that of emergency manage-ment.

1.3.1 Challenges in the detection of deceptive social sensors

We tackle the assessment of credibility and reliability of social sensors as the task of characterizing and identifying deceptive accounts in social media. Specifically, we fo-cus on two types of deceptive accounts that received little attention from the scientific community: fake followers, and social spambots. With respect to prior work in this area, our contributions are related to the study of currently available techniques for the classification of fake Twitter followers [66]. Results of this work demonstrate that current techniques and social analytics tools are not capable of accurately detecting fake followers in Twitter. Therefore, we designed and developed an efficient machine learning classifier that achieved excellent detection results [67]. Subsequently, we fo-cused on a novel type of deceptive accounts: social spambots. Such accounts present advanced human-like characteristics that make them extremely hard to detect [230]. We benchmarked state-of-the-art spambot detection techniques as well as human per-formance in detecting such novel social spambots and we demonstrated the need for novel analytic tools capable of detecting such accounts [70]. Then, leveraging the re-sults of this study, we devised a novel behavioral modeling technique, called digital DNA [68]. We demonstrated that modeling the online behavior of Twitter accounts with their digital DNA uncovers significant differences between social spambots and human-operated accounts [69, 73]. We leveraged such differences to design and imple-ment a detection technique capable of accurately detecting even the most sophisticated social spambots [71].

1.3.2 Opportunities in emergency management via social sensing

The goal of this body of work is to investigate the usefulness of social sensing for solv-ing emergency management tasks, such as: detectsolv-ing the occurrence of an emergency, assessing the consequences of the emergency, understanding the places that require a prioritized intervention, identify individuals that are directly involved in the emergency, and more. Solving these high-level issues also requires to overcome a number of other low-level challenges, such as that of geoparsing social media texts, detecting bursts of keywords related to a given topic, filtering out unimportant chatter from social streams, represent a multitude of information in intuitive, clear, and elegant visualizations.

Since much prior work already focused on the detection of different specific gencies, our first contribution is the design of a general framework for detecting emer-gencies via social sensing [16]. The proposed framework has been also implemented for the detection of earthquakes, with the EARS system [22, 26, 80]. After the detec-tion of an emergency event, it is important to understand its consequences [19, 23]. Thus, we developed a statistical system for nowcasting earthquake intensity – that is, a quantitative estimation of earthquake damage – from social sensors observations [21]. The proposed system provides accurate estimations way earlier than the traditional ap-proach. We also developed a classification system for detecting mentions of damage in social media messages [74]. This classification system is exploited jointly with a semantic geoparsing technique [24] in order to produce crisis maps in the aftermath of an emergency [18, 64]. Crisis maps are used to highlight the most stricken locations,

(26)

Chapter 1. Introduction

in order to prioritize emergency interventions [17]. Finally, to augment post-disaster operations, we developed a hybrid crowdsensing technique with the twofold goal of maximizing the amount and the quality of collected social sensors observations, and of allowing to directly contact a selected subset of social sensors in order to elicit additional, targeted information from them [15]. We also developed a classification system for automatically detecting those social sensors that are witnesses of an emer-gency event. The novel hybrid crowdsensing technique can be used conjunctly with the witness detection system to directly contact – and solicit detailed information from – witnesses of an emergency.

1.4

Materials

Since this doctoral work has been carried out with a data-driven approach, the main material of this thesis is constituted of the real-world data that we collected in order to carry out realistic experiments. Specifically, most of the data used in the remainder of this thesis have been collected from the Twitter social networking and microblogging platform. In this section, we describe the use of Twitter data in social sensing, the Twitter datasets used in our work, as well as the methodologies with which they have been collected. Additional data coming from different sources are described on a per-experiment basis, within the specific sections in which they are used.

1.4.1 Twitter as a social sensing platform

Nowadays, Twitter is one of the largest social networking platforms, and it is widely used for the real-time social sharing of information through microblogging [147]. Mi-croblogs are short text messages that, in the case of Twitter, are called tweets and are limited to 140 characters. Twitter’s character limit for tweets encourages the service to be used for short, frequent exchanges of information. However, as a result of this length limitation, abbreviations, jargon, and colloquial phrasing are common, making the automation of content filtering a challenging task. In addition to sharing tweets, users can also interact socially by subscribing to feeds of tweets from other users.

Twitter’s growing popularity has coincided with a rise in smartphone usage, result-ing in Twitter beresult-ing used frequently on mobile devices. Given this ease of use, many Twitter users post tweets throughout the day wherever they might be as they engage in other activities. This gives Twitter data good spatial and temporal coverage, and results in content that often concerns the user day-to-day experiences. Twitter delivers all of this content in real-time, meaning that a tweet written by a particular user is made ac-cessible and pushed to that user’s followers almost instantaneously. Twitter’s real-time delivery enables content to be consumed as it is happening. When the tweets from a large number of users are aggregated, real-time information about current occurrences can be extracted [163]. This makes Twitter a valuable social sensing source for live information on events as they progress. For instance, Twitter is very popular during emergencies and disasters, and it is being used by both official government agencies and the public, to disseminate time-sensitive information. Central to the operation of Twitter is the use of hashtags, words or short unspaced phrases prefixed with the hash (#) sign. Hashtags are content identifiers frequently used to tag a tweet, as well as to search and filter information. The creation and use of a hashtag can be established by

(27)

1.4. Materials

any user who wants to create a concept category to share specific information about a subject [47]. For example, during the 2013 floods in Sardinia (Italy), the hashtag #allertameteoSARwas used by both users and agencies to share information about this particular event.

Twitter has a large volume of activity with over 500 millions new tweets shared each day, more than 340 millions monthly active users5, and over 1.3 billions subscribers,

as of 20176, giving analyses conducted on its data increased relevance to the global

population. Twitter annual advertising revenue in 2014 has been estimated to be around $480 millions7. Twitter information is particularly well suited for data analyses for

several specific reasons. First, because of Twitter’s focus on the public broadcasting of content, only a small percentage of Twitter accounts are private. Additionally, tweets can be automatically encoded with a location. Indeed, users can enable a setting to have all of their tweets automatically tagged with their coordinates, or they can choose to attach their location to certain tweets [163]. However, only a small percentage of tweets natively carry such geographic information [18, 64]. Finally, Twitter maintains a large and free set of Web APIs8 through which filtered samples of tweets can be

collected. The quantity of data, its relevance to social sensing analyses, and the ease of access, all make Twitter an appealing source for social sensing. Because of all these reasons, within the context of this thesis Twitter data is used to extract information about social sensors, as well as about the observations that social sensors share.

1.4.2 Twitter datasets of social sensors

In Twitter, every account registered to the platform is potentially a social sensor, capable of providing observations about certain phenomena. Thus, it is possible to study the characteristics of social sensors on Twitter by analyzing the rich metadata associated to Twitter accounts, as well as their behavior on the social platform. In our work, such information is exploited in order to discriminate between real social sensors, and deceptive (e.g., fictitious, automated) accounts.

Twitter offers a number of APIs for collecting extensive data about the accounts registered to the social platform. Specifically, all the metadata that is available for a Twitter account is described in a specific page of the official Twitter documentation for developers9. Among account metadata information are: the account name, its unique

identifier in Twitter (ID), the date the account was created, the URL to its profile pic-ture, its language, its biography, its location, its birthdate, the URL to a personal Web site, etc. In addition to these information, Twitter also allows to collect the full list of accounts with which a target account, specified by its account ID, established social relationships – that is, all the followers10 and friends11 of that account. Finally, given an account ID, it is also possible to retrieve the contents of the account’s timeline12 – that is, the list of the most recent tweets posted by that account. To summarize, for each Twitter account it is possibile to retrieve its metadata information, its social

re-5https://www.omnicoreagency.com/twitter-statistics/

6C. Smith, By The Numbers: 150+ Amazing Twitter Statistics, http://goo.gl/o1lNi8 7Statistic Brain, Twitter statistics, http://goo.gl/XEXB1

8https://dev.twitter.com/overview/api 9https://dev.twitter.com/overview/api/users

10https://dev.twitter.com/rest/reference/get/followers/list 11https://dev.twitter.com/rest/reference/get/friends/list

(28)

Chapter 1. Introduction

lationships, and its posting behavior. This data collection operation can be performed programmatically for a large number of accounts, by means of a Twitter crawler.

The datasets that we exploited for studying deceptive social sensors are related to different types of Twitter accounts. The idea underlying this part of our work is to compare the characteristics of different types of deceptive accounts, with those of le-gitimate, human-operated accounts. The differences uncovered by this comparison can then be leveraged to design detection techniques for automatically identifying decep-tive accounts. In the following, we provide details about each social sensors dataset, detailed statistics about all the datasets are also reported in Table 1.1.

Fake followers

The first type of deceptive accounts that we studied are so-called fake followers (or fakes) [223], as further discussed in Section 3.1. Fake followers are fake accounts massively created to follow a target account and that can be bought from online mar-kets. In other words, their goal is that of increasing the number of followers, and consequently the popularity, of the target account. In order to build a ground truth of fake followers, in April 2013 we bought 3,000 fake Twitter accounts from three different Twitter online markets. In particular, we bought 1,000 fakes from http: //fastfollowerz.com, 1,000 fakes from http://intertwitter.com, and 1,000 fakes from http://twittertechnology.com, at a price of $19, $14 and $13 respectively. Surprisingly, fastfollowerz.com and intertwitter.com gave us a few more accounts than what we paid for, respectively 1,169 and 1,337 instead of 1,000. We collected data about those accounts with a Twitter crawler and we built a fastfollowerz.com dataset, labeled FAKE-FSF, and an intertwitter.com dataset la-beled FAKE-INT. Instead, we were unable to crawl all the 1,000 fakes bought from twittertechnology.com since 155 of them got suspended from Twitter almost immedi-ately. The remaining 845 accounts constitute the twittertechnology.com dataset, which is labeled FAKE-TWT. All accounts in the FAKE-FSF, FAKE-INT, and FAKE-TWT datasets are English-speaking.

Notably, this ground truth differs from those typically employed in this research field. Indeed, in order to build datasets of malicious and legitimate accounts, a man-ual annotation process is often used, where each account of the dataset is manman-ually inspected by a human annotator that infers its nature (whether malicious or legitimate). Here instead, we directly bought fake followers from those accounts markets that sell them on the Web, thus obtaining a strong ground truth and avoiding the need for a manual annotation.

We acknowledge that this fake followers dataset is just illustrative, and not exhaus-tive of all the possible existing sets of fake followers. However, it is worth noting that we found the Twitter accounts marketplaces by simply Web searching them on the most common Web search engines. Thus, one can argue that this dataset represents what was easily possible to find on the Web, at the time of searching.

Social spambots

The second type of deceptive accounts that we investigated are so-called social spam-bots [100], whose details are thoroughly disclosed in Section 3.2. In general, spamspam-bots

(29)

1.4. Materials

are automated accounts (i.e., accounts driven by a bot) that repeatedly advertise un-solicited and often harmful content (e.g., malware, URLs to phishing Web sites, fake news, etc.). For this novel type of deceptive accounts, there is not a publicly available online market. Hence, we did not have the chance to directly buy the accounts for our ground truth and, instead, we had to look for them in the Twitter ecosystem. A first dataset of social spambots (henceforth SPAM-POL) was created after observing the activities of a suspicious group of accounts that we discovered on Twitter during the Mayoral election in Rome, in 2014. One of the runners-up employed a social media marketing firm for his electoral campaign that made use of almost 1,000 automated ac-counts to publicize his policies. Surprisingly, we found such automated acac-counts to be similar to genuine ones in every way. Every profile was accurately filled with detailed – yet fake – personal information such as a stolen photo, short-bio, location, etc. Those accounts also represented credible sources of information since they all had thousands of followers and friends, the majority of which were genuine users. Furthermore, the accounts showed a tweeting behavior which was apparently similar to those of gen-uine accounts, with a few tweets posted every day, mainly quotes from popular people. However, every time the political candidate posted a new tweet from his official ac-count, all the automated accounts retweeted it in a time span of just a few minutes. By resorting to this farm of bot accounts, the political candidate was able to reach many more genuine accounts in addition to his direct followers and managed to alter Twitter engagement metrics during the electoral campaign. Amazingly, we also witnessed to tens of human accounts who tried to engage in conversation with some of the spam-bots. The most common form of such human-to-spambot interaction was represented by a human reply to one of the spambot tweets quotes. Quite obviously, no human account who tried interacting with the spambots ever received a meaningful reply from them. All the accounts belonging to the SPAM-POL dataset tweet in Italian.

We further investigated this issue and found it to be widespread also outside of Italy. Indeed, we discovered a second group of social bots, labeled SPAM-TLN, who spent several months promoting the #TALNTS hashtag. Specifically, Talnts is a mobile phone application for getting in touch with and hiring artists working in the fields of writ-ing, digital photography, music, and more. The vast majority of tweets were harmless messages, occasionally interspersed by tweets mentioning specific legitimate (human) accounts and suggesting them to buy the VIP version of the app from a Web store.

Furthermore, we uncovered a third group of social bots, labeled SPAM-AMZ, which advertise products on sale on Amazon.com. The deceitful activity was carried out by spamming URLs pointing to the advertised products. Similarly to the retweeters of the Italian political candidate, also this family of spambots interleaved spam tweets with harmless and genuine ones. Accounts belonging to the SPAM-TLN and SPAM-AMZ datasets tweet in the English language.

Traditional spambots

Accounts belonging to this category are not the core focus of this thesis, since both scholars and platform administrators have already focused on their detection with many satisfactory results, mainly coming from Academia [70]. Anyway, these unsophisti-cated, old-school spambots can serve as a baseline, against which to compare the novel, advanced, and sophisticated deceptive accounts.

(30)

Chapter 1. Introduction

dataset description accounts tweets year

legitimate accounts 5,424 11,009,252 –

HUM-TFP followers of @thefakeproject account 469 563,693 2010 HUM-E13 citizens that discussed Italian political elections 1,481 2,068,037 2010 HUM-HYB random accounts that answered questions 3,474 8,377,522 2011

fake followers 3,351 196,027 –

FAKE-FSF fakes bought from fastfollowerz.com 1,169 22,910 2013 FAKE-INT fakes bought from intertwitter.com 1,337 58,925 2012 FAKE-TWT fakes bought from twittertechnology.com 845 114,192 2010

social spambots 4,912 3,457,344 –

SPAM-POL retweeters of an Italian political candidate 991 1,610,176 2012 SPAM-TLN spammers of paid apps for mobile devices 3,457 428,542 2014 SPAM-AMZ spammers of products on sale at Amazon.com 464 1,418,626 2011

traditional spambots 2,661 6,148,293 –

OLD-YANG training set of [254] 1,000 145,094 2009

OLD-SCAM spammers of scam URLs 100 74,957 2014

OLD-JOB1 automated accounts spamming job offers 433 5,794,931 2013 OLD-JOB2 other automated accounts spamming job offers 1,128 133,311 2009 Table 1.1: Statistics about the social sensors datasets used in the remainder of this thesis. The year

column reports the average creation year of the accounts belonging to the different groups.

The OLD-YANG dataset is the training set used by Yang et al. in [254], kindly provided to us by the authors of that work. In [254], the dataset has been used to train a machine learning classifier for the detection of evolving Twitter spambots. Accounts belonging to the OLD-SCAM dataset are rather simplistic bots that repeatedly mention other users in tweets containing scam URLs. To lure users into clicking the malicious links, the content of their tweets invite the mentioned users to claim a monetary prize. The OLD-JOB1 and OLD-JOB2 datasets are related to 2 different groups of bots that repeatedly tweet about open job positions and job offers. All the accounts of to the OLD-YANG, OLD-SCAM, OLD-JOB1, and OLD-JOB2 datasets tweet in English.

Notably, another well-known dataset, introduced in [152], could have been used as a baseline dataset of traditional spambots. However, our OLD-YANG dataset already had comparable characteristics to the dataset of [152]. Moreover, OLD-YANG dataset is particularly interesting for our studies since it contains spambots that were considered evolved – thus, more sophisticated and more difficult to detect – at the time of dataset release.

Legitimate accounts

Together with the data about deceptive Twitter accounts, we also collected data about legitimate, human-operated accounts. A first dataset of legitimate accounts is derived from the followers of The Fake Project Twitter account. The Fake Project started its activities on 12 December 2012, with the creation of the homonym Twitter account @thefakeproject13. Its profile reports the following motto: “Follow me only if you are

(31)

1.4. Materials

NOT a fake” and explains that the initiative is linked to a research project owned by researchers at IIT-CNR, in Pisa (Italy). In a first phase, team members of The Fake Project contacted other researchers and journalists to advertise the initiative. Foreign journalists and bloggers also supported the initiative in their respective countries. In a twelve days period (December 12 to 24, 2012), the Twitter account was followed by 574 followers. Via Twitter APIs, we collected all information about these followers together with that of their followers and friends. For building this dataset, we crawled the 574 accounts, leading to the collection of 616,193 tweets and 971,649 social relationships. All these followers voluntarily joined the project. To include them in our dataset of humans, we also launched a verification phase, where each follower received a direct message on Twitter from containing an URL to a unique CAPTCHA. We considered as certified humans all the 469 accounts (out of the 574 followers) that successfully completed the CAPTCHA. In the remainder of this thesis this dataset is referred to as HUM-TFP. Accounts of this dataset are mainly Italian-speaking, although there are accounts that also tweet in other languages.

The #elezioni2013 dataset, henceforth HUM-E13, has been created to support a re-search initiative for a sociological study carried out in collaboration with the University of Perugia and the Sapienza University of Rome. The study focused on the strategic changes in the Italian political panorama for the 3-year period 2013-2015. Researchers identified 84,033 unique Twitter accounts that used the hashtag #elezioni2013 in their tweets, during the period between January 9 and February 28, 2013. Identification of these accounts has been based on specific keyword-driven queries on the username and biography fields of the accounts’ profiles. Keywords include blogger, journalist, so-cial media strategist/analyst, and congressperson. Specific names of political parties have been also searched. In conclusion, all the accounts belonging to politicians and candidates, parties, journalists, bloggers, specific associations and groups, and who-ever somehow was officially involved in politics, have been discarded. The remaining accounts ('40,000) have been classified as citizens. This last set has been sampled (with confidence level 95% and confidence interval 2.5), leading to a final set of 1,488 accounts, that have been subject to a manual verification to determine the nature of their profiles and tweets. The manual verification process has been carried out by two sociologists from the University of Perugia, Italy. It involved the analysis of profile pictures, biographies, and timeline of the accounts under investigation. Accounts not having a biography or a profile picture have been discarded. URLs in biographies have also been manually checked to allow for a deeper analysis of the subject. Only ac-counts labeled as humans by both the sociologists have been included in the HUM-E13 dataset. Overall, the manual verification phase lasted roughly two months. As a result, 1,481 accounts became part of dataset HUM-E13. All the accounts in HUM-E13 tweet in Italian.

Finally, the HUM-HYB dataset is a random sample of genuine (human-operated) accounts. Following a hybrid crowdsensing approach [15], we randomly contacted Twitter users by asking them a simple question in natural language. All the replies to our questions were manually verified and all the 3,474 accounts that answered were certified as legitimate. The accounts that did not answer to our questions were discarded and are not used in this study. Since all questions have been asked using the English language, roughly all accounts of the HUM-HYB dataset are English-speaking.

(32)

Chapter 1. Introduction

1.4.3 Twitter datasets of social sensors observations

Besides the analysis of the characteristics of social sensors, we are also interested in studying the observations that social sensors share. In other words, differently from the previous section that describes datasets of Twitter accounts, in this section we describe how we built datasets comprising the observations (i.e., the tweets) that such accounts share. Exploiting social sensors observations can be particularly profitable in the af-termath of disasters, in order to support emergency management procedures, which is among the goals of this thesis.

The information that Twitter discloses about tweets is described in a specific page of the developers documentation14. Among the information provided are: the unique

identifier (ID) of the tweet, its textual content, the links to possible multimedia (im-ages/videos) content included in the tweet, the publication timestamp, the author of the tweet (ID and name), and more. The data collection process for social sensors tweets has been carried out with a specific focus on emergencies. This allowed to obtain tar-geted datasets, suitable for studying the information shared in the aftermath of notable emergencies. Twitter offers 2 main methods for collecting tweets about specific topics: (i) the Search API, and (ii) the Streaming API. Twitter’s Streaming API15 gives access to a global stream of tweets, optionally filtered by search keywords. This API opens a persistent connection to Twitter allowing all newly shared tweets matching the search criteria to be delivered in quasi-real time16. Notably, the Streaming API does not allow to collect tweets shared in the past – that is, before the connection with Twitter is estab-lished. This API is therefore suitable for long-term monitoring of specific phenomena, or for monitoring predictable events that are known in advance. For collecting tweets about an event that has already happened, Twitter provides the Search API17. Twitter’s

Search API is a REST API that accepts a list of search keywords as input and that re-turns the list of tweets matching the search criteria and that have been shared no more than 1 week in the past. The 1-week limitation makes the Search API unsuitable for collecting data about historic events (i.e., important events that happened years ago). However, in the case of emergency management via social sensing, it is important to study the information shared in the aftermath of severe historic disasters occurred in the age of social media. For this reason, we relied on data resellers (e.g., GNIP’s Historical API18) in order to collect data about relevant disasters occurred years in the past.

All the aforementioned APIs are designed so as to deliver sets of tweets that match user-specified search keywords. We exploited a different set of search keywords for every different disaster that we investigated, in order to collect the most relevant tweets about it. Whenever possible, we resorted to hashtags specifically created to share re-ports of a particular disaster, such as the #allertameteoSAR hashtag for the Sardinia floods of 2013. In this way, we were able to select only tweets actually related to that disaster. However, for historical disasters we couldn’t rely on specific hashtags and we had to exploit generic search keywords already proposed in literature [22, 204]. For example, this is the case of the 2009 L’Aquila earthquake, for which we exploited the “terremoto” (earthquake) and “scossa” (tremor) Italian keywords. As a final remark,

14https://dev.twitter.com/overview/api/tweets 15https://dev.twitter.com/streaming/overview

16The typical latency in acquiring tweets from the Streaming API is in the range of a few seconds. 17https://dev.twitter.com/rest/public/search

(33)

1.4. Materials

dataset location language accounts tweets year

earthquakes 11,841 16,456 –

EAQ-LAQ L’Aquila, Italy IT 563 1,062 2009

EAQ-CHR Christchurch, New Zealand EN 1,438 1,998 2011

EAQ-EML Emilia, Italy IT 2,761 3,170 2012

EAQ-AMA Amatrice, Italy IT 7,079 10,226 2016

floods 862 1,410 –

FLO-SAR Sardinia, Italy IT 597 976 2013

FLO-GEN Genoa, Italy IT 265 434 2014

power outages 163 391 –

PWO-MIL Milan, Italy IT 163 391 2013

Table 1.2: Statistics about the social sensors observations datasets used in the remainder of this thesis.

we only used “fresh” data shared in the aftermath of the disasters under investigation. For instance, all the 3,170 tweets that we collected about the 2012 Emilia earthquake, were posted in less than 24 hours since the earthquake occurred. Lastly, to investigate a wide range of situations we picked disasters having variable degrees of severity: some caused only moderate damage, while other produced widespread damage and hundreds of deaths. Table 1.2 reports detailed statistics about the social sensors observations datasets used for emergency management in the remainder of this thesis. Details about the specific disasters covered by the datasets are reported in the following.

Earthquakes

Since the emergency management part of this thesis has been carried out in collabora-tion with the Italian Nacollabora-tional Institute for Geophysics and Volcanology (INGV) – the authority responsible for monitoring seismic events in Italy – the majority of datasets of this section are related to earthquakes. In detail, the EAQ-LAQ dataset comprises Ital-ian tweets shared in the aftermath of the 2009 L’Aquila earthquake19. The earthquake

stroke a wide region in Central Italy, with the main shock occurring at around 3 a.m. local time on 6 April 2009. The earthquake has been assigned a 6.3 moment magni-tude and caused over 300 deaths, about 1,500 injured, over 65,000 homeless refugees, and widespread damage. Tweets of this dataset have been collected by resorting to the historical paid services of Twitter data resellers.

The EAQ-CHR dataset is related to the devastating earthquake occurred 22 February 2011 at around 12 p.m. local time in Christchurch, New Zealand20. The earthquake registered 6.3 on the Richter magnitude scale and caused caused widespread damage across Christchurch, killing 185 people in what has been the nation’s fifth-deadliest disaster. The EAQ-CHR is the only emergency-related dataset of this thesis comprising English tweets. It has been included in this work because it has already been widely studied in previous works and, as such, it allows for comparisons with such previous approaches. We obtained this dataset from the authors of [107, 108, 173] that used it in their works.

19https://en.wikipedia.org/wiki/2009_L%27Aquila_earthquake 20https://en.wikipedia.org/wiki/2011_Christchurch_earthquake

(34)

Chapter 1. Introduction

Then, the EAQ-EML dataset contains Italian tweets about the earthquakes that stroke the Emilia Romagna regional district in Italy on 20 May 2012 starting from 4 a.m. local time21. In particular, 3 strong earthquakes occurred in a time span of a few hours near to

the Finale Emilia, Bondeno, and Sermide villages, in a wide rural area. The strongest of these earthquakes has been assigned a 6.1 magnitude. They caused 7 deaths and severe damage to many of the ancient, rural buildings of the area. Data for the EAQ-EML dataset have been collected via the historical paid services of Twitter data resellers.

Finally, the recent Central Italy earthquake of 201622 is covered by the EAQ-AMA dataset. The dataset comprises Italian tweets shared in the aftermath of the 6.0 magni-tude earthquake that occurred near the village of Accumoli and the city of Amatrice on 24 August 2016, at 3:30 a.m. local time. Overall the earthquake caused 299 deaths and widespread damage. Tweets about this disaster have been collected via a combination of the Streaming and the Search APIs.

Floods

In addition to earthquakes, we also built datasets to study emergency communications in the case of floods and power outages. Regarding floods, we covered 2 relevant events occurred in Italy in 2013 and 2014. Specifically, the FLO-SAR dataset is related to the floods that occurred in the Sardinia regional district between 17 and 19 November 2013, as a consequence of the Cleopatra extratropical cyclone (also known as Ruven by the Free University of Berlin)23. The floods mainly occurred in the northern part of the Sardinia island, around the city of Olbia, but also affected minor areas in central and southern Sardinia. Overall, the disaster caused 18 deaths and more than 3,000 home-less. The Italian tweets of the FLO-SAR dataset have been collected via a combination of the Streaming and the Search APIs.

The FLO-GEN dataset contains Italian tweets collected during and in the aftermath of the floods that occurred near the city of Genoa between 9 and 11 October 201424. The

floods have been caused by a torrential rain and resulted in 1 death and several damaged buildings in Genoa. Tweets of this dataset have been collected via the Streaming APIs.

Power outages

The PWO-MIL dataset is related to a power outage (i.e., a blackout) that occurred in the city of Milan, in northern Italy, in the night between 14 and 15 May 2013. Despite not causing any serious consequence, the blackout has been extensively discussed on Twitter. This dataset is also interesting since it has been used to evaluate the crisis mapping system presented in [173]. As such, it can provide ground for a sound com-parison between the techniques developed in this work and those used in [173]. The Italian tweets that make up the PWO-MIL dataset have been shared to us by the original authors of [173].

21https://en.wikipedia.org/wiki/2012_Northern_Italy_earthquakes

22https://it.wikipedia.org/wiki/Terremoto_del_Centro_Italia_del_2016_e_del_2017 23https://en.wikipedia.org/wiki/2013_Sardinia_floods

Riferimenti

Documenti correlati

Studies included in the analysis relative to in situ L. monocytogenes inactivation in fermented sausages. Range a/a Reference No. monocytogenes strain Type of fermented sausage

Studies addressing vector transduction have been mainly performed in vivo in nonhuman models or in human 2D cultures. In a 2D model of dopaminergic differentiated hmNPCs we assessed

In marginal areas, the landscape of a territory can be shaped by sustainable development policies through the support of initiatives aimed at the recovery of traditional

En el año 1994 recogió en cinco volúmenes una selección de sus principales artículos bajo el título de Veinticinco años de Arqueología Global (vol. Ca- racteres

Aus dem Gesagten kann man folgern, dass das georgische Alphabet, in welchem jedes Phonem durch einen Buchstaben abgebildet wird, die lateinische und arabische Schrifttradition

By our, at this point, everyday experience and from the still scanty literature in this regard it seems to us that patients with severe pneumonias by SARS-CoV-2 can develop a

At maximal stress and during early recovery, STE analysis revealed a substantial reduction of regional strain in basal segments of lateral and posterior walls and in both basal

The widespread pre- sence of farms involved in tourism is sizable, with more than 4,000 farms (and about 52,000 beds) recorded in 2011, which were dispersed around places with