Automatic analysis and interpretation of multimedia user generated content for emergency management

Testo completo

(1)UNIVERSITÀ DI UDINE DOTTORATO DI RICERCA IN COMUNICAZIONE MULTIMEDIALE CICLO XXVIII. Ph.D. Thesis. AUTOMATIC ANALYSIS AND INTERPRETATION OF MULTIMEDIA USER GENERATED CONTENT FOR EMERGENCY MANAGEMENT. Supervisor: Prof. GIAN LUCA FORESTI. Candidate: MARCO VERNIER. Assistant Supervisor: Prof. CHRISTIAN MICHELONI. APRIL 2016.

(2) 1. Reviewer: Prof. Sergio Canazza. 2. Reviewer: Prof. Alfredo Gardel Vicente. Day of the defense: April 22, 2016. Head of PhD committee: Prof. Leopoldina Fortunati. ii.

(3) Abstract. Traditional Emergency Management Systems (EMS) are mainly focused on the institutional warning response and do not fully exploit the active participation of citizens involved. In the case of emergency events, citizens are usually considered as people to be rescued rather than active participants. Nowadays the widespread adoption of digital media and the production of content by ordinary people have marked a significant change in the study of the disaster context and have allowed analysis of the event from a completely new perspective: that of citizens involved. Thanks to the use of blogs, social networking sites, and video/photo-sharing applications, a large number of citizens are able to produce, upload and share content related to the impact of a disaster, the emergency response, the search and rescue operations, the restoration phase, etc. All this social content can be exploited in order to provide a more accurate situational awareness of the event from below, in addition to the traditional EMS. This thesis focuses on a Smart Multimedia User Generated Content Retrieval system (SMR) expressly conceived for event detection and situational awareness applications. Based on state-of-the-art clustering algorithms, it is able to locate an event and extract the most significant multimedia content. Contrary to already existing EMS, the proposed SMR system is able to analyse not only the textual content posted by users during an event, but also the visual context. To perform such a task, specific computer vision algorithms have been exploited in order to evaluate images retrieved from social platforms. Retrieved images are then displayed by emergency operators through a user-friendly graphical interface. Important results have been obtained by testing the system with over 60 events that occurred in 2015. More than 130K images were retrieved and analysed by the proposed SMR system. Results obtained are really promising and show the feasibility and the interest of the proposed SMR system.. iii.

(4) Contents. List of Figures. i. 1. Introduction. 3. 1.1. Background and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.2. Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.3. Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2. State of the art. 9. 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2. Emergency Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.2.1. Traditional Emergency Management Systems . . . . . . . . . . . . . . . . . .. 10. 2.2.2. Smart Emergency Management Systems . . . . . . . . . . . . . . . . . . . . .. 11. 2.2.3. Social Emergency Management Systems . . . . . . . . . . . . . . . . . . . .. 11. ASyEM: an Advanced System for Emergency Management . . . . . . . . . . . . . . .. 13. 2.3.1. Analysis of User Generated Content on Social Platforms . . . . . . . . . . . .. 15. The use of Social Platforms for Events Detection . . . . . . . . . . . . . . . . . . . .. 16. 2.4.1. The role of Twitter in supporting emergency operations . . . . . . . . . . . . .. 17. Data mining on Twitter Social Platform . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 2.5.1. Twitter Situational Awareness System classification . . . . . . . . . . . . . . .. 21. 2.5.1.1. Semantic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2.5.1.2. Meta data systems . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 2.5.1.3. Smart Self-Learning Systems . . . . . . . . . . . . . . . . . . . . .. 25. Data Mining on Instagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 2.3. 2.4. 2.5. 2.6. vii.

(5) CONTENTS. 3. 4. System architecture. 29. 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 3.2. A smart system for multimedia data analysis . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.3. The system interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34. MUGC Data Extraction. 41. 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41. 4.2. MUGC Data Extraction overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 4.2.1. MUGC Data Extraction implementation . . . . . . . . . . . . . . . . . . . . .. 43. 4.2.2. The Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 4.2.3. The Graph database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.2.4. Tweets analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 4.2.4.1. Geographic clustering (GC) . . . . . . . . . . . . . . . . . . . . . .. 46. 4.2.4.2. Trend Hashtag Computation (THC) . . . . . . . . . . . . . . . . . .. 48. Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.3.1. Evaluation protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.3.2. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 4.3.3. Tests on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 4.3. 5. MUGC Data Analyser. 53. 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. 5.2. MUGC Data Analyser overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 5.3. Image Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 5.4. Image Analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57. 5.4.1. Panoramic Image Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 5.4.2. Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 5.4.2.1. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 5.4.2.2. Feature comparison . . . . . . . . . . . . . . . . . . . . . . . . . .. 75. 5.4.3. Features matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 5.4.4. Decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. viii.

(6) CONTENTS. 6. Experimental Results. 83. 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 6.2. General overview of the retrieved data . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 6.3. MUGC Data Extraction evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 6.3.1. Running example results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. MUGC Data Analyser evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88. 6.4.1. French Riviera Floods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 6.4.2. Expo Milano Clashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 94. 6.4.3. Hurricane Patricia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 99. 6.4.4. Terrorists attacks in Paris . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 101. General results discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 103. 6.4. 6.5 7. Conclusions and Future Work. 105. References. 109. ix.

(7)

(8) List of Figures. 1.1. Active users on social Platform in 2015 (source http://wearesocial.net/tag/statistics/). .. 1.2. Daily number of photos uploaded and shared on the most diffused social platforms. 4. (source http://recode.net/). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.1. An example of a Traditional Emergency Management System. . . . . . . . . . . . . .. 10. 2.2. An example of Smart Management System: it uses heterogeneous sensors located at specific locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 2.3. An example of Social Emergency Management System. . . . . . . . . . . . . . . . . .. 13. 2.4. ASyEM: the Advanced System for Emergency Management proposed by Foresti et al. [2014b]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.5. An example of tweets posted on Twitter during the 2013 Wellington earthquake. . . . .. 19. 2.6. A potential application of a system to extract data from Twitter stream during an emergency event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 2.7. The architecture proposed by Twitcident [Abel et al., 2012]. . . . . . . . . . . . . . .. 24. 2.8. An example of smart self-learning systems. . . . . . . . . . . . . . . . . . . . . . . .. 26. 3.1. Logical architecture of the smart system for automatic analysis and interpretation of MUGC for emergency management. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 3.2. The sub-modules which compose the SMR system. . . . . . . . . . . . . . . . . . . .. 32. 3.3. An example of panoramic image view retrieved by the Event-Analyser sub-module from Google Street View database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.4. An overview of the search configuration page that is composed by several sections. . .. 34. 3.5. The keywords section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. i.

(9) LIST OF FIGURES. 3.6. Through the Google Map the operator can select the specific location where to start the search. Moreover, through the slider picker the radius of the entire geographical area can be regulated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. 3.7. Through the duration time section, the operator can set the total period time of a search.. 36. 3.8. An overview of the “Other Options” section . . . . . . . . . . . . . . . . . . . . . . .. 36. 3.9. An overview of the page which include the list of searches. . . . . . . . . . . . . . . .. 36. 3.10 The three icons included on each list item. . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.11 An example of charts in the statistics page. . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.12 The tag-cloud and the bar chart used to show the trend keywords found during the search. 37 3.13 An overview of the page which includes all retrieved post related to a specific search. .. 38. 3.14 The checkboxes that can be used in order to filter the results. . . . . . . . . . . . . . .. 38. 3.15 An overview of the proposed graphical interface where emergency operators can monitor the geographical area of the event. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. 39. The architecture of the MUGC Data Extraction module. It is composed by three main components: crawler software, the tweets analyser and the graph database. . . . . . . .. 42. 4.2. An example of keywords detected by the spider software on the Twitter stream. . . . .. 44. 4.3. Example of the bounding box obtained using two pairs of latitude and longitude coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.4. 45. Input hashtag selection. The most popular hashtags adopted by the Twitter users during previous floods events in Italy are computed to individuate the K set of input hashtags.. 50. 4.5. A map showing the different hierarchy levels of the SMR system. . . . . . . . . . . .. 51. 4.6. The proposed trend hashtag computation. . . . . . . . . . . . . . . . . . . . . . . . .. 51. 5.1. An overview of the MUGC Data Analyser that is composed of two main sub-modules: the Image Classifier and the Image Analyser. . . . . . . . . . . . . . . . . . . . . . .. 54. 5.2. An overview of the Neural Tree algorithm a used for disaster image classifier. . . . . .. 55. 5.3. An example of image retrieved from Twitter during the Hurricane Patricia event. Left and centre images are not strictly connected to the event itself. On the contrary, right image describe the given event and in particular what is happening in that specific area.. 5.4. An overview of the image analyser sub-module. It is able to identify if an image is retrieved from a specific location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.5. 58. 58. An overview of the panoramic image builder component to retrieve the panoramic image view of a given location from Google Street View. . . . . . . . . . . . . . . . . . . . .. ii. 59.

(10) LIST OF FIGURES. 5.6. An example of HTTP GET request to Google Street View server. . . . . . . . . . . . .. 5.7. The proposed image matching procedure for detecting meaningful images for a given event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.8. 59. 60. On the left, the image convolved with the Gaussian kernel that form sets of octaves. To the right, the images that represent the differences of Gaussian (DoG), obtained by subtracting the Gaussian adjacent. At the end of each octave, the Gaussian image is. 5.9. sub-sampled by a factor of 2 and the process is reiterated [Lowe, 1999]. . . . . . . . .. 63. Localization of maximum and minimum locals. . . . . . . . . . . . . . . . . . . . . .. 64. 5.10 Descriptor implementation starting from the orientation gradient in the neighbourhood of the point of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64. 5.11 Features extracted using SIFT descriptors. . . . . . . . . . . . . . . . . . . . . . . . .. 65. 5.12 On the left hand, the image is sub-sampled and climbing by using SIFT. On the right hand, adopting SURF descriptor, the use of integral images allows to increase the size of the filter and keep constant the image. . . . . . . . . . . . . . . . . . . . . . . . . .. 66. 5.13 Representation of the orientation calculation. Into the mobile window (blue coloured) are added the value of all Haar wavelet’s answers. . . . . . . . . . . . . . . . . . . . .. 67. 5.14 The scheme represents the computation of a SURF descriptor. . . . . . . . . . . . . .. 68. 5.15 Features extracted using SURF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 69. 5.16 Scale-space interest point detection Leutenegger et al. [2011]. . . . . . . . . . . . . .. 70. 5.17 BRISK sampling pattern with N = 60 points Leutenegger et al. [2011]. . . . . . . . .. 71. 5.18 Features extracted using BRISK descriptors. . . . . . . . . . . . . . . . . . . . . . . .. 72. 5.19 Computation methods proposed by ORB descriptor . . . . . . . . . . . . . . . . . . .. 73. 5.20 Orientation of the Point of Interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75. 5.21 Features extracted using ORB descriptor. . . . . . . . . . . . . . . . . . . . . . . . .. 76. 5.22 A comparison of the features descriptors used to extract significant keypoints from images posted by users on social platforms during a given event. . . . . . . . . . . . . .. 77. 5.23 Feature matching comparison using SIFT (left images) and SURF (right images) descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 5.24 An example of false matching. Nevertheless images are completely different, the feature matching process detect valid matches. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.25 Histograms obtained by analysing two matched images. 6.1. 80. . . . . . . . . . . . . . . . .. 81. Overview of results obtained by testing the system with 60 searches. . . . . . . . . . .. 84. iii.

(11) LIST OF FIGURES. 6.2. The most used device for tweeting.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 6.3. Data retrieved by searches performed on four specific events.. . . . . . . . . . . . . .. 86. 6.4. Trend hashtag computed for each level of the system. . . . . . . . . . . . . . . . . . .. 87. 6.5. The data retrieved by the system during the French Riviera flood event.. . . . . . . . .. 87. 6.6. Experimental results obtained by analysing four specific events occurred in 2015. . . .. 88. 6.7. Geo-located images retrieved during flooding events occurred in French Riviera.. . . .. 89. 6.8. Results obtained by analysing images retrieved from the French Riviera search. . . . .. 90. 6.9. An example of images recognized as meaningful by both SIFT and SURF descriptors.. 91. 6.10 Some examples of spam images detected by the matching module. The number of matches is lower than the thp threshold as well as the dissimilarity is over the thd threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 6.11 An example of false-positive images. The system detected them as valid since the number of matches and the dissimilarity measures comply with the imposed thresholds. . .. 93. 6.12 Geo-located images retrived during clashes events occurred in Miland during Expo 2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 94. 6.13 Results obtained by matching the images with SIFT (left bar chart) and SURF (right bar chart) descriptors using the standard thp and thd thresholds. . . . . . . . . . . . . . .. 95. 6.14 Results obtained by using the SURF descriptor reducing the thp value from 4 (left bat chart) to 3 (right bar chart ) with the goal to retrieve more valid images. . . . . . . . . 6.15 An example of valid images obtained during NoExpo clashes occurred in Milano.. . .. 95 96. 6.16 An example of images evaluated as valid by the proposed matching algorithms despite their high level of dissimilarity.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 97. 6.17 An example of spam images detected by the image analyser. . . . . . . . . . . . . . .. 98. 6.18 The images retrieved by MUGC Data Extraction module during the Hurricane Patricia in Mexico.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6.19 Results obtained by the analysis of images retrieved during the Hurricane Patricia.. . .. 99 100. 6.20 An example of images retrieved during the Hurricane Patricia, which have been taken in a low lighting condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 100. 6.21 The first fifty images retrieved on Twitter during terrorist attacks occurred in Paris. . .. 101. 6.22 Results obtained by matching the images using SIFT descriptor. . . . . . . . . . . . .. 102. 6.23 Results obtained by matching the images using SURF descriptor.. 102. . . . . . . . . . . .. 6.24 An example of images retrieved from Twitter during terrorists attacks which have not been matched by the system due to their low quality. . . . . . . . . . . . . . . . . . .. iv. 103.

(12) LIST OF FIGURES. 1.

(13)

(14) 1. Introduction. 1.1. Background and Problem Statement. In the last few years, social media have grown in popularity with millions of users who produce and share online digital content every day, so-called User-Generated Content (UGC). While in the nineties, with the advent of the web, sharing digital content was feasible only for users with computer science competencies able to manage some web programming language, nowadays, in the era of web 2.0, anyone who uses a computer, a smartphone or a tablet, can create and share digital content online. The growth of the sharing of digital content all over the web and in particular on social platforms, allows an understanding of how the use of the Internet is evolving. This is obtained thanks to technological progress, which has rapidly increased in the last few years. Social media platforms are considered the most diffuse computer-mediated tools that allow people to create and exchange information. They include different types of services, like podcasts, blogs, social networks, emails, texting and Internet forums. Examples are now countless although among the most popular, there are Facebook, Twitter, Google+ and Instagram with an average of 500 million active accounts in the first quarter of 2015 (Figure 1.1). Due to the constant presence of these services in the users life, social networks have a decidedly strong social impact. Every day, all social media services generate a sheer volume of UGC that consists. 3.

(15) 1. Introduction. Figure 1.1: Active users on social Platform in 2015 (source http://wearesocial.net/tag/statistics/).. of social interactions and information such as comments, tweets, personal data, images and videos. These interactions create a very complex bundle of information (big data) spread all over the Internet, which contains very complex network structures. These structures can be extracted and analysed, in order to provide an insight into these networks and help to detect their main key features. As reported in Meeker & Wu [2013], every day are published 1.8 billion images on the web out of which only 350 million in Facebook (Figure 1.2). Comparing these data with those coming from past years we can see that the number of images uploaded on social networks has exponentially increased. As an example, in 2011 the number of uploaded images did not exceed 300 million per day. Social media platforms have been built from the beginning to be socially used, oriented around collaboration and sharing. These potentialities are emphasised in extraordinary contexts, when ordinary people adopt these tools to provide or search for first-hand and real-time information regarding a certain event (e.g., an emergency event). The most recent catastrophic events, from the 2010 Haiti earthquake, to the devastating 2013 Colorado floods, have shown that these platforms have been strongly used both during and after disasters, allowing a real-time dissemination of information to the wider public, an effective situational awareness and an up-to-date picture of what is happening on the ground [Farinosi & Micalizzi, 2013], [White et al., 2014]. According to Kotsiopoulos [2014] extraordinary situation, social media enable citizens to play at least three roles: i) first responders/volunteers; ii) citizen journal-. 4.

(16) 1.1 Background and Problem Statement. Figure 1.2: Daily number of photos uploaded and shared on the most diffused social platforms (source http://recode.net/).. ists/reporters; and iii) social activists. Oftentimes individuals experiencing the event first-hand are on the scene of the disaster and are able to provide updates more quickly than disaster response organizations and traditional news media [Sweetser & Metzgar, 2007], [Procopio & Procopio, 2007]. Given the increasing availability of data and meta-data produced and/or distributed by users on social platforms, it is important to understand how these data can be potentially exploited for event detection purposes thus integrating them with the existing traditional Emergency Management Systems (EMS). Social EMS are those which better consider the data produced by citizens to support emergency operators (Civic Protection, Red Cross, Fire Department) during a given emergency event [Vernier et al., 2015]. The real-time UGC produced creates an ever-growing heap of information that can be exploited in order to gather knowledge and to obtain relevant information about the event. In the literature there are different types of social EMS which analyse the textual content shared by users in order to locate the area where a given emergency event is occurring (see Chapter 2). Nevertheless, one of the main limitations of these systems is that they do not consider multimedia content (images, video, etc.) shared by users during an event. Indeed this content could be a useful resource for emergency operators to obtain a better overview on what is happening in that area thus organizing the rescue operations in a more accurate way. As reported in Johnston & Marrone [2009], Bruns et al. [2012] and Gupta et al.. 5.

(17) 1. Introduction. [2013], during an emergency event, users share on social platforms significant images regarding the event itself. In order to extract and analyse the multimedia content shared by users, there are some constraints that are required to be investigated, considering the crowded scenario of social networks. Social platforms are heterogeneous from each other both considering their use and how the digital content is shared. For example, Twitter allows the user to communicate with 140-character messages. Instagram allows sharing only images or video combined with a brief description. Other constraints come with the nature of the users profiles that could be public or private. This means that if a users profile is signed as private, his/her shared content cannot be available either for other users (belonging to the same social platform) or for third party applications that would access this content. Moreover, each social platform includes custom versions of Application Programming Interfaces (API) and services which allow extracting content generated by users. If the above mentioned limitations are in most cases connected to the logical and physical architecture of a social platform, other limitations could be related to the truthfulness of content shared by users. Fake news reports on the internet are a phenomenon that is becoming more and more frequent with each passing day. Social media are more and more filled with purposely-falsified content which in most cases is out of context. In particular, images represent the most typical media-content to be falsified. Even before the advent of Twitter or Photoshop, images were in many cases modified in order to manipulate the public information; and nowadays, with the advent of social platforms, it is easy for anyone who has a PC or a smartphone device to spread fake images all over the web and in particular on social platforms. An example could be the recent terrorist attacks in Paris at the Charlie Hebdo offices: several fake images were spread in order to deceive people on the truthfulness of the event. Since in many cases it is not always possible to recognise fake images, several manual methods can be adopted to overcome this problem. For example, relevant information could be obtained consulting the EFIX data attached to each image (the information reported in the images properties). Other methods exploit Google Images in order to retrieve similar photos thus understanding if the supposed fake images already exist or have been manipulated. However, robust state-of-the-art methods able to recognize fake images do not exist. In particular, considering the potentiality offered by UGC on social platforms, there currently is not any system able to extract in real time the multimedia content generated by users that is able to recognize fake images or at least able to evaluate if an image comes from the area where it has been taken.. 6.

(18) 1.2 Contribution of the Thesis. 1.2. Contribution of the Thesis. Current state-of-the-art social EMS are limited to analysing textual content shared by users on social platforms. In particular, they perform a semantic analysis of shared text in order to detect the geographical area where a given event is occurring. The main contribution of this thesis is a Smart Multimedia User Generated Content Retrieval (SMR) expressly conceived for event detection and situational awareness applications. SMR allows for analysis of multimedia content attached to each post, with the main goal of selecting only the most relevant content which better describes the given event. Based on state-of-the-art clustering algorithms, the system aggregates posts, which are then analysed in order to detect the event of interest. The system includes also an advanced image analyser module in charge of identifying only the most meaningful images which best describe the given event. Other images are automatically discarded. This module exploits GPS coordinates of each post in order to retrieve a panoramic image from Google Street View. The panoramic image is then compared with the image retrieved from the specific social platform in order to find valid matches. To perform such a task, specific computer vision algorithms are exploited to perform an accurate image analysis based on features extraction and matching procedures. Finally, to give the emergency operators a more precise and accurate situational awareness of the event, the system is able to show, through a user-friendly graphical user interface, the images attached to each post combined with a panoramic image of the same area where the image has been taken.. 1.3. Organization of the Thesis. The thesis is structured as follow: in Chapter 2 we give a brief literature review about current stateof-the-art EMS. Then, in Chapter 3 the logical and physical architecture of the proposed SMR system is presented. In next chapters, each single module which compose the entire system is described. In particular, in Chapter 4 and 5 the MUGC Data Extraction and the MUGC Data Analyser modules are explained. Then, in Chapter 6 experimental results are presented in order to show the efficiency and reliability of the proposed system. Finally, in Chapter 7, conclusions and possible future directions are drawn.. 7.

(19)

(20) 2. State of the art. 2.1. Introduction. In the last decade, progress in low cost, high performance computing networks and digital communications on heterogeneous, mobile and fixed broadband networks [Hofstee, 2005], [Pande et al., 2005], [Kim, 2009], [Abad et al., 2012] have supported the development of innovative systems for emergency management. The last few years have seen an explosive growth in the adoption of social media in all kinds of catastrophic events, from the 2010 Haiti earthquake to the 2015 Patricia hurricane. Large scale use of web 2.0 platforms and in particular, of socio-mobile applications by ordinary people, allow the generation of consistent information, which represents a great opportunity for emergency management stake-holders and agencies [Meier, 2013]. To valorise and exploit grass-roots data, it is crucial to design advanced architectures able to collect, select, process and integrate data produced by citizens with data acquired by sensors already present in the environment in order to support institutions when responding to a specific event. Nevertheless, there is a need to combine multimodal real-time big data into actionable situations specifically to model and recognize emergency events [Singh et al., 2012]. The chapter is structured as follow: in section 2.2 a classification of Emergency Management Systems is presented. Then, in section 2.3, ASyEM, an advanced system for emergency management is described. In section 2.4 some examples of systems which analyse UGC on social platform for event. 9.

(21) 2. State of the art. detection purposes are shown. In particular, in section 2.5 the most important system based on Twitter is presented. Finally, in section 2.6 a brief overview of Data Mining on the Instagram social platform is given.. 2.2. Emergency Management Systems. In the literature, there are several Emergency Management Systems (EMS) able to monitor and manage rescue operations in the aftermath of a disaster event. Considering the input data and the logical architecture, these systems can be mainly classified into three categories: i) Traditional Emergency Management Systems; ii) Smart Emergency Management Systems; iii) Social Emergency Management Systems [Foresti et al., 2014a].. 2.2.1. Traditional Emergency Management Systems. Traditional Emergency Management Systems are those systems that do not make use of sensors to monitor the scene during and after a disaster event (Figure 2.1). In many cases, the alarm is launched by citizens or public operators through traditional communication systems such as landlines or mobile phones. Today these systems are still in use although they present strong limitations. Furthermore, in some cases, the information provided to the rescue personnel may be limited, ambiguous and imprecise since it is spread by people who have experienced emotional shock. In addition, the Traditional EMS -given that they are not based on sensors able to monitor a certain area or situation- can be activated only after the event has happened, through a call from the affected area to the fire brigade or Civil Protection that initiates rescue operations.. Figure 2.1: An example of a Traditional Emergency Management System.. 10.

(22) 2.2 Emergency Management Systems. 2.2.2. Smart Emergency Management Systems. The second category of EMS is represented by the Smart EMS. These innovative systems are based on a network of advanced smart sensors (e.g., PTZ cameras, audio sensors, etc.) able to automatically monitor a certain area during and after the disaster event and transmit relevant environmental data to a remote unified operative centre (Figure 2.2). In the literature, there are several examples of smart EMS. For instance, a software infrastructure developed in 2004, called CodeBlue, integrates different devices, such as wearable vital sign monitors, location-tracking tags, handheld computers, and allows wireless monitoring and tracking of patients and first responders [Lorincz et al., 2004]. In 2007, an emergency communication network platform, named DUMBONET was designed for collaborative simultaneous emergency response operations [Kanchanasut et al., 2007]. It is based on an integration of a Mobile Ad Hoc Network (MANET) and a satellite IP network operating with traditional terrestrial internet and can be deployed in a number of disasteraffected areas. In 2009, a rescue information system for earthquake disasters able to support a large number of rescues under catastrophic natural disasters was realized by Jang et al. [2009]. It aims to overcome infrastructure network problems that can paralyse the entire communication systems as occurred in the Jiji/Taiwan earthquake. Another interesting system is AELPS (Artificial EmergencyLogistics-Planning System) which has been created to help governments and disaster relief agencies to prepare for and manage severe disasters [Li & Tang, 2008]. It is based on a complex computational platform that generates logistics phenomena during disaster relief and gives intuitive results that can be used in emergency-logistics planning. In addition, there is SoKNOS [Paulheim et al., 2009] which allows operations with various and heterogeneous information sources and enables emergency organizations to collaborate in an efficient way. It combines different methods for visual analysis and data aggregation that work on a consistent information basis and address a large range of tasks relevant for emergency management, such as information integration, visualization and interaction. Furthermore, Brunner et al. [2009] presented a system able to collect geospatial feature data from distributed sources and integrate them for supporting a collaborative and rapid emergency response. The system enables rapid crowdsourced mapping and supports customized on-demand image processing and geospatial data queries.. 2.2.3. Social Emergency Management Systems. Finally, the third category is represented by the Social EMS. These systems include those infrastructure mainly based on the so-called bottom-up approach (Figure 2.3). As already mentioned above, nowa-. 11.

(23) 2. State of the art. Figure 2.2: An example of Smart Management System: it uses heterogeneous sensors located at specific locations.. days, thanks to the use of blogs, social networking sites, and video/photo-sharing applications, a large number of citizens is able to produce, upload and share content (UGC) related to the impact of the disaster, the emergency response, the search and rescue operations, the restoration phase, etc.. This new phenomenon constitutes a real paradigm shift in the existing literature because usually, in the case of disasters, citizens have always been considered as people to be rescued rather than active participants. Social media instead offer opportunities for two-way dialogue and interaction between citizens and emergency organizations [Bortree & Seltzer, 2009]. Furthermore, especially when official sources provide relevant information too slowly [Spiro et al., 2012], people turn to social media in order to obtain time-sensitive and unique information [Kavanaugh et al., 2012], [Kodrich & Laituri, 2011], [Sutton et al., 2008], [Riva & Galimberti, 1998], [Stephens & Malone, 2009]. As explained by Fraustino et al. [2012], oftentimes, individuals experiencing the event first-hand are on the scene of the disaster and can provide updates more quickly than traditional news sources and disaster response organizations. In this sense some scholars have used the definition of “citizens as sensors” [Goodchild, 2007], [Schade et al., 2010], as non-specialists creators of geo-referenced information that contribute to crisis situations awareness. Previous studies have shown that socio-mobile applications in emergency contexts can be useful not only to facilitate the search for information, but also to maintain a sense of community and human contact [Farinosi & Micalizzi, 2013]. Moreover this kind of application can help people to organize emergency relief and self-mobilize from both near and afar [Starbird & Palen, 2010], [Starbird. 12.

(24) 2.3 ASyEM: an Advanced System for Emergency Management. & Palen, 2011], [Starbird & Palen, 2012], [Farinosi & Trer`e, 2014]. Quite a bit is known about the perspectives of emergency managers toward social media and their expectations. A recent report realised by CNA1 in partnership with the NEMA highlights that emergency management agencies make “modest use of social media compared to ground-breaking crowdsourcing and crisis-mapping activities conducted by digital volunteers in large-scale events” [San et al., 2013] but they expect these tools to have at least a moderate impact on future response efforts.. Figure 2.3: An example of Social Emergency Management System.. 2.3. ASyEM: an Advanced System for Emergency Management. The major part of systems for emergency management presented in the literature have at least two main limitations: i) they are addressed only to professional users and also when they are based on a collaborative platform, usually the collaboration is between the different institutions that manage the emergency response and not between institutions and the population involved in the disaster; ii) they do not take into account the grassroots participation by the citizens and, above all, they are not based on a system for two-way communication. All systems developed so far are in fact mainly based on oneway communication, where the government institutions play a central role in the emergency response. Moreover, there are no systems able to collect data created from the bottom and aggregate them in different ways for automatic interpretation of events. ASyEM, the advanced system for Emergency Management proposed by Foresti et al. [2014b] aims to overcome these limitations, combining data generated through socio-mobile applications with data generated by infrastructure sensors and creating in this way an innovative system for emergency response. Thanks to ASyEM, citizens will be actively involved in disaster management and social media and personal devices become an integral part of the emergency response. This system is based on an 1 http://www.cna.it/. 13.

(25) 2. State of the art. innovative two-way communication architecture that combines content produced by citizens involved in the disaster with data generated from sensors located in a specific area. The architecture of ASyEM is built mainly to address the following functional requirements: • to collect geo-referenced multimedia data from multiple sensors, such as infrastructure sensors (e.g. video cameras, microphones, etc.) or from social media applications (e.g.: Twitter, Instagram, Facebook etc.); • to communicate data on detected events via a high-speed wireless communication network to a Unified Operative Centre (UOC); • to process locally rough data acquired by infrastructure sensors and fuse them through the JDL fusion model [Steinberg et al., 1999]; • to integrate different multimedia data into the UOC; • to share targeted information among different executive monitoring units; • to collect on demand additional data (e.g. short video and sequences) using UAVs (Unmanned Aerial Vehicles). As shown in Figure 2.4, the logical architecture of ASyEM is constituted of four layers: 1) sensor, 2) local transmission, 3) network and 4) management. In particular, at the sensor layer, input data are acquired from different kind of sensors and at the local transmission layer, data are pre-processed (e.g., compressed, etc.), collected and passed to the network layer that is in charge of sending them to a remote control centre [Martinel et al., 2014]. Finally, at the management layer, all sensor data are processed, fused and used to generate a situational awareness and suggest to the operators a planning of the emergency responses to be activated. The sensor layer is composed of three kinds of different sensors: i) environment sensors, distributed permanently in the environment; ii) mobile personal devices (smartphones, tablets, netbooks, etc.), which not only provide data that allows users’ localization, but can be used directly by individuals to produce and spread online grassroots information; iii) mobile system sensors, placed on-board unmanned aerial vehicles (UAV), useful to inspect specific areas during or just after the disaster. Data coming from distributed sensors are pre-processed and coded to save bandwidth resources at the local processing layer and routed at network level. The communication medium is normally represented by wireless LAN (e.g., IEEE 802.11g, IEEE 802.11n, etc.) or mobile digital devices (e.g. HDPSA for mobile phones) as well as broadband media such as optical fibres, coax cables or IEEE 802.16 WiMAX,. 14.

(26) 2.3 ASyEM: an Advanced System for Emergency Management. Figure 2.4: ASyEM: the Advanced System for Emergency Management proposed by Foresti et al. [2014b].. which extends over 30 miles and allows a bandwidth of 50 Mbps [Regazzoni et al., 2001], [Rinner & Wolf, 2008]. Data are finally sent to a Unified Operative Centre (UOC) where an advanced support decision system can handle both emergencies and prevention to improve citizens’ safety [Snidaro & Foresti, 2007].. 2.3.1. Analysis of User Generated Content on Social Platforms. The just mentioned, the ASyEM system proposes an innovative way to analyse the user generated content on social platforms. It uses a neural tree (NT) network [Micheloni et al., 2012] which, after a training phase, is able to determine and identify a type of emergency event using the most significant keywords extracted from users’ social posts. In this respect, the first selection procedure of the vocabulary to perform the research is conducted on the basis of a manual search of the posts in order to identify the most important terms used by people during a catastrophic event. In this way, when a user write a post or upload a photo on a social platform (e.g. Twitter,) using for example the hashtag #earthquake or other important specific keywords, such as “disaster”, “flood”, “fire”, etc., the system is able to identify and collect the socio-mobile data, locating it on a map. When an emergency event occurs, usually posts written by people contain some specific words related to the event itself. These words can be used to. 15.

(27) 2. State of the art. create a set of keywords useful to recognize and identify what is happening. Starting from this assumption the system uses a web crawler software appositely developed called ASyEM spider. The ASyEM spider is able to visit web sites, read and analyse the content of their pages and other useful information, such as the mark-up language (e.g. HTML), in order to find the established keywords and retrieve the associated content. The ASyEM spider is also able to analyse the content of an article posted in an online newspaper and search significant keywords to recognise a disaster event. For example, if in an article it finds words such as earthquake, alarm, emergency, disaster, etc., and if these words appear a consistent number of times, it is plausible that the article refers to an earthquake rather than a flood. Through the ASyEM spider, all the users’ posts published on social platforms can be carefully evaluated and the most important keywords can been detected. To perform this operation, the ASyEM spider has been trained to analyse and retrieve the information in a period of time ranging from 5 to 15 minutes. This range was chosen considering several experimental attempts. Finally, all extracted keywords are analysed by the NT algorithm that is able to detect and classify large sets of complex data separating emergency events from ordinary events. The NT algorithm requires two different phases: a learning phase and a classification phase. The learning phase (off-line phase) is performed and the NT is built by training it with data acquired from previously occurred disasters. Then, the obtained NT is applied (on-line phase) to analyse the keywords extracted by the ASyEM spider just after the occurrence of a disaster and to correctly classify the type of event. In the off-line phase, a supervised keywords classification is required. Past disasters (e.g. earthquake in L’Aquila, floods in Genoa, etc.), have been inspected and a keywords selection have been performed. Keywords have been classified according to the number of times they have been used to report the disaster. Regular expression techniques were used to avoid differences to uppercase and lowercase words or simple typos that can occur during the keywords detection process. The probability that a post is classified as an emergency event increases with the number of post which are classified by the NT as generated during an emergency event. For example, if 80% of posts are classified by the NT as generated by an emergency event it is highly possible that an emergency event is occurring. As mentioned above, one of the peculiarities of this system is the capacity to consider the data generated by users on the social platforms with the goal to offer emergency organizations more accurate information on the given emergency event.. 2.4. The use of Social Platforms for Events Detection. As has emerged in the introduction, the use of social platforms is has hugely increased in the last few years. Social platforms are oftentimes used to share various digital content as images, videos, audios. 16.

(28) 2.4 The use of Social Platforms for Events Detection. and so on. Several studies show the various social platforms have also been used during emergency events for different purposes. It is possible to classify them at least in five categories according to their features: 1. Social Networking Sites (SNS): are the most popular category and allow people to create a public or semi-public profile and share information, photos and/or videos to others in their networks. They can be used not only for disseminating content, but also for gathering or requesting specific things [Boyd & Ellison, 2008]; 2. Photo and video sharing platforms: allow not only the sharing of rich multimedia information, but also provide a sort of collective live streaming of event; 3. Blogs: represent a way to write and disseminate more rich articles on any subject and permit to visitors to comment on posts. They can embed photos or videos and have no text character limit; 4. Wikis: are online collaborative spaces where anyone can add, delete or modify content. During emergencies this kind of platform can be used in a variety of ways for content management and can be set up for specific topics, but it is particularly useful to collect logistic information about needs and resource requirements, place where to find/offer accommodation for people involved in the catastrophe and/or situation collective reports; 5. Mashup/mapping software: are based on the concept of crowdsourcing and allow information collection, visualization and interactive mapping. Usually these kinds of applications are based on “social GIS data” and provide greater understanding of locations for people unfamiliar with the area, giving -at the same time- a good oversight of information.. 2.4.1. The role of Twitter in supporting emergency operations. In the literature review, the great majority of existing studies and research related to the use of social platforms for situational awareness and emergency management purposes, are focused on Twitter1 . Twitter is a popular social platform that allows users to share up to 140-characters text messages simply called tweets. Twitter was founded in 2006 by Jack Dorsey, Evan Williams, Biz Stone and Noah Glass and nowadays is one of the most used social platforms with about 250 million active users and with about 500 million tweets shared per day2 . The platform is mainly based on the exchange of messages between a network of contacts. Each user account can create his/her custom network following other 1 https://twitter.com/ 2 https://dev.Twitter.com/. 17.

(29) 2. State of the art. users feed. In turn, each users can be followed by other users called followers. Besides the possibility of sharing short text content, Twitter also offers the opportunity to share visual content, such as videos and images, or URLs to specific web sites. The reason that the majority of research is focused on Twitter derives from many factors. First of all, given the instantaneous nature of communication on Twitter, the platform is particularly suitable for real-time communications. Furthermore, the architecture and some specific features of Twitter seem to facilitate widespread dissemination of information. Among the most important characteristics, which differentiate Twitter from the other social platforms, there is the possibility to share a content using hashtags. A hashtag is an annotation format represented by the “#” symbol, used to indicate with a single word (or a combination of words) the core meaning of the tweet. The conversations centred on a specific hashtag promote focused discussions, even among people who are not in direct contact with each other. In addition, the choice to analyse Twitter is also motivated by the prevailing public nature of the great majority of the accounts (only a small percentage of the accounts is private), a feature that distinguishes this platform from other social networking sites [Weller et al., 2014]. This peculiarity promotes public conversations, even among users that were not previously in contact with each other (Twitter in fact offers the possibility to interact with other users, to share content with them and to reply or mention someone in your own tweet simply using the symbol “@” followed by the username of the person who you want to tag). On the other hand, it makes it easier to conduct analyse that aim to rebuild the spread of communication flows within the platform. Finally yet importantly, this characteristic makes the use of tweets for research purposes less critical from an ethical point of view. For example, the analysis of the 2011 floods in Queensland provides a detailed mapping of the general dynamics of Twitter use during emergencies and offers useful general indications [Bruns & Burgess, 2014]. Their findings highlight that the space-time variables represent a crucial element to obtain relevant data and to improve situational awareness during disasters. For instance, the physical distance of the Twitter users from the site of the catastrophe can reflect a different type of need to be meet and a different perception of danger. In addition, given that a social platform like Twitter is structurally connected to forms of activation just in time, the time variable plays a fundamental role. Previous research demonstrates that immediately after the event there is a greater presence of forms of instinctive response, while tweets containing links to official news sources tend to arrive later [Acar & Muraki, 2011]. Moreover, it is worth nothing that the behaviour of Twitter users during emergencies depends strongly on the type of phenomenon: for instance, the grass-roots reaction to a flood will not be the same as the response to an earthquake. And also the geography of the area, along with the type of human settlement affected by the event can have a clear impact on the number of tweets (the major clusters for example are the most densely populated by people connected to the network).. 18.

(30) 2.5 Data mining on Twitter Social Platform. Figure 2.5: An example of tweets posted on Twitter during the 2013 Wellington earthquake.. 2.5. Data mining on Twitter Social Platform. Tweet data extraction operations involve the use of suitable API libraries, which provide all the tools and services to retrieve users’ tweets and their correlated information. When a user posts a tweet on Twitter, besides the text content of 140 characters, they can share other useful information, for example videos, images, links etc. Moreover they can include also their position (the GPS coordinates or a position obtained through a Wi-Fi positioning system). Considering a given emergency event, this information could be useful to accurately locate the area where the event is occurring and to better arrange rescue operations. In state-of-the-art there are several APIs (Application Programming Interfaces) which provide useful tools to collect and analyse tweets shared by users. Mainly, it is possible to distinguish official Twitter libraries and unofficial libraries created for specific programming languages (i.e. Neo4J1 ). Usually the unofficial libraries use the core libraries of the Twitter APIs to retrieve data from Twitter’s server but are customized for specific purposes. Twitter APIs are the official programming libraries proposed by Twitter2 . These APIs allow the performance of different types of operations, for example 1 http://neo4j.com 2 https://dev.Twitter.com/. 19.

(31) 2. State of the art. to extract tweets from the Twitter stream or to create custom applications for one’s own web site. They are freely available although they can be used with some limitations (see Twitter API limitations in the next section). Twitter APIs can be generally classified into four groups: 1. Twitter for Websites: these APIs allow users to create custom web applications. An example could be the “following button” which can be find on many websites and it is used to show the tweets posted by a specific Twitters account. 2. Search APIs: the search APIs has been specifically implemented to extracts tweets from the Twitter stream by simply executing and processing data queries to the Twitter databases. An example could be the possibility to extract only tweets that contain a specific keyword or, for example, to retrieve tweets posted by a specific user account. 3. REST APIs: the REST APIs allow developers to access core Twitter data. These are very useful and include several functionalities, for example, the possibility to retrieve timeline data of a specific user and update his/her status in real-time. Moreover, they allow the extraction of specific data information such as the geo-located information of a tweet (the GPS information) or the images shared by specific users. 4. Streaming APIs: these APIs allow developers to extract data from the Twitter stream in real-time. As reported in the official Twitter documentation, Twitter offers several streaming endpoints, each customised to certain use cases. Moreover, streaming APIs can be differentiated by: 1. Public stream: streams of public data flowing through Twitter, suitable for following specific users or topics, and data mining; 2. User’s streams: single-user streams, containing roughly all of the data corresponding to a single user’s view of Twitter; 3. Site Stream: the multi-user version of user streams. Site streams are intended for servers, which must connect to Twitter on behalf of many users1 . 1 https://dev.Twitter.com/streaming/overview. 20.

(32) 2.5 Data mining on Twitter Social Platform. Twitter APIs limitations Twitter APIs are freely available for anyone who intend to create their own custom application with the goal to analyse the Twitter stream data and extract useful information. Nevertheless, the APIs come along with some limitations. First, there is a restriction regarding the number of requests, that can be sent to the Twitter servers. For example, if you want to extract data from a specific user (e.g. information about the profile of the user, his/her tweets etc.) or simply to collect tweets containing a specific keyword, it is only possible to send 180 requests every 15 minutes. If the maximum limit is reached, then the system automatically discards any following requests until the time window of 15 minutes is exceeded. Moreover, there are other limitations regarding the use of Twitter APIs. For example, considering the Streaming APIs, only a random sample corresponding to 1% of the total tweets is retrieved from the Twitter data stream. It means that some relevant information could be missed during the extraction process. Also, the REST APIs have some limitation: they do not provide access to tweets up to about a week old. Moreover, only tweets posted from public Twitter profiles can be retrieved, collected and analysed.. 2.5.1. Twitter Situational Awareness System classification. As explained above, in the literature, there is much significant research that exploits the use of Twitter to locate and analyse in real-time a given event for situational awareness. In particular, these systems are able to extract and save tweets from the Twitter stream and analyse them with the goal to collect useful information. All these systems are very similar to each other, although they can be classified into three main groups on the basis of their characteristics [Vernier et al., 2015]: 1. Semantic systems, based on a textual analysis of the content; 2. Meta-data systems, based on the extraction of meta-data information; 3. Smart self-learning systems, able to identify in real-time a trending topic and to automatically search for other secondary hashtags connected to the first one. 2.5.1.1. Semantic Systems. Semantic systems focused on situational awareness (Figure 2.6), are characterised by the possibility of analysing the textual content of the tweets written and shared online by Twitter users. In several cases the tweets posted by users could be very useful to identify relevant topics of discussion and, in doing so, to detect specific contextual information (i.e., collapsed buildings, number of wounded, number of death, etc.). A first example of semantic systems can be identified in Corvey et al. [2012] which is able. 21.

(33) 2. State of the art. to extract linguistic and behavioural information from tweet text to aid in the task of information integration. Through linguistic annotation, in the form of Named Entity Tagging, and behavioural annotations, it captures tweets and then analyses and classifies their content, contributing to situational awareness. Another interesting example of this kind of system is that developed by Schulz & Probst [2012], which uses crowdsourcing and Linked Open Data to enhance, classify, and filter the information shared by people on social media (i.e., Twitter or Facebook). The result is a structured dataset that contributes to identifying a certain event. A third example can be identified in the system implemented by Zielinski & B¨ugel [2012], which is able to analyse multilingual Twitter feeds for emergency events. Exploring the number of tweets posted online before and after ten earthquakes in the Mediterranean area, it detects a disaster by observing a rapid increase in Twitter activity, focusing in particular on posts written in the native language of the area hit by the seism. It is worth mentioning also the system developed by Klein et al. [2012], a Twitter crawler that analyses the stream through a combined social and content-aware analysis approach. It conducts a grammatical analysis in order to classify the textual content of tweets in nouns, verbs, adjectives, etc. and group single words in meaningful units, extracting in this way knowledge from texts through named entity recognition algorithms. By querying the Twitter Streaming API with ad hoc keywords focused on the specific disaster and the location, it is able to detect emergencies in real-time. Another example of a tool that can be employed for situational awareness, even if originally is conceived for a broader purpose, is Eddi, an interactive topic-based browser for Twitter feeds. It allows the organization of tweets on the basis of specific topics (e.g. an event crisis or political elections, etc.) making their visualisation more feasible. It provides different features: 1) a tag cloud overview for showing the major topics, each of which are scaled proportionally to the number of tweets assigned to it; 2) a topic detail view, available by clicking on each tag, provides a view of all tweets about a specific topic; 3) navigation list, that contains a complete list of the feed’s topics sorted by popularity; 4) topic dashboard for displaying an overview of the most interesting topics the user might like; and 5) a timeline that focuses on the temporal factor, stressing spikes and trends over time. Among the possible contexts of use, Eddi could be adopted during and after a natural disaster to identify trending topics of discussion and relevant content of tweets. 2.5.1.2. Meta data systems. Regarding the meta-data systems (Figure 2.6), these kind of applications are able to use and aggregate different types of data. As semantic systems, they are able to extract and analyse the textual content of the tweets related to a certain event, but they take into account also other types of data directly produced by the social platform itself, such as time and date of the tweet, GPS coordinates, URLs shared, and. 22.

(34) 2.5 Data mining on Twitter Social Platform. Figure 2.6: A potential application of a system to extract data from Twitter stream during an emergency event.. so on. A first example of the meta-data system can be identified in Twitcident [Abel et al., 2012], a framework and web-based system that is automatically able to filter, search and analyse tweets about real-world incidents, crises or natural hazards. Adopting semantic filtering strategies, which includes tweets classification, named entity recognition, and linkage to related external online resources, it monitors emergency broadcasting services and automatically collects and filters tweets whenever an incident occurs. The core of Twitcident is composed by several modules: 1) Incident detection module, which is able to identify incidents employing the P2000 communication network, an ad hoc broadcasting service used in Netherlands by police and other emergency operators. Whenever an incident is detected, the collected messages are semantically analysed by the 2) semantic enrichment module, in order to identify the tweets relevant to a given incident and perform a real-time analysis. Twitcident also collects additional meta-data about the publisher of a tweet, his/her profile picture, number of followers, and his/her location when publishing the tweet. All these data are useful to assess the trustworthiness of a tweet and improve the reliability of the whole system. Another system for supporting emergency planning and risk assessment activities that takes into account also the meta-data is that developed by De Longueville et al. [2009]. This project analyses the activity of Twitter users during a fire that broke out in Marseille in July 2009 through Twitter’s API and a search of the keywords “incendie” and “Marseille”. They classify the content on the basis of the users and identify three major roles: citizens, media and aggregators. This last category did not generate primary content but distribute existing sources (i.e., news produced. 23.

(35) 2. State of the art. Figure 2.7: The architecture proposed by Twitcident [Abel et al., 2012].. by citizen journalists) through abbreviated URLs to blogs, news portal and pictures websites such as Flickr and Twitpic. It is worth noting also TEDAS [Li et al., 2012], a system which uses keywords and GPS coordinates to discover spatial and temporal patterns of events, and the region of major influence. The high coverage of Twitter users, along with the rich information associated with tweets, let TEDAS monitor the events in a fast and accurate way. It is able to recognize new events, rank them according to their importance and generate a spatial-temporal pattern for every event. The tool takes a query, which contains keywords, a temporal period, a location and extrapolates relevant tweets. Furthermore, it identifies and classifies tweets based on some relevant keywords, giving more relevance to those posted by an authority. Another interesting system has been developed by Sakaki et al. [2010]. Through the analysis of tweets and keywords (e.g., earthquake or shake), it is able to automatically detect target events and send notifications promptly to those who are registered. Moreover, the system applies Kalman and particle filtering in order to detect accurately the location of the event. Worth to mention is also the tool created to enhance the situational awareness in emergency events which uses Twitter Search APIs to find tweets through a query of case-insensitive terms and then analyses textual data, GPS-coordinates and the registered location of the tweets, and visualizes their content by the E-Data Viewer (EDV) [Vieweg et al., 2010]. A more specific system for disaster response is that developed to help relief workers dur-. 24.

(36) 2.6 Data Mining on Instagram. ing natural disasters [Ashktorab et al., 2014]. It adopts different methods (i.e., sLDA, SVM) to classify in real-time the material shared from areas close to the disaster location and extracts the most relevant phrases about structure and infrastructure damages. A similar system has been developed by Walther & Kaisser [2013]. It is able to detect geo-spatial real-world events in real-time. It focuses on clusters of Tweets sent in a short time span and stores them for at least 48 hours in an open-source database (MangoDB). Then -through a Machine Learning component, which uses 41 features that address several of their aspects- the material is analysed and filtered in order to eliminate tweets generated from bots or false positives. 2.5.1.3. Smart Self-Learning Systems. The last category, refers to smart self-learning systems (Figure 2.8), and includes those applications able to identify in real-time a location affected by a certain event combining characteristics used by semantic systems and meta-data systems. However, in contrast to the previous categories, they are also able to self-learn from the content of tweets analysed, identifying the most adopted words and hashtags and automatically looking for other secondary hashtags connected to the first ones. A first example of these systems can be identified in CrisisLex [Olteanu et al., 2014]. The aim of this system is to improve the query to the Twitter servers in order to harvest the most relevant tweets during a specific crisis. As a first stage, the system creates a lexicon of crisis-related terms that frequently appear in relevant messages posted during different types of past crisis situations. T he lexicon is then used by the system to automatically identify new terms that describe a given crisis. Moreover, it allows the automation of the choice of keywords contained in the tweets. Data collection is optimised using keywords or geolocation searches. CrisisLex constitutes a first attempt to develop a smart self-learning system. Further research is needed to develop more accurate systems able to identify the specific area where the event is occurring, extracting only the data related to that event in that area. In the next section, a possible innovative smart solution will be illustrated.. 2.6. Data Mining on Instagram. Instagram1 is another popular social platform that allows the sharing of digital content on the web. Created and launched by Kevin Systrom and Mike Krieger in October 2010 as a free mobile app, now it counts over 300 million monthly active users. Among the main features which characterize this social platform, there is the possibility of applying different filters in order to transform the appearance of the 1 https://www.instagram.com/. 25.

(37) 2. State of the art. Figure 2.8: An example of smart self-learning systems.. image you want to share. Moreover, as it occurs on Twitter, the users can include into the images or videos shared some text information, such as, for example, a caption as well as hashtags (#) related to specific keywords. Again, the users who post a content can mention (tag) other people using the symbol @, or include geo-location information as the latitude or longitude coordinates. To extract the digital content from Instagram databases, the social platform offers to a third-party developer, complete APIs which can be exploited to create custom desktop, web-based and mobile applications. Moreover, several REST services are provided to easily retrieve specific content such as, for example, all the images which share a specific hashtag, the friend connections of a specific user and so on. In the state-of-theart, there are several web applications, that exploit the Instagram APIs to retrieve and extract the digital data shared by users. Instaport.me1 , Dinsta2 , Downgram3 , Gramfed4 , Iconsquare5 are just an example of applications which allow the retrieval and downloading of images posted by users on Instagram. Among the main features that characterize these online platforms there is the possibility of searching and retrieving only specific photos which share one or more specific hashtags. Moreover, the images can be downloaded at various size resolutions. The presented works are just an example of custom web applications, which allow users to retrieve and download digital content from Instagram. Nevertheless, it is also important to consider the scientific contributions given by researches on the use of Instagram 1 http://instaport.me/ 2 http://www.dinsta.com/ 3 http://downgram.com/ 4 http://www.gramfeed.com/ 5 http://iconosquare.com/. 26.

(38) 2.6 Data Mining on Instagram. for research purposes. As a first example, in Weilenmann & Hillman [2013], authors investigate how the Instagram application is used to communicate visitors’ experiences while visiting a museum of natural history. The research is based on the analysis of 222 photos taken by users in the museum and shared on Instagram. The main goal was to understand how the sharing practice is used to share museum experiences. Moreover, authors discuss the ways that visitors re-categorise and document objects found in the museum, the engagement between visitors and science centres, and suggest ways to extend this dialogue beyond an institution’s physical location. Instead, in Hu et al. [2014], the authors present a quantitative and qualitative analysis of photos shared by users on Instagram. They use state-of-the-art computer vision techniques to examine the photo content with the goal of identifying the different types of active users on Instagram using clustering. The final results revealed how eight popular photo categories can be identified, five distinct types of Instagram users in terms of their posted photos, and how the user audience, based on the number of followers, is independent from the type of photos shared on Instagram. Regarding the computer vision technique adopted to characterise the type of photos posted on Instagram, they used the well-known Scale Invariant Features Transform (SIFT) descriptor [Lowe, 2004] to detect and extract local discriminative features from photos samples. Following the standard image vector quantization approach (i.e., SIFT feature clustering [Szeliski, 2010]), they obtained the codebook vectors for each photo. Finally, they used k-means clustering to obtain 15 clusters of photos where the similarity between two photos is calculated in terms of Euclidean distance between their codebook vectors. These clusters served as an initial set of our coding categories, where each photo belongs to only one category. Another important piece of research on the data produced by users on the Instagram social platform is presented in Silva et al. [2013]. To study on a large scale the urban social behaviour and city dynamics, authors consider Instagram as the most popular Participatory sensing system (PSS) on the Internet. In particular, they show the unequal frequency of photos sharing, both spatially and temporally, which is highly correlated with the typical routine of people.. 27.

(39)

(40) 3. System architecture. 3.1. Introduction. In Chapter 2, an overview of Emergency Management Systems is given. In particular, the most diffused systems for Twitter data mining are presented. These systems allow the recognition of specific events by analysing the text content generated by users. User Generated Content (UGC) modifies the coverage of crisis events and helps to obtain more timely reporting and up-to-date information than traditional media [Conklin & Dietrich, 2010], [Goolsby, 2010]. The bottom-up communication practices related to their adoption accelerates information flows and contributes to communities’ empowerment, even if the content produced and shared online sometimes could be incorrect and need accuracy and validation. In accordance with this, UGC extracted from social platforms needs to be validated, in order to discard the superfluous information thus maintaining that which is relevant. Then, the information retrieved, can be used by emergency operators to obtain more accurate details of the event, with the goal to better organise the rescue operations. In this chapter, the architecture of the proposed SMR system for analysis and interpretation of Multimedia User Generated content is presented. Exploiting specific Application Programming Interfaces (API), this system is able to extract content shared by users on specific social platforms. The retrieved content is then analysed in order to detect the geographical area where a given event (e.g., an emergency. 29.