• Non ci sono risultati.

TOWARDS BIG DATA METHODS AND TECHNOLOGIES FOR OFFICIAL STATISTICS

N/A
N/A
Protected

Academic year: 2021

Condividi "TOWARDS BIG DATA METHODS AND TECHNOLOGIES FOR OFFICIAL STATISTICS"

Copied!
139
0
0

Testo completo

(1)

UNIVERSITÁ DIPISA

DOTTORATO DI RICERCA ININGEGNERIA DELL’INFORMAZIONE

T

OWARDS

B

IG

D

ATA

M

ETHODS AND

T

ECHNOLOGIES

FOR

O

FFICIAL

S

TATISTICS

DOCTORALTHESIS

Author

Lorenzo Gabrielli

Tutor (s)

Prof. Francesco Marcelloni Prof.ssa Fosca Giannotti Prof. Mirco Nanni

Reviewer (s)

Prof. Davy Janssens

Prof.ssa Alessandra Raffaetà

The Coordinator of the PhD Program

Prof. Marco Luise

Pisa, 05 2018 XXX

(2)
(3)
(4)
(5)

"Con i dati si può mentire, ma senza dati non si può dire la verità" "With the data you can lie, but without data we can not tell the truth"

(6)
(7)

Acknowledgements

Ph.D. is designed to be an individual practice, while in reality, it is the sum of many collective experiences, within which the candidate expresses his creativity. For the reasons expressed above, I wish to thank all the people with whom I collaborated. I, therefore, thank all the colleagues of KDDLAB, the scientific coordinators of my Thesis, Fosca Giannotti and Mirco Nanni. I appreciate the friends with whom every day I take the train, although belonging to different search fields, they gave me advice, help, and expertise for improving my research. With one of these, Luca Bastiani, I have also submitted a paper.

Finally, the most important thanks to the family, for the human and material support received during my studies. Particular thanks to my nephew Giulia, who has often helped me choose the colors for what she calls scribbles but which are charts showing the results of my researches.

(8)
(9)

Ringraziamenti

Il percorso di dottorato è pensato per essere un percorso individuale, mentre in realtà è la somma di tante esperienze collettive, all’interno delle quali il candidato esprime la sua creatività e la voglia di mettersi in discussione. E’ per questo motivo che il mio ringraziamento va a tutti gli attori con cui a diverso titolo mi sono confrontato. Ringrazio quindi tutti i colleghi del KDDLAB, i responsabili scientifici della mia Tesi, Fosca Giannotti, Mirco Nanni. Ringrazio gli amici pendolari che pur appartenendo a diversi ambienti di ricerca mi hanno dato diversi suggerimenti su come svolgere le mie ricerche. Con uno di questi, Luca Bastiani, sono pure riuscito a fare una pubblicazione. Infine il ringraziamento più importante va alla famiglia, per il sostegno umano e materiale ricevuto in questo e nei precedenti percorsi di studio. Ringraziamento parti-colare a mia nipote Giulia che mi ha aiutato spesso a scegliere i colori per quelli che lei definisce scarabocchi ma che in realtà sono grafici che mostrano i risultati nei paper.

(10)
(11)

Summary

This thesis aims to demonstrate in a tangible way how mobile phone data, private ve-hicle tracks, and scanner data are useful for measuring complex systems. The three main areas of application concerned use of Big Data: i) for measuring the presence within a territory through Data Mining techniques, ii) to now-casting socio-economic development of a country, and iii) for measuring the dynamics of cities.

First, it has been developed a tool for real-time demography demonstrating how to use mobile phone data over a wide area to achieve a new Official Statistic indicators. The study showed how Big Data, either using mobile phone data or scanner data are useful and effective for carrying out a continuous census of the population.

Second, it has been proposed an analytical framework able to evaluate relations be-tween relevant aspects of human behavior and the well-being of a territory. We found out that the diversity of human mobility is a mirror of some aspects of socio-economic development and well-being. Then, we showed how mobility features help to improve the performance of state-of-the-art methodology such as small area estimation method-ologies.

Finally, it has been analyzed how mobility interacts with the territory due to the movement of people. We proposed to use mobile phone data and GPS tracks for city government measuring the attractiveness of cities. Furthermore, a data analysis ap-proach aimed to identify mobility functional areas in a completely data-driven way has been proposed.

The main findings of the thesis concern the statistical and ethical evaluation of re-sults with official sources and showed that methodologies could be applied in other contexts and with different data sources as well. We showed how the geographic infor-mation contained in the data sources is incredibly useful to observe our society with a new “microscope”. Thanks to the opportunity provided by the varied scientific context of SoBigData, the European Research Infrastructure for Big Data and Social Mining. the Ph.D. also contributed to develop and promote responsible data science because the ethical framework is considered as part of the CRISP model, not a problem to treat apart.

(12)
(13)

Sommario

Lo scopo di questa tesi è dimostrare in modo tangibile come i dati di telefonia mobile, le tracce gps e i dati di acquisto sono una incredibile sorgente di informazioni per misurare le abitudini degli individui e della nostra società. Le tre aree di ricerca mostrate in questa tesi riguardano l’uso dei Big Data per: i) realizzare un censimento continuo della popolazione sul territorio, ii) misurare in quasi real-time lo sviluppo socio-economico delle nazioni, e infine iii) misurare l’impatto della mobilità nella dinamica delle città.

In primo luogo, è stato sviluppata una metodologia per implementare la demogra-fia in tempo reale nell’ambito della Statistica Ufficiale. Lo studio mostra come i Big Data, sia quelli ricavati dal traffico cellulare che quelli ricavati dai dati di acquisto, so-no incredibilmente utili ed efficaci nel misurare la popolazione insistente sul territorio municipale.

In secondo luogo, è stato realizzato un framework analitico che misura le relazioni tra le componenti di mobilità associate a un territorio e il benessere dello stesso. Ho mostrato che la diversità generata dalla mobilità umana è un proxy di alcuni aspetti dello sviluppo socio economico del territorio in cui si manifesta. Inoltre è stato mo-strato come gli indicatori di mobilità contribuiscano a migliorare le performance di metodologie note allo stato dell’arte come “small area estimation methodologies”.

Infine, è stato analizzato come il movimento degli individui sia utile per la realiz-zazione di indicatori che misurino l’attrattività dei luoghi. In questo contesto, è stata inoltre proposta una tecnica di analisi che ha lo scopo di identificare come suddividere il territorio identificando le aree omogenee che condividono medesimi comportamenti di mobilità, il tutto in modo completamente guidato dai dati.

I risultati principali della tesi riguardano la validazione statistica ed etica dei risultati, inoltre è mostrato come le metodologie sono generali ovvero funzionano con differenti sorgenti di dato e in differenti contesti. Il filo conduttore di questa tesi è l’uso dell’infor-mazione geografica contenuta nelle sorgenti di dato per osservare la nostra società con un nuovo “microscopio”. Grazie alle opportunità fornite dal vasto contesto scientifico di SoBigData, il Ph.D ha contribuito allo sviluppo e alla promozione dell’uso della data science in modo responsabile ed etico, perché gli aspetti etici sono considerati parte del processo analitico e non un problema da trattare separatamente.

(14)
(15)

List of publications

International Journals

1. Furletti, B., Trasarti, R., Cintia, P., & Gabrielli, L. (2017, June). Discovering and Understanding City Events with Big Data: The Case of Rome. Journal of Information. Multidisciplinary Digital Publishing Institute. 8(3), 74.

2. Calastri, C., Hess, S., Choudhury, C., Daly, A., & Gabrielli, L. (2017, April). Mode choice with latent availability and consideration: theory and a case study. Transportation Research Part B: Methodological.

3. Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., & Ricci, L. (2016, December). Scalable and flexible clustering solutions for mobile phone based population indicators. International Journal of Data Science and Analytics.

4. Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., & Gian-notti, F. (2016, February). An analytical framework to nowcast well-being using mobile phone data. International Journal of Data Science and Analytics, 1-18. 5. Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D.,

& Gabrielli, L. (2015, June). Small area model-based estimators using big data sources. Journal of Official Statistics, 31(2), 263-281.

International Conferences/Workshops with Peer Review

1. Gabrielli, L., Fadda, D., Rossetti, G., Nanni, M., Piccinini, L., Lattarulo, P., Pe-dreschi, D. and F. Giannotti (2017, October). Discovering Mobility Functional Areas: A Mobility Data Analysis Approach. Complenet2018.

2. Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., & Ricci, L. (2016, June). Improving population estimation from mobile calls: a clustering approach. In 2016 IEEE Symposium on Computers and Communication (ISCC) (pp. 1097-1102). IEEE.

(16)

3. Gabrielli, L., Furletti, B., Trasarti, R., Giannotti, F., & Pedreschi, D. (2015, Octo-ber). City users’ classification with mobile phone data. In Big Data (Big Data), 2015 IEEE International Conference on(pp. 1007-1012). IEEE.

4. Gabrielli, L., Guido, D., Giannotti, F., & Bastiani, L. (2016, April). A Syntethic Measurement for Political Engagement of Spending: Pilot study to measure per-formance of local government using Open Government Data. International Con-ference on Digital Society and eGovernments

5. Campagni, R.,Gabrielli, L., Giannotti, F., Guidotti, R., Maggino, F. and D. Pe-dreschi. (2016, February). Measuring Wellbeing extracting Social Indicators from Big Data. In DATA SCIENCE & SOCIAL RESEARCH (DSSR). International Con-ference.

National Conferences/Workshops with Peer Review

1. Guidotti, R. and L. Gabrielli (2017, July). Recognizing Residents and Tourists with Retail Data Using Shopping Profiles. GOODTECHS 2017.

2. Campagni, R.,Gabrielli, L., Giannotti, F., Guidotti, R., Maggino, F. and D. Pe-dreschi. (2017,June). Measuring Wellbeing extracting Social Indicators from Big Data. Scuola Italiana di Statistica (SIS) 2017. Scuola Italiana di Statistica. 3. Bocci,C., Fadda, D., Gabrielli, L., Nanni, M. and L. Piccini. (2017, June). Using

GPS Data to Understand Urban Mobility Patterns: An Application to the Florence Metropolitan Area. Scuola Italiana di Statistica (SIS) 2017. Scuola Italiana di Statistica.

4. Elena Salvatori, Lorenzo Gabrielli, Fosca Giannotti and Dino Pedreschi. A Data Driven approach for evaluating foundation skills of adults. DIDAMATICA 2017. 5. Gabrielli,L., Riccardi, G. and Pappalardo, L. (2016, June). Using retail market

Big Data to nowcast Customer Price Index. Scuola Italiana di Statistica (SIS) 2016. Scuola Italiana di Statistica.

Submitted

1. Guidotti, R., Gabrielli, L., Monreale, A., Pedreschi, D. and F. Giannotti (2017, September). Discovering Temporal Regularities in Retail Customers’ Shopping Behavior. EPJ Data Science 2017.

2. Cintia, P., Gabrielli, L., Giannotti F., Monreale, A. and Francesca Pratesi. Privacy-Aware Sociometer: a Mitigation Strategy for Quantication of City Users. (in preparation).

(17)

Contents

1 Introduction 1

2 Setting the stage 3

2.1 Background . . . 4

2.2 Data sources . . . 11

3 Real Time Demography 17 3.1 Problem Definition: Population Estimation . . . 18

3.2 Sociometer: a Solution to Real-Time Demography . . . 19

3.3 Extending the Sociometer Engine . . . 22

3.3.1 Semi Automatic Annotation of Prototypes . . . 22

3.3.2 Scalability issues . . . 25

3.3.3 Privacy issues . . . 33

3.3.4 New directions: integration with Retail Data . . . 36

4 Monitoring Well-being from Big Data 43 4.1 Problem definition: multidimensional well-being . . . 44

4.2 Using mobile phone data to monitor well-being . . . 45

4.2.1 Measuring Human Behavior . . . 47

4.2.2 Correlation Analysis . . . 51

4.2.3 Predictive Models . . . 56

4.2.4 Interpretation of Results . . . 59

4.2.5 Future directions . . . 62

4.3 Exploring usability of GPS data to monitor well-being in small areas . . 63

4.3.1 Measure poverty index using Mobility Diversity . . . 64

4.3.2 Big data as a covariate of small area estimation method . . . 65

4.4 Early results on Retail Data . . . 66 4.4.1 Extracting social Indicators from Customers’ Shopping Behavior 66 4.4.2 Using retail market Big Data to nowcast Customer Price Index . 69

(18)

Contents

5 City Dynamics and Big Data 75

5.1 Problem 1: Event detection in Large cities . . . 77

5.2 Problem 2: Identifying functional areas at Regional scale . . . 85

5.2.1 Background . . . 88

5.2.2 Mobility Functional Areas Discovery (MFAD) . . . 89

5.2.3 Experiments . . . 91

5.2.4 Evaluation . . . 95

5.2.5 Future directions . . . 98 6 City Atlas Booklet: a dashboard for Human mobility 101

7 Conclusions 111

Bibliography 115

(19)

CHAPTER

1

Introduction

With the spread of Internet, the amount of data produced by individuals has grown a lot, just think of the number of searches made by Google, Wikipedia, or the considerable amount of items purchased through eBay or Amazon. Also, with the diffusion of mobile devices, each individual produces an incredible amount of information on their travels, their preferences, and opinions. Such data can be freely accessible, owned or collected by volunteers, can be classified as Big Data, both in terms of size, quality, and the frequency with which they are produced.

We have the opportunity to use such a volume of data produced for studying society from the social and economic prospects. Data alone do not have value but rather the process of extracting knowledge from data can produce social benefits. This thesis aims to contribute to the debate of what are the challenges and solutions to use Big Data as sensors of our complex society. Employing Big Data in Official contexts, for instance, implies a revolution in organizational, analytical, technological, methodological and cultural terms.

From a statistical point of view, it is necessary to demonstrate the validity of the available sample of data or more generally its quality. To this end, it is required to define a theoretical framework that reconciles Big Data Analytics and traditional sta-tistical models. It is worth to notice that working with Big Data in Statistics requires a significant cultural change, especially model design.

About the technological issues, over the last few years, hardware has been devel-oped to manage vast amounts of information exploiting the advantages of parallel and distributed computing. These technological advances aim to provide efficient and effec-tive processing for the different application areas and mean a redefinition of Machine Learning tasks towards a new distributed and efficient paradigm.

(20)

Chapter 1. Introduction

are not to be underestimated. The privacy issues must be addressed starting from the design methods, taking care of applying strategies for removing or mitigating risks. It is worth pointing out that privacy issues do not only concern data processing but also the release of the results obtained on the data processed.

This thesis aims to demonstrate in a tangible way how mobile phone data, private vehicle tracks, and scanner data are a useful tool for measuring complex systems. The work done during the thesis is also a contribution to pursue the objectives proposed by the research infrastructure SoBigData.eu which creates the Social Mining & Big Data Ecosystem in order to promote responsible data science. SoBigData.eu proposes to create a research infrastructure (RI) providing an integrated ecosystem for ethic-sensitive scientific discoveries and advanced applications of social data mining on the various dimensions of social life, as recorded by “big data”. For the Ph.D., it has been advantageous to work in a large community of researchers belonging to different disciplines.

After the introduction of the research questions and the description of the data used in Chapter 2, Chapter 3 addresses the issue of using Big Data for measuring the pop-ulation presence within a territory through Data Mining techniques. Here, we develop a scalable, flexible and ethical framework for Real-Time Demography, representative of the whole population. The study has been jointly designed by WIND-TRE Telecom-munication, ISTAT, National Council of Research, the University of Pisa within the project “ISTAT Big Data Commission”.

In Chapter 4, we address the issue of using Big Data to now-cast socio-economic development of a country. The implemented analytical framework shows an interesting relationship between the mobility of individuals, measured with mobile phone data, and deprivation index on a national scale. The study has been jointly developed in collaboration with the French telco, Orange Telecom. Also, in this chapter, we propose to use large retail market data to construct indicators for collective well-being. Starting from scanner data of individuals, three phenomena have been studied: i) time buying habits, ii) the impact of the crisis on purchases, and iii) the cost of living. This activity was carried out in collaboration with a distribution chain (UNICOOP-TIRRENO).

In Chapter 5, we show some application cases, in a city and at national level, about several different mobility phenomena that can be observed with Big Data (event de-tection, new borders of cities). This experience highlights some areas of application of various Big Data Analysis methods (some of them introduced in previous chap-ters, some others presented in this one). The studies are conducted in collaboration with IRPET (Regional Agency for Economic Planning in Tuscany), and WIND-TRE telecommunication1.

The process of transforming raw data into knowledge is very complex, and it is necessary to provide metaphors of visualizations that are understandable to decision makers. In Chapter 6, we introduce an analytical platform that extracts information on the mobility of individuals from vehicle data and mobile phones by applying Data Mining methodologies both present in the literature and realized as part of this thesis. Chapter 7 concludes the thesis and indicates some open issues.

1The ASAP FP7 research project developed a dynamic open-source execution framework for scalable data analytics

(21)

CHAPTER

2

Setting the stage

Big Data, the masses of digital breadcrumbs produced by the information technologies that humans use in their daily activities, allow us to scrutinize individual and collective behavior at an unprecedented scale, detail, and speed. Building on this opportunity, we have the potential capability of creating a digital nervous system of our society, enabling the measurement, monitoring, and prediction of relevant aspects of the socio-economic structure in quasi-real-time [51]. The interest in the analysis of Big Data provides nowadays the possibility to study human behavior both at an individual and collective level in all branches of human knowledge, from economy [102], to human mobility [94], social networks [8] and even sports [24, 25].

Understanding the individual decision process from a micro-perspective would ide-ally lead to truly emergent behaviour, which can be observed at the macro-level by means of Big Data. For example, in [46], authors apply a data driven approach to im-plement simulation models for transport system. They combine Big Data and an Agent Based Model to analyze and understand the complex phenomenon of interactions be-tween different entities to implement a carpooling application.

Technological evolution brought along, in recent years, a remarkable increase in the diffusion of devices that can produce and record digital footprints of our behavior on a daily basis, tracking a vast degree of activities, from mobility to shopping, from social relationships to economic interactions. Constant and basically unintentional produc-tion of such tracks generates, by accumulaproduc-tion, huge datasets that contain, albeit not in an easily intelligible form, a precious quantum of information about socio-economic behavior that may be extracted and used for socio-economic research and policy analy-sis [50]. Raising interest in Big Data has generated a wave of academic and applied lit-erature in different fields that use unstructured data as a replacement and a supplemental data source for more traditional forms of data collection (institutional, administrative

(22)

Chapter 2. Setting the stage

or sampled data). Big data sources can help public institutions by providing data with better timeliness, reduced cost, improved robustness, and applicability in cases where a developed statistical system is still not established. This may support policymakers in the ex-ante phase of policy implementation, by providing a more sophisticated de-piction of the socio-economic environment and enabling a more accurate evaluation of needed actions and expected results. Now-casting can be used to monitor critical indi-cators during the operational phase of the policies, allowing real-time adjustment and constant refinement. Finally, Big Data sources may be used for ex-post evaluation pur-poses in quasi-experimental design and counter-factual settings. The literature on the matter and practical experiences have highlighted pros and cons of this approach [119]. Some of the advantages have already been mentioned and include timeliness, cost-effectiveness, spatial and temporal disaggregation, the emergence of unexpected and/or unobservable phenomena. On the other hand, since the relative novelty of the method-ologies used to deal with these data, extra carefulness needs to be used to acknowledge possible shortcomings in terms of quality, accessibility, applicability, relevance, privacy policy and ownership of the data, all of which may affect the quality of policy evalua-tion and appraisal. Nonetheless, we believe that Big Data sources can be successfully used to foster the capabilities of the public institutions to deal with complex problems, to plan effective policies and to evaluate the outcomes of their actions.

2.1

Background

This Thesis aims to contribute to the debate of what are the challenges and solutions to use Big Data as sensors of our complex society. Employing Big Data in Official con-texts implies a revolution in organizational, analytical, technological, methodological and cultural terms. An intriguing open issue, I want to address in this thesis, is whether measurements derived from Big Data recording human activities can yield high-fidelity proxies of the complex system. In particular, we aim to address the following questions:

Can we implement new statistical indicators for estimating population (and movements) in near real-time? Typically, official demographic data are collected systematically every ten years, during the nationwide official census. However, census data, while very rich with information and details, have two major drawbacks: the temporal lag between census, during which we have no information on mobility, and the focus on what we call systematic mobility, i.e., the mobility which happens almost every day and is mainly related systematic movements, leaving out an increasingly relevant segment of non-systematic mobility, which, by its nature, is difficult to capture with traditional methods. Thanks to Big Data, we can thus increase our analytical capability with an informative base that can be updated almost continuously, and that includes all presences and not only the systematic one. The proposal we make to overcome the limitations of the traditional approach is to use mobile phones because mobile devices are today one of the principal means by which people disseminate digital tracks of their everyday activities. In particular, mobile phones and the data they produce revealed to be a high-quality proxy for studying people mobility in different domains, such as environmental monitoring [70, 108], transportation planning [18], smart cities and social relationship analysis [33, 137].

(23)

2.1. Background

The vast majority of works in the context of Big Data for official statistics are based on the analysis of mobile phone data, the so-called CDRs (Call Detail Records) of calling and texting activity of users. Mobile phone data guarantee the repeatability of experiments in different countries and on different scales as they, nowadays, can be retrieved in every nation given the worldwide diffusion of mobile phones [11]. A set of recent works use mobile phone data as a proxy for socio-demographic variables. Brea et al. [15], for example, study the structure of the social graph of mobile phone users in Mexico and propose an algorithm for the prediction of the age of mobile phone users. One of the first works using mobile data to estimate the population has been presented by Terada et al. [129]. In this work, they monitor the presence of mobile terminals present in each base station area in different time intervals. Such data are refined with census information, and at the end, the per-cell populations are aggregated in grid sections or municipalities. This result may be affected by errors. Checking only the presence in a cell cannot detect if an individual is a resident, who should be counted as living in the area, or just a visitor. Due to this, subsequent works try to exploit mobile data in a different manner.

Mobile phone traces have been utilized to monitor the traffic in cities and analyze tourist movements. In particular, two famous works focus on this issue for the cities of Rome [18] and Graz [113]. From these works, many others, for instance, Ahas et al. [2], analyze that is possible to individuate which are the places visited by the individuals investigating the calls performed. Also, a plethora of works, for instance, the winner of the Nokia Mobile Data Challenge [37], build predictors able to determine the next position of an individual given the current context.

De Jonge et al. [29] study different approaches making use of call records spanning two weeks in Netherlands. They give insights on the indicators obtainable analyzing the phone calls. For instance, such data can be used to estimate the level of the economic activity because the number of phone calls can be an indicator of the economic activity of a specific region. They make use of the k-means clustering algorithm to determine daily pattern clusters of the call activity. However, they suggest that a more in-depth study of the calling behavior should be performed on a more massive dataset covering multiple weeks to estimate population density correctly.

Deville et al. [31] improve the ideas of De Jonge and exploit mobile phone data for estimating population density. According to such methodology, population density is calculated as a function of the night-time phone calls occurring in a given area. How-ever, a simple rule-based approach to identify user presence may hinder to derive some more useful information about the calling behavior of the users. For instance, it would be cumbersome to define rules able to characterize individuals that are Commuters or Visitors.

To overcome the limitations mentioned above, in a seminal work Furletti et al. [43] defined how to build individual profiles based on mobile phone calls. Such patterns characterize the calling behavior of a user, in different time slots. By analyzing these profiles, it is possible to identify three categories of users: Residents, Commuters or Visitors. For data mining analytics and applications, these data are very significant in terms of size and representativeness. In [43] the authors demonstrate how the number of residents observed with mobile phone data is highly correlated with the number of residents identified by official estimates.

(24)

Chapter 2. Setting the stage

Currently, a hot topic in the modernization of official statistics is precisely how to use Big Data in combination with traditional data sources, to improve quality, timeliness and spatiotemporal granularity of statistical information [10]. Along with this line, mobile phone data, despite their limits in spatial precision compared to other location data such as GPS tracks, are of uttermost interest due to their global availability for any countries, and the ability to portray mobility independently from the transportation means. This is documented by many contributions realized within the D4D challenge1, and for example in (i) [89], where Call Data Records (CDRs) from an African city was used to reconstruct an Origin and Destination (OD) matrix describing typical traffic flows; or in (ii) [5] where the authors, created temporal profiles of the call activities in order to identify the different location types (residential, business, mixed area) in a city. Other recent works use different types of mobility data, e.g., GPS tracks and retail market data, to show that Big Data on human movements can be used to support of-ficial statistics and understand people’s purchase needs. Pennacchioli et al. [101] for example, provide an empirical evidence of the influence of purchase needs on human mobility, analyzing the purchases of an Italian supermarket chain to show a range ef-fect of products: the more sophisticated the requirements they satisfy, the more the customers are willing to travel.

Open issues addressed in this thesis: In this thesis, it is proposed to use Big Data to overcome the drawbacks of the lack of timeliness of traditional data sources or tem-poral lag, and spatial coverage of small areas. One of the main issues, related to the use of Big Data in the field of official statistics, concerns the technological aspects of data processing and validation of data quality. It is also necessary to ensure that all aspects of data protection and ethical use of the results are respected. An interesting method-ological perspective regarding the use of different data sources implies the definition of new similarity metrics to adapt the metrics to the data instead of the algorithm.

Can we monitor and predict the socio-economic development of a territory just by observing the behavior of its inhabitants through the lens of Big Data? There are some different initia-tives within the official statistics community related to these issues. Many national statistical agencies and researchers are now developing, evaluating, and implement-ing poverty estimation and poverty-mappimplement-ing methodologies, while also promotimplement-ing Big Data methodologies and best practices. For example, the European Commission has funded projects such as SAMPLE (Small Area Methods for Poverty and Living Condi-tion Estimates) and AMELI (Advanced Methodology for European Laeken Indicators) related to this topic. The EU-SILC provides information on the household income for each of the sampled households: this information is fundamental to compute monetary poverty indicators, such as the Head Count Ratio (HCR), for any domain or area of interest. HCR, also known as the At-Risk-of-Poverty Rate, measures the incidence of poverty. It is a particular case of the generalized measures of poverty introduced by Foster et al. [41].

This fascinating questions, also stimulated by the United Nations in recent reports [135,136], has attracted the interest of researchers from several disciplines, who started investigating the relations between human behavior and economic development based on extensive experimental datasets collected for completely different purposes [35].

1D4D challenge: http://d4d.orange.com/

(25)

2.1. Background

As a first result along this line, a seminal work exploited a nationwide mobile phone dataset to discover that the diversity of social contacts of the inhabitants of a municipal-ity is positively associated with a socio-economic indicator of poverty, independently surveyed by the official statistics institutes [35]. This result suggests that social behav-ior, to some extent, is a proxy for the economic status of a given territory. However, little effort has been put in investigating how human mobility affects and is affected by the socio-economic development of an area. Theoretical works suggest that human movement is related to economic well-being, as it could nourish economy and facilitate flows of people and goods, whereas constraints in the possibilities to move freely can diminish economic opportunities [69]. So, it is reasonable to investigate the role of human mobility concerning the socio-economic development of a given territory.

How to measure social complexity with a focus on the identification and quantifica-tion of social exclusion and deprivaquantifica-tion, there is a demand for local-level estimates of the most important poverty and well-being indicators. The local level (local adminis-trative area, the zone of local governance) constitutes a so-called unplanned domain of estimation in sample surveys. Oversampling to increase the sample size in the domains of interest could be a feasible solution for assessing poverty and deprivation at a local level, say at Local Administrative Units levels 1 and 2 (LAU 1 and LAU 2, levels in the Nomenclature of Territorial Units for Statistics used by Eurostat), as is often re-quired by policymakers. However, the high cost regarding time and financial resources makes this approach impractical for obtaining accurate estimates. Big Data can repre-sent an alternative source of data for the same areas, usually reaching a very high level of geographical detail. Big Data can be analyzed from two alternative perspectives: as collected on a self-selected sample from the population – that is, under a survey design perspective – or not.

Measures from Big Data sources are usually obtained very quickly; however, they can be affected by a severe self-selection bias. Conversely, small area estimates are methodologically sound, but they require timely survey and population data that can be difficult to obtain. Comparing the two alternative sets of measures referring to the same areas can provide useful insights on the potential of Big Data to benchmark small area estimates. If there is accordance with Big Data and survey data in a given small domain/area with respect to the recorded level of deprivation and poverty, then analysts and policymakers may rely on a substantial evidence. Otherwise, if there is a discrep-ancy between the results obtained from the two sources of data, then there is a need for further investigation of those domains/areas. A lot of effort has been put in recent years on the usage of mobile phone data to study the relationships between human behavior and collective socio-economic development. Below, is introduced a review of some works related to this field of research.

The seminal work by Eagle et al. [35] analyzes landline calls and a nationwide mobile phone dataset to shows that, in the UK, regional communication diversity is positively associated to a socio-economic ranking. Gutierrez et al. [59] address the is-sue of mapping poverty with mobile phone data through the analysis of airtime credit purchases in Ivory Coast. Blumenstock [13] found a relationship between the his-tory of mobile phone transactions and individual wealth , while [30] revealed a strong correlation between the consumption of food rich in vitamins and airtime purchase. Frias-Martinez et al. [42] analyzed the relationship between human mobility and the

(26)

Chapter 2. Setting the stage

socio-economic status of urban zones, building a model to predict the socio-economic level of urban zones from mobile phone data. Amini et al. [3] used mobile phone data to compare the human mobility patterns of a developing and a developed country showing that cultural diversity in developing regions can present challenges to mobility models defined for less culturally diverse areas. Smith-Clarke et al. [123] analyzed the aggre-gated mobile phone data of two developing countries and extracted features that are strongly correlated with poverty indexes derived from official census data. Pappalardo et al. [95] analyze mobile phone data and obtain meaningful mobility measures for cities, discovering an exciting correlation between human mobility aspects and socio-economic indicators. Decuyper et al. [30] use mobile phone data to study food security indicators finding a strong correlation between the consumption of vegetables rich in vitamins and airtime purchase.

Open issues addressed in this thesis: In the context of studies on well-being, it emerges the limit of traditional sources in making local estimates. An open issue is how to define an analytical process that, from Big Data, captures individual and collective indicators to measure the well-being of the territories. It is interesting to understand what are the features inferred through Big Data more related to individual and collective well-being.

Can we measure the dynamics of cities with Big Data? In the domain of city and human dy-namics, the main issue is to detect and characterize essential and unusual events at ur-ban level and to estimate the composition of the population who attended to them. The objective is to identify and isolate significant and unusual peaks of presences among all the mobile signals of occurrences registered during the days and quantify and under-stand the events and their impact on the city dynamics and population composition.

Over the last few years, several authors have proposed the use of Big Data for the exploration of city dynamics. Ratti et al. in [112] illustrated several urban phenomena of the city of Graz by using a different type of mobile phone records: CDRs for traf-fic intensity, handovers for traftraf-fic migration, and volunteers’ tracks for reconstructing individual movements. In [19], the authors reviewed several techniques for extracting knowledge from mobile phone data to perform a global sensing of the cities and reason about the mechanisms driving the city life. Pollution at the urban level has also been studied by the support of mobile phone data: in [48], for example, Gariazzo et al. used the records to estimate the population density and draw a distribution map over the city to evaluate the actual population exposure to the pollutants. Users categorization rep-resents a further semantic layer that enriches the information provided by every single CDR, thus introducing a novel approach w.r.t. to [58] and [32]. The former proposes a method to detect unusual events relying on users’ mobility profile, considering each an-tenna the user connected to as a location. Then, they identify unusual crowds detecting users who are aggregating in areas uncommon for them, according to their correspond-ing mobility profiles; the latter proposes a supervised approach to learn the pattern of an event. This solution is based on the availability of a list of known events.

Studies from different disciplines document a stunning heterogeneity of human travel patterns [53, 97], and at the same time observe a high degree of predictability [34,126]. The patterns of human mobility have been used to build generative models of individual human mobility and human migration flows [67, 94, 98, 99, 121], to construct

(27)

2.1. Background

methods for profiling individuals according to their mobility patterns [94], to discover geographic borders according to recurrent trips of private vehicles [114], or to predict the formation of social ties [21, 137], and to predict the kind of activity associated with individuals’ trips on the only basis of the observed displacements [66, 72, 115]. There are widely accepted mobility models and measures, e.g. radius of gyration [53,94], mo-bility entropy [95, 126], individual momo-bility networks [66, 115] and origin-destination matrices [114], that can be used to study different aspects of both individual and col-lective mobility.

Lotero et al. [77] analyze the architecture of urban mobility networks in two Latin-American cities from the multiplex perspective. They discover that the socio-economic characteristics of the population have an extraordinary impact on the layer organization of these multiple systems.

Nowadays Europe is the second continent for urbanization after Latin America, about 70% of the population (mostly in the age group of 20-64 years) lives in urban areas [38]. The understanding of the spatial organization of similar regions and of how places link among them can improve analytical approaches when facing governance challenges such as the economic development of complex nationwide systems. Indeed, policymakers are paying increasing attention to the role of homogeneous economic ag-glomeration and to the capacity of local areas to contribute to social growth [64]. It is therefore essential to use human movements for studying the boundaries of cities from a functional point of view.

The gold standards for the territorial organization of large geographical areas into functional units are experiencing a steady change of perspective, due to the dynamics behind the growth and interconnection of cities. The traditional interpretation of the urban hierarchy refers merely to the size of the city, with its population and bound-aries. From the theoretical point of view, a slightly different perspective is given by the concept of polycentrism [16]: urban areas are often evolving from mono-centric agglomerations to more complex systems made of integrated urban centers (cores) and sub-centers. In other territories, some cities and towns are increasingly linking up, forming polycentric integrated areas. Moreover, the contraction of public expenditure and the subsequent need for a more efficient use of available resources in the public sector has driven a process of service concentration towards denser urban areas. The accurate perimetration of service provision using actual mobility patterns can help in-crease the efficiency of public administrations without marginalizing the surrounding territories.

From a statistical and economical point of view, Boix et al. [14] illustrate different methodologies used to solve the problem of redefining urban areas. Among these, it is worth mentioning Dynamic Metropolitan Areas (DMA), specifically designed to deal with the characteristics of policentricity. It uses the origin-destination matrix derived from commuting flow data, collected from the census, as the primary measure of in-teraction. The first stage of the DMA algorithm has a top-down approach: it identifies first-order centers (seeds) which have at least 50,000 inhabitants and merges the sur-rounding municipalities that commute at least 15% of their inhabitants. On data science prospective, several group discovery methods might be applied, basically following ei-ther a clustering of network-based perspective.

(28)

Chapter 2. Setting the stage

commonly putting together objects that are similar to each other under some specific notion of similarity. The three classical and most frequently adopted examples are: k-means, representing a family of partitioning methods that create compact clusters, trying to minimize the diversity within a cluster and to maximize it across different clusters; hierarchical clustering, producing several different partitioning at various lev-els of aggregation; density-based clustering, which puts together groups of objects that form locally dense areas, not enforcing any constraint on the size and shape of clusters (as opposed to k-means, for instance).

Network-based methods define groups as communities, i.e., groups of linked nodes that share common properties. Different approaches solve slightly different instanti-ation of the problem itself optimizing different objective functions [26, 39]. For this reason, identifying the right Community Detection approach given a specific context is mandatory for finding substructures from which to drive fruitful insights. In the con-text of territorial partitioning, community discovery has become an essential tool for decision makers that need to study social complex systems, e.g., in grouping together municipalities showing similar characteristics [64]. In this sense, community discovery has analogies to the clustering problem. Indeed, by adopting a community discovery approach, we can obtain, in a bottom-up way, an unsupervised classification of territo-ries.

Open issues addressed in this thesis: In the context of city dynamics, it is crucial to define new tools for planning, measuring the attractiveness and interactions among cities. The challenge is, therefore, to apply the real-time demography methodologies to observe the presence and flow of individuals as well as the definition of new methods to investigate the dynamics of cities in a functional perspective. All the methods presented above work by adopting standard and generic objective functions that might not fit the specific objectives we are pursuing.

Summary of the contribution of the thesis: Despite an increasing interest in this field of research, a review of the state-of-the-art cannot avoid noticing that there is no unified methodology to exploit Big Data for Official Statistics. It is also surprising that widely accepted measures of human mobility (e.g., radius of gyration [53] and mobility en-tropy [126]) have not been used so far. The first proposal included in the thesis is to use a continuous stream of mobile phone data over a wide area to achieve a new Of-ficial Statistic tool for real-time demography. To put to work our framework, in the thesis are shown the scientific contributions made to validate the tool, to address ethical and scalability issues. The second proposal consists in providing a Big Data analytical framework as support for official statistics, which allows for a systematic evaluation of the relations between relevant aspects of human behavior and the development of a territory. Finally, I propose to use real-time demography to discover, understand and characterize city events from mobile phone data to provide a useful tool for city mobil-ity manager to manage the activities and taking future decisions to achieve securmobil-ity and mobility. Furthermore, I propose a data analysis approach aimed to identify functional mobility areas in an entirely data-driven way.

(29)

2.2. Data sources

2.2

Data sources

In this thesis, several data sources have been used to prove how the digital tracks we leave can be used in different application contexts. Chapter 3 shows how to use mobile phone data for measuring the presences of users within a territory through Data Mining techniques. Furthermore, in Chapter 4 it is shown how to study the link between mobil-ity, sociality and regional development. Finally, in Chapter 5 it is shown an application case within the territory of Rome, where thanks to the use of one year of data, several events have been detected.

In Chapter 4 it is shown how to use the GPS tracks as a co-variate of a model for estimating levels of poverty at the local level, and in Chapter 5 it is shown how to use the same source to identify the borders of cities considering a functional mobility approach.

The last data source used in the thesis is Market Retail Data. In Chapter 4 it is shown how to measure the impact of the economic crisis by observing the purchasing data, then it is shown a methodology to infer inflation at district level analyzing the real basket of individuals. Lastly, in Chapter 3 it is demonstrated how temporal purchasing habits of individuals allows studying the presence of tourists in a territory.

Mobile phone data: Between 2013 and 2017, mobile phone penetration will rise from 61.1% to 69.4% of the global population, according to several reports [138]. Also, an assumption is that the position of the mobile devices is the position of their users [111]. As a consequence, it is possible to use such information in many different domains and in ways for which they were not meant. For instance, such data have been successfully used for traffic monitoring [20] or tourist movements [91].

The CDRs, generally collected by mobile phone operators for billing and operational purposes, contain an enormous amount of information on how, when, and with whom people communicate. This wealth of information allows to capture different aspects of human behavior and stimulate the creativity of scientists from different disciplines, who demonstrated that CDRs are a high-quality proxy for studying individual mobility and social ties [53, 92]. Table 2.1 illustrates an example of the structure of CDRs.

CDRs collect geographical, temporal and interaction information on mobile phone use and show an enormous potential to investigate human dynamics on a society-wide scale [62] empirically. Each time an individual makes a call the mobile phone operator registers the connection between the caller and the callee, the duration of the call and the coordinates of the phone tower communicating with the served phone, allowing to reconstruct the user’s time-resolved trajectory. Therefore, we can rebuild the path of a user based on the time-ordered list of cell phone towers from which she made her calls during the period of observation (Fig. 2.1).

CDR data have been extensively used to study human mobility due to the follow-ing advantages: they provide a means of samplfollow-ing user locations at large population scales; they can be retrieved for different countries and geographic scales given their worldwide diffusion; they offer an objective concept of place, i.e., the phone tower. Nevertheless, it is worth noting that CDRs suffer different types of bias [63, 109], such as (i) the position of an individual is known at the granularity level of phone towers; (ii) the location of an individual is known only when she makes a phone call; and (iii)

(30)

Chapter 2. Setting the stage

timestamp tower caller callee 2007/09/10 23:34 36 4F80460 4F80331 2007/10/10 01:12 36 2B01359 9H80125 2007/10/10 01:43 38 2B19935 6W1199

..

. ... ... ...

(a) Call Data Record

tower latitude longitude

36 49.54 3.64 37 48.28 1.258 38 48.22 -1.52 .. . ... ... (b) Geodata

Table 2.1: Example of Call Detail Records (CDRs). Every time a user makes a call, a record is created with timestamp, the phone tower serving the call, the caller identifier and the callee identifier (a). For each tower, the latitude and longitude coordinates are available to map the tower on the territory (b).

Figure 2.1: The detailed trajectory of a single user. The phone towers are shown as grey dots, and the Voronoi lattice in grey marks approximate reception area of each tower. CDRs records the identity of the closest tower to a mobile user; thus, we can not identify the position of a user within a Voronoi cell. The trajectory describes the user’s movements during four days (each day in a different color). The tower where the user made the highest number of calls during nighttime is depicted in bolder grey.

phone calls are sparse in time, i.e. the time between consecutive calls follows a heavy tail distribution [9, 53]. In other words, since individuals are inactive most of their time, CDRs allow reconstructing only a subset of the mobility of an individual. Several works in literature study the bias in CDR data by comparing the mobility patterns ob-served on CDR data to the same patterns seen on GPS data [94,96,97,100] or handover data (data capturing the location of mobile phone users recorded every hour or so) [53]. The studies agree that the bias in CDR data do not affect the study of human mobility patterns significantly.

Vehicular tracks: In recent years there has been a strong increase of digital traces left by devices installed on private cars. These data sources are useful for the mobility manager to study the mobility habits of the individual to propose sustainable mobility plans. Thanks to the great detail of information collected, using appropriate techniques

(31)

2.2. Data sources

vid timestamp latitude longitude

63 2014-06-18 06:31:24 43.557703 10.337913 63 2014-06-18 06:31:26 43.557725 10.33794 63 2014-06-18 06:31:27 43.557735 10.337955

..

. ... ... ...

Table 2.2: Example of GPS Records. The collected GPS data consist of the sequence of space-time detections of vehicles on which the positioning device is installed. Every time a vehicle switches on, a record is created consisting of the vehicle identifier, timestamp, the latitude and longitude coordinates.

Figure 2.2: The detailed trajectory of a single user. The start point of each trajectory is a red dot, the end point is represented as a red triangle. The trajectory describes the user’s movements during several days (each day in a different color).

can be extracted from the GPS data the systematic and/or occasional mobility profiles of each individual [115] [130]. Therefore it becomes feasible the realization of carpooling applications [57] [131] or the improvement of public transportation [44].

The GPS data are very promising but at the same time may present some bias. The first concerns the representativeness of human behavior: although such data are col-lected over a portion of the population, it has been shown that under certain conditions these data are highly significant [97]. The data do not refer to a particular topological structure. It follows then that for some applications it is necessary to proceed to the mapping of the data on road arc [22].

Several companies are specialized in the provision telematics systems and services for the insurance and automotive market. For commercial purposes or to prevent fraud, the insurance companies offer their customers favorable terms in exchange for the adop-tion of the tracking device. This system provides better protecadop-tion against theft, as well as an objective measurement of the operating mode of the vehicle. The company, which is in charge of tracking a fleet of cars via GPS devices on board, provided the KDD Lab a sample of such data on the Tuscan area. In Tab. 2.2 is reported an example of data collected by the device installed on the car.

The collected GPS data consist of the sequence of space-time detections of vehicles on which the positioning device is installed. This device is responsible for rebuilding the geographic location of the vehicle continuously, communicating regularly to the Central. The information is transmitted on a regular basis so as to reduce the traffic

(32)

Chapter 2. Setting the stage

of information. The switching on and off the device is automatic and simultaneous to that of the vehicle, for which the tracking is carried out without interruption. Through continuous observation of cars we are able to reconstruct individual or collective trips (Fig. 2.2).

Market retail data: The availability of the enormous quantity of retail data stimulates more and more challenging questions that can be answered by in-depth and smart ana-lysis of different aspects related to shopping sessions of customers. Retail data are an extraordinarily complex type of data; it contains different dimensions that can be analyzed under many points of views. The main dimensions are: what customers buy, whenand where they make the purchases and which is the relevance of the purchase. The choice of analyzing a set of dimensions rather than another one depends on the kind of phenomena to be investigated: considering all the aspects in the same analysis can lead to very complex models or too weak generalizations. Most of the works in the literature are centering their attention on catching and comprehending the behaviors and habits by analyzing what customers buy [1, 68] and a few of them have also exploited the temporal dimension as a feature for enriching their models based primarily on the items purchased [55, 74, 86].

One example of the dataset used in this thesis consists of economic transactions col-lected by COOP, a system of Italian consumers’ cooperatives which owns the largest supermarket chain in Italy. The company operates 138 shops in the West Coast in Italy, and it serves millions of customers every year, of which a vast majority is identifiable through fidelity cards, with almost a million of unique clients. The whole dataset con-tains retail market data in a time window that goes from January 1st, 2007 to December, 31th 2016, with almost 280 billion product scans.2

Below are shown some statistics on habits that emerge from the data. It can be seen how July, August, and December are the months in which more purchases have been made. It should be noted that the chain is distributed over the Tyrrhenian coast so that sales can be driven by tourist phenomena during the Summer. The increase in purchases during December is because the chain also sells non-food items, which can be gifted during Christmas (Fig. 2.3).

In Fig. 2.4 the weekly shopping habits are reported. Saturday is the day when more shopping is done, while looking at the time slots there is a big difference between the weekdays and weekends. In the weekdays the peak is after 17 while on weekends is between 11 and 13, because buying habits are driven by individuals’ working times.

2KDDLab created a very large sales DataWarehouse, equipped with advanced BI solutions for strategic marketing and

cus-tomer care based on data mining predictive analytics. The dataset, prepared over the years thanks to the valuable contribution by Pennacchioli and Guidotti, has been the object of several publications including: i) Pennacchioli, D., Coscia, M., Rinzivillo, S., Giannotti, F., & Pedreschi, D. (2014). The retail market as a complex system. EPJ Data Science, 3(1), 33; ii) Guidotti, R., Coscia, M., Pedreschi, D., & Pennacchioli, D. (2015, October). Behavioral entropy and profitability in retail. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on (pp. 1-10). IEEE.

(33)

2.2. Data sources

(a) Distribution of purchases by month (b) Geographical distributions of shops

Figure 2.3: UNICOOP-TIRRENO revenues are higher in July, August and December. In the summer, the increase is due to the location of the shops, many of which are in tourist locations. The increase during Christmas is common to many supermarkets, it is worth to notice that the UNICOOP chains also sell non-food products.

(a) Day of week (b) Weekdays vs Weekend

Figure 2.4: Hourly distribution of purchasers. UNICOOP-TIRRENO revenues are higher in the late afternoon during weekdays while they are higher in the morning on weekends. The conclusion is that spending at the supermarket depends a lot on the working hours of the customers and their free time.

(34)
(35)

CHAPTER

3

Real Time Demography

The Big Data originating from our daily activities let us to observe the individual and collective behavior of people at an unprecedented detail. Many dimensions of our social life have Big Data “proxies”, as the mobile calls data for mobility [43]. In this chapter we investigate how to use Big Data to implement new tools for real time demography. The study has been jointly developed by Istat, CNR, University of Pisa and “ISTAT Big Data Commission”.

Italian National Institute of Statistics (ISTAT) use the census to measure the habits and lifestyles of citizens, but administrative data do not contain information on the frequency of the mobility. The idea is to specify an estimate method, through mobile phone data, to define for each municipality the stock people who live, work or visit the city.

The work presented in this Chapter deals with issues concerning scalability and ethical issues. The following publications are related to this chapter:

1. Guidotti, R. and Lorenzo Gabrielli (2017, July). Recognizing Residents and Tourists with Retail Data Using Shopping Profiles. GOODTECHS 2017

2. Cintia, P., Gabrielli, L., Giannotti F., Monreale, A. and Francesca Pratesi. Privacy-Aware Sociometer: a Mitigation Strategy for Quantication of City Users. Ready to be submitted to Data & Knowledge Engineering (DKE).

3. Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., & Ricci, L. (2016, December). Scalable and flexible clustering solutions for mobile phone based population indicators. International Journal of Data Science and Analytics.

4. Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., & Ricci, L. (2016, June). Improving population estimation from mobile calls: a

(36)

Chapter 3. Real Time Demography

clustering approach. In 2016 IEEE Symposium on Computers and Communication (ISCC) (pp. 1097-1102). IEEE.

5. Gabrielli, L., Furletti, B., Trasarti, R., Giannotti, F., & Pedreschi, D. (2015, Octo-ber). City users’ classification with mobile phone data. In Big Data (Big Data), 2015 IEEE International Conference on(pp. 1007-1012). IEEE.

3.1

Problem Definition: Population Estimation

For data mining purposes, Call Data Record (CDR) proved to be significant in terms of size and representativeness of the sample [43]. Thanks to this source of data we are able to localize people and to build support tools for applications in several domains such as health-care, coordination of social groups, transportation and tourism [30, 75, 89]. It is obvious that it is necessary to demonstrate that the measurements carried out with alternative sources are statistically significant. Achieving success in this direction would mean to be able to safely integrate existing population and flow statistics with the continuously up-to-date estimates obtained from CDR data, thus a first step towards exploiting Big Data in Official Statistics.

The first objective of this work is to correctly estimate the population that belongs to each of the following categories, already defined by ISTAT in the ongoing project “Persons and Places” using administrative data. Each class represents a different way of living the territory and, correspondingly, a different usage of its resources.

• Standing residents in A: residents who have formal residence and place of work (study) in the same municipality A, or who do not work (study).

• Embedded city users in A: people that spend long periods of working (studying) in a municipality A (e.g., most days of the week), while being formally resident in another municipality, different from A.

• Daily city users in A: people who commute to municipality A, having a formal residence in another municipality, different from A. The category describes people who move at high frequency, not strictly daily, for work or for study to the place under examination.

• Visitors in A: people who spend short periods for visiting A.

Since mobile phone usage implies presence on a territory, the four classes defined above can be easily translated to classification rules. Given an individual call behavior, we assign automatically it to the proper category, taking into account some differences. In particular, if compared to the classes adopted in the ISTAT project “Persons and Places”, the lack of administrative information about the CDR users does not allow to distinguish between Standing Residents and Embedded city users since in practice their physical presence on the residence/embedded area tends to be identical. On the other hand, the physical presence of users allows to easily distinguish (at least in principle) Dynamic vs. Static residents, since the former usually are not present in the residential municipality during working hours. This small mismatch between the two classifica-tions will be considered when we compare the population estimates obtained with the

(37)

3.2. Sociometer: a Solution to Real-Time Demography

two methods (the one based on official data vs. the one based on CDR data).

We define the following categories which we characterize by observing the call be-havior of people.

• Residents (or Static Residents): are those individuals that live and work in the same area, and therefore their presence is significant across the whole day and all time slots for a specific municipality.

• Dynamic Residents: people who reside in some municipality A but work in a different one (B). The presence in A is expected to be always significant, except during working days and working hours.

• Commuters: people that work or study in municipality A, while residing (Dy-namic Resident) in a different municipality B. The presence in A is expected to be almost exclusively concentrated during working days and working hours.

• Visitors: people that visit a municipality only once or a few times.

Starting from the calling behavior, by applying an ad hoc analytical process, we can quantify the presence of different categories of individuals within a territory.

3.2

Sociometer: a Solution to Real-Time Demography

The Sociometer is a Data Mining process [43] able to classify the call behavior of people to quantify the different type of city users within a territory.

The basic statistical unit of our analytical process is the Individual Call Profile (ICP). ICPs are the set of aggregated spatiotemporal profiles of a user computed by applying spatial and temporal rules on the raw CDRs. The structure is a matrix of the type shown in Fig. 3.1. The temporal aggregation is by week, where each day of a given week is grouped in weekdays and weekend. Given for example a temporal window of 28 days (4 weeks), the resulting matrix has eight columns (2 columns for each week, one for the weekdays and one for the weekend). A further temporal partitioning is applied to the daily hours. A day is divided into several timeslots, representing interesting times of the day. This partitioning adds to the matrix new rows. Numbers in the matrix represent the number of events (in this case the presence of the user) performed by the user in a particular period within a specific timeslot. For instance, the more the intensity of the color in a given cell (Fig. 3.1) means that the individual was present in the area of interest for several distinct weekdays or weekends in that period. Please note that measured attendance depend on the call events so if an individual is in the territory and does not make any call the corresponding ICP will be empty.

During the modeling phase of the ICP, we tried different configuration w.r.t. time slot definition. The solution we have reported is the acceptable solution, which also emerged from a comparison with the domain expert. Empirical results have shown that by changing the structure of the ICP some categories of users are confused / lost. In case of reduction of the two-time slots, we have verified the inability to properly recognize dynamic_resident and commuter. On the other hand, an increase in the number of time slots makes it more difficult to identify similar clusters.

(38)

Chapter 3. Real Time Demography

Figure 3.1: Individual call profile (ICP)

The analytical process consists of several phases (Fig. 3.2). The first phase involves the ICP Building, then we group similar ICPs (Prototypes Extraction), the third part of the process involves labeling the centroid of each cluster computed in the previous phase w.r.t. the definitions introduced above (Prototype Labeling). The last phase is the Label Propagation, each point for each cluster is labeled propagating the value of its stereotype. At the end of the process, we can quantify the stock of individuals present in the area and flows of individuals among the different regions.

In details, ICP Building is performed to compute the ICPs from the CDR. During this transformation, the matrix is normalized by the number of days in each column (i.e., 5 for weekdays and 2 for the weekend). Then, the columns of the ICPs are rearranged in order to have the week with more presences in the first positions (left most), and those with less presences in the last positions. This helps to compare profiles of different users who may have their calls concentrated in different weeks.

In Prototypes Extraction, by using the K-means algorithm, a set of clusters are ex-tracted from the ICPs. The corresponding k centroids, called stereotypes, are the set of characteristic behaviors of the population. In Prototype Labeling the domain expert label manually all the k stereotypes, w.r.t. the city users categories above mentioned. The choice of parameter k is fundamental in the identification of high quality clusters. Empirical studies on our data have shown that 100 is a good compromise to identify a sufficient variety of behaviors. A higher number does not allow to capture sensitive behavior variations.

Finally, the Label Propagation sub-task implements the application of the classifi-cation model for classifying new instances.

Concerning the flow of Fig. 3.2, stated D the initial dataset, U the set of users and M the number of municipality, the ICP Building process has a complexity of O(D + (U ∗ M )). The complexity is calculated as the sum of the complexity of the input reading and the complexity for the construction of the output. The first value represents the cost needed for scanning the entire dataset D, the second factor indicates that in the worst case, the cost of building the ICP for each user is O(U ), multiplied the number of all the municipalities M present in our dataset.

Finally, the Label Propagation step has a complexity bounded by O(U ∗ M ).

(39)

3.2. Sociometer: a Solution to Real-Time Demography

Figure 3.2: Sociometer: Starting from raw call data record, in phase 1, we first build for each user, for each zone an Individual Call Profile(ICP). Then we apply a clustering algorithm to group users with similar behavior (phase 2). From each cluster, we extract a centroid (phase 3), and we label it w.r.t. the closest representative archetypes (phase 4).

(40)

Chapter 3. Real Time Demography

The Sociometer described above is a sequential process, which reads the input data from a relational database and runs on a personal computer. Although the performances in terms classification are good, this implementation has critical limitations regarding computational time and ability in processing large amounts of data. Having a continu-ous stream of mobile phone data over a wide area can be used by Official Statistic with a new tool that allows making real-time demography. To put to work the Sociometer as a tool for official statistics, in the following sections are shown the scientific contri-butions made during the Ph.D. for validating the stock of presences measured with the tool, for addressing scalability and ethical issues.

3.3

Extending the Sociometer Engine

The first version of Sociometer as a tool for monitoring the insistent population is pre-sented in [43]. In that version the characteristics that must have a tool to achieve Real Time Demography through the use of the mobile phone data are defined. The following sections describe the work done during the Ph.D.: in Sec. 3.3.1 is discussed a semi-automatic approach to classify call behavior [45], in section 3.3.2 how to approach the scalability issues [79], in section 3.3.4 is discussed how to extend the methodology to market retail data, and section 3.3.3 deals with ethical issues [23].

3.3.1 Semi Automatic Annotation of Prototypes

The more direct strategy to label the stereotypes obtained with clustering (Fig. 3.2, phase 3) is the manual labeling performed by an expert, as shown in Fig. 3.2, phase 4. This human activity (i) introduces a bias in the process because different experts may generate different labeling, (ii) makes the systematic re-computation of the model very expensive and complicated, and (iii) requires different domain experts for different geographic areas or contexts.

To limit the dependence of the Sociometer from the human support, in [45] we pro-pose a semi-automatic strategy where a set of archetypes, i.e., an abstract representation of the behavioral categories, is used to annotate (in an automatic way) the stereotypes.

The automatic labeling procedure replicates the cognitive process made by the ex-perts taking one stereotype at the time and assigning a label of the closest archetype. An archetype (Fig. 3.3) has same structure and semantics of an ICP but represents a perfect example of a behavioral profile. This concept shifting between archetypes and stereotypes is the real power of the semi-automatic labeling, allowing the abstract representation of the categories to adapt to actual homogeneous behaviors in the data.

Moreover, it is important to notice that, using the archetypes to label the stereo-types and using these last as classification model, is different from using the archestereo-types to classify the users as shown in Fig 3.4 directly. In the picture, a1 and a2 are the

archetypes, while s1 and s2 the stereotypes; δ0 and δ00 are the distances of the ICP u

from a1 and a2 respectively, while σ0 and σ00 are the distances of u from s1 and s2

respectively. If u is compared directly with a1 and a2, then it will be assigned to a2

be-cause δ00 < δ0. If we label first s1 and s2with the closest archetypes, u will be assigned 22

Riferimenti

Documenti correlati

1 Beyond the design of the physical space, designers have to deal with the intangible outcomes of projects, which include allowance for the future experience of people in time

In order to study bacterial flora associated with urinary tract infections (UTI) in sows, urine samples and urinary tracts from 72 multiparous culled sows were randomly collected from

Un altro dato poi che occorre tenere presente in via preliminare è che una sensibilità normativa diretta a migliorare le condizioni di vita del detenuto (anche dal punto di

It is possible to specify, besides destination IP address and port, DS field of packets, probability distribution of packets inter-departure and packets size processes, seed value,

la  prima  sequenza  consonantica  seguita  dalla  prima  vocale,  senza  arrivare   al  confine  sillabico  se  la  sillaba  è  chiusa..  Sempre  in  direzione

1994 A mutation in RET proto-oncogene associated with multiple endocrine neoplasia type 2B and sporadic medullary thyroid carcinoma.. Quadro L, Panariello L, Salvatore D,

La presente raccolta di saggi (prevalentemente inediti, e tutti nati nell’ultimo quinquennio) presenta cinque prospettive sui diversi orizzonti della linguistica generale:

Because if, on the one hand, it is true that the role of language and linguistic theory was reconsidered and limited by Ricœur in his 1965 essay, then on the other it is true