Social networks, human mobility and economic development: a data-driven study in France

(1)

Università di Pisa

Dipartimento di Informatica

Corso di Laurea Magistrale in Informatica per

l’Economia e per l’Azienda

(Business Informatics)

Tesi Magistrale

Social networks, human mobility and economic

development: a data-driven study in France

Canditati: Giovanni Lima Pierpaolo Paolini

Relatori:

Prof. Dino Pedreschi Dott. Luca Pappalardo Tutor:

Dr. Zbigniew Smoreda Controrelatore:

Prof. Roberto Bruni

(2)

(3)

Qualsiasi lavoro tu faccia, se trasformi in arte ciò che stai facendo, con ogni probabilità scoprirai di essere divenuto per gli altri una persona interessante e non un oggetto. Questo perché le tue decisioni, fatte tenendo conto della Qualità, cambiano anche te. Meglio: non solo cambiano anche te e il lavoro, ma cambiano anche gli altri, perché la Qualità è come un’onda. Quel lavoro di Qualità che pensavi nessuno avrebbe notato viene notato eccome, e chi lo vede si sente un pochino meglio: probabilmente trasferirà negli altri questa sua sensazione e in questo modo la Qualità continuerà a diffondersi.

(4)

(5)

Abstract

Nowadays, the striking proliferation of Big Data and the new scientific tools provided by the emerging field of Data Science finally pave the road to realize a longstanding dream of scientists and policy makers: drawing a comprehen-sive picture of human social behavior. Big Data, indeed, hide a huge amount of predictive power, which can be exploited by governments and companies to unveil and understand the complexity underlying our society.

In the present thesis we propose a multidimensional study of human social behavior, aimed to understand how the social network, the mobility patterns and the socio-economic status of people in a big European country are con-nected to each other. To this end we exploit the access to a big mobile phone dataset provided by the French telecom provider Orange. Thanks to Big Data management tools like Hadoop, we computed several individual mea-sures, each describing different aspects of the social or mobility behavior of individuals and their aggregation at various geographic scales. Our analysis confirmed the existence of known patterns and revealed new interesting ones. Firstly, the observation at neighborhood level of the three biggest cities in France (Paris, Marseilles, Lyon) uncovers a very strong correlation between the social diversification and the mobile predictability. People who equally di-versify their calls over the social contacts tend to have a more erratic mobile behavior. Moreover, even more interesting relationships emerged between the mobile behavior and the economic status: mobility diversity is strongly correlated with some indexes describing the socio-economic level of the coun-try. These striking results suggest that the greater the diversification of the mobile behavior of people within a territory, the higher their economic pros-perity. Such findings open interesting future perspective about the study of human social behavior. New statistical indexes can be defined which rely on

(6)

mobile phone data to describe and predict (nowcast) the actual and future economic health of a territory.

(7)

Sommario

La proliferazione dei Big Data e dei nuovi strumenti di analisi forniti dall’emer-gente disciplina della Data Science, ci indicano finalmente la strada verso la realizzazione di un sogno di vecchia dati di scienziati e politici di ogni tempo: tracciare un quadro completo del comportamento sociale. I Big Data, infatti, nascondono un’enorme potere predittivo, che può essere sfruttato da governi e aziende private per comprendere la grande complessità che caratterizza il nostro tessuto sociale.

In questa tesi, proponiamo uno studio multidimensionale del compor-tamento umano, allo scopo di comprendere come sono connessi tra loro il comportamento sociale, il comportamento di mobilità ed il benessere eco-nomico delle persone che vivono in un grande Paese europeo. A tal fine sfruttiamo l’accesso ad un grande dataset di chiamate effettuate da telefoni cellulari fornito dal provider di telecomunicazioni francese Orange. Grazie a strumenti di gestione per i Big Data, come Hadoop, abbiamo calcolato una serie di misure individuali, ognuna delle quali descrive diversi aspetti del comportamento sociale o di mobilità sia degli individui sia di una loro aggregazione a diverse scale geografiche. La nostra analisi ha confermato alcune scoperte precedenti e ha rivelato nuovi interessanti leggi di compor-tamento. In primo luogo, l’osservazione a livello di quartiere delle tre città più grandi in Francia (Parigi, Marsiglia, Lione) rivela una correlazione molto forte tra la diversificazione sociale e la prevedibilità dei movimenti. Le per-sone che hanno un’alta diversificazione dei contatti sociali attraverso le loro chiamate tendono ad avere un comportamento mobile più irregolare. Inoltre, una relazione ancora più interessante emerge tra il comportamento mobile e lo status economico: la predicibilità nei movimenti è fortemente correlata con alcuni indici che descrivono il livello di benessere economico.

(8)

Questi risultati sorprendenti non lasciano spazio a dubbi: maggiore è la diversificazione del comportamento mobile delle persone all’interno di un territorio, maggiore è la il suo benessere economico. Queste scoperte aprono interessanti prospettive future sullo studio del comportamento sociale umano. Nuovi indici statistici possono essere definiti basandosi su dati di telefoni cellulari, col fine di descrivere e prevedere lo stato economico attuale e futuro di un determinato territorio.

(9)

List of Figures

3.1 The Hadoop MapReduce System . . . 30

3.2 Storage and processing layers in Hadoop . . . 31

3.3 Example of MapReduce execution on the problem of counting words frequency . . . 31

3.4 Pig compilation and execution stages . . . 35

3.5 The process of translating a logical query plan into a Map-Reduce execution plan . . . 36

3.6 Comparison between HiveQL and Pig . . . 40

4.1 Example of trajectories extracted from mobile phone data . . 45

4.2 Distribution of Orange users per municipality . . . 47

4.3 Heat map of Orange users per municipality . . . 47

4.4 Distribution of calls and degree . . . 48

4.5 Distribution of visited locations . . . 48

4.6 Heat map of installed towers per municipality . . . 49

4.7 Distributions of income and unemployment rate . . . 50

4.8 Heat map of income per municipality . . . 51

4.9 Heat map of unemployment rate per municipality . . . 52

4.10 Distribution of education variables . . . 52

4.11 Heat map of the European Deprivation Index (EDI) . . . 54

4.12 Distribution of the European Deprivation Index (EDI) . . . . 54

4.13 Correlation between European Deprivation Index (EDI), in-come and primary education rate at municipality level . . . . 56

4.14 Correlation between European Deprivation Index (EDI), no diplomas and unemployment rate at municipality level . . . . 56

(12)

5.2 Correlation between social degree and social diversity . . . 62

6.1 Distributions of radius of gyration and social degree . . . 64

6.2 Distributions of mobility entropy and social diversity . . . 65

6.3 Correlation between individual mobility and sociality measures 66

6.4 Distribution of Pearson’s coefficient at municipality and

de-partment level . . . 67

6.5 Heat maps of Pearson’s coefficient at municipality level . . . . 68

6.6 Correlation between aggregated mobile and social measures at

district level (Paris, Marseilles, Lyon) . . . 69

6.7 Correlation between aggregated mobile and social measures at

district level, including cities in the periphery of Paris, Mar-seilles and Lyon . . . 70

6.8 Correlation between mobility/sociality measures and the

indi-vidual income, at district level (Paris, Marseilles, Lyon) . . . . 71

6.9 Relation between mobility entropy and European Deprivation

Index (EDI) . . . 73 6.10 Relation between mobility entropy and income pro capite . . . 74 6.11 Distributions of mobility entropy in the deciles of EDI and in

the deciles of income pro capite . . . 74 6.12 Relation between no diplomas and mobility entropy . . . 75 6.13 Distributions of mobility entropy in the deciles of no diplomas

and primary education level rate . . . 75 6.14 Relation between socio-economic indexes and mobility entropy

at municipality and department level . . . 76 6.15 Relation between social diversity and European Deprivation

Index (EDI). . . 77 6.16 Relation between social diversity and income pro capite . . . . 78 6.17 Distributions of social diversity in the deciles of EDI and

in-come pro capite . . . 78 6.18 Relation between social diversity and no diplomas . . . 79 6.19 Distributions of social diversity in the deciles of no diplomas

and in the deciles of primary education rate . . . 79 6.20 Relation between socio-economic indexes and social diversity . 80

(13)

6.21 Relation between social diversity and mobility entropy . . . . 81 6.22 Relation between radius of gyration and social degree . . . 81

(14)

(15)

List of Tables

3.1 Example of input file to calculate the mobility entropy measure 41

4.1 Example of Call Detail Records . . . 44

4.2 Features of the initial and filtered datasets . . . 46

4.3 Economic measures used in our study. . . 55

5.1 Description of the individual mobility measures . . . 59

5.2 Description of the individual sociality measures . . . 62

(16)

(17)

Listings

3.1 Words count from a text in HiveQL . . . 33

3.2 HiveQL query which counts the number of triangles in the call graph . . . 34

3.3 HiveQL query which creates the table edges . . . 34

3.4 Pig Latin script to compute the PageRank . . . 37

3.5 Control flow for the PageRank in Python . . . 38

3.6 Map phase for calculating the mobility entropy in Python . . . 41

(18)

(19)

Chapter 1 Introduction: the challenge of

Human Behavior

Drawing a comprehensive picture of human social behavior is a longstanding dream of policy makers and scientists from different disciplines. Over the past centuries, reaching the fundamental laws governing society seemed to be a chimera, because of the lack of a suitable tool of observation of indi-viduals. Nowadays, the striking proliferation of data that characterizes our modern era finally offers the opportunity to change the dream into reality. Every day we produce large amounts of data simply by living in our tech-nological world: we make calls from mobile phones, make checkins on online social networks, post geotagged photos on websites and blogs, produce GPS data by driving our car. All these actions are translated in masses of digital data, and collected by institutions and companies that compile them into a comprehensive picture of human behavior. Such huge corpus of data consti-tutes a huge social microscope, capable of photographing the main aspects of human actions: as well as biologists observe and describe microorganisms through sophisticated microscopes, data scientists are able to observe the behavior of humans through the powerful lens of Big Data.

"There is a huge amount of predictive power in these data", says Albert-Lászlo Barabási the Hungarian physicist best known for his research work in network science. Such descriptive and predictive power is capturing the eye of governments and public institutions. Big Data can help governments in many different ways such as: flush out tax evaders through the

(20)

identifi-cation of suspicious behavior; predict the spreading patterns of epidemics; develop crime prediction and prevention; analyze human movements to de-sign new sustainable cities by improving public transportation and reducing air pollution.

The power of data constitutes a valuable resource for private companies as well, and in many cases a significant competitive advantage. Amazon, for instance, exploits consumers’ online purchases to improve the product recom-mendations, increasing both its profits and customers’ satisfaction. Several other companies like Walmart, Apple and Ebay are making Big Data part of their DNA to improve their products and services, and to help managers in the decision making process.

The huge availability of data and the new scientific tools provided by the emerging field of Data Science [1, 2, 52] has revolutionized modern science and society, giving us for the first time in history the opportunity to unveil and understand the complexity underlying human social behavior.

In the present thesis we propose a multidimensional study of human social behavior, aimed to understand how the social network, the mobility patterns and the socio-economic status of people in a big European country are con-nected to each other. To this end we exploit the access to a big mobile phone dataset provided by the French telecom provider Orange [48]. The dataset contains temporal and spatial information of all calls and text messages sent and received by about 20 million French costumers in a period of forty-five days. This type of data constitute a particularly suitable social microscope for our purpose. Indeed, mobile phones are nowadays very common techno-logical devices carried out by persons in their daily routine, offering a good proxy to capture both social interactions and individual trajectories of peo-ple. Thanks to Big Data management tools like Hadoop, we accessed such huge amount of data and computed for each user several individual mea-sures. Each measure describes a different aspect of the social or mobility behavior of that individual. Then, we analyzed the correlations between so-cial and mobility aspects at different levels of detail of our soso-cial microscope. At the most detailed level, we studied the correlations for each individual in the dataset, discovering that mobility aspects are weakly correlated with social ones. This means that a too detailed view on French mobile phone

(21)

users does not reveal any significant relationship between how people inter-act and how people move. However, the aggregation of users with similar sociality/mobility behavior or living in the same geographical location (same district, municipality or department) reveals new interesting and unexpected patterns.

Firstly, the observation at district level of the three biggest cities in France (Paris, Marseilles, Lyon) uncovers a very strong correlation between the so-cial diversification and the mobile predictability. An individual who has a high social diversity of communication ties also shows a more erratic and unpredictable mobility behavior. Moreover, we also found an interesting re-lationship between social/mobile behavior and socio-economic status. At municipal and departmental levels social and mobility measures vary with some socio-economic indexes such as the income, education and unemploy-ment rate. In this context, we observed that mobility diversity is correlated with the European Deprivation Index (EDI), a measure computed by select-ing needs associated both with objective and subjective poverty [30]. This striking result leaves no room for doubt: the greater the diversification of the mobile behavior of people within a territory, the higher their economic prosperity.

Such results suggest us that the different dimensions that compose our complex society are not unconnected, but they are linked aspects of the same phenomenon. Unveiling patterns regarding a single dimension can help us to understand the other connected aspects of society. Such findings open interesting future perspectives about the study of human social behavior. For instance, new statistical indexes can be defined which rely on mobile phone data to describe and predict the actual and future economic health of a territory.

(22)

(23)

Chapter 2 State of the art

In this chapter we provide an overview of the studies about mobility patterns, social networks and the socio-economic status. In Section 2.1 some related works about human mobility patterns are presented, while in Section 2.2 the main measures used to characterize complex networks are briefly analyzed. Finally, Section 2.3 presents some works about the interplay between mobile, social and economic dimensions.

2.1 Human Mobility patterns

Fueled by big mobility data collected by a wide range of tools and technolo-gies, many recent studies provided statistical laws and realistic models to describe some important aspects of how people move. Brockmann et al. [4] performed a first large scale analysis of human mobility finding a power law in the probability of a bank note to traverse a given distance. Other researchers analyzed a massive mobile phone dataset discovering: i) a huge variability in the characteristic distance traveled by individuals [5]; ii) a decreasing ten-dency of people to visit previously unvisited locations [6, 7]. Such mobility patterns also apply to car travels, as shown in [27] through the analysis of a GPS dataset representing 10 million travels accomplished by 150,000 cars. From the data mining community, authors of [9] provide a mining method-ology to extract mobility profiles from GPS traces.

Recently, many scientists started to study human mobility by using the power and flexibility of networks. In [7], authors introduce the concept of

(24)

mo-bility network to compute the predictamo-bility of each individual’s trajectory. Bagrow and Lin [11] construct a mobility network capturing the detailed flows between individual locations. They apply to each user’s mobility net-work the Infomap community detection algorithm, defining the groups of locations they discover as mobility “habitats”. Rinzivillo et al. [10] extract relevant clusters from mobility networks inferred from vehicle’s GPS tracks, finding a good match with the existing administrative borders. Brilhante et al. [12] combine public points of interest with trajectories of individuals moving within a city, and build a network connecting points of interests by the individual trajectories passing through them. On the resulting network they apply a communitiy detection algorithm to find groups of places highly connected by the mobility of the individuals. In [13], authors propose a methodology to represent a mobility scenario as a weighted contact graph and, analyzing its structure using tools from complex network analysis and graph theory, they find in human mobility a strong modularity and typical small-world characteristics.

2.2 Complex Network analysis

Network science is a recent interdisciplinary field that examines the intercon-nections among diverse physical, engineered, information, biological, cogni-tive, semantic and social systems. Over the past decade, many real network systems were collected, allowing network scientists to study the properties that characterizes real complex networks.

The average path length of a network measures the average steps it takes for one node to reach another node in the network. Real networks are found to have a very small average path length, which is most known as the “small world” property. In the social context for example, individuals on the planet are separated by six degrees of social contacts [14]. The clustering coeffi-cient, first introduced by Watts and Strogatz [15], represents densely con-nected cliques in a network. Precisely, it measures the fraction of neighbors of a node that are also connected to each other. Another important features to describe complex networks is the degree distribution. It measures the probability that a randomly selected node has k edges. The random graph

(25)

model provided by Erdös and Rényi predicts that the degree distribution follows a Poisson distribution, corresponding to a network where every node has roughly the same degree. However, a variety of real networks exhibit the ‘scale-free’ property: the degree distribution follows a power law. This means that real networks are highly heterogeneous: most nodes in the network have very low degree, while there are a notable number of nodes with a large num-ber of connections. The concept of tie strength has also attracted particular attention in the study of social networks. It was introduced by sociologist Mark Granovetter [16], who proposed a model of society consisting of small and fully connected circles of friends, linked by strong ties. Weak ties con-nect the members of these intimate circles to their acquaintances, who have strong ties to their own friends. The existence of a local coupling between tie strengths and network topology is confirmed by recent research [17], which exploit the huge quantity of human interactions recorded by modern tools and technologies.

2.3 Interplay between mobile, social and

eco-nomic dimensions

The interplay between social, mobility and economic aspects of society is becoming one of the central issues in modern science. Here, we revise some data-driven works which aim to describe and model the interplay between sociality and mobility and economic development.

Interplay between friendships and mobility. Thanks to the

increas-ing availability of mobile phone datasets and location-based online social networks, data scientists are uncovering the patterns underlying the inter-play between social and mobile aspects of human behavior.

A central hypothesis is that social interactions increase with physical prox-imity. Scellato et al. [18] analyze data from Location Based Social Networks and find a weak positive correlation between the number of friends and their average distance. In [19], authors compute the entropy related to people’s locations in order to understand how it affects the underlying social network.

(26)

They observe that people who visit locations of higher entropy tend to be more social, having more ties in the social network. Crandall et al. [20] study the relationship between spatial co-location and friendship through Flickr data, finding that even a small number of co-occurrences lead to orders-of-magnitude greater probabilities of a social tie. Wang et al. [37] exploit the access to a big mobile phone dataset to study at what extent individual mobility patterns shape and impact the social network. The find that so-cial similarity strongly correlate with both mobile similarity and tie strength. They use such results to build several supervised and unsupervised classifiers, discovering that combining both mobility and network measures actually pro-duce better performance in the link prediction problem. The problem of the interaction between the person’s social network structure and the individual mobility is also addressed in [21], where authors try to understand if friend-ships influence where people travel, or if it is more traveling that influences and shapes social networks. Analyzing data from mobile phone and location based social networks, they find that the probability of a user to visit a home of an existing friend is higher than the probability that a checkin leads to a new friendship. Moreover, data also display a strong dependency between probability of friendship and trajectory similarity, suggesting that there is a strong presence of social and geographical homophily.

Interplay between socio-mobility and economic development. The

problem of the interplay between social behavior and economic wellness is still mostly open, given the difficulty to retrieve accurate information about the economic situation of individuals or territories.

A couple of studies on the relationship between socio-economic level and human mobility [22, 24] try to measure the impact of socio-economic aspects on individual mobility. In [22] the authors find relevant linear correlations between the economic status and the number of different towers used, the ra-dius of gyration and the diameter of the area of influence: the socio-economic value of individuals increases with the three mobility variables. Another large-scale analysis [23] studies the usage of cell phones in an emerging econ-omy, combining large-scale mobile phone datasets and census data collected by the National Statistical Institute. Other works focus on social and

(27)

bility behavior of individuals to understand their city-life. Bettencourt et al. [34] demonstrate the qualitative changes associated with the scale of cities, while Batty shows that the quality of life increases with the city size, since the benefits outweigh the costs [32]. In this context, a new science of cities is emerging aimed to study the dynamics and evolution of a city [33, 31]. Natan Eagle et al. study the relationship between socio-economic wellness and social network diversity using phone records and social development in-dicators in the U.K [25]. They discover that social network diversity is a good indicator of economic development of social communities.

(28)

(29)

Chapter 3 Managing Big Data

Big data is the term used for a collection of datasets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. As we will see in Chapter 4, the size and complexity of the datasets we used fit the definition of Big Data: it is very difficult to manage them using most relational database management systems, desktop statistics or visualization packages. The management of such complex data requires instead massively parallel software running on many servers. In this chapter we describe the main features of the Big Data management tools we used to analyze data and perform the experiments.

3.1 The MapReduce Paradigm

Apache Hadoop is an open source framework able to manage the access to very large datasets. Hortonworks [49], the Software Company which develops and supports the Hadoop project, defines it as a framework that allows gain insight from massive amount of structured and unstructured data quickly and without significant investment. Two are the core services provided by Hadoop: the HDFS file system and the MapReduce mechanism. The HDFS (Hadoop Distributed File System) is a Java-based file system that provides data storage features as scalability, fault-tolerance and highly efficiency. It was designed to work with the MapReduce paradigm, showing scalability on a storage of 200 Petabyte across cluster of 4500 servers. The MapReduce paradigm distributes tasks across a cluster of coordinated nodes. It was

(30)

designed to run on commodity hardware and to scale up or down without system interruption. Thanks to the MapReduce programming model, while traditional data tools spend days or weeks to solve problems on Big Data, Hadoop solves them in few hours or minutes.

The Hadoop MapReduce System consists of the JobTracker, which is the master, and of the slave nodes called TaskTrackers [53]. Similarly, the HDFS has a master/slave architecture [54] composed by the NameNode, which is the master, and the slave nodes called DataNode. Conceptually, MapReduce is based on the map() and reduce() functions, and follows two main phases: 1. In the mapping phase, a JobTracker node partitions the input into smaller sub-problems and distribute them to some TaskTracker nodes. In turn, the TaskTrackers perform the sub-problems and return the result to the JobTracker.

2. In the reducing phase, the JobTracker collects the outcomes and com-bines them to obtain a solution to the original problem.

Figure 3.1 shows how the JobTracker and the TaskTracker work in the Hadoop MapReduce System [55].

Figure 3.1: The Hadoop MapReduce System.

In a commodity cluster computing [56] where Hadoop runs, the NameN-ode and the DataNNameN-ode also have important roles. The NameNNameN-ode keeps the directory tree of all the files in the Hadoop file system, while the DataNode stores the data in the HDFS. When a client application or a JobTracker needs to know where a given file is located, it asks the NameNode which responses

(31)

with a list of DataNode servers where the data are located [59]. In short, to store Big Data we use the Hadoop HDFS layer across the NameNode and DataNode. To perform tasks on Big Data we use Hadoop MapReduce across the JobTracker node and the TaskTrackers nodes. Figure 3.2 (left) illustrates the storage and processing layers of Hadoop, and how the nodes in a small cluster can be deployed on a group of servers (Figure 3.2, right) [57] [58].

Figure 3.2: Storage and processing layers in Hadoop (left). Typical Hadoop cluster on a minimum of four machines (right).

An example of MapReduce execution on the problem of counting words frequency in a file is shown in Figure 3.3. In the splitting process, the flat file is split into three lines. In the next phase, for each line a map instance is created, and the sentence is split into words which form the initial key-value pair. In the shuffling process the key is shuffled in some logical order. Finally, in the reducing process the words are grouped by keys and the sum function is applied to their values.

(32)

In the real world, the MapReduce paradigm is harder to program since the problems to solve are more complex. To make easier the interaction with Hadoop we used two data services: Hive for handling, filtering and computing phases; and Pig for the processing phase.

(33)

3.2 Big Data Analytic tools

3.2.1 Hive

Hive is a successful Apache project used by many organizations as a scalable data processing platform. Hive’s SQL dialect, called HiveQL, provides state-ments similar to standard SQL statestate-ments but doesn’t support transactions, inserts, updates and deletes for row level. In other words, Hive is a data ware-house software for querying and managing large distributed datasets, built on Hadoop. Since it takes SQL queries and translates them to MapReduce tasks, it is useful when real-time responsiveness to queries and record-level inserts, updates, and deletes are not required. Hive is mainly used for batch jobs over large sets of immutable data, such as web logs or CDRs. The pre-viously explained words count problem can be easily solved by the following HiveQL code 3.1:

1 . CREATE TABLE d o c s ( line STRING ) ;

2 . LOAD DATA INPATH ' input . t s v ' O V E R W R I T E INTO TABLE d o c s;

3 . CREATE TABLE w o r d _ c o u n t s AS

4 . SELECT word, count( 1 ) AS count FROM

5 . (SELECT e x p l o d e( split ( line , ' /n ') ) AS w o r d FROM d o c s) w

6 . GROUP BY w o r d

7 . ORDER BY w o r d

Listing 3.1: Words count from a text in HiveQL

In our study we need to manage a big call graph in order to compute several network structural measures which summarizes the social behavior of an individual. For this purpose, we used the power of Hive to reduce the time of calculation. The clustering coefficient, for example, is a network measure of the degree to which nodes in a graph tend to cluster together. It is expensive in time, since we need to calculate the number of triangles incident on the particular node in the call graph. Such computation can be done in a very easy way through the following HiveQL code 3.2 [60]:

(34)

1 . SELECT count( ∗ ) 2 . FROM e d g e s e1

3 . JOIN e d g e s e2 ON e1. number2 = e2 . number1 AND e1. number1 < e2 . number1 4 . JOIN e d g e s e3 ON e2. number2 = e3 . number1 AND e3. number2 = e1 . number1 5 . AND e2. number1 < e3 . number1

Listing 3.2: HiveQL query which counts the number of triangles in the call graph

where the table edges is created in the following way:

1 . CREATE TABLE e d g e s ( number1 INT, number2 INT)

2 . ROW FORMAT DELIMITED FIELDS TERMINATED BY ' \ t '

3 . STORED AS T E X T F I L E;

Listing 3.3: HiveQL query which creates the table edges

We also used Hive for loading and storing data, creating and filtering the datasets. The best alternative to Hive is Pig, a top-level Apache project developed for Hadoop. Pig is described as a data flow language, rather than a query language. Pig any Hive provide a way to operate on datasets stored with Hadoop.

(35)

3.2.2 Pig

Apache Pig is a framework to process and analyze very large datasets. It provides an engine for running data flows on Hadoop. Complex data trans-formations tasks are managed as data flow sequences, making them easy to write, understand, and maintain. Using Pig, we focus on data instead than the nature of the execution across the Pig Latin, that is the data flow lan-guage for creating Pig scripts. Pig takes Pig Latin scripts and translates them to MapReduce tasks. Pig Latin defines a set of transformation on a dataset such as aggregation, join and sort, allowing users to describe how data should be read, processed and stored.

Data flows can be very simple, like the words count example, or complex workflows that include points where multiple inputs are joined, and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG) [66], where edges are data flows and nodes the operators [65] which process the data [41]. This directed acyclic graph can be optimized before its nodes are performed.

Figure 3.4 clarifies the concept showing how Pig compiles and executes Pig Latin scripts [40].

Figure 3.4: Pig compilation and execution stages.

In the parsing phase the parser performs a syntactic analysis of the Pig program. The output of the parser is a canonical logical plan with a one-to-one correspondence between Pig Latin statements and logical operators, arranged in a directed acyclic graph (DAG). The logical optimizer phase optimizes the logical plan. For instance, it places a filter operation before

(36)

the join operation. In the MapReduce compiler phase, the optimized logical plan is compiled into a series of MapReduce jobs, which then pass through another optimization phase called MapReduce optimizer. Finally, the DAG of optimized Map-Reduce jobs is then topologically sorted, and jobs are submitted to Hadoop for execution Hadoop job manager phase. Figure 3.5 illustrates the process of translating a logical query plan into a Map-Reduce execution plan [40].

(a) Pig Latin to logical plan translation. (b) Logical plan to physical plan

trans-lation.

(c) Plan to physical plan to map re-duce plan translation.

Figure 3.5: The process of translating a logical query plan into a Map-Reduce execution plan.

(37)

The following Pig Latin code 3.4 shows how Pig can be used to compute the PageRank, a centrality feature which measures the importance of a node in graph [41].

p r e v i o u s _ p a g e r a n k = load ' $docs_in ' as ( url : chararray , pagerank : float , l i n k s: { link : ( url : chararray ) }) ;

o u t b o u n d _ p a g e r a n k = foreach previous_pagerank generate p a g e r a n k / COUNT ( links )

as p a g e r a n k,

f l a t t e n( links ) as to_url ; c o g r p d = cogroup outbound_pagerank by to_url ,

p r e v i o u s _ p a g e r a n k by url;

n e w _ p a g e r a n k = foreach cogrpd generate group as url ,

(1 − $d ) + $d ∗ SUM ( outbound_pagerank . pagerank ) as p a g e r a n k,

f l a t t e n( previous_pagerank . links ) as links , f l a t t e n( previous_pagerank . pagerank )

as p r e v i o u s _ p a g e r a n k; s t o r e n e w _ p a g e r a n k i n t o ' $docs_out ';

n o n u l l s = filter new_pagerank by previous_pagerank i s not n u l l and p a g e r a n k i s not n u l l;

p a g e r a n k _ d i f f = foreach nonulls generate ABS ( previous_pagerank − pagerank )←-;

g r p a l l = group pagerank_diff all ;

m a x _ d i f f = foreach grpall generate MAX ( pagerank_diff ) ; s t o r e m a x _ d i f f i n t o ' $max_diff ';

(38)

The previous code can be called from a control flow written in the fol-lowing Python script 3.5 [41].

#! / usr / bin / python

from org. apache . pig . scripting import ∗ P = Pig . compile

("

Pig Latin s c r i p t f o r c a l c u l a t e the PageRank s o c i a l measure , $PR(A) = ←-(1−d ) + d (PR(T1) / C(T1) + . . . + PR(Tn) / C(Tn) )

")

p a r a m s = { "d" : " 0 . 5 ", " docs_in " : " data / crawl " }

f o r i in r a n g e( 1 0 ) : d o c s _ o u t = " out / pagerank_data " + str ( i + 1) m a x _ d i f f = " out / max_diff " + str ( i + 1) p a r a m s[" docs_out "] = out p a r a m s[" max_diff "] = max_diff Pig. fs ( "rmr" + docs_out ) Pig. fs ( "rmr" + max_diff ) b o u n d = P . bind ( params ) s t a t s = P . bind ( ) . runSingle ( ) i f not s t a t s. isSuccessful ( ) : r a i s e " f a i l e d "

mdv = float ( str ( stats . result (" max_diff ") . iterator ( ) . next ( ) . get ( 0 ) ) )

p r i n t " max_diff_value = " + str ( mdv )

i f mdv < 0 . 0 1 :

p r i n t "done at i t e r a t i o n " + str ( i )

break

p a r a m s[" docs_in "] = out

Listing 3.5: Control flow for the PageRank in Python

(39)

3.2.3 Pig Latin vs HiveQL

Pig Latin overcomes the limits of the SQL Language on the data processing. While SQL is designed for the RDBMS environment, Pig Latin is designed directly for Hadoop, allowing an easier use of the MapReduce paradigm. Pig can easily understand, analyze, check and optimize the data flow described by the user, allowing to implement the MapReduce code in very short time. Here, we show how some of the more familiar HiveQL idioms can be imple-mented in Pig, and how Pig allows to control explicitly the execution plan. In Figure 3.6, we provide many of the usual SQL data processing concepts, such as filtering, selecting, grouping, and ordering. Figure 3.6 shows how we can directly optimizing the execution flow placing the filter operator country region = ’Asia’ before the join operator.

Unlike Hive, Pig can be useful during a ETL (extract transform load) process and in search operations on raw data, since it works in situations where the schema is unknown, incomplete, or inconsistent. Pig Latin also allows to manage and clean datasets from different data source before loading them in a data warehouse. However, Pig Latin is not just a procedural version of the SQL or a declarative dialect like HiveQL. As we notice from the differences in Table 3.6 Pig Latin overcomes the limits of the SQL Language on the data processing. SQL is designed for the RDBMS environment instead Pig Latin is designed for Hadoop (Pig is the native language of parallel data-processing systems), allowing a direct use of MapReduce paradigm and thus offering more advantages. Pig can easily understand, analyze, check and optimize the data flow that the user is describing, permitting us to implement indirectly the MapReduce code with a lower cost in time. This aspect is especially important for private companies, where usually the development cycle in MapReduce is a hard and time-consuming business.

(40)

SQL PIG

. . . FROM MyTable. . . A = LOAD MyTable USING PigStorage(;) AS (col1:int, col2:int, col3:int);

SELECT col1 + col2, col3 . . . B = FOREACH A GENERATE col1 + col2, col3; . . . WHERE col2 > 2 C = FILTER B by col2 > 2:

SELECT col1, col2, sum(col3) D = GROUP A BY (col1, col2); FROM X GROUP BY col1, col2 E = FOREACH D GENERATE FLATTEN(group), SUM(A.col3); . . . HAVING sum(col3) > 5 F= FILTER E BY $2 > 5; . . . ORDER BY col1 G = ORDER F BY $0;

SELECT DISTINCT col1 FROM X I = FOREACH A GENERATE col1; J = DISTINCT I;

SELECT col1, COUNT(DISTINCT col2) K = GROUP A BY col1; FROM X GROUP BY col1 L = FOREACH K {

M = DISTINCT A.col2;

GENERATE FLATTEN(group), count(M);}

Figure 3.6: Similarities and differences between HiveQL and Pig Latin (top). Some of the more familiar HiveQL idioms are implemented in Pig (bottom).

(41)

3.2.4 Python

Python is a dynamic programming language who supports many ming paradigms, such as object-oriented, imperative and functional program-ming. It is rapidly becoming one of the most popular dynamic, interpreted programming languages among data scientist, thanks to its simplicity and ability to express concepts in a few lines of code. Moreover, many Python libraries are freely available for developing and implementing data science applications, such as as NumPy and Scipy for exploratory analysis and Mat-plotlib for data visualization. Python also provides the IPython shell com-mand, which is very useful for interactive and exploratory computing.

By combining the efficiency of Hadoop with the flexibility of Python, we computed complex measures on the data in a very fast way. For example, the mobility entropy measures the uncertainty in the next locations visited by an individual. It is hard to compute on the mobile phone dataset, since we need to establish the trajectories of each user. Using the map/reduce paradigm, in the map phase we read line by line the input file. The output of the map phase is the input for the reduce phase. Table 3.1 shows an example of the structure of the input file. To compute the measure on the data, we just

caller timestamp duration lat_caller lon_caller HJ123423 2007/10/01 23:45:00 1 sec 599366 2443429 TR234S3 2007/10/01 23:51:00 3467 sec 599366 2443429 HJ123423 2007/10/01 23:52:00 327 sec 599366 2443429

... ... ... ... ...

Table 3.1: Example of input file to calculate the mobility entropy measure.

require two Python scripts: map.py and reduce.py, which are showed in the following pseudocodes 3.6 and 3.7.

1 . data = ' input . csv '

2 . f o r e a c h l i n e i n d a t a

3 . c r e a t e a c o u p l e c <key , value>

4 . key = user

5 . v a l u e s = tupla ( timestamp , duration , location of the tower ) 6 . output = list of couples

(42)

1 . data = output of the map phase

2 . f o r e a c h u s e r in d a t a

c a l c u l a t e the e n t r o p y of the u s e r

3 . output = list of individual entropy measures

Listing 3.7: Reduce phase for calculating the mobility entropy in Python

(43)

Chapter 4 Big data as Social Microscope

In this chapter we describe the data source used in this study. A discussion about the nature of the data is provided, together with a description of the structure and the preprocessing/filtering operations needed to make the data more reliable and suitable for the analysis.

4.1 Mobile Phone Data

Mobile phones are nowadays very common technological devices carried out by persons in their daily routine, offering a good proxy to study structure and dynamics of human social behavior. Indeed, phone records capture in-formation about both social links and human displacements: each time we make a call a social relationship of some kind is expressed, and the tower that communicates with our phone is recorded by the carrier, effectively pin-pointing our location. Unfortunately, the spatial information is not terribly accurate because a user could be anywhere within the tower’s reception area, which can span tens of square kilometers.

In this study we exploit the access to an anonymized GSM dataset col-lected by the French Telecom provider Orange [48] for billing and operational purposes. Orange is the biggest French carrier, covering approximately the 30% of the mobile phone market in France. The dataset consists of Call Detail Records (CDR) describing each phone call sent or received by Orange users in a period of 45 days (from 01/09/2007 to 15/10/2007). It stores information about 21, 410, 549 different users, 215, 947, 416 phone calls and

(44)

87, 712 phone towers. Each call is represented by a tuple with timestamp, caller and callee identifiers, duration of the call, and the geographical coor-dinates of the tower serving the call. Table 4.1 provides an example of the structure of CDRs.

timestamp caller callee duration tower coords

2007/10/01 23:45:00 HJ123423 R482G9342 365 sec (48.862, 2.349)

2007/13/01 12:10:04 TR234S3 43FG3423 125 sec (48.859, 2.338)

... ... ... ... ...

Table 4.1: Example of CDRs (Call Detail Records).

The time-ordered list of towers from which a user performed his calls com-pose a mobile trajectory, which describes his movements during the period of observation. Such mobility traces are more accurate in correspondence of densely populated areas, where much more phone towers are installed to carry the heavier load. This means that in rural areas, where a single tower usually covers several kilometers, short movements are not collected. Unfor-tunately, sparsely populated municipalities are the majority: only the 3% of the French municipalities have more than two Orange towers installed. In very dense municipalities, on the contrary, several towers may be installed in an area as small as a square kilometer. To mitigate the problem of heteroge-neous tower density, all towers within a distance of 800 meters are collapsed into one. We assign each user to a home location using the algorithm pro-posed in [36], which sets the home as the tower where the user performs the highest number of calls during nighttime hours (from 10 p.m. to 7 a.m.). Figure 4.1 shows the trajectories of two anonymized mobile phone users who visited N = 22 (left) and N = 76 (right) different towers during the period of observation.

Some preprocessing and filtering operations are needed to overcome some limitations and shortcomings of the mobile phone data. Such shortcomings concern both the mobile and the social information provided by the calls. The location of a user is usually recorded only when a person uses his phone. However, it may not necessarily be a relevant place for the systematic mo-bility of the user. To clarify this point, consider the following scenario. A user performs a call while he is driving on the highway. In our dataset a new entry will be recorded, and the tower covering that portion of the highway

(45)

Figure 4.1: Example of mobility traces extracted from mobile phone data.

stored. Is that location meaningful for the description of the movements of that person? Of course not, since that location does not describe a place that the user visited voluntarily. To avoid cases of random locations and capturing only the systematic mobility, we applied several filters to the data. Firstly, for each user u we discarded all the locations he visited only

once or having a visitation frequency lower than a given threshold. We

set the threshold to f = ni/N < 0.005, where ni is the number of calls

performed by u in location i and N the total number of calls performed by u during the period of observation. This condition simply checks whether the location is relevant with respect to the specific volume of calls of the user. For example, let us consider a location l with frequency n = 2 of a user who performed a total of N = 2000 phone calls in the period of observation. Location l will be discarded because f = 2/2000 = 0.001 < 0.005, i.e. the location has been visited less than the 0.05% of the total. Since it is meaningless to analyze the mobility of individuals who do not move, all the users with only one location after the previous filter have been discarded.

To deal with problem of the bursty nature of human activities1, we selected

only users with high call frequency. We set the call frequency threshold to f = N/45 < 0.5, where N is the total number of calls performed by u and 45 days our period of observation. This means that all users with less than one call each two days (in average) have been deleted. Finally, to avoid

1_{for most of the time people are inactive, and hence for most of the time we do not}

(46)

the presence of abnormally active users, we discarded the users with a huge number of calls N > k ∗ 45, where k = 300. In this way, anomaly users like line testers and alarm managers are excluded from the analysis.

On the resulting users, we extracted a call graph taking the duration of the calls (in seconds) as the weight of the edges. We imposed a filter on the strength level of a social tie by considering reciprocated edges only [16]. This means that an edge (u, v) exists in the call graph if u called v and vice versa at least once. Finally, only the giant component of the network has been considered, corresponding to the 98% of nodes in the graph. Starting from 21, 410, 549 users, the filtering phase resulted in 6, 289, 865 active users and 33, 071, 948 edges. Table 4.2 summarizes some characteristics of the initial and the filtered datasets.

Table 4.2: Features of the initial and filtered datasets.

initial dataset filtered dataset

users 21,410,549 calls 215,947,416 locations 87,712 users 6,289,865 social links 33,071,948 locations 17, 662

Figure 4.2 displays the distribution of Orange users per municipality. Since it is a power law, the vast majority of population is concentrated within a few municipalities, corresponding to the biggest and densest cities. Indeed, although France has more than 30,000 municipalities, Paris, Marseilles and Lyon host together about the 10% of the overall population. The heat map in Figure 4.3 shows how the population of Orange users is distributed over the territory of France.

Figure 4.4 shows that the distribution of outcoming calls per user during the period of observation is a power law, as well as the distribution of the number of social contacts. This means that there is a big difference in how individuals use the mobile phone service. During the period of observation of 45 days, the majority of users perform a very few calls, less than one hundred in total, while few of them perform several thousands of calls.

The same unequal behavior applies for the mobility aspect, since the distribution of visited locations per user is also a power law (Figure 4.5).

(47)

102 ₁₀3 ₁₀4 ₁₀5 ₁₀6 orange users 10-7 10-6 10-5 10-4 10-3 10-2 p (orange users)

Figure 4.2: Distribution of Orange users per municipality.

Figure 4.3: Heat map of Orange users per municipality, the number of Orange users is normalized in the range [0, 1]. The map can be downloaded in high definition at this link: https://github.com/Pierpa/ maps_high_def.git

Many users are very static, visiting less then a dozen locations in 45 days. Others are instead very erratic, visiting hundreds of locations. The number of phone towers installed in municipalities, however, is not uniformly distributed

(48)

100 101 102 103 104 105 outcoming calls 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 p (outcoming calls) 100 101 102 103 degree 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 p (degree)

Figure 4.4: Distributions of outcoming calls (left) and social degree (right).

on the territory, but tend to follow the population density of the municipality (heat map in Figure 4.6). This means that the denser the municipality, the finer the observation of users’ movements.

100 ₁₀1 ₁₀2 ₁₀3

location per user

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

p (location per user)

100 ₁₀1 ₁₀2 ₁₀3

location per municipality

10-4 10-3 10-2 10-1 100 101

p (location per municipality)

Figure 4.5: Distributions of visited locations per user (left) and visited locations per municipality (right).

(49)

Figure 4.6: Heat map of installed towers per municipality. The number of towers installed is normalized in the range [0, 1] .The municipalities with no Orange towers have been excluded. The map can be downloaded in high definition at this link: https://github.com/Pierpa/maps_high_def.git

(50)

4.2 Economic data

To study the economic dimension of human behavior we used several socio-economic indicators provided by the French National Institute of Statistics and Economic Studies (INSEE) [3]. Such indicators are at municipality level: for each of the 36, 568 French municipalities, we collected information about the number of residents, working activity, education level and a deprivation index which provides a measure of objective and subjective poverty.

Income and unemployment rate provide information about the working activity of a municipality. The income is the sum of the incomes of the residents in the municipality, while the unemployment rate measures the ratio between the unemployed individuals between 15 and 64 years old and the number of individuals older then 15 years old. We also have information about male and female unemployment rate separately. Figure 4.7 (left) shows that the distribution of the income variable is a power-law. This means that an economic inequality exists among the municipalities, following the 80/20 rule first observed by Pareto: the 80% of the wealth is in the hands of the 20% of the municipalities. For example, the sum of the incomes of Paris, Lyon and Marseilles is about the 10% of the total French income. A picture of the heterogeneity which characterizes the wealth distribution is provided by the heat map in Figure 4.8. From the map, we notice that the income is much higher in correspondence of urban areas (the red colored municipalities), while it tends to be lower as we move away towards the countryside.

107 108 109 1010 1011 income 10-12 10-11 10-10 10-9 10-8 10-7 p (income) 0.00 0.05 0.10 0.15 0.20 0.25 unemployment rate 0 5 10 15 20 p(unemployment rate)

Figure 4.7: Distributions of income (left) and unemployment rate (right).

The distribution of the unemployment rate is well fitted by a normal-like 50

(51)

Figure 4.8: Heat map of income per municipality. The income is normalized in the range [0, 1]. The map can be downloaded in high definition at this link: https://github.com/Pierpa/maps_high_def.git

distribution: municipalities have more or less the same unemployment rate (Figure 4.7, right). Such homogeneity can be better observed from the heat map in Figure 4.9. The map is colored uniformly, telling us that there is no much difference among the cities in the unemployment rate. Moreover, the rate is low and equals to 0.06, representing a low mean unemployment rate of 6%, consistently with official statistics about general unemployment in France [39].

The education level provides useful information about the degree of

de-velopment of a country. INSEE provides us two measures regarding the

education level: the ratio of people without diploma and the ratio of people with primary education only. As we notice from Figure 4.10, both measures are normally distributed. They also show a low average meaning that in general French education level is very good, since a small percentage of the

(52)

Figure 4.9: Heat map of unemployment per municipality. The map can be downloaded in high definition at this link: https://github.com/Pierpa/maps_high_def.git

population has a low level of education.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 no diplomas rate 0 1 2 3 4 5 6

p(no diplomas rate)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

primary education rate 0 1 2 3 4 5 6 7 8

p(primary education rate)

Figure 4.10: Distributions of no diploma rate (left) and no primary education rate (right).

Finally, for each municipality we collected the European Deprivation In-dex (EDI), which is calculated selecting fundamental needs associated both

(53)

with objective and subjective poverty [30]. This index is a combination of socio-economic and ecological variables identified to reflect individual expe-rience of deprivation:

EDI = 0.11 * “Overcrowding”

+ 0.34 * “No access to a system of central or electric heating” + 0.55 * “Non-owner ”

+ 0.47 * “Unemployment ” + 0.23 * “Foreign nationality” + 0.52 * “No access to a car ”

+ 0.37 * “Unskilled worker-farm worker ” + 0.45 * “Household with 6 + persons” + 0.19 * “Low level of education” + 0.41 * “Single-parent household ”

Clearly the coefficients are the same for all municipalities while the variables of the municipalities change. Figure 4.12 shows the distribution of the EDI, which is well-fitted by a normal distribution with average 1.58, minimum 0.76 and maximum 2.79. Big cities and urban areas show the highest depri-vation index (heat map in Figure 4.11). Although this could seem surprising, we think there are two possible explanations to the phenomenon. First, the EDI is a complex measure partly based on ecological quality, which tend to have lower values in big and overcrowd urban areas. Second, the distribu-tion of wealth is known to be a power law, following the above cited 80/20 rule. Presumably, in the areas with the highest peaks of wealth (metropolis and big urban areas) the power law is more skewed, resulting in a stronger inequality between the rich and the poor part of the population. Since the majority of population is poor, the majority of population is in a subjective and objective condition of deprivation. Figure 4.14 shows that the EDI mea-sure, as expected, is positively correlated with the unemployment rate and with low level of education. Moreover, EDI is also positively correlated with the total income of the municipalities Figure(left) 4.13, suggesting a positive relationship between the total wealth of a territory and its deprivation index.

(54)

Figure 4.11: Heat map of European Deprivation Index (EDI) per municipality. The EDI is normalized in the range [0, 1]. The map can be downloaded in high definition at this link: https://github.com/Pierpa/ maps_high_def.git 0.5 1.0 1.5 2.0 2.5 3.0 EDI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 p(EDI)

Figure 4.12: Distribution of the European Deprivation Index (EDI).

(55)

measure description

residents official number of residents in the

municipal-ity

total income sum of the income of the residents

unemployment rate unemployed people (15-64 y.o.) / people

older than 15 y.o.

no diplomas no diplomas people / people older than 15

y.o.

primary education rate people with primary education only/ people

older than 15 y.o.

EDI European Deprivation Index [30]

(56)

Figure 4.13: Correlation between European Deprivation Index (EDI) and municipality’s income (left) and between European Deprivation Index (EDI) and primary education rate at municipality level .

Figure 4.14: Correlation between European Deprivation Index (EDI) and no diplomas (left) and between European Deprivation Index (EDI) and unemployment rate at municipality level.

(57)

Chapter 5 Measuring Human Behavior

In this chapter, we describe the main measures we used in our study to char-acterize human behavior. Both for mobility and sociality we defined two types of individual measures: a measure of volume, specifying the size of the social and mobility world of an individual; and a measure of diversity, repre-senting the dynamicity and the incertitude related to his social interactions and movements.

5.1 Mobility measures

Inspired by the mechanics of rigid bodies, we can imagine individual mobility as a rigid body made of several parts (locations), and think of the importance of a location as the mass associated with that particular position. In this

conceptual framework, the center of mass rcm of a user represents her

barycenter, the center of gravity of individual mobility. Mathematically, we define it as a two-dimensional vector representing the weighted mean point of the locations visited by an individual. We measure the mass associated to a location with its visitation frequency, obtaining the following definition [26, 27]: ~ rcm = 1 N X i∈L ni~ri (5.1)

where L is the set of locations visited by the user, ~ri is a two-dimensional

(58)

visitation frequency1; and N = P

i∈Lni is the sum over all the locations of

ni (i.e. total number of calls of the user). Two forces shape the center of

mass of a user: the geographic positions of his locations, and their visitation frequency.

The radius of gyration rg of a user is the characteristic traveled

dis-tance, a measure of how far he is from his center of mass [26, 27]. In mathe-matical terms, it is defined as the root mean square distance of the locations from the center of mass:

rg = s 1 N X i∈L ni(~ri− ~rcm)2 (5.2)

where L is the set locations visited by the user, ni is the location’s visitation

frequency, N = P

i∈Lni is the sum of all the single weights, ~ri and ~rcm are

the vector of coordinates of location i and center of mass respectively. Computing the radius of gyration of a user means reconstruct his trajec-tories and draw a circle around them, sizing up the neighborhoods the user frequented. It provides us a measure of mobility volume, indicating how large is the typical distance traveled by an individual and providing an estimation of his tendency to move. The higher the radius of gyration, the bigger the mobility volume.

The mobile predictability of an individual can be estimated through the entropy [28], a measure used in physics and information science to compute the degree of disorder characterizing a system. In our case, locations compose the particles of the system, whose disorder is determined by the movements performed by the user. In other words, the mobility entropy of an individ-ual quantifies the accuracy of predicting his future whereabouts. To clarify the concept, let us consider the mobility systems of Mark and John. Mark goes to work every day at nine A.M., has lunch at the same restaurant at noon, and leaves the office for home at five A.M., where he stays until the following morning. Mark’s mobility entropy is close to zero, because there is little mystery as to his future whereabouts. Contrariwise, John has not a daily routine. Every day, he visits a different city of the country, eats each time in a different restaurant, sleeps each day in a different place. John’s

1_{number of calls registered in location i for the user}

(59)

Table 5.1: Description of the individual mobility measures.

measure symbol description

center of mass ~rcm barycenter of mobility

radius of gyration rg characteristic traveled distance

mobility entropy E predictability of user movements

entropy is one, because his mobility life has no systematic nature.

We first defined the outgoing entropy of a location a with the following for-mula:

E(a) = −X

b∈V

p(a, b) log p(a, b) (5.3)

where p(a, b) represents the probability of observing a movement by the user from location a to location b, computed as the number of times he visited b after location a. E(a) tends to zero when all the the movements outgoing from a are directed to a single location, while it grows when the outgoing trips are distributed uniformly over all the other available locations. We then extended it to the concept of mobility entropy of a user u:

E(u) = −X

e∈E

p(e) log p(e) (5.4)

where p(e) is the probability of observing a movements between the loca-tions composing the edge e = (a, b). E(u) tends to one when, starting from a given location, each outgoing trip is directed to a different location (the case of John’s mobility), while it tends to zero when all the outgoing trips are directed to the same location (the case of Mark’s mobility). The mo-bility entropy of a user provides also information about his mobile diversity. Indeed, it indicates how much that individual diversifies his mobility across the locations. Mark, for example, does not have a great mobile diversity because he visits the same few places over and over. His is a static mobility world. Conversely, John is very erratic and highly dynamic, showing a very diverse mobile system. Figure 5.1 shows the correlation between the radius of gyration and the mobility entropy. We notice that a weak correlation ex-ists between the two mobility measures, confirming that they capture two aspects of the same complex phenomenon.

(60)

Figure 5.1: Correlation between the radius of gyration and the mobility entropy computed over the population of Orange users.

(61)

5.2 Sociality measures

The importance of a user within a social network can be measured by the degree, i.e. the number of his social links. The degree provides information about the sociality volume of that individual, or in other words his degree of sociability. Real social networks are highly heterogeneous with respect to the number of friendships [46, 47]: some individuals has very few social contacts while other act as hubs or connectors, having an anomalously large number of links. Due the large variability in the number of social contacts the degree provides a useful information, capturing whether a user is a hub or a socially inactive individual.

Since our social network is directed, nodes have two different degrees, the in-degree, which is the number of incoming edges, and the out-degree, which is the number of outgoing edges. We define the social degree of a user as the sum of the in-degree and the out-degree:

deg(u) = degin(u) + degout(u) (5.5)

The social diversity measure, first defined in [25], quantifies the topo-logical social diversity in a social network as the Shannon entropy associated with an individual’s communication behavior:

dsocial(i) = −

Pk

j=1pijlog(pij)

log(k) (5.6)

where k is the out-degree of user i, and pij =

Vij Pk

j=1Vij

, where Vij is the call

volume between user i and user j. It measures the social diversification of each individual according to his call pattern. A user who calls always the same persons has low diversification, while a user calling a vast network of contacts shows high social diversification. Social diversification is related to the predictability of users’ calls: the higher the diversification, the lower the social predictability of the user. Let us consider the case of Igor and Lucas. Both individuals have a high number of social contacts. However, Igor prevalently calls only one person during the period of observation. His social diversity is hence close to zero, because it is very easy to predict the

(62)

destination of his future calls. Contrariwise, Lucas diversifies his calls over many different social contacts. Lucas’ social diversity is one, because his call pattern is hard to predict. Figure 5.2 shows the correlation between social degree and social diversity. As in the case of the mobility measures, a weak correlation exists.

Table 5.2: Description of the individual sociality measures.

measure symbol description

social degree deg sociability of the user

social diversity dsocial social diversity of the user

Figure 5.2: Correlation between the social degree and the social diversity computed over the population of Orange users.

(63)

Chapter 6 Results

In this chapter we present the core work of this thesis, describing the exper-iments performed and the obtained results. A detailed discussion about the interpretation of the results will be provided in the next chapter.

6.1 Individual level

In this section, we investigate the interconnection between mobility and so-ciality at individual level, trying to answer the following question: Is the social behavior of an individual influenced by his mobility behavior, and vice versa? In other words, do our mobility features correlate with the character-istics of our social interactions?

To address these issues, for each user in the filtered dataset we computed the four individual measures defined in Chapter 5: radius of gyration, mo-bility entropy, social degree and social diversity. Firstly, we studied how vol-ume and diversity measures distribute over the population of Orange users: distributions provide information about the degree of variability across the population of a given quantity. Figure 6.1 shows the probability distributions of the volume measures: radius of gyration (left plot) and social degree (right plot). Both variables show off a great variability across the population, since the distributions are well fitted by two power law-like functions. In a mo-bility scenario, the emergence of a power law means that most of us confine our lives to a very small mobility circle, a few kilometers at most, moving back and forth among several nearby locations. However, this highly

(64)

local-ized majority coexists with some people who move dozens of kilometers every day, and a few individuals who travel more than hundreds of kilometers. Our results are consistent with those found in previous studies on human mobility [5, 27].

Such a great heterogeneity also characterizes our social behavior: the power law emerged from degree distribution formulates that the majority of users in the social network have only a few social links, and that these numerous tiny nodes coexist with a few big hubs, nodes with an anomalously high number of social bonds. The emerged social and mobile heterogeneity thus forces us to abandon the idea of a scale or characteristic node. In other words, radius of gyration and social degree are scale-free: in a continuous hierarchy there is no single node which we could pick up and claim to be characteristic of all the nodes.

Figure 6.1: Probability density functions of radius of gyration (left) and social degree (right).

When it comes to the social or mobile predictability, power laws are re-placed by Gaussians. Indeed, the distributions of mobile entropy and social diversity are peaked around characteristic values, as we can notice from plots in Figure 6.2. The two peaks, or modes, of the distributions (0.75 and 0.8 for mobile and social entropy respectively) imply that the vast majority of individuals have high mobile or social entropy, and that nodes deviating from this characteristic behavior are rather rare. Whether we limit our life to a neighborhood or drive dozens of kilometers each day, we are just as pre-dictable as everyone else. In other words, despite the heterogeneity observed in the mobility volume, when it came to our whereabouts we are all equally

Social networks, human mobility and economic development: a data-driven study in France

Università di Pisa

Dipartimento di Informatica

Corso di Laurea Magistrale in Informatica per

l’Economia e per l’Azienda

(Business Informatics)

Tesi Magistrale

Social networks, human mobility and economic

development: a data-driven study in France

Abstract

Sommario

Contents

List of Figures

List of Tables

Listings

Chapter 1

Introduction: the challenge of

Human Behavior

Chapter 2

State of the art

2.1

Human Mobility patterns

2.2

Complex Network analysis

2.3

Interplay between mobile, social and

eco-nomic dimensions

Chapter 3

Managing Big Data

3.1

The MapReduce Paradigm

3.2

Big Data Analytic tools

3.2.1

Hive

3.2.2

Pig

3.2.3

Pig Latin vs HiveQL

3.2.4

Python

Chapter 4

Big data as Social Microscope

4.1

Mobile Phone Data

4.2

Economic data

Chapter 5

Measuring Human Behavior

5.1

Mobility measures

5.2

Sociality measures

Chapter 6

Results

6.1

Individual level