Mobility Data-driven Analysis and Characterization of Urban Regions

(1)

D GD

-8I 8 8 G 8A C 8 8

C

8C

IG C GG C D B8 G

-

2

2 -D9 A

8 8

C C8A G G 8C

8 8

L8 DC D

98C 2

DCG

2 - 02

/

0 D /8CC

G DE 1 8A L

CCD 8 8 B D

(2)

Abstract

Methods for the description and characterization of urban regions with human mobility GPS data are presented. The spatial distribution of human activity is measured with spatial entropy, which expresses how equally activities are distributed across areas. Fur-thermore, Moran’s I, a measure for spatial autocorrelation, is used to measure if activities are geographically clustered or dispersed. Flows within urban areas are modeled as a network whose characteristics are used to describe the regions. On the level of individual drivers, networks that represent the movements of single cars (Individual Mobility Net-works) are used to describe the traveling behavior of a region’s inhabitants. Finally, road networks are analyzed by calculating the amount of intersections, dead-ends and roads, as well as the concentration of traffic flowing through them. All methods are applied to a database consisting of 18 million car trajectories in the Tuscany region and the results are presented. The methods and results aim at facilitating future research in human mobility science and serve as a concise way of describing urban regions.

(3)

(4)

Introduction

1.1 Context

The recent emergence of tracking devices has endowed human mobility science with a vast amount of valuable data. GPS data produced by phones and trackers in vehicles provide a detailed view of individual movements on a large scale. Human mobility scientists, coming from a wide range of fields such as geography, sociology, economics, urban planning and computer science, use these new data sources to mine for patterns of individual behavior as well as global rules that govern human mobility. For example at CNR Pisa, Pappalardo et al. [19] have combined individual GPS data from mobile phones with regional data about economic prosperity and show a correlation between mobility patterns and well-being. At the same institution, Rinzivillo et al. [21] have developed a learning algorithm that classifies the purpose of individual car trajectories (trip to work, leisure, shopping etc.), thus enriching mobility datasets with a semantic layer.

A major obstacle in research is that human mobility does not necessarily work the same way across di↵erent regions. A trajectory pattern in a mountainous countryside may have other implications than the same pattern has in the suburbs of a large town. The trajectories in a planned city with rectangular streets and strict zoning laws might be completely di↵erent than the ones in a town that has grown organically without any

(7)

clear structure. Therefore, knowledge and models that were produced with the data of a particular area cannot simply be transferred to another geographic location.

1.2 Objectives and Content of the Thesis

This thesis aims at making a first step to tackle this problem. Its objective is to develop and present a range of quantitative measures that describe and characterize urban regions. The methods introduced in this thesis aim at producing results that can serve as a first step in other research, by providing a numerical description about what kind of urban area is being dealt with. Moreover, the measures presented in this work can be used as an input for transfer learning, that is the transformation of knowledge gained in one geographical region in order to apply it to another region with di↵erent characteristics. Finally, the methods of this thesis should produce results that stand for itself, as a multilayer description of urban regions and a means for displaying di↵erences between municipalities.

Four di↵erent approaches are used: In chapter 2, we investigate how human activities are spatially concentrated or dispersed across urban areas. Chapter 3 models the movements within an urban region as a graph that contains the flow of trajectories between the di↵erent parts of a region. Chapter 4 treats the traffic on an individual level, by analyzing networks that represent the places and movement of single cars. Lastly, chapter 5 presents the characteristics of road networks and how traffic is distributed in them.

For this work, concepts from di↵erent scientific backgrounds were used as starting points and adapted for the purpose of characterizing urban regions: In chapter 2, measures from geography are used. The third chapter describes urban regions with key figures from network science and interaction models that stem from economic geography. In chapter 4, a Human Mobility Mining approach is adopted, which was first presented by Rinzivillo et al. [21], while chapter 5 operates with simple descriptive statistics. Many of these methods had originally been developed for specific questions and issues, such as the understanding of commuting patterns [12] or the spread of economic activities [11]. The novelty in this thesis is that these methods were adapted to describe and characterize urban areas in a general sense.

(8)

1.3. CASE STUDY 3

Figure 1.1: Left: The areas used for the case study: A 10⇥10km square around the center of each of the 276 municipalities of Tuscany. Right: The area used for Pisa with a 20_{⇥20-grid, which will} be used in chapters 2 and 3.

Another novelty of this work is that all methods were implemented in a way that makes it possible to automatically calculate all characteristics for hundreds of di↵erent cities and entire regions. The resulting library in the Python programming language enables the user to process an unlimited amount of data simply by passing a database with trajectories and a list containing the positions of the geographical areas of interest as an input.

The results of this work are already being used and enhanced in the scope of another student’s Master thesis at CNR Pisa, whose objective it is to investigate traffic across regional boundaries. Furthermore, the results of the case study, which will be described in the next section, are being used in the Knowledge Discovery and Data Mining Laboratory (KDD Lab) for Track & Know, a European project about Big Data in the context of human mobility and are currently planned to be published.

1.3 Case Study

All methods presented in this thesis were applied to a database of human mobility trajec-tories in the region of Tuscany, made available by the KDD Lab in Pisa. The database contains 18.9 million trajectories of 250,239 cars, which were collected during a period of

(9)

seven weeks in the beginning of 2014. The trajectories are preprocessed signals of GPS trackers in the cars of participating drivers. Each of the trajectories consists of a sequence of longitude-latitude pairs, given on average in intervals of 1-2 minutes, that form an ap-proximation of the trip. Additional information in the database are the timestamps of the trajectory’s beginning and end, as well as an anonymized identification number that links trajectories to cars.

All measures were calculated separately for each of the 276 municipalities of Tuscany. The results are presented at the end of each chapter. Since the goal of this case study is to obtain an objective comparison of di↵erent regions, the areas to investigate were chosen to be a 10_{⇥10km rectangle for each municipality, with the sides approximately parallel to} the meridians and parallels, centered around the town or village center (see Figure 1.1).

The advantage of this method is that it is not necessary to consider possible e↵ects of shapes and sizes on the numerical outcomes. The obvious disadvantages are that there are many overlapping municipalities (mainly in the dense area around Florence) and that large municipalities with a diameter of more than 10km, such as Grosseto, are not fully covered in this case study, as we can see in a plot of all areas in Figure 1.1. Also, the areas of some costal areas and island municipalities are partly above the ocean, which influences some of the measures calculated.

(10)

Chapter 2

Spatial Concentration

One of the most important aspects in the description of urban regions is the question of how the density of people and activities vary across the area. In empirical research, this question was traditionally focused on people’s residency and workplace, since that was the only available data, mostly coming from census or government records. Due to the lack of more detailed data, the granularity of many studies is often on a large scale, with the smallest distinguishable areas being administrative or geographical entities, such as in Neme¸s et al. [14], where census and government data are used to analyze population density, economic turnover and green spaces. More recent research is profiting from the availability of more detailed data from mobile phones, vehicle trackers and satellite imag-ing, like Jat et al. [9], who use remote sensing from satellite images to measure urban growth in a scale of 5m - 25m.

Spatial concentration is used in a vast range of di↵erent fields. Examples of research using this concept are the analysis of the concentration in di↵erent industrial economic activities [11]. Another application was performed by Shekhar [26], who measured urban sprawl in India with satellite imaging. In this thesis, the concept of spatial concentration will be focused on the overall amount of mobility, undi↵erentiated by types of activity. The question of interest is: Are the activities concentrated in cluster-like centers of high density or are they spread-out across the map?

(11)

Three approaches to answer this question will be presented in this chapter: spatial entropy, Moran’s measure for spatial autocorrelation (Moran’s I ) and the average nearest neighbor distance. The first two can only be calculated after the geographical space has been divided into a set of disjoint areas. A classic approach for this subdivision is the use of small administrative boundaries, such as voting districts. Another approach is the use of a grid, also called raster [22]. The grid-method does not necessarily have to split the area into equally-sized fields. For example Yuan et al. [28] create a grid by segmenting a city into fields that are separated by major roads, in order to identify di↵erent functional regions.

Administrative boundaries will not be used in this thesis, since they bear large disad-vantages: There is the risk of not being consistent across regions. Also, the boundaries are chosen by the authorities in order to separate areas that historically or functionally belong together. Since we would like to obtain the most objective analysis possible and do not want to be influenced by the preexisting analysis by government officials, only the equally-spaced grid approach is used in this thesis. In the case study (described in section 2.4), the 100km2 _{area is divided into raster of equal rows and columns, resulting}

in a grid consisting of equally-sized quadratic fields.

2.1 Entropy

The first technique for the measurement of spatial concentration is Shannon’s entropy. It can be used as a measure of equality for various applications. In this thesis, it measures how equally activities are distributed across the grid fields. It was presented in 1948 by Shannon [25] as an application of the concept of physical entropy in the field of information theory. It is applied to a discrete random variable X that has n possible outcomes and in our case, n possible grid fields. The definition is:

entropy =

n

X

i=1

P(xi) log P(xi) (2.1)

where_{x1, . . . , xn} are the possible values of X and P(xi) is the probability of X having

state i. The basis of the logarithm can be chosen at will. In this thesis, the natural logarithm is used. In the case of human mobility the random variable is the position of

(12)

2.1. ENTROPY 7

Figure 2.1: Two example grids with di↵erent entropy and Moran’s I values. (expected Moran’s I under random distribution: -0.067)

an individual, which can end up in n di↵erent fields on the map. The probabilities are estimated by taking the amount of data points in the respective fields divided by the total amount. An example of two grids with di↵erent entropy can be seen in Figure 2.1.

The concept was adapted for geographical applications by Batty in 1974 [3], who coined the concept of spatial entropy, which takes into consideration that fields might have di↵erent sizes and adapts the entropy formula to balance out those size e↵ects. Since in this thesis, all entropy calculations are performed on a grid with equally-sized fields, we will only use the above-presented normal form of Shannon’s entropy.

The case of maximum entropy would be a situation with an equal amount of activity in all fields, as opposed to the case of minimal entropy, where all activity is amassed in one single field and zero activity persists in all other fields. The former would yield an entropy of log(n), n being the amount of fields, the latter would have an entropy of zero.

In order to compare entropy scores of di↵erent-sized grids, the measure must be normalized by dividing it by the expected entropy of a uniform distribution, log(n). The normalized entropy is therefore

normalized entropy = Pn

i=1P(xi) logbP(xi)

(13)

Figure 2.2: Grids with the same entropy values as the grids in Figure 2.1, but higher Moran’s I values. (expected Moran’s I under random distribution: -0.067)

This entropy ignores the geographical positions of the fields and the distances between them. Figure 2.2 grids illustrate this. Both grids have the same entropy value as the ones in Figure 2.1, despite the fact that, from an intuitive point of view, they have a much stronger spatial concentration, because the high-value fields are closer together.

2.2 Moran’s I

The entropy measure from the previous section ignores the geographical position of the bins in which the concentration is measured. It consequently lacks an important aspect of spatial concentration, namely that dense fields are near each other. We therefore in-troduce a property that takes into account how the fields are positioned in space: spatial autocorrelation [22]. As the name suggests, it represents the degree to which the fields’ values are correlated to the value of neighboring fields.

For spatial autocorrelation measures, the nearness between all pairs of fields must be defined with a so-called weight matrix w, where wij is the nearness (or negative distance) between nodes i and j. A simple form of weight matrix is an adjacency matrix, with the value 1 if fields are adjacent and the value 0, if not. Adjacency in a rectangular grid is usually defined either as rook or as queen adjacency (both terms allude to the movement patterns of the chess pieces of the same name), the former only considering fields as neighbors if they are adjacent sideways, the latter also including diagonal neighbors. In

(14)

2.2. MORAN’S I 9

this thesis, rook adjacency matrices are used. Four examples of simple 4⇥ 4-grids with their concentration measures can be seen in Figure 2.1 and Figure 2.2.

An important di↵erence to the entropy is that spatial autocorrelation has two directions. A high autocorrelation indicates that values of the same magnitude are prone to be next to each other, while a low autocorrelation means that similar values are less likely to be near each other than under random positioning. Somewhere in between lies a value of autocorrelation in which the population of the fields is how one would expect it to be under a random distribution.

The most famous of many spatial autocorrelation measures is Moran’s I , which has the following definition [13]: I = N W P i P jwij(xi x)(x¯ j x)¯ P i(xi x)¯ 2 (2.3)

where N is the number of fields with the indices i and j. x is the value of the corresponding fields (in the case of density the amount of activity or population) and ¯x is the average field value. w is the above-mentioned with matrix with W as the sum of all its weights.

A low (high) Moran’s I indicates a low (high) spatial autocorrelation. The minimum and maximum possible values of Moran’s I depend on the weight matrix used. In most applications, also in this thesis, a symmetric and row-standardized matrix is used, that is a matrix where two fields have only one distance, no matter which field is named first, and the sum of distances of one field to all others is 1. In this case, Moran’s I ranges from -1 to +1.

An important caveat is that the absence of autocorrelation is not given at a Moran’s I of zero, but at the value that is expected under assumption of a random distribution with no autocorrelation, which is

E(I) = 1

(15)

The expected value is negative, but under high amounts of fields, such as in grids with very small fields, this value tends to zero.

2.3 Nearest Neighbor Distance

The previous measures have the drawback that there is the arbitrary element of the choice of the grid field size. In order to have at least one measure that is not dependent on a grid and the arbitrary choices that come along with it, the Average Nearest Neighbor Distance(ANND) is used as a third and last measure for spatial concentration. The concept of this measure is simple: For every point, the distance to its nearest neighbor is calculated. The mean of those values is the ANND. The equation is

AN N D = P

imin(di)

N (2.5)

where di is a vector containing the distances of point i to all the other points. N is the

amount of points. The ANNDis easily interpretable, since it measures an actual geo-graphical distance. The lower this value is, the higher is the average spatial concentration in the areas surrounding the points. It is important to note that this definition bears a similar weakness as the entropy: if we consider the data to be pairs of nearest neighbors distributed through space, the positioning of those pairs across the space does not a↵ect the ANNDas the distances within the pairs remains unchanged, as Figure 2.3 exemplifies.

The expected average nearest neighbor distance under assumption of a uniform distribu-tion of points across the area, called Mean Random Nearest Neighbor Distance (MRNND), is

M RN N D = 0.5 r

A

N (2.6)

with A as the surface of the area and N as the amount of points. It is important to consider that, in contrast to entropy, the expected value does not mark the value of least spatial

(16)

2.3. NEAREST NEIGHBOR DISTANCE 11

Figure 2.3: Example average nearest neighbor distances. The lower maps have the same ANND-value as the ones above them, although the lower points are globally more spread out.

concentration, since under a random uniform distribution, we expect a certain amount of random concentration. The situation of maximum possible ANNDwould occur, when all points are perfectly spread out across the whole area and accordingly every point has the same distance to their nearest neighbors. The ANNDwould then be

q

A N.

This meaning behind this value becomes clear, if we consider the case of a quadratic area. In that case the points are aligned in horizontal and vertical lines, forming the corners of a grid consisting of equally-sized squares. The horizontal pN lines consist of pN points each. Each of those lines has the side length of the area, that ispA. Since the points are equally spaced throughout the line, the distance between the points are

q

A

N. On the other

end, the lowest possible ANNDvalue, which signifies the maximum possible concentration, is evidently zero, in the case that every point lies on top of another point.

The MRNND of Equation 2.6 can be used to transform the Average Nearest Neighbor Distanceinto an index that is comparable among samples with di↵erent sample sizes and areas. By dividing the ANNDby the MRNND, we obtain the Nearest Neighbor Index :

N N I = P imin(di) N , (0.5 r A N) (2.7)

(17)

A N earestN eighborIndex smaller than 1 indicates that there is higher spatial concentra-tion than in a random case, whilst a value above 1 shows that the points are spread out across the map more than one expects in a random scenario.

2.4 Case Study

Grid-based Measures

For each of the 276 municipalities in the database, all trajectories that end within the municipality’s bound were extracted. The end points of those trajectories are considered to be the position of a human activity, since they occur, whenever an individual has stopped their car. The municipalities’ areas were divided into a rectangular grid made up of equally-sized squares (see Figure 1.1 for an example grid). For each grid field, the number of total activities that have taken place within its bounds, was calculated. Note that all activities regardless of day and time are summed up. The resulting measure of concentration therefore refers to the overall usage of space and not to the actual spatial concentration at a given point in time.

For the calculation of entropy, the values in the grid are normalized to have a sum of 1, so the resulting numbers can be used as probabilities in formula 2.1. Moran’s I was calculated with the same values (normalization does not influence Moran’s I ) and a rook adjacency matrix was used as weight matrix. For the calculation of the Average Nearest Neighbor Distance, no subdivision into a grid is needed.

In order to determine how strongly the choice of the grid influences the entropy and Moran’s I, both measures were calculated in three variants. Each variant is a grid that splits the relevant area into a certain number of equally large rows and columns. The three grid variants are 10⇥ 10, 20 ⇥ 20 and 50 ⇥ 50. The respective sizes of a single field in the quadratic area of 100 km2 are 1 km2 (1 km side length), 0.25 km2 (0.5 km side length) and 0.04 km2 _{(0.2 km side length).}

In Figure 2.4, the Pearson correlation coefficients between the di↵erent measures can be seen. The entropy values of di↵erent sizes are highly correlated with the coefficients of 0.97,

(18)

2.4. CASE STUDY 13

Figure 2.4: Correlation of normalized entropy and Moran’s I with di↵erent grid sizes (10, 20 and 50 rows/columns).

0.96 and 0.88 for the row/column number pairs of 50–20, 20–10 and 50–10, respectively. The correlation between the Moran’s I values of di↵erent sizes are also high, but with slightly lower coefficients of 0.82, 0.82 and 0.66 for the pairs of 50–20, 20–10 and 50–10, respectively. The low correlation of 0.66 for Moran’s I with grid sizes 50_{⇥50 and 10⇥10} shows that there is some e↵ect of the scale on the measurement of spatial autocorrelation, but all in all, the arbitrary choice of grid size does not seem to have an overwhelming influence on the results. The following analysis will only use of the values for a 20_⇥20 grid and leave the other grid sizes aside. With this size, it is still possible to have large enough sample sizes, also for less active regions, but it also allows to capture phenomena on a small-scale level of rectangles with the size of 500⇥500m.

Average Nearest Neighbor Distance

The average nearest neighbor distance di↵ers from the previous measures in that it does not depend on a grid. A second di↵erence is that, since it measures actual distances between points, it is not suitable to aggregate all activities over time, like we did in the case of entropy and Moran’s I. This type of aggregation would remove the meaning of

(19)

the measure as a distance between cars and would result in numbers close to zero, since individuals tend to repeat their activities at the same locations.

Therefore, for the ANND, the locations of all ongoing activities were retrieved for every full hour of every day in the database. The timing of an activity is defined as the interval between the stopping of the car to when it starts a new trajectory. A maximum duration of 24 hours per activity was chosen to avoid overestimating the duration of activities. For every hourly snapshot of activities, the ANNI was calculated.

For all of the 276 municipalities in the database, the average value of all available snapshots was calculated. It was taken into account that weekdays could potentially show di↵erent driving patterns from weekends. To avoid a possible averaging-out of those characteristics, only the weekdays were used, of which 35 are recorded in the database.

The problem with the results is that only the largest cities of Tuscany have large enough data for a reliable estimate. Florence for example has a vast amount of activity, with an average amount of activity points of 9200 at every full hour. More than 100 municipalities however have an average activity count of less than 400. Over 50 municipalities did not have a single snapshot with more than 100 activities, which we considered a minimum threshold, to perform the calculations.

Technically, normalizing the average nearest neighbor distance into the average nearest neighbor index, should compensate for di↵erent sample sizes. But the sample sizes of many municipalities are so small that they carry the danger of randomness and noisy data, as well as the possibility of a systematic bias coming from missing values. Furthermore, the lack of correlation to any other measure of spatial concentration (0.06 to entropy and -0.078 to Moran’s I , see Figure 2.7) causes doubt about the expressiveness of the measure. Therefore, the nearest neighbor distance will not play a prominent part in the rest of this chapter. Instead, the focus will lie on the two grid-base measures, entropy and Moran’s I .

(20)

2.4. CASE STUDY 15

Figure 2.5: Heatmap of activities in Pisa (left, entropy: 0.816, Moran’s I : 0.65) and Florence (right, entropy: 0.889, Moran’s I : 0.698) with a 20_{⇥20 grid.}

Results

The distributions of the concentration measures can be seen in Figure 2.6. Both distri-butions have a clear peak in the center. The entropy has outliers on the left side of the spectrum, which is mainly due to island municipalities, whose grid fields only consist of ocean, thus leading to a higher spatial concentration measure.

Figure 2.6: Histograms showing the distributions of normalized Entropy (20_{⇥20 grid) (left),} Moran’s I (20*20 grid) (right).

(21)

Table 2.1 shows the averages and the results for the top-5 municipalities in terms of entropy and the 5 cities with the lowest entropy. The Moran’s I values point to a positive spatial concentration in all 276 municipalities, in fact all Moran’s I values are higher than -0.0025, which is the neutral value in a 20_{⇥20 grid.}

Rank Municipality Ent 20*20 Moran’s I 20*20 26 Pisa 0.816 0.650 1 Firenze 0.889 0.698 132 Montepulciano 0.667 0.336 87 Grosseto 0.714 0.794 1 Firenze 0.889 0.698 2 Campi bisenzio 0.881 0.487 3 Agliana 0.881 0.501 4 Prato 0.877 0.548 5 Porcari 0.872 0.488 ... ... ... ... 272 Badia tedalda 0.431 0.295 273 Monteverdi marittimo 0.43 0.098 274 Palazzuolo sul senio 0.406 0.012 275 Isola del giglio 0.391 0.186 276 Capraia isola 0.235 0.201 Mean 0.663 0.437 Median 0.661 0.438 Std.dev. 0.109 0.15

Table 2.1: Values of entropy and Moran’s I (for a grid of 20 rows and columns), ranked by de-creasing entropy: Four example cities, top and lowest 5 municipalities in terms of entropy, mean, median and standard deviation of all 276 municipalities.

In Figure 2.7, we see that the Moran’s I and entropy are strongly positively correlated. It is a counter-intuitive result, since entropy and Moran’s I both measure concepts related to the general idea of the concentration of activities, the entropy negatively and Moran’s I positively. One would therefore expect a strong negative correlation. It is however not so contradictory when we consider that the same field sizes of 500m_{⇥500m have been} used for both measures. The fact that the activities mainly take place in a few fields does not imply that these fields are in spatial proximity. Vice-versa the fact that fields which are near each other also share similar amounts of activity does not imply that activities are concentrated in few fields. Quite the contrary, in the concept of human

(22)

2.4. CASE STUDY 17

Rank City Entropy 20⇥20 Moran’s I 20⇥20 1 Campagnatico 0.573 0.105 2 Sillano Giuncugnano 0.551 0.119 3 Semproniano 0.552 0.127 4 Santa Luce 0.580 0.140 5 Lajatico 0.567 0.165 ... ... ... ... 40 Volterra 0.554 0.587 41 Piombino 0.586 0.604 42 Massa Marittima 0.547 0.617 43 Campiglia Marittima 0.594 0.644 44 Follonica 0.597 0.670

Table 2.2: Only considering regions with entropy between 0.547 and 0.597 (See blue space in Figure 2.8): Top and lowest 5 values (for a grid of 20 rows and columns), ranked by ascending Moran’s I .

activity in urban regions, it seems realistic that there is an inverse relationship, e.g. that relatively high-density fields are associated with surrounding fields of low density or that city centers somehow geographically repel each other. Firenze and Pisa are an example of this positive relationship between entropy and Moran’s I. Their density heatmaps can be seen in Figure 2.5.

Figure 2.7: Correlation of normalized entropy (20_{⇥20 grid), Moran’s I (20⇥20 grid), average} nearest neighbor index and sample size (amount of trips in the database).

(23)

Despite the highly positive correlation of 0.54, there is still a large amount of municipalities where Moran’s I is highly di↵erent, although their entropy value is roughly the same. To take a look at those divergence in spatial autocorrelation which is independant of the entropy in the grid, an arbitrary range was chosen (marked with a blue box in Figure 2.8), in which all municipalities lie in a middle narrow range of entropy, but show a high degree of variance in Moran’s I. The top and bottom 5 cities of these 44 municipalities in terms of Moran’s I are shown in table 2.2. The span between the highest and the lowest Moran value is very large, going from 0.105 (1st percentile of all values) to 0.67 (97th percentile of all values).

Figure 2.8: Scatterplot of Moran’s I against entropy (both measured with a 20⇥20 grid).

In Figure 2.9, the activity heatmaps of two extreme cases are shown: Campagnatico with the lowest and Massa Martittima with the third-highest Moran’s I value of the selected range. The two highest ones, Follonica and Campiglia Marittima were not used since their bounding boxes are intersecting the mediterranean sea to a substantial degree. It is only logical that they obtain high values of spatial autocorrelation since all fields over the ocean, have values of zero. Those zero-values are all adjacent and thus lead to a high Moran’s I.

(24)

2.4. CASE STUDY 19

Figure 2.9: Heatmap of activities in grid fields and map of two regions with similar entropy, but di↵erent Moran’s I. Left: Campagnatico (Entropy: 0.573, Moran’s I: 0.105), Right: Massa Marittima (Entropy: 0.547, Moran’s I: 0.617). For better visibility the heatmaps and street maps are plotted separately.

We see in Figure 2.9 that, judged by the naked eye, the degrees of activity, although much higher in Campagnatico in absolute numbers than in Massa Marittima, are similarly distributed over all grid fields, as the near-identicall entropy value suggests. The dense fields of Massa Marittima (on the right side) however are mostly clustered together in the center and, to a smaller amount, in the south west, near Valpiana. In contrast, acive fields of Campagnatico (on the left) are distributed all over the grid and often have empty spaces between them. Campagnatico’s center fills only one field in the middle. The sole fact that Massa Marittima has a large center that spans accross many fields drives up the measure of spatial autocorrelation. This fits the information in Figure 2.7, that Moran’s I is positively correlated to the number of trajectories in the database used (Pearson coefficient: 0.54), provided that this number can be used as a proxy for heavily populated areas with big centers.

(25)

This e↵ect can also be seen in Figure 2.10, a plot of the results on a map of Tuscany, with colors representing the entropy and Moran values. There, the Moran’s I values (right image) are high in the communities around Florence, Livorno and Arezzo, three of the most populated cities of Toscany. The same goes for the entropy (left image), which is also positively correlated to the number of trajectory samples (Pearson coefficient: 0.68).

Figure 2.10: Results of all 276 municipalities on a map with color indicating the entropy (left) and Moran’s I (right) for a 20_{⇥20 grid.}

(26)

Chapter 3

Flows in a Grid Network

In order to capture the information about flows in urban regions, the data of every mu-nicipality is transformed into a directed weighted graph that represents the flow of the people’s trajectories. The graph consists of

• a set of nodes or vertices V , which represent places that are origins and destinations of trajectories,

• a set of edges E, which are the directed connections between the nodes,

• a weight function w : E ! IR: a function that maps each edge to a weight, which indicates the amount of trajectories that occur along the edge.

For this purpose, only trajectories that begin and end within the boundaries of the area of interest are used. In order to have a limited amount of nodes, the map is split into fields of a rectangular grid across the whole surface, like in chapter 2. All origins and destinations are matched to the field in which they lie. The network is created by assigning to each edge weight the amount of flows occurring along the edge. The weight function w is therefore basically equivalent to an origin destination matrix. The network perspective allows us to gain knowledge about the structure of a region by looking at the properties of the resulting network. The major disadvantage of this approach is, like in the case of Moran’s I and entropy in the previous chapter, that the choice of the grid field size is somehow arbitrary and has an influence on the outcome.

(27)

3.1 Node Degrees

A basic property of the network is the distribution of its degrees. Degree is hereby defined as the total traffic (sum of in- and out-flow) of a grid field. In graph terms, this would be the sum of the weights of all outgoing and incoming edges. This measure is sometimes also referred to as node-flux [24].

The degree distribution function returns the frequency of each degree. It can be graphically presented in the form of a histogram or scatterplot. In many real-world networks, this distribution takes on the form of a scale-free decreasing distribution. [2] Another approach is to take the list of all nodes’ degrees and sort them in non-ascending order. We can then compare the values between di↵erent regions. Higher values in the low positions indicate that arrivals and departures are unequally distributed and thus concentrated in few fields.

3.2 Louvain Modularity

An interesting quality of networks is the degree to which they can be partitioned into groups of nodes, such that the connectivity is high within those groups, and low in between. In the context of urban regions, the corresponding question is: Can the city be split into areas that are relatively autonomous and have only low interaction between them?

In network science, modularity measures this property for a given partitioning. As opposed to the concept of clustering, a graph partitioning separates the graph’s nodes exhaustively into non-overlapping communities. Modularity shows the di↵erence between the relative amount of inner-community links and the expected relative amount under random linking in a non-directed weighted graph [2]. Its equation is:

Q = 1 2m X ij  Aij kikj 2m (ci, cj) (3.1)

(28)

3.2. LOUVAIN MODULARITY 23

• ki and kj are the sum of the weights of the edges attached to nodes i and j

respec-tively,

• m is the sum of all of the edge weights in the graph (and 2m is the sum of all node degrees, when the degree is defined as the summed weights of edges adjacent to a node,)

• ci and cj are the communities of the nodes and

• is a simple indicator function that returns 1 if the two arguments belong to the same community.

The modularity goes from 1 to +1, where 0 marks the value expected in a network where all possible edges have the same expected weight. Modularity is classically only defined for undirected networks. Although there are attempts to generalize the concept for weighted graphs [16], the normal non-directed version is used in this thesis, since it is sufficient to explain the questions whether the urban regions can be partitioned well into relatively independent sub-regions. The direction of traffic flow is not important for this question. The grid networks in this thesis are therefore transformed into non-directed networks before the modularity is calculated. This is achieved by assigning to each non-directed edge weight the sum of the two previous directed edges.

Modularity does not describe a network on its own, but a network along with its partition. In order to quantify, how well an urban region is separable into di↵erent sub-areas, we must find the best possible partitioning, the modularity of which will serve us as a measure. The amount of possible partitions in a network of n nodes is given by the Bell number Bn.

The Bell sequence has a growth rate that exceeds exponential complexity. It is therefore not feasible to compare all possibilities, even with a small grid network of a hundred fields. An approximative algorithm to find the best partition with regards to modularity is the Louvain Algorithm[4]. It does not guarantee an optimal solution, but it performs well and reduces the time complexity drastically, to linear speed with regards to the amount of edges. The algorithm achieves that with a hierarchical approach. It starts with small communities of high modularity and merges those communities together into a single node, thus simplifying the network for the next iteration. At the end of he algorithm, there is a small network with few nodes, each of which represents a community that contains nodes of the original network. All that remains is to calculate the modularity score of the resulting partition.

(29)

3.3 Interaction Models

The flow network allows us to test how well the empirical data aligns with two estab-lished models that describe human interaction in space. Two models are introduced. In section 3.4, the case study data will be fitted to the models and the R2 values will be presented to indicate how well the data fits the models. The approach orients itself on Masucci et al. [12], who applied the same technique on commuter data and transportation flows of England and Wales.

Gravitation Model

The older of the two models, the gravitational model, was first introduced by Alonso [1] in 1976. Its basic idea is that the traffic flow from place i to place j depends on the origin population mi and the destination population nj. Highly populated places, according to

the model, attract flow towards them, like objects with large masses do with gravity. The classic model predicts the traffic flow from place i to place j, which have a distance of r as

Gij = A

m↵_in_j

r (3.2)

A is a normalization factor which assures that the result has the right scale. ↵, and are the model’s parameters. They can be optimized by multiple regression when fitting data to the model. In this thesis a more simple model with only one parameter is used, which is a special case of the general model of equation 3.3 with ↵ and = 1 [12]:

Gij = A

minj

(30)

3.4. CASE STUDY 25

Radiation Model

The second model was introduced by Barabasi in 2012 [27]. Like the gravity model, it uses the origin population miand the destination population nj and introduces a third element:

sij, that is the population within a circle around place i, with a radius of its distance to

place j, minus miand nj. The intuition behind the model is that outgoing trips are being

attracted by nearby populations, like particles being “absorbed by surrounding locations” [12]. It predicts the flow Tij between i and j with the following equation:

T_ij1= Ti

mi nj

(mi+ sij)(mi+ nj+ sij)

(3.4)

Ti is the sum of outflows from place i, so the right part of the formula is the fraction of i’s

outflows that go to j. The model was built for systems with an infinite population. For finite samples, the equation underestimates the outflow from i by a factor of _{M m}M

i where

M = (P_imi), that is the total sample population. The corrected version of the model for

finite systems is:

Tij = Ti 1 mi M mi nj (mi+ sij)(mi+ nj+ sij) (3.5)

3.4 Case Study

Like in chapter 2, the 276 municipalities were divided with a 20⇥20 grid made of 500m2

squares. From all available trajectories in the database, only trips that originate and end within a municipality’s 100km2 area were used. It is important to note that this is only a subset of all traffic available in the database. Long trips between di↵erent urban regions are not part of this analysis. The resulting networks technically each have 400 nodes, corresponding to the grid fields, though many fields have a degree of zero, that is, they exhibit no in- or outflow at all. The average and median amount of nodes with a positive degree are 134 and 127, respectively, which means that the largest part of the areas are devoid of any arrivals or departures.

(31)

Figure 3.1: Degrees of the flow networks in non-increasing order, on a logarithmic scale: 4 example cities, the mean and the standard deviation (blue area). (Plot ends after the 150th highest degree, since the later degrees are all near-zero.)

Node Degrees

The 400 nodes of the 20⇥20 grid are not enough to obtain a useful plot of the degree distribution. Therefore, we prefer to plot the 400 degrees , ordered by non-ascending size (Figure 3.1). Since all 276 networks of the case study have the same amount of nodes, they can be compared side by side. The average values and the standard deviation of the 276 samples can be seen in the blue line with a light blue surrounding. It is plotted on a logarithmic scale for better visibility. The degree on the y-axis is presented in relative terms, by dividing the absolute degree by the sum of all degrees. Note that the degrees in this graph are only shown until the 150th of all 400 nodes, since the values drop very fast to near-zero, which is not suitable for the logarithmic plot.

(32)

3.4. CASE STUDY 27

City Degr100 Degr95 Degr90 Modul Partit. GravR2 Pisa 0.025 0.012 0.008 0.325 6 0.258 Firenze 0.013 0.009 0.007 0.42 6 0.42 Montepulciano 0.120 0.015 0.005 0.352 8 0.352 Grosseto 0.038 0.020 0.007 0.288 7 0.288 Mean 0.099 0.012 0.005 0.435 8.975 0.256 Median 0.089 0.012 0.006 0.420 8.5 0.249 Stddev 0.062 0.003 0.002 0.094 3.251 0.073

Table 3.1: Four examples and population mean, median and standard deviation: Degrees of per-centiles 100, 95 and 90 (measured as relative amount of in- and outflows), Louvain Modularity Score, amount of partitions produced by the Louvain algorithm, R2-Score of Gravitation model.

Figure 3.1 also shows, as an example, the degrees of four municipalities: Pisa, Firenze, Montepulciano and Grosseto. The two larger cities, Pisa and Firenze, start out with much lower degrees, but have higher values in the less busy nodes on the right side of the plot, that is to say, Pisa and Firenze’s trajectories are more evenly distributed than the average municipality. The mean and median values, as well as the standard deviation, for degree percentiles 100, 95 and 90 are stated in Table 3.1, along with the values for the four example cities.

The higher the largest degree percentiles are, the more of the start and end points of trajectories take place in the same grid fields. This is a notion very similar to the en-tropy, measured in chapter 2, which negatively measured the concentration of all arrivals, including those from outside the municipality. This similarity is reflected in the highly negative correlation coefficient between the entropy values of chapter 2 and the degrees of the 100th percentile in this chapter, of -0.87.

Modularity

The Louvain algorithm produced modularity scores with a mean of 0.435 and a median of 0.42, as can be seen in Table 3.1, along with the values for the four example cities previously used. In Table 3.2, the five highest and the five lowest modularity scores can be observed. The amount of partitions generated by the algorithm has a mean of 8.975 and a median of 8.5. The distribution of the modularity scores of all municipalities can be seen in Figure 3.2.

(33)

Rank Municipality Louvain Score 1 Capraia isola 0.083 2 Portoferraio 0.255 3 Siena 0.271 4 Livorno 0.272 5 Poggibonsi 0.277 ... ... ... 272 Pescaglia 0.655 273 Palaia 0.659 274 Montecatini val di cecina 0.675 275 Stazzema 0.719 276 Casola in lunigiana 0.732

Table 3.2: Highest 5 and lowest 5 modularity scores achieved by the Louvain algorithm

Rank Municipality Gravity R2 1 Zeri 0.099 2 Bagnone 0.113 3 Villafranca in lunigiana 0.12 4 Licciana nardi 0.121 5 Porcari 0.125 ... ... ... 272 Grosseto 0.425 273 Portoferraio 0.436 274 Massa marittima 0.437 275 Piombino 0.439 276 Follonica 0.493

(34)

3.4. CASE STUDY 29

Figure 3.2: Left: Histogram of Modularity Score achieved by Louvain Algorithm. Right: Histogram of R2 Score achieved by Gravity Model.

Interaction Models

Both the gravitational and the radiation model for spatial interaction were applied to the data. The radiation model performed very badly on the data. With the 20_{⇥20 grid, out} of the 276 municipalities, only 15 (5%) obtained a positive R2 score, that is to say, for the other 95%, the radiational model performed worse than a trivial estimator that predicts the average value of interaction for all field pairs.

There is no significant di↵erence to the grid sizes 10_{⇥10 and 50⇥50, which yielded 18 and} 10 positive R2 scores. The radiational model seems not to be suited well for observations of the scale of a 10⇥10 km2 _{area. Masucci et al. [12] have applied the same model to three}

datasets. While it obtained positive R2 values for travels across the whole of the United Kingdom, including flights and long-distance trains, it did not achieve a positive R2 score for local commuting patterns and bus trajectories within the boundaries of London. Since the radiational model is not well-suited for the analysis on a municipal level, we will leave this model aside and only consider the gravitational model.

The gravitational model fits the data, divided into a 20_{⇥20 grid, with an average R2 score} of 0.25 across all 267 municipalities, of which every single one has a positive value. The distribution can be seen in the histogram in Figure 3.2. The model performs better with a 10_{⇥10 grid (average R2: 0.36) and substantially worse with a 50⇥50 grid (average R2:} 0.09). In Table 3.3, there is a list of the five best and the five worst fitting municipalities and their R2 values.

(35)

Figure 3.3: Correlation of the 95th degree percentile, Louvain Modularity Score, Gravity Model R2 score and amount of trips available in the data.

Correlations

Figure 3.3 shows the correlations between all measures discussed in this chapter and the number of trips used, as a heatmap. As opposed to the measures of chapter 2, the number of trips in the network has only a limited association to the measures of the grid network, with virtually no Pearson correlation to the gravity R2 score, only a slight positive one (0.11) to the 95th degree percentile and a small negative (-0.36) correlation to the modularity obtained by the Louvain algorithm.

The number of trips can be viewed as a proxy for highly populated areas, so that it seems that the interactions within larger agglomerations seems to be more mixed, while in less dense regions, spaces can be grouped into areas that have high flows within and few flows from and to other areas. The correlation coefficients between the other variables are negligible, except for the gravity R2 score, which has a negative coefficient of -0.3 towards the degrees of the 95th percentile node and suggests that the networks with a high concentration in few nodes fit worse to the gravitational model of interaction.

(36)

3.4. CASE STUDY 31

Geographical Distribution

As an example for the geographical distributions, maps for the Louvain modularity score and for the R2 value of the gravitation model are shown in Figure 3.4. Both of them do not show a clear pattern, as opposed to the spatial concentration maps in the last chapter, which mainly highlighted Arezzo and the region around Pisa and Firenze. The high values in both plots appear to be distributed rather evenly across the whole regions and there are no clusters of neighboring high-value municipalities.

Figure 3.4: Results of all 276 municipalities. Left: Modularity Score after Louvain algorithm, Right: R2 score for fitting the data to the gravitational model.

(37)

(38)

Chapter 4

Individual Mobility

This chapter focuses on the aspects of mobility on an individual level. Since the trajectories used in the case study are linked to specific cars by an anonymized ID, urban regions can be described by aggregated values of their inhabitants’ individual mobility behavior.

4.1 Individual Mobility Networks (IMN)

Without further processing a set of basic statistics can be calculated for each individual directly from the trajectory database:

• Average distance/duration per trip

• Average driving distance/duration per day • Average amount of trips per day

In order to gain a deeper insight into the individual travel habits, following the methods proposed by Rinzivillo et al. [21], the individuals’ mobility data can be transformed into Individual Mobility Networks (IMN), which is a representation of a person’s travel behav-ior in the form of a weighted directed network, where the set of nodes V represents places that are visited once or repeatedly by the individual and the edges E represent trajectories

(39)

from one of those places to another. The edge’s weights (defined as a function w) equal the amount of times a trajectory from one node to another has occurred.

Gu= (V, E) ! : E ! N (4.1)

In order to create the network, one must define what constitutes a place. Since we work with GPS traces, the arrivals and departures in the data will never take place at the exact same location. At first, a threshold must be defined, how far apart the locations of arrivals or departures can be, in order to be regarded as the same place, and thus as the same node in the network. As a second step, a clustering of all arrival and departure GPS locations is performed, which groups the real locations together into places with a maximum diameter of the predefined threshold. For the case study, the threshold has been set to 250m, which is a scale that allows a certain tolerance for errors in the GPS traces and takes into account that parking spots could be in a few minutes of walking distance away from a point of activity.

For the clustering of locations into nodes of the IMNs, hierarchical clustering is used, specifically agglomerative clustering. This technique starts out with the set of all points, each one constituting a cluster of size one. In each iteration, the two nearest clusters are merged to form a new cluster. The process goes on until all points are united in one giant cluster, or until it is interrupted.

Nearness, used for choosing which points to merge, can be defined in various ways. The choice for the calculation of IMNs is the complete linkage method, which defines the distance between two clusters as the largest distance available between a two points, one of each cluster. With this choice, we can interrupt the algorithm right before the first cluster reaches a geographical diameter that exceeds the preset threshold, such that we have grouped together as many points as possible, without exceeding the threshold [10].

4.2 Network properties

(40)

4.2. NETWORK PROPERTIES 35

• Size of the network

Maybe the most simple characteristic of any network is the amount of nodes and edges. This represents the amount of distinct places visited by an individual and the number of distinct trips.

• Temporal-uncorrelated entropy

To measure how equally the di↵erent places of the IMN are visited, the measure of entropy, which has already been introduced in section 2.1, can be applied to the in-degrees of the IMN’s nodes. (In-degree is defined as the summed weights of incoming arcs). This measures the predictability of an individual’s destination, not taking into consideration the temporal order of the their trips.

• Radius of gyration

This measure approximates the average distance of an individual from its center of mass [7]. This concept appears similar to the average distance travelled per day, but ultimately measures a fundamentally di↵erent quality. Specifically, it measures how far a driver usually moves away from their center of activity. It does not necessarily grow with the distance travelled. Taxi drivers for example spends the whole day on the road, but remain in a close vicinity and therefore have large distance travelled and yet a small radius of gyration. [18]

• Regularity of trajectories

An interesting question about individual travel behavior is how much of the traffic takes place regularly. We define regularity as the percentage of trips that are driven more often than a certain threshold per time. For this purpose, we calculate the amount of edges that are driven more or equal than i times during a certain amount of time (for example twice per week), where i is an arbitrary parameter. This measure is calculated simply by counting the edges whose weight exceeds the threshold and dividing it by the total sum of edge weights.

• Modularity The Louvain Algorithm[4], which was already introduced in chapter 3, can be applied to IMNs to approximate their modularity under a perfect partitioning. An example of an individual with a high Louvain modularity score would be a person who has a typical set of workday locations (home, children’s school, work, gym, back home) and a set of weekend activities that usually follow each other in an alternating order without an intermediate trip home (friends, family, shopping, restaurants etc). This example would produce a high Louvain modularity score, since the only edge

(41)

between the two modules would be the individual’s home location while the nodes within the two sets are all inter-connected.

4.3 Case Study

All trajectories of the database were processed, out of which only those were used that have at least two trajectories. This resulted in 218,505 IMNs, which were then grouped to the municipalities by the location of the node with the highest degree, that is, their most frequent location. About half of them (107,286) were inside a municipality and could therefore be successfully grouped. The most frequent location was chosen as the grouping factor, since it has a high probability of being the individual’s residency or at least their place of work. It is important to consider that, due to this method, the results of this section do not refer to the actual traffic taking place in the analyzed communities, but instead reflect the properties of their inhabitants and workers.

Results

In Table 4.1, the average values for several key statistics can be seen, along with the four example cities of Pisa, Firenze, Montepulciano and Grosseto. The amount of nets equals the number of individuals in the database that have their most frequent location within the 10⇥ 10km grid of the relevant location. It is on average 302 and has a very high standard deviation, which reflects the heterogeneity within the 276 areas. The examples in the same table illustrate this with Firenze having more than 28 times the amount of Montepulciano.

The amounts of average kilometers totally driven lie in a relatively narrow range. Over 90% of the values are between 600 and 1200km, which can also be seen in scatterplot 4.3. It is important to note that the single values are already averages within municipalities. The standard deviations in Table 4.1 therefore only reflect the variance between cities and not the fluctuations within the di↵erent IMNs. Most of the average radius of gyration values lie between 10km and 30km (see scatterplot 4.3). The average entropy lies between

(42)

4.3. CASE STUDY 37

Municipality #nets Ø driven km Ø #nodes Ø Rad.Gyr Ø Entr. Ø Reg. % Pisa 2669 847.5 34.9 31.5 0.445 11.5 Firenze 9732 660.1 36.0 21.8 0.482 9.3 Montepulciano 343 849.7 40.3 25.4 0.444 8.3 Grosseto 3089 822.4 41.3 16.6 0.457 11.5 Mean 791.7 889.3 40.5 19.3 0.435 12.5 Median 302.0 898.4 42.2 17.5 0.435 12.8 Stddev 1224.8 211.0 7.9 7.5 0.016 3.4

Table 4.1: Four example cities, mean, median and standard deviation of the following measures: Amount of IMNs (#nets), Average total distance driven in km (Ø Km driven), Average amount of network nodes (Ø #nodes) Radius of Gyration (Ø Rad. Gyr.), Uncorrelated normalized Entropy (Ø Entropy), Ratio of distance driven regularly to the total distance driven in percent. (Ø Regular ratio %).

0.4 and 0.5 for all but 3 municipalities, with an average of 0.435. The average value of the mean fraction of regular meters travelled is 12.8%.

Correlations

In Figure 4.1 the correlations between the above-mentioned measures can be seen. In-terestingly, the number of nets, which is positively associated with the density of drivers, does not have a strong correlation to any of the other measures. Its highest correlation co-efficient (0.26) is towards the average entropy within the IMNs. This mirrors the positive correlation between the amount of trips and the spatial entropy observed in chapter 2.

The strongest linear correlation in the other variables lies between the average amount of kilometers driven and the amount of nodes in the network with a coefficient of 0.75, in accordance to the intuition that people who spend more time driving also have larger set of locations.

Less obvious is the negative coefficient of -0.54 between the average number of nodes and the radius of gyration. In Figure 4.2, we see this relationship in a scatterplot. The negative relationship is clearly visible and the far outliers are mostly islands and communities with a small sample sizes like Casole d’elsa with 97 networks.

(43)

Figure 4.1: Correlation heatmap of the following aggregate measures of Individual Mobility Net-works: Amount of IMNs (#nets), Average total distance driven in km (Km driven), Average amount of network nodes (#nodes) Radius of Gyration (Rad. Gyr.), Uncorrelated normalized Entropy (Entropy), Ratio of distance driven regularly to the total distance driven (Regular ratio)

The amount of kilometers driven by an average individual is also negatively correlated (coefficient: -0.12) to the radius of gyration, which seems counter-intuitive. Scatterplot 4.3 shows that this relationship is less clear. Apart from a few small outliers, the scatterplot consists of nearby data points with no obvious shape. The average ratio between regular meters driven and total meters driven is associated positively with the average amount of nodes (coefficient: 0.48) and negatively with the average radius of gyration (coefficient: -0.61). All in all it seems that a high amount of kilometers driven comes along with more distinct destination, but is surprisingly connected to a smaller area of movement and more repetitive travel itineraries.

(44)

4.3. CASE STUDY 39

Figure 4.2: Scatterplot of all 276 municipalities: Average number of nodes in the IMNs against average radius of gyration. The darker the points, the higher is the number of IMNs available in the data (# nets).

Figure 4.3: Scatterplot of all 276 municipalities: Average radius of gyration against the average amount of km driven. The darker the points, the higher is the number of IMNs (# nets).

(45)

Geographical Distribution

In Figure 4.4, the geographical distribution of two exemplary measures, the average radius of gyration and the average amount of nodes can be seen on a map. The radius of gyra-tion graphically shows no clustering. The four municipalities with the highest values are Civitella in val di Chiana and Marciano della Chiana, which are both in the surroundings of Arezzo, as well as Castelfranco Piandisc`o and Barberino di Mugello in the north of Florence.

In the right image, we can see that the average number of IMN nodes is clustered together in nearby municipalities, namely in the northern area around Livorno, Florence and Lucca, and further south around Siena, near Massa Marittima and north of Arezzo.

Figure 4.4: Values of all 276 municipalities. Left: Average radius of gyration of the IMNs, Right: Average number of nodes in the IMNs.

(46)

Chapter 5

Roads and Traffic

5.1 Static Road Network

This section focuses on the road network, which is modeled as a directed graph G = (E, V ) where

• V is a s et of nodes representing intersections and dead-ends. • E is a set of directed edges, which model the road segments.

• l : E ! IR is a function that maps each edge to its geographical length.

Basic Statistics

For a first overview, a few simple statistics of the road network will be calculated:

1. Amount of edges and nodes/node density: In this thesis, only map data of equal sizes will be compared, so the amount of nodes and edges can be used in absolute terms when comparing di↵erent cities. Otherwise, the amount of nodes can be divided by the total area of the map, resulting in the node density, which equals

(47)

the density of intersections and dead-ends. By ignoring nodes that have only one ingoing and/or one outgoing arc, we obtain the density of intersections.

2. Amount of intersections/intersection density: The nodes of the network con-sist of intersections and dead-ends. If we subtract from the amount of nodes all dead-ends, e.g. nodes with no more than one in- or out-degree, the amount of inter-sections can be calculated. This value can also be used to calculate the intersection density, that is the amount of intersections per square kilometer.

3. Average node degree/average intersection degree: The amount of edges and the edge density show the (relative) amount of street segments in the map segment. It can be used to calculate the average node degree: 2_⇤_{amount of nodes}amount of edges. Considering only intersections, the average intersection degree can be obtained. This measure indicates how many arms an average node has. For these measures we use the non-directed version of the network, which unites oppositely-non-directed edges into a single edge, in order to avoid double-counting of bidirectional streets.

One can get a sense of the redundancy of roads, by comparing the average amount of edges per intersection with the minimum average node degree of 3, which is necessary to connect all nodes of the graph, in the simplified case that there are no dead ends and no self loops. Figure 5.1 compares the average node degree of two exemplary Manhattan-style nets and a connected net with the minimal amount of edges. In real road networks, self-loops (e.g. circular streets) and dead-ends decrease this measure, so values below 3 are possible.

4. Total Length of edges/mean edge length: The total length of edges gives us the size of a road network in terms of the length of roads. Since the road network is directed, all bidirectional streets are doubled. The mean edge length indicates after how many meters a street hits an intersection or a dead-end on average.

Road Network Centrality

Nodes in any network can be evaluated regarding their centrality. Centrality is a concept with many concurring definitions, which all have in common that they aspire to measure a node’s importance. For this thesis, we chose to evaluate the road network’s closeness centrality, which evaluates how long the shortest path to any given other node is, mea-sured in the length of the corresponding streets. The average of those path lengths is a

(48)

5.1. STATIC ROAD NETWORK 43

Figure 5.1: Node degrees of a Manhattan-like grid (left), a connected net with minimum edges (middle) and a Manhattan-grid with two additional diagonal roads. (Values for a non-directed network)

node’s average farness from other nodes. The reciprocal of this value is a node’s closeness centrality:

C(x) = P 1

yd(y, x)

(5.1)

where x and y are nodes and the function d returns the length of the shortest path between its arguments. The distance function used for the road network in this thesis, measures considers the length as the summed up road lengths of the edges of the shortest path.

The distribution and average value of betweenness centrality can be compared between di↵erent urban regions in the case study. This approach has been used by Porta et al. [20], who used betweenness centrality along with three other measures of centrality to characterize di↵erent urban regions. In the case study, only the average centrality is used, that is the mean of all node’s closeness centralities.

(49)

5.2 Traffic in the Road Network

Map Matching

To investigate how traffic is distributed in a network of roads, one must match the se-quences of GPS locations to nodes and edges in the road network. This problem, called map matching, is not trivial since GPS locations are often imprecise and road networks can be very dense in central areas. There is a variety of algorithms that handle this problem, many of which take into consideration that the map position of one part of the trajectory influences the probabilities of all other points’ positions, such as hidden Markov models [15]. These methods are especially useful for sparse and very noisy GPS records.

In the case study of this thesis, the GPS data is reliable enough, so that a simpler algorithm was implemented. It independently maps every point of a trajectory to a node in the road network. The nodes are then connected and form a path that describes the individual’s trajectory. For this purpose, as an additional input, the shape and position of every network edge is used.

The map matching of a trajectory to a network path is performed as follows:

• For every point of the trajectory

– Find their nearest edge. (Distance is measured as the minimum distance to the lines that make up the edge.)

– Get the two nodes that are connected by the edge. Calculate which of the two nodes is nearer to the spot on the edge that has minimum distance to the point. Match the point to that node.

• Connect all matched points in chronological order with the shortest network path. The length of the path is thereby defined as the sum of the edges’ geographical length. The resulting network path represents the trajectory.

For the first step of the algorithm, a spatial index is used, which speeds up the process of finding the nearest edge. Spatial indexes are data structure that allow fast spatial querying. The spatial index used in the case study works with an r-tree to speed up the

(50)

5.2. TRAFFIC IN THE ROAD NETWORK 45

Table 5.1: Example for the calculation of the cumulative percentage of traffic that fits into a given percentage of the most dense meters of road. The example consists of six road segments (edges) with length 1 or 2, seen on top of the figure. For the calculation, they are ordered by non-ascending traffic flow.

retrieval of the desired results. An r-tree is one of many possible variants of a spatial index. With this method, cascading geographical areas are saved in a tree-structure, which saves the objects’ positions on di↵erent granularity levels.

The fact that the method only produces paths from node to node, instead of specific positions on edges, has the disadvantage that the points of arrival and departure are always imprecise to some degree, unless the individuals depart and arrive exactly at a node location (intersection or dead end). This error is negligible since most edge lengths in the database are between 80 and 160 meters. The advantage of the technique is that all trajectories can be expressed in network paths and the amount of traffic flowing through the network can be easily be expressed as edge weights.

Traffic distribution

We can now create a function that tells us how many percent of total traffic flow through a given percentage of the most dense roads. For this purpose, all edges are sorted by their traffic flow in non-ascending order. Cumulative traffic, measured in cars⇥ meter (or any other arbitrary unit of distance), is calculated for the end of every edge by multiplying the edge length with the amount of traffic flow and adding the result to the previous amount of cumulative traffic. The intermediary values within edges can be calculated by linear interpolation. For any given percentage of roads, the percentage of traffic in those roads is calculated by dividing the cumulative traffic until that point by the total amount of traffic. The calculations are illustrated in a simple example in Table 5.1.

Mobility Data-driven Analysis and Characterization of Urban Regions

D GD

-8I 8 8 G 8A C 8 8

C

8C

IG C GG C D B8 G

-

2

2

-D9 A

8 8

C C8A G G 8C

8 8

L8 DC D

98C 2

DCG

2 - 02

/

0

D /8CC

G DE 1 8A L

CCD 8 8 B D

Abstract

Contents

Chapter 1

Introduction

1.1

Context

1.2

Objectives and Content of the Thesis

1.3

Case Study

Chapter 2

Spatial Concentration

2.1

Entropy

2.2

Moran’s I

2.3

Nearest Neighbor Distance

2.4

Case Study

Chapter 3

Flows in a Grid Network

3.1

Node Degrees

3.2

Louvain Modularity

3.3

Interaction Models

3.4

Case Study

Chapter 4

Individual Mobility

4.1

Individual Mobility Networks (IMN)

4.2

Network properties

4.3

Case Study

Chapter 5

Roads and Traffic

5.1

Static Road Network

5.2

Traffic in the Road Network