A framework for evaluating and configuring location-privacy solutions.

(1)

UNIVERSITY OF PISA

DEPARTMENT OF INFORMATION ENGINEERING Master Degree in Computer Engineering

A Framework For Evaluating And Configuring

Location-Privacy Solutions

Candidate: Relators:

Tommaso Vongher Prof. Gianluca Dini

Ing. Pericle Perazzo Academic Year 2016/2017

(2)

(3)

Summary

Abstract 5 1. Introduction 6

2. Background technologies & methods 9

2.1 LBS - Location Based Services 10

2.1.1 Definition and usage areas 10

2.1.2 Types of services 11

2.1.3 Architecture and positioning 12

2.1.4 Privacy issues 14

2.1.5 Privacy protection schemes 16

2.2 K-Anonymity approach 18

2.2.1 K-anonymous location information 18

2.2.2 Cloaking algorithm 19

2.2.3 Threats 21

2.3 Mix zones approach 22

2.3.1 Pseudonyms 22 2.3.2 Mix zones 23 2.3.3 Threats 24 2.4 HUMsim 25 2.4.1 Waypoint generation 27 2.4.2 Person generation 27

2.4.3 Semantic trajectory generation 28

2.4.4 Raw trajectory generation 30

3. Novel framework 31

3.1 Mix zones placement 32

3.2 The idea 34

3.2.1 Metrics 35

3.2.2 K-anonymity selection 36

3.2.3 Mix zones selection 37

4. Development 38

4.1 HUMsim refactoring 39

4.1.1 New behavioral model 39

(4)

4.1.3 New person generation process 44 4.1.4 New raw trajectory generation process 45

4.1.5 Lines-Of-Code reduction 48

4.2 K-Anonymity approach 50

4.3 Mix zones approach 54

5. Simulations & Results 59

5.1 Simulation setup 60

5.1.1 Map and waypoints 60

5.1.2 Dataset creation 62

5.1.3 Manual selections of mix zones 67

5.2 Results 68

6. Conclusions & future works 82

6.1 Conclusions 83

6.2 Future work 85

References 87

(5)

Abstract

The increased capabilities of technology in both fields of mobile devices and position retrieval during last years allowed a widespread use of Location-Based services. If in a first point of view LBS simplified user’s life on the other side they enlarged possible vulnerabilities that may be used to violate privacy of users.

Guaranteeing privacy is a key aspect and lots of studies were made to accomplish this requirement; in this work is presented a novel framework in which the k-anonymity method is used to raise performances of the mix zones approach, this goal is reached defining an automated way to select the zones instead of using a manual/empirical method.

In section n.1 there is a brief introduction on all the work done, in section n.2 background technologies and methods used are explained, in section n.3 we propose our novel framework (KAOS-placer) and in section n.4 we explain the development of the work. In the end in section n.5 results of simulation are shown and in section n.6 conclusions and future works considerations are present.

Keywords: location based services framework • location privacy framework • k-anonymity • mix zones • location obfuscation • privacy framework • mix zones optimal placement •mix zones optimal selection

(6)

(7)

Nowadays almost every person we know possesses a smartphone with different applications installed; in the majority of cases these applications are freely downloaded and used but in exchange they provide you with advertisements and in the worst scenario they can keep track of what you are doing.

A particular case is the one of location-based applications which, as the name says, use location data to give back to the user a specific information; examples of such services are spread in fields like navigation, commerce, communications, games and much more.

In such a situation, being connected using location based services means being tracked for the whole time we are using the service, so basically we are telling to the service provider, whose privacy policies we probably accepted without reading, everywhere we were during our day.

At this point we have no control over the amount of information we generated and sent to the provider; an hypothetic adversary (and this can be either a malicious user or the service provider itself) may use these data at wish trying to infer sensible informations about us.

In order to avoid this to happen anonymity approaches comes in hand to guarantee privacy to the user; their scope is to perform a sort of data alteration in order to increase the effort an adversary has to deal with to retrieve informations.

Our work is based on two particular methods to guarantee privacy, which are the k-anonymity and the mix zones approaches; the first being the one defined by Gruteser and Grunwald in [3] and based on quadtree-like space

(8)

division, the second being the one developed by Beresford and Stajano in [10] and providing an obfuscation method based on the change of pseudonym inside defined areas.

In particular our proposal aims to provide an automatic way to configure the mix zones approach, in fact in current literature there isn’t such a method to configure zones except for the manual/empirical a priori selection.

The performance indexes taken in consideration to define the goodness of the framework proposed are the average tracking percentage of a person, intended as the average amount of time a person is tracked, the average K-EPC (K-Efficient Pseudonym Changes) per person per day, intended as the average number of pseudonym changes for a person in a day inside a mix zone when other k-1 persons are in the same zone and finally the average cloaking size intended as the average area of the mix zones used. In order to test our framework we needed a dataset of mobility traces and we decided to use HUMsim (Human Mobility Simulator) which is a tool developed by Giurlanda et al. [5] able to generate mobility traces based on behavioral models, we used models representing the movements of a student between a set of waypoints during a day; the area used for the traces generation is a portion of the city of Pisa.

The whole project implementation was made in Python 2.7 language, HUMsim was initially ported from C++ to Python, then it underwent into some changes to cope with our necessities.

(9)

(10)

2.1 LBS - Location Based Services

2.1.1 Definition and usage areas

Location Based Services can be defined as services that integrate a mobile

device’s location or position with other information so as to provide added value to

a user; the information’s nature usually depends on the usage context which

may vary in fields like marketing, emergency, navigation, information services, sports, tracking, location-based social media and almost every activity that may include the need of GPS coordinates.

All these usage areas can be divided in three main categories: military and government industries, emergency services and the commercial sectors. The former is the one for which this kind of technology was developed in the first time by the U.S Department of Defense using satellite localization; after a while though, thanks to liberalization to use GPS localization worldwide and thanks to the birth of new technologies like GSM and wireless communications, the other two categories growed up.

The emergency services were a first LBS challenge and resulted to be very important, they allow to retrieve the position of the user, for example, directly from the 911 call exploiting the carrier itself; most of the times in fact the emergency call is made by mobile phone and the user is not so able to provide an exact location, instead using this solution the information is instantly provided to the emergency service without relying on the user. The third one, the commercial sector, was developed as soon as location’s precision related to technology reached a ‘good’ level and

(11)

telecommunication companies realized that their revenues could have been higher with new kinds of service to the user.

2.1.2 Types of services

A first main categorization of services can be done in these two types:

● _{Person-oriented LBSs are user-based services. This means that the}

focus is to find the position of a person or use the person’s position to guarantee a service; in other words the person can control the service.

● Device-oriented LBS are services external to the user, in a sense that

they may need or not to retrieve the person’s position; instead it can be located an object, a group of people,etc. Briefly the person/object located is not controlling the service (car tracking after a theft, for example).

Then, another distinction can be done, this one regarding the application design:

● _{Push services imply that an user receives notifications as a result of his}

or her whereabouts without having to actively request it. In this situation there may be a prior consent or not by the user to the service provider.

● _{Pull services, on the other side, imply that the user actively uses the}

application and requests informations from the service provider when he or she need it.

(12)

Because of the costs that seems to outweigh the benefits together with privacy issues in sending messages without prior consent and a doubtful proceeds, push services are not so popular nowadays; on the other side there is a high and growing number of pull services.[1]

A last distinction, but not less important, is the one about the input needed by the service:

● _{Snapshot LBS are services that need a single positional information}

from the user as input in exchange of the desired information.

● _{Continuous LBS instead are the ones that need multiple positions or a}

continuous stream, i.e. they track the user, like the navigation services.

2.1.3 Architecture and positioning

LBSs are services mostly intended for mobile networks, in fact the principal attraction (especially in the case of pull services) for an user is to be able to retrieve informations in anytime and anywhere and this feature is fully exploited in a mobile network; this mobile environment is provided by mobile networks operators in the form of cellular network (but we can have also other types of wireless networks).

(13)

The paradigm followed is the one of a Client-Server application in which the user act as the client which sends request using the LBS application to the service provider acting as the server.

Normally the position is discovered by the client using a LoCation Service (LCS) with a precise positioning method and then delivered to the server following the WGS84 standard format. [2]

The most used positioning method are:

● _{dead reckoning}

● _{proximity sensing: signal signature tracking}

● _{trilateration: signal strength analysis and Time Of Arrival - TOA} ● _{multilateration: Time Difference Of Arrival - TDOA}

● _{triangulation: Angle Of Arrival - AOA} ● etc.

As you can see there are lots of methods to obtain the position and going in depth of each one will be out of the scope of this article, so we can generalize in the following macro categories by technology used:

● GPS based LBS is the simple and standard solution, which allows to compute the position based on the distance between other known positions using the principle of geometric trilateration. Obviously the mobile device has to be equipped with GPS capabilities.

(14)

● _{GSM localization is another one of the most used approach and} consists in computing the position referred to the cell in which the user lies. The position is computed looking at the signal strength related to multiple cells.

● _{Near LBS is a technique that exploits local-range technologies like} WLAN, Bluetooth, infrared RFID or NFC communications to allow discovering user’s surroundings.

● _{Other Local Positioning Systems are used mainly in environments} where other technologies cannot work well, like indoor areas. Technologies used are Bluetooth, RFID, WiFi, UltraWideBand (UWB).

In a typical system, location information is determined by a location information source such as a GPS receiver in the user’s device, it is then periodically transmitted to the location server through a cellular or wireless network.

Anytime the user sends a message or request to the LBS service this one access the location server, which acts as a proxy or middleware, to get the user actual position.

The final information is then returned to the user to fulfill the request. [3]

2.1.4 Privacy issues

There are two types of privacy tied to LBS, the first is the query-privacy and the other one is the location-privacy; the first is related to user’s private

(15)

informations contained inside the query to the LBS, the second one is related to the location information.

These two aspects are different in theory but close each other in reality because violating one of them can give more informations to an attacker to violate the other one; but it can happen also that one is violated while the other not.

Major threats to user privacy can be the following:

● _{profiling: customers bring with them their device most of the time,} while they are using the LBS the position information can be used to infer activities made and to profile them.

● _{identity theft: when location data are coupled with other personal} informations, collected via applications on the device for example, and if the amount of data is big then criminals may steal the identity of the user and use it.

● _{personal security: when location data are collected and a profile of the} user is inferred then criminals can use this information to know present or future positions of the user, this scenario become the worst one when this informations are used to invade user’s property for theft or to stalk.

To assess the goodness of a privacy protection strategy we need a metric, having two categories of privacy means having different metrics depending on the type of privacy in consideration, for the query-privacy a typical metric is the k-anonymity or the entropy level.

(16)

The k-anonymity is a concept based on providing indistinguishability of a particular user between k-1 other users present in the same zone together and it is a very popular metric.

The entropy level derives from the Shannon entropy in information theory and it is another metric widely used to assess the amount of information an attacker can retrieve by obtaining a single or a series of LBS queries.

About location-privacy metrics it possible to use again entropy, but it has to be defined in a different way; other metrics rely on the accuracy of the position estimated by the adversary with respect to the actual true location. Once a metric has been defined is possible to select a protection scheme.

2.1.5 Privacy protection schemes

There are different protection schemes, like the policy based schemes or the Private Information Retrieval schemes; the one we will focus on and give a brief overview is the one based on location perturbation and obfuscation. Location perturbation in LBS has been the most investigated topic in the last years, driven by the fact that an higher possibility to provide privacy is the most convincing incentive to the use and prosperity of location based services.

There are several techniques to perform location obfuscation, the most used regard pseudonyms, spatial cloaking, random noise adding and dummies. The pseudonyms one can be found in the mix zones approach presented by Beresford and Stajano in [10], this method is based on the use of mix zones which are selected areas into which user’s movement are not tracked; every

(17)

user has a pseudonym assigned and when happens that a user enter and then exit from a mix zone, if there were other persons in the same zone, a path confusion method is applied by switching pseudonyms between users in the zone.

Spatial cloaking is the approach used in various algorithms, the most important is the original one developed by Gruteser and Grunwald in [3]; here a perturbation algorithm is applied to the location information which consists in providing not the point but the area in which there are simultaneously k-1 other person more than the actual user making him indistinguishable from the other K-1 person.

The approach on adding noise regards modifying with random noise the true location information at a middleware level before sending the request to the LBS, once the middleware receive the response decrypts it subtracting the noise added and providing the user with a useful information.

The approach with dummies is based upon insertion of other fake requests while asking also for the true one, the set of true and fake queries create the obfuscated region.

The aforementioned strategies are based on a centralized architecture in mobile networks, where there is a middleware acting as anonymizer/modifier/translator of queries sent back and forth from the client to the LBS server.

This structure has its drawbacks, like the single-point of failure and the trustworthiness of the server, that limit the usefulness of the solution; in last years scientists tried to adapt the goods of this approach to other

(18)

environment (moving from server-based to device-based architectures) that avoid these two aspects, resulting in new frameworks based on the same concept of obfuscation but developed in different ways; however, unfortunately, also this newer solutions have drawbacks mostly due to battery lifetime issues and to the computational power offered by mobile-devices in comparison to the required one for anonymizing algorithms used.

In the next two paragraphs we will give a brief overview of the location perturbation techniques that we took in consideration in our work, which are part of the server-based approaches mentioned.

2.2 K-Anonymity approach

2.2.1 K-anonymous location information

Gruteser and Grunwald defined a subject as k-anonymous with respect to location information if and only if the location information is indistinguishable from the location informations of at least k-1 other subjects.

Following this definition, instead of presenting to the LBS the exact GPS coordinates of the position, a box is presented that corresponds at the area in which k-1 other subjects are present within the actual user; it can be used also a temporal uncertainty range to anonymize ulteriorly.

(19)

[x1, 2], y1, 2], t1, 2])

(

x

[

y

[

t

Where

[

x1, 2] , y1, 2]

x

[

y

represent minimum and maximum coordinates respectively for a 2-dimensional area,

[

t1, 2]

t

instead is the time interval; a line like this one tells that in the area defined by

[

x1, 2] , y1, 2]

x

[

y

there were k users in the time interval

[

t1, 2]

t

, one of which is the actual users that requested the service.

The k value represents a degree of privacy, in fact the higher it is the higher is the number of other persons in the same zone (and probably bigger is the zone itself, it depends on the population density), so lower is the probability to infer the real identity of the user between the set of persons in the zone (1/k).

2.2.2 Cloaking algorithm

The idea behind the method is to make always possible to fulfill the k-anonymity requirement beside any population density only by reducing the accuracy of the zone found.

In other words, if the population density is not so high, it will be possible to provide a k-anonymization by increasing the size of the zone until it contains k persons; this will provide the anonymization but will reduce the utility of the information.

The algorithm is based on quadtree division methods, starting from the application area it will search for the obfuscation area by iterative divisions

(20)

until the k-anonymity requirement stops being fulfilled, at that point the last good area is returned as obfuscation zone.

A pseudo-code of the division algorithm used is the following:

1. Initialize the quadrants q and qprev as the total area covered by the anonymizer

2. Initialize a traffic vector with the current positions of all known vehicles

3. Initialize p as the position of requestor vehicle

4. If the number of vehicles in traffic vector < kmin , then return the previous

quadrant qprev

5. Divide q into quadrants of equal size

6. Set qprev to q

7. Set q to the quadrant that includes p

8. Remove all vehicles outside q from the traffic vector

9. Repeat from Step 2

Fig. 1: Pseudocode for quadtree division.

This algorithm above does not take in consideration temporal cloaking but only spatial, the method used for temporal cloaking works on delaying the request until k-1 other users have visited the same area chosen for the requestor.

The algorithm need several input values, has to find the area of the selected resolution in which the requestor is located and then compute the time interval by looking at the times when other k-1 different users visited the zone.

(21)

2.2.3 Threats

An hypothetical adversary can obtain informations in different ways: he can intercept wired or wireless communications with sorts of man-in-the-middle attacks, can violate service provider’s systems and can have a priori knowledge on locations and/or users involved.

Of this set, the possibles countermeasures are applicable only in the first (but is out of the scope) and second situations, while on the third we have no control.

About the situation in which an adversary violates the system, the main concern is to avoid that the informations inside the system are used to re-identify the originator of the message, this is possible if the informations are anonymized and have no reference over private user’s data, which is what k-anonymity algorithms do; anyway there will always be a positional information, albeit obfuscated, in the request stored and an adversary can have, like already said, a priori knowledge that he can correlate to the location information present in the system to gain more data about the user. Gruteser and Grunwald defined three situations in which user’s privacy is at risk when the adversary knows something a priori; suppose that a subject reveals his location L in a message M to a location-based service and an adversary A has access to this information. Then, sender anonymity and location privacy is threatened by location information in the following ways: ● _{Restricted Space Identification. If A knows that space L exclusively}

(22)

● _{Observation Identification. If A has observed the current location L of} subject S and finds a message M from L then A learns that S has sent M.

● _{Location Tracking. If A has identified subject S at location L} _i and can link series of location updates L 1,L2,....Li,...Ln to the subject, then A

learns that S visited all locations in the series.

2.3 Mix zones approach

2.3.1 Pseudonyms

The environment present in a mix zone model can be the one of a client-server with the use of an intermediate middleware, both positioning system and middleware are trusted entities while the LBS obviously is not; the role of the middleware is to mediate request/response protocols between user and LBS and also help the user to hide his identity.

To maintain unknown the identity of individuals to the LBS service the approach used is the pseudonyms one, the middleware map every user with a pseudonym and use the latter to transmit requests to LBS server instead of using real identifiers, on receiving of responses it translates back from the pseudonym to the real user address and dispatches the information.

However, only assigning new pseudonyms is not enough to keep secure user’s identity, Beresford and Stajano proved that is possible to de-anonymize users correlating a priori informations with data on “long-term pseudonyms”, which are pseudonyms used for a long interval

(23)

of time; the idea developed by the two was to allow users to change pseudonym multiple times in a day, using the concept of mix zones.

2.3.2 Mix zones

In this approach there are two kind of zones: application-zones and mix zones.

The firsts are zones in which the user transmits location updates to LBS, which means that the user is tracked during navigation in these zones.

On the other side, mix zones are a priori selected areas inside which there is no communication between user and LBS, moreover every time a user enters a mix zone he changes his pseudonym with the one of another user present in the mix zone at the same time, if the anonymity set requirement is valid; in this sense a mix zone directly follows the case of a k-anonymity area, in which a level of privacy is guaranteed by the amount of persons present in the zone; if this requirement is verified a “mix” of pseudonyms happens. This path-confusion method provides unlinkability between the old pseudonym to the new one for each person present in the same time in the mix zone, in this way the adversary in theory is not able to easily understand which of the outgoing persons is the one that has entered at certain time and point.

The mix zone is said to be K-anonymized with a set of user if and only if: 1. the set has K or more members.

2. there exist a time in which all the K members are inside the zone. 3. every person spends a random time inside the zone.

(24)

4. the probability of transitions from any entry point to any exit point for every user that enters the zone follows a uniform distribution. 5. the location of users cannot be tracked inside the zone.

If the above definitions are verified the unlinkability provided can be measured with the Shannon entropy equation:

(i )

( p

)

H

′

= − ∑

jεA

p

_{i →j}_′

× log

₂ _{i →j}_′

where:

●

A

_{is the anonymity set of K persons.}

● i

ε A

_{is the person and}

i

′

_{is the exiting pseudonym associated to .}

i

● p

_{i →j}_′

is the probability to map

i

′

j

to

for every

ε A

j

, which is

equiprobable if the definitions above are verified,

p

_{i →j}_′

=

1 .

|A|

Unfortunately, this entropy measure is based on some assumptions that not always are verified, which are the third and fourth one, that open up possibilities for attacks.

2.3.3 Threats

The privacy in the mix zones approach is guaranteed as soon as the attacker cannot easily discern which is the person of interest between all the ones that traveled across the mix zone, as we already said this happens in two cases: when the persons spend random time inside the mix zone and when

(25)

all the entry points and exit points in the mix zone have the same probability to be chosen.

If something changes in these relationships the unlinkability becomes less stronger and the attacker can remove from computations some low probable mappings, having more informations to discern the identity.

If this happens two types of attack are possible: ● _{timing attack}

● _{transition attack}

Both of these attacks rely on the fact that the attacker increases his confidence on some mappings.

In the timing attack if, for example, persons spend constant time in the zone, the attacker is able to guess the identity just looking at the time of exit from the zone; this is due to the fact that the zone acts as a FIFO system in which the first person entering is the first exiting.

In the transition attack, for example, if the conditions about following a uniform distribution for the transitions are relaxed the adversary can look at all possible transitions and remove the ones that are less probable gaining more insights on where the persons can be when they will exit from the zone.

2.4 HUMsim

HUMsim (Human Urban Mobility simulator) is a generator of synthetic human traces aimed at the validation of location privacy solutions.

(26)

HUMsim generates semantic trajectories which are sequences of semantic waypoints, i.e., locations labelled with semantic tags. Semantic trajectories are generated according to a behavioral model of a person, which describes his daily behavior in terms of visited semantic waypoints (which one and for how long); shortly, semantic waypoints can reveal information about the person’s habits that can put at risk his/her privacy.

After that HUMsim translates semantic trajectories into raw trajectories, which take into account real maps and movements onto it. It follows that the resulting trajectories not only represent “realistic” movements of a person but they also convey privacy-relevant information.

As such, differently from existing mobility models which generate trajectories without a semantic value, the semantic trajectories produced by HUMsim allow us to validate location-privacy solutions. [5]

This picture represent HUMsim design.

Fig.2 HUMsim, first design.

(27)

We can see outside HUMsim the use of behavioral model, Google tools and the selection of the urban area by the user; inside instead we can see the main components of HUMsim: person generator, waypoint generator, semantic trajectory generator and finally the raw trajectory generator.

2.4.1 Waypoint generation

The waypoint generator produces a set of usual and opportunistic waypoints, usual ones are waypoints chosen once in the whole simulation of a person (like home or university or work for example), opportunistic ones are waypoint chosen time by time depending on the person’s actual position.

These waypoints were retrieved using Google Maps or in a random way following a Gaussian distribution centered on a urban center specified by the user, then they were stored in a database accessible by HUMsim.

2.4.2 Person generation

The person creation consists in defining the basic characteristics of every person that will take part to the simulation.

Every person in HUMsim is characterized by two aspects: the usual waypoints and the behavioral model.

As already said, usual waypoints differ from opportunistic waypoints in the fact that are waypoints fixed for the person during the whole simulation, examples of these waypoints are “home”,”university” or “work”; opportunistic waypoints instead can change from time to time depending on the shorter distance of the waypoint from the actual position of the person.

(28)

The behavioral model technically defines the person we are simulating, it defines all the types of waypoint reached during the day and the possible transitions from a waypoint to another; in this part the behavioral model is used to select the usual waypoints, later will be used to generate the traces. The person generation consists in selecting two usual waypoints for each person in the simulation, one for his home waypoint and one for the other waypoint at which the person spend most of his time during the day, like a work waypoint or an university waypoint (in our case is an university waypoint because we simulate students’ behaviour); the first selected is the “home” one, then following a power law distribution on the distance from this one the “university” waypoint is selected.

2.4.3 Semantic trajectory generation

Once the basic features of a person are defined the phase of semantic trajectory generation can start, this step aims to generate the set of waypoints “touched” during the whole day by a person, also called daily semantic trajectory.

The daily semantic trajectory is a sequence of semantic waypoints coupled with the arrival time at the point itself and expresses the day of a person; it is created using the behavioral model. As previously stated, the behavioral model defines states and transitions between states of a person; it’s a .xml definition of the person’s mobility in probabilistic terms.

We can see the behavioral model as a composition between two schemes: the pause scheme, which is a table composed by the waiting time in each

(29)

state and the transition scheme which is a discrete-time Markov chain whose states represent distinct visits of semantic waypoints.

Fig. 3: discrete-time Markov chain in HUMsim.

Fig. 4: pause table where Tc and Tv are constant and variable times for the state.

Both pause table and transitions scheme can totally be derived from the behavioral model, the pause scheme is retrieved using the values “constant time” and “variable time” which are present in the definition of every state; the constant time defines the amount of time which the person will spend surely in the state, the variable time defines the maximum amount of additional time spent in the state.

The transitions scheme can be derived from the transitions’ definitions where there are informations on all the possible transitions from a

(30)

particular state to another together with the probability of the transition to happen.

At this point, to generate the semantic trace, it is needed to run the Markov chain for each person for every day of simulation; the result will be the series of waypoints reached by the user during the simulation with additional informations on GPS coordinates and leaving time.

2.4.4 Raw trajectory generation

Raw trajectory generation is done using the semantic trace already created, the map file in .osm (OpenStreetMap) format and an additional routing software; in first design Google Maps API were used to route, then was used SUMO [7] (Simulation of Urban MObility).

From the semantic trace, for every person and for every day of simulation, all the waypoints reached are taken into consideration in set of two consecutive. From this couple of lines in the semantic trace we use the coordinates of the starting and ending state, together with the datetime values to route the person via SUMO.

The result of the routing is written in an output file which will be the final output of the simulator, with all the movements for every person in every day of simulation.

(31)

3. Novel framework

(32)

3.1 Mix zones placement

It is quite obvious that a mix zones approach is based most of all on the positioning of the mix zones over the application area and on the quantity of mix zones used; if one or more zones are used but are placed where no one travel through them then no one will change his pseudonym and an hypothetical adversary can retrieve all the identities he seeks for, because there will be no unlinkability provided between pseudonyms (every person will maintain his pseudonym during all the navigation).

Since the very first application of the mix zones approach, there was no significant focus on how to choose the location of mix zones; the prevalent method was to decide a priori where to place them inside the application area, in these cases happened that zones were selected empirically from all the simulation area or with some kind of a priori reasoning.

A clear example is in the first appearance of the mix zone model, Beresford and Stajano used a set of mix zones decided a priori inside the AT&T Labs Cambridge where the mobility data came from the Active Bat systems installed on the three floors of the building; this system used transceivers called bat, little devices that provided informations on their position at very small intervals of time, these bat were given to all researchers in the building and tracked them for two weeks during working time.

In every floor of this building the simulation was made over six zones, for a total of eighteen zones; results showed that only three zones (with the second and third zones made by the same rooms in every floor) fulfilled the

(33)

anonymity set requirements, moreover the pseudonymity level reached was not enough to provide location privacy to users in that context (but this was due to population distribution and environment).

At this point it was clear that, to provide location privacy in this context, some other studies had to be made on things like population distribution, mobility behaviours and zones’ placement.

Later on, other researches were made on how to place mix zones, a must mention is “On the optimal placement of mix zones” done by Freudiger, Shokri and Hubaux [11]; in this work the authors present a different metric to compute effectiveness of mix zones in providing anonymity called “Flow-Based metric”, this metric relies on statistics of mix zones such as mobility flows and mobility profiles which are essentially characteristics related to person’s movements across mix zones.

All the possible mix zones were selected a priori and every one was tested using the Jensen-Shannon divergence technique and the adversary’s error probability to compute the level of effectiveness, once a classification was ready an optimization problem was used to select subsets of zones to deploy.

The placement framework resulted in a set of 6 optimal zones, the authors then compared this solution, in terms of average tracking percentage and average number of travels through mix zones per person, with other 3 situations: a “bad” choice of mix zones, a random choice and the usage of all possible zones; results showed that the optimal selection followed the same average tracking percentage of the total usage (53% versus 48%) and

(34)

better than the “bad” one and the random one (98% and 78% respectively), while the average number of travels attested to 1.55 times per person in the optimal case against 1.56 times in the random one whereas it was 3.56 time in the total usage case and 0.48 in the “bad” selection.

Another comparison was done over the tracking success of the adversary, it resulted that the optimal selection performed better than all other selection and his behaviour was not modified by changing mobility from fluid to congested, changing radius of the zone or modifying adversary precision. In a few words this last research showed pros and cons of the mix zones approach but also remarked the importance of studying how to place them to provide a “good” level of location privacy.

3.2 The idea

Our main idea is to provide a selection of mix zones based on the K-anonymity approach, use them in a following mix zones approach and then compute the level of location privacy provided in terms of average tracking percentage and K-Efficient Pseudonym Change (K-EPC) in order to see if and where are the improvements.

The major scope behind this idea is to find a first selection of zones correlated with people distribution and behaviors during simulation time, then the further selection aims to maximize the privacy level while maintaining good values of tracking percentages in order to have always a useful service.

(35)

3.2.1 Metrics

The first metric used is the average tracking percentage, this is computed over all the persons participating in the simulation considering their whole mobility traces during 7 days; every position information of a person is considered as tracked if it is inside the application-zone, whereas is considered obfuscated if inside one of the mix zones used.

Mix zones in our approach can change both in dimension, although having always rectangular shape, and in number so the tracking percentages can change from one configuration to another.

It is obvious that to provide higher privacy a tracking percentage lower is better, but in this case the service will be not so good in giving informations to users, so a trade-off is the key; related to this issue is the mix zone’s dimension which can vary and directly affects the tracking percentage, also in this case is important to keep low the resolution of mix zones but a good trade-off is needed to guarantee privacy too.

The second metric used is the K-Efficient Pseudonym Change (K-EPC), this one also is computed over all persons during 7 days of simulation; in particular this metric resembles the one used in [11] to keep track of travels through mix zone with the difference that here the location informations are considered good to be counted in the K-EPC only if they are inside the mix zone when other k-1 person are present inside, if this requirement is not verified the location information is always considered obfuscated

(36)

because is inside a no communication zone but will not take part to the computing of EPC.

This metric allows to have a clear insight on how useful a mix zone is in providing the pseudonym changes to persons travelling through it; higher the EPC is, higher is the location privacy provided, so our intent is to keep it as higher as possible while maintaining good values of tracking percentages and small cloaking areas.

3.2.2 K-anonymity selection

The idea is to perform the k-anonymity approach and retrieve a set of possible zones, every zone computed by the k-anonymity derives from a LBS request taken as a snapshot from the mobility trace of one person during a day of simulation; we extracted from every trace and for every person a series of LBS requests and we considered all these requests happening with interarrival times distributed exponentially.

Adding to this the fact that for each snapshot found so far a possible zone is returned, we have a representation of one person’s movements across the simulation in term of zones visited, each one satisfying the K-anonymity requirement.

Using this approach, based on quadtree division, it is possible to retrieve the same zone multiple times; so our intent is to count the different times the same zone is returned and make a list of occurrence to sort later in order to see the most used zones.

(37)

3.2.3 Mix zones selection

Once a first selection of zones is available we apply a reduction on the set, first we sort the list by descending order of occurrences (so we’ll have most occurred zones first) and we reduce the set cutting away the last 70%; in this way we avoid to take into consideration sporadic zones with lower usages.

After this reduction we order the list by ascending order of cloaking size, having as first the smallest areas among the most used, we pick then a set of 3,4,5 and 6 zones (in our case, but it can be modified) to use with the next mix zones approach.

The reason in selecting the smallest sized ones is based on the fact that inside mix zones the user is not tracked at all, so having big zones results in having high probability of not tracking the user for long time because they occupy more space; this reduces the average tracking time that we have as metric and that we use to check the level of usefulness of the service.

(38)

4. Development

(39)

4.1 HUMsim refactoring

HUMsim underwent the porting from C++ to Python 2.7 to follow others k-anonymity and mix zones tools, in the following sections we explain all the changes made.

4.1.1 New behavioral model

The first thing that we need to show is the behavioral model, on which all the computations related to mobility simulation are based.

From the original design there are few changes and some additions, while the main concept remained the same, a composition between the pause scheme and the transition scheme.

We represented the pause times in every state using two parameters, the “minimum leaving time” (minLT) and the “maximum leaving time” (maxLT) which are respectively the minimum and maximum time at which a person can leave the waypoint; these parameters represent the same concept as the constant time and variable time used in the first design, the constant time was a portion of time that the person was going to spend in the waypoint no matter what, while the variable time was an additional interval of time taken randomly from the interval value defined in the parameter. Now with minLT and maxLT we have an availability interval associated to a state, from this interval is guaranteed that at least half of this time is spent in the state plus another quantity which is taken randomly from the second half and added to the total time spent.

(40)

Another parameter that can be found in the behavioral model which was not there in the previous design is the “avg-snapshot-time”; this quantity defines the average time from a LBS request to another one and represent the λ parameter in an exponential distribution. This quantity is present in both states and transitions and is expressed in seconds, it will be used after the raw trajectories generation to pick up from the entire trace a selection of requests exponentially distributed.

Some other parameters instead were discarded, like the “area” one which tells if the state is an area state or is a point state (we used all point states); another one that was discarded is the usual parameter which tells if the state is an usual waypoint or not, in the end the probability parameter associated to a behavioral model has been discarded too because now the information is directly given providing the number of person to generate for each model.

Talking about the Markov chain instead, we can find informations about it inside the “trans” elements, where we have the starting state name with the arriving state name and the probability of the transitions in addition with the avg-snapshot time which we talked about earlier.

The final organization of a behavioral model is showed in the next figure. <?xml version="1.0" encoding="UTF-8"?>

<bm>

<name>High-sociality student</name> <states>

<state name="H0" pos="start" minLT="7:00:00" maxLT="8:30:00" waypoint-type="home"

avg-snapshot-time="1800"/>

<state name="H1" pos="none" minLT="8:30:00" maxLT="12:30:00" waypoint-type="home"

(41)

<state name="U0" pos="none" minLT="8:30:00" maxLT="12:30:00" waypoint-type="university"

<state name="UM" pos="none" minLT="13:30:00" maxLT="18:30:00" waypoint-type="university"

<state name="C0" pos="none" minLT="12:30:00" maxLT="13:30:00" waypoint-type="cookhouse"

<state name="C1" pos="none" minLT="19:30:00" maxLT="21:00:00" waypoint-type="cookhouse"

<state name="F0" pos="none" minLT="11:30:00" maxLT="12:30:00" waypoint-type="fun"

<state name="M0" pos="none" minLT="11:30:00" maxLT="12:30:00" waypoint-type="market"

<state name="M1" pos="none" minLT="18:30:00" maxLT="19:30:00" waypoint-type="market"

<state name="SR0" pos="none" minLT="8:30:00" maxLT="12:30:00" waypoint-type="study"

<state name="SRM" pos="none" minLT="13:30:00" maxLT="18:30:00" waypoint-type="study"

avg-snapshot-time="1800"/> </states>

A framework for evaluating and configuring location-privacy solutions.