• Non ci sono risultati.

PrivacyRiskAssessmentinBigDataAnalyticsandUser-CentricDataEcosystems UniversityofPisa

N/A
N/A
Protected

Academic year: 2021

Condividi "PrivacyRiskAssessmentinBigDataAnalyticsandUser-CentricDataEcosystems UniversityofPisa"

Copied!
189
0
0

Testo completo

(1)

University of Pisa

Department of Computer Science

Ph.D. in Computer Science

Privacy Risk Assessment in Big Data Analytics and

User-Centric Data Ecosystems

Ph.D. Thesis

Ph.D. Candidate Francesca Pratesi

Supervisors Dott. Fosca Giannotti

Prof. Dino Pedreschi

(2)
(3)

Abstract

Nowadays, our daily life is centered on data. Whether or not we are aware of it, our simple everyday interactions with through digital devices produce a myriad of data, that is combined to create Big Data. We leave traces relating to our movements via our mobile phones and GPS devices, to our relationships within social networks, to our habits and tastes from query logs and records of what we buy. These digital breadcrumbs are a treasure trove as a way to discover new patterns in human activities and a way to understand better many aspects of human behavior that it was impossible to study or analyze just a few years ago. The resulting data can also enable a totally new class of services that can improve directly and sensibly our society or provide ways to tackle and solve problems from new perspectives. The other side of the coin is the question of privacy: since the data describe our life at a very detailed level, privacy breaches can occur along with inferences that reveal the most personal details. For example, a malicious party could uncover our home location from GPS tracks, our lovelife from call records or communication in social networks and our health status from the products that we buy in a supermarket. For this reason, we are witnessing changes in ethical and legal norms, with a move towards a novel vision of the data management, which focuses on giving appropriate priority to privacy and individuals.

The objective of this thesis is two-fold. Firstly, we propose a framework that aims to enable a privacy-aware data sharing ecosystem, based on Privacy-by-Design. This framework, called PRISQUIT (Privacy RISk versus QUalITy), can support a Data Provider in sharing collected personal data with an external entity, e.g., a Service Developer. PRISQUIT helps to decide which is the right level of aggregation of the data and what are the opportune strategies for enforcing privacy, by quantifying the actual and empirical privacy risk of the individuals, highlighting the users most at risk, and consequently the data related to them. Then it analyzes the data quality which guarantees only the data from users not at risk is released. The framework is modular, so it is possible to define, implement and enrich the framework management with new kinds of data, new privacy risk and utility functions, potential new types of background knowledge, new services to be developed and new mitigation strategies.

Secondly, we investigate the privacy perspective within a user-centric model, where each individual has full control of the life cycle of his personal data. To this end, we take advantage of the outcome of PRISQUIT by studying the correlation between some individual features, such as entropy of visited locations, and the actual privacy risk. Then we design a method that allows each user to obtain an estimated level of his own privacy risk. This tool leads to increased awareness about individual personal data and, thus, it helps people in choosing whether or not to share their data with third parties. After that, we propose three privacy-preserving transformations based on the differential privacy paradigm, which offers very strong privacy guarantees regardless of any external knowledge that a malicious agent has. This can render the data private before they leave the individual who produces them.

We provide a wide range of experiments on three kinds of real world data (mainly mobility data, but also mobile phones and retail data), to prove the flexibility and the utility of the PRISQUIT framework and the usefulness of the two approaches related to the user-centric ecosystem.

(4)
(5)

Contents

Introduction 9

I Setting the Stage 15

1 Towards an Ecosystem of Data Driven Services 17

1.1 Inspiring case studies . . . 19

1.2 Ethical and Legal Implications . . . 22

1.2.1 General Data Protection Regulation (GDPR) . . . 24

1.2.2 Privacy by Design . . . 25

1.2.3 Data Protection Impact Assessment . . . 26

1.2.4 GDPR on Research Data . . . 29

2 State of the Art 31 2.1 Privacy-Preserving Data Publishing and Data Mining . . . 31

2.1.1 Differential Privacy Model . . . 34

2.1.2 Privacy by Design in Data Mining . . . 37

2.2 Privacy in Spatio-Temporal Data . . . 38

2.2.1 Privacy in Location Based Services . . . 38

2.2.2 Privacy in Moving Objects Databases . . . 40

2.2.3 Privacy in Call Detail Records . . . 42

2.3 Privacy Risk Assessment . . . 43

2.4 Towards a User-Centric Model for Personal Data . . . 44

2.4.1 Personal Data Store (PDS) . . . 45

II Assessing Privacy Risk vs. Quality in Data Sharing Ecosystems 49 3 The PRISQUIT Framework 51 3.1 Privacy-aware ecosystem . . . 52

3.2 Privacy Risk Assessment . . . 54

3.2.1 Data Dimensions . . . 54

3.2.2 Background Knowledge Dimensions . . . 55

3.2.3 Privacy Risk Measures . . . 56

3.2.4 Quality Measures . . . 58

3.2.5 Data Catalog . . . 59

(6)

TABLE OF CONTENTS

3.4 Towards Distributed Computation . . . 61

3.5 Releasing multiple Dataviews . . . 61

4 PRISQUIT for Services based on Movement Data 63 4.1 Preliminaries and Dataset Presentation . . . 64

4.2 Privacy Risk Assessment for Presence Data . . . 67

4.3 Privacy Risk Assessment for Trajectory Data . . . 72

4.4 Privacy Risk Assessment for Road Segment Data . . . 76

5 PRISQUIT for Services based on Mobile Phone Data 81 5.1 The Sociometer . . . 81

5.2 Attack Models . . . 84

5.3 Privacy Risk Assessment in the Rome 2015 Dataset . . . 86

5.4 Privacy Risk Assessment in the Pisa 2014 Dataset . . . 89

5.5 Privacy Risk Mitigation . . . 92

5.5.1 Towards Distributed Computation . . . 95

5.5.2 Towards a complete Privacy Risk Mitigation Method . . . 96

5.5.3 Experiments . . . 98

6 PRISQUIT for Services based on Multidimensional Data 103 6.1 Privacy Preserving Multidimensional Profiling . . . 103

6.2 Promotion Service based on Recurrent Events . . . 105

6.3 Service based on Off line Economical Impact Analysis . . . 111

6.4 Promotion Service based on Daily life Activities . . . 112

6.5 Experiments . . . 115

6.5.1 Datasets presentation . . . 117

6.5.2 Privacy Analysis . . . 119

III Towards a User-centric Ecosystem 127 7 Estimation of Privacy Risk based on Individual Features 129 7.1 A Data Mining approach for Privacy Risk Assessment . . . 130

7.1.1 Individual Mobility Features . . . 130

7.1.2 Attack Models . . . 132

7.1.3 Construction of training dataset . . . 136

7.1.4 Alternative usage of the data mining approach . . . 137

7.2 Experiments . . . 138

8 Differential Privacy in Distributed Mobilytics 143 8.1 Preliminaries and Problem Definition . . . 145

8.1.1 System Architecture . . . 146

8.1.2 Privacy Model . . . 147

8.1.3 Approach Overview . . . 148

8.2 Architecture . . . 148

8.2.1 Data Collector Node Computation . . . 148

8.2.2 Coordinator Computation . . . 152

8.3 Experiments . . . 153 6

(7)

TABLE OF CONTENTS

8.3.1 Space Tessellation . . . 153

8.3.2 Utility Measures . . . 154

8.3.3 Analytical evaluation . . . 154

8.4 The LocalSensitivity Approach . . . 160

Conclusion 165

Acknowledgement 167

Bibliography 169

(8)
(9)

Introduction

“I like large parties. They’re so inti-mate. At small parties there isn’t any privacy.”

Jordan Baker - The Great Gatsby

Techniques to analyze and discover valuable knowledge from databases have become in-creasingly important over the last few years. Interaction with digital devices provides a huge and ever-growing source of data. These data are more and more complex, and they have been given the title of Big Data, to summarize their main intrinsic characteristics: the data are very large and they have a very fine level of detail, making it harder to perform analyses. A commonly accepted definition of Big Data is given in [74], where it is stated that they are data that exceed the processing capacity of conventional database systems; in other words, they are too big and they move too fast. Every day, we create 2.5 quin-tillion bytes of data [180]; the growth is so rapid that 90% of the data in the world today has been created in the last three years alone. This data comes from everywhere: sensors used to gather weather information, posts and relationship information recorded in social networks, digital pictures and videos, purchase transaction records, query-logs stored by search engines and cell phone GPS signals to name just a few. Together these devices and events result in a huge quantity of data. For example, from the invention of the camera to 2011, it has been estimated that around 3.5 trillion photos were taken and Facebook has hosted over 140 billion of them [25]. For every minute of 2016, 20 million WhatsApp messages, 2.4 million Google queries and 347 thousand tweets were generated, 150 mil-lion emails were sent and 2.7 milmil-lion YouTube videos were watched [147]. Mobile phones produce a large quantity of these data: indeed, the number of mobile phone users in the world is expected to pass the five billion mark by 2019 [198]. In 2014, nearly 60% of the population worldwide already owned a mobile phone, and mobile phone penetration is forecasted to reach 67% by 2019 [147].

Big Data offer many new opportunities to learn more about our society, because they provide personal details of the activities of a large part of the population. Therefore, sophisticated techniques for analysis have been developed, in order to gather, store and analyze increasingly complex data. These techniques are able to extract patterns, models, profiles and general rules that describe the behavior of a community. Indeed, through

(10)

INTRODUCTION

the analysis of personal data with sophisticated tools, we can create new opportunities to interpret complex phenomena, such as mobility in urban areas, and to foresee the diffusion of an economic crisis or the spread of epidemics and viruses.

The worrying aspect of the very fine level of detail of these data is that they can also contain personal information. Consequently, the opportunities to acquire knowledge bring an increased risk of privacy violation of the people who are represented in them. The threat includes identification of personal or even sensitive aspects of people’s lives, such as home address, habits and religious or political beliefs. Managing this kind of data is a very complex task. It is not sufficient to rely solely on de-identification (i.e., removing the direct identifiers contained in the data) in order to preserve the privacy of the people involved. In fact, many examples of re-identification from supposedly anonymous data have been reported in the scientific literature and in the media, from health records [225] to GPS trajectories [114, 65] and, even, from movie ratings of on-demand services [176].

Several techniques have been proposed to develop technological frameworks for coun-tering privacy violations, without losing the benefits of Big Data analytics technology [96]. Unfortunately, no general method exists that is capable of handling both generic personal data and preserving generic analytical results. Nevertheless, Big Data and privacy are not necessarily opposites: in fact, many practical and impactful services can be designed in such a way that the quality of results can coexist with a high protection of personal data if the Privacy-by-Design paradigm is applied [171]. The Privacy-by-Design (PbD) paradigm [38, 42, 40], introduced by Cavoukian in the 1990s, aims to protect privacy by inscribing it into the design specifications of information technologies, accountable business practices, and networked infrastructures, from the very start. It represents a profound innovation with respect to the traditional methods; the idea is to have a significant shift from a reactive model to a proactive one, i.e., preventing privacy issues instead of remedying them.

Privacy-by-Design has raised interest especially in the last few years because an elab-oration of this paradigm is explicitly referred to in the new European General Data Pro-tection Regulation [60]. Indeed, the new regulation states that controllers shall implement appropriate technical measures for ensuring, by default, the protection of personal data. Another task introduced is the Data Protection Impact Assessment, which should be car-ried out prior to the processing activities. The aim is to assess the specific likelihood and severity of the risk, taking into account the context and the purpose of processing as well as the sources of the risk. Safeguards to mitigate that risk are then applied. Thus, impact assessment and privacy mitigation strategies should act in tandem, at the time each process arises, in order to ensure data is protected by default.

Until a few years ago, most of the research work carried out in the context of privacy-preserving data mining and data analytics focused on an organization-centric model for personal data management [96, 254, 91]. Unfortunately, this model has some drawbacks that limit the fascinating opportunity to analyze human data. First of all, personal data are often fragmented since they are gathered from a wide variety of sources; this does not allow for a holistic view of individuals. Second, a lack of interaction by the individual with their own data. Users are not involved in the life cycle of their data and they have a very limited opportunity to control their own data and to take advantage of them according to their needs and wishes. For these reasons, individuals often become more and more skeptical about the potential benefits and advantages of Big Data collection and analysis. Furthermore, as personal data are mainly under the control of organizations, the focus

(11)

INTRODUCTION

of authorities has been primarily to protect personal data, in order to reduce the risks of uncontrolled use. Promoting the full utilization of data, when paired with a higher control from their “owners”, has been of secondary importance.

To counter these problems, recently we have been witnessing a change of perspective towards a user-centric model for personal data management (a vision promoted by the World Economic Forum [88, 89, 90]). This model has the aim to give users an active role by introducing transparency and full control on the life cycle of their own personal data. To implement this model, individuals need to have a copy of their data, which they have the right to dispose of or distribute with the desired privacy level. The aim is to receive services or other benefits, or to increase their knowledge about themselves or about the society they live in. The advent of the user-centric model sheds a new light on the privacy issues: most of the existing solutions consider an architecture where central sites make the data private before releasing them; in the new model the privacy-transformation has to be performed before data leave the user. This encourages the voluntary participation of users, thus limiting the fear that often leads people not to access the benefits of extracting knowledge from their own data, both at personal and collective level [107].

In this thesis, we investigate two aspects: the assessment and the enforcement of privacy in Big Data. Both these goals are analyzed in two different contexts: the standard organization-centric model and the user-centric data ecosystem.

Context One: Organization-centric model. Concerning the first goal, i.e., pri-vacy assessment, we study an analytical framework able to “set the data free” in the organization-centric model for personal data management. The idea is to provide or-ganizations (data-owners or data provider) and data analysts or service developers with tools to make privacy risk assessments and to enforce the desired level of privacy pro-tection. Regarding the second goal, instead, we explore the enforcement of privacy in the organization-centric data management model, aiming to maintain a good quality of services.

In particular, we introduce a framework, called PRISQUIT, which represents the main result of this thesis. PRISQUIT enables quantitative measuring of re-identification risk as well as data quality. The proposed process helps both data providers and service de-velopers to find a trade-off between these two conflicting targets, through a systematic exploration of the dimensions of the problem. The particular strengths of PRISQUIT are its flexibility and modularity, which make it possible to integrate specific functions, data formats, background knowledge and mitigation strategies into the framework. We validate our proposal in three different contexts (i.e., mobility, mobile phones and retail data), showing how different kinds of data can be analyzed by our framework to provide meaningful measurements of risks for users and also data quality. In addition, PRISQUIT is able to identify the most suitable techniques for enforcing the privacy of the individuals described in the data. We investigate a novel mitigation strategy that can be applied to an aggregation of mobile phone data, taking advantage of the privacy risk evaluation provided by PRISQUIT.

Context Two: User-centric model. We investigate how the problems of assessment and enforcement of privacy change in the user-centric ecosystem. We particularly focus on how it is possible to adapt or extend the Privacy-by-Design methodology taking into

(12)

INTRODUCTION

consideration some important aspects that characterize this particular setting. The pe-culiarity of this ecosystem is that each individual has access to only their own data, plus potentially some aggregated information shared from other peers or from central entities. We describe two approaches applied to the mobility data context. First, we explore a data mining solution that analyzes the correspondence between specific individual mobility fea-tures, which can be computed locally by each person, and the risk of re-identification. This results in an estimation of privacy risk, rather than the actual one, but it is sufficiently accurate to increase the individuals’ awareness of their own privacy risk. The awareness gained helps each individual to decide whether or not to share their data with third parties. For this reason, our data mining approach could be integrated into available user-centric tools. Second, we investigate a different solution, aiming to mitigate the individual move-ment independently from other users, i.e., from the privacy risk, applying the differential privacy paradigm. The differential privacy model provides strong guarantees without any assumption of background knowledge. In our framework, this model is used by each ve-hicle, which transforms data before sending it, so as to ensure privacy. We provide three solutions, characterized by a different trade-off between privacy and data utility, and we empirically show that these transformations preserve some important properties of the original data, allowing analysts to use them for important mobility data analysis.

The thesis is organized into three parts. In the first part, Setting the Stage, we introduce the preliminary notions necessary to become familiar with the fields of research covered by this thesis. In Chapter 1, we present examples of the many ways personal data can create economic and social value for governments, organizations, and citizens, and we analyze the potentialities of a data sharing ecosystem for providing new kinds of services. Moreover, we provide a brief overview of the ethical and legal implications of the use of personal data, and we introduce two main concepts that are the basis of this thesis: Privacy-by-Design and Data Protection Impact Assessment. In Chapter 2, we recall the basic concepts related to privacy-preserving data publishing and data mining, with particular focus on Privacy-by-Design data mining and differential privacy [75], and we report the current state of the art with respect to privacy in spatio-temporal data, such as location-based services, moving objects and mobile phones data. After that, we illustrate the user-centric model with the current state of Personal Data Stores, the implementations currently available and what they offer to the customer. Then, we describe related work about the privacy risk assessment and the user-centric ecosystem.

In the second part of this thesis, we get into the core of our work describing PRISQUIT, a framework for assessing privacy risk and quality in a data sharing ecosystem, where a data provider is willing to give access to their data to an external entity, such as a service provider. The framework was first presented in [200] and it is already used within European Projects like High Impact Initiative1and SoBigData2[145, 120]. In Chapter 3, we illustrate the theoretical principles, the role of the actors in the ecosystem, the definition needed by the framework and the goal of the process. Then, in Chapter 4 and 5, we provide some instances of the framework regarding, respectively, GPS and mobile phone data. In both these chapters, we present the possible data formats needed to implement specific services, and the definition of some possible attacks related to those data formats. Then, we provide

1http://www.eitictlabs.eu/innovation-entrepreneurship/future-cloud/ 2

http://www.sobigdata.eu

(13)

INTRODUCTION

a wide experimentation using real datasets, by analyzing in detail the actual privacy risk and quality of each data format. We show there are some common characteristics within these two different kinds of data. Moreover, for mobile phone data (Chapter 5) we extend both the theoretical and experimental part with an ad hoc mitigation strategy. The results presented in Chapter 4 appeared in [200] and are the basis of the privacy risk assessment of [26, 33], where we use routines, i.e., special trajectories that represent systematic individual movements, instead of standard trajectories data. The assessment of privacy risk on routines was used within the Petra3 project [240], while the one on trajectories was used within High Impact Initiative project [145]. The results presented in Chapter 5, instead, are the evolution of the ones described in [171]; they are partially presented in [56] and used within Asap4 and SoBigData projects [28, 109]. In Chapter 6, we envision an ecosystem of data sharing, where different data providers make available different kinds of data, in order to enable a new class of multidimensional services, i.e., services that are not possible to carry out with only one type of data. We first describe three possible multidimensional services, then, for each service and for each combination of data types, we state the minimum data formats necessary to provide the service. In addition, we list the possible attacks and we assess the privacy risk and the quality of the services in any of intermediate steps of the process. Here, the novelty is both in imaging a multidimensional service and measuring what happens when we combine knowledge from different data sources. Part of the work presented in this chapter is described in [196, 199]. In the third part of this thesis, we move the perspective from the organization-centric ecosystem analyzed in Part II, where data are primarily collected by a data provider and then shared with a third party, to an user-centric ecosystem, i.e., a system where privacy risk is counteracted before users release data. In Chapter 7 and 8, we use two different approaches to analyze a specific scenario. In Chapter 7, we provide a solution not aimed directly at mitigating privacy risk, but instead at estimating it. A data mining approach is used to discover correspondences between individual features and actual privacy risk. Using this approach, each individual has the opportunity to test their own privacy risk on the basis of their behavior. In the case where the estimation of the risk complies with the user’s expectations, he can choose to release the data. This chapter presents part of the work described in [194, 195]. Instead, in Chapter 8, we describe three data transformation methods, based on the differential privacy model, that allow the individuals to reach the desired level of privacy. The differential privacy paradigm ensures privacy regardless of the context in terms of external information owned by any potential attacker. Here, it is used to obtain privacy mitigation in a distributed environment, where private data are collected by a central entity in order to perform some spatial analyses. This chapter derives from the work presented in [173, 171].

Finally, we conclude the thesis by summarizing the main findings, and presenting possible future work that can build on the results described in this dissertation.

3http://www.petraproject.eu/ 4

https://www.asap-fp7.eu/

(14)
(15)

Part I

(16)
(17)

Chapter 1

Towards an Ecosystem of Data

Driven Services

“Atticus had said it was the polite thing to talk to people about what they were interested in, not about what you were interested in.”

Jean Louise “Scout” Finch - To Kill a Mockingbird

“The one thing that doesn’t abide by majority rule is a person’s conscience.” Atticus Finch - To Kill a Mockingbird

One of the most pressing and fascinating challenges of our time is to understand the complexity of the global interconnected society we inhabit. The fast growth of the Internet and the Web, the speed with which global communication and trade now take place, and the rapid diffusion of breaking news and information around the world, as well as epidemics, trends, financial and social crises. These are all signals that mankind has entered a new era, a new techno-social ecosystem whose inner mechanisms are significantly different from the past.

Ours is also a time of opportunity to observe and evaluate the inner workings of our society the big data originating from the digital breadcrumbs of human activities, derived from the ICT systems that we use everyday. Indeed, multiple dimensions of our social life have now big data “proxies”. Our shopping patterns and lifestyles are traced in transaction records and query logs. Our relationships are recorded as social media and telephone contacts. Our feelings and opinions are expressed via internet platforms. Our movements are tracked by our phones and navigation systems.

(18)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

Nowadays, companies and governmental institutions use this ocean of Big Data to unleash powerful analytic capabilities. They connect data from different sources, find patterns and generate new insights, all of which add to the ever deepening pool of data. This can help transform the lives of individuals, fuel innovation and growth, and help solve many of modern society’s challenges. As elaborated in World Economic Forum reports, personal data represent an emerging asset class, potentially every bit as valuable as other assets such as trade goods, gold or oil.

As stated in [89], the growing quantity and quality of personal data create enormous value for the global economy. Personal data play a vital role in countless facets of our everyday lives. Medical practitioners use health data to better diagnose illnesses, develop new cures for diseases and address public health issues. Individuals use data about them-selves and others to find increasingly relevant information and services, coordinate actions and connect with people who share similar interests. On-line stores provide accurate recommendation systems to provide targeted suggestions and advertisements. Financial institutes create patterns of credit card usage, which detect potential fraudulent purchases in real-time. Government institutions use personal data to protect public safety, to im-prove law enforcement and strengthen national security. And companies employ a wide range of personal data to innovate their business, improve efficiency and design new prod-ucts that stimulate economic growth. Estimates show that the Internet economy was valued at US$ 2.3 trillion in 2010, or 4.1% of total GDP, within the G20 group of nations. The economic value of the Internet is larger than the economies of Brazil or Italy, and was expected to have nearly doubled by 2016 to US$ 4.2 trillion.

This huge, ever-growing flow of data is undeniably useful for a wide range of appli-cations. But the interesting question is: what could happen if these data get out of the hands of those who gather them? Permitting the use of data from external subjects could improve aspects that potentially differ from the original goal of the storage entity (usually a company) and that are related to different private companies or to public institutions. As a consequence, additional classes of services can be enabled, based on different back-grounds or multiple sources of data, and these services can lead to great advantages for individuals, private companies or the entire collectivity.

In [89], the World Economic Forum listed the many ways personal data can create economic and social value for governments, organizations and citizens:

• Responding to global challenges. Real-time personal data and social media can help to better understand and respond to global crises, like disaster response, unem-ployment and food security. It represents an unprecedented opportunity to track the human impacts of crises as they unfold and to get real-time feedback on policy responses.

• Generating efficiencies. For centuries, increased access to information has created more efficient ways of doing business. These days, organizations in every industry are using vast amounts of digital data to streamline their operations and boost overall productivity.

• Making better predictions. Personal data is stimulating the creation of innovative new products tailored to and personalized for the specific needs of individuals. • Democratizing access to information. Consumers benefit from “free” services like

(19)

1.1. INSPIRING CASE STUDIES

search engines, e-mail, news websites and social networks that previously either did not exist or had a significant monetary cost in other forms in the off-line world.

• Empowering individuals. Empowered consumers are taking greater control over the use of data created by and about them. Rather than individuals being passive agents, they are engaging with organizations in a collective dialog. In addition, individuals are using the information they share about themselves, their beliefs and their preferences to connect with each other like never before.

Unfortunately, all these opportunities come with new risk regarding the individual privacy of persons whose data belong. For this reason, we first provide some examples of services that are already enabled in a data sharing ecosystem, in order to illustrate the power, the potentiality and the novelty offered by such services. Then, in Section 1.2, we outline the ethical implications and the legal constraints that must be addressed when platforms that massively use personal data are created.

1.1

Inspiring case studies

In the following, we report some popular services and possibilities, which are extraordinary examples of the potentiality and the advantages of a data-sharing ecosystem.

Developing integrated data platforms [90, 85] Big Data makes it possible to com-bine data from conventional sources such as censuses, national household surveys and farm surveys with data generated in real-time from sources such as satellite and drone images, social media, mobile phone records and digital financial transactions, to broaden the range of data that can be incorporate into a database. Or, again, integrating data for different departments and agencies to permit comparisons of indicators across agencies and time. An example of an integrated data platform is Aquastat [85], promoted by FAO, which combines satellite data with population density, long term average precipitation, map of cultivated areas, water withdrawal per inhabitant, etc., in order to provide a synthesis of water resources on Earth.

Increasing civic participation [92] While traditional media have long been central to informing the public and focusing public attention on particular subjects, digital media is helping to amplify the response to humanitarian crises and to support those afflicted by these crises. During the Arab Spring of 2011-2012, digital media served as a vehicle to mobilize resources, organize protests and draw global attention to the events. Through digital media, users around the world collected $2 million in just two days for victims of the Nepal earthquake of 2015. Refugees fleeing the war in Syria have cited Google Maps and Facebook groups as sources of information that helped them not only to plan travel routes, but also to avoid human traffickers. Digital media has also enhanced information sharing across the world, giving people much greater access to facts, figures, statistics, and similar, allowing that information to circulate much faster.

(20)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

Private companies giving access to their database for research challenges [188, 133] The most famous example is the Orange1 and Sonatel2 Data for Developer (D4D) challenge, which in 2013 and 2014 gave access to a series of statistical databases and anonymized samples extracted from the mobile network management signals regarding Cˆote d’Ivoire and Senegal. For the challenge, five priority subject matters were defined: health, agriculture, transport/urban planning, energy and national statistics. This empha-sizes the social impact expected by the promoter of the initiative. Some of the participants proposed to analyze antenna traffic data of GSM network to capture hidden local socio-demographic heterogeneity, such as illiteracy or poverty; to detect anomalies in human mobility patterns for providing early warning in case of events that could change or dis-rupt daily life; to analyze intensities of calls and thus understand how the urban system works, promoting the development of a sustainable land use; to study the local energy needs, which is crucial for electricity infrastructure planning; to model the millet prices in a spatially explicit model for explaining more than 80% of the price differentials observed in the 40 markets, and assessing the social welfare impact of further development of both road and mobile phone networks in the country.

Another example is the Telecom Italia3 BigDataChallenge of 2014 and 2015. Telecom Italia provided access to heterogeneous (from both a spatial and temporal point of view) datasets of communication (tweets, calls, mobility information). Some of the proposed ideas are: (i) analyzing communications and electric energy consumption, discovering a correlation that permits to create predictor of daily average consumption and peak consumption of energy, with the aim to limit the energy production when it is not needed; (ii) creating “footprints” of urban territory and building a classification of the usage of the various districts, in order to avoid the expense of a urban stakeholder and enable the repeatability for monitoring the evolutions during time, allowing for a better planning of public services; (iii) analyzing the pollution of Milan (one of the most polluted cities in Europe), classifying the various substances (such as PM10 and PM2.5), the time and the quantity, so the citizens can check an app and discover the air quality in real-time; (iv) classifying interests of persons based on text of geolocated tweets, in order to propose to each user areas containing events (or users with interests) close to his passions.

Enabling automated ride sharing The ride sharing (commonly known as car pooling) is the sharing of private car journeys so that more than one person travels in a car. By having more people using one vehicle, ride sharing reduces individual travel costs, such as fuel costs, tolls, and the stress of driving. Moreover, ride sharing is environmentally friendly and a sustainable way to travel, reducing carbon emissions, traffic congestion on the roads, and the need for parking spaces. Authorities often encourage carpooling, especially during periods of high pollution or high fuel prices, offering reserved lanes or discount on tolls as rewards. There are a lot of ride sharing services, such as BlaBlaCar4 and Zimride5, which require departure, arrival destinations and times and possibly also interest and hobbies for integrating the social aspect.

The sharing of private vehicles’ GPS traces, in tandem with data mining strategies

1

http://www.orange.fr/ (in French)

2http://www.sonatel.com/ (in French) 3 http://www.telecomitalia.com/tit/en.html 4https://www.blablacar.co.uk/ 5 https://www.zimride.com/ 20

(21)

1.1. INSPIRING CASE STUDIES

[9, 234, 156, 117], can enable the automatic matching between drivers and passengers. This can be done at least for systematic mobility (i.e., the daily routes that we perform almost everyday), which is usually the one that has the greatest impact on private costs, traffic and environment.

Enabling location-based advertising [20, 34] Location-based advertising6 refers to marketer-controlled information specially tailored for the place where users access an advertising medium. It represents an opportunity for advertisers to personalize their messages to people based on their interests and their current location, in real time. Using location data, collected from their mobile device, advertisers can send different messages to people depending on where they are. Information about promotions or deals often reaches consumers at inconvenient times or locations, and this prevents them from taking advantage of such offers. Certainly, the location is an important factor, but it is also useful putting information into the right context. If advice is properly targeted, they are almost certain to be more relevant and, therefore, less likely to be ignored. Thus, making advertising both relevant and accessible seems the key to a novel approach to “marketing in the moment”.

Exploring the opportunities and risk of using personal data in a real-world context through living labs [90, 45, 137] MTL (Mobile Territorial Lab) and LivLab (Livorno Lab) are experimental “living labs” with the aim to understand opportunities, risks and to balance between protection and utilization of personal data. MTL aims at creating an open infrastructure and a real community to perform experiments to under-stand people’s approaches, attitudes and feelings to the user-centric Personal Data Store (PDS) paradigm. It is developed in cooperation with Telecom Italia SKIL Lab7, Fon-dazione Bruno Kessler8, the Human Dynamics group9 at MIT Media Lab, the Institute

ID310 and Telefonica I+D11. MTL collected communications data (call and SMS logs), lo-cation data (through GPS and cell towers of GSM network) and periodic surveys (mainly about personality and expenses) of 100-142 volunteers from February 2013 to December 2014 (2 years of observation). Each participant can take advantage of data collected in their PDS by means of personal applications for life monitoring, behavior awareness and social behavior comparison, and can freely decide to contribute with their personal data to research analyses and city monitoring. The study analyzed communication patterns, highlighting a strong daily seasonality in both calls and SMSs (even if the two tools are used in different moments to communicate with different persons), mobility habits with respect to population external to MTL, and the correlation between communications and some personal traits (e.g., discovering that emotionally unstable people are in contact with more individuals, tend to send and receive more SMSs, and are more likely to have longer phone conversations). The weakness of this approach is that its incentives are very expensive, such as participants being provided with a smartphone and a monthly credit.

6 https://econsultancy.com/blog/67418-what-is-location-based-advertising-why-is-it-the-next-big-thing/ 7 http://jol.telecomitalia.com/jolskil/ 8https://www.fbk.eu/en/ 9 http://hd.media.mit.edu/ 10https://idcubed.org/ 11 http://www.tid.es/ 21

(22)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

LivLab is a sort of continuation of MTL, developed in collaboration with Unicoop Tir-reno12, the KDDLab13 at Italian National Research Council, Telecom Italia14 and Fon-dazione Bruno Kessler. It started on April 2015 and, after 2 years of observation, it is currently composed of about 100 active users. The main difference with respect to MTL regards the kind of collected data, which are location data (through GPS) and purchasing data (collected directly by the retail chain). Participants can consult their historical data and they can compare some personal models and indicators against those of the collec-tivity (e.g., analyzing the (un)predictability of their basket with respect to typical basket compositions [116]). Individuals can take advantage of this knowledge to understand how they behave with respect to the mass, and have access to some “what-if” analysis, like quantifying the savings if they substituted some products with the Coop-branded ones, and so on. In LivLab the economic effort is sustained by the retail company, but is defini-tively less expensive with respect to MTL, since it corresponds to a monthly discount of e10 on the next purchases in the chain stores.

Anyway, it is not necessary for all these scenarios to be conducted for free. Data monetization, both starting from individuals and companies, is often possible. Data can be offered in return of specific services or even money. Obviously, the external entities will pay only if the data respond to specific quality standards, that make possible to them to perform their goals. Thus, we need a better understanding of the evolution of platforms and the implications of this development.

1.2

Ethical and Legal Implications

In Section 1.1, we saw that the explosive growth in the quantity and quality of personal data has created a significant opportunity to generate new forms of economic and social value. But for data to flow well, there is a need for the same kinds of rules and frameworks that exist for other asset classes.

Citizens are increasingly concerned about what companies and institutions do with their data, and ask for clear positions and policies from both the governments and the data owners. Despite this increasing need, there is no unified view on privacy laws across countries. The European Union regulates privacy by Directive 95/46/EC (Oct. 24, 1995) and Regulation (EC) No 45/2001 (December 18, 2000). The European regulations, as well as other regulations such as the U.S. rules on protected health information (from HIPAA), are based on the notion of “non-identifiability”. The regulation on privacy in the EU was recently revised by the comprehensive reform of the data protection rules proposed on January 25, 2012 by the European Commission, which will be applied on May 25, 2018 in the form of Regulation, i.e., the General Data Protection Regulation (GDPR).

With legal frameworks evolution, ethical concerns and guidelines are changing too. As highlighted in [93], this is reflected by social networks continuing to update privacy policies and settings, by newsrooms making frequent updates to publishing guidelines on how they use material sourced from social media platforms, and by the continuous shifts in what is or is not considered appropriate when individuals post on social media platforms.

12

https://www.unicooptirreno.it/ (in Italian)

13http://kdd.isti.cnr.it/ 14

http://www.telecomitalia.com/tit/en.html

(23)

1.2. ETHICAL AND LEGAL IMPLICATIONS

Moreover, both active and passive data collections also raise questions. Even though users are publishing messages and personal information on public networks, many users do not consider that anyone other than their close friends and family will see their posts. But, while many users are aware of the information they have logged into social networks, they are much less aware of the data being collected from them. World Economic Forum warns both people and organizations. It points out that people need to be informed about the potential impact of their content being shared widely. On the other hand, organizations must be honest with the user about when and how the content will be used, and whether it will be syndicated to other publishers or organizations.

World Economic Forum is not the only entity that invokes transparency. Indeed, transparency is one of the pillars in ethics and it is related to several parts of the big data process, such as seeking permission of users, explanation of terms of use, and data usage after the collection. Bertelsmann Foundation [86], Organisation for Economic Co-operation and Development [87], UK Cabinet Office [36] and Council of Europe [184] state that notice and consent are fundamental tasks in big data management. They also offer other important consideration about ethics. In particular, in [86] we can find a good excursus on history of individual control, on cultural differences between Europeans and Americans and a list of key concepts useful for address the challenges of privacy man-agement. These guidelines are: individual empowerment (through education that teaches individual basic technology and data portability), corporate accountability (through a vol-untary, self-regulatory risk assessment) and collective accountability (through government mandate entities that can assess the impact of any big data process). In [87], the OECD framework is presented, along with fundamental principles that should be respected in data usage process: collection limitation (data collected are the minimum necessary and they must be obtained by lawful and fair means), data quality (personal data should be relevant to the purposes for which they are to be used and they must be complete and up-to-date), purpose specification (purposes should be specified before of data collection), use limitation (data must be used and disclosed only for the specified purpose), security safeguard (data must be protected by reasonable security safeguards), openness (about development, practices and policies), individual participation (individuals should have the right to control, rectify or have the data erased) and accountability (data controllers should be accountable for complying with measures regarding the other principles). In [36], we can find a short summarization, along with some practical examples of good and bad practices, of the six key principles they consider essential in the data management: (i) to highlight the users need and public benefit from the start of the definition of the methods; (ii) to use data and tools with the minimum intrusion necessary; (iii) to create robust data science models, analyzing the representativeness of the data and the presence of potential discrimination features; (iv) to be aware of public perception, understanding how people expect their data to be used; (v) to be clear and open about data, tools and algorithms, providing explanation in plain English; and (vi) to keep data secure, following the guidelines provided by the Information Commissioner’s Office15. Finally, the Council of Europe [184] drafts some guidelines too. The majority of ethical principles are highly shared among different institutions, and many of them are included in the new EU Regu-lation. However, in the case of relatively loose regulatory environments, ethical rules are particularly important. In [256], authors listed 10 rules for performing ethical research on

15

https://ico.org.uk/for-organisations/guide-to-data-protection/principle-5-retention/

(24)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

big data. Some of them are inspired by concepts described above (for example, insert-ing ethics directly in the work-flow of research or documentinsert-ing clearly when decisions are made), while others are specifically oriented to research. For example, it is listed the im-portance to debate issues within a group of peers or to share data, which is a fundamental task in some projects, like studies of rare genetic diseases. Moreover, Zook et al. [256] encourage to consider the complete context of interests of data, to stress the fact that most of data represent or impact people and to be aware that also anonymized datasets, if combined with other variables, might lead to unexpected re-identification.

1.2.1 General Data Protection Regulation (GDPR)

Besides ethical principles, it is clearly fundamental that law would enforce personal rights, aiming to adapt regulation for covering the current situation. So far, despite this increasing need, there was not a unified view on privacy laws across countries. However, the regulation on privacy in EU was recently revised by the comprehensive reform of the data protection rules, proposed on January 25, 2012 by the European Commission, approved on April 14, 2016 and that will be applied on May 25, 2018 in the form of Regulation, i.e., the General Data Protection Regulation (GDPR, Reg. EU 679/2016) [60].

GDPR is composed of 11 chapters, that go from principles and right to accessing to data, from transferring of personal data and competences, to implementing Acts. In par-ticular, in Article 5 GDPR state some of the ethical principles considered fundamental, such as data minimization, transparent information, purpose and limitation and account-ability.

GDPR also introduces some novelty with respect to the Directive 95/46/EC, such as explicit references to Data protection by design and by default, to data protection impact assessment and new obligations of data processors. A Data Controller is a “natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data” (Article 4 (7)). A Data Processor is a “natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller” (Article 4 (8)).

Data Controller needs to:

• implement appropriate technical and organizational measures to ensure and to be able to demonstrate that processing is performed in accordance with the Regulation (Article 24);

• implement appropriate data protection policies (Article 24);

• implement appropriate technical and organizational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed (Article 25);

• carry out a data protection impact assessment (Article 35). Data processor has, among its obligations:

• guarantee to implement appropriate technical and organizational measures in such a manner that processing ensure the protection of the rights of the data subject, where processing is to be carried out on behalf of a controller;

(25)

1.2. ETHICAL AND LEGAL IMPLICATIONS

• inform the controller of any intended changes concerning the addition or replacement of other processors;

• processes the personal data only on documented instructions from the controller (in-cluding the categories of processing carried out and any transfer to a third country); • takes all the data protection measures required also for the Data Controller.

The data protection measures should be applied to any information concerning an identified or identifiable natural person. Identified data (e.g., name and social security number) are directly linked to the individual, whereas identifiable data (e.g., nickname or address) are attributable to a specific person through some additional information. This data protection measures could consist, among other things, of minimizing the processing of personal data, pseudonymizing personal data as soon as possible, transparency with re-gard to the functions and processing of personal data, enabling the data subject to monitor the data processing, enabling the controller to create and improve security features.

1.2.2 Privacy by Design

Article 25 of GDPR analyzes data protection by design and by default. In particular, it states that the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organizational measures, such as pseudonymization. These measures are designed to implement data protection principles, such as data minimization, in an effective manner. Moreover, the measures shall integrate the necessary safeguards into the processing in order to meet the requirements of the Regulation and protect the rights of data subjects. Then, it continues asserting that the controller shall implement appropriate technical and organization measures for ensuring that, by default, only personal data that are necessary for each specific purpose of the processing are collected. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that, by default, personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.

The Privacy-by-Design paradigm [38, 42, 40], introduced by Cavoukian in the 1990s, gives principles useful for managing privacy, including the privacy by default. In [39], Cavoukian defines the 7 Foundational Principles of Privacy-by-Design:

1. being proactive instead of reactive, in order to prevent problems instead of remedy to them;

2. inserting privacy into the system, by default;

3. embedding privacy into the system, without diminishing its functionality;

4. demonstrating that it is possible, and far more desirable, to have both privacy and security;

5. ensuring strong security measures, from start to finish of the life-cycle of the data involved;

6. establishing accountability and trust, though visibility and transparency; 25

(26)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

7. respecting user privacy, requiring to architects and operators to keep the interests of the individual uppermost.

In particular, for the privacy by default principle, the following Fair Information Prac-tices should be applied [39]:

• Purpose Specification: the purposes for which personal information is collected, used, retained and disclosed shall be communicated to the data subject at or before the time the information is collected. Specified purposes should be clear, limited and relevant to the circumstances.

• Collection Limitation: the collection of personal information must be fair, lawful and limited to that which is necessary for the specified purposes.

• Data Minimization: the collection of personally identifiable information should be kept to a strict minimum. The design of programs, information and communica-tions technologies, and systems should begin with non-identifiable interaccommunica-tions and transactions, as the default. Wherever possible, identifiability, observability, and linkability of personal information should be minimized.

• Use, Retention, and Disclosure Limitation: the use, retention, and disclosure of personal information shall be limited to the relevant purposes identified to the indi-vidual, for which he has consented, except where otherwise required by law. Personal information shall be retained only as long as necessary to fulfill the stated purposes, and then securely destroyed.

Where the need or use of personal information is not clear, there shall be a presumption of privacy and the precautionary principle shall apply: the default settings shall be the most privacy protective.

However, from a legal point of view, Privacy-by-Design should not be considerate as a master key that relieves data controller and data provider of obligation with respect to data protection, but, on the contrary, as a different accountability and responsibility regarding data subjects [62].

1.2.3 Data Protection Impact Assessment

Data Protection Impact Assessment and Privacy Impact Assessment are two strictly re-lated concepts. In GDPR (and in particular in Article 35) one can find references to the sole Data Protection Impact Assessment (DPIA), which is a process that data controller or data processor, possibly with the advice of the data protection officer, must carry out for risky data processing. DPIA, taking into account the nature, scope, context and pur-poses of the processing, enables the assessment of the impact of the envisaged processing operations on the protection of personal data. A single assessment may address a set of similar processing operations that present similar high risks. DPIA is especially required in the case of: (a) a systematic and extensive evaluation of personal aspects relating to natural persons which is based on automated processing, including profiling, and on which decisions are based that produce legal effects concerning the natural person or similarly, significantly affect the natural person; (b) processing on a large scale of special categories of data referred to in Article 9(1), or of personal data relating to criminal convictions and

(27)

1.2. ETHICAL AND LEGAL IMPLICATIONS

offences referred to in Article 10; or (c) a systematic monitoring of a publicly accessible area on a large scale.

In particular, DPIA evaluates the origin, nature, particularity and severity of that risk. The outcome of the assessment should be taken into account when determining the appropriate measures to be applied in order to demonstrate that the processing of personal data complies with the GDPR. Controllers should examine the likely impact of the intended data processing on the rights and fundamental freedoms of data subjects in order to [184]:

• identify and evaluate the risks of each processing activities involving Big Data and their potential negative outcome on individuals’ rights and fundamental freedoms, taking into account the social and ethical impacts;

• develop and provide appropriate measures, such as “by-design” and “by-default” solutions, to mitigate these risks;

• monitor the adoption and the effectiveness of the solutions provided.

The concept of impact assessment is not a novelty: indeed, in the literature there exist many studies related to this problem, even if the general goal is slightly different. Privacy Impact Assessment (PIA) is a slightly wider concept with respect to DPIA, since it includes data protection but it is also driven by social values. It aims to study all the ethical aspects of the data management and to generate or increase data controllers’ awareness. The purpose of the PIA is to ensure that privacy risks are minimized while allowing the aims of the project to be met whenever possible [186]. Risks can be identified and addressed at an early stage by analyzing how the proposed uses of personal information and technology will work in practice. First of all, a PIA process should identify the need for a PIA; then, it should identify the privacy and related risks, evaluate the privacy solutions and integrate them into the project plan [186]. The PIA process should be a flexible one, and it can be integrated with an organization’s existing approach to managing projects. Moreover, the implementation of the core PIA principles should be proportionate to the nature of the organization carrying out the assessment and to the nature of the project [186].

Currently, there is a lot of interest about PIA, and a lot of frameworks have imple-mented to manage this task. [102] studies the link between data protection and the notion of risk. In particular, Gellert tries to compare data protection as risk regulation with the so-called risk-based approach, hypothesizing a shift from the regulation of risk to the regulation through risk (the risk management of everything). An important claim that he reports is the fact that zero-risk (i.e., avoiding all risks stemming from an activity) is simply impossible. It may have been thought so at some point, but the uncertainties resulting from modern technological risks have shown that, precisely because the notion of risk relies upon an infinite number of factors, it is simply impossible to fully prevent a risk. Indeed, one may always discover a new factor, so that whatever one does, and no matter how many risk factors they untangle, there will always be another reason for the risk to occur. This is a claim repeated and largely shared among experts. For exam-ple, Ann Cavoukian has repeated several time that “Zero-Risk doesn’t exist” [41]. But a managerial approach to risks is used by Commission Nationale de l’Informatique et des

(28)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

Figure 1.1: Goal of the Privacy Risk Assessment

Libert´es’s16 (the French Data Protection Authority) and the code of conduct of the UK Information Commissioner’s Office is quite open about this: “It is important to remember that the aim of a PIA is not to completely eliminate the impact on privacy. The purpose of the PIA is to reduce the impact to an acceptable level while still allowing a useful project to be implemented.”17

There are many practical studies that can help industries and researchers. University of British Colombia [182] provides a tool for determining a project’s privacy and security risk classification. TRUSTe [236] offers a consulting service for analyzing personally identifiable information, looking at risk factors and assisting in the development of policies and training programs. Information Commissioner’s Office provides a handy step by step guide through the process of deciding whether to share personal data [185], and a useful guideline for understanding PIA [186]. The Brussels Laboratory for Data Protection & Privacy Impact Assessments, or d.pia.lab [206], provides training and delivers policy advice related to impact assessments in the areas of innovation and technology development. PIA has raised interest also in United States [183] and Canada [238].

However, it is interesting noting that, whereas some experts state that PIA and DPIA are different concepts, others (such as [113]) affirm that are essentially the same by a different name.

Privacy Risk Assessment

Privacy Risk Assessment regards only one aspect related to DPIA, i.e., the one related to the privacy risk, overlooking, for example, all the steps related to the analysis of the cost. However, to the best of our knowledge, PIA does not provides data-driven tools that can effectively measure the privacy risk, but it aims mainly to create self-awareness.

The research field studying the problem of Privacy Risk Assessment has the goal to create tools able to identify and to reduce the privacy risks in information systems and to help people to design more efficient and effective processes for handling personal data. In particular, as summarized in Figure 1.1, a (automated or semi-automated) process, having access to personal data about individuals, classifies individuals based on their privacy risk.

16

CNIL, ‘Methodology for Privacy Risk Management: How to Implement the Data Protection Act’ (CNIL 2012)

17

ICO (n 106) 27, at 21

(29)

1.2. ETHICAL AND LEGAL IMPLICATIONS

The privacy risk can be expressed by some categorical values (i.e., low - medium - high, as showed in the figure) or by a more detailed quantification (e.g., the percentage of risk of re-identification of each specific individual, as we will consider in the remain of this thesis). In this way, it is possible to recognize if there exists individuals whose privacy is at risk and to apply appropriated (e.g., least destructive as possible) countermeasures.

As stated in [102], the risk is the chance (understood as a probabilistic notion) that a danger (i.e., an event with harmful consequences) will happen. Risk signals the threat of harm appraised through statistics and probabilities. Gellert [102] continues with a more technical definition: risk is an objective measurable entity combining the probability of an adverse event and the magnitude of its consequences.

In [32], Boehm discusses the components of a typical risk assessment process; in par-ticular, he defines three major steps:

• Risk Identification: the project-specific risks are generated using various identifica-tion techniques.

• Risk Analysis: after risks are detected in the identification step, certain properties are assigned to each risk statement so that they can be distinguished by a certain group.

• Risk Prioritization: a ranked order of risks is produced in order to show their im-portance.

Clearly, the above phases are suitable also for the Privacy Risk Assessment process.

1.2.4 GDPR on Research Data

Article 89 of GDPR lists special situations in which the legal obligation can be relaxed, i.e., analyzes derogations for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes. In particular, it states that some safeguards shall be developed to ensure respect for the principles of data minimization and of non re-identifiability of the data subjects (this could include pseudonymization, if anonymization is not possible).

Special derogations are allowed, by Union or Member State law, if they are necessary for the fulfillment of the above purposes, regarding some rights listed in GDPR, such as:

• right of access by the data subject (Article 15, where there are references to the period for which the personal data will be stored and to the restriction of processing of personal data);

• right to rectification of inaccurate personal data (Article 16); • right to restriction of processing (Article 18);

• right to data portability from one controller to another (Article 20);

• right to object to a particular processing of personal data, including profiling (Article 21).

(30)

CHAPTER 1. TOWARDS AN ECOSYSTEM OF DATA DRIVEN SERVICES

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. The GDPR does not, therefore, concern the processing of such anonymous information, including for statistical or research purposes.

(31)

Chapter 2

State of the Art

“People were stupid, sometimes. They thought the Library was a dangerous place because of all the magical books, which was true enough, but what made it really one of the most dangerous places there could ever be was the sim-ple fact that it was a library.”

The Librarian - Guards! Guards!

In this chapter, we review the most important results obtained by the scientific commu-nity on privacy in various contexts. In particular, we provide some basic notions about privacy strategies; for our purposes, the most central will be Differential Privacy, Privacy-by-Design and k-anonymity. We also give an overview of relevant work about privacy applications in topics that are correlated to our research, i.e., spatio-temporal data.

2.1

Privacy-Preserving Data Publishing and Data Mining

In recent years, individual privacy has been one of the most discussed jurisdictional issues in many countries. The problem of protecting the individual privacy when disclosing in-formation is not trivial and this makes the problem scientifically attractive. It has been studied extensively in the data mining community, under the general umbrella of privacy-preserving data mining and data publishing. The aim of the methods proposed in the literature is of assuring the privacy protection of individuals during both the analysis of human data and the publishing of data and extracted knowledge. There are two main families of approaches to treat the problem of privacy preservation: anonymity by ran-domization and anonymity by indistinguishability. We provide the general idea and some related work.

(32)

CHAPTER 2. STATE OF THE ART

Anonymity by Randomization

Randomization methods are used to modify data in order to preserve the privacy of sen-sitive information. They were traditionally used for statistical disclosure control [5] and later have been extended to privacy-preserving data mining problem [8]. Randomization techniques perturb the data using a noise quantity; from the perturbed data, it is still possible to extract patterns and models. In literature, there exist two types of random perturbation techniques: additive random perturbation and multiplicative random pertur-bation.

In the additive random perturbation method the original dataset is denoted by X = {x1. . . xm} and the new distorted dataset, denoted by Z = {z1. . . zm}, is obtained drawing

independently from the probability distribution (Uniform or Gaussian) a noise quantity ni

and adding it to each record xi∈ X. Notice that, both m instantiations of the probability

distribution Z and the distribution of the noise is known. The original record values can not be easily guessed from the distorted data while the distribution of the dataset can be easily recovered by using one of the methods discussed in [8, 7]. So, individual records are not available, while it is possible to obtain distribution only along individual dimensions describing the behavior of the original dataset X. It is evident that traditional data mining algorithms are not adequate as based on statistics extracted from individual records or multivariate distributions. Therefore, new data mining approaches have to be devised to work with aggregate distributions of the data in order to obtain mining results. In the works presented in [8, 252, 253], authors propose new techniques based on the randomization approach in order to perturb data and, then, they build classification models over randomized data. In [7] Agrawal and Aggarwal show that the choice of the reconstruction algorithm affects the accuracy of the original probability distribution. Furthermore, they propose a method that converges to the maximum likelihood estimate of the data distribution. Authors of [252, 253] introduce methods to build a Naive Bayesian classifier over perturbed data. Randomization approaches are also applied to solve the privacy-preserving association rules mining problem, like in [82, 210].

For privacy-preserving data mining, multiplicative random perturbation techniques can also be used. The main techniques of multiplicative perturbation are based on the work presented in [135].

Unfortunately, the main problem of randomization methods is that it is not safe in case of attacks with prior knowledge, indeed, in the work [136], Kargupta et al. show that the original data matrix can be obtained from a randomized data matrix using the random matrix-based spectral filtering technique.

Differential Privacy

A recent model of randomization, though based on different assumptions, is Differential Privacy. This is a privacy notion introduced in [79] by Dwork. The key idea is that the privacy risks should not increase for a respondent as a result of occurring in a statistical database. Differential privacy ensures, in fact, that the ability of an adversary to inflict harm should be essentially the same, independently of whether any individual opts in to, or opts out of, the dataset. This privacy model is called -differential privacy, due to the level of privacy guaranteed . It assures a record owner that any privacy breach will not be a result of participating in the database since nothing, or almost nothing, that can be

Riferimenti

Documenti correlati

A ran- domised controlled trial of fluid restriction compared to oesophageal Doppler-guided goal-directed fluid therapy in elective major colorec- tal surgery within an

The ability to work remotely will not drive most people away from cities and offices, but it will enable many to live and work in new ways and places — while causing its fair share

The test in particular has been conducted to simulate a debris flow event that occurred on the evening of the 4 th of July 2011 in the test bed area of Fiames (Km 109

Enhancement, 3-D Virtual Objects, Accessibility/Open Access, Open Data, Cultural Heritage Availability, Heritage at Risk, Heritage Protection, Virtual Restoration

Based on the previous discussion of the results, we can conclude that literature and practitioners seem to be generally aligned on the digital twin paradigm; nevertheless, even if

Based on this, Chapter 4 suggests that knowledge management benefits from organizational settings that motivate individuals’ sharing of intellectual capital (Cabrera,

The second research area (i.e., knowledge sharing and knowledge transfer) has led to the development of Chapter 4 and Chapter 5 which respectively investigate

T.officinalis and R. However, we did not find any reference on the effect of air drying on volatile composition of these two species cultivated in Sardinia. Consequently, the aim