Analysis and benchmarking of privacy-preserving methods to access outsourced data

(1)

Politecnico di Milano

Scuola di Ingegneria Industriale e dell’Informazione

Corso di Laurea Magistrale in Ingegneria Informatica

Dipartimento di Elettronica, Informazione e Bioingegneria

Analysis and Benchmarking of Privacy-preserving

Methods to Access Outsourced Data

Relatore: Prof. Gerardo PELOSI

Correlatore: Ing. Alessandro BARENGHI

Tesi di Laurea di:

(2)

(3)

(4)

(5)

Ringraziamenti

I miei ringraziamenti vanno al mio relatore e controrelatore, Gerardo Pelosi e Alessandro Barenghi, che mi hanno assistito nella comprensione e svolgimento di questa tesi.

Inoltre, voglio ringraziare la mia famiglia che mi ha supportato durante tutto il mio percorso formativo, permettendomi in fine di raggiungere il traguardo di un titolo magistrale. In particolar modo ringrazio mia madre Attilia, che mi ha sempre spinto a migliorarmi.

(6)

(7)

Sommario

Da quando é nato Internet, lo sviluppo tecnologico di nuovi dispositivi non é mai cessato. Negli ultimi anni abbiamo assistito a delle piccole ri-voluzioni nel modo in cui ci approcciamo ad Internet ed al modo in cui lo viviamo. Smartphone, tablet e smartwatch fanno ormai parte della nostra vita quotidiana e questi strumenti dialogano costantemente con internet e ci permettono di trasferire velocemente enormi quantitá di informazioni che fino a vent’anni fa erano inimmaginabili. Con lo sviluppo di questi nuovi apparecchi, sono nate anche nuove infrastrutture che permettono di salvare grossi quantitativi di dati su sistemi che esistono solo in rete, é nato il cloud computing. In questo documento verrá trattato il problema della sicurezza delle comunicazioni e del passaggio di dati in rete e di come é possibile pro-teggersi utilizzando particolari algoritmi chiamati Oblivious Random Access Memory (ORam). In particolare verrá analizzata la questione del manteni-mento della segretezza dei contenuti salvati su database esterni e di come i gestori di queste piattaforme possano carpire informazioni applicando mo-delli statistici. Vedremo anche come il classico sistema di difesa, la cifratura, non é assolutamente sufficiente per arginare questo pericolo e di come siano necessarie una serie di contromisure per avere la certezza della privacy dei propri dati. A tal fine, in questa tesi verranno presentati tutti gli elementi per comprendere il problema, le tipologie di soluzioni ed infine verranno esposte quattro diverse implementazioni di algoritmi Oram, con relativi benchmark, per poter discutere i punti deboli e di forza di ogni sistema.

(8)

(9)

Abstract

Since Internet was born, the technological development of new devices hasn’t stopped. In the last few years we have assisted to little revolutions in the way we approach Internet and how it affects our lives. Smartphones, tablets and smartwatches are part of our ordinary life an these tools contin-uously communicate with each other using Internet enabling us to transfer a large amount of information very fast, a thing that would be inconceivable twenty years ago. Furthermore, with the development of these new devices, a new kind of online infrastructure that has the capability to store large amounts of data was born, it is the cloud computing. In this document it will be discussed the internet communication security problem and how it is possible to protect a data transfer using special algorithms called Oblivi-ous Random Access Memory (ORam). In particular, it will be discussed the privacy threat that affects documents saved in external storages and how the cloud service providers can steal secret information applying statistical models. Also, it will be shown how common security precautions, like the cryptography, are not enough to solve the problem and how necessary it is to implement a series of countermeasures to guarantee the privacy of the data. So, in this thesis, it will be explained the privacy problem and how the service providers can steal information. Then, the possible solutions to the problem are discussed, followed by the implementations of four Oram algorithms. At the end of the document, there is a series of benchmarks of the four security systems that show the strength and the weakness of each solution.

(10)

(11)

6 Experimental Evaluation 75 6.1 System Specifications . . . 75 6.2 Benchmarks Specifications . . . 77 6.2.1 Measurements . . . 77 6.2.2 Network . . . 78 6.2.3 Comparison Criterion . . . 78 6.3 Base Tests . . . 80 6.3.1 Shuffle Index . . . 80 6.3.2 Path Oram . . . 84 6.3.3 Ring Oram . . . 85 6.3.4 XRing Oram . . . 88 6.4 Confrontation Tests . . . 89

6.4.1 Average Internet Accesses . . . 93

6.5 Bandwidth Evaluations . . . 95

6.5.1 Bandwitdh Formulae . . . 98

6.5.2 Bandwitdh Confrontations . . . 104

(13)

CONTENTS

Conclusion 111

7.1 Summary . . . 111 7.2 Conclusion . . . 112

(14)

(15)

List of Figures

2.1 Shuffle Index structure as seen by the server . . . 13

2.2 Ideal Shuffle Index structure for the client . . . 13

2.3 Real Shuffle Index structure for the client . . . 13

2.4 Example of search into the cloud and change of variables dur-ing the Access procedure . . . 24

2.5 Initial status of the B+-tree . . . 24

2.6 Permutation inside the B+-tree . . . 25

2.7 Shuffle Index final result . . . 26

2.8 Cypher Block Chaining (CBC) encryption mode . . . 27

3.1 Oram tree . . . 34

3.2 Oram tree after stash insertions . . . 35

3.3 Oram tree after saves . . . 36

4.1 Ring Oram tree . . . 56

4.2 ROram tree after aR research . . . 57

4.3 ROram tree after EvictP ath function . . . 58

4.4 ROram tree after EarlyReshuf le function . . . 58

6.1 HDD performances . . . 76

6.2 Shuffle Index LAN test with B=25 . . . 81

6.3 Shuffle Index LAN test with B=40 . . . 82

(16)

6.5 Shuffle Index WAN 30ms test with B=40 . . . 83

6.6 Path Oram LAN test . . . 84

6.7 Path Oram WAN test . . . 85

6.8 Ring Oram LAN test . . . 86

6.9 Ring Oram WAN test . . . 87

6.10 XRing Oram LAN test . . . 88

6.11 XRing Oram WAN test . . . 89

6.12 Confrontation test in LAN environment and ∆_#Blocks= −76 89 6.13 Confrontation test in LAN environment and ∆_#Blocks= 856 . 91 6.14 Confrontation test in WAN environment and ∆_#Blocks = −76 91 6.15 Confrontation test in WAN environment and ∆#Blocks = 856 92 6.16 Bandwidth trend, ∆_#Blocks= −76 . . . 104

6.17 Bandwidth trend, ∆#Blocks= 856 . . . 106

6.18 Bandwidth trend, ∆_#Blocks= −479 . . . 107

(17)

List of Tables

3.1 Legend of Path Oram . . . 31

4.1 Table of the main variables . . . 42

5.1 Main variables table . . . 71

6.1 System Hardware Specifications . . . 75

6.2 Tuning Parameters . . . 77

(18)

(19)

List of Algorithms

2.2.1 Shuffle Index algorithm . . . 19

3.2.1 Path Oram Access algorithm . . . 32

4.2.1 Non-recursive Ring ORAM. . . 44

4.2.2 ReadPath procedure. . . 46 4.2.3 GetBlockOffset procedure. . . 48 4.2.4 EvictPath procedure. . . 49 4.2.5 ReadBucket procedure. . . 51 4.2.6 WriteBucket procedure. . . 53 4.2.7 EarlyReshuffle procedure. . . 54 5.1.1 ReadXorPath procedure. . . 65 5.1.2 WriteBucket procedure. . . 69

(20)

(21)

Introduction

What is an Oblivios RAM? What is the meaning of its acronym? What is its purpose? Why do we need it? The first person to talk about Oram was Oded Goldreich, an Israeli professor of Computer Science that in 1987 formulated the Oram theory. Then, in 2012, a PhD student of the university of Berkeley, California, published Path Oram [1], one of the most impor-tant and efficient algorithms of its category. Oram is the abbreviation for Oblivious Random Access Memory and is an algorithm category that has the purpose to hide the way in which people access to data stored in the cloud. In fact, the cloud service provider can obtain non-trivial information about the content stored in the servers even if the information is protected by a strong cypher. Today, these intersection attacks [2] have become easier and more feasible than before because big companies have the computational power and the economic resources to do so. Also, the explosion of affordable cloud services and the computerization of little/medium enterprises has ex-posed more and more data to indiscreet eyes. For this reason, since 1987, many security algorithms have been published to protect the privacy of the users.

Objective

The goal of this thesis is to compare four main Oram algorithms by running a series of tests in order to see how they work in common situations and not only in their best scenario. To show what was obtained there is a dedicated chapter for each algorithm in which it is possible to find all the features and a detailed explanation of their functioning. Inside each algorithm chapter, there is also a simple functional example to make clearer how each system works. The last part of the thesis is dedicated to the

(22)

benchmarks, how they were performed and which criterion of comparison was used, followed by a detailed analysis and explanation of the obtained results.

(23)

Chapter 1

The Problem

The privacy of personal information is one of the biggest problems that our society is facing nowadays. Many of the e-commerce and of the so-cial media platforms employ users’ information to make profit, using the data themselves or selling them to third parties. Basically, they see how users approach their platforms, what is searched, clicked and viewed in them obtaining significant information. The same kind of extrapolation of data happens on cloud storages where companies save important information such as government documents, hospital records or personal emails. In fact, the small and medium enterprises that have not the economic resources to build and maintain a personal cloud storage generally buy a cloud from a service provider exposing themselves to a privacy threat. Nevertheless, both the cloud service providers and the clients take advantage of the cloud comput-ing infrastructure. In fact, this kind of market has seen a big expansion in the last few years and is still growing due to the great demand.

The privacy threat derives from the fact that the cloud service providers have the physical access to the machines and they can observe: how many requests are made in a certain period of time, the usage of the resources and the access patterns. It is important to note that the service provider has not malevolent intentions but is very curious about what is stored in its servers. In fact, gaining more information about its customers can represent a great chance to earn more money for the big companies. This way, they can offer customized services to their clients and cut the costs modifying the hardware on servers or migrating the service on different machines without

(24)

a noticeable loss of performances for the clients, improving the efficiency of the cloud. These service and cloud improvements are possible by analysing the data obtained from the intersection attacks. For the service providers this knowledge extraction is fundamental in order to be more competitive on the market.

The core problem is the storage of data in third parties databases because it is there that the leakage of information takes place. The first countermea-sure to this threat is concealing all the data with a cypher, before saving them into the external storages. This is a good first step, but it is not the solution to the problem because the encryption can only achieve the secrecy (confidentiality) of the data but it cannot hide how the owner accesses to them. In fact, it is possible to retrieve the content of the encrypted infor-mation using some techniques that are examined in depth in Section 1.1. Furthermore, the service provider can see the access patterns of its clients, and this represents the main information leakage factor.

It is important to clarify that, in this analysis, it is supposed that the server knows the client’s security measures and the algorithms employed, but not the security keys of the client. This assumption is essential because the se-curity of a system does not rely on the secrecy of the chosen algorithm, but on how it works and which security features it employs.

To go deeper into the problem it is fundamental to explain every critical point of the data outsourcing in external databases:

1. Static memory placement: When data is stored, it is placed in a specific location of the memory and this location changes very rarely. Unless the client applies special countermeasures, this weakness enables the server to count how often the client accesses to his information. Also, the server can map memory areas with the same query identifiers used by the client. Therefore, the service provider can retrieve the content of the whole storage, with a certain degree of accuracy, either having a little knowledge of its clients’ business sector or using the help of external databases like Google Trend. The explanation of how this is possible is in Section 1.1.

2. Sequential queries: The normal behaviour of the client is to access to the data in a useful way for him. In fact, he wants to extrapolate

(25)

in-formation from the cloud making a series of interrogations useful for his goals. For instance, the client can search for emails coming from a spe-cific sender, or check the sale volumes of the last two months, or check the incomes of his employees. To obtain knowledge from the cloud, the client makes a sequence of queries that share a common topic. The server knows this common behaviour and the learning process of the service provider is even easier if the data extractions are made in in-terrogation bursts. With this behaviour, the server can draw a map of correlation between the cyphered contents and can infer information collecting all the leakages. The main reason why this is feasible is due to the static memory placement of the data.

3. Repetition of queries: The repetition of queries problem is charac-terized by the repetition of equal or similar requests during the exis-tence of the cloud. It doesn’t matter if the client makes an identical request with two seconds or two years between one and the other, be-cause the server can easily see that they are equal. In fact, the service provider can count how many times each interrogation is made invest-ing a very little amount of resources.

In case the server doesn’t want to employ a permanent memory or has a limited amount of storage, the repetition time will become funda-mental. In the first case, the server can see equal queries until the memory is refreshed or the machine is rebooted. In the second case, the server can count only the most frequent interrogations as it has not enough space to save them all. So, the information leakage is even clearer when the repetition of queries happens in a very short period of time between each other.

4. Frequency of queries: The number of requests made in a certain period of time defines the frequency with which the client accesses the cloud. Usually the client only performs interrogations when he needs some data. Sometimes he needs to retrieve all the information from the cloud, but sometimes he has the information in a local memory, so he can drop one or more queries. Also, it is common for the client to have a cache system that holds useful data for the current task and this client feature emphasizes the frequency of queries problem. In

(26)

this way the service provider can profile his client behaviour and gain information about the usage of cloud data and the correlation between them.

All of this information leakages, combined together, represent a serious threat to the security of an external storage. In fact, all the four problems make intersection attacks feasible. The core idea of these attacks is to guess what is saved in the cloud observing the behaviour of the client without breaking the cypher.

Each information has its unique identifier and its fixed position in the cloud. Every time the client makes a query, the server can see which identifiers are involved and which memory locations are accessed more often. With this expedient, the server can make assumptions on the importance of the encrypted contents sorting them using the access frequencies. The more queries are made, the higher will be the speculation accuracy of the server and the information leakage of the cyphered content stored in the cloud. How is it possible to protect the data? During the last decades, the security community has proposed a multitude of algorithms called Oblivious Random Access Memory (Oram). The adoption of one of these security systems makes the server’s counting useless and hides the correlation between data, making intersection attacks infeasible.

1.1 The Attack

Before explaining how the attack is leaded towards a cloud computing storage, it is important to define some baseline characteristics of the system. The first thing to say is that all the contents on the cloud are cyphered using a secure cryptographic algorithm, so this is not the weak link in the chain. Second, all the information are accessed using a unique identifier (id ) and only the client knows the pairing < id, content >, where the content is a use-ful information. There are no other kinds of security measures implemented by the client, so each of the four problems mentioned previously is present in the outsourced data scheme.

The main characteristic of the intersection attacks [2] consists in counting how often an id is used by analysing the client’s requests. This entails that, every time an identifier is used, an identical query is generated on the server

(27)

1.1. The Attack

side to retrieve the necessary information (Repetition of queries problem). At the same time, the server can understand the specific memory area asso-ciated to an id, because of the static memory placement problem.

Now the service provider has all the necessary elements to make the attack. The key idea is very similar to the method employed for breaking a mono-alphabetic cypher. The attacker counts the frequencies of the letters of a cyphered text, then he uses the basic knowledge of the letter frequencies of the language in which the message is written. At this point, he only needs to pair equal/similar frequencies between the cypher and the alphabet and retrieve the hidden text. In case the language of the plain text is unknown, it is necessary to perform another guessing phase choosing the alphabet that has the most similar letter frequencies and the same number of distinct char-acters to the hidden text. The precision of the attack depends on the length of the cyphered text and on the size of the alphabet. If the accuracy is too low, the attacker only needs to collect more cyphered text and repeat the procedure.

In the cloud scenario there are many analogies: the concealed source is the whole content in the cloud, the identifiers are the alphabet letters, the num-ber of queries are comparable to the length of the cyphered text, the ad-ditional knowledge (the plain language) is the business sector of the client (medical, technological, military etc). As before, the attacker counts how many times an id is used and associates it to a memory area with the same access frequency. Then he can use the information about the client’s business sector and select an auxiliary knowledge (e.g., Google Trends) to know the most frequent searched keywords in that specific field to pair keywords/ids with the same/similar frequencies. If the service provider has no information about its client, it has to apply the same kind of guessing technique explained before, obviously it is a more complicated scenario, but the principle is the same. In fact, the attack can be successful if the server dynamically adjusts the client’s business sector and repeats multiple times the intersection attack on different knowledge bases. The guess might be inaccurate in the very first rounds performed since, at the beginning, the server has no idea about the exact user’s background, but with the increase of the attack rounds, it can gradually identify the exact business sector of the client. It is important to understand that this kind of attacks are not infeasible.

(28)

Now the service provider knows all the keywords of the cloud and it can extrapolate further knowledge using the correlation between contents. As mentioned before, the sequential queries problem and the frequency queries problem show clearly the access patterns of the client and the correlation be-tween specific data. With this further step, the attacker can read the cloud. For example, if the outsourced data scheme is the one of a hospital, it is possible to associate patients with their diseases or with their therapy using the disclosed contents and the access patterns.

1.2 The Contermeasures

This section deals with the countermeasures that can be implemented to solve the outsourced data schemes problems. The basic idea is to make in-tersection attacks infeasible, creating confusion on the server side and hiding real researches with useless requests. But how is this possible? There are different problems that afflict an outsourced data scheme, but for each of them there is a solution.

One of the first countermeasure commonly used in Oram algorithms is to ob-fuscate the correlation between multiple queries performing a series of fake searches. This technique is more powerful if for each real information re-quest more cover searches are performed. This number is usually chosen by the client once and then it is maintained for all the queries. It is impor-tant to understand that this fake searches must be different every time they are scheduled. However, there are constraints to follow and these useless requests have a considerably high cost in terms of resources, but by using them, sequential queries can be performed with a certain degree of security. The second countermeasure is to change continuously the position of the contents inside the cloud. The client has the power to choose where to place each information inside the external storage. This kind of behaviour has the purpose to randomize the data positions from the server point of view and has the same objective as shuffling a card deck. How is this done? Basically, the client changes the combination of < id, content > every time he has the possibility, and the effect on the cloud side is a storage that is continu-ously shuffled. What changes in the database is the content placement, not the cloud structure. Unfortunately, the service provider can still recognize

(29)

1.2. The Contermeasures

the same cyphered information relocated in the cloud if its encryption is always the same. Symmetric encryption schemes will always have the same outcome if the inputs are always the same. So, it is necessary to modify something before applying the cypher algorithm. The possible solutions can be changing the initialization vector (iv) or the encryption key. Otherwise it is common to add some random pad inside the structure of the information. With these stratagems, the server has no possibility to recognize the relo-cation of the same content. This countermeasure solves the static memory placement and the repetition of queries problems because every information is forced to change its position and identifier. Moreover, with the help of cover searches, it is solved the problem of correlation leakage, too.

The last threat remaining is the queries frequency problem that reveals the client usage behaviour of his data. To solve this issue, it is possible to choose a fixed number of queries to perform in a single period of time. This makes it impossible to profile the client’s behaviour because there is no mutation in his actions.

(30)

(31)

Chapter 2

Shuffle Index

The first algorithm presented in this document is Shuffle Index [3]. It is an algorithm published in 2015 by teachers and researchers of Universitá degli Studi di Milano, Universitá degli Studi di Bergamo and Politecnico di Milano. The main goal of the algorithm is to guarantee the privacy and the security of outsourced data. There are many similar protocols with the same features, but this one has substantial differences from the others as it implements particular solutions that could make it one of the most balanced algorithms in terms of performances and costs.

2.1 Structure of Shuffle Index

This part of the document is focused on the description of the data struc-tures implemented in Shuffle Index, fundamental for the comprehension of the protocol. Without further hesitations it is time to see how Shuffle Index is structured and how it works. The main data structure is an unchained B+-tree. Every node of the unchained B+-tree has at most a fan-out of F children, except for the root that always has F children (the case of a tree with only the root is not representative). Inside each node there is a vector of indexes that points to its children. The internal organization of the data structure holds the following rule: the i-th child of any internal node in the unchained B+-tree is the root of a subtree containing the values v with: v < v1; vi−1≤ v < vi, i = 2, ..., q − 2; v ≥ vq−1, ordered from the smallest to

the greatest.

(32)

is here that it is possible to save information. The number of blocks per bucket is equal to B = F − 1 and every block has a unique identifier for identification and searchable reasons.

Def inition1.3 (B+-tree): The tree of height H is formed by (F )_{F −1}H−1 nodes. Every node has an identifier id ∈ [0,(F )_{F −1}H−1 − 1]. Inside the tree there are H levels identified by L ∈ [0, H − 1]. The first L − 1 levels contain internal nodes, while the leaf level L contains leaf nodes.

Def inition1.1 (InternalN ode) : Let <id, keys[B], childrenIds[B + 1]> be the configuration of a node that is not a leaf. The first component is the unique identifier of the node. The keys[B] vector represents the space where it is possible to store B research keys useful for choosing which path to fol-low. The childrenIds[B + 1] vector contains all the identifiers of the child nodes. B is the parameter that defines the bucket size.

Def inition1.1 (Leaf N ode) : Let <id, bucket[B]> be the configuration of a leaf node. The first component is the unique identifier of the node and the bucket[B] is the vector where the Blocks are saved. These nodes are the only ones that have information inside them. B is the parameter that defines the bucket size.

Def inition1.2 (Block) : Let <idb, data> be the structure of a block, where

idb ∈ [0, B · ((F )

H₋₁

F −1 − 1)] represents a unique identifier that is also the

re-search key of the block. The data field is the space dedicated to save useful information.

To better understand the physical structure of Shuffle Index, it is proposed a simple representation of an unchained B+-tree (Figure 2.1) as it is seen by the server.

(33)

2.1. Structure of Shuffle Index

Figure 2.1: Shuffle Index structure as seen by the server

In Figure 2.2 it is shown how the data structure is filled ideally.

Figure 2.2: Ideal Shuffle Index structure for the client

In Figure 2.3, it is represented an instance of the same B+-tree as it could be saved on the cloud. These interlaced relationships between nodes have the purpose to confuse the server and hide the access patterns. But it should be noted that the node identifiers remain at their places.

Figure 2.3: Real Shuffle Index structure for the client

It is important to understand that the physical structure of the tree always remain the same whatever action the client performs. What changes every time are the nodes contents and the parent/child relationships. In fact, the algorithm maintains the tree continuously shuffled, constantly modifying the data inside the structure. How this is done will be explained more in detail in the next section.

(34)

2.2 Theory and Protection Thecniques

Before going into the details of the algorithm, it is important to define the security features of Shuffle Index. The algorithm uses three techniques to ensure that no information can leak from the cloud or during the data trans-fer through Internet. These strategies are: the cover searches, the cached searches and the shuffling. Also, each node of the tree is encrypted using a strong and secure cypher to ensure the confidentiality of its content.

2.2.1 Cover Searches

As the name suggests, the functionality of the cover searches is to hide the real searches performed on the cloud. It is important to conceal the real queries, otherwise the counting of the access content frequencies is feasible and intersection attacks can be performed. In fact, the server can pair a phys-ical location with a specific content (they have the same frequencies) and, at the same time, it can guess the importance of the information by using an ascending ordering of frequencies. For instance, consider two consecutive requests of ’F’ on the Figure 2.3 scheme. The server will access at nodes {(001); (103); (207)} twice, revealing that the two queries refer to the same information. To mitigate the repetition of queries problem, the stratagem is to search useless contents during a real request. This is a simple and very effective action because it confuses the server. The number of cover searches performed for each real request is chosen by the client setting the parameter num_cover. For example, with one cover search per real query, the client will perform four accesses: two pointing to the block ’F’ and two pointing towards two other blocks, for instance ’I’ and ’M’. So, for the first query, the server will access to nodes {(001); (101, 103); (201, 207)} and then {(001); (103, 104); (207, 211)}. With this trick, the server has a probability p = 0.5 · 0.5 = 0.25, rather than p = 1, to infer a sequential access to the same node. In case the required content is in the client cache, the algorithm performs an additional cover search (num_cover +1), otherwise there are two possible security breaches. In the first case, the client tends to completely avoid accessing the cloud, but this reveals that the target content is already in cache (frequency of queries problem). In the second case, the client tends to make only the num_cover searches, revealing that all these researches

(35)

2.2. Theory and Protection Thecniques

are fake and the server still infers a cached content. For this reason it is important to make the same amount of queries every time. Here below the cover searches are formally defined.

(36)

Def inition1.4 (Cover searches) : Let {< id0, b0>, ..., < idm, bm>} be

a set of nodes forming a Shuffle Index built over a candidate key with domain D, and let v0be a value in D. A set {v1, ..., vn} of values in D is a set of cover

searches for v0 if ∀vi, vj ∈ {v0, ..., vn} : vi 6= vj =⇒ path(vi) ∩ path(vj) =

< id0, b0 >, that is, it contains only the root of Shuffle Index.

This means that the only node in common for every search, fake or not, must be the root of the tree and nothing else. This is a very strong con-straint and a very safe measure for the algorithm because it binds the value of num_cover to the fan-out of the root F . To summarize the way in which a search works, when the algorithm has to retrieve a content on the cloud it will perform one search plus the number of cover searches. This sum cannot be greater than the fan-out because there are not more than F possible paths outgoing the root of the tree. Each cover search path is chosen uniformly at random to make indistinguishable real and fake searches. Also, another safe feature is that all the researches are performed in parallel, level by level (every num_cover +1 nodes at each level of Shuffle Index is retrieved before proceeding to the next level). This means that the parent-child relationship between nodes of adjacent levels is broken because at each level any of the num_cover +1 parents could be associated with any of the num_cover +1 children, therefore producing (num_cover + 1)h potential paths.

2.2.2 Cached Searches

A cached search is a special case of data request. Basically, the needed information is in the client cache and it is not necessary to make the whole search procedure to retrieve the content. As mentioned before, in this case, the Shuffle Index algorithm will perform (num_cover +1) fake searches to hide the unnecessary request to prevent any leakage of information (queries frequency problem). Also, this strategy will make intersection attacks worse because the searches are completely uncorrelated between each other. But before going on with the explanation, it is important to define how the cache works.

Def inition1.5 (Cache) : Let {< id0, b0 >, ..., < idm, bm >} be a set of nodes

(37)

the unchained B+-tree is a layered structure of L+1 sets Cache0, .., CacheL,

where:

• Cache₀ contains the root node < id₀, n0>;

• Cache_l, l = 1, ..., L, contains num_cache nodes belonging to the l−th level of the unchained B+-tree;

• ∀n ∈ Cache_l, l = 1, ..., L, the parent of n in the unchained B+-tree belongs to the Cacle_l−1 (path continuity property).

In other words, the cache is composed of multiple levels going from zero to L. When a node is taken from the server, it is placed at the same depth in the cloud. When the cache is full and a new node has to be inserted, the Least Recently Used (LRU) policy is applied to make space for the new node. The path continuity property guarantees that all the nodes retrieved for the real search are in cache, starting from the root (always in cache) going down to the leaf requested.

The cache is useful for hiding equal searches made at distance num_cache from each other, because the targeted information is still in the cache and the algorithm will only perform fake researches in the second query. This prevents from short intersection attacks and the server can not distinguish a different behaviour of the client because he always makes the same number of requests.

2.2.3 Shuffling

Cover and cache searches aren’t sufficient to counteract all the possible leakages of information because the static memory placement problem is still not solved. In fact, the server can still see the access patterns if its obser-vation goes beyond the cache size. To counteract this weakness, the Shuffle Index algorithm implements the node shuffling mechanism.

Def inition1.6 (Shuf f ling) : Let N = {< id1, n1 >, ..., < idm, nm >}

be a set of nodes at the same level of an unchained B+-tree and π be a permutation of id1, ..., idm. The node shuffling of N with respect to π is the

set {< id₁, n0₁>, ..., < idm, nm0 >} of nodes, where idi= π(idj) and n0i= nj

(38)

The definition states that every time a level of the cloud is accessed, all the nodes at the same depth will be shuffled changing their node identi-fiers. It is important to remind that a real query with its cover searches is performed in parallel, level by level. Another point of strength of the shuf-fling operation is how the permutation is performed. The π function uses nodes from Cache_l and the last read nodes. As it is possible to imagine, all the parent-child relationships and the data locality of nodes are eliminated from the server point of view. For instance, let’s assume three accesses with the following paths {(001);(101,103);(201,207)}; {(001);(103,104);(207,211)}; {(001);(102,103);(207,208)}. The server sees the presence of a common leaf node, 207. The three queries aim at accessing the same node only if: the second and third requests are for the content of node 207 (the probability is 0.5 · 0.5 = 0.25); the data target of the first request coincides with the content of node 207 after the first shuffling operation (the probability is 0.5); and the content of node 207 is not moved by the second shuffling operation (the probability is 0.5). As a consequence, 0.0625 is the probability that the three requests aim at the same node. Without this procedure the server can easily infer a probability of 0.5 · 0.5 · 0.5 = 0.125 that the three queries point at the same data location. It is possible to assert that it is not a high prob-ability, but performing this operation multiple times can seriously threaten the privacy of the information.

2.2.4 The Implementation

This part of the document deals with the internal functioning of the Shuffle Index algorithm and how it can reach its security objectives. All the variables, auxiliary functions and subroutines are highlighted and explained in details. It is an important part of the discussion of Shuffle Index because without a clear picture of how it works, it will be difficult to understand all the following observations and considerations.

(39)

Algorithm 2.2.1: Shuffle Index algorithm

1 . S : Shuffle Index on a candidate key with domain D, leaf level L, fan-out F /

2 . Cachel, l = 0, ..., L : cache /

3 . num_cache: number of nodes in Cachel, l = 1, ..., L / 4 . num_cover: number of cover searches /

Input: target_value: value to be searched in Shuffle Index Output: n: leaf node that contains target_value

5 MAIN

6 . Initialize variables /

7 N on_Caches := N on_Cached_P := 0 8 let n0 be the unique node in Cache0 9 target_id := n0.id

10 cache_hit := TRUE . the root always belongs to Cache0 11 num_cover := num_cover + 1

12 for i := 1...num_cover do cover_id[i] := target_id 13 . Choose cover searches /

14 for i := 1...num_cover do

15 randomly choose cover_value[i] in D s.t. ∀j := 1, ..., i − 1, 16 ChildToFollow(n0, cover_value[i]) 6=

ChildToFol-low(n0, cover_value[j])

17 ChildToFollow(n0, cover_value[i]) 6∈

{n.id|n ∈ Cache1} and

18 ChildToFollow(n0, cover_value[i]) 6=

ChildToFol-low(n0, target_value)

19 . Search, shuffle and update cache and index structure / 20 for l := 1...L do

21 let n ∈ Cachel−1 such that n.id = target_id 22 target_id :=ChildToFollow(n, target_value) 23 . identify the nodes to read from the server / 24 if target_id 6∈ {n.id|n ∈ Cachel} then 25 T oRead_ids := {target_id} 26 if cache_hit then 27 cache_hit := FALSE 28 num_cover := num_cover − 1 29 else 30 T oRead_ids := ∅

(40)

31

32 for i := 1...num_ cover do

33 let n ∈ Cachel−1S N on_Cached_P such that

n.id = cover_id[i]

34 cover_id[i] := ChildToFollow(n, cover_value[i]) 35 T oRead_ids := T oRead_ids S {cover_id[i]} 36 . read blocks /

37 Read := Decrypt(ReadBlocks(T oRead_ids)) 38 . shuffle nodes /

39 let π be a permutation of T oRead_ids S {n.id|n ∈ Cachel} 40 foreach n ∈ ReadS Cachel do n.id := π(n.id)

41 . determine effects on parents and store nodes at level l-1 / 42 foreach n ∈ Cachel−1S N on_Cached_P do

43 for i:=1...F do n.pointers[i]:=π(n.pointers[i]) 44 WriteBlock(n.id, Encrypt(n))

45 target_id := π(target_id)

46 for i:=1...num_ cover do cover_id := π(cover_id) 47 . update cache level l /

48 N on_Cached := Read 49 if cache_ hit then

50 refresh the timestamp of n ∈ Cachel s.t. n.id = target_id

51 else

52 let deleted be the least recently used node in Cachel 53 let n ∈ Read s.t. n.id = target_id

54 insert n into Cachel

55 N on_Cached_P := N on_Cached

56 . Write nodes at level L /

57 foreach n ∈ CacheLS N on_Cached_P do

WriteBlock(n.id,Encrypt(n))

58 . Return the target leaf node / 59 return n

60 CHILDTOFOLLOW (n,v)

61 i:=0

62 if v ≥ n.values[1] then

63 while i+1 < Lenght(n.values) AND v>n.values[i+1] do

64 i:=i+1

(41)

There is only one parameter passed through the prototype of the function. That is:

• target_value: This variable is a research key that identifies a precise block. It is used to discern which path to follow using the internal organization of the unchained B+-tree.

Now it is possible to move on to the major steps accomplished by the Shuffle Index algorithm:

• Select cover_values (line 12 to 18): The parameter num_cover determines the number of cover searches to perform during a sin-gle request. By default settings, the algorithm defines num_cover +1 cover_values, useful to hide the target_value search. The additional cover search is useful in case the target_value is in cache. These cover values have to hold some constraints:

– Not same path (line 16): the path to follow for any cover_value must have only the root as common node.

– Not in cache (line 17): the level one following child of a cover_value must not be in cache.

– Not same path of target_value (line 18): the path of any cover_value must be different from the path of the target_value. These constraints are very binding because, from the root, it is possible to have only F different paths in total. It is mandatory to have at least a fan-out F =num_cover +2 (one for the target_value, the others are num_cover +1 cover searches) or the algorithm will not work properly. So, the dimension of the bucket B, the fan-out F and the security parameter num_cover are strictly related.

This part deals with the level by level search (i.e., l ∈ [1, L]): • Node in cache (lines 21 to 30): here it is ascertained if the

ChildToFollow(n0,target_id ) is in cache or not. Basically, if the node

is not present in Cache_l, it is necessary to retrieve the target_id node and to drop one cover_search. In fact, the algorithm always performs num_cover +1 reads, if the target_id node is in cache the reads re-quests will be only for cover values, otherwise one of the num_cover +1

(42)

reads will be a research for a useful content. This is achieved in-crementing or dein-crementing the num_cover number and filling with proper identifiers the To_Reads_ids vector. In the case of a cache hit, To_Reads_ids will be full of cover_ids, otherwise there will be one less cover_id in To_Reads_ids and its place will be taken by the target_id.

• Node to read (lines 32 to 35): Here are determined all the next node identifiers for the cover_values. All of them are inserted into the To_Reads_ids vector.

• Read operation (line 37): All the node identifiers inside

To_Reads_ids are retrieved from the cloud, then decrypted. Now the algorithm has a Read vector full of nodes extracted from the level l of the unchained B+-tree.

• Permutation (lines 39 to 40): All the nodes inside Read and inside Cachel undergo the permutation procedure. What really happens is

that node identifiers are swapped. In fact, it is not necessary to up-date other variables of nodes because they will be changed in the next iteration of the Access function, during the update fathers phase. • Update fathers and store (lines 42 to 44): It is fundamental to

update the children indexes inside the father nodes, otherwise the in-ternal organization of the tree will be destroyed permanently. So, all the fathers previously read receive the new identifiers of their children and then they are encrypted and stored on the cloud.

• Update cover_ids and target_id (lines 45 to 46): the update of the identifiers must be applied also on the target_id and on the cover_id vector.

• Cache update (lines 49 to 55): The node with the identifier target_id is placed inside the Cache_l. In case there is not enough space, the LRU policy is applied to empty a slot for the new node. All the other nodes of level read during this iteration are placed in Non_Cached_P with the purpose to be re-uploaded in the next iteration.

(43)

After the algorithm has reached the bottom of the unchained B+-tree and uploaded the permuted nodes at the leaf level, there is still one operation to do before ending the Access procedure:

• Return (line 59) (Retrieve of the real node): At the beginning of the research procedure, the target_id node can be in the cloud or in Cache. At this point of the Access function, the requested node must be inside CacheLand it is possible to retrieve and return the requested node to

the client.

The only auxiliary function inside the Access operation is ChildToFollow. Its purpose is to retrieve the next node identifier given a node n and a re-searchKey v. The function uses the internal organization of the node to choose which child identifier must be returned.

(44)

2.3 Functional Example

There is no better way to explain the whole Access operation than using an example. The initial parameters are num_cover = 1, num_cache = 2 and target_value =’F’. In Figure 2.4 is shown how the data structures are populated inside the Access function, during the research. The table must be read row by row from left to right. The rows represent at which level l of the B+-tree the algorithm is performing its operations and which cache level (Cachel) is in use.

Figure 2.4: Example of search into the cloud and change of variables during the Access procedure

Starting from the possible cloud configuration shown in Figure 2.5, the first step accomplished by the algorithm is to choose the cover searches. Following the constraints for the selection of cover_values, it is supposed that the two additional researches are ’S’ and ’M’.

Figure 2.5: Initial status of the B+-tree

In iteration l = 1, the next node to read for ’F’ is 103, obtained using ChildToFollow(001,’F’), which is in Cache1. For the two cover values

the results are 104 and 102, respectively and these identifiers are added to the To_Reads_ids vector. Then the algorithm reads the selected nodes

(45)

2.3. Functional Example

and decrypts and puts them into the Read vector. Each node in Cache1

and Read goes through the permutation phase so the node identifiers are swapped following: (101 → 102), (103 → 101), (102 → 104), (104 → 103). In Figure 2.6 there is the graphical representation of the permutation that takes place, see only level l = 1. Now target_id = π(target_id) = 101. The node 101 (the old 103) is saved in Cache1 and the other read nodes are

placed in N on_Cached. The last step is to update the children pointers of the 001 node following π, then encrypt and store the node in the cloud.

Figure 2.6: Permutation inside the B+-tree

In iteration l = 2, To_Reads_ids will be {202, 207} because the

target_id node is not in cache, so the second cover search is dropped. The nodes are read and then permuted (202 → 210), (203 → 207), (207 → 202), (210 → 203) using nodes in Cache2 and Read. The target_id node 202 (the

old 207) is placed in Cache2 discarding 203 because it is supposed that node

203 is the least recently used (LRU cache policy). Then the update phase occurs and the father nodes {101, 102, 103, 104} present in Cache₁ and in N on_Cached_P following π. After that, there is the encryption and the storage of the nodes into the cloud.

(46)

Figure 2.7: Shuffle Index final result

2.4 Complexity

This section deals with the time and space complexities of Shuffle Index main data structures. It is possible to figure out which part of the algorithm could be a bottleneck reading the functions descriptions.

• Multilevel Cache (client): it is the most important structure on client side. It works just like a cache so it is very useful for the user because it memorizes the last num_cache · H nodes read during the real researches. Also, it is a flexible structure because its dimension can be customized by the client so it is tailor-made for him. The cache size is determined by the parameter num_cache and by the height H of the B+-tree, so the spatial complexity is O(num_cache · (L + 1)). A more accurate analysis is O((num_cache · L) + 1) because the root level of the B+-tree contains only one node so it is necessary to memorize only one element in Cache₀. The normal usage of the cache is level by level, so the algorithm works on single vectors that have time complexity of O(num_cache) (for Cache0isO(1)).

• Non_Cached (client): it is not a permanent vector but it is used multiple times in the Access function. Its time and spatial complexity are O(num_cover).

• Non_Cached_P (client): it is not a permanent vector but it is used multiple times in the Access function. Its time and spatial complexity are O(num_cover).

(47)

2.5. Security & Privacy Considerations

2.5 Security & Privacy Considerations

This section deals with the security features of Shuffle Index and how they prevent the leakage of information.

The four key points to guarantee are: Data confidentiality, Data in-tegrity, Access confidentiality, Pattern confidentiality.

Figure 2.8: Cypher Block Chaining (CBC) encryption mode

The data confidentiality and the data integrity are guaranteed by an AES cypher. It was chosen to use a security key of 16Byte and a Cypher Block Chaining (CBC) encryption mode. This cypher modality, Figure 2.8, hides the content partitioning it into fragments of the same length as the key. The first fragment is xored with an Initialization Vector (IV) and then encrypted. Each subsequent fragment is encrypted after being xored with the previously encrypted fragment.

The Access confidentiality and the Pattern confidentiality are guaran-teed by the shuffling procedure. In fact, the permutation function π reshuffles the nodes choosing them uniformly at random. From the server point of view, it is impossible to predict or guess where a precise content is repositioned even if it knew the initial position of the relocated node content. Also, this prevents the server from recognising the access patterns (pattern confiden-tiality) of the client. Furthermore, the cached nodes and the cover searches make the algorithm even stronger to any possible intersection attack.

(48)

(49)

Chapter 3

Path Oram

Path Oram [1] is an algorithm created by Emil Stefanov, a brilliant stu-dent who was among the first to talk about oblivious random access memory protocol. This algorithm uses very simple ideas to solve the complex privacy problem. Many other algorithms have reproposed similar solutions as it is a landmark in this academic topic. The key idea of this security system was to use one of the most powerful data structure for the cloud (binary tree), in term of insertion/deletion/research complexity, and to implement a fast accessing procedure. Furthermore, Path Oram employs a completely differ-ent randomizing technique compared with Shuffle Index, that guarantees the privacy of the cloud contents.

3.1 Structure of Path Oram

The memory structure that was chosen for the cloud is a binary-tree with a height H and 2H−1 leaves. The implemented tree has the property being perfect (complete and balanced) this implies that all leaves have the same distance from the root, too. The memory structure can be subdivided in levels that are numbered from 0 to L, where 0 is the root level and L = H −1 is the leaf level.

Each node in the tree has a bucket inside, this bucket is composed of B blocks where information can be stored. If a bucket is not full with real blocks, the remaining slots are filled with dummy blocks.

(50)

Def inition1.1 (Binary tree) : The binary tree of height H = L + 1 is formed by 2H − 1 nodes, each one of them is characterized by < idi, b >,

where id_i is the identifier having i ∈ [0, 2H _{− 2] and b a Bucket.}

Def inition1.2 (Bucket) : Let bucket[B] be a vector of dimension B filled with < idb, z > blocks.

Def inition1.3 (Block) : Let < idb, data > be a block of the tree, where

idb ∈ [0, B · (2H − 2)] is the unique block identifier. The data field is the

space dedicated to save useful information.

There are 2L leaves in the tree, each one has a unique path obtainable us-ing the specific function P (x), where x is the leaf identifier. This function retrieves all the node identifiers along the path from the root to the leaf x. There is also a similar function P (x, l) that returns the exact identifier at depth l starting from the top of the tree.

On the client side there are two data structures:

• Stash: the purpose of the stash is to collect all the blocks that cannot be saved on the cloud. This occurrence can happen some times during the normal procedures of access to the server and it is demonstrated that the worst-case size is O(log(N ) · Ω(1)) [1]. The stash contains a variable number of < id_b, inf o > retrieved blocks.

• P osition map: In the client, all the pairs of < id_b, x > are stored, where x is a leaf identifier of the tree and id_b is a block identifier. This mapping function is used during the searching process to retrieve the position of a specific block. Every block, if not in stash, must be present in one of the nodes along the path P (x).

On the server side the memory structure is a collection of nodes < id_i, b >. In fact it can be claimed that during the normal reading and writing process on the cloud, the server continuously observes a fixed logical structure of indexes while the buckets inside nodes change constantly.

(51)

3.2. Theory of Path Oram

3.2 Theory of Path Oram

In this section, all the details of Path Oram and its working mechanisms will be discussed. To begin with it is fundamental to show the legend of the symbols.

N Total # blocks outsourced to server

H Height of binary tree

L Leaf level of binary tree

Z Block size (in bits)

B Capacity of each bucket(in blocks)

P (x) Path from root to leaf node x

P (x, l) Node identifier at level l along the path P (x)

S Client’s local stash

position Client’s local position map

x := position[a] Block a is currently associated with a leaf node x, i.e., block a resides somewhere along P (x) or in the stash.

Table 3.1: Legend of Path Oram

The initial situation on the server side is a tree completely filled with random dummy blocks. On the client side all of the auxiliary data structures, like the stash and the position map, are empty because there are no real blocks saved on the cloud.

The two main operations executable on cloud are reads and writes, both of them are performed by the Access function. In the pseudo-code below is shown how the function works step by step.

(52)

Algorithm 3.2.1: Path Oram Access algorithm 1 Access (op,a,data*): 2 x ← position[a] 3 position[a] ← UniformRandom(0...2L− 1) 4 for l ∈ {0, 1, ..., L} do 5 S ← SS ReadBucket(P (x, l))

6 data← Read block a from S

7 if op = write then

8 S ← (S − {(a, data)})S{(a, data∗)} 9 for l ∈ {L, L − 1, ..., 0} do

10 S0← {(a0, data0) ∈ S : (x, l) = P (position[a’], l)} 11 S0← Select min (|S0|, Z) blocks from S0

12 S ← S − S0

13 WriteBucket(P (x, l), S0)

14 return data

There are three parameters that characterize the Access function: • op: this variable has the function to describe the client’s operation

that he wants to actuate (read,write).

• a: it is the block identifier requested by the client on which the read or write operation will be applied.

• data*: it is the new data to write on the cloud. In a read operation this parameter is not utilized.

The procedure of cloud accessing can be resumed in four main steps: 1. Remap block (line 2 to 3): the position of the needed block is

re-trieved and then it is remapped with a new leaf identifier. The old value is saved in x until the end of the procedure.

2. Read path (line 4 to 6): all the node buckets belonging to the path P (x) are read and placed into the stash. With this operation the pres-ence of the block a is forced inside the stash. In fact, before the Access function call, the a block only can be in two places: in cloud along the path P (x), or it is already in stash due to a previous operation.

(53)

3.2. Theory of Path Oram

3. Update block (lines 6 to 8): the block having the identifier equal to a is retrieved from the stash. If the specified operation op is a write one, the chosen block is updated with the new data∗ and saved again on the stash.

4. Write path (line 9 to 15): this is the real concealing feature of the algorithm. Starting from the leaf level and going up to the root, for each node along the path P (x), the algorithm tries to fill up each bucket with all the possible blocks in stash, following this condition: a block a0 ∈ stash can be placed in the node bucket P (x, l) if and only if the path P (position[a0]) intersects P (x) at the exact level l. In other words, for each a0 ∈ stash, it is checked if their path P (position[a0_])

has a common node on P (x) at depth l (∀a0 ∈ stash, if P (x, l) = P (position[a0], l) =⇒ a0 can be saved in P (x, l)). The maximum number of possible blocks per bucket is equal to B. If there are less blocks to complete the bucket, the remaining free slots are occupied by fake random blocks. The meaning of this procedure is to shuffle the content of the path P (x) every time the client accesses to the cloud. Before uploading the node into the cloud, it must be encrypted.

(54)

3.3 Functional Example

The example proposed below has the purpose to clarify any doubts per-taining the Path Oram algorithm and its logic. So, for instance, a possible representation of an POram tree with a four slots bucket (B = 4) and a pos-sible stash are shown in Figure 3.1. For convenience reasons dummy blocks are omitted. 102 204 203 202 201 101 001 O P S V N Q H W R A L X U B T E M I STASH: D J G C F Y

Figure 3.1: Oram tree

Starting from this configuration, it will be simulated a research on the server supposing a write request on the block R. Therefore the steps to retrieve the content are:

• Access function: the function prototype will be Access(”writeop”, aR, ”#”)

• Retrieve and remap: The algorithm discovers the position of the needed block on cloud and pairs it with a new leaf with id 201, for instance. The remapping will only be applied after the block will be saved again on the server side, so for the moment the block still lie on path P (x). The x value is obtained using x ← position[a_R]. In this case x = 202.

• Stash insertions: the client saves all buckets along the path

(55)

3.3. Functional Example

aR can be only in two places during this phase, either in cloud or in

stash. If a_R is already in stash, it is crucial to make the research for security reasons (query frequency problem).

Figure 3.2: Oram tree after stash insertions

• Research in Stash: the content is retrieved from the stash.

• Write: if the operation specified is a write, as it is in this case, the content of the retrieved block is changed with ”#” that is the data∗ parameter value. Otherwise there isn’t any data manipulation. • Saves: from the leaf 202 to the root 001, the algorithm starts to fill

up the buckets on path P (202). For this example it is supposed that, at the end of this phase, the data structures will appear as shown in Figure 3.3.

(56)

Figure 3.3: Oram tree after saves

It is possible to notice the contents change. Also, it is important to mention that with the new pairing < #, 201 > of the accessed block, the algorithm is capable to place it on the node 101 because at that depth the old path P (202) and the new P (201) converge, so the block can be saved on the cloud. If, for instance, the new path was P (204), the only common node would be the root 001.

3.4 Complexity

The time and memory complexity are fundamental parameters for com-parisons among algorithms. Also they show clearly the costs and the needed resources to run properly the algorithm. In particular, the complexity dis-cussion is subdivided between the client and the server side.

• Stash (client): this structure has a worst case memory complexity of O(log(N )·B) and a time complexity for a search of O(log(N )) where N denote the number of blocks in stash. It is important to mention that the stash is a data structure that continuously change its dimension due to its nature.

• Position: it is a map function pairing blocks identifiers and leaf iden-tifiers < idb, x >. It has a size complexity of O(N ) · Ω(bid+ x) and

a time complexity of O(log(N )). This mapping function has a fixed maximum dimension O((2L+1 − 1) · B) · Ω(bid+ x) when the binary

(57)

3.5. Security & Privacy Considerations

tree is used at its maximum capabilities.

• Cloud: it has size complexity equal to O(2L+1_{− 1) · Ω(node). The}

time complexity is equal to O(log(N )) or it can be also seen as O(H) because the cloud is a perfect binary tree.

• Packet Size: the packet dimension travelling through the network is Ω(node) = Ω(idnode+ idlef tN ode+ idrightN ode+ B · Ω(block)). A block

is Ω(block) = Ω(id_b+ data).

3.5 Security & Privacy Considerations

This section deals with the security features of Shuffle Index and how they prevent the leakage of information.

The four key points to guarantee are: Data confidentiality, Data in-tegrity, Access confidentiality, Pattern confidentiality.

Figure 3.4: Cypher Block Chaining (CBC) encryption mode

The data confidentiality and the data integrity are guaranteed by an AES cypher. It was chosen to use a security key of 16Byte and a Cypher Block Chaining (CBC) encryption mode. This cypher modality, Figure 2.8, hides the content partitioning it into fragments of the same length as the key. The first fragment is xored with an Initialization Vector (IV) and then encrypted. Each subsequent fragment is encrypted after being xored with the previously encrypted fragment.

The Access confidentiality and the Pattern confidentiality are guar-anteed by the way in which the blocks are paired with the leaf identifiers. In fact, each time the algorithm requests a block, the association < idb, x >

(58)

is changed. The new leaf identifier is chosen uniformly at random among 2L leaves present in the tree. This means that the next time the block is uploaded into the cloud, its position will be along a new random leaf path. Also, the height at which the block is placed is not predictable making any intersection attack infeasible.

It is also important to describe why the stash doesn’t grow out of all pro-portion. It is possible to assert that all the blocks are paired with a leaf uniformly at random. This means that the client accesses uniformly at ran-dom the cloud because every time he reads a different path from the root to a leaf. This happens even if the client continuously researches the same block because its < idb, x > changes constantly. At the same time, it is

impossible that the stash goes out of control because each path is read and refreshed uniformly at random, so it is possible to find an intersection node where a stash block can be placed over multiple accesses.

(59)

Chapter 4

Ring Oram

Ring Oram [4] is an advanced version of Path Oram. It was proposed for the first time during the 24th USENIX Security Symposium in Wash-ington DC in 2015. The differences between Path Oram and Ring Oram are significant and all of them have a unique goal: reducing the bandwidth. In fact many Oram algorithms require a great amount of resources to run and this is one of their drawback because it makes them unaffordable. So, the authors of Ring Oram worked on the efficiency and the cost reduction of Path Oram and pushed forward the state of the art of this protocol. The new algorithm has a 2.3x to 4x bandwidth reduction, as claimed by the authors. This kind of result is achieved using little stratagems that together make a substantial difference. During the detailed explanation of Ring Oram, all of the new solutions will be made clear and pointed out.

4.1 Structure of Ring Oram

The main structure of Ring Oram is a perfect binary-tree (complete and balanced) with a height H and 2H− 1 nodes. The tree can be subdivided in levels that are numbered from 0 to L, where 0 is the root level and L = H −1 is the leaf level. It is a very simple and organized structure, easy to build, to maintain and that has useful properties. The binary tree is composed of nodes inside which is placed a bucket.

Def inition1.1 (Binary tree) : The binary tree of height H is formed by 2H − 1 nodes, each one of them is characterized by < idi, b >, where idi is

(60)

the node identifier having i ∈ [0, 2H − 2] and b is a bucket.

Def inition1.2 (Bucket) : Let Blocks[Z + S] be a vector of dimension B = Z + S filled with blocks. Z represents the number of potential real blocks per bucket, while S are dummy blocks.

Def inition1.3 (Block) : Let < z > be a of block of the tree, where z is the encrypted information. The field z is obtained by applying z = Ek(inf o||randomP ad), with E a symmetric encryption function, k the

en-cryption key, randomP ad a value chosen at random during the enen-cryption.

Also, inside a node, there are the following variables:

• Count: it is a counter of how many times a node is accessed. It is useful for a new security feature of the algorithm. In the Theory Section is described its application.

• Valids[Z + S]: it is a vector that represents the validity of each block (real and dummy ones).

• Metadata: it is a memory section containing multiple variables. It is used by Ring Oram for bandwidth usage reduction and for choosing precisely which block to pick up from a node in the cloud.

– Addrs[Z + S] : every block needs to be identified with a key ∈ [0, (Z + S) · (2H− 2)] (or address) and this vector contains all the identifiers of the blocks inside the bucket. The key in position i (i ∈ [0, Z + S)) is the identifier of the block placed in the node Blocks[i] vector.

– Ptrs[Z] : this vector represents the position of the (potentially) Z real blocks inside the vector Blocks[Z + S] of the node. In fact, the positions of real and dummy blocks in a node bucket is randomized and the P trs vector saves where are placed the real ones. The implementation of this shuffling feature is explained in the Theory Section.

– Leafs[Z + S] : in this vector are placed the leaf labels for all the blocks in the bucket.

(61)

4.1. Structure of Ring Oram

• Blocks[Z + S]: here are placed all the real and dummy blocks stored in the node. Each block is composed by:

– data : in this field is saved the real content to be stored in the cloud.

– randomPad : this field is important for security reasons. It is created totally at random before a block is saved on the cloud. This guarantees not to have the same kind of encryption for the same block.

• idNode: it is the identifier of the node in the cloud. • idLeftNode: it is the identifier of the left child. • idRightNode: it is the identifier of the right child.

For a correct functioning of the algorithm, all the blocks are paired with a leaf identifier. In fact, as in Path Oram, the two data structures position and stash are still present. They maintain their role in the algorithm but there are little changes:

• Stash: the purpose of the stash is to collect all the blocks that cannot be saved on the cloud. This occurrence can happen some times during the normal procedures of access to the server and it is demonstrated that the worst-case size is O(log(N ) · Ω(stashElem)) [4]. The Stash contains a variable number of < id_b, stashElem > elements. The stashElem is a new data structure useful for saving all the important information of a block. In fact, due to different requirements, it is necessary to save in stash the block identifier id_b, the block information data and the leaf identifier leaf associated to the block.

So stashElem has the following fields:

– data: variable where is stored the useful information of a block. – addr: variable where is stored the block identifier.

– leaf : variable where is stored the leaf identifier.

• P osition map: In the client, all the pairs of < id_b, l > are stored where l is a leaf identifier of the tree and idb is a block identifier. This