A blockchain-based approach for digital archives: a decentralized Ethereum application for the safeguarding of the cultural heritage.

(1)

DEPARTMENT OF INFORMATION ENGINEERING UNIVERSITY OF PISA

Master of Science in Computer Engineering

A blockchain-based approach for digital archives: a

decentralized Ethereum application for the

safeguarding of the cultural heritage.

Supervisors:

Prof. Gianluca Dini

Prof.ssa Cinzia Bernardeschi

Candidate: Mariano Basile

(2)

(3)

Abstract

The recordkeeping process in an accounting system usually results in the form of a digital archive, i.e a collection of data which aims at fulfilling organizations’ specific needs.

In case of a structured collection of data, a digital archive is basically maintained as a database and it is usually under the supervision of every single organization.

There are diﬀerent issues related to these kinds of solutions.

First, data being stored are controlled by specific organizations. This implies customers trusting data being exposed which in turn means customers trusting organizations exposing the data.

Second, the problem of the single-point-of-failure. The latter was particularly common in Web 1.0 and mainly due to a failure in the node running the database service or due to any attack vectors against the node storing the data.

Things didn’t get much better with Web 2.0. Cloud providers oﬀer cheap and sometimes even free scalable service, but we don’t like really much the way scalability and fault-tolerance is achieved since it came at the cost of providers owning the content, providers doing user profiling, i.e accounting which content gets accessed, by whom and when so to turn it into a profit. Moreover this has made surveillance and censorship easier.

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“. This database plays a significant role in the spreading of knowledge and in the promotion of the Italian national heritage.

The goal of this work is to design and develop a distributed and eﬃcient solution, both from an economic and performance viewpoint, that was able to overcome the above mentioned problems while ensuring authenticity, privacy, integrity and long-term data availability.

A blockchain-based architecture has been chosen so to guarantee three out of four of the above properties, respectively: authenticity, integrity and availability. In particular Ethereum has been considered because of its nature of being public and Turing complete blockchain.

Moreover an oﬀ-chain data storage approach by means of a content addressed, versioned, p2p, distributed file system named InterPlanetary File System (IPFS) has been adopted.

It is worth highlighting that the solution still required a database service since we were dealing with structured data on which searches still had to be performed. In this respect the choice was to use a "document store"-type NoSQL-database called Elasticsearch.

Finally the privacy requirement has been achieved at application level by storing data, or part of it, in their encrypted form.

(4)

Introduction

The recordkeeping process in an accounting system usually results in the form of a digital archive, i.e a collection of data which aims at fulfilling organizations’ specific needs.

In case of a structured collection of data, a digital archive is basically maintained as a database and it is usually under the supervision of every single organization.

There are diﬀerent issues related to these kinds of solutions.

First, data being stored are controlled by specific organizations. This implies customers trusting data being exposed which in turn means customers trusting organizations exposing the data.

Second, the problem of the single-point-of-failure. The latter was particularly common in Web 1.0 and mainly a.

Things didn’t get much better with Web 2.0. Cloud providers oﬀer cheap and sometimes even free scalable service, but we don’t like really much the way scalability and fault-tolerance is achieved since it came at the cost of providers owning the content, providers doing user profiling, i.e accounting which content gets accessed, by whom and when so to turn it into a profit. Moreover this has made surveillance and censorship easier.

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“. This database plays a significant role in the spreading of knowledge and in the promotion of the Italian national heritage. At the same time it helps in the deployment of countermeasures aimed at preventing crimes and natural disasters aﬀecting the just men-tioned.

The goal of this work is to design and develop a distributed and eﬃcient solution, both from an economic and performance viewpoint, that was able to overcome the above mentioned problems while ensuring authenticity, privacy, integrity and long-term data availability.

By authentication we mean letting only authorized entities recording new cultural prop-erty. Integrity means protecting the attributes describing each specific cultural good. Privacy means hiding some of these attributes, when necessary. Finally, availability means guarantee-ing information to be available in the really long run.

(8)

We have chosen a blockchain-based architecture so to guarantee three out of four of the above properties, respectively: authenticity, integrity and availability.

In particular we have considered Ethereum because of its nature of being public and Turing complete blockchain.

By means of an Ethereum smart contract, an archive can operate programmatically with-out the need for ownership or control by a particular entity. Furthermore, smart contracts have their own storage which means that data that comprise the archive can be stored, in prin-ciple, on the blockchain. However, exactly because these computations, including the storage, have to be replicated on each node in the Ethereum network, so to be secured, storing data is really expensive.

In addition, as soon as the size of the archive starts being not negligible, the time required in order to retrieve and store the new archive contents may lead to high latencies. Things get even worse if we consider that accounting is required every time a new operation alters the archive’s current state. The proposed solution addressed both the issues.

The performance concern has been solved accounting, each time, only the specific oper-ation (insert/update/delete) and the associated data, instead of the whole archive contents. Furthermore, this kind of accounting has been performed using an oﬀ-chain data storage approach by means of a content addressed, versioned, p2p, distributed file system, named InterPlanetary File System (IPFS).

This has enabled storing on the Ethereum blockchain just the immutable, permanent IPFS links in the form of the cryptographic hash of each operation together with associated data, i.e just fixed amount of bytes (32).

This idea has been further explored so as to end up with the cheapest solution that could ever be obtained: in the end no smart-contract storage has been used at all, instead IPFS links have been stored in Ethereum logs.

It is worth highlighting that the solution still required a database since we were dealing with structured data on which searches still had to be performed. In this respect the choice was to use a "document store"-type NoSQL-database called Elasticsearch.

Moreover the following assumption has been made: nodes in the Elasticsearch cluster are not byzantine, i.e they are not malicious. Basically this has allowed to reflect each insert/up-date/delete operation on Elasticsearch right before propagating it oﬀ-chain, letting search operations to involve only Elasticsearch. In other terms, the latter acted as a front end cache system.

Under this assumption, the archive gets reconstructed only in the case of nodes failure within the Elasticsearch cluster. This is achieved reading the IPFS links from the blockchain hence retrieving operations and associated data from IPFS.

Finally, the privacy requirement has been achieved at application level by storing data, or part of it, in its encrypted form.

(9)

Chapter 1

Digital archives

We are currently living in an era in which basically everything, either physical or not, has some form of digital representation. We usually refer to a collection of digital content as a digital archive.

In the specific case of structured related digital information, a digital archive is usually main-tained as a database. Data are structured diﬀerently according to the type of database con-sidered: traditional SQL databases use to structure data in tables, NOSQL database use other formats such as JSON and key-values. In any case, the recordkeeping process is usually performed by a single organization, either public or private, to fulfill its specific needs. As already mentioned, there are diﬀerent issues related to these kinds of solutions.

What this work aimed at finding is a distributed and eﬃcient solution that solves these issues. Distributed solutions have been proved extremely robust in terms of oﬀering zero-downtime, completely fault-tolerant operations. In a proper decentralized network where content is stored redundantly across many nodes, there’s very little chance that all the nodes that redundantly stores the data are down at the same time.

The solution has been developed having in mind that it has to to possess these relevant features:

• authentication: for ensuring the trustworthiness of the creator of data;

• integrity: to guarantee that data being stored don’t get modified from unauthorized entities;

• availability: to provide long term access;

(10)

Chapter 2

The safeguarding of cultural heritage

use case

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“.

2.1 “Il catalogo generale dei beni culturali”

The so called “Catalogo Generale dei Beni Culturali” is a database which collects and organizes in a centralized fashion data about Italian cultural goods. It responds to protection and promotion purposes and at the same time it helps in the deployment of countermeasures aimed at preventing crimes and natural disasters aﬀecting the Italian national heritage. The cataloging is aimed at identifying and describing all cultural property for which a specific artistic, historical, archeological and anthropological interest has been recognized.

It can be consulted by a web application [1].

(11)

2.2 How cultural property are described: the ICCD standards

Cultural goods are properly described by means of specific standards. “Central Institute for Cataloguing and Documentation” (aka ICCD) coordinates the research for standards [2] definition. Diﬀerent kinds of standards have been defined, among which the so called “schede di catalogo” play the most important role.

“Schede di catalogo” get classified according to the following criteria:

• category (3): which can be movable property, immovable property, intangible property. • disciplinary sectors (9): which can be archeological goods, architectonical goods, an-thropological goods, photographic goods, musical goods, naturalistic goods, numismatic goods, scientific and technological goods, artistic and historical goods.

• typology (30): which can be archeological monuments, table of archeological supplies, stratigraphic essays, archeological facilities, archeological sites, numismatic - archeolog-ical goods, archeologarcheolog-ical exhibits, anthropologarcheolog-ical exhibits, parks/ gardens, architecture, intangible anthropological goods, tangible anthropological goods, extra-europeans tangible anthropological goods, photographic funds, photos, musical instruments - organ, natural-istic goods - paleontology, naturalnatural-istic goods - mineralogy, naturalnatural-istic goods - zoology, naturalistic goods - petrology, naturalistic goods - botanic, numismatic goods, scientific and technological heritage, contemporary artworks, draws, artworks, stamped matrices, ancient and contemporary clothes, artistic and historical goods - numismatic, printings. In the most general terms we can state that “schede di catalogo” are basically sets of of attributes which take into account peculiarities and intrinsic key features of each cultural good.

(12)

2.3 ICCD Standard: “schede di catalogo” version 3.00

Reporting the ICCD standards “schede di catalogo” is out of scope, but we can present their general structure according to the actual used version, that is 3.00.

Each “scheda di catalogo” is organized into diﬀerent set of homogeneous data called para-graphs. Each paragraph contains fields: each field can be basic, i.e an individual entry to fill in or it can be structured, i.e an element made up of diﬀerent subfields to be filled too.

Paragraphs, basic fields, structured fields and subfields are represented according to specific graphical formalisms and conventional definitions, as shown in the following schema:

Figure 2.3: ICCD standard: “schede di catalogo”. Formalisms being used. In particular, for each field or subfield the standard specifies the following properties: • length: the number of characters available when filling out the element;

• repeatability: indicates whether or not that element can be repeated. This is for accounting diﬀerent information, of the same type, involving the element;

• mandatory: indicates whether or not that element has to be filled out. In some cases elements have to be filled out no matter what. This is indicated by the symbol ’*’. In some other scenarios, elements have to be filled out only in case some specific conditions are met. This is indicated instead by the symbol ’(*)’.

• dictionary: indicates that a specific set of terms has to be used when filling out the element. In case of a close dictionary, indicated by the letter ’C ’, the list of terms is defined a-priori. In case of an open dictionary, indicated by the letter ’A’ instead, the list of terms is also defined a priori, but it can be expanded by the cataloguer when filling out the element. In all the other cases, elements are considered as free text fields. • visibility: in order to regulate the public diﬀusion of catalogued data, each element has a predefined level of visibility, in accordance to the possibility that the element may or

(13)

may not contain confidential information. The available levels of visibility are specified in the following schema:

1 low level of confidentiality the piece of information is made public available

2 medium level of confidentiality privacy concerns: personal data which regards private entities

3 high level of confidentiality privacy concerns and safeguarding: data are considered confidential because they allow

the specific localization of the cultural property

Table 2.1: ICCD standard: “schede di catalogo”. Fields’ visibility levels.

The eﬀectiveness of these three levels of visibility is related to specific access profile value assigned to each specific cultural good by the cataloguer. The access profile value gets specified when the cultural good gets stored in the database, within the following:

• paragraph: “AD - ACCESSO AI DATI ” • structured field: “ADS ”

• subfield: “ADSP”

A close dictionary that takes on the values: ’1’, ’2’, ’3’ is used for specifying the access profile value. The meaning of these values is described below:

• ’1’: specifies that the content of all fields can be made public available, irrespectively of their own visibility level;

• ’2’: specifies that only the content of fields with a visibility level equal to ’1’ or ’3’ can be made public available;

• ’3’: specifies that only the content of fields with a visibility level equal to ’1’ can be made public available;

A particular case is represented by elements having a visibility level equals to ’0’. This is the case of data specifying economic estimates or in case of stocktaking related information. Fields or subfields with a visibility level equals to ’0’ are never released public.

The just presented standard has been used for diﬀerent purposes in the development of the solution. More on this later on.

(14)

2.4 ICCD open-data

Around May 2016, the ICCD has launched a project [3] which aims at sharing (raw) data about cultural goods being catalogued. Open data being exposed constitute a subset of all the information stored and available in the “Catalogo Generale dei Beni Culturali”.

In particular, dataset refer only to state-owned goods and turn out to be organized ac-cording to these three criteria:

• region in which property are located;

The other two criteria are related, instead, to the ICCD standard “scheda di catalogo” used for describing cultural property. These other two criteria are:

• the typology: one of the thirty(30) typologies presented in Figure 2.2;

• the version number: either 2.00 or 3.00;

2.5 Gathering ICCD open-data

Gathering ICCD open-data has been the very first step. In this way a clone of the “Catalogo Generale dei Beni Culturali” could be created. As already mentioned, exposed open data refer to state-owned good only, hence it was not possible to create a fully clone.

By the way, it wasn’t even a requirement. We just needed data to work on.

For this reason, ICCD open-data related to a specific region-only have been considered. The selected region was Tuscany, of course. Available open data refer to the following typologies:

• archeological exhibits: described by means of the ICCD standard “scheda di catalogo” RA according to the version 2.00 and 3.00;

• artworks: described by means of the ICCD standard “scheda di catalogo” OA according to the version 2.00 and 3.00;

(15)

In both cases, as shown in the previous Figure 2.2, the currently used version is 3.00. For this reason, we gathered open data [4][5] related to this specific version.

Figure 2.4: Gathered ICCD open data

At that point, ICCD open data were available in the form of two .csv files: • regione-toscana_RA3.00.csv : containing artworks;

• regione-toscana_OA3.00_0.csv : containing archeological exhibits;

2.5.1 ICCD Standard: “schede di catalogo” OA/RA version 3.00

In order to understand the meaning of each field available in each of the two gathered .csv open data file, it was required an in-depth understanding of both the ICCD standards “schede di catalogo” OA/RA version 3.00 [6][7].

(16)

Chapter 3

From ICCD open-data to digital

The blockchain and its use for digital

A blockchain-based approach for

digital archives

At the end of Chapter 3 we left with the two Elasticsearch indices, “oa3_0 ” and “ra3_0 ”, storing, respectively, artworks and archeological exhibits.

Then, in Chapter 4, we’ve presented the main benefits that a blockchain-based solution can provide in the specific case of digital archives.

One could wonder, why Ethereum was specifically presented in that chapter. Actually, it has not come about by chance. Indeed, it was presented because it is one component of the architectural solution.

So, on one hand, we have Elasticsearch with two above mentioned indices. On the other hand, we have the Ethereum blockchain.

What this chapter aims to present is they way in which Ethereum has been used in order to develop a distributed solution which also ensures, for our considered use case, authenticity, privacy, integrity and long-term data availability.

5.1 Towards a distributed and eﬃcient solution

The problem we needed to solve concerned a way to integrate, somehow, our available digital archives with the Ethereum blockchain.

Specifically we needed a way to integrate Elasticsearch with Ethereum. We’ve made several attempts before a viable solution, both from a performance and economic viewpoint, has been obtained.

Before presenting them, it is foremost important to explain the assumption that has been made: nodes within the Elasticsearch cluster are not byzantine. However, this is not generally the case. Strictly speaking, there’s no current available database implementation which has been proven to be “Byzantine Fault Tolerant”.

Indeed, the “Byzantine Generals problem” has been addressed, actually, only in the blockchain space and by means of a probabilistically approach.

(35)

Without this assumption, and without considering any possible optimization, we would have required retrieving data from the blockchain and recreating our Elasticsearch indices every time a search, update or delete operation was issued. In other terms, we were supposed to solve a non trivial problem.

With this assumption, instead, we guaranteed two things:

• In the first place, data stored in the cluster do not get altered or deleted. For this specific reason, data associated to each new insert, update and delete operation, regarding either an artwork or an archeological exhibits, could be stored on Elasticsearch before involving the blockchain in every possible way.

• Available Elasticsearch indices, that is our digital archives, get reconstructed only in the case of nodes failure within the Elasticsearch cluster.

The two just mentioned have to be kept in mind from now on.

5.1.1 First attempt: (raw) Archive contents accounting on the blockchain This first attempt is based on the idea of storing after each Elasticsearch insert/up-date/delete operation the new archives contents, as a whole, in the blockchain.

This can be achieved in the following way: Elasticsearch exposed a specific API, called snapshot, which basically resembles a mysqldump, that can be used in order to obtain a backup of the entire Elasticsearch cluster.

This backup can be serialized and the result of the serialization can then be stored on the blockchain.

As soon as a there’s a node failure within the Elasticsearch cluster, raw data stored on the blockchain can be retrieved, deserialized and the result of this deserialization process can be used to restore the snapshot with a specific Elasticsearch API, called restore.

By default, all indices in the snapshot get restored. This behavior can be achieved using the following smart contract:

(36)

We can immediately notice that, for sure, this approach is not eﬃcient from a performance point of view, in particular in terms of latency being experienced, since it requires to retrieve all the Elasticsearch documents, i.e artworks and archeological exhibits, stored in the nodes within the Elasticsearch cluster on a insert/update/delete operation time base.

Now, one may wonder if the solution is at least eﬃcient from an economic perspective. In order to this evaluation we faked storing 1024 bytes of random data obtained from the random.org [10] web application. The randomness comes from atmospheric noise. Obtained data are presented in the following figures:

Figure 5.2: Generating 1024 bytes of random data

Figure 5.3: Random data obtained

At that point in order to perform the evaluation the “Remix” IDE [11] was used.

The latter is a web integrated development environment that creates, deploys, executes and explores the working of smart contracts on the Ethereum blockchain.

(37)

Specifically, it supports three runtime test environments among which the “Javascript VM ” was used.

The very first step was to compile and deploy the contract shown in Figure 5.1. In this regard, it is fair to consider the cost to be payed for the deployment of the smart contract.

The contract creation transaction is reported below:

Figure 5.4: Contract creation transaction for this first attempt As can been seen from the above figure, the overall cost was 232.699 gas.

In order to understand where this value comes from it is necessary to present the Ethereum fee schedule, as presented in the available yellow paper [12]:

(38)

The overall cost of a transaction is basically the sum of two inner costs: “execution cost” and “transaction cost”. The former is the cost to be payed due to the execution of a trans-action, in terms of required computational steps. The latter, instead, is the cost of sending data to the blockchain.

Actually Remix uses the label “transaction cost” in order to indicate the overall transaction cost. This means that the specific transaction cost can be obtained by simply subtracting from the overall cost the execution cost.

The execution cost was mainly due to:

• 32 Kgas payed for a CREATE operation.

• 200 gas payed for each byte of calldata (input field in the Figure 5.4);

The specific transaction cost, instead, was mainly due to:

• 32 Kgas paid for the contract creation transaction; • 21 Kgas paid for the transaction itself;

• 68 gas paid for every non-zero byte of code for the transaction; • 4 gas paid for every zero byte of code for the transaction;

Let’s see if the specific transaction cost, following the above reasoning, is the same that we expected, that is: 232699 135583 = 97116 gas.

In this specific case, the generated bytecode was the following:

Figure 5.6: EVM bytecode related to the smart contract in Figure 5.4

It accounts for 709 bytes. In the figure, each zero byte has been highlighted. There are actually 64 of them.

So, following the reasoning, the specific transaction cost is: 32000 + 21000 + 68 ⇥ 645 + 64_{⇥ 4 = 97116 gas, which is the value that we expected.}

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16), the actual economic cost for the contract deployment is equal to: 232699 ⇥ 10 8_{⇥ 183 ⇠ 0, 43 Euro.}

(39)

Soon after, random data obtained from the random.org website were used in order to call the specific smart contract’s method storeRawData(bytes _rawData), hence faking the storage of 1KB of cultural goods data.

The result of the transaction being confirmed is shown below:

Figure 5.7: storeRawData(bytes _rawData) invocation cost

The execution cost regarding the storeRawData(bytes _rawData) was mainly due to:

• 20 Kgas payed for each SSTORE operation. Being 1024 bytes sent, and since 32 bytes are stored for each SSTORE, it results in 640Kgas;

• 23.040 gas due to operations of the following sets: Wbase, Wverylow, Wlow, Wmid, Whigh.

The transaction cost, instead, was mainly due to:

• 21 Kgas paid for the transaction;

• 68 gas paid for every non-zero byte of data for the transaction; • 4 gas paid for every zero byte of data for the transaction;

Considering a “gas price” equals to 10 Gwei (< 2 conf.time on 2018-10-16) and the exchange at that specific time (2018-10-16), the actual economic cost for storing 1KB of data was equal to: 754072 ⇥ 10 8_{⇥ 183 ⇠ 1, 38 Euro. That meant more than 1MillionEuro/GB.}

Things get even worse if we consider that this cost would need to be payed for each call to the storeRawData(bytes _rawData).

A scenario like this isn’t impossible, indeed if we just consider the specific case of the regione-toscana_OA3.00_0.csv index we have:

(40)

• roughly 80K records and 52K images in the form of URLs, which overall account for more than 100 MB. On the assumption of an average image size of 150 KB we are already way, way above the considered scenario.

• In practice roughly 3 Million records are actually stored inside the “Catalogo Generale dei Beni Culturali”.

It is right to underline that a read operation, instead, is always free of charge as long as it comes from an EOA.

In this specific case, a read operation means a call to the smart contract’s method retriev-eRawData(). For this reason the economic evaluation is not performed.

To conclude, we can state that this approach is infeasible both from an economic and performance perspective.

5.1.2 Second attempt: Oﬀ-chain archive contents storage + Merkle DAG accounting on the blockchain

In order to present this solution, it is worth describing first, very briefly, a peer-to-peer file system named InterPlanetary File System [13].

5.1.2.1 The Interplanetary File System: IPFS

The InterPlanetary File System (IPFS) is a content-addressed, versioned, peer-to-peer file system. It is a synthesis of well-tested internet technologies such as distributed hash tables (DHTs), the Git versioning system and Bittorrent.

The DHT is used in IPFS for routing, in other words to announce added data to the network and help locate data that is requested by any node.

Small values (equal to or less than 1KB) are stored directly on the DHT. For larger values, the DHT stores references, which are the node identifiers of peers who can serve the block. Once a PKI key pair is generated the node identifier is obtained by hashing the public key. The exchange of objects in IPFS is inspired by BitTorrent but is not 100% BitTorrent. An IPFS object is a data structure with two fields:

• Data - a blob of unstructured binary data of size < 256 kB.

• Links - an array of Link structures. These are indeed links to other IPFS objects.

A Link structure has three data fields: • Name - the name of the Link.

• Hash - the hash of the linked IPFS object. This is also referred to as content identifier, aka CID. A CID doesn’t indicate where the content is stored, but it forms a kind

(41)

of address, based on the content itself. CIDs can take a few different forms, in the sense of different encoding bases (aka CID versions). This is because hashes can be represented in different bases. When IPFS was first designed, they used base 58-encoded “multihashes” [14]. If a CID is 46 characters starting with “Qm”, it’s referred to as CID version 0 or, for short, CIDv0. One important aspect to underline is that all CIDv0 begin with “Qm”. This is because, as mentioned, the hash is actually a “multihash”, meaning that the hash itself specifies the hash function and length of the hash, in the first two bytes. CIDv0 have the first two bytes equal to 0x1220, where 12 denotes that this is the SHA256 hash function and 20 is the length of the hash in bytes - 32 bytes. The collection of IPFS objects stored inside an IPFS node result in a Merkle DAG - DAG meaning Directed Acyclic Graph, and Merkle to signify that this is a cryptographically authenticated data structure that uses cryptographic hashes to address content. In general, any difference in the content will produce a different CID whereas the same piece of content added to two different IPFS nodes, using the same settings, will produce exactly the same CID. Hence IPFS removes duplications across the network.

• Size - the cumulative size of the linked IPFS object, including following its links.

IPFS can easily represent a file system consisting of files and directories:

• A small file (< 256 kB) is represented by an IPFS object with data being the file contents (plus a small header and footer) and no links, i.e. the links array is empty. The file name is not part of the IPFS object, so two files with diﬀerent names and the same content will have the same IPFS object representation and hence the same hash. • A large file (> 256 kB) is represented by a list of links to file chunks that are < 256

kB, and only minimal Data field specifying that this object represents a large file. The links to the file chunks have empty strings as names.

• A directory is represented by a list of links to IPFS objects representing files or other directories contained in it. The names of the links are the names of the files and directories.

It is important to underline that each node in the IPFS network stores only content it is interested in. In particular IPFS nodes treat data they store like a cache, meaning that there is no guarantee that the data will continue to be stored.

“Pinning a CID“ tells an IPFS node that the data is important and mustn’t be thrown away. Any important content should always be pinned to ensure that content is retained long-term. When garbage collection is triggered on a node, any pinned content is automatically exempt from deletion. Non-pinned data may be deleted.

(42)

5.1.2.2 Second attempt: Overview

Having said that, in this second attempt the idea is the following: after each Elasticsearch insert/update/delete operation, we retrieve the new archive contents, as a whole, using the Elasticsearch snapshot API (as in the previous attempt).

Then, with respect to the previous case, the snapshot gets added (as usually refer in the IPFS jargon), without serializing it, on the InterPlanetary File System.

What we get in return is the immutable, permanent IPFS link, in the sense of a CIDv0, related to the specific Merkle DAG representation of the archive’s state.

At that point the IPFS link gets published on the blockchain. This timestamps and secures the archive’s state itself. This approach is usually referred to as oﬀ-chain data storage.

As soon as there’s a node failure within the Elasticsearch cluster, the IPFS link gets retrieved from the blockchain and used to gather the associated archive contents, stored in the form of an Elasticsearch snapshot, from IPFS. At that point, the snapshot can be restored using the Elasticsearch restore API.

This specific behavior can be achieved using the following smart contract:

Figure 5.8: Smart contract used with this second attempt

As for the previous attempt, we’ve still an ineﬃcient performance solution, in terms of latency. This is because we’re still required to retrieve all artworks and archeological exhibits, stored in the nodes within the Elasticsearch cluster, on a insert/update/delete operation time base.

Let’s try to see how much this oﬀ-chain storage approach is instead more eﬃcient from an economic perspective.

(43)

Figure 5.9: Contract creation transaction for this second attempt The execution cost was mainly due to:

• 200 gas payed for each byte of calldata (input field in the figure above);

The transaction cost instead was mainly due to:

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for contract deployment was equal to: 116251 ⇥ 10 8_{⇥ 183 ⇠ 0, 21 Euro. In other terms, the deployment of this}

solution cost less than half than the previous one.

In order to evaluate, instead, the economic cost related to a call to the smart contract’s method storedbCID(bytes32 _dbCID) it is required some further explanation.

As previously mentioned, what gets stored on the blockchain is the immutable, permanent IPFS link, in the sense of a CIDv0. In other terms, what we were required to store was a base 58-encoded multihash of 46 characters.

For this reason, in principle, a string datatype variable should have been used. However, the Solidity docs [15] states that:

(44)

“As a rule of thumb, use bytes for arbitrary-length raw byte data and string for arbitrary-length string (UTF-8) data. If you can limit the length to a cer-tain number of bytes, always use one of bytes1 to bytes32 because they are much cheaper.”

Therefore, as shown in Figure 5.8, a bytes32 datatype variable was used instead. Since a bytes32 fits in a single word of the EVM it uses less gas, hence it costs less.

To accomplish this task, in general, the base 58-encoded multihash gets first converted to a base 16-encoding and then the first two bytes are chopped oﬀ, since always the same.

To evaluate the cost of a a storedbCID(bytes32 _dbCID) method’s invocation, we faked storing 32 bytes of random data obtained, again, from the random.org web site.

The result of the transaction being confirmed is shown in the next page.

Figure 5.10: storedbCID(bytes32 _dbCID) invocation cost

The execution cost regarding the storedbCID(bytes32 _dbCID) was mainly due to:

• 20 Kgas payed for an SSTORE operation;

• 245 gas due to operations of the following sets: Wbase, Wverylow, Wlow, Wmid, Whigh.

• 21 Kgas paid for the transaction;

• 68 gas paid for every non-zero byte of data for the transaction; • 4 gas paid for every zero byte of data for the transaction;

(45)

Again considering a “gas price” equals to 10 Gwei (< 2 conf.time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for storing 32bytes of data was equal to: 43693 ⇥ 10 8_{⇥ 183 ⇠ 0, 08Euro.}

The actual overall cost will depend on the number of insert, update, delete operations that will be performed over time. Indeed, for each new operation in this set, the accounting of a new IPFS link on the blockchain is required.

In order to give an estimate, we’ve used reports provided by the ICCD about the number of cultural goods stored each year in the “Catalogo Generale dei Beni Culturali” and their associated cost.

Provided statistics [16] are reported below:

Figure 5.11: ICCD statistics about “Catalogo Generale dei Beni Culturali” (2002-2009) Based on the information available in the ICCD report and on the actual number of cultural goods stored, we can state that every eight years roughly 1,35Million cultural goods get indexed.

If we suppose the same for the next eight years (2019-2024), the overall cost of this solution will be equal to: 0, 08 ⇥ 1, 35 ⇥ 106 _{= 108KEuro}_{. It should be noted that, in this hypothesis,}

we’re only considering insert operations.

In conclusion, this solution may be feasible from an economical perspective but is still weak from a performance point of view.

5.1.3 Third attempt: Oﬀ-chain operations storage + IPFS links account-ing on the blockchain

The aim of this solution is to overcome the performance issue aﬀecting all solutions pre-sented so far, but still relying on an oﬀ-chain storage approach, so to maintain the economic benefits.

The performance concern is due to the fact that after each Elasticsearch insert/update/delete operation the new archives contents is required to be retrieved, in order to be stored oﬀ-chain, that is in order to be added to IPFS. This may lead to high latencies, as soon as the size of the archive starts being not negligible.

(46)

archives contents is represented by the set of all Elasticsearch documents stored in the cluster, that is the set of all artworks and the archaeological exhibits.

This means equivalently that the whole archive’s state, at any given time, can be seen as the result of the executions of all Elasticsearch insert/update/delete operations ever per-formed.

So, basically, the performance issue was solved adding on IPFS, each time, instead of the whole archive contents, only the specific operation together with the associated data.

To be fair, we didn’t need to specify the operation since data were structured in a way that let diﬀerent operations be distinguished based on the data itself.

This kind of solution required, however, some changes with respect to the smart contract presented in the last attempt.

Indeed, this time, it is required to store on the blockchain an IPFS link for each insert/up-date/delete operation that causes associated data to be stored on IPFS.

Hence, it was required using a dynamic bytes32 array datatype, which represents an infinite expandable array of IPFS CIDs (version 0).

In case of nodes failure within the Elasticsearch cluster, all IPFS links get retrieved from the blockchain and used to gather data stored on IPFS. Operations encoded in the data get executed and, in the end, the whole archives content is reconstructed.

The smart contract involved in this specific solution is the following:

Figure 5.12: Smart contract used with this third attempt

As it can be noticed, with respect to the last solution a loop in the retrieveIpfsLinks() method is now required, so to retrieve all stored IPFS CIDs.

As already mentioned, reading data from the blockchain is at no cost, whenever a read operation is specified into a transaction coming from an EOA.

However, even in the case of a read operation, because of the halting problem, gas is still accounted so to avoid denial-of-service attacks.

(47)

the compiler indeed warns that the gas consumption for the execution (of this view function) may be infinite:

Figure 5.13: solc compiler warning: gas requirement may be infinite due to the presence of the loop

The contract creation transaction related to this solution is reported below:

Figure 5.14: Contract creation transaction for this third attempt The execution cost was mainly due to:

• 200 gas payed for each byte of calldata (input field in the figure above);

(48)

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for contract deployment was equal to: 184295 ⇥ 10 8_{⇥ 183 ⇠ 0, 34Euro.}

This means that with respect to the second solution the deploy cost has now increased by roughly 62%.

In order to evaluate the cost of a storeIpfsLink(bytes32 _ipfsLink) method’s invocation we faked, again, storing 32 bytes of random data.

The result of the transaction execution is reported below:

Figure 5.15: storeIpfsLink(bytes32 _ipfsLink) invocation cost

The execution cost regarding the storeIpfsLink(bytes32 _ipfsLink) was mainly due to the fact that at end of the execution of the method 2 storage locations are actually used:

Figure 5.16: Smart contract’s storage state at the end of the execution of the storeIpfs-Link(bytes32 _ipfsLink)

A blockchain-based approach for digital archives: a decentralized Ethereum application for the safeguarding of the cultural heritage.

A blockchain-based approach for digital archives: a

decentralized Ethereum application for the

safeguarding of the cultural heritage.

Abstract

Contents

Introduction

Chapter 1

Digital archives

Chapter 2

The safeguarding of cultural heritage

use case

2.1 “Il catalogo generale dei beni culturali”

2.2 How cultural property are described: the ICCD standards

2.3 ICCD Standard: “schede di catalogo” version 3.00

2.4 ICCD open-data

2.5 Gathering ICCD open-data

Chapter 3

From ICCD open-data to digital

archives

3.1 The “ELK ” stack

3.2 Elasticsearch indices as digital archives

3.3 Deploying a containerized dev-environment with Docker

3.4 Storing artworks: the “oa3_0” Elasticsearch index

3.5 Storing archeological exhibits: the “ra3_0” Elasticsearch

index

Chapter 4

The blockchain and its use for digital

archives

4.1 Blockchain in a nutshell

4.2 From the Nakamoto’s blockchain to the Ethereum blockchain

4.3 Ethereum accounts: Externally Owned Accounts vs

con-tracts

4.4 Ethereum state

4.5 How code is executed in Ethereum

4.6 The application binary interface: ABI

4.7 The key mechanism underlying the security of Ethereum:

Gas

4.8 Logs vs Storage

4.9 The role of the blockchain for digital archives

Chapter 5

A blockchain-based approach for

digital archives

5.1 Towards a distributed and eﬃcient solution