• Non ci sono risultati.

A blockchain-based approach for digital archives: a decentralized Ethereum application for the safeguarding of the cultural heritage.

N/A
N/A
Protected

Academic year: 2021

Condividi "A blockchain-based approach for digital archives: a decentralized Ethereum application for the safeguarding of the cultural heritage."

Copied!
101
0
0

Testo completo

(1)

DEPARTMENT OF INFORMATION ENGINEERING UNIVERSITY OF PISA

Master of Science in Computer Engineering

A blockchain-based approach for digital archives: a

decentralized Ethereum application for the

safeguarding of the cultural heritage.

Supervisors:

Prof. Gianluca Dini

Prof.ssa Cinzia Bernardeschi

Candidate: Mariano Basile

(2)
(3)

Abstract

The recordkeeping process in an accounting system usually results in the form of a digital archive, i.e a collection of data which aims at fulfilling organizations’ specific needs.

In case of a structured collection of data, a digital archive is basically maintained as a database and it is usually under the supervision of every single organization.

There are different issues related to these kinds of solutions.

First, data being stored are controlled by specific organizations. This implies customers trusting data being exposed which in turn means customers trusting organizations exposing the data.

Second, the problem of the single-point-of-failure. The latter was particularly common in Web 1.0 and mainly due to a failure in the node running the database service or due to any attack vectors against the node storing the data.

Things didn’t get much better with Web 2.0. Cloud providers offer cheap and sometimes even free scalable service, but we don’t like really much the way scalability and fault-tolerance is achieved since it came at the cost of providers owning the content, providers doing user profiling, i.e accounting which content gets accessed, by whom and when so to turn it into a profit. Moreover this has made surveillance and censorship easier.

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“. This database plays a significant role in the spreading of knowledge and in the promotion of the Italian national heritage.

The goal of this work is to design and develop a distributed and efficient solution, both from an economic and performance viewpoint, that was able to overcome the above mentioned problems while ensuring authenticity, privacy, integrity and long-term data availability.

A blockchain-based architecture has been chosen so to guarantee three out of four of the above properties, respectively: authenticity, integrity and availability. In particular Ethereum has been considered because of its nature of being public and Turing complete blockchain.

Moreover an off-chain data storage approach by means of a content addressed, versioned, p2p, distributed file system named InterPlanetary File System (IPFS) has been adopted.

It is worth highlighting that the solution still required a database service since we were dealing with structured data on which searches still had to be performed. In this respect the choice was to use a "document store"-type NoSQL-database called Elasticsearch.

Finally the privacy requirement has been achieved at application level by storing data, or part of it, in their encrypted form.

(4)

Contents

Abstract i

Introduction 1

1 Digital archives 3

2 The safeguarding of cultural heritage use case 4

2.1 “Il catalogo generale dei beni culturali” . . . 4

2.2 How cultural property are described: the ICCD standards . . . 5

2.3 ICCD Standard: “schede di catalogo” version 3.00 . . . 6

2.4 ICCD open-data . . . 8

2.5 Gathering ICCD open-data . . . 8

2.5.1 ICCD Standard: “schede di catalogo” OA/RA version 3.00 . . . 9

3 From ICCD open-data to digital archives 10 3.1 The “ELK ” stack . . . 10

3.2 Elasticsearch indices as digital archives . . . 11

3.3 Deploying a containerized dev-environment with Docker . . . 12

3.4 Storing artworks: the “oa3_0” Elasticsearch index . . . 14

3.5 Storing archeological exhibits: the “ra3_0” Elasticsearch index . . . 19

4 The blockchain and its use for digital archives 22 4.1 Blockchain in a nutshell . . . 22

4.2 From the Nakamoto’s blockchain to the Ethereum blockchain . . . 23

4.3 Ethereum accounts: Externally Owned Accounts vs contracts . . . 24

4.4 Ethereum state . . . 24

(5)

4.6 The application binary interface: ABI . . . 25

4.7 The key mechanism underlying the security of Ethereum: Gas . . . 25

4.8 Logs vs Storage . . . 26

4.9 The role of the blockchain for digital archives . . . 27

5 A blockchain-based approach for digital archives 28 5.1 Towards a distributed and efficient solution . . . 28

5.1.1 First attempt: (raw) Archive contents accounting on the blockchain . . 29

5.1.2 Second attempt: Off-chain archive contents storage + Merkle DAG accounting on the blockchain . . . 34

5.1.2.1 The Interplanetary File System: IPFS . . . 34

5.1.2.2 Second attempt: Overview . . . 36

5.1.3 Third attempt: Off-chain operations storage + IPFS links accounting on the blockchain . . . 39

5.1.3.1 Some considerations . . . 44

5.1.4 Partial proposed solution: Off-chain operations storage + IPFS links accounting on the blockchain using logs . . . 44

5.1.4.1 Some considerations . . . 49

5.1.5 Proposed solution . . . 49

6 A decentralized Ethereum application for the safeguarding of the cultural heritage 52 6.1 Decentralized Ethereum applications . . . 52

6.2 Proposed architectural solution . . . 54

6.2.1 Containerized dev-environment with Docker: remaining services . . . 57

6.3 Finalizing data migration: part I . . . 60

6.4 Dapp backend . . . 63

6.4.1 Compiling the CulturalGood smart contract . . . 63

6.4.2 Testing the CulturalGood smart contract . . . 63

6.4.3 Deploying the CulturalGood smart contract . . . 65

6.5 Dapp frontend . . . 70

6.5.1 Insert Interface . . . 70

(6)

6.6 Dapp: Get up and running . . . 73

6.6.1 Setup Metamask . . . 73

6.6.2 Launch lite-server . . . 74

6.7 Interacting with the Dapp . . . 74

6.7.1 Finalizing data migration: part II . . . 74

6.7.2 Insert a new cultural property . . . 79

6.7.3 Search cultural property . . . 80

6.7.4 Show details about a specific cultural property . . . 80

6.7.5 Delete a cultural property . . . 82

6.7.6 Update information about a cultural property . . . 83

6.8 Restore “oa3_0 ”/”ra3_0 ” Elasticsearch indices . . . 84

6.9 GitHub hosting . . . 88

6.10 Future work . . . 88 Appendix 1: Lookup table (Cultural good typology to uint8) 89

(7)

Introduction

The recordkeeping process in an accounting system usually results in the form of a digital archive, i.e a collection of data which aims at fulfilling organizations’ specific needs.

In case of a structured collection of data, a digital archive is basically maintained as a database and it is usually under the supervision of every single organization.

There are different issues related to these kinds of solutions.

First, data being stored are controlled by specific organizations. This implies customers trusting data being exposed which in turn means customers trusting organizations exposing the data.

Second, the problem of the single-point-of-failure. The latter was particularly common in Web 1.0 and mainly a.

Things didn’t get much better with Web 2.0. Cloud providers offer cheap and sometimes even free scalable service, but we don’t like really much the way scalability and fault-tolerance is achieved since it came at the cost of providers owning the content, providers doing user profiling, i.e accounting which content gets accessed, by whom and when so to turn it into a profit. Moreover this has made surveillance and censorship easier.

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“. This database plays a significant role in the spreading of knowledge and in the promotion of the Italian national heritage. At the same time it helps in the deployment of countermeasures aimed at preventing crimes and natural disasters affecting the just men-tioned.

The goal of this work is to design and develop a distributed and efficient solution, both from an economic and performance viewpoint, that was able to overcome the above mentioned problems while ensuring authenticity, privacy, integrity and long-term data availability.

By authentication we mean letting only authorized entities recording new cultural prop-erty. Integrity means protecting the attributes describing each specific cultural good. Privacy means hiding some of these attributes, when necessary. Finally, availability means guarantee-ing information to be available in the really long run.

(8)

We have chosen a blockchain-based architecture so to guarantee three out of four of the above properties, respectively: authenticity, integrity and availability.

In particular we have considered Ethereum because of its nature of being public and Turing complete blockchain.

By means of an Ethereum smart contract, an archive can operate programmatically with-out the need for ownership or control by a particular entity. Furthermore, smart contracts have their own storage which means that data that comprise the archive can be stored, in prin-ciple, on the blockchain. However, exactly because these computations, including the storage, have to be replicated on each node in the Ethereum network, so to be secured, storing data is really expensive.

In addition, as soon as the size of the archive starts being not negligible, the time required in order to retrieve and store the new archive contents may lead to high latencies. Things get even worse if we consider that accounting is required every time a new operation alters the archive’s current state. The proposed solution addressed both the issues.

The performance concern has been solved accounting, each time, only the specific oper-ation (insert/update/delete) and the associated data, instead of the whole archive contents. Furthermore, this kind of accounting has been performed using an off-chain data storage approach by means of a content addressed, versioned, p2p, distributed file system, named InterPlanetary File System (IPFS).

This has enabled storing on the Ethereum blockchain just the immutable, permanent IPFS links in the form of the cryptographic hash of each operation together with associated data, i.e just fixed amount of bytes (32).

This idea has been further explored so as to end up with the cheapest solution that could ever be obtained: in the end no smart-contract storage has been used at all, instead IPFS links have been stored in Ethereum logs.

It is worth highlighting that the solution still required a database since we were dealing with structured data on which searches still had to be performed. In this respect the choice was to use a "document store"-type NoSQL-database called Elasticsearch.

Moreover the following assumption has been made: nodes in the Elasticsearch cluster are not byzantine, i.e they are not malicious. Basically this has allowed to reflect each insert/up-date/delete operation on Elasticsearch right before propagating it off-chain, letting search operations to involve only Elasticsearch. In other terms, the latter acted as a front end cache system.

Under this assumption, the archive gets reconstructed only in the case of nodes failure within the Elasticsearch cluster. This is achieved reading the IPFS links from the blockchain hence retrieving operations and associated data from IPFS.

Finally, the privacy requirement has been achieved at application level by storing data, or part of it, in its encrypted form.

(9)

Chapter 1

Digital archives

We are currently living in an era in which basically everything, either physical or not, has some form of digital representation. We usually refer to a collection of digital content as a digital archive.

In the specific case of structured related digital information, a digital archive is usually main-tained as a database. Data are structured differently according to the type of database con-sidered: traditional SQL databases use to structure data in tables, NOSQL database use other formats such as JSON and key-values. In any case, the recordkeeping process is usually performed by a single organization, either public or private, to fulfill its specific needs. As already mentioned, there are different issues related to these kinds of solutions.

What this work aimed at finding is a distributed and efficient solution that solves these issues. Distributed solutions have been proved extremely robust in terms of offering zero-downtime, completely fault-tolerant operations. In a proper decentralized network where content is stored redundantly across many nodes, there’s very little chance that all the nodes that redundantly stores the data are down at the same time.

The solution has been developed having in mind that it has to to possess these relevant features:

• authentication: for ensuring the trustworthiness of the creator of data;

• integrity: to guarantee that data being stored don’t get modified from unauthorized entities;

• availability: to provide long term access;

(10)

Chapter 2

The safeguarding of cultural heritage

use case

With respect to the use case considered, as 2018 being the European year of cultural heritage, a specific database was examined. It is the so called “Catalogo Generale dei Beni Culturali” and it’s under the control of “Ministry of Cultural Heritage and Activities and Tourism“.

2.1 “Il catalogo generale dei beni culturali”

The so called “Catalogo Generale dei Beni Culturali” is a database which collects and organizes in a centralized fashion data about Italian cultural goods. It responds to protection and promotion purposes and at the same time it helps in the deployment of countermeasures aimed at preventing crimes and natural disasters affecting the Italian national heritage. The cataloging is aimed at identifying and describing all cultural property for which a specific artistic, historical, archeological and anthropological interest has been recognized.

It can be consulted by a web application [1].

(11)

2.2 How cultural property are described: the ICCD standards

Cultural goods are properly described by means of specific standards. “Central Institute for Cataloguing and Documentation” (aka ICCD) coordinates the research for standards [2] definition. Different kinds of standards have been defined, among which the so called “schede di catalogo” play the most important role.

“Schede di catalogo” get classified according to the following criteria:

• category (3): which can be movable property, immovable property, intangible property. • disciplinary sectors (9): which can be archeological goods, architectonical goods, an-thropological goods, photographic goods, musical goods, naturalistic goods, numismatic goods, scientific and technological goods, artistic and historical goods.

• typology (30): which can be archeological monuments, table of archeological supplies, stratigraphic essays, archeological facilities, archeological sites, numismatic - archeolog-ical goods, archeologarcheolog-ical exhibits, anthropologarcheolog-ical exhibits, parks/ gardens, architecture, intangible anthropological goods, tangible anthropological goods, extra-europeans tangible anthropological goods, photographic funds, photos, musical instruments - organ, natural-istic goods - paleontology, naturalnatural-istic goods - mineralogy, naturalnatural-istic goods - zoology, naturalistic goods - petrology, naturalistic goods - botanic, numismatic goods, scientific and technological heritage, contemporary artworks, draws, artworks, stamped matrices, ancient and contemporary clothes, artistic and historical goods - numismatic, printings. In the most general terms we can state that “schede di catalogo” are basically sets of of attributes which take into account peculiarities and intrinsic key features of each cultural good.

(12)

2.3 ICCD Standard: “schede di catalogo” version 3.00

Reporting the ICCD standards “schede di catalogo” is out of scope, but we can present their general structure according to the actual used version, that is 3.00.

Each “scheda di catalogo” is organized into different set of homogeneous data called para-graphs. Each paragraph contains fields: each field can be basic, i.e an individual entry to fill in or it can be structured, i.e an element made up of different subfields to be filled too.

Paragraphs, basic fields, structured fields and subfields are represented according to specific graphical formalisms and conventional definitions, as shown in the following schema:

Figure 2.3: ICCD standard: “schede di catalogo”. Formalisms being used. In particular, for each field or subfield the standard specifies the following properties: • length: the number of characters available when filling out the element;

• repeatability: indicates whether or not that element can be repeated. This is for accounting different information, of the same type, involving the element;

• mandatory: indicates whether or not that element has to be filled out. In some cases elements have to be filled out no matter what. This is indicated by the symbol ’*’. In some other scenarios, elements have to be filled out only in case some specific conditions are met. This is indicated instead by the symbol ’(*)’.

• dictionary: indicates that a specific set of terms has to be used when filling out the element. In case of a close dictionary, indicated by the letter ’C ’, the list of terms is defined a-priori. In case of an open dictionary, indicated by the letter ’A’ instead, the list of terms is also defined a priori, but it can be expanded by the cataloguer when filling out the element. In all the other cases, elements are considered as free text fields. • visibility: in order to regulate the public diffusion of catalogued data, each element has a predefined level of visibility, in accordance to the possibility that the element may or

(13)

may not contain confidential information. The available levels of visibility are specified in the following schema:

1 low level of confidentiality the piece of information is made public available

2 medium level of confidentiality privacy concerns: personal data which regards private entities

3 high level of confidentiality privacy concerns and safeguarding: data are considered confidential because they allow

the specific localization of the cultural property

Table 2.1: ICCD standard: “schede di catalogo”. Fields’ visibility levels.

The effectiveness of these three levels of visibility is related to specific access profile value assigned to each specific cultural good by the cataloguer. The access profile value gets specified when the cultural good gets stored in the database, within the following:

• paragraph: “AD - ACCESSO AI DATI ” • structured field: “ADS ”

• subfield: “ADSP”

A close dictionary that takes on the values: ’1’, ’2’, ’3’ is used for specifying the access profile value. The meaning of these values is described below:

• ’1’: specifies that the content of all fields can be made public available, irrespectively of their own visibility level;

• ’2’: specifies that only the content of fields with a visibility level equal to ’1’ or ’3’ can be made public available;

• ’3’: specifies that only the content of fields with a visibility level equal to ’1’ can be made public available;

A particular case is represented by elements having a visibility level equals to ’0’. This is the case of data specifying economic estimates or in case of stocktaking related information. Fields or subfields with a visibility level equals to ’0’ are never released public.

The just presented standard has been used for different purposes in the development of the solution. More on this later on.

(14)

2.4 ICCD open-data

Around May 2016, the ICCD has launched a project [3] which aims at sharing (raw) data about cultural goods being catalogued. Open data being exposed constitute a subset of all the information stored and available in the “Catalogo Generale dei Beni Culturali”.

In particular, dataset refer only to state-owned goods and turn out to be organized ac-cording to these three criteria:

• region in which property are located;

The other two criteria are related, instead, to the ICCD standard “scheda di catalogo” used for describing cultural property. These other two criteria are:

• the typology: one of the thirty(30) typologies presented in Figure 2.2;

• the version number: either 2.00 or 3.00;

2.5 Gathering ICCD open-data

Gathering ICCD open-data has been the very first step. In this way a clone of the “Catalogo Generale dei Beni Culturali” could be created. As already mentioned, exposed open data refer to state-owned good only, hence it was not possible to create a fully clone.

By the way, it wasn’t even a requirement. We just needed data to work on.

For this reason, ICCD open-data related to a specific region-only have been considered. The selected region was Tuscany, of course. Available open data refer to the following typologies:

• archeological exhibits: described by means of the ICCD standard “scheda di catalogo” RA according to the version 2.00 and 3.00;

• artworks: described by means of the ICCD standard “scheda di catalogo” OA according to the version 2.00 and 3.00;

(15)

In both cases, as shown in the previous Figure 2.2, the currently used version is 3.00. For this reason, we gathered open data [4][5] related to this specific version.

Figure 2.4: Gathered ICCD open data

At that point, ICCD open data were available in the form of two .csv files: • regione-toscana_RA3.00.csv : containing artworks;

• regione-toscana_OA3.00_0.csv : containing archeological exhibits;

2.5.1 ICCD Standard: “schede di catalogo” OA/RA version 3.00

In order to understand the meaning of each field available in each of the two gathered .csv open data file, it was required an in-depth understanding of both the ICCD standards “schede di catalogo” OA/RA version 3.00 [6][7].

(16)

Chapter 3

From ICCD open-data to digital

archives

Once ICCD open-data were gathered, the very next step was to actually create the reduced-size clone of the “Catalogo Generale dei Beni Culturali”. In other terms, ICCD open data were required to be stored in some kind of database.

We knew that, because of the CAP theorem, there’s no way to create a fully distributed database which ensures, at the same time, consistency, availability and partition tolerance.

At that time, a choice had to be made made. With the aim of designing a distributed solution, the selection on the kind of database to be used was restricted to NoSQL databases. This was due to the fact that strictly SQL DBMS primarily provides consistency.

In the end, the choice was to use a "document store"-type NoSQL-database called Elas-ticsearch.

3.1 The “ELK ” stack

The general idea behind the creation of the reduced-size clone is to:

• retrieve records from each of the two .csv file, in which open data are stored; • parse retrieved records according to each involved ICCD standard;

(17)

For this purpose we relied upon the "ELK " stack.

Figure 3.1: “ELK” stack

“ELK ” is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Specifically:

• Logstash is a server-side data processing pipeline that ingests data from multiple sources (like a .csv file for example), transforms it, and then sends it to a "stash", like Elastic-search.

• Elasticsearch allows to store, search, and analyze data;

• Kibana lets visualize Elasticsearch data and navigate the Elastic Stack.

3.2 Elasticsearch indices as digital archives

When a piece of information is stored into Elasticsearch, it is stored in something called an Elasticsearch index.

The latter is a collection of documents that have somewhat similar characteristics. An index is identified by a name and this name is used to refer to the index when per-forming indexing, search, update, and delete operations against the documents in it.

A document is a basic unit of information that can be indexed. It is expressed in JSON which is a ubiquitous internet data interchange.

The collection of one or more nodes that together holds the entire data and provide federated indexing and search capabilities is called Elasticsearch cluster. A cluster is identified by a unique name which by default is "elasticsearch".

This said, the idea was to store the result of the parsing of each record available in the .csv files, into a specific Elasticsearch index. Specifically, we have used two Elasticsearch indices.

(18)

One was aimed at storing artworks and the other one was aimed at storing archeological exhibits.

Basically, each of two Elasticsearch index represents a digital archive that stores a struc-tured collection of data. In practice, in order to actually accomplish this task, a dev-environment has been set up.

3.3 Deploying a containerized dev-environment with Docker

What we needed, at that time, was a dev-environment with the “ELK” stack available to be used. At this aim, the “ELK” stack has been deployed within a containerized dev-environment with Docker. In practice, not only the “ELK” stack, but the entire architectural solution, which will be presented in Chapter 6.2, has been deployed within the just mentioned. Building and deploying applications with Docker containers has been proven to be faster:

• Docker containers wrap up software and its dependencies into a standardized unit for software development that includes everything it needs to run: code, runtime, system tools and libraries. This guarantees that containerized applications always run the same whatever the underline computing environment.

• Different containers can run on the same machine, sharing the OS kernel and running as an isolated user process.

In our specific case, what we needed to build was a multi-container Docker application. For this reason, the Docker Compose tool has been used.

Specifically, not only the different pieces that strictly make up the architectural solution, but also all the additional (but required) services were defined in the ad-hoc created docker-compose.yaml file.

This was indeed the case for the two additional services aimed at creating the reduced size clone. To be precise, the task accomplished by each of these two services consisted in:

• retrieving records from a specific .csv file;

• parsing retrieved records according to the involved ICCD standard; • store the results into the specific Elasticsearch index.

These two services have been called “openiccd_oa_migration” and “openiccd_ra_migration”. Each of the two represents a Logstash instance aimed at the just mentioned purposes. The services, as defined in the docker-compose.yaml file, are shown in Figure 3.2.

(19)

Figure 3.2: docker-compose.yaml: openiccd_oa_migration and openiccd_ra_migration ser-vices

These two represent the “L” part of the “ELK ” stack. Of course, the two remaining compo-nents were required too. For this reason two other services were defined, named “elasticsearch” and “kibana”.

Figure 3.3: docker-compose.yaml: elasticsearch and kibana services

The “elasticsearch” service represents the Elasticsearch endpoint, running on localhost:9200, in which ICCD open data get stored, searched and retrieved.

The “kibana” service was used, instead, as window into Elasticsearch. It runs as web-application at localhost:5601.

It is worth highlighting that these two services are part of the architectural solution itself, i.e they’re not additional services.

(20)

3.4 Storing artworks: the “oa3_0” Elasticsearch index

The openiccd_oa_migration service executes according to a Logstash .conf file, ad-hoc de-fined, called “logstash-oa3_0.conf ”. The specific behavior for the Logstash pipeline defined in this file is outlined below:

• retrieve records from regione-toscana_OA3.00_0.csv;

• parse retrieved records according to the ICCD standard “schede di catalogo” OA version 3.00;

• store the results into a specific Elasticsearch index, called “oa3_0 ”.

Figure 3.4: openiccd_oa_migration service

regione-toscana_OA3.00_0.csv comes with 75.585 records with 51.934 images in the form of URLs and 46 attributes for each record.

What we expected from the service was to store all 75.585 records in the “oa3_0” Elas-ticsearch index.

Before presenting the result of the execution, it must be highlighted that because images were in the form of URLs, we were required to download images too. This resulted necessary in order to store, into Elasticsearch, the complete set of information related to each record.

Indeed, the output element of the Logstash pipeline was programmed to download an image, from the specified URL, every time a record with a value in the “IMG” field was retrieved from the .csv.

In general, binary data can be stored into Elasticsearch. However, the latter is not opti-mized for dealing with this type. For this reason, the idea was to store images in some form of distributed file system which we will present in Chapter 5.1.2.1. Therefore, what we stored in Elasticsearch were the correspondent identifiers only.

However, since there was no way to add images to the specific distributed file system by using the openiccd_oa_migration service, the two just mentioned tasks were, at that point, required to be performed later in time.

That meant the following: a way to link each downloaded image with the Elasticsearch document in which the identifier had to be stored, was basically required.

In this regard, it is worth highlighting that each records in the .csv files comes with an identifier which is the value corresponding to the attribute “CODICE_UNIVOCO”. This identifier is used to uniquely identify the specific cultural good at national level.

(21)

Hence, the solution was to compute a digest on this identifier and use this digest both as image name and as “document_id” in Elasticsearch. Basically, each document in Elasticsearch has a “document_id” which uniquely identifies it.

The original image extension was also stored in Elasticsearch, so to be able to correctly render the image when required.

While computing digests, however, the code raised the following exception:

Figure 3.5: Ruby runtime exception during data migration

At that point I looked at the logstash fingerprint filter plugin codebase and I found out, by looking at the last commit message, that support for non-keyed regular hash functions was specifically added just few months back, as shown in the following:

Figure 3.6: Logstash fingerprint filter plugin commit message

For this reason, in order to solve the issue, what I did was generating an HMAC key (on 128 bit) by relying on a CSPRNG, so to use a keyed hash function.

(22)

Specifically I relied on the “crypto” module exposed by Node.js. The following script was developed to accomplish this task:

Figure 3.7: Generating a 128 bits HMAC key

At that point, the HMAC key was specified in the Logstash .conf file and indeed the exception was not raised anymore. In the end we had:

• images downloaded and stored into a folder.

• data, obtained from the parsing of remaining fields of each record, indexed into the “oa3_0” Elasticsearch index.

These two sentences represent the explanation of the rightmost part of the Figure 3.4. The reason behind the icon representing a JSON file will be explained in Chapter 6. It is right to point out that, even if each record was described by “only” 46 attributes (out of roughly 400 defined in the ICCD standard “schede di catalogo” OA version 3.00), an Elasticsearch template with a “mappings” section (as well as a “settings” section and a “simple pattern template”) that accounts exactly as defined in the standard was defined.

A mapping is the process of defining how a document and the fields it contains are stored and indexed. It is also usually referred as the schema.

(23)

The template gets automatically applied as soon as the first record gets indexed into Elasticsearch, as shown below:

Figure 3.8: openiccd_oa_migration service: Template being installed

Now we can actually present the result of the execution: data migration ended successfully, the Elasticsearch index “oa3_0” was created and 75.582 out of 75.585 records were indexed. The 3 records which were not indexed, did not respect the .csv format. A summary is reported in the followings figures:

(24)

Figure 3.10: Available Elasticsearch indices after artworks opendata migration

Figure 3.11: Template applied with “oa3_0 ” index

It’s fair to say that during this attempt, in practice, images were not downloaded so to avoid storing more than 50K file on the machine. Once the evidence of all records being correctly indexed was obtained, we selected just the first 100 records and re-run the migration, this time letting images be downloaded too. Migration results are shown below:

Figure 3.12: openiccd_oa_migration service: Number of successfully download images (first 100 records considered)

(25)

Figure 3.13: openiccd_oa_migration service: Number of successfully indexed documents (first 100 records considered)

3.5 Storing archeological exhibits: the “ra3_0” Elasticsearch

index

The openiccd_ra_migration service executes according to a second Logstash .conf file, ad-hoc created, called “logstash-ra3_0.conf ”. The specific behavior for the Logstash pipeline defined in this file is outlined below:

• retrieve records from regione-toscana_RA3.00_0.csv;

• parse retrieved records according to the ICCD standard “schede di catalogo” RA version 3.00;

• store the results in the “ra3_0 ” Elasticsearch index.

Figure 3.14: openiccd_ra_migration service

regione-toscana_RA3.00_0.csv comes with 77 records with 43 images, in the form of URLs, and 46 attributes for each record.

What we expected from the openiccd_ra_migration service, instead, was to index all 77 record in the “ra3_0” Elasticsearch index.

(26)

Also in this case, an Elasticsearch template with a “settings” and a “mappings” sections as well as a “simple pattern template” was defined.

The mapping in this case was defined so to resemble the ICCD standard “schede di catal-ogo” RA version 3.00.

As in the previous case, the template gets applied as soon as the first record gets indexed into the “ra3_0” Elasticsearch index, as can be seen in the figure below:

Figure 3.15: openiccd_ra_migration service: Template being installed

The result of the execution is presented below: also in this case data migration ended successfully, i.e the “ra3_0 ” Elasticsearch index was created, all 77 raw records were indexed and all 43 images were downloaded.

(27)

Figure 3.17: openiccd_ra_migration service: Number of successfully download images

Figure 3.18: Available indices into Elasticsearch after archeological exhibit open data migration

(28)

Chapter 4

The blockchain and its use for digital

archives

The aim of this Chapter is to briefly present what a blockchain is and its main underlying con-cepts. A special attention has then been paid on “new generation” blockchain (aka blockchain 2.0) and, in particular, on Ethereum.

4.1 Blockchain in a nutshell

Roughly at the end of 2008 Satoshi Nakamoto came up with the idea of a peer-to-peer digital currency named Bitcoin. At that point, some kind of database was required in order to store how much money every one has at each time, that means how much money anyone was allowed to spend at any given time.

Solving this issue in a way that was decentralized, in the spirit of things like BitTorrent let’s say, it was actually a very hard computer science problem. Even without a central authority, trust and security still had to be guaranteed.

In order to have security, the idea of economic incentives came up. In order to have economic incentives, a currency was required. Nakamoto came up with the first solution that was really practical in this kind of open permissionless context. The solution was named the Nakamoto’s blockchain. This basically enabled peer-to-peer transactions in a decentralized network and, at the same time, established trust among unknown peers.

From a data structure point of view, a blockchain can be seen as a linked list built by using hash pointers. The way this linked list is built is such that if someone alters data being stored in any of the elements of the list, called blocks, the modificaton gets detected. This is because it will result in the hash pointer in the following block being incorrect and so on so forth.

From an high level point of view, it can be seen as a decentralized distributed database of immutable transactions. Every new transaction is validated across the distributed network before it is stored in a block.

(29)

In other terms, a blockchain gives secure computations in a sandbox environment. It tends to be sandbox since exactly every node has to compute exactly the same output starting from the same input, in a way that is completely deterministic. Transactions are moreover protected by strong crypto algorithms.

What a blockchain aims to provide is a share single source of truth. The truth results to be shared because it is distributed among all the nodes in the network. This means that every participant stores its own copy of the truth.

Participants are encouraged to keep synchronized copies by an economic incentive mech-anism. The latter steers participants in following the rules defined by the protocol.

4.2 From the Nakamoto’s blockchain to the Ethereum blockchain

The problem with Bitcoin is that it’s essentially a value transfer system. It means that anything more complex than that it’s hard to do with that specific blockchain.

At the same time, having a single blockchain for each specific use case didn’t seem to be a good approach, even if this has happened for a while after Bitcoin came up.

Then in 2015, Ethereum [8] came up. It is part of “new generation” blockchain, aka blockchain 2.0. Actually it can be considered as the most prominent example.

Ethereum defines itself as the world computer, in the sense that applications, on top of it, run on a single worldwide network of nodes, meaning that it is possible to execute code in an auditable way.

It is basically a blockchain, i.e a linked list of blocks, but with few tiny additions, in particular it has a built in programming language which is Turing complete. This programming language is essentially an hybrid between standard virtual machine architecture and extended Bitcoin scripts. The extended scripting features are a full blown code execution framework, called smart contract.

The point is that people can write their own programs and let them run on the Ethereum network, i.e on the world computer. Programs can be written either using scripts or, more realistically, in a high level language, such as Solidity [9], that gets compiled down to the byte-code. Then, byte-code get inserted into a transaction which gets sent off to the blockchain. When the transaction gets confirmed, an address (20 bytes) gets generated and, at that point, a special kind of account called a contract is available.

Anyone can create an application with any rules by defining it as contract. Once the application gets created, anyone can interact with the application by sending transactions to it. This is done specifying into the transaction the contract address as the destination address. It is the same concept that it is used in Bitcoin for sending money to someone else. Indeed, Ethereum also allows sending Ethers among participants in the network.

Finally, as in Bitcoin, every node on the Ethereum blockchain runs every transaction. If the transaction goes to a contract it means that every node runs the contract’s code.

(30)

4.3 Ethereum accounts: Externally Owned Accounts vs

con-tracts

In Ethereum there are two types of accounts: externally owned accounts (EOAs) and contract accounts. EOAs are controlled by private keys, that means that EOAs use public key cryptography to sign transactions.

Contracts, instead, are controlled by code. They are basically automaton sitting on the blockchain and executing exactly has the code on the blockchain said it has to be executed.

4.4 Ethereum state

In Bitcoin the state is simply represented by the set of unspent transaction outputs (UTXOs).

In Ethereum, the state is more complex: it consists of key value mapping from addresses to account objects. Either EOAs and contracts represent account objects. Both EOAs and contracts have a state.

The state of all accounts is the state of the Ethereum network. The state gets updated with every block.

In case of an EOA the state is simply represented by a nonce and a balance.

In case of a contract the state consists of two extra fields (in addition to a nonce and a balance): the “code hash”, that is the hash of the code of that particular contract and “storage trie root”, that is the merke root hash of the persistent storage of that particular account.

4.5 How code is executed in Ethereum

Every transaction specifies a “to” address that it’s sending to. The only exception is a trans-action creating a new contract. If the “to” address refers to EOA, ethers are just being moved around.

If the “to” address refers to a contract what that is doing is basically activating the code and it’s actually letting the code run. Contract’s code gets executed within the Ethereum Virtual Machine (EVM). This is in order to allow any node to execute the code independently of the underlying hardware or OS. Inside the EVM there are:

• a “stack”: 32 bytes field up to a maximum of 1024 of them; • “memory”: variable length byte array;

• “contract storage”: permanent storage; • “environment variables”;

(31)

• “logs”;

• “sub-calling”: opcode to let EVM calls other contracts’ function from within a contract.

As mentioned, there’s no need to write in EVM bytecode, but there are some high level languages and compilers that can be used in order to compile smart contracts into byte-code. Among these high level languages there’s Solidity, which is now the de-facto standard programming language used for smart contracts development.

4.6 The application binary interface: ABI

The ABI basically represents the standard way used to interact with contracts in the Ethereum ecosystem.

Functions exposed by smart contracts can be called either from an EOA or from contract’s accounts.

Functions calls are compiled into transaction data, aka calldata, i.e a sequence of bytes. The first four bytes represent the function selector, that is the function to be called. These four bytes are basically the first four bytes of the Keccak-256 of the signature of the function. The rest of the calldata are arguments in chunks of 32 bytes.

4.7 The key mechanism underlying the security of Ethereum:

Gas

Having untrusted parties that can create piece of code, in the form of smart contracts, that are executed in a Turing completed programming language by every single computer in the entire network may lead to the so called halting problem.

Basically it has be proven to be mathematically impossible to create an algorithm which can tell whether or not any given program is just going to keep on running forever or not for all possible inputs.

For this reason, Ethereum implements the concept of “gas”, so to avoid attackers from creating programs that just keep on running on everyone’s computer forever.

The idea is to charge a fee for every computational step that the contract code execution takes. The way it is implemented is that:

• each block in the Ethereum blockchain has a “gas limit”, so basically a limit on compu-tational steps. The “gas limit” is set by a simply voting mechanism every time a miner mines a new block. In particular the winner miner can upvote or downvote the actual “gas limit” by a maximum factor of 1

(32)

• Each transaction, moreover, specifies a “gasprice”, which is the amount of ether that a user is willing to pay per unit of computational step (that is per unit of gas). At the same time a “gas limit”, that is the maximum amount of gas that’s willing to take, gets also automatically specified based on estimates.

If the execution takes less than the specified gas limit, the user is charged only for the amount of ether that the transaction execution required. This amount of ether is obtained as the product between the specified “gas price” and the actual amount of “gas” the transaction execution required.

If the execution instead takes more than the specified gas limit, the transaction is rolled back, that is all the execution gets reverted, but an amount of ether equals to the product of “gas price” by “gas_limit” is payed anyway.

Finally it is important to underline that there’s no 1-to-1 mapping between a unit of gas and a unit of computation step.

Indeed the gas system is intrinsically a set of incentives that are designed in order to encourage people using the resources of the Ethereum virtual machine responsibly.

4.8 Logs vs Storage

Every time a transaction gets confirmed and added to a block, a “receipt” gets created. Each “receipt” is basically an object which contains different piece of data:

• “intermediate state root”: a kind of hash which represents the entire Ethereum state after that the transaction was executed;

• “cumulative gas used”: the total amount of gas used in that particular block including that transaction;

• “logs”: it’s sort of a different kind of storage. In the general case, every-time a variable is set into a smart contract what actually happens is that a key is set in the contract own storage. Each contract’s storage can be read by other contracts but can be written only by its own contract. Logs are a sort of different kind of storage in the sense that they are basically append-only. They get generated by the Ethereum virtual machine(EVM) every time an “event” is emitted during code execution. Logs cannot be accessed by (any) contracts and they appear just in that particular block, not in any kind of persistent state. For this reason logs are, on average, roughly 10x cheaper than storage. Each log has a “data” field and up to four “topics”. Topics are intended to allow efficient client access to logs.

(33)

4.9 The role of the blockchain for digital archives

So why using the blockchain for the specific case of digital archives?

Well, the answer lies in its core value proposition: a shared source of true, non-repudable and un-censorable system. In other terms a layer of information that everybody trusts.

Moreover, being intrinsically distributed, having all the information replicated on all the nodes of the network, it gives the guarantee that data are preserved over the time. In fact, if a failure occurs on a node of the network, information is not lost.

Furthermore it guarantees also decentralization, in the sense that data are not hosted and managed by a single central authority.

Finally, it provides authenticity and integrity because of its intrinsic nature, based on cryptography.

(34)

Chapter 5

A blockchain-based approach for

digital archives

At the end of Chapter 3 we left with the two Elasticsearch indices, “oa3_0 ” and “ra3_0 ”, storing, respectively, artworks and archeological exhibits.

Then, in Chapter 4, we’ve presented the main benefits that a blockchain-based solution can provide in the specific case of digital archives.

One could wonder, why Ethereum was specifically presented in that chapter. Actually, it has not come about by chance. Indeed, it was presented because it is one component of the architectural solution.

So, on one hand, we have Elasticsearch with two above mentioned indices. On the other hand, we have the Ethereum blockchain.

What this chapter aims to present is they way in which Ethereum has been used in order to develop a distributed solution which also ensures, for our considered use case, authenticity, privacy, integrity and long-term data availability.

5.1 Towards a distributed and efficient solution

The problem we needed to solve concerned a way to integrate, somehow, our available digital archives with the Ethereum blockchain.

Specifically we needed a way to integrate Elasticsearch with Ethereum. We’ve made several attempts before a viable solution, both from a performance and economic viewpoint, has been obtained.

Before presenting them, it is foremost important to explain the assumption that has been made: nodes within the Elasticsearch cluster are not byzantine. However, this is not generally the case. Strictly speaking, there’s no current available database implementation which has been proven to be “Byzantine Fault Tolerant”.

Indeed, the “Byzantine Generals problem” has been addressed, actually, only in the blockchain space and by means of a probabilistically approach.

(35)

Without this assumption, and without considering any possible optimization, we would have required retrieving data from the blockchain and recreating our Elasticsearch indices every time a search, update or delete operation was issued. In other terms, we were supposed to solve a non trivial problem.

With this assumption, instead, we guaranteed two things:

• In the first place, data stored in the cluster do not get altered or deleted. For this specific reason, data associated to each new insert, update and delete operation, regarding either an artwork or an archeological exhibits, could be stored on Elasticsearch before involving the blockchain in every possible way.

• Available Elasticsearch indices, that is our digital archives, get reconstructed only in the case of nodes failure within the Elasticsearch cluster.

The two just mentioned have to be kept in mind from now on.

5.1.1 First attempt: (raw) Archive contents accounting on the blockchain This first attempt is based on the idea of storing after each Elasticsearch insert/up-date/delete operation the new archives contents, as a whole, in the blockchain.

This can be achieved in the following way: Elasticsearch exposed a specific API, called snapshot, which basically resembles a mysqldump, that can be used in order to obtain a backup of the entire Elasticsearch cluster.

This backup can be serialized and the result of the serialization can then be stored on the blockchain.

As soon as a there’s a node failure within the Elasticsearch cluster, raw data stored on the blockchain can be retrieved, deserialized and the result of this deserialization process can be used to restore the snapshot with a specific Elasticsearch API, called restore.

By default, all indices in the snapshot get restored. This behavior can be achieved using the following smart contract:

(36)

We can immediately notice that, for sure, this approach is not efficient from a performance point of view, in particular in terms of latency being experienced, since it requires to retrieve all the Elasticsearch documents, i.e artworks and archeological exhibits, stored in the nodes within the Elasticsearch cluster on a insert/update/delete operation time base.

Now, one may wonder if the solution is at least efficient from an economic perspective. In order to this evaluation we faked storing 1024 bytes of random data obtained from the random.org [10] web application. The randomness comes from atmospheric noise. Obtained data are presented in the following figures:

Figure 5.2: Generating 1024 bytes of random data

Figure 5.3: Random data obtained

At that point in order to perform the evaluation the “Remix” IDE [11] was used.

The latter is a web integrated development environment that creates, deploys, executes and explores the working of smart contracts on the Ethereum blockchain.

(37)

Specifically, it supports three runtime test environments among which the “Javascript VM ” was used.

The very first step was to compile and deploy the contract shown in Figure 5.1. In this regard, it is fair to consider the cost to be payed for the deployment of the smart contract.

The contract creation transaction is reported below:

Figure 5.4: Contract creation transaction for this first attempt As can been seen from the above figure, the overall cost was 232.699 gas.

In order to understand where this value comes from it is necessary to present the Ethereum fee schedule, as presented in the available yellow paper [12]:

(38)

The overall cost of a transaction is basically the sum of two inner costs: “execution cost” and “transaction cost”. The former is the cost to be payed due to the execution of a trans-action, in terms of required computational steps. The latter, instead, is the cost of sending data to the blockchain.

Actually Remix uses the label “transaction cost” in order to indicate the overall transaction cost. This means that the specific transaction cost can be obtained by simply subtracting from the overall cost the execution cost.

The execution cost was mainly due to:

• 32 Kgas payed for a CREATE operation.

• 200 gas payed for each byte of calldata (input field in the Figure 5.4);

The specific transaction cost, instead, was mainly due to:

• 32 Kgas paid for the contract creation transaction; • 21 Kgas paid for the transaction itself;

• 68 gas paid for every non-zero byte of code for the transaction; • 4 gas paid for every zero byte of code for the transaction;

Let’s see if the specific transaction cost, following the above reasoning, is the same that we expected, that is: 232699 135583 = 97116 gas.

In this specific case, the generated bytecode was the following:

Figure 5.6: EVM bytecode related to the smart contract in Figure 5.4

It accounts for 709 bytes. In the figure, each zero byte has been highlighted. There are actually 64 of them.

So, following the reasoning, the specific transaction cost is: 32000 + 21000 + 68 ⇥ 645 + 64⇥ 4 = 97116 gas, which is the value that we expected.

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16), the actual economic cost for the contract deployment is equal to: 232699 ⇥ 10 8⇥ 183 ⇠ 0, 43 Euro.

(39)

Soon after, random data obtained from the random.org website were used in order to call the specific smart contract’s method storeRawData(bytes _rawData), hence faking the storage of 1KB of cultural goods data.

The result of the transaction being confirmed is shown below:

Figure 5.7: storeRawData(bytes _rawData) invocation cost

The execution cost regarding the storeRawData(bytes _rawData) was mainly due to:

• 20 Kgas payed for each SSTORE operation. Being 1024 bytes sent, and since 32 bytes are stored for each SSTORE, it results in 640Kgas;

• 23.040 gas due to operations of the following sets: Wbase, Wverylow, Wlow, Wmid, Whigh.

The transaction cost, instead, was mainly due to:

• 21 Kgas paid for the transaction;

• 68 gas paid for every non-zero byte of data for the transaction; • 4 gas paid for every zero byte of data for the transaction;

Considering a “gas price” equals to 10 Gwei (< 2 conf.time on 2018-10-16) and the exchange at that specific time (2018-10-16), the actual economic cost for storing 1KB of data was equal to: 754072 ⇥ 10 8⇥ 183 ⇠ 1, 38 Euro. That meant more than 1MillionEuro/GB.

Things get even worse if we consider that this cost would need to be payed for each call to the storeRawData(bytes _rawData).

A scenario like this isn’t impossible, indeed if we just consider the specific case of the regione-toscana_OA3.00_0.csv index we have:

(40)

• roughly 80K records and 52K images in the form of URLs, which overall account for more than 100 MB. On the assumption of an average image size of 150 KB we are already way, way above the considered scenario.

• In practice roughly 3 Million records are actually stored inside the “Catalogo Generale dei Beni Culturali”.

It is right to underline that a read operation, instead, is always free of charge as long as it comes from an EOA.

In this specific case, a read operation means a call to the smart contract’s method retriev-eRawData(). For this reason the economic evaluation is not performed.

To conclude, we can state that this approach is infeasible both from an economic and performance perspective.

5.1.2 Second attempt: Off-chain archive contents storage + Merkle DAG accounting on the blockchain

In order to present this solution, it is worth describing first, very briefly, a peer-to-peer file system named InterPlanetary File System [13].

5.1.2.1 The Interplanetary File System: IPFS

The InterPlanetary File System (IPFS) is a content-addressed, versioned, peer-to-peer file system. It is a synthesis of well-tested internet technologies such as distributed hash tables (DHTs), the Git versioning system and Bittorrent.

The DHT is used in IPFS for routing, in other words to announce added data to the network and help locate data that is requested by any node.

Small values (equal to or less than 1KB) are stored directly on the DHT. For larger values, the DHT stores references, which are the node identifiers of peers who can serve the block. Once a PKI key pair is generated the node identifier is obtained by hashing the public key. The exchange of objects in IPFS is inspired by BitTorrent but is not 100% BitTorrent. An IPFS object is a data structure with two fields:

• Data - a blob of unstructured binary data of size < 256 kB.

• Links - an array of Link structures. These are indeed links to other IPFS objects.

A Link structure has three data fields: • Name - the name of the Link.

• Hash - the hash of the linked IPFS object. This is also referred to as content identifier, aka CID. A CID doesn’t indicate where the content is stored, but it forms a kind

(41)

of address, based on the content itself. CIDs can take a few different forms, in the sense of different encoding bases (aka CID versions). This is because hashes can be represented in different bases. When IPFS was first designed, they used base 58-encoded “multihashes” [14]. If a CID is 46 characters starting with “Qm”, it’s referred to as CID version 0 or, for short, CIDv0. One important aspect to underline is that all CIDv0 begin with “Qm”. This is because, as mentioned, the hash is actually a “multihash”, meaning that the hash itself specifies the hash function and length of the hash, in the first two bytes. CIDv0 have the first two bytes equal to 0x1220, where 12 denotes that this is the SHA256 hash function and 20 is the length of the hash in bytes - 32 bytes. The collection of IPFS objects stored inside an IPFS node result in a Merkle DAG - DAG meaning Directed Acyclic Graph, and Merkle to signify that this is a cryptographically authenticated data structure that uses cryptographic hashes to address content. In general, any difference in the content will produce a different CID whereas the same piece of content added to two different IPFS nodes, using the same settings, will produce exactly the same CID. Hence IPFS removes duplications across the network.

• Size - the cumulative size of the linked IPFS object, including following its links.

IPFS can easily represent a file system consisting of files and directories:

• A small file (< 256 kB) is represented by an IPFS object with data being the file contents (plus a small header and footer) and no links, i.e. the links array is empty. The file name is not part of the IPFS object, so two files with different names and the same content will have the same IPFS object representation and hence the same hash. • A large file (> 256 kB) is represented by a list of links to file chunks that are < 256

kB, and only minimal Data field specifying that this object represents a large file. The links to the file chunks have empty strings as names.

• A directory is represented by a list of links to IPFS objects representing files or other directories contained in it. The names of the links are the names of the files and directories.

It is important to underline that each node in the IPFS network stores only content it is interested in. In particular IPFS nodes treat data they store like a cache, meaning that there is no guarantee that the data will continue to be stored.

“Pinning a CID“ tells an IPFS node that the data is important and mustn’t be thrown away. Any important content should always be pinned to ensure that content is retained long-term. When garbage collection is triggered on a node, any pinned content is automatically exempt from deletion. Non-pinned data may be deleted.

(42)

5.1.2.2 Second attempt: Overview

Having said that, in this second attempt the idea is the following: after each Elasticsearch insert/update/delete operation, we retrieve the new archive contents, as a whole, using the Elasticsearch snapshot API (as in the previous attempt).

Then, with respect to the previous case, the snapshot gets added (as usually refer in the IPFS jargon), without serializing it, on the InterPlanetary File System.

What we get in return is the immutable, permanent IPFS link, in the sense of a CIDv0, related to the specific Merkle DAG representation of the archive’s state.

At that point the IPFS link gets published on the blockchain. This timestamps and secures the archive’s state itself. This approach is usually referred to as off-chain data storage.

As soon as there’s a node failure within the Elasticsearch cluster, the IPFS link gets retrieved from the blockchain and used to gather the associated archive contents, stored in the form of an Elasticsearch snapshot, from IPFS. At that point, the snapshot can be restored using the Elasticsearch restore API.

This specific behavior can be achieved using the following smart contract:

Figure 5.8: Smart contract used with this second attempt

As for the previous attempt, we’ve still an inefficient performance solution, in terms of latency. This is because we’re still required to retrieve all artworks and archeological exhibits, stored in the nodes within the Elasticsearch cluster, on a insert/update/delete operation time base.

Let’s try to see how much this off-chain storage approach is instead more efficient from an economic perspective.

(43)

Figure 5.9: Contract creation transaction for this second attempt The execution cost was mainly due to:

• 32 Kgas payed for a CREATE operation.

• 200 gas payed for each byte of calldata (input field in the figure above);

The transaction cost instead was mainly due to:

• 32 Kgas paid for the contract creation transaction; • 21 Kgas paid for the transaction itself;

• 68 gas paid for every non-zero byte of code for the transaction; • 4 gas paid for every zero byte of code for the transaction;

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for contract deployment was equal to: 116251 ⇥ 10 8⇥ 183 ⇠ 0, 21 Euro. In other terms, the deployment of this

solution cost less than half than the previous one.

In order to evaluate, instead, the economic cost related to a call to the smart contract’s method storedbCID(bytes32 _dbCID) it is required some further explanation.

As previously mentioned, what gets stored on the blockchain is the immutable, permanent IPFS link, in the sense of a CIDv0. In other terms, what we were required to store was a base 58-encoded multihash of 46 characters.

For this reason, in principle, a string datatype variable should have been used. However, the Solidity docs [15] states that:

(44)

“As a rule of thumb, use bytes for arbitrary-length raw byte data and string for arbitrary-length string (UTF-8) data. If you can limit the length to a cer-tain number of bytes, always use one of bytes1 to bytes32 because they are much cheaper.”

Therefore, as shown in Figure 5.8, a bytes32 datatype variable was used instead. Since a bytes32 fits in a single word of the EVM it uses less gas, hence it costs less.

To accomplish this task, in general, the base 58-encoded multihash gets first converted to a base 16-encoding and then the first two bytes are chopped off, since always the same.

To evaluate the cost of a a storedbCID(bytes32 _dbCID) method’s invocation, we faked storing 32 bytes of random data obtained, again, from the random.org web site.

The result of the transaction being confirmed is shown in the next page.

Figure 5.10: storedbCID(bytes32 _dbCID) invocation cost

The execution cost regarding the storedbCID(bytes32 _dbCID) was mainly due to:

• 20 Kgas payed for an SSTORE operation;

• 245 gas due to operations of the following sets: Wbase, Wverylow, Wlow, Wmid, Whigh.

The transaction cost instead was mainly due to:

• 21 Kgas paid for the transaction;

• 68 gas paid for every non-zero byte of data for the transaction; • 4 gas paid for every zero byte of data for the transaction;

(45)

Again considering a “gas price” equals to 10 Gwei (< 2 conf.time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for storing 32bytes of data was equal to: 43693 ⇥ 10 8⇥ 183 ⇠ 0, 08Euro.

The actual overall cost will depend on the number of insert, update, delete operations that will be performed over time. Indeed, for each new operation in this set, the accounting of a new IPFS link on the blockchain is required.

In order to give an estimate, we’ve used reports provided by the ICCD about the number of cultural goods stored each year in the “Catalogo Generale dei Beni Culturali” and their associated cost.

Provided statistics [16] are reported below:

Figure 5.11: ICCD statistics about “Catalogo Generale dei Beni Culturali” (2002-2009) Based on the information available in the ICCD report and on the actual number of cultural goods stored, we can state that every eight years roughly 1,35Million cultural goods get indexed.

If we suppose the same for the next eight years (2019-2024), the overall cost of this solution will be equal to: 0, 08 ⇥ 1, 35 ⇥ 106 = 108KEuro. It should be noted that, in this hypothesis,

we’re only considering insert operations.

In conclusion, this solution may be feasible from an economical perspective but is still weak from a performance point of view.

5.1.3 Third attempt: Off-chain operations storage + IPFS links account-ing on the blockchain

The aim of this solution is to overcome the performance issue affecting all solutions pre-sented so far, but still relying on an off-chain storage approach, so to maintain the economic benefits.

The performance concern is due to the fact that after each Elasticsearch insert/update/delete operation the new archives contents is required to be retrieved, in order to be stored off-chain, that is in order to be added to IPFS. This may lead to high latencies, as soon as the size of the archive starts being not negligible.

(46)

archives contents is represented by the set of all Elasticsearch documents stored in the cluster, that is the set of all artworks and the archaeological exhibits.

This means equivalently that the whole archive’s state, at any given time, can be seen as the result of the executions of all Elasticsearch insert/update/delete operations ever per-formed.

So, basically, the performance issue was solved adding on IPFS, each time, instead of the whole archive contents, only the specific operation together with the associated data.

To be fair, we didn’t need to specify the operation since data were structured in a way that let different operations be distinguished based on the data itself.

This kind of solution required, however, some changes with respect to the smart contract presented in the last attempt.

Indeed, this time, it is required to store on the blockchain an IPFS link for each insert/up-date/delete operation that causes associated data to be stored on IPFS.

Hence, it was required using a dynamic bytes32 array datatype, which represents an infinite expandable array of IPFS CIDs (version 0).

In case of nodes failure within the Elasticsearch cluster, all IPFS links get retrieved from the blockchain and used to gather data stored on IPFS. Operations encoded in the data get executed and, in the end, the whole archives content is reconstructed.

The smart contract involved in this specific solution is the following:

Figure 5.12: Smart contract used with this third attempt

As it can be noticed, with respect to the last solution a loop in the retrieveIpfsLinks() method is now required, so to retrieve all stored IPFS CIDs.

As already mentioned, reading data from the blockchain is at no cost, whenever a read operation is specified into a transaction coming from an EOA.

However, even in the case of a read operation, because of the halting problem, gas is still accounted so to avoid denial-of-service attacks.

(47)

the compiler indeed warns that the gas consumption for the execution (of this view function) may be infinite:

Figure 5.13: solc compiler warning: gas requirement may be infinite due to the presence of the loop

The contract creation transaction related to this solution is reported below:

Figure 5.14: Contract creation transaction for this third attempt The execution cost was mainly due to:

• 32 Kgas payed for a CREATE operation.

• 200 gas payed for each byte of calldata (input field in the figure above);

The transaction cost instead was mainly due to:

• 32 Kgas paid for the contract creation transaction; • 21 Kgas paid for the transaction itself;

• 68 gas paid for every non-zero byte of code for the transaction; • 4 gas paid for every zero byte of code for the transaction;

(48)

Considering a “gas price” equals to 10 Gwei (< 2 conf. time on 2018-10-16) and the exchange at that specific time (2018-10-16) the actual economic cost for contract deployment was equal to: 184295 ⇥ 10 8⇥ 183 ⇠ 0, 34Euro.

This means that with respect to the second solution the deploy cost has now increased by roughly 62%.

In order to evaluate the cost of a storeIpfsLink(bytes32 _ipfsLink) method’s invocation we faked, again, storing 32 bytes of random data.

The result of the transaction execution is reported below:

Figure 5.15: storeIpfsLink(bytes32 _ipfsLink) invocation cost

The execution cost regarding the storeIpfsLink(bytes32 _ipfsLink) was mainly due to the fact that at end of the execution of the method 2 storage locations are actually used:

Figure 5.16: Smart contract’s storage state at the end of the execution of the storeIpfs-Link(bytes32 _ipfsLink)

Riferimenti

Documenti correlati

Secondary efficacy measures included body weight (BW), BMI, and scores achieved on scales used to investigate the severity of BED, craving for food, quality of sleep, quality of

When operating in protected cultivation the amount and the spectral distribution of solar radiation (SR) inside the cultivation undergo modifications that depend

To evaluate the impact of the uncertainty related to the input data on the results, a sensitivity analysis was car- ried out, applying a variation of ± 20% on the cost of the

theatre. It was chosen as a case study the &#34;Vasile Alecsandri&#34; National Theatre of Jassy. The paper also sought to make a comparison in three distinct scenarios for HVAC

On the basis of proposals submitted by States Parties, and in accordance with criteria to be defined by the Committee and approved by the General Assembly, the Committee shall

a) Any Party, and the committee mentioned in Article 16, may propose amendments to this Convention. b) Any proposal for amendment shall be notified to the Secretary General of

This work demonstrated that a polymeric protective prepared from a co-monomer bearing a medium-length (seven fluorine atoms) fluorinated pendant chain is

 The takeOrdered(num, key) action returns a local python list of objects containing the num smallest elements of the considered RDD sorted by. considering a user specified