• Non ci sono risultati.

The Role of Distributed Computing in Big Data Science: Case Studies

N/A
N/A
Protected

Academic year: 2021

Condividi "The Role of Distributed Computing in Big Data Science: Case Studies"

Copied!
3
0
0

Testo completo

(1)

Università degli Studi di Salerno

Dipartimento di Informatica

Dottorato di Ricerca in Informatica (Ciclo XIV - Nuova Serie)

Tesi di Dottorato in Informatica

The Role of Distributed Computing in Big Data Science: Case Studies

in Forensics and Bioinformatics Abstract

Candidato:

Gianluca Roscigno Matr.: 8880900108

Tutor:

Prof. Giuseppe Cattaneo Coordinatore: Prof. Gennaro Costagliola

A.A. 2014/2015

(2)

Abstract

The era of Big Data is leading the generation of large amounts of data, which require storage and analysis capabilities that can be only ad- dressed by distributed computing systems. To facilitate large-scale distributed computing, many programming paradigms and frame- works have been proposed, such as MapReduce and Apache Hadoop, which transparently address some issues of distributed systems and hide most of their technical details.

Hadoop is currently the most popular and mature framework sup- porting the MapReduce paradigm, and it is widely used to store and process Big Data using a cluster of computers. The solutions such as Hadoop are attractive, since they simplify the transformation

of an application from non-parallel to the distributed one by means of general utilities and without many skills. However, without any algorithm engineering activity, some target applications are not alto- gether fast and ecient, and they can suer from several problems and drawbacks when are executed on a distributed system. In fact, a distributed implementation is a necessary but not sucient condition to obtain remarkable performance with respect to a non-parallel coun- terpart. Therefore, it is required to assess how distributed solutions are run on a Hadoop cluster, and/or how their performance can be improved to reduce resources consumption and completion times.

In this dissertation, we will show how Hadoop-based implementations can be enhanced by using carefully algorithm engineering activity, tuning, proling and code improvements. It is also analyzed how to achieve these goals by working on some critical points, such as: data local computation, input split size, number and granularity of tasks, cluster conguration, input/output representation, etc.

i

(3)

In particular, to address these issues, we choose some case studies coming from two research areas where the amount of data is rapidly increasing, namely, Digital Image Forensics and Bioinformatics. We mainly describe full-edged implementations to show how to design, engineer, improve and evaluate Hadoop-based solutions for Source Camera Identication problem, i.e., recognizing the camera used for taking a given digital image, adopting the algorithm by Fridrich et al., and for two of the main problems in Bioinformatics, i.e., alignment- free sequence comparison and extraction of k-mer cumulative or local statistics.

The results achieved by our improved implementations show that they are substantially faster than the non-parallel counterparts, and re- markably faster than the corresponding Hadoop-based naive imple- mentations. In some cases, for example, our solution for k-mer statis- tics is approximately 30× faster than our Hadoop-based naive im- plementation, and about 40× faster than an analogous tool build on Hadoop. In addition, our applications are also scalable, i.e., execution times are (approximately) halved by doubling the computing units.

Indeed, algorithm engineering activities based on the implementation of smart improvements and supported by careful proling and tun- ing may lead to a much better experimental performance avoiding potential problems.

We also highlight how the proposed solutions, tips, tricks and insights can be used in other research areas and problems.

Although Hadoop simplies some tasks of the distributed environ- ments, we must thoroughly know it to achieve remarkable perfor- mance. It is not enough to be an expert of the application domain to build Hadop-based implementations, indeed, in order to achieve good performance, an expert of distributed systems, algorithm engi- neering, tuning, proling, etc. is also required. Therefore, the best performance depend heavily on the cooperation degree between the domain expert and the distributed algorithm engineer.

ii

Riferimenti

Documenti correlati

However, within groups with a steeper age-earnings profile high school and college graduates employed in the private sector the generalised Lorenz curve associated with

Strinati ha dimostrato non solo che il volgarizzamento del Chigiano è del tutto indipendente dalla versione latina del Tifernate, ma anche che molto

[ −2]proPSA: [ −2]pro-prostate specific antigen; BCR: biochemical recurrence; BMI: Body Mass Index; Bx: biopsy; DRE: digital rectal examination; DT: doubling time; GS: Gleason

CSMON-LIFE (Citizen Science MONitoring) aims at involving citizens in monitoring Italian biodiversity by a citizen science approach. Novel IT tools for mobile devices have

European studies have now suggested that several polymorphisms can modulate scrapie susceptibility in goats: in particular, PRNP variant K222 has been associated with resistance

(a) Voltage output from one of the 16 tactile units; (b) result of the algorithm application using a 40 ms window for the ON/OFF signal computation; (c) result of the

The conference has been organized by AUTEC – the Academic Association of Topography and Cartography, SIRA – the Academic Association of Architectural Restoration, and IGMI –

The aim of the present study was to review the experience of our Unit, comparing three different regimens for breast cancer, used in subsequent years: CMF, protocols