• Non ci sono risultati.

Genomic big data management, modeling and computing

N/A
N/A
Protected

Academic year: 2021

Condividi "Genomic big data management, modeling and computing"

Copied!
2
0
0

Testo completo

(1)

Genomic Big Data Management, Modeling And

Computing

Masseroli M1,§, Pinoli P1, Canakoglu A1, Bernasconi A1, Gulino A1, Nanni L1, Orlova O1, Pallotta S1, Ceri S1

1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy

§Email:marco.masseroli@polimi.it

Motivation

Modern genomics promises to answer fundamental questions for biological and clinical re-search, e.g., how cancer develops, how driving mutations occur. Unprecedented efforts in genomics are made possible by Next Generation Sequencing (NGS), a family of technolo-gies that is progressively reducing the cost and time of reading the DNA. Huge amounts of sequence data are continuously collected by research laboratories, often organized through world-wide consortia (such as ENCODE, TCGA, the 1000 Genomes Project, and Epige-nomics Roadmap); precision medicine, based on genomic information, is becoming a real-ity. So far, the bioinformatics research community has been mostly challenged by primary analysis (production of sequences in the form of short DNA segments, or ”reads”) and sec-ondary analysis (alignment of reads to a reference genome and search for specific genomic features on the reads, such as variants/mutations and peaks of expression). The most important emerging problem is the so-called tertiary analysis, concerned with sense mak-ing, e.g., discovering how heterogeneous regions interact with each other, by integrating heterogeneous DNA features, such as variants or mutations in a DNA position, or signals and peaks of expression, or structural properties of the DNA, e.g., break points (where the DNA is damaged) or junctions (where DNA creates loops). According to many biologists, answers to crucial genomic questions are hidden within genomic data already available in public repositories, but suitable tools for processing them are lacking.

Methods

We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM), which mediates existing data formats, and the GenoMetric Query Language (GMQL), a high-level, declarative query language required by tertiary data analysis. Its enhancement and implementation using a cloud-based technologies provided a GMQL-based system with enhanced accessibility, portability, scalability and performance.

Results

Our new GMQL system is publicly available at http://www.bioinformatics.deib.polimi.it/GMQLsystem/); it has a modular architecture including an intermediate representation supporting

oper-ations over genomic regions and metadata that are executed by a Spark engine, a data framework on the cloud that proved to be extremely efficient in supporting massive ge-nomic queries, a high-level technology-independent repository abstraction supporting dif-ferent repository types (e.g., local file system, Hadoop File System, or others), several system interfaces, including an intuitive Web-based interface and a Web Service inter-face publicly available at http://www.gmql.ue/, as well as two programmatic interinter-faces: a

(2)

pyGMQL library (https://pygmql.readthedocs.io/en/latest/) for Python language and a RGMQL package (https://bioconductor.org/packages/release/bioc/html/RGMQL.html) for R/Bioconductor environment. The GMQL system is associated with an integrated repository of heterogeneous datasets from multiple public sources. It includes datasets form ENCODE, Roadmap Epigenomics and TCGA, whose metadata have been homog-enized through a rule-based framework that we specifically developed for automatize the maintenance of the repository and its extension with the inclusion of new data sources. Data and metadata integrated in the repository can be comprehensively queried through GMQL scripts, as demonstrated by several biological use case examples, and extracted data can then be directly processed and analyzed using analytical Python libraries or R/Bioconductor packages. Further extensions of our GMQL system will support data sharing and federated processing of data located in distinct GMQL instances running on different systems/clouds, by automatically splitting and transferring the processing where the data to be evaluated are located. This will avoid as much as possible expensive downloads and transfers of big data quantities between repositories, also allowing tak-ing advantage of processtak-ing resources available in different organizations. Furthermore, and more important, it will ease the integrated processing of valuable data made avail-able by different sources, for their comprehensive evaluation toward biomedical knowledge discoveries.

Riferimenti

Documenti correlati

(C) Evolution of VDOS from ab initio MD simula- tions along compression (solid lines) and decompression (dashed lines) at T = 300 K. Color represents structural state of the

Each edition of a course is characterized by a starting date, an ending date, a duration, a venue, a set of teachers (one or more), and a set of attendees (employees of the

• Each film is characterized by the title, the year during which it has been released, the production country, the producer, the film maker, the actors, and the actresses.. The set

• Each show is identified by a number, which univocally identifies it within the artistic event it belongs to (the first show of event E27, the second show of event E27, and so on),

All six isolated compounds have been evaluated firstly for the inhibition of both Human Immunodeficiency Virus type 1 (HIV-1) Reverse Transcriptase (RT)-associated DNA Poly-

LCeMS analysis of the L -FDAA-derivatized hydrolysates of 3 and 4, and the NMR data revealed all a -amino acid residues to possess con figurations identical to those in 1.

The simulations are compared to the available literature data, confirming the accuracy of the calculations, and to experimental Raman spectra taken at room temperature on a

Doppia pelle del tipo cell rooms con intercapedine ventilata naturalmente costituita da facciata interna continua trasparente con tamponamento in vetro basso-emissivo e telaio