Integrazione di dati che evolvono nel tempo, anche nello schema Il caso bioinformatico

(1)

V Scuola di Ingegneria dell’Informazione Laurea Magistrale in Ingegneria Informatica

Integrazione di dati che evolvono nel tempo, anche nello schema – Il caso bioinformatico

Dipartimento di Elettronica e Informazione

(2)

Laurea Magistrale in Ingegneria Informatica

Percorso

Bioinformatica e e-Health

Marco Masseroli, PhD

marco.masseroli@polimi.it

(3)

Laurea Magistrale in Ingegneria Informatica (http://ccs-informatica.ws.dei.polimi.it/)

(4)

Bioinformatica & e-Health

(5)

Scopi didattici e formativi

Obiettivo del percorso Bioinformatica & e-Health è fornire nozioni avanzate e approfondimenti delle tecnologie e metodologie proprie dell’ICT, oltre a competenze di base in biologia, biologia

molecolare, fisiologia, segnali e dati biologici, tecnologie avanzate di produzione dati bio-medici-molecolari, e processi clinici-sanitari di diagnosi e cura, oltre a competenze relative agli standard

tecnologici che permettono l’interoperabilità dei dati e dei sistemi bio-medici-sanitari

Tali nozioni bio-mediche sono integrate con esempi pratici di

applicazione efficace di metodologie e tecnologie ICT in aree delle Scienze della Vita e della Salute, quali bioinformatica, robotica per la salute, telemedicina, la sanità elettronica, …

(6)

In funzione dello specifico settore informatico scelto (Gestione dell'ICT, Metodologie software, o Architetture), il profilo

professionale formato dal percorso Bioinformatica & e-Health, è quello di esperto in ICT con conoscenze applicative dell'ambito bio-medico-sanitario; in particolare per:

• sviluppo e gestione di sistemi informativi bio-medici-sanitari e di supporto alle decisioni biomediche (Gestione dell'ICT)

• sviluppo di software per la raccolta e analisi di dati e

informazioni bio-mediche e per la gestione dei processi clinici- sanitari (Metodologie software)

• sviluppo di architetture integrate hardware e software a supporto delle attività bio-mediche-sanitarie (Architetture)

(7)

Tale percorso si propone quindi di formare leader nell’Ingegneria Informatica applicata alla Biologia e alla Medicina che, oltre ad avere una buona preparazione tecnologica di ICT, conoscano anche le complesse e specifiche problematiche delle Scienze della Vita e della Salute e siano in grado di interagire da subito con personale di altra formazione (medica, biomedica, biologica, chimica, …) e di

inserirsi facilmente e proficuamente, portando le proprie competenze di ICT, in gruppi interdisciplinari che sono attualmente

imprescindibili per giocare un ruolo rilevante nelle Scienze della Vita e della Salute

(8)

Percorso Bioinformatica & e-Health

(9)

(10)

Integration and Computational Analysis of Genomic and Proteomic Information (http://dottorato.dei.

polimi.it/dettagli_corso_b_eng.php?id_corso=491)

(11)

Integrazione dati che evolvono nel tempo, anche nello schema:

Il caso bioinformatico

(12)

Motivation

Many tasks in bioinformatics require comprehensive evaluation of many different types of data:

• structural

• functional

• phenotypic

E.g. to identify the biomolecular phenomena involved in the differential expression of a gene set in a specific biological condition

(13)

281 335 386 548

719

968 1078

1170 1230

1330 1380

202 226

858

0 300 600 900 1200 1500

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Publication year

Databank number

Motivation

Such data are generally available in numerous, distributed and heterogeneous data sources:

• in January 2012 more than 1,380 publicly accessible databanks

• increasing coverage of both:

– biomolecular entities

(genomic DNAs, genes, transcripts, proteins)

– description of their structural and functional biomedical features (sequences, expression in different tissues,

involvement in biological processes and genetic disorders)

(14)

Motivation

The information about a given biomolecular entity is often scattered across many different databanks:

• the nucleotide sequence of a gene in one databank

• the expression of the gene in a second databank

• the three-dimensional structure of its products in a third one

• data regarding interactions of the gene products with other proteins in a fourth databank

Combining information from multiple databanks is paramount for biomolecular investigation

(15)

Motivation

Several approaches have been proposed to integrate data from multiple sources, including:

• data warehousing

• multi-databases

• federated databases

• information linkage

• mediator based systems

(16)

Motivation

Data warehousing is the most adequate when it is required:

• integration of numerous data

• efficient off-line data processing

• comprehensive mining of the integrated data

It requires that information from the distributed databanks to be integrated are automatically retrieved and processed to create and

maintain updated an integrated and consistent collection of originally distributed data

Considering this scenario, to support biomedical interpretation of high-throughput gene lists, we developed the system Genome

Functional INtegrated Discoverer (GFINDer) and its Genomic and Proteomic Data Warehouse (GPDW)

(17)

The GFINDer system

(http://www.bioinformatics.polimi.it/GFINDer/)

GFINDer is a publicly available Web system running at Politecnico di Milano since 2004, which implements ICT to support

biomolecular investigation. It allows:

• Performing comprehensive evaluations and mining of gene

annotations sparsely available in numerous different databanks accessible via the Internet

• Highlighting functional annotation categories significantly

enriched or depleted in user uploaded gene ID lists, supporting the interpretation of their biological meaning through:

– dynamic retrieval and integration of gene annotations – statistical analysis of their relevance

– mining of controlled functional and phenotypic terms within the retrieved annotations

(18)

Biomedical-molecular annotations

• Annotation: association of a biomolecular entity (gene or gene product) with a biomedical/biomolecular concept (term)

• Each gene (or gene product) is associated with more terms (has more features)

• Each term (category) is assigned to many genes (or gene products) (many genes (gene products) have more features)

• Ontological annotations must be assigned by associating genes (or gene products) with the most specific terms describing their

features

• When a gene (or gene product) is annotated (associated) to an

ontological term (i.e. identified as having the feature described by the term), it is implicitly annotated also to all the parent terms

(describing more general features) (i.e. annotation unfolding)

(19)

Biomedical-molecular annotations

• Each annotation is represented by a record including:

Gene (or gene product) ID

Gene Ontology ID

Reference ID(s) (e.g. PubMed ID(s))

Evidence code(s)

Evidence modifier (Qualifier)

or

Protein ID GO ID Evidence code PubMed ID Qualifier

P05147 GO:0047519 IDA PMID:2976880 null

Protein ID GO ID Evidence code PubMed ID Qualifier

P98194 GO:0005388 IDA PMID:16192278 NOT

(20)

GFINDer implementation

• Three layer architecture - Data layer

- Processing layer - User layer

• MySQL as relational DBMS for the multi-database system in the data layer

• Java application for automatic

updating of remotely retrieved data

• ASP, and Javascripts for the Web application in the processing layer

• HTML and Javascript for the user layer

Homologene Entrez

Gene Swiss-Prot

InterPro eVOC

OMIM

Gene Ontology

On-line databanks

Web server

Processing tier User tier

Database server

MyGO DB

Master DB

Data tier GeneData DB

Pfam KEGG

Automatic updating procedures

(21)

Nearly 190,000 accesses from more than 10,000 IPs since 2004 (about 27,000 accesses from nearly 1,700 IPs in the last year) GFINDer Web site

http://www.bioinformatics.polimi.it/GFINDer/

(22)

• GFINDer multi-database data warehouse integrates gene and protein data and annotations of 14 organisms (8.5 GByte of MySQL DB):

- 830,261 nucleotide sequences

- 336,068 genes (24,682 human protein-coding genes) - 1,125,161 proteins (312,161 human)

• expressed by 15 controlled terminologies and ontologies, including:

- 26,493 Gene Ontology concepts - 40,353 Gene Ontology associations - 396 biochemical pathways (KEGG)

- 11,132 protein families and 3,149 protein domains (Pfam and InterPro) - 2,465 inherited disorders (OMIM)

- also phenotypic data (signs and symptoms): 9,912 (OMIM) GFINDer multi-database system

(23)

GFINDer Updater

Information integrated in GFINDer data warehouse are kept updated monthly by the GFINDer Updater software that:

• Automatically retrieves gene and protein data and their controlled annotations

from the original databanks

– Downloading data files of different formats

• Uses specific parsers according to the various file formats to:

– Extract the data of interest from the downloaded files

– Import them into specific GFINDer data warehouse tables

(24)

GFINDer data warehouse releases

GFINDer system opened in March 2004 included a total of:

• 163,023 genes of 9 species from Entrez Gene and Swiss-Prot

• 6 types of controlled annotations of such genes from 4 biomolecular databanks (GO, KEGG, Pfam and OMIM) Since then, GFINDer data warehouse had 3 major releases:

1. in 2005: integration of gene annotations about genetic phenotypes and phenotype locations from OMIM

2. in 2006: integration of gene products annotations about protein families, domains and functional sites from InterPro

3. in 2007: integration of human gene annotations about four

eVOC ontologies describing the gene expression in anatomical systems, cellular types, developmental stages, and pathologies

(25)

GFINDer data warehouse evolution

Since 2004 continuous increasing in the integrated GFINDer DB of:

• controlled terminologies and ontologies

• their terms and associations with genes and proteins

• genes and gene products

GFINDer data warehouse evolution had to face both:

• addition of new data types

• changes in structure of integrated data remotely provided

Required global data schema changes often not rapid to introduce

• due to global data schema classically designed by individually modeling the different data and annotation types to be integrated

(26)

Data warehouse schema design

When many heterogeneous data types from different sources need to be integrated in a single data warehouse, classical global data schema design leads to:

• a complex data schema

– difficult to be maintained and extended with additional data types

• limitation for a genomic and proteomic data warehouse development

(27)

The GPDW project

To overcome above issues, in 2008 we started design and construction of a Genomic and Proteomic Data warehouse (GPDW)

• Focused on the integration of genomic and proteomic controlled annotation data of different species, which represent the available biomolecular knowledge

• Supporting integration of data sources which are fast evolving in number, data content and structure

• Assuring quality and provenance tracking of the integrated data

This requires:

1. A flexible and abstracted global data schema, which fits changing requirements of data sources

(28)

The GPDW project

2. Procedures with a high degree of automation, in order to sustain updating of the integrated data and plug-in of new sources, and capable of adapting to small changes of formats while loading data from the sources

3. Automatic procedures for data consistency checking, provenance tracking and data quality assessment

Our goal was to use the integrated data warehouse as an underlying infrastructure for data analysis and mining, which supports:

• Discovery of new biomedical knowledge in response to experimental input data or to user interaction

• Developing an enhanced version of GFINDer

(29)

Novel generalized modular multi-level data schema

We are developing a GPDW with:

• abstract, generalized and modular conceptual data schema

– easily supporting data evolution in data quantity, type and (limitedly) schema

Biomedical features of biomolecular entities described as a multiple association of

the latter ones with the former ones, i.e. through their annotations

(30)

(31)

Novel generalized modular multi-level data schema Multi-level modular data model:

Aggregated data level

Imported data level

(32)

(33)

Integrated data level

(34)

Automated data integration factory

Defined data schema (mainly its abstraction and modularity around the feature concept), make possible implementing a generalized parametric software architecture that can be customized to support automated

integration of multi-source heterogeneous data

GPDW software architecture supports to main tasks:

1. Data import: importing data from their different sources in the source-import tier of the defined global data model

• Import manager

• Parsers (one for each imported data format)

• Loaders (one for each data file / API)

• Importers (one for each data source)

• Unfolder (to explicit hierarchic relationships of ontological data)

• Lowest common ancestor (LCA) calculator

(35)

Some pre-processing of the imported data is performed:

• Unfolding, i.e. calculation of explicit hierarchical relationships, of gene and protein ontological annotations (e.g. Gene

Ontology annotations) is pre-computed and stored

– to enable faster processing of queries for standard annotation enrichment analysis, frequently performed to support

interpretation of gene ID lists

• Lowest common ancestor (LCA) of ontological terms, i.e. the common ancestor closest to the ontology root, is pre-calculated and stored

– to support fast calculation of semantic similarities between biomolecular entities annotated to those term

(36)

2. Data integration: integrating imported data in instance-aggregation and concept-integration tiers of the global data model

• Integration manager

• Integrated table manager

• Translation manager

• History translator

• Similarity translator

• Association translator

(37)

Integrated data quality checking

In integrated data, GPDW automatic data processing identifies:

• data structure differences in new updates

• inconsistencies among data from different or the same source This is ensured in each updating of source data files by:

• strict checking of data parsed from source data files

• checking absence of null data and modifications of data structure

• using regular expression for ID checking and identification – assures correct use of alias and historical or obsolete IDs

• checking and cross-validation of data imported from different sources to identify redundant and mismatching information

(38)

Regular expressions

Regular expressions for biomolecular entity IDs:

ID type Regular expression

DNA sequence RefSeq ID AC_[0-9]{6}\.[0-9]+

DNA sequence RefSeq ID N[CGSTW]_[0-9]{6,9}\.[0-9]+

DNA sequence RefSeq ID NZ_[A-Z]{4}[0-9]{8}\.[0-9]+

Entrez Gene ID [0-9]+

Ensembl Gene ID ENSG[0-9]{11}

Transcript RefSeq ID [NX][MR]_[0-9]{6,9}\.[0-9]+

UniProt ID [A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]

UniProt ID [OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]

IPI ID IPI[0-9]{8}

(39)

Regular expressions

Regular expressions for biomedical feature IDs:

ID type Regular expression Gene Ontology ID GO:[0-9]{7}

KEGG Pathway ID [0-9]{5}

InterPro ID IPR[0-9]{6}

Pfam ID PF[0-9]{5}

eVOC ID EV:[0-9]{7}

MIM ID [0-9]{6}

(40)

ID reconciliation

• When different data sources have different updating frequencies, data source IDs (or their assignment) can vary differently among different updating versions

• At a given point of time, associations between different data

sources (e.g. annotations) can refer to a version of the associated data sources different from the one available from the associated data sources themselves

• This is an important issue when all such data are integrated together

• To minimize it, whenever historical ID data are available, they can be used to automatically reconcile same data source IDs from

different providers (i.e. obsolete IDs of external data sources)

(41)

Cross-validation of data from different sources

Analysis of overlaps and relationship loops among imported data

• can help in verifying data consistency and unveiling unexpected information pattern

• can possibly lead to biological discoveries

E.g. by checking cross-references between Gene Ontology, Entrez Gene and UniProt databanks, consistency of GO annotations of proteins and their codifying genes can be tested

We found 27,884 (7.23%) GO annotations (about 2,124 different GO terms) of 11,557 human proteins that are not comprised in the GO annotations of their codifying genes provided by Entrez Gene

• including 588 (2.11%) annotations with evidence stronger than that inferred from electronic annotation (IEA)

(42)

Data warehouse quality

Besides the quality of the data contained, quality and usability key factors of a data warehouse depend on:

1. Ease of maintenance and extensibility

− Software architecture procedures allow automatic updating of the data warehouse with the newest data version available in the original sources

2. Coverage of integrated data

− Leveraging external reference data provided as pairs of data source internal IDs and IDs of other (external) data sources 3. Query performance

− Horizontal and vertical data partitioning, suitable indexes, data encodings, pre-computed data derivations (unfolding), ...

(43)

Conceptual data modeling of integrated sources

Gene Ontology Annotation

(44)

Expasy Enzyme

(45)

NCBI Taxonomy

(46)

Logical data modeling of integrated sources

NCBI Taxonomy

(47)

Conceptual vs. logical data modeling

NCBI Taxonomy

(48)

GPDW data coverage

Currently, the created multi-organism GPDW, by importing data and external references from 13 databanks (Entrez Gene,

Homologene, IPI, MINT, IntAct, Expasy Enzyme, GO, GOA,

BioCyc, KEGG, Reactome, eVOC and OMIM), integrates gene and protein data and annotations from a total of 28 data sources

It contains data about:

• 8,355,596 genes (7,556,095 protein-coding) of 8,568 different organisms (20,424 human protein-coding genes)

• 51,386,222 proteins of 372,458 species (891,781 human proteins)

• 103,106,992 gene annotations (6,548,493 human) and 183,209,462 protein annotations (3,706,717 human)

− regarding 14 biomedical controlled terminologies

(49)

The integrated annotations regard, among others:

• 33,511 Gene Ontology concepts and 62,337 relationships (51,057 is a and 5,948 part of)

• 1,065 eVOC ontology concepts and 1,057 is a relationships between them, describing the expression of human genes in anatomical systems (517 terms), cellular types (192 terms), developmental stages (157 terms) and pathologies (199 terms)

• 27,667 biochemical pathways (490 from BioCyc, 475 from KEGG and 26,702 from Reactome)

• 4,903 Expasy Enzymes

• 14,163 InterPro protein families, domains and functional sites

(50)

• 7,215 OMIM genetic disorders and 34,216 phenotypes (signs and symptoms)

− from OMIM clinical synopsis semi-structured descriptions (not included in any other integrative databank)

• 395,590 biomolecular interaction data (391,624 protein-protein interactions) from MINT and IntAct

GPDW currently contains more than 1,650 million data tuples, for a total of about 432 GB of disk space (included index space)

Downloading time is about 2 hours and 45 minutes, automatic import and integration require about 70 hours and 67 hours, respectively

• This time is acceptable since the amount of data handled and the data quality, integration and indexing processing performed

• Processing, performed off-line, allows faster run-time querying

(51)

Conclusions

• Computational systems and data warehouses, such as GFINDer and GPDW, provide support for:

– comprehensive use and analysis of sparsely available genomic structural, functional and phenotypic data

– answering biological questions requiring integrated access to numerous biomolecular information and knowledge

• To be effective bioinformatics instruments they need to:

– have a data schema that ensures the easiest possible maintenance and extensibility of the data warehouse

– guarantee good performances of high-throughput queries – include automatic procedures for updating the integrated

data and supporting their quality checking

(52)

Conclusions

• Quality of the integrated data can be strengthened by:

– automatic correctness checking of all operations performed on integrated data

– reconciliation of unsynchronized and obsolete data by using available historical data information

– testing accuracy of imported cross-references between data of different databanks

• Data quality controls can lead to reveal inconsistencies or missing information in public databanks:

– supporting quality improvement of genomic and proteomic information available to the whole scientific community

– allowing their correct use in support of high-throughput data driven biological discoveries

(53)

Conclusions

• Our designed new genomic and proteomic data warehouse can easily include virtually any annotation data type from any data source thanks to its novel generalized and modular schema

• We are also redesigning the software architecture that supports creation, extension and automatic update of the data warehouse

• The new data and software architecture automatize as much as possible the plug-in of new data sources making easy:

– the integration of several annotation types from many different biomolecular databanks

– the quality checking of the integrated annotations and their structuring in a suitable way to be used for high-throughput data driven biological discoveries

(54)

Progetti e Tesi

• Progetti:

Possibilità di realizzare un progetto pratico di sviluppo software su temi bioinformatici per il corso:

- PROBLEM ANALYSIS ATELIER – Progetto di BIOINFORMATICA e e-HEALTH (Prof. Paolini)

• Tesi:

Integrazione, visualizzazione e navigazione Web di dati genomici e proteomici

Analisi e mining di informazioni genomiche e proteomiche Informazioni:

- alcune sul sito Web dei corsi (Corsi On Line

http://corsi.metid.polimi.it/) - contattare direttamente M. Masseroli

(55)

Progetti e Tesi

Esempi

• Genomic and Proteomic Data Warehouse development

(http://www.bioinformatics.polimi.it/masseroli/BCBMM/projects/

GPDW_project_proposal.pdf)

• Quantitative description and quality analysis of integrated genomic and proteomic data

QuantitativeQualityAnalysis_prject_proposal.pdf)

• I2B2 + GPDW

I2B2-GPDW_project_proposal.pdf)

• Gene similarity bioinformatics analysis

GeneSimilarity_project_proposal.pdf)