The design of the web interface requires the writing of dynamical SQL queries and a deep knowledge of the underlying GAPP database.

(1)

Chapter 4

GAPP Novel Data Views

My master thesis focuses on the development of an optimal solution for proteomics data analysis and the related visualization. In fact, although GAPP provides many benefits (such as high confidence results, high throughput, does not require large spectral data files stored, maps peptides back to the genome and identifies variant peptides) it is a quite new repository therefore, at the beginning of my work, some aspects to be improved were the following:

9 Database data access through SQL queries: a graphical interface for the visualization of the results of complex queries instead of a pure SQL interface is very important, in fact biologists interested in analyzing their data are not interested (or lack the needed knowledge) in learning the SQL programming language.

The design of the web interface requires the writing of dynamical SQL queries and a deep knowledge of the underlying GAPP database.

Moreover, part of the work consists of the evaluation of what information to show in such a way to allow a full understanding of the search conducted. An important aspect to take care of is to avoid confusing the users with unnecessary information and a not very clean representation of the results.

9 A more logical and organic exposition of the results: simple tables holding results are often not enough to show the complex data obtained from proteomics experiments. Users need an easy graphical view to analyse the protein details, the experimental information and to compare the proteins identified in different experiments;

The purpose of my thesis work is to address these weak points of the GAPP. In particular it consists of two main, interrelated parts:

1. Data-mining: As reported in the section about GAPP in the second

chapter, after the analysis of the sample, data retrieved are stored into a

(2)

relational database. The access to the results requires an extensive data- mining work;

2. Visualization: Dynamical web pages show the data extracted from the database in an easy and intuitive way, established through interaction with experts in the field.

Having identified the weak points of the GAPP, the next step was to devise a sensible way of improving those aspects. For this reason, thanks to various discussions with some experts in the field mainly from Cambridge University, the need of three different data views was identified. The three data views that have been chosen for the representation of the results are complementary and present three different facets of the information, providing an a complete and comprehensive vision of the results. They are:

• Experiment view, holding the list of the experiment features stored in the database and together with the information on the related gene products;

• Protein view, the detailed description of every single protein, together with the corresponding peptides, identified by the GAPP;

• Differential view, the core of the thesis work, a graphical representation of the gene products identified across different experiments allowing the comparison between experiments in terms of identified proteins. Due to the large amount of proteins identified in a set of experiments, proteins can also be filtered by gene ontology (GO). The Gene Ontology, briefly, consists of three controlled vocabularies (ontologies) describing gene products in terms of their associated biological processes, cellular components and molecular functions. The Gene Ontology is becoming more and more popular within the proteomics community as an annotation tool supporting the research. This part of the work involved several aspects ranging from the design and creation of a specialized database, the usage of some bioinformatics tools to extract the annotation data from repositories and the implementation of a program to fill the database with the gene ontology entries.

Another problem of the GAPP system was the modification of data submission,

because the previous system was fully automatic but not easy in the use.

(3)

The technologies learned and employed in the implementation of my thesis work are:

• HTML and GD library for the web visualization;

• PHP for the creation of dynamical web pages (included queries);

• Perl to implement supporting programs;

• MySQL databases and SQL query language.

4.1 GAPP database

GAPP is a pipeline that uses a combination of analysis algorithms. Peptides identified by GAPP pipeline are automatically collated at regular intervals into a catalogue of observed peptides, into the database, and are publicly available for the benefit of researchers working in proteomics and related fields. The creation of an efficient interface to show the results in an optimal way requires deep knowledge of the database structure to manage the data. GAPP database, shown in figure 4.1, is a relational database characterized of 14 main tables:

• The table ‘parameter’ contains all the information about the user (user name, email, profiles selected) and about their experiments (enzyme, instrument, tissue type, sample species, etc.). The key of the table is the unique_id, which basically is the experiment id.

• The table ‘protein’ contains all the information about the proteins identified, such as the protein id, the searchdb (table keys) and important information as the protein sequence, the description, an id for potential isoforms, transcript and status.

• In the table ‘datainfo’ all the information about spectra are included

(spectra assigned in the first loop, spectra assigned in the secondary loop,

file type, number of spectra contained in the original file), the key again is

unique_id.

(4)

Figure 4.1: GAPP database.

(5)

• ‘Mod_parameter’ contains potential modifications that users can search for in some amino-acids indicating to the search engine if they are to be considered as fixed or variable (or in other words, if they have to be assumed as always present or not).

• In the table ‘Modifications’ are listed the information about the identified modifications of amino-acids, comprising information on the amino-acid(s) modified, the protein and the peptide whereby the modification has been detected, the position of the amino-acid into the protein sequence, the spectrum providing the experimental evidence of the modification. The unimod (www.unimod.org) identifier of the modification and the variation of mass due to the modification are stored as well.

• ‘Mod_type’ holds information on all the possible modifications that can be detected by GAPP.

• In ‘hitlist’ is stored the sumscore and frequency of all the proteins identified in an experiment for a particular protein, both information are useful to calculate the average APS. This, as seen in the second chapter, is a score useful to discard peptides that have got their score below a certain threshold to obtain a false-positive rate very close to zero.

• ‘Peptide’ consists of the features of all identified peptides, such as the position in the protein sequence in which that peptide starts and ends, the peptide sequence, the spectrum id, how many missed cleavages it has and other useful information.

• ‘Enzyme’ table is about the enzyme used for the digestion of the proteins,

‘spectrum’ about spectra recognised.

• ‘GeneDesc’ contains the whole list of descriptions and gene ontology description of the proteins identified.

• Other table present in the database are ‘splice_variants’ and

‘APS_score_info’, the tables of potential splice variants and that of all

thresholds set for the reverse search.

(6)

4.2 GAPP Data Views

Before the beginning of this work, GAPP did not have its own Data view - the only way to obtain the results was by writing the queries in SQL language. This meant biologists interested in data analysis were required to have some programming skills in SQL language. Moreover, to obtain the desired information, it was necessary to have a good level of knowledge of the GAPP database structure (Figure 4.1). Finally, writing SQL queries it is not possible to analyse the differences among various experiments or proteins quickly. These were clearly some important limitations of the system’s usability.

By discussing with experts in the proteomics field the need for three complementary ways to fully describe the results (views) emerged.

Then, the core of my thesis consists of the design of these three views:

1. Experiment view;

2. Protein view;

3. Differential view.

From the home page of the web site (www.gapp.info), in the ‘Mine the GAPP’

section it is possible to carry out a ‘Simple Search’ allowing to select a particular experiment of interest by inserting the corresponding experiment id (unique id), the protein id or the protein description of interest.

By inserting an experiment number, the user will have all the information about that particular experiment, the search leads the user to the page of the experiment view. By selecting a particular protein id, instead, the user will be redirected to the protein view.

Finally, writing a protein description, GAPP will return a list of proteins in which

that particular word was matched in their description. A link to each protein view

is also provided.

(7)

4.3 Experiment View

One of the needs of proteomics researchers is to have a clear understanding of the experiments performed by other scientists to know which experimental conditions allowed the identification of certain proteins and peptides.

The main aim of the experiment view is to provide experiment details as complete as possible and to keep a clear description of the experiments available in GAPP.

A summary (Figure 4.2) on the top of the page reports the amount of data stored in the repository.

Since the user can flag his data “public” or “private” at the moment of data submission, as described in the second chapter, the summary reports information about the number of proteins that have been seen in the user’s experiment and those that have been performed by the user, as well if they has been flagged as public as they are private. Moreover, in the summary is shown the number of proteins that have been seen in experiments achieved from other users that are flagged as public and the number of “public” experiments.

A link to the public data list is provided with the file names and their own experiment ids at the bottom of the window.

Figure 4.2: Summary.

After the summary the main experiment selection is shown. This is the list of experiments submitted in GAPP, grouped by user’s experiments, if available, and public experiments, which contains all the experiments developed by other users that gave the permission to share publicly their data.

For each experiment the main features are listed to facilitate a quick choice of the

desired experiment, they are the following:

(8)

1. Number of proteins identified, the number of proteins identified in that experiment;

2. Cell line, cell lines name are collected in dedicated databases, it can be a human or animal line and provide information about the tissue analysed, in particular about the species, potential pathologies or tumours;

3. Species, the species examined;

4. Tissue, the tissue sample;

5. Disease state, indicates if the tissue is healthy or if not, the name of the pathology.

Biologists, looking at the experiments outline shown at the bottom, can choose the experiment they are interested in and clicking on it can have more details.

Figure 4.3:Experiments profiles .

Since the huge number of experiments already contained in the database and being aware that the amount of data will increase in the future, they are shown in groups of twelve per page. The user can change page by a menu to select the other experiments.

Once the experiment has been selected, in a new tab the page experiment_view_res.php is open. In this page all the important features regarding the chosen experiment are listed.

First of all, the PHP script checks if the user is logged in or not for the connection to the appropriate database (according to the GAPP’s mentioned data policy).

The part of the script about the database connection is listed below.

(9)

Once the connected with the database is established, a data-mining work is needed to extract data from the very large repository. In this section data of interest describe the experiment in which the protein was identified, in particular:

1. Metadata, a collection of data about data, such as cell line, cell type, instrument;

2. Search parameter, parameters about peptides identified;

3. Data format, the electronic format of data submitted to the pipeline;

4. Number of spectra in file, number of spectra contained in the file;

5. Number of peptides identified, number of peptides identified in the experiment;

6. Average protein APS score, a measure of the accuracy of the result;

7. Number of proteins identified, number of proteins identified in the experiment.

Some JavaScripts provide the windows with the details regarding each field clicking on them.

4.3.1 Metadata

Metadata are “data about data”. This information is very important for the

understanding of experiment consulted. The fields of this section are obtained

with a simple query from the parameter table of the relational database ( Figure

4.4).

(10)

Figure 4.4: MySQL query to find out the celltype of the particular experiment ‘$id’.

Metadata listed, figure 4.5, are the following:

• Cell line, that is representative of particular cell type.

• Cell type, that is characterised by a distinct morphological or functional form of cell;

• Dev stage, cellular stage of development;

• Sample coll time, sample collection time;

• Tissue type, type of tissue sample;

• Disease state, state of health of the cell sample ;

• Treatment, if the sample is submitted to a specific treatment this field indicates which it is;

• Sample species, species of the sample is a member;

• Instrument, machine used during the experiment;

• Calibration, information about the calibration of the instrument used in the experiment if it is known.

Figure 4.5: Metadata .

(11)

4.3.2 Search parameters

This section consists of all parameters of the selected experiment (Figure 4.6) that describe the peptides identified, such as:

• Search db, indicates the database from which the theoretical spectra are computed to compare it with the m/z peak lists of the instrument output file to identify peptide fragments by matching mass;

• Cleavage enzyme, name of the enzymatic cleavage reagent, used to cleave the peptide backbone of a protein to produce peptides, different sites along the backbone will have different reaction rates and kinetics. Therefore, there may be some sites that should be cleaved by the reagent that are not completely cleaved during the course of the experiment, this will possibly lead to missed cleavages in the identified peptides;

• Mass option, that consists in two options for the computation of the theoretical spectra. The options are: the average amino acids mass, that is the mass of one mole of a substance and the monoisotopic mass, that is the sum of the masses of the atoms in a molecule using the unbound, ground- state, rest mass of the most abundant isotope for each element;

• Max. missed cleavages, the value of this parameter represents the maximum number of missed cleavage sites allowed within a peptide;

• Ms tolerance, indicates the permissible limit of variation in the molecular mass measured value by Mass Spectrometry compared to the theoretical mass computed in silico by the search engine, expressed in Dalton;

• Ms/Ms tolerance, indicates the permissible limit of variation in the molecular mass of fragments (once again compared with theoretically computed values), expressed in Dalton;

• Peptide charge up to, this is the maximum charge that the precursor ion, the ion that may dissociate to form fragments, can have during the creation of peptides in silico.

• Fixed modifications, a fixed peptide modification is a modification which

is applied throughout the whole peptide, although it is specific to which

(12)

type of amino acid it modifies, or this modification may be to the N- terminal or C-terminal of the peptide;

• Variable modifications, may or may not be present. They can be a very powerful means of finding a match, but there are also dangers to be aware of. Even a single variable modification will generate many possible additional peptides to be tested. More than one variable modification causes the number of arrangements to increase geometrically. This means that a search can take dramatically longer than the same search with fixed modifications. More importantly, testing all possible arrangements of modifications generates many more random matches, so that discrimination can be sharply reduced. The best advice is to do a preliminary search with a small number of variable modifications followed by an error tolerant second search to pick up additional matches to peptides containing unusual modifications.

Figure 4.6: Search Parameters .

The collection of these data sometimes require the joining of two tables as for the

case of cleavage enzyme below.

(13)

4.3.3 Proteins identified

The main bar indicates the number of protein identified in the experiment selected.

This section can be expanded, by clicking on this bar. The list of all proteins identified in that experiment therefore appears, the protein ids provide the link that allow user to navigate straight to the corresponding protein view page (Figure 4.7).

Figure 4.7: List of the protein identified, Ensembl description view.

Beside each protein, the corresponding description is reported together with a link

to the Ensembl database (i.e. to the page

http://www.ensembl.org/Homo_sapiens/geneview), whereby the user can have a

lot of additional information on the gene product of interest. Clicking on ‘Toggle

to gene ontology view’ the previous schema is preserved but now the gene

ontology description is shown instead of the protein description. The information

about protein description and gene ontology description is stored in the table

geneDesc. Now a link ‘Toggle to Ensembl description view’, figure below, is

available to go back at the protein description.

(14)

Figure 4.8: List of the protein identified, gene ontology view.

4.3.4 Further information

The other information, shown in figure 4.7, reported in experiment view page are:

• Data Format: the format of data submitted in GAPP for the analysis. It has been retrieved from filetype in ‘datainfo’ table;

• Number of spectra in file: Each file contains more spectra and this field show its exact number. As the Data format also this one has been retrieved from ‘datainfo’;

• Number of peptides identified: indicates the number of peptides identified in a particular experiment. This information is stored in ’peptide’ table;

• Average protein APS score: computed the sum of the ratio between the sum of each peptide and their own frequency, retrieved from the table

‘hitlist’, and the result is divided by the number of proteins identified in the experiment considered.

Figure 4.9: List of various information in experiment view.

Experiment view provides a detailed description of experiments that are stored in

GAPP. In this page metadata, search parameters, the data format, the number of

(15)

spectra that are contained in the file, the number of peptides identified and the average APS and the number of proteins identified are all information that are shown to the user to offer a full vision of the experiment. Moreover, the list of the proteins identified is reported with the link to the corresponding Protein view, shown in the following paragraph.

4.4 Protein View

The Protein view aims to provide a detailed description of every protein and corresponding peptides identified by GAPP. As discussed earlier, the idea was to meet the needs for an intuitive graphical interface able to offer a complete and clear visualization of the proteins and peptides identified.

As in the case of the Experiment view, a summary on the top of the page reports the amount of proteins identified by GAPP, distinguishing the user’s experiments, if any, from the public experiments, that are the experiments performed by other users who gave the permission to share their data.

After the summary, a form is available to select the protein of interest and to visualize the details about its identification. A view of the page is shown in Figure 4.10.

Figure 4.10: Searching of the protein in Protein view.

Once the protein has been selected, another page appears in the browser,

“Protein_view_res.php”. The page mentioned maintains the same style of

(16)

“Experiment_view_res.php”, in fact it is also composed of several blocks, each one associated to a certain feature of the protein analysed, again the option to collapse or expand sections of data is given to the user.

Then, using forms with the method GET it is possible to send the data as part of the URL, as can be seen in the code reported below.

(17)

The code above refers to the “Sequence” block, but the other blocks have been implemented following the same principles.

The features displayed for each protein are the following:

1. Sequence, the sequence of amino-acids that form the protein, in other words its primary structure;

2. Description, the description or official name of the protein. This information is taken from the Ensembl release of the specific organism. At the moment several organisms are supported by the GAPP, including human, horse, guinea pig, fruit fly, mouse and xenopus;

3. Gene ontology description, gene products description in terms of their associated biological processes, cellular components and molecular functions by using three ontologies species-independent;

4. Statistics, statistical distribution of representative metadata, that are data

about various properties of the experiments in which the protein has been

(18)

identified. The graphical instrument chosen to illustrate the statistic data is a 3D pie-chart, due to its easy interpretation and widespread usage within the proteomics community;

5. Number of experiments, indicates the amount of experiments in which a protein has been identified;

6. Average number of peptides used in the identification, the mean of the number of peptides used in the identification process of the protein;

7. Average APS, is calculated as the sum of the fraction sumscore/frequency for each peptide, divided by the number of experiments in which the protein has been identified;

8. Modifications, eventual post-translation modifications, that are modifications that sometimes involve proteins after the protein biosynthesis;

9. Peptide coverage, the graphical representation about the mapping of

peptides identified on the sequence of the protein. It consists of two views,

the first one map the peptides identified in different experiments all

together in the protein sequence to show how many times a certain peptide

has been identified in the various experiment. The other view represents

the peptides mapped in each experiment annotating the eventual

modifications.

(19)

Figure 4.11: Protein view.

4.4.1 Sequence

The primary structure of the proteins is specified by a sequence of amino-acids along their backbone. This structure can be subject to modification, in fact proteins can become cross-linked, most commonly by disulfide bonds, or the chiral centers of a polypeptide chain can undergo racemization, o a variety of post-translational modifications.

The whole sequence of amino-acids forming the protein has been reported in this field. Amino-acids are grouped in set of ten and in each row a number indicates the position of its first element (Figure 4.12). The indexing of the sequence makes a particular amino-acid easier to localise among the sequence.

Figure 4.12: Sequence of amino-acids composing the protein.

(20)

The sequence of amino-acids composing the protein identified is stored in

“protein” table, it must be grouped in branches of ten elements. The script used to achieve it is shown below.

4.4.2 Description

The description (Figure 4.13), holds the official name of the protein, pertains to

UniProt/Swiss-Prot. UniProt means UNIversal PROTein resource, and is a central

repository of protein data created by a consortium established from Swiss-Prot,

TrEMBL and PIR to develop the most comprehensive resource on protein

information. It is considered as the gold standard of protein annotation. Swiss Prot

have developed terminological resources for genes and proteins, the official gene

symbols, official full names and synonyms.

(21)

GAPP retrieves the description of the proteins identified from Ensembl Gene Report (its link is reported near the protein official name).

Figure 4.13: Protein description.

The information about the description of the protein and the corresponding protein id are stored in “geneDesc” table. The SQL query used to retrieve it is listed below.

4.4.3 Gene ontology description

The GO consortium is a set of model organism, protein database and biological research communities all involved in the same project, that provide a consistent common description for the genes identified from different databases. The project aim is to solve the problems caused by the existence of a wide variations in terminology that is an huge loss of time for experimentalists when they are looking for some information about each small area of research.

The project promotes the development of three ontologies to describe the gene products in term of their ‘biological process’, ‘cellular component’ and ‘molecular function’. A biological process is a recognized series of events or molecular functions, a cellular component is part of the cell, basically it describes locations, at the levels of subcellular structures and macromolecular complexes, finally the molecular function ontology describes activities that occur at the molecular level.

After an effort to understand the structure of the ontologies, the Gene Ontology information have been retrieved and stored in a dedicated database, termed

"gene_ontology".

When the gene ontology section is expanded, the GO descriptions are shown

grouped by the three different ontologies, as it is possible to observe in Figure

(22)

4.14. A link to the GO website is provided near each gene description, this is useful to see other details on the hierarchical structure of the GO description.

Figure 4.14: Gene ontology description of a protein.

4.4.4 Statistics

Experimentalists need to compare the different experimental conditions which allowed the identification of the protein selected and analysed in this page.

Understing this demand, the idea was to develop an intuitive statistic graph to visualise the distribution of metadata characterising the experiments concerning that protein.

The first step was to realise which metadata was relevant, this step has been accomplished by close collaboration with some biologists of the Cranfield University. The choice was to show the following parameters: Instruments used for the experiments, such as ESI-TRAP, ESI-QUAD-TOF, MALDI-TOF-TOF;

Potential disease state, as a carcinoma or an healthy state; Tissue types, type of

tissue analysed such as serum, brain, saliva; Cell types; Development stages of

the sample, adult or unknow for example; The treatments; The time; The

max_charge and The experiment type. To visualise one of these statistical

parameters minimal input is required from the user who has only to select the data

of interest from the menu, as in Figure 4.15.

(23)

For the design of the dynamical graph the GD library was used, an open source code library for the dynamic creation of images by programmers, available for Perl, PHP and other languages.

The application needed a function for the dynamical generation of colours well spaced, in fact the problem found in the realisation of this graph was the discrimination of two neighbouring sectors. In fact, since the number of variables, that is the number of pie sectors, changes dynamically, then a dynamical number of well spaced colours are required to avoid adjacent sectors to be too similar and therefore unrecognizable.

Finally, the realization of the pie-chart consisted of three steps:

1. The definition of dynamical queries, necessary to retrieve the data and to calculate the concerning percentage;

2. The implementation of a function to generate colours well spaced avoiding contiguous sectors are nearly of the same colour;

3. The design of a 3D pie-chart and legend.

The design of the statistic graph has been implemented in a separate page, the

“piechart_protein.php”. For the first step is necessary the definition of two

variables sent from “protein_view_res.php” in which is stored the protein

identified and the property chosen for the representation.

(24)

Then, these variables are used to create dynamical queries and to calculate the width of the sectors:

Since the sectors width of the pie-chart is known, a function for the assignation of

colours well spaced at the sectors was created.

(25)

Finally, the pie-chart and the legend are created. The 3D effect is due to the

overlapping of several circles shifted below the uppermost one.

(26)

Figure 4.15: 3D Pie-chart of statistic metadata values, in this case showing tissue type.

4.4.5 Modifications

After translation, that is the process by which the messenger RNA (mRNA) is decoded to guide the synthesis of a chain of amino acids that form the protein, proteins can be modified. The post-translational modifications of amino-acids are important because they extend the range of the protein functions.

A modification can be caused by attaching other biochemical functional groups such as acetate, phosphate, various lipids and carbohydrates to it, by changing the chemical nature of an amino acid or by making structural changes, like the formation of disulfide bridges.

All modifications observed with mass spectrometry for the protein analysed are

listed in a window, an example of which is shown below. For each modification

the corresponding experiment and the complete sequence of amino-acids are

reported. The modifications are highlighted in red into the protein sequence, and

for each one of them the amino-acid modified, its position inside the sequence, the

type of modification and the mass change are specified.

(27)

Figure 4.16: Post-translation modifications observed.

Finally, a link to the unimod database (www.unimod.org) is provided as well, wherby all the details about that particular modification are stored.

The implementation required the extraction of the modification ids, the amino- acids affected, the positions of the amino-acids modified and the consequent mass variations from modifications table for that protein in each experiment by a suitable SQL query.

Once the necessary information has been retrieved, a script allows users to visualize the amino-acid modified in the correct position for a quick and intuitive understanding of the modification state of the peptides observed.

Below, the queries to determine the modifications:

When the data have been found out, a script takes care of their visualisation:

(28)

4.4.6 Peptide coverage

This part of page is very important because gives a complete idea about the peptides identified. In particular, it is structured in two views:

1. Summary view, the representation of all the peptide identified mapped into the protein sequence;

2. Detailed view, a detailed view of the peptides identified from every single experiment.

The implementation of these two views has been done by using the GD graphic library and creating two new web pages: “Peptide_summary.php” and

“Peptide.php”.

In Summary view, all the peptides identified are mapped on the protein sequence

using a scale of blue as indicator of the number of peptides identified in a certain

position of the sequence, the darker the colour the more peptides are identified in

the corresponding part of the protein. A scale is placed on top of the section and it

indicates the length of the sequence to allow to users the identification of the

peptides position in the protein. At the bottom, a legend reports the quantitative

(29)

data about the frequency associated with the intensity of the colour used in the graph.

Finally, by a link situated on the top, it is possible to change view to visualise the details of the peptide view. Below, in Figure 4.17, the Summary view is shown.

Figure 4.17: Summary view of the peptides identified mapped on the protein.

Detailed view completes the information previously provided from the summary view. In this view the experiments are listed sorted by sumscore/freq. For each experiment the protein sequence is reported and the peptides identified in that particular experiment are mapped on its. Passing over the peptides with the mouse, a flag indicates the sequence of the peptide and the positions of start and end inside the protein to facilitate the plotting of the peptides.

As in the case of the Summary view, in the Detailed view the scale indicating the length of the protein identified is represented on the top of the page.

In the Detailed view also the modifications are reported, mapped in correspondence of their position inside the single peptides, with the amino-acids modified.

The peptides colour corresponds to the instrument used during the experiment. On the bottom of the page, a legend explains the correspondence between the peptide colour used in the draw and the instrument used in the concerning experiment.

In Figure 4.18, some parts of a Detailed view are shown with the 88 peptides

identified. The first part shows some experiments in which some modifications

have been observed, in the second part the variety of colours that characterise the

instrument used in the various experiments, and at the bottom the legend.

(30)

Figure 4.18: Detailed view of the peptides identified.

(31)

4.4.7 Other information

Other information shown in “Protein_view_res.php” are:

• Number of experiments, amount of experiments in which the protein has been identified.

• Average number of peptides used in the identifications..

• Average APS, calculates the ratio of the sum of sumscore/frequency for each peptides by the number of experiments identified the protein.

The Protein view provides a detailed description of proteins and corresponding peptides identified by GAPP. The graphical interface developed is intuitive and complete thanks to a number of sections that provide an easy and clear visualization of the proteins properties and the peptides identified.

Experimentalists are interested in the sequence of amino acids that forms the protein, in protein view shown graphically clicking on Sequence; the description of the protein identified, available clicking on Description; moreover, in GAPP it is possible to know the universal name of the gene products in terms of their associated biological processes, cellular components and molecular functions, by clicking on Gene ontology.

Statistical representations of data are very appreciated by biologists because are methods very intuitive to understand the percentage of cases and conditions that relate to the identifications of a certain protein, this information is included in Statistics, in which metadata distribution is shown by a 3D-piechart. Other details contained in Protein view are the Number of experiments, the amount of experiments in which a protein has been identified, the Average number of peptides used in the identification, the mean of peptides used in the identification process of the protein, the Average APS, calculated as the sum of the fraction sumscore/frequency for each peptides, divides by the number of experiment in which the protein has been identified.

Finally, Modifications, eventual post-translation modifications, are shown

graphically inside the amino-acids sequence and précising the position in which

the modification happen and the amino-acid has been modified. The last piece of

information is the Peptide coverage, that consist of two views, the Summary view

(32)

in which all the peptides identified are mapped on to the protein sequence giving an idea of the frequency with every peptide has been identified thanks to the coloration with the scale of blue, and the Detail view, where instead of the summary, peptides identified are mapped on the sequence but grouped by experiment and the colours indicates in this case the instrument used in the experiment. In this view it is possible to observe the eventual modifications that can characterize some amino-acids.

Thanks to all this information represented in a such a clear way, the Protein view is an efficient instrument for the features visualisation of the protein identified.

4.5 Differential View

The increasing need of comparing results obtained from different experiments, led to the development of a dedicated view, the differential view.

Substantially it consists of an insightful representation of the gene products identified across different experiments to allow the users to be able to compare them.

The idea is to create a tabular view having proteins as the rows and the experiments, in which those proteins have been identified, as columns. In each cell a colour in blue scale gives an indication about the relevant APS score and, as discussed earlier, it is therefore a measure of the confidence of the identification Due to the large amount of data stored in GAPP, it can be difficult to have a meaningful comparative global view of the results that can involve several thousands of proteins.

For this reason, a filter has been implemented to find out the important

information by selecting only the interesting proteins for the user’s search. The

filtering of the proteins identified has been achieved by Gene Ontology, the most

popular annotation tool used by scientists. Briefly, Gene Ontology, as seen in the

previous paragraphs, describes the structure of a gene in terms of its biological

process, cellular component and molecular function, more attention to it will be

devoted in the next paragraphs.

(33)

Thus, after the selection of the proteins to visualise the results by indicating the experiments of interest, the user can filter out the proteins by their gene ontology.

A pie-chart is also used to represent in a concise and graphical way the gene ontology features of all the proteins filtered.

4.5.1 Selection of proteins

The visualization of the differential view requires the selection of the interesting proteins identified and stored in GAPP for the comparison.

The selection of the proteins can be achieved by choosing the experiments in which the proteins of interest have been identified. The choice of experiment can be based on:

• Properties of the experiment, in particular selecting the tissue type and/or the cell type of the sample, the sample species, an eventual disease state and the instrument used for the identification of the proteins;

• Experiment IDs, this can be done directly by writing the experiment IDs, if known, or by choosing the experiment of interest among a list.

The differential view page, shown in Figure 4.19, is composed of a form for the selection of the experiment proprieties mentioned above.

Figure 4.19: Differential view.

The queries used to select the experiments and proteins desired for the analysis

are dynamical, and they use the following variables, corresponding to the possible

choices available to the user:

(34)

Part of the dynamical queries is composed by a variable string that depends on what the user selects in the form.

After the variable definition, a query retrieves the experiment IDs

If, instead of the properties, the user prefers to manually select the experiment Ids, it is enough to click on the blue link below the window of differential view page and another page appears in the browser, it consists of two forms.

The first form, Figure 4.20, require the writing of the experiment IDs separated by

a semicolon.

(35)

Figure 4.20: Experiment selection by IDs.

In this case, the string $experiment_id must be split deleting the semicolons.

Then, the array $experiments created contains the experiment ID’s selected.

The selection of the experiment, by inserting a tick at the experiment of interest, is shown in Figure 4.21.

Figure 4.21: Experiment selection by IDs.

The previous representation consists of the list of all the experiments stored in

GAPP. The experiments are grouped in two sets: the user’s experiments, if

(36)

available, and the public experiments, those are experiments of users gave the consensus for data sharing.

The grouping of experiments in two sets is useful to localize quickly the experiments.

Moreover, clicking on the IDs of the experiments list, it is possible to open the experiment view of the experiment selected. This solution allows the users to have a complete description of the experiment properties to facilitate the choice.

For the realization of this kind of experiment selection a set of variables are sent from the form to the page for the protein selection and added at the array

$experiments:

Once the array $experiment has been filled, the table can be created.

The creation of the differential view requires the list of experiments, the protein identified for each experiment and the corresponding APS score.

Finally, the algorithm for the computation of APS score consists in an ‘if loop’

that considers the two cases, experiments selected by properties and experiments

selected directly by IDs, and hence it retrieves the proteins and data necessary to

calculate the APS by queries, in the first case using the variable $param_qry

previously seen and in the second one using the array $experiment shown above.

(37)

(38)

4.5.2 Table creation

Once experiments have been selected and the identified proteins have been retrieved as well as the correspondent APS score, all data necessary for the creation of the table are now available.

The image is composed by the collection of more images, created in another web page, that needs the transfer of the concerning APS, APS max and APS min:

The creation of the picture needs of the following three variables:

An algorithm has been implemented to allocate the colours. Once again, it is

important to have well spaced colours in order to be able to detect easily

differences in the APS values:

(39)

The image is a rectangle, coloured with the colour allocated as above:

The resulting view, shown in Figure 4.22, is a table having the experiments selected at the beginning of the process in the columns and the correspondent protein identified in the rows. The experiments in the columns and the proteins in the rows are links to the corresponding Experiment view page and Protein view page. The cell is filled with a colour indicating the degree of confidence for proteins identifications, given by the APS, Average Peptide Score.

Figure 4.22: Differential view.

On the top of the picture, a legend explains the meaning of the cells coloration.

A flag, for each cell (see Figure 4.22), indicates the exact APS score and the

number of peptides used in the identification of that protein within the

corresponding experiment.

(40)

4.5.3 Gene Ontology

The design of the filter to select proteins according to their gene ontology features has been one of the crucial part of the work of this thesis. It required a precise understanding of the Gene Ontology organization as well as the development of the filter itself (this required the implementation of various scripts as well as the definition of an in-house database). In this paragraph is devoted to an explanation of the structure of the Gene Ontology.

The Gene Ontology (GO) is a consortium, composed of many databases for plant, animal and microbial genomes and research communities, both involved in the development and application of the Gene Ontology. The GO consortium members, such as the Berkeley Bioinformatics and Ontology Project (BBOP), the British Heart Foundation - University College London (BHF-UCL) and many others, participate actively at the utilization and development of the Gene Ontology, in particular at the development and maintenance of the ontologies themselves, at the annotation of gene products and at the development of tools that facilitate the creation, maintenance and use of ontologies.

The importance of the GO project consists in the creation of a common terminology accepted by the scientific world.

The GO is developed by three structured ontologies describing gene products in terms of their associated biological processes, cellular components and molecular functions. The description is universal, meaning that it does not depend on species.

The characteristic that renders GO popular and special, as well as complex, is that GO is “structured”. This means that it is accessible at different levels, or better, there is a hierarchy that allows users to find all the gene products of a specific species described by the three vocabularies at different levels of detail, such as the gene products in the mouse genome that are involved in signal transduction, or, for example, zooming in on until the level of all the receptor tyrosine kinases.

This allows scientists to annotate properties to genes or gene products at different

levels, depending on the depth of knowledge about that entity.

(41)

Every gene or gene product annotated in GO is identified with a unique number in the form GO:xxxxxxx (where x are numbers) and a term name assigned to one of the ontology (biological process, cellular component or molecular function).

GO sometimes uses synonyms to refer to fields that may not have exactly the same meaning as the term they are attached to. In this case the gene product annotation will consist of the same GO ids but different names (synonyms). This is to limit the number of GO ids used in the ontology. The relationship between the synonyms and the corresponding term is indicated in the GO file.

Moreover, it may happen that a term describes a concept that would be better represented in another way or that has been modified due to some new knowledge, in this case the term will be marked as ‘obsolete’ in newer releases of the GO but for compatibility reasons it will be preserved in the database.

4.5.4 GO vocabularies: Biological process, cellular component and molecular function

As seen above, the Gene Ontology consists of three structured vocabularies:

Biological process, a series of events or molecular functions; Cellular component, the localization of the element described by the ontology and, finally, Molecular function, activities that occur at the molecular level.

The ontologies are similar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). An example of the hierarchic structure is reported in Figure 4.23, it refers to the GO:0016020 term (the marked circle).

Moreover, GO terms can be linked by five types of relationships: is_a, that is a

simple class-subclass relationship, part_of, more complex relationship, and

finally regulates, positively_regulates and negatively_regulates, that are

relationships describing interactions between biological processes and other

biological processes, molecular functions or biological qualities. All the cited

relationships are transitive.

(42)

Each gene product is associated with or allocated to one or more of these fields and, more precisely, they can be associated with or allocated to also more then one description in the same field.

Figure 4.23: Hierarchical structure of GO:0016020.

Biological Process encloses recognized series of events or molecular functions.

Sometimes, even if it is not a pathway, some GO termsdescribe a pathway.

An event notified in biological process can be termed as is_a, if it is an instance of the child process, or part_of, if it is an instance of only a portion of the parent process.

In biological process the following information can be annotated:

1. Cell cycles, physical processes as well as temporal stages of the cell cycle can be represented within the GO.

The structure of the cell cycle description consists of a sets of nodes, the

cell cycle node sits under cellular physiological process (GO:0050875),

and is split into types of cell cycle ( such as meiotic or mitotic) and stages

(as M phase, S phase, etc.), plus a regulation term.

(43)

The cell cycle structure is the structure for development terms involving tissues, organs and organisms, such as the development, that describes the progression of a generic element x over time, from its formation to the mature structure, the morphogenesis, that explains the process of generation and organization of the anatomical structures of x, formation, that gives information about the initial formation of the structure x from unspecified parts, cell differentiation, the process in which a cell acquires specialized features of a certain type of cell, structural organization, about the process creating the structural organization of x and, finally, maturation, that is the developmental process required for x to attain its fully functional state.

2. Interaction between organisms, this term and its children describe

interactions that occur between organisms of different species and include

also the existing terms that described interactions between organisms. The

categorisation of interactions between organisms is based on the nature of

the interaction (it can be behavioural or physiological and inter- or intra-

species). The node structure is shown below:

(44)

3. Metabolism, includes both biosynthesis and catabolism. Metabolic processes can be distinguished in two types: the metabolism of the organism, that occurs at a multi-cellular organism level or in more than one cell type, and the cellular metabolism, that occurs at the level of the cell and then is restricted to a single cell or cell type.

4. Regulation, the regulation of a process is not a part of the process itself, but it indicates that the term is a process that modulates its parent process.

The regulates relationship is transitive.

The regulation process refers to regulation of process, that modulates the

frequency or extent of process, regulation of enzyme activity, that

modulates the frequency, rate or extent of enzyme activity, the catalysis of

the reaction, regulation of function that modulates the frequency, rate or

extent of function, regulation of phenotype, modulates phenotype,

negative regulation of process, that stops, prevents or reduces the

frequency, rate or extent of process, positive regulation of process, that

activates, maintains or increases the frequency, rate or extent of process,

and also activation of process, enzyme activity, the maintenance of process

(45)

and phenotype, the up-regulation of process, that are any process that increases the frequency, rate or extent of the active process process, and also the down-regulation of process, that reduces the frequency, rate or extent of, but does not terminate, the active process, the inhibition of process, termination of process and inactivation of enzyme activity.

5. Detection Of and Response To stimulus, describes the response of a cell or an organism to a stimulus occurring within or outside the cell or organism, that is a change in state or activity of a cell or an organism as a result of a stimulus type stimulus. Moreover it describes detection of the stimulus, that is the process by which a stimulus is received by a cell and converted into a molecular signal.

6. Sensory perception, occurs only in organisms capable of performing neuro-physiological processing of the stimuli. It involves the detection of the stimulus and subsequent recognition and characterization of it. Five different stimulus types are involved in sensory processing: chemical, mechanical, electrical, light and temperature.

7. Signaling pathway, the series of molecular signals that form the signalling pathway.

8. Transport and localization, processes that influence the location of an

entity in or outside the cell, including the establishment of localization,

(46)

which covers transport and/or autonomous movement of substances or cellular components, as well as orienting a protein or organelle and the maintenance of localization, that covers sequestering and active retrieval processes.

9. Transporter activity, as the name suggests, describes the activities of transporters, active and passive. The active transmembrane X transporter activity is the catalysis of the transfer of X from one side of the membrane to the other, up the solute's concentration gradient. The process is followed by a series of conformational changes. The passive trans-membrane X transporter activity is the catalysis of the transfer of X from one side of the membrane to the other, down the solute's concentration gradient.

In summary, the Biological Process Ontology includes the description of every event or molecular function that can involve a gene. Its structure is complex but well structured in a hierarchical fashion.

Cellular Component. In general terms, a gene product is located in particular

cellular component or is a subcomponent of it. Cellular Components Ontology

describes its locations, at the level of sub-cellular structures and

macromolecular complexes. This ontology includes multi-subunit enzymes

and protein complexes, but it does not include individual proteins, nucleic

acids or multi-cellular anatomical terms. As the name of the ontology

suggests, it describes the cellular components. A cell can be defined as the set

of all components within the plasma membrane (including it) and any external

encapsulating structures. Therefore, the ontology includes:

(47)

1. Protein complexes. They are termed as 'complex' and enzyme names.

The name can be confused with a molecular function, that however has got the word 'activity' appended, for example, pyruvate dehydrogenase complex is a cellular component, but pyruvate dehydrogenase activity is a molecular function.

2. Membranes and envelopes. Membranes that surround organelles, annotated in GO, are distinguishes in single, single bi-layer, and double membranes, (i.e. two bi-layers). While the term of the single membrane includes information about the inter-membrane space, the double membrane is described only by the lipid bi-layer. An example of this structure follows:

3. Membrane proteins, describes the location where a gene product may act. These terms actually describe spatial relationships between a membrane and a gene product.

The protein membranes (Figure 4.24) classification in GO depends on

their location and they can be: extrinsic to membrane, gene products

associated with membranes, intrinsic to membrane, gene products that

have some covalently attached moiety embedded in the membrane, and

can be integral to membrane or anchored to membrane.

(48)

Figure 4.24: Membrane proteins.

In summary, the Cellular Component Ontology describes the cell components and the localization of gene products or it indicates if they are subcomponents of a particular cellular component.

Molecular function describes activities that can be performed by individual gene products or by assembled complexes of gene products. The activities annotated in Molecular function ontology occur at the molecular level. As written above, it is possible to confuse a gene product name with its molecular function. For this reason, many GO molecular functions are accompanied with the word "activity"

added at the end of the name.

The molecular function structure is reported below:

The standard definitions for molecular function terms are: x binding, interacting

selectively with x, activity, catalysis of the reaction that can be distinguished in x

receptor activity and x transporter activity.

(49)

A problem with this ontology is that a gene product may have many different functions, and therefore many different terms. To overcome this limit, the gene product information should be captured at the annotation stage.

Briefly, Molecular function describes activities performed by individual or assembled complexes gene products, with Biological process, that describe a set of events or molecular functions and Cellular component, which annotate details about parts of the cell.

4.5.5 Gene Ontology database

Having delved into the meaning of the Gene Ontology and realised the importance of a common ontology adopted by the scientific world, my work focused on the analysis of the Gene Ontology database structure.

In fact, the realization of the filter for the proteins requires the use of an in-house database containing all the known GO annotation for the proteins IDs contained in the Ensembl proteomic database. The knowledge of the GO database was therefore necessary to decide if to use it directly as it is or if the development of the filter required the creation of a new database organized in a different way.

The GO Database is a relational database containing both the Gene Ontology and

the annotations of genes and gene products to terms in the GO. The schema of the

database is shown in Figure 4.25 and consists of tables for storing GO terms and

relationships between terms, as well as annotations and gene products.

(50)

Figure 4.25: GO database schema.

The schema is organised into different modules. One of these modules is the go- graph module, that is for housing the ontology. The main tables in go-graph are term, that forms the nodes in the ontology graph, and term2term, of which each entry corresponds to an arc/edge in the ontology graph, representing a relationship that holds between two entities. Other annotations are stored in the go-association module. Here, the main tables are association, that provides a link between a gene product record and an ontology term, and gene_product, in which genes or gene products are stored, typically at the species level.

The whole GO database is structured to create a graph, in which the GO terms are the nodes and the relationship between them are the arcs. The graph is equipped with a hierarchy, this is evident from the term table, in which the edge of the graph and the relations among the different entities are stored.

The structure of the GO database makes it possible to iterate through the graph

without requiring multiple SQL calls. This is possible because the GO database

pre-computes the path from every node to all of its ancestors, thanks to the

(51)

concept of transitive closure of a relationship. It means that “Is an ancestor of" is a transitive relation, then if an entity x is an ancestor of y and y is an ancestor of z, then x is an ancestor of z. This concept is used to create the graph structure of terms. The term2term table contains the relationships between the entities (nodes of the graph). By using this table it is possible to move up and down (from parents to children and vice versa in the graph structure) in the graph by swapping around term1 and term2.

In conclusion, the GO database is very useful but complex and, moreover, it contains a lot of information that is not needed for our purposes. For these reasons, we decided that the creation of a dedicated database in GAPP was the best choice for our needs.

To follow this route we had therefore to design a new specific database as well as some scripts to populate the database in an automatic way. Due to performance reasons, we decided to recreate the graph structure of the GO terms for only those terms that are connected to the genes annotated within the Ensembl release of the organisms for which we run the proteomics experiment rather than using the entire graph structure of the GO terms.

4.5.6 Algorithm for the filling of the Gene Ontology database in GAPP

Differential view requires a filter for the proteins identified based on their gene ontology at different levels. It means that the idea is to filter out the proteins by choosing the proteins properties that the user wants to display at a certain level of detail, keeping in mind the hierarchical structure of Gene Ontology.

To accomplish this task first a way to link between the Ensembl database of proteins with the GO was needed. After that, the new database had to be devised and finally an efficient way to populate the database by means of an automatic script had to be implemented.

The link between the Ensembl database and the GO in house database was provided by the BioMart (www.biomart.org) interface to the Ensembl database.

Thanks to this tool provided by Ensembl (www.ensembl.org), we obtained a tab

(52)

separated file containing the known GO annotation for all the gene products of the Ensembl database. This allowed us to create a GO structure of terms similar to that used within the GO database but containing only the subset of terms that belonged to at least one of the genes in the Ensembl database. This helped the performances of the system since, in general, the structure we created is smaller than the entire GO terms structure.

The following step was the design of the in house database. Two pieces of information were needed for our filtering purposes. The first being the correspondence between gene products and GO terms, while the second was a representation of the relationships between terms. This latter is due to the fact that we wanted to allow the user to navigate up and down the GO structure while filtering out proteins. For the representation of this second piece of information we had two options.

The problem was that the BioMart tool for each annotated gene product was providing one, and sometimes more, GO terms (the synonyms seen in section 4.5.4). Moreover, to allow the user to navigate the GO structure up and down, we needed a way to recreate the full path from the term provided by BioMart up to the root of the structure, that was conventionally assumed to be either Biological Process, Cellular Component or Molecular Function. In particular, we needed a simple and quick way to show all the possible ontology terms at any given depth of the GO structure. Moreover, these terms had to belong to one of the proteins identified in our experiment. The following two possible solutions were initially identified.

The first option was to explicitly represent for each Gene Ontology term G,

associated to a gene product, the relationship with each of its ancestors up to the

root, keeping the information on distance between the term G and each ancestor

A. Each row of our database would contain in this case three fields, the term G

(which in the differential view will typically correspond to one of the terms

associated to one of the proteins identified in our experiments), an ancestor term

A and the distance between the two. Therefore, given a term G, the term A

_x,

having the highest distance from it among rows in the database, will always be the

2 G

3

A

3

3 Even if the second solution seems the more efficient (in terms of database size), this structure requires a more complex query which slows down the filtering process. For this reason, the first option was initially adopted.

After the implementation of the code, it did not appear feasible enough from a computational viewpoint since it was too slow. This demanded another solution, the definitive one.

The new structure consisted of three columns: the entry, the ancestor and the

ancestor parent. This solution is a trade of from between the two solutions seen

before. In fact, as the first one it involves several combinations for a gene term G

and as the second option it is based on the relationship existing between adjacent

nodes. Then, the relationship between the first and second column is similar to the

first solution, instead, the relationship between the second and the third column is