Towards Improved Handling of CNV Data for Genetic Analysis - development of a dedicated module for the Gemini toolset

(1)

University of Pisa

Master Course in Biomedical Engineering Dipartimento di Ingegneria dell’Informazione

T O WA R D S I M P R O V E D H A N D L I N G O F

C N V D ATA F O R G E N E T I C A N A LY S I S

Development of a dedicated module for the Gemini toolset

Author:

Andrea Spinelli Supervisor:

Alessio Bechini Romina D’Aurizio Advisor: Maurizio Mangione

(2)

Andrea Spinelli: Towards Improved Handling of CNV Data for Genetic Analysis, Develop-ment of a dedicated module for the Gemini toolset, ©

(3)

A B S T R A C T

Over the years, researchers have revealed that all kind of DNA variations play a role in the susceptibility and genetic disease: not only the single-nucleotide polymorphisms (SPNs), but also larger structural variants called copy-number variants (CNVs). A CNV is an alteration of DNA, either a duplication or a deletion, whose length falls in the range from 50bp to millions of bases; the interested sequences span large portions of the genome and present a high number of repetitions of a base pattern, possibly encompassing different genes. CNVs are also known to modulate different aspects of the genetic disease.

In this study, the variants are extrapolated from raw sequencing data, and then stored in a tab-separated format file called Variant Call Format (VCF). In order to evaluate if a single variant or a set of variants may confer risk, variants are compared against different reference resources.

Gemini is a free and open-source framework for exploring genome variation, based on Python. This thesis work aims to extend the functionalities of Gemini, creating additional tools to handle CNV data.

Unlike existing software, Gemini integrates genetic variation with a diverse set of genome annotations (e.g. ENCODE, UCSC, ClinVar) into a unified portable database (based on SQLite). Its portability and flexibility, along with the possible integration of other genome annotations and the capability to query about variant information and the extensible with other Python tools, make Gemini an extremely interesting software to use and to extend.

As a prime result of the presented work, support of CNV data has been added to Gemini: it is now possible to load VCF files and create a database that integrates some existing genome annotations. To filter the variants, it has been developed a tool to overlap variants with a track of the Database of Genomic Variants (DGV) containing a CNV map of benign variants. This tool let us performing filter using overlap fraction,

(4)

alteration (deletion or duplication), length of overlap, and sample. It has been also created a tool to annotate variants with the Gemini gene map (a track of Ensembl v.75). Such a tool provides also the options to load a custom gene map and to select the sample over run the annotation. The annotation tool produces also an heatmap that shows the correlation between variant alterations and the genes involved by variants. The browser-based interface of Gemini has been extended by adding the necessary parts for every new tool, including a wizard for the VCF load.

(5)

There is no gene for the human spirit.

Gattaca

A C K N O W L E D G E M E N T S

Ringrazio i miei fratelli Simone e Matteo per le spinte che ci siamo dati in questi anni.

Ringrazio mia madre per la fiducia, l’amore ed il supporto regalatomi.

Ringrazio Flaminia, per essermi stata vicino, spronato e guidato in questi anni in cui siamo stati insieme.

Ringrazio Alessio e Sonia, ex-compagni di studi ma soprattutto amici al di la del tempo e dello spazio.

Grazie a Riccardo e Michele per avermi fatto vivere una vita parallela.

Ringrazio la mia anima che mi ha sempre detto di andare avanti, mi ha nutrito di cu-riosità per tutto e mi ha fatto respirare. Anche io avrò cura di te.

Un ringraziamento speciale va alla Dott.ssa Romina D’Aurizio per aver subito avuto fiducia in me, per avermi fatto conoscere il mondo della bioinformatica ed essermi stata vicino anche nei momenti per lei complicati. Grazie anche a Margherita, un giorno forse mi odierai.

(6)

C O N T E N T S

1 i n t r o d u c t i o n 1

i 3

2 g e n o m i c s t r u c t u r a l va r i a n t 4

2.1 The four canonical bases . . . 4

2.2 Copy-number variant - CNV . . . 7

2.2.1 CNV formation . . . 7

2.2.2 CNV detection . . . 10

2.3 DNA workflow analysis . . . 13

3 h i g h-throughput technologies for genomic studies 16 3.1 First generation sequencing . . . 18

3.2 Microarray . . . 19 3.3 Next-generation sequencing . . . 21 3.3.1 Illumina technology . . . 23 3.4 What’s next? . . . 25 4 d ata a na ly s i s t o o l s 27 4.1 NGS software tools . . . 27 4.2 VCF as CNV file format . . . 32

4.3 Database of Genomic Variants - DGV . . . 34

4.4 Pybedtools . . . 36

4.5 Gemini . . . 38

ii 45 5 d e v e l o p m e n t o f g e m i n i t o o l s f o r c n v t r e at m e n t 46 5.1 Load of VCF with CNV data . . . 48

5.2 Load of a CNV map . . . 51

5.3 Loading Wizard . . . 55

(7)

c o n t e n t s vii 5.4 Overlap tool . . . 57 5.4.1 Intersect . . . 60 5.4.2 Jaccard index . . . 61 5.4.3 Filtering . . . 62 5.4.4 Overlap on browser . . . 63

5.5 Overlap gene tool . . . 65

5.5.1 Heatmap . . . 68

5.5.2 Overlap gene on browser . . . 69

6 c o n c l u s i o n 71 iii a p p e n d i x 74 a b o t t l e 75 b g i t 77 c s o m e c o d e 78 d g e m i n i-cnv database 95 b i b l i o g r a p h y 100

(8)

L I S T O F F I G U R E S

Figure 1 The major structures in DNA compaction: DNA, the nucleo-some, the 10 nm "beads-on-a-string" fibre, the 30 nm fibre and the metaphase chromosome. . . 5

Figure 2 The SNPs are single nucleotide variation, INDELs are extra or missing of DNA sequences, while SVs are large block of extra, missing or rearranged DNA sequences. . . 6

Figure 3 In this figure, CNVs are represented as dots beside chromo-somes. Changes in the number of dots reflect relative losses or gains. Panels (a), (b) and (c) shows simple deletions and dupli-cations, while (d) shows a multi-allelic variant and (e) a complex CNV. . . 8

Figure 4 Non-allelic homologous recombination (NAHR) cases. (A) Inter-chromosomal NAHR can generate deletions and duplications. (B) Interchromatid NAHR can also generate deletions and du-plications. (C) Intrachromatid NAHR results in deletion and in the generation of ring chromosomes. (D) Intrachromatid NAHR

between LCRs in opposing orientation can result in re-arrangement.[13] 9

Figure 5 Signature and patters of SVs for deletion (A), novel sequence insertion (B), inversion (C), and tandem duplication (D) in read count (RC), read-pair (RP), split-read (SR), and de-novo assem-bly (AS) methods [75]. . . 11

Figure 6 A typical NGS clinical workflow . . . 13

Figure 7 A high level genomic workflow - from Aaron Quinlan presentation 14

Figure 8 Cost per genome [17] . . . 16

Figure 9 The Sanger chain-termination method for DNA sequencing. . . . 19

Figure 10 The principle of hybridization. . . 20

(9)

List of Figures ix

Figure 11 Array-based, genome-wide methods for the identification of copy-number variants. (A) Array-based comparative genome hybridiza-tion (array-CGH). (B) Representahybridiza-tional oligonucleotide

microar-ray analysis (ROMA) [24]. . . 20

Figure 12 Next-generation sequencing Illumina technology [35] . . . 25

Figure 13 SMRT and nanopore technology are bases on the idea to work directly on the non-aplified DNA, but while the nanopore sys-tem (A) is based on the measured of the direct effect of the pass-ing of the DNA sample through a channel, the SMRT (B) is based on the measured of the emetted light of a complementary strand of the DNA sample . . . 26

Figure 14 Basic workflow for whole-exome and whole-genome sequenc-ing projects [57]. . . 28

Figure 15 VCF example [19] . . . 32

Figure 16 VCF storing CNV . . . 34

Figure 17 DGV content provenience [48]. . . 34

Figure 18 Gemini overview of workflow [58]. . . 39

Figure 19 Gemini database scheme [58]. . . 40

Figure 20 Gemini browser example [26]. . . 43

Figure 21 . . . 49

Figure 22 Load a Copy Number Variation map of DGV schema . . . 52

Figure 23 Loading Wizard . . . 55

Figure 24 Loading Wizard workflow . . . 55

Figure 25 Overlap tool schema . . . 58

Figure 26 Intersect . . . 61

Figure 27 Overlap tool . . . 63

Figure 28 Overlap gene schema . . . 66

(10)

L I S T O F TA B L E S

Table 1 A summary of the tools and algorithms for the investigation of

SVs [75]. . . 12

Table 2 The most currently used platforms and comparison of their spec-ifications [29]. . . 22

Table 3 Variant Annotation Approaches [38] . . . 31

Table 4 Summary of CNVs in the genome based on the inclusive and stringent maps [81] . . . 35

Table 5 Summary of available wrapped BEDTools programs [5] . . . 37

Table 6 Variants_cnvtable . . . 50

Table 7 dgv_maptable . . . 51

Table 8 Overlap table . . . 59

(11)

L I S T I N G S

Listing 1 The Gemini API create custom scripts in Python [26] . . . 42

Listing 2 Loading wizard Bottle code . . . 78

Listing 3 Loading wizard template . . . 79

Listing 4 gemini_main.py overlap tool parser . . . 80

Listing 5 Overlap tool . . . 80

Listing 6 overlap.j2.html . . . 83

Listing 7 Overlap on web . . . 86

Listing 8 Annotation with default Ensembl 75 . . . 87

Listing 9 Annotation with custom map . . . 88

Listing 10 Gene heatmap . . . 88

Listing 11 Overlap gene view . . . 89

Listing 12 Overlap gene function web version . . . 91

Listing 13 Overlap gene on web . . . 92

Listing 14 Overlap gene custom map function . . . 93

(12)

A C R O N Y M S

BAC Bacterial Artificial Chromosome is an artificially constructed segment of nucleic acid used to sequence the genome of organisms.

CNV Copy Number Variant are long size DNA sequences in a range from 50bp to millions of bases which number of copy in individual genome change from a genome reference. CNVs are alteration: duplication and deletions.

CNP Copy Number Polymorphism is a copy-number variant present to some appreciable degree within a population (e.g. > 1%).

HGNC HUGO Gene Nomenclature Committee is the only worldwide authority that

assigns standardised nomenclature to human genes.

INDEL InsertionDeletion. It is classified among small genetic variations, measuring from 1bp to 10kbp in length.

PED PLINK Pedigree and Genotype Table is a file format to store pedigree and genotype information, and includes fields like: family id, sample id, paternal id, maternal id, sex and genotypes [80].

SNP Single Nucleotide Polymorphism is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).

VCF Variant Call Format - a tab-separated file format to store gene sequence variations.

(13)

1

I N T R O D U C T I O N

In the last years, genetic studies have revealed the relationship between various patho-logical phenotypes and genomic variants. The introduction of high-throughput se-quencing technologies has favoured new discoveries, as a consequence of the substan-tial reduction of both costs of genome sequencing and clinical times for a medical response. Nevertheless, there is still a lack for analysis tools: each case requires many different kinds of experts and resources to be correctly analysed. Currently, the need is arising for developing new software dedicated to manage and interpret genome-scale variation in the context of a disease phenotype, especially for large variants that en-compass different genes. The challenge is to reduce, along with the current trend of raw data generation, the costs of analysis and interpretation procedures [50] building

new software tools.

In this kind of studies, starting from raw sequencing data, variants are extrapolated and then stored in tab-separated files, according to the Variant Call Format (VCF). This file format is the standard one used for the 1000 Genome Projects. A VCF file is like a big matrix, where the rows are the human genome positions where genetic variations have been observed in the sequenced sample group, and the columns are the genotypes for each sample. At this point, the challenge is to interpret genetic variations by trying to isolate those variations that confer risk. Then, an hypothesis is posed for every variation and evaluated though different reference resources, which are often highly heterogeneous, large, and stored at different websites.

Different tools and scripts exist for the treatment of variants, such as ANNOVAR, ANNtools, SCAN and SNPnexus. Even if some of them provide high performances, they also show different limitations: issues in database portability and management, close or proprietary input/output file formats, limited database, no web or GUI inter-face, and complexity of use.

(14)

i n t r o d u c t i o n 2

Among software tools currently available for annotation, visualization of variants and links to third-party tools or databases, GEMINI can be considered one of the most interesting and popular. Created by Quinlan Lab of the University of Utah, GEMINI (GEnome MINIng) is a Python-based framework and it collects different tracks of refer-ence databases. All of this information is stored in a portable database that allows the exploration and interpretation of both coding and non-coding variations using tools made available by Gemini or an enhanced SQL engine [58]. Gemini was designed to

address SNPs and INDELs, that are single or small nucleotide variants. Over the years, the research has revealed that also a large size class of variants presents in variable copy on genes, called copy-number variant (CNV), confer risks in the pathology context. Upon the completion of the described enhancements, the time has come to reaping the benefits of the enhanced Gemini platform, now able to deal with CNVs as well.

This thesis project fits in the process of improving the tools collection for data min-ing of CNVs, so to make easier drawmin-ing genetic diagnosis and understandmin-ing genetic diseases. All this activity is in line with the overall Gemini project. The idea is to add a loading command for VCF file containing CNV, a tool for filtering of benign variants and a tool for annotation and visualization of relationship between variants and gene involved.

The thesis is composed of two parts. In the first part it s described the genomic structural variant (chapter 2), with CNV formation (subsection 2.2.1) and detection

(subsection 2.2.2); next, an overview of the sequencing technologies, from Sanger

tech-nique (section 3.1) to the last nanopore technology is provided (section 3.4); the last chapter of first part is dedicated to the actual software tools and packages used within of this thesis project. The second part of the thesis represents the core of our work. Here we try to develop new tools and commands for handling CNV and their code, such as loading process (section 5.1), loading wizard tool (section 5.3), CNV map overlap

(section 5.4), overlap for gene annotation and heatmap building (section 5.5). Finally,

conclusions are drawn, along with the description of possible improvements and future steps of development.

(15)

(16)

2

G E N O M I C S T R U C T U R A L VA R I A N T

The central topic in the genetic study of human diseases is to identify genetic DNA variations that confer risks into clinical phenotypes. Before the introduction of the high molecular resolution technology, the main approach to genetic diseases was to observe variants in number and structure of chromosomes by microscope. Over the years, the research has shown the presence of variants in the genetic code, which ranges from single-nucleotide polymorphism (SNP) to long tandem repeats, and of copy-number variants CNV, long size sequence which copy in the genome encompassing different genes.

The difficulty in studying complex diseases has become the correlation of the phe-notype with a set of genes variation, while Mendelin disease are often modulated by only a gene. Indeed, the progress of research has revealed that all kind of DNA vari-ations play a role in the susceptibility and genetic disease: not only single nucleotide polymorphism (SNP) and insertion/deletion (INDELs) are relevant, but also larger size structural variants called copy number variation (CNV). This because CNV encompass large fragments of genes [24] and different genes. This chapter describe what is the

copy-number variation, their formation and how detect it.

2.1 t h e f o u r c a n o n i c a l b a s e s

The human genome is composed of 6 billion bases of DNA (deoxyribonucleic acid) packaged into two sets of 23 chromosomes, one set inherited from each parent. Each units of DNA, called nucleotide, is composed by a phosphate group, a sugar (deoxyri-bose), and a base (guanine, cytosine, thymine, adenine). The structure of DNA takes the form of a double-stranded helix, the strands of which are linked by hydrogen bonds between guanine and cytosine and between thymine and adenine. Each such linkage is a base pair (bp): 3 billionbpconstitute the human genome. The particular order of the

(17)

2.1 the four canonical bases 5

bases arranged along the sugar-phosphate backbone encode all the information nec-essary to building and maintaining life [10]. The double helix of DNA is packaged to

form the chromosomes through various intermediate structure, as reported in Figure1.

Regions on chromosomes, with specific positions, are called genes, and the transmission of this regions to an organism’s offspring is the basis of the inheritance of phenotypic traits.

Figure 1: The major structures in DNA compaction: DNA, the nucleosome, the 10 nm "beads-on-a-string" fibre, the 30 nm fibre and the metaphase chromosome.

The DNA stores biological information but only a small fraction of the sequence encodes protein. The human DNA encodes 20,000-25,000 protein-coding genes, only

1,5% of entire gemone [16]. The rest 98,5% of genome encodes for non-protein-coding

(18)

inter-2.1 the four canonical bases 6

genic regions between them [11,12]. Differences in the DNA sequence of our genomes

contribute to our uniqueness and guarantees humans to evolve and adapt [81]. For a

long time, has been thought that only a simple alteration of sequence, represented by the single nucleotide polymorphism (called SNP), was correlated with genetic and pheno-typic human variation. Subsequently, with the advent of microarrays and sequencing technologies, has uncovered other two classes of different variation. The first one is an intermediate size class known as indels (insertion and deletions), range from 1–50bp in length. The second one is a large size class called structural variation[24] and includes

tandem repeats, deletions, insertions, inversions, translocations, mobile-elements trans-positions and copy number variants (CNVs) [75].

Figure 2: The SNPs are single nucleotide variation, INDELs are extra or missing of DNA se-quences, while SVs are large block of extra, missing or rearranged DNA sequences.

Structural variants (SV) account for 1.2% of the variation among human genomes while single nucleotide polymorphisms (SNP) represent only 0.1%[59]. Many

struc-tural variants are associated with genetic diseases, however many are not, and there is a continuous spectrum of phenotypic effects of SV, from adaptive traits to embryonic lethality. The ability to examine the genome at this high resolution has resulted in the discovery of widespread copy number variation in the human genome, both polymor-phic variation in healthy individuals and novel pathogenic copy number imbalances. Copy number variants (CNVs) can influence gene transcriptional and translational lev-els and have been associated with complex disease susceptibility. That because a copy-number variant encompass different part of intra-gene and different genes, modulating the gene expression.

(19)

2.2 copy-number variant - cnv 7

2.2 c o p y-number variant - cnv

Copy-number variants are defined as a phenomenon in which sections of the genome are repeated. The number of repeats in the genome varies between individuals. A copy-number variant is an alteration, duplications or deletions, that encompass segments of DNA with a length in a range from 50bp to millions of bases [81].

A CNV can be simple in structure, such as tandem duplication, or may involve com-plex gains or losses of homologous sequences at multiple sites in the genome [65].

Some CNVs are found in normal individuals, while others contribute to cause dif-ferent disease such as cancer, cardiovascular disease, HIV acquisition and progression, autoimmune disease and Alzheimer’s and Parkinson’s diseases [66,73]. This because a

CNV can encompass entire genes and influencing gene dosage. This can, clearly, cause genetic disease, either alone or in combination with other genetic or environmental factors [24].

There are two principal classes of CNV: inherited, derived from parents, and denovo, coming out from a new combination (2.2.1). Furthermore, it’s possible to differentiate

according to the type of alteration: deletions or duplications, or bi-, tri-, or multi-allelic, simple or complex [65]. An overview in this way is in Figure 3. Through familiar

analysis, CNV inheritance can often be established and de novo changes can usually be discerned. As shown in Figure 3.c, both the maternal (red) and paternal (blue)

genomes have 3 copies of CNV, but their offspring have a relative loss or gain.

2.2.1 CNV formation

In the contexts of meiosis and mitosis, CNV formation is supposed to be due mainly to the recombination between interspersed duplicated sequences by non-allelic homologous recombination (NAHR) and non-homologous end-joining (NHEJ).

(20)

Figure 3: In this figure, CNVs are represented as dots beside chromosomes. Changes in the number of dots reflect relative losses or gains. Panels (a), (b) and (c) shows simple deletions and duplications, while (d) shows a multi-allelic variant and (e) a complex CNV.

2.2.1.1 Non-allelic homologous recombination - NAHR

Usually occurs CNVs are surrounded by repeat elements typically ranging from 10–300kbp in length, called low copy repeats (LCRs), which share significant identity (>95–97%) [13].

(21)

subsequent cross-over can result in a congenital disorder. The outcome NAHR can vary depending on repeat orientation and location. When LCRs are located on the same chromosome and in direct orientation, NAHR results in deletion and/or duplication (Figure4B and C). Inversions result when LCRs on the same chromosome are in

oppo-site orientation (Figure 4D); NAHR between LCRs located in different chromosomes

result in translocation.

Figure 4: Non-allelic homologous recombination (NAHR) cases. (A) Interchromosomal NAHR can generate deletions and duplications. (B) Interchromatid NAHR can also generate deletions and duplications. (C) Intrachromatid NAHR results in deletion and in the generation of ring chromosomes. (D) Intrachromatid NAHR between LCRs in oppos-ing orientation can result in re-arrangement.[13]

2.2.1.2 Non-homologous End-Joining - NHEJ

NHEJ is is a mechanism of genetic recombination implicated in the double strand breaks (DSB) repairing process, for proper development of the vertebrate immune sys-tem, during the early stages of T and B cell maturation. NHEJ is composed of four

(22)

principal steps: (1) detection of a double strand break; (2) formation of molecular bridge that holds the DNA ends together; (3) a processing procedure that modifies non-matching and/or damaged DNA ends into compatible and ligatable ends; (4) the final ligation [79].

2.2.2 CNV detection

Until 2003, the traditional approach to identify CNVs exploited cytogenetic technolo-gies, such as karyotyping and fluorescence in situ hybridization (FISH). After, array-based comparative genome hybridization (array-CGH) (Cap. 3.2) single-nucleotide

polymor-phism (SNP) array approaches became the main approaches to the genome-wide detec-tion of CNVs [82].

Since 2005, next-generation sequencing (NGS) has been a widespread strategy for genotyping and characterization of CNVs by generating hundreds of millions of short reads in a single run (Cap. 3.3). The detection of genomic structural variants starting

from NGS data (raw paired reads) is based on four different strategies (Figure5):

• Read-depth (or read count) approaches assume a random distribution in mapping depth and investigate the divergence from this distribution to highlight duplica-tion and deleduplica-tions: sequencing of duplicated/amplified regions results in higher read depth while deleted regions show reduced read depth when compared to normal regions.

• Read-pair methods are based on the evaluation of the span and orientation of paired-end reads: read pairs mapping too far apart are associated to deletions while those found closer than expected are indicative of inversions; orientation inconsistencies can represent inversions and a specific class of tandem duplica-tions.

• Split-read methods use a reference: a gap in the read is a marker of deletion while stretches in the reference reflect insertions.

(23)

• De novo assembly method refers to merging and ordering short fragments to reassemble the original sequence from which the short fragments were sampled [75].

Different strategies have their own advantages and limitations. Depending on the cases there are numerous software that use only one or many of this approaches, as reported in Table1.

Our team has developed a tool called EXCAVATOR for the detection of copy number variants (CNVs) from whole-exome sequencing data, based on RC approach. EXCA-VATOR combines a three-step normalization procedure with a novel heterogeneous hidden Markov model algorithm and a calling method that classifies genomic regions into five copy number states [49].

Figure 5: Signature and patters of SVs for deletion (A), novel sequence insertion (B), inversion (C), and tandem duplication (D) in read count (RC), read-pair (RP), split-read (SR), and de-novo assembly (AS) methods [75].

(24)

(25)

2.3 dna workflow analysis 13

2.3 d na w o r k f l o w a na ly s i s

Next-generation sequencing has become a powerful tool for the clinical management of patients, it has applications in diagnosis, guidance of treatment, prediction of drug response, and carrier screening [38]. The main challenge for the clinical implementation

of this technology is the managing of the big amount of data generated, in particular for annotation and clinical interpretation of genomic variants. The clinical workflow typically involves the following steps:

Figure 6: A typical NGS clinical workflow

1. A pre-analytical phase of patient review that concentrates on determining the ques-tions of the referring physician, and involves pretest counseling and sample col-lection.

(26)

2. An analytical phase of targeted, exome, or whole-genome sequencing with variant annotation and interpretation.

3. A post-test phase of reporting and clinical review to reconcile the clinical pheno-type with laboratory data, together with counseling and developing a clinical management plan Figure6.

It was demonstrated that the next-generation sequencing is clinically and economi-cally useful. Estimating over 50% success rate for next-generation sequencing in undi-agnosed genetic disorders, its application after the first clinical visit could result in a higher rate of genetic diagnosis at a considerable cost savings per successful diagnosis [71].

The high-throughput sequencing phase consists in a number of laboratory steps in-cluding library preparation, sequencing and three layers of data analysis. The first data analysis refers to base calling from raw sequence machine output. The second analysis consists in the alignment of reads with a human reference sequence and the identi-fication of sequence variants. The third analysis describe the annotation of sequence variants. Finally, steps are made up of interpretation and clinical contextualization based on reports from in vivo and in vitro published experiments [38].

Focusing our attention on the analytical phase, at a very high level, a typical study follows a basic pipeline.

(27)

For example, sequencing can be used in a cohort study: a bunch of people with the disease (case) is confronted with a bunch of people without disease (control). The confront musts show what are the risk variants that distinguish the two groups. The same process can be used to family-based studies or cancer genomics (Figure 7). It’s

like a generic signal-noise problem: for every variation, make an hypothesis of risk, mark ups or annotate, interpret and prioritize through different web resources.

(28)

3

H I G H - T H R O U G H P U T T E C H N O L O G I E S F O R G E N O M I C S T U D I E S

The DNA sequencing is the process to determinate the nucleotide order of a given DNA fragment.

The lowering of costs required today to this technology, stable at $1000 and roughly equivalent to an magnetic resonance imaging, made it a realistic proposition in routine diagnostic use in molecular pathology. The use of high-throughput technologies opens diverse chances such as testing for familial predisposition to hereditary diseases, pre-dicting drug response and toxicity, noninvasive prenatal testing from maternal plasma DNA, or detection of Mendelian and rare genetic disorders [38]. Essentially, there was

a significant reduction of the costs and time for diagnosis, compared with traditional assessment methods, for patients who presented complex clinical phenotypes such as multiple anomalies or a heterogeneous disorder [71].

Figure 8: Cost per genome [17]

In 2000 it was concluded the first mapping and sequencing project of the human genome, called Humane Genome Project (HGP), generating a reference sequence of the human genome [40,77]. As reported in Figure8, in a relative short period, thanks

to the revolutionary advances in DNA sequencing technologies, the cost per genome is strongly reduced, faster than the Moore’s law prediction. Sequencing human genomes

(29)

h i g h-throughput technologies for genomic studies 17

are nowadays aided by the possibilities to compare samples with a reference sequences of the human genome.

The DNA sequencing consists essentially of three phases: the fragmentation of DNA sample and library preparation, the physic sequencing and the reassembly.

Genome is very large and their bases cannot be read out (sequenced) in order end-to-end in a single step. Therefore, to sequence a genome, its DNA must first be broken down into smaller pieces, with each resulting piece then subjected to chemical reactions that allow the identity and order of its bases to be deduced.

The established base order derived from each piece of DNA is called sequence read, and the collection of the resulting set of sequence reads (often numbering in the bil-lions) is then computationally assembled back together to deduce the sequence of the starting genome. When an entire genome is being sequenced, the process is called whole-genome sequencing. An alternative to whole-genome sequencing is the targeted sequencing of part of a genome. Most often, this involves just sequencing the protein-coding regions of a genome, which reside within DNA segments called exons, to gen-erate the process called whole-exome sequence. Computationally, to a right and more precise re-assembly, is better use long reads, in order to have a minor number of gap.

Since 1953 (Watson & Crick DNA structure discovery), the researchers have focused on the implementation of a method to coming out the order of genome. After differ-ent approach to RNA, the Walter Fiers’ laboratory was able to produce the first com-plete protein-coding gene sequence in 1972, based on the detection of radiolabelled partial-digestion fragments after two-dimensional fractionation (electrophoresis and chromatography) [25]. The first-generation DNA sequencing was developed by Maxam

and Gilbert, which using radiolabelled DNA and chemicals which breaks the chain at specific bases, called chemical sequencing method; after that, DNA fragments can run on a

(30)

3.1 first generation sequencing 18

polyacrylamide gel to determinate the sequence [51]. Nowadays a very new approach

of sequencing is the nanopore technologies [33], that allow to sequence in a very few

time and is very cheap [70].

In this chapter we present an overview on the main sequencing methods and tech-nologies: from the Sanger’s sequencing methods to the next-generation technologies (NGS). We present also the microarray technology, as not a sequencing technology but as a method to detect structural variation.

3.1 f i r s t g e n e r at i o n s e q u e n c i n g

The major breakthrough in the DNA sequencing technology came in 1977 with the development of Sanger’s chain-termination technique [67]. This technique, more

sim-ple than Maxam and Gilbert technique, use a chemical analogues of nucleotide called deoxyribonucleotide, abbreviated as ddNTPs (ddGTP, ddATP, ddTTP and ddCTP). A de-oxyribonucleotide lacks the 3’ hydroxyl group that is required for extension of DNA chains, and therefore cannot form a bond with the 5’ phosphate of the next ddNTP. Each ddNTP is radiolabelled or fluorescent dyed. Four parallel reactions are performed with a primer, all ddNTPs, DNA polymerase and the amplified DNA sample. The in-corporation of the ddNTP along the DNA strand cause the break of function of poly-merase, obtaining different length fragment. The length of each fragment indicate the position of the base where the elongation process is broken. By running this radiola-belled fragments on electrophoresis gel is possible rebuild the order of sequence. In case of fluorescent ddNTP use a capillary electrophoreis to perform a chromatography of sample and produce a fluorescent peak plot, as reported in Figure9.

The Sanger method was soon automated, and prevailed from the 1980s until the mid-2000s producing the first human genome in 2001 [40].

(31)

3.2 microarray 19

Figure 9: The Sanger chain-termination method for DNA sequencing.

3.2 m i c r oa r r ay

Microarray technology is not a real sequencing technology, but allows to detect ge-nomic structural variation.

Called also gene chip, microarray is a set of microscopic probe of DNA attached to a solid surface (plastic, glass or silicon). This technology is useful for different aims: gene expression profiling, biomarker determination, alternative splicing detection, SNP and structural variant detection [9]. The main principle behind microarray is the

hybridiza-tion between two DNA strands: the union of complementary nucleic acid sequences to specifically pair with each other. The labeled samples are mixed with a propriety hybridization solution for a period of time, after which the excess is washed off and the microarray is scanned under laser light. The process is schematized in Figure10.

The major technique is array-based comparative genome hybridization (array-CGH), where reference and test DNA samples are differentially labelled with fluorescent tags (Cy5 and Cy3, respectively), and are then hybridized to genomic arrays. The array can be spotted with one of different DNA sources: BAC clones, PCR fragment and oligonu-cleotides. After hybridization, the fluorescence ratio (Cy3:Cy5) is determined, to reveal copy-number differences between the two DNA samples. The process is showed in Figure11a.

(32)

3.2 microarray 20

Figure 10: The principle of hybridization.

Other array-based method is known as representational oligonucleotide microarray anal-ysis (ROMA). It is a variant of array-CGH and, to reduce the sample complexity before hybridization, the DNA that is to be hybridized on the array is digested by restriction enzyme. Only DNA of less than a threshold length is amplified. Fragments that are greater the threshold size are lost, therefore reducing the complexity of the DNA that will be hybridized to the array. The process is showed in Figure11b.

Figure 11: Array-based, genome-wide methods for the identification of copy-number variants. (A) Array-based comparative genome hybridization (array-CGH). (B) Representa-tional oligonucleotide microarray analysis (ROMA) [24].

(33)

3.3 next-generation sequencing 21

3.3 n e x t-generation sequencing

Next-generation sequencing technology (NGS) is a high-throughtput (HTS) technology and allow the sequencing of millions of short DNA fragments (reads) simultaneously. NGS can process a whole human genome in three days at 500-fold less cost than previ-ous methods [52, 78]. This new technology has been reshape the economics and scale

of human genome sequencing, redefining the possibilities for population healthcare studies.

All NGS platforms share two commons working principles: massively parallel sequenc-ing and cyclic-array sequencsequenc-ing. While Sanger sequencsequenc-ing is based on the electrophoretic separation of chain-termination products produced in individual sequencing reactions, NGS use a massively parallel sequencing of amplified or single DNA molecules that are spatially separated in a flow cell. This design is a paradigm shift from first-generation sequencing. With regard of the sequencing, in NGS is performed by repeated cycles of polymerase-mediated nucleotide extensions or, in one format, by iterative cycles of oligonucleotide ligation [78].

In few years, there has been a rapid development of NGS platforms, including Illumina, the Applied Biosystems SOLiD System, 454 Life Sciences (Roche), Helicos HeliScope, Pacific Biosciences PacBio and Life Technologies Ion Torrent [4], each one with a proper

approach, advantages and disadvantage, as reported in Table2.

The workflow of all of them are very similar:

1. template preparation phase: consist of a genomic DNA fragmentation and ligation to common adaptors.

2. nucleic acid sequencing phase: create an array of millions of spatially immobilized PCR colonies.

3. imaging and data analysis phase: alternating cycles of enzymedriven biochemistry and imaging-based data acquisition.

The NGS platforms allow the generation of many kinds of sequence data: such as whole-genome sequencing, de novo sequencing, candidate region targeted

(34)

resequenc-3.3 next-generation sequencing 22 Instrument Principle Pur chase cost T otal bases per run Run time Obser v ed ra w err or rate Read lenght Roche 454 p yr osequencing ∼$ 100 500 Mb 10 h 0. 1% 250 bases Abi S olid sequencing b y ligation ∼$ 500 100 Gb 11 -12 da ys 4, 00 % 50 bases Helicos Heliscope single molecule sequencing -35 Gb 30 da ys 2-7% 35 bases Ion T orr ent PGM Ion semiconductor sequencing $ 49 . 5 1Gb 2 h 1. 71 % up to 100 bases PacBio Rs single molecule real-time sequencing ∼$ 700 100 Mb 2 h 12 . 86 % 8601100 bases Illumina HiS eq 2000 sequencing-b y-synthesis $ 690 600 Gb 11 da ys 0. 26 % up to 150 bases Illumina MiS eq sequencing-b y-synthesis $ 125 1. 5-2Gb 27 h 0. 80 % up to 150 bases Illumina NextS eq 500 sequencing-b y-synthesis -100 -200 Gb 30 h 0. 80 % 2 x 150 bases Illumina HiS eq X 10 sequencing-b y-synthesis -800 -900 Gb < 3 da ys 0. 50 % 2 x 150 bases T able 2: The most curr ently used platfor ms and comparison of their specifications [ 29 ].

(35)

ing, DNA sequencing, RNA sequencing (for applications such as transcriptome and small RNA analysis), methylation analysis, and protein-nucleic acid interaction analy-sis.

3.3.1 Illumina technology

Nowadays, Illumina platform is the most successful and widely adopted next-generation sequencing technology in research and clinical labs [30]. The Illumina machines are

based on the concept of sequencing by synthesis (SBS): all four dNTP are fluorescent labelled and compete for addition to the template sequence, emanating a specific light. The end result is true base-by-base sequencing, with virtually no errors. Introduced in

2006, the first HTS machine was the Illumina Genome Analyzer (1st Solexa Sequencer), witch produced sequence reads of 32–40 bp with an throughput of 1300 Mb/run for a total of data in 4 days, for a sequence of 1 gigabase (Gb) of data in a single run. Nowadays, the latest Illumina NextSeq Series sequencer is able to generate 400 million (2 x 150 bp of read lenght) reads per run for a total of 120 Gb of data in 12–30 hours. The Illumina NGS workflows include 4 basic steps [35], as reported in Figure12:

• Library preparation - The sequencing library is prepared by random fragmen-tation of the DNA or cDNA, long 500 bp or less, followed by 5’ and 3’ adapter ligation. Through reduce cycle amplification, additional motifs are introduced, such as the sequencing binding site, indices and regions complementary to the flow cell oligos.

• Cluster generation - For cluster generation, the library is loaded into a flow cell where each fragment is isothermically amplified. The flow cell is a glass slide with 8 individual lanes and each lane is composed of two types of oligos. The first oligo is the starter of hybridization process and is complimentary to the adapter region on one of the fragment strands. Each fragment is amplified into a clonal cluster through bridge amplification. When cluster generation is complete, the templates are ready for sequencing.

(36)

• Sequencing - Sequencing begins with extension of the first sequencing primer to produce the first read. With each cycle flourencent tagged nucleotides com-pete for addition to the growing chain. Only one is incorporated based on the sequence of the template. After the addition of each nucleotide the clusters are ex-ited by a light source and a characteristic emission from each cluster is recorded. This process is called sequencing-by-synthesis. This cycles is repeated "n" times to create a read length of "n" bases. Hundreds of millions of clusters are sequence in a massively parallel process. The first read is then washed away to product a second read in the same way.

• Data analysis - The entire process generate millions of reads representing all the fragments. For each sample with similar structure is locally clustered in a forward read and reverse read, to create a continuous sequence. This continues sequences are aligned and compared with a reference. After alignment, different between the reference genome and the newly sequenced reads can be identified.

(37)

3.4 what’s next? 25

Figure 12: Next-generation sequencing Illumina technology [35]

3.4 w h at’s next?

The main ideas behind of third generation sequencing technologies are the single-molecule real-time sequencing (SMRT) and nanopore sequencing.

While the SMRT approach is based on directly observing a single molecule of DNA polymerase, as it synthetics a strand of DNA, the nanopore sequencing is based on the measurement of an electric current leaded by the transit of a DNA molecule through a nanoscopic pore (Figure 13). Although these methods seems different, allows to

se-quencing of non-amplified DNA and produce sequences much longer than those gen-erated by previous methods, avoiding associated biases and errors.

(38)

3.4 what’s next? 26

Figure 13: SMRT and nanopore technology are bases on the idea to work directly on the non-aplified DNA, but while the nanopore system (A) is based on the measured of the direct effect of the passing of the DNA sample through a channel, the SMRT (B) is based on the measured of the emetted light of a complementary strand of the DNA sample

In the SMRT method a DNA polymerase is confined in a zero-mode waveguide nanos-tructure array and performing uninterrupted synthesis using four fluorescently labeled bases. The base incorporation is continuously measured with the waveguide [22]. This

technology is commercialized by Helicos BioSciences.

The nanapore sequencing technology is commercialized by Oxford Nanopore with the MinION. A double stranded DNA gets denatured by an enzyme which ratchets one of the strands through a biological nanopore embedded in a systhemic membrane, across which a voltage is applied; as the single strand of DNA passes through the nanopore the different bases prevent ionic flow in distinctive manner, allowing the sequence of the molecule to be inferred by monitoring the current at each channel [31].

(39)

4

D ATA A N A LY S I S T O O L S

This chapter introduce the main tools used into our project. In the first section we present an overview of the most used analysis NGS software. Next, we describe the principal file format to storing variation data, which we use. To initialize an annotation analysis, it’s possible compare the sample variations with a database of variation. In our case, we proposed to use a track of Database of Genomic Variants (4.3) as reference.

Next, we getting in the core of project describing the principal software package used to handle genomic structure: Pybedtools4.4, a Python wrapper of the famous package

BedTools, and Gemini 4.5, the framework for exploring genome variation at the base

of our project.

4.1 n g s s o f t wa r e t o o l s

The critical phase of NGS workflow is composed of variant annotation and gene and vari-ant prioritization. Analysis pipelines consisted of a number of custom scripts to integrate different lines of evidence to predict the likely significance of variants and to prioritize for review. Nowadays, have been developed a number of commercial and open-source software tools to assist in this task, such ad VariantStudio (Illumina), IonReporter (Life Technologies), Geneticist Assistant (Softgenetics), Expressionist (GeneData), and GEM-INI (Quinlan Lab - University of Utah) [38].

After completing laboratory work and the real sequencing with the NGS sequencer, the researchers and physicians, are confronted with a huge amount of raw data. The analysis of the data can be decomposed into five distinct steps (Figure14):

1. Quality assessment of the raw data - Raw data generated by sequencing platforms are compromised by sequence artifacts such as base calling errors, poor quality reads and adaptor contamination [18]. So, is necessary remove, trim or correct

(40)

4.1 ngs software tools 28

Figure 14: Basic workflow for whole-exome and whole-genome sequencing projects [57].

reads that do not meet the defined standards. Several tools have been developed: FastQC [23], Galaxy[6], NGSQC Toolkit [60] and PRINSEQ [68].

2. Read alignment to a reference genome - The reads are usually aligned to an existing reference genome. Currently, there are two main sources for the human reference genome assembly: the University of Santa Cruz (UCSC), which is also hosting the central repository of ENCODE data [64], and Genome Reference Consortium

(GRC) [28]. Both resources provide several versions of the human genome: USCS

offers versions hg18 and hg19, while GRC offers GRCh36 and GRCh37. The im-provement of sequencing technologies goes towards a long length of generated reads, that requiring new algorithms. Nowadays, the available long-read

(41)

align-4.1 ngs software tools 29

ment algorithms may be classified as either using has table indexing (BLAST or SSAHA2) or using some sort of compressed tree indexing based on the Bur-rows– Wheeler transform [8], algorithm used in data compression techniques.

Most alignment algorithms are based on seed and extend paradigm, where one or more of so-called seeds are searched followed by an extension to cover the whole query sequence [45]. Over the years, many alignment programs have been

de-veloped to process millions of short and long reads, and include: Bowtie/Botie2 [42], MAQ [46], Mosaik [53], Novoalign [55], SSAHA2 [54] and Stampy [47].

3. Variant identification - This is the crucial part of next-generation genome sequenc-ing data analysis. Tools for genome-wide variant identification can be grouped into four categories: germline callers, somatic callers, CNV identification and SV identification. The detection of germline mutation is a central part for finding causes of rare disease. The tools for identification of large structural modifica-tions can be divided into those which find CNVs and those which find other SVs such as inversions, translocation or large INDELs. A list of tools is reported in Table1

4. Annotation of the variants - An automated system to predict the function impact of variants, enables research groups to filter and prioritize potential mutations that confer risk for further analysis. Most of available tools focus on the annotation of SNPs, since they can be easily identified and analysis. INDELs are also covered by some tools, whereas annotation of structural variants is limited to CNVs and only performed by few application. The main approach for annotation is to provide database links to various public variant databases such as dbSNP. To predict the variant impact, there are different approaches: from simple sequence-based analysis over region-based analysis to the evaluation of the structural impact on proteins. The result of functional analysis is a classification into accepted and deleterious mutations, ranked by scores or risk classes.

5. Data visualization - Visual representation of data is useful for the interpretation. It can be divided into three different types: finishing tools supporting the

(42)

in-4.1 ngs software tools 30

terpretation of sequence data of de-novo or re-sequencing experiments, genome browser that allow users to browse mapped experimental data in combination with different types of annotation, and comparative viewers that facilitate the comparison of sequence from multiple organisms or individuals. Many genome browser have been developed, web-based and stand-alone, like: Ensembl [74],

GenomeView [2], NGSView [3] and UCSC Genome Browser [21]. In addition,

there are visualization suite to CNVs and SVs, like Circos [39] or Gremlin [56].

(43)

4.1 ngs software tools 31

(44)

4.2 vcf as cnv file format 32

4.2 v c f a s c n v f i l e f o r m at

Variant call format - VCF - is a text file format witch allows the storing of DNA poly-morphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project [19]. An example

Figure 15: VCF example [19]

It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The header section contains a description of tags and annotations used in data sections, arranged into an arbitrary number of meta-information lines, witch starting with characters "##". Mandatory head lines are

fileformat and a TAB delimited field definition line, starting with a single "#" char-acter, defines the columns names of the data section. Annotations may apply to the variant as a whole (the INFO column) or to each genotype (the FORMAT column). The sufficient and necessary 8 fixed fields per record are:

1. #CHROM- alphanumeric string, required - chromosome: an identifier from the ref-erence genome.

2. POS - integer, required - position: the reference position, with the 1st base hav-ing position 1. Positions are sorted numerically, in increashav-ing order, within each reference sequenceCHROM.

(45)

4.2 vcf as cnv file format 33

3. ID - alphanumeric string - identifier: semi-colon separated list of unique identi-fiers where available.

4. REF - string, required - reference base(s): each base must be one of A,C,G,T,N. Bases should be in uppercase. Multiple bases are permitted.

5. _ALT - alphanumeric string - alternate base(s): comma separated list of alternate non-reference alleles called on at least one of the samples.

6. QUAL- numeric - quality: phred-scaled quality score for the assertion made inALT.

7. FILTER - alphanumeric string - filter status:PASS if this position has passed all filters, i.e. a call is made at this position.

8. INFO - additional information: INFOfields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]. The exact format of eachINFOsub-field should be specified in the meta-information.

In addition, if samples are present in the file, the mandatory header columns are followed by a FORMAT column and an arbitrary number of sample IDsthat define the samples included in the VCF file. Nowadays there isn’t a standard format to store CNV data: our VCF files are originated from EXCAVATOR tool [49]. This files have

differ columns meaning.

• no data inIDcolumn

• ALTcolumn represent type of variation: deletion (DEL) or duplication (DEL)

• no data inQUALcolumn

Our file that store CNV data has different keys, specially into FORMAT part. As re-ported in Figure16,CNspecifies the integer copy number of the variant in this sample,

CNFreport copy number fraction,FCLandFCPreport the label and posterior probability inferred by FastCall algorithm. In this case ALT column report the type of alteration: deletion (DEL) or duplication (DUP).

(46)

4.3 database of genomic variants - dgv 34

Figure 16: VCF storing CNV

4.3 d ata b a s e o f g e n o m i c va r i a n t s - dgv

The Database of Genomic Variant [48] is a publicly accessible, comprehensive curated

catalogue of copy number variations (CNV) and structural variations (SV) that are found in the genomes of control individuals from worldwide populations.

The DGV project was launched following the publication of the inaugural CNV articles that described the genome-wide prevalence of CNV in the genomes of healthy, clinically unaffected individuals [34, 69]. As reported in Figure 17, in the years, the

coming out of data was been massively moved from low-resolution microarrays to NGS technologies, and this have been significantly improve the accuracy of the curated SV catalogue.

SV data are made available in multiple formats providing graphical (gbrowse), tab-ular (query tool) and text-based formats (downloads). The text-based format data con-tain a copy of all the information concon-tained in the database with variants mapped to multiple assemblies (NCBI36/hg18 and GRCh37/hg19 where applicable).

(47)

4.3 database of genomic variants - dgv 35

Copy number variation measures All variants Gains Losses

Inclusive map Stringent map Inclusive map Stringent map Inclusive map Stringent map Total genome variable (%) 9_.5 4_.8 3_.9 2_.3 7_.5 3_.6 Total genome variable (Mb) 273 136_.6 111_.5 64_.7 215 102_.4 Mean interval length of CNVRs (bp) 11_,362 11_,647 35_,581 55_,370 9_,181 8_,883 Number of CNVRs 24_,032 11_,732 3_,132 1_,169 23_,438 11_,530

Table 4: Summary of CNVs in the genome based on the inclusive and stringent maps [81]

From DGV was been extracted a human CNV map, witch catalogues benign CNVs among presumably healthy individuals of various ethnicities. This map includes mi-croscopic and submimi-croscopic variants from 50 bp to 3 Mb. The aim of this is to create a map of the human genome variations that are not associated with adverse phenotypes [81]. The CNVmap datas are splits in Inclusive and Stringet: the first set of data includes

a set of variants found in one or more samples in at least 2 different studies; the second set of data contains variants that are found in 2 or more unique samples in at least 2 or more unique studies. Each data set is available in three groups, as reported in Table 4:

gain or duplications, loss or deletions, gain+loss.

Each record of DGV map is identified by hischrm, start,end andid, moreover of other fields, it contains the list of samples which shows the specific variant called as:

• nsv- NCBI structural variant region

• nssv- NCBI ssv variant call

• esv- EBI structural variant region

• essv- EBI ssv variant call

• dgv- DGV merged variant, generated if two or more variant regions share >70% reciprocal overlap within a study

Moreover, there is a variant type field (CNV or OTHER) and a variant subtype (CNV, complex, deletion, duplication, gain, gain+loss, insertion, loss, OTHER). The terms gain and duplication are equivalent, like loss is the same of deletions. The frequency of a variation is defined by the authors and can be a relative measure compared to

(48)

4.4 pybedtools 36

the number of samples tested, or if there is genotype data available, this could be represented as an allele frequency.

4.4 p y b e d t o o l s

The BEDTools package is a collection of utilities for genomic interval manipulation and for genome arithmetic, developed in the spring of 2009, by the Quinlan Lab of the University of Utah [5]. pybedtools is a Python package that wraps and extends

BEDTools [63]. It is clear that the primary challenge in genomics is data analysis and

interpretation, not data generation. Therefore, a flexible genome arithmetic tools was developed to interrogate and compare diverse datasets of genome features. Pybedtools can analyze datasets in BED, VCF, GFF, BEDGRAPH and SAM/BAM formats without the need for format conversion.

The core of pybedtools is the BedTool class. A BedTool object is initially created with a file name, and one can access to program as methods of BedTool object, with arguments identical to the user’s installed version of BEDTools. Moreover, it is possible passing collections of interval objects which can be manipulated in Python. The wrapped meth-ods of BEDTool adhere to the pybedtools design principes:

• Temporary files are created, and delated, automatically - Every operation results in a new temporary file, and at exit all temp files created during the session will be deleted.

• Names and arguments are as similar as prossible to BEDTools - As much as possible, BEDTools programs and BedTool methods share the same names and arguments.

• Indifference to BEDTools version - Since BedTool methods just wrap BEDTools pro-grams, they are as up-to-date as the version of BEDTools you have installed on disk.

• Sensible default args - All default arguments (-i,-a,-b) are reported as defaults in BedTool methods.

(49)

4.4 pybedtools 37

Utility BEDTool wrapper Description

annotate pybedtools.bedtool.BedTool.annotate(*args, ...) Annotate coverage of features from multiple files. bamtobed pybedtools.bedtool.BedTool.bam_to_bed(*args, ...) Convert BAM alignments to BED (and other) formats. bamtofastq pybedtools.bedtool.BedTool.bam_to_fastq(...) Convert BAM records to FASTQ records.

bedtobam pybedtools.bedtool.BedTool.to_bam(*args, ...) Convert intervals to BAM records.

closest pybedtools.bedtool.BedTool.closest(*args, ...) Find the closest, potentially non-overlapping interval. cluster pybedtools.bedtool.BedTool.cluster(*args, ...) Cluster (but don’t merge) overlapping/nearby intervals. coverage pybedtools.bedtool.BedTool.coverage(*args, ...) Compute the coverage over defined intervals. jaccard pybedtools.bedtool.BedTool.jaccard(*args, ...) Calculate the Jaccard statistic b/w two sets of intervals. genomecov pybedtools.bedtool.BedTool.genome_coverage(...) Compute the coverage over an entire genome. intersect pybedtools.bedtool.BedTool.intersect(*args, ...) Find overlapping intervals in various ways.

map pybedtools.bedtool.BedTool.map(*args, **kwargs) Apply a function to a column for each overlapping interval. merge pybedtools.bedtool.BedTool.merge(*args, **kwargs) Combine overlapping/nearby intervals into a single interval. overlap pybedtools.bedtool.BedTool.overlap(*args, ...) Computes the amount of overlap from two intervals. random pybedtools.bedtool.BedTool.random(*args, ...) Generate random intervals in a genome.

reldist pybedtools.bedtool.BedTool.reldist(*args, ...) Calculate the distribution of relative distances b/w two files. shuffle pybedtools.bedtool.BedTool.shuffle(*args, ...) Randomly redistribute intervals in a genome.

sort pybedtools.bedtool.BedTool.sort(*args, **kwargs) Order the intervals in a file.

subtract pybedtools.bedtool.BedTool.subtract(*args, ...) Remove intervals based on overlaps b/w two files. tag pybedtools.bedtool.BedTool.tag_bam(*args, ...) Tag BAM alignments based on overlaps with interval files. unionbedg pybedtools.bedtool.BedTool.union_bedgraphs(...) Combines coverage intervals from multiple BEDGRAPH files. window pybedtools.bedtool.BedTool.window(*args, ...) Find overlapping intervals within a window around an interval.

Table 5: Summary of available wrapped BEDTools programs [5]

• Other arguments have no defaults - Except default arguments, all others arguments have no defaults specified by pybedtools; they pass the buck to BEDTools pro-gram.

• Chaining together commands - Most methods return new BedTool objects, allowing to chain things together just like piping commands together on the command line.

A collection of the most important BEDTools programs wrapped by pybedtools is reported in Table5.

(50)

4.5 gemini 38

4.5 g e m i n i

GEMINI (GEnome MINIng) is a free and open-source software package, based on Python, that integrates genetic variation in the VCF format with both automatically in-stalled and researcher defined genome annotations into a unified database framework. Gemini package wants to answer to the big challenge to manage and interpret genome-scale variation in the context of a disease phenotype. Resolve this problem presents two big issues: genome annotation datasets are often quite large and are de-scribed in myriad file formats [58].

The basic idea of Gemini’s developers is to load a VCF file, with an optional PED file that describes relationship among the samples in the study, and populate an SQLite database. SQLite was chosen because of its speed and portability: a given GEMINI database can easily be shared as a single file among laboratory members and collab-orators without a dedicated database server or additional configuration. The use of a relational database to alternative was chosen also, “NoSQL” approaches (e.g., Redis, MongoDB), because of the expressive power that SQL provides for constructing data exploration queries, its intuitive syntax, and its familiarity to many researchers [58].

$ gemini load -v file.vcf file.db

Loading is by far the slowest aspect of GEMINI. Using multiple CPUs can greatly speed up this process.

$ gemini load -v file.vcf --cores 8 file.db

Gemini allows you to directly query the database in search of interesting variants via the -q option.

$ gemini query -q " select . . . from . . . where . . . " file.db

As reported in Figure 18, each variant of VCF file is extensively annotated though

automatic comparisons with a set of genomic annotation files including: dbSNP [72],

ENCODE [15], ClinVar [41], 1000Genomes [14], the Exome Sequencing Project [76],

KEGG [37], GERP score [20], and HPRD [62]. This process is made up by Tabix[44]

(51)

4.5 gemini 39

islands, regions under evolutionary constraint, RepeatMasker annotations [1],

segmen-tal duplication, mappability scores [43], and regional recombination rate. Annotated

variants are loaded as rows in the variants database table.

Figure 18: Gemini overview of workflow [58].

As reported in Figure19, essentially, Gemini create a database withvariants table

and variant_impact table, to store the annotated variants;sample table, to store the pedigree and sample information;resourcestable, to tracks which version of the built-in annotations were used to create the database; and version table, to store which version of Gemini was used to create a database. Gemini store also the header of VCF file intovcf_headertable. There are also two important table builded on version

75 of the ensembl genes: gene_summary and gene_detailed. These tables are different in the way where gene_detailed describe information and other aspect summarized in gene_summary. The chrom, gene and the transcript columns of the gene tables may be used to join on the variants and the variant_impacts tables.

Studies of human disease require the ability to compare the genotypes of individual sample (e.g, cases vs controls) for each observed variant. To accelerate this process,

(52)

4.5 gemini 40

Figure 19: Gemini database scheme [58].

Gemini represent genotype information (genotype, phase, depth, etc.) for each sample as a compressed array that is stored as single column for each variant row. This strategy enables both query performance and scalability while still proving necessary access to individual sample genotype information.

$ gemini query -q " select gt_copy_number .NA07000 from variants " --qt-filter "gt_copy_number >= 2" file.db

In addition of the basic SQL query tool, Gemini provide several tools, includes:

• region- extract variants from specific genomic intervals or genes

• stats- compute variant statistics (SFS, Ts/Tv, counts, etc.)

• annotate- add new columns based on custom annotations

• windower- compute variant statistics across genome "windows"

(53)

4.5 gemini 41

• pathways- maps genes and variants to KEGG pathways

• lof_sieve- prioritize candidate loss-of-function variants

• interact- find protein interactions for genes/variant/samples

• auto_rec- identity variants meeting an autosomal recessive model

• auto_dom- identity variants meeting an autosomal dominant model

• de_novo- identity candidate de novo mutations

• browser- launch the interactive gemini web browser interface

Parallelization

Because of the constant amount of data, the loading step can be parallelized on single machines with multiple CPUs. In addition, through use of the IPython.parallel library, loading can be parallelized with computing clusters supporting LSF, Sun Grid Engine, or Torque load management systems [58].

Storage requirements

Since sample genotype information is stored as compressed binary arrays in the vari-ants table and many annotations are stored more efficiently in a SQLite database than in a text-based VCF format. The resulting GEMINI database, complete with annota-tions, requires just over half the space than the relative VCF file [58].

Gemini API

Importantly, Gemini allows to researchers to extend the database with genome anno-tations that are relevant to their own research, and, above all, to create and integrate new analysis tools that leverage the GEMINI framework via Python scripts. In order

(54)

4.5 gemini 42

to address this feature, Gemini provides an useful API to create custom queries or annotations, as report in code1.

Listing 1: The Gemini API create custom scripts in Python [26]

#!/usr/bin/env python

import sys

from gemini import GeminiQuery

database = sys.argv[1] gq = GeminiQuery(database)

query = "SELECT variant_id , chrom, start , end, ref , alt , info FROM variants "

gq.run(query)

for row in gq:

try:

print "\t ".join([str(row[’chrom ’]), str(row[’ start ’]), str(row[’end ’]),

str(row[’ ref ’]), str(row[’ a l t ’]), str(row.info[’dbNSFP_SIFT_pred ’])])

except KeyError: pass # yields chr1 906272 906273 C T P|D|P chr1 906273 906274 C A D|D|D chr1 906276 906277 T C D|D|D chr1 906297 906298 G T B|B|B chr1 1959074 1959075 A C D chr1 1959698 1959699 G A B chr1 1961452 1961453 C T P chr1 2337953 2337954 C T D

(55)

4.5 gemini 43

Gemini browser

Currently, the majority of GEMINI’s functionality is available via a command-line in-terface. However, a browser interface was developed for easier exploration of GEMINI databases created with thegemini loadcommand, and based on Bottlepy framework

A. The follow command line launch the GEMINI browser, on localhost on port8088:

gemini browser [--use use] [--host host] [--port port] db

The web app is defined by a single script file called gemini_browser.py. Here the server is launched, all routes are run and all templates are manage.

• -use useWhich browser to use: builtin or puzzle

• -host hostHostname, default: localhost.

• -port portPort, default: 8088.

(56)

4.5 gemini 44

Gemini installation

Gemini contains an automated installation script which installs Gemini along with required Python dependencies, third party software and data files.

After package download, it is possible run the script.

$ wget https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py

$ python gemini_install.py /usr/local /usr/local/share/gemini $ export PATH=$PATH:/usr/local/gemini/bin

The installer requires: python 2.7.x, git, wget, a working C C++ compiler such as gcc and zlib. This installs the Gemini executable as /usr/local/bin/gemini, other required third party dependencies in /usr/local/bin, associated data files in /usr/ local/share/gemini, and the code directory is located in /usr/local/share/gemini/ anaconda/lib/python2.7/site-packages/gemini

(57)

(58)

5

D E V E L O P M E N T O F G E M I N I T O O L S F O R C N V T R E AT M E N T

Gemini is based on the idea of converting the VCF variants data into a SQLite database adding other annotation resources. The aim of this thesis is to develop various tools to expand the features of Gemini, in particular tools for the CNVs data treatment. Our project is based on Gemini version 0.19.1 and to hold both versions of software, I called our softwaregemini_cnv. So, first I created a bash command file to call our release of Gemini in /usr/local/bin/gemini_cnv. Below, is reported the configuration file for the bash call.

#!/usr/local/share/gemini/anaconda/bin/python

from __future__ import absolute_import from gemini_cnv import scripts

from gemini_cnv import gemini_main from gemini_cnv.gemini_constants import * gemini_main.main()

In this way is possible to loadgemini_cnvmodule, which directory code is located in

/usr/local/share/gemini/anaconda/lib/python2.7/site-packages/, like the Gemini installation directory. The first change is to mark up our version of gemini_cnv in

version.pyfile and the import of this intogemini_main.py, in order to have:

$ gemini_cnv --version gemini_cnv 0.19.1.1.0

Where the last two digit are our internal release number and the other part is the official gemini release number.

Multisample vs. singlesample

Another prerequisite is to understand what kind of files we are going to work with. EXCAVATOR, the software for detecting copy number variants from whole-exome se-quencing data, return single sample VCF files. Instead Gemini works with VCF file