• Non ci sono risultati.

MOLECULAR ALTERATIONS IN HUMAN GENETIC DISEASES THROUGH NEXT GENERATION SEQUENCING TECHNOLOGIES

N/A
N/A
Protected

Academic year: 2023

Condividi "MOLECULAR ALTERATIONS IN HUMAN GENETIC DISEASES THROUGH NEXT GENERATION SEQUENCING TECHNOLOGIES"

Copied!
109
0
0

Testo completo

(1)

EUROPEAN SCHOOL OF MOLECULAR MEDICINE (SEMM)

UNIVERSITA’ DEGLI STUDI DI NAPOLI FEDERICO II

PhD in Molecula Medicine – XXVI cycle Human Genetics

MOLECULAR ALTERATIONS IN HUMAN GENETIC DISEASES THROUGH NEXT

GENERATION SEQUENCING TECHNOLOGIES

Dr. Valeria D’Argenio

Academic years: 2011-2014

(2)

EUROPEAN SCHOOL OF MOLECULAR MEDICINE (SEMM)

UNIVERSITA’ DEGLI STUDI DI NAPOLI FEDERICO II

PhD in Molecula Medicine – XXVI cycle Human Genetics

MOLECULAR ALTERATIONS IN HUMAN GENETIC DISEASES THROUGH NEXT GENERATION

SEQUENCING TECHNOLOGIES

Tutor

Prof. Francesco Salvatore PhD Student

Internal Supervisor Dr. Valeria D’Argenio

Prof . Giuseppe Castaldo Esternal Supervisor Dr. Penelope Bonnen

Academic years: 2011-2014

(3)

I

TABLE OF CONTENTS

LIST OF ABBREVIATIONS P.1

FIGURES INDEX P.2

TABLES INDEX P.3

ABSTRACT P.4

I. INTRODUCTION P.6

1.1 Next Generation Sequencing Technologies P.7 1.2 Next Generation Sequencing Technologies applications

P.15

1.3 Targeted DNA sequence capture P.17 1.3.1 Inherited cardiomiopathies P.21

1.4 Metagenomics P.23

1.4.1 Gut microbiome and inflammatory bowel diseases

P.27

1.4.2 Gut microbiome and celiac disease P.29

II. AIMS P.31

III. MATERIALS AND METHODS P.32

3.1 Patients selection and biological samples collection P.32 3.1.1 Targeted DNA Sequence Capture P.32

3.1.2 Metagenomics P.34

3.2 NGS Library Preparation P.35

3.2.1 Targeted DNA Sequence Capture P.35

3.2.2 Metagenomics P.37

3.3 NGS Library Amplification and sequencing P.38

3.4 Bioinformatics P.39

3.4.1 Targeted DNA Sequence Capture P.39

3.4.2 Metagenomics P.43

IV. RESULTS P.45

(4)

II

4.1 Targeted DNA Sequence Capture P.45

4.2 Metagenomics P.62

4.2.1 Crohn disease P.62

4.2.2 Celiac disease P.64

V. DISCUSSION P.70

5.1 Targeted DNA Sequence Capture P.70

5.2 Metagenomics P.73

5.2.1 Crohn disease P.73

5.2.2 Celiac disease P.74

V. CONCLUSIONS P.75

VI. REFERENCES P.76

APPENDIX 1 P.91

APPENDIX 2 P.99

(5)

1

LIST OF ABBREVIATIONS

NGS Next Generation Sequencing sstDNA single stranded DNA

emPCR emulsion PCR

PPi pyrophosphate

ATP adenosine triphosphate CCD Charge-Coupled Device PTP picotiterplate

WES Whole exome sequencing HCM hypertrophic cardiomyopathy DCM dilated cardiomyopathy

ARVC arrhythmogenic right ventricular cardiomyopathy LVNC left ventricular noncompaction

RCM restrictive cardiomyopathy LQTS long QT syndrome

SQTS short QT syndrome

CPVT catecholaminergic polymorphic ventricular tachycardia SCD sudden cardiac death

IBD Inflammatory bowel disease

CD Crohn disease

UC ulcerative colitis CD Celiac disease

MWT Maximal ventricular wall thickness ECG electrocardiographic

LVH left ventricular hypertrophy

PCDAI Pediatric Crohn’s Disease Activity Index

BT before therapy

AT after therapy

GFD gluten free diet

HCDiffs high confidence nucleotide differences AllDiffs all nucleotide differences

P probability

TP true positive

TN true negative

FP false positive FN false negative

OTUs operational taxonomic units

(6)

2

FIGURES INDEX

Figure 1. Impressive reduction of DNA sequencing costs. P.6 Figure 2. Overview of the sample preparation workflow through NGS

platforms. P.7

Figure 3. Library amplification’s strategies. P.9 Figure 4. NGS sequencing chemistries. P.11 Figure 5. IonTorrent sequencing chemistry. P.13 Figure 6. Schematic view of the DNA sequence capture procedure. P.18 Figure 7. Circularization-based procedure for selective DNA enrichment.

P.19 Figure 8. Next Generation sequencing-based approaches for metagenomics.

P.25 Figure 9. Coverage of targeted bases and regions. P.47 Figure 10. Variants prioritization pipeline. P.50 Figure 11. Detection pattern of the KCNQ1 sequence duplication. P.55 Figure 12. Pedigree of patient 2’s family. P.56 Figure 13. Detection of a missense mutation in the CACNA1C gene. P.58 Figure 14. Composition of the ileum microbiome characterized in the control subject and in the Crohn patient BT and AT by NGS. P.62 Figure 15. The mean Shannon Diversity Index Score. P.63 Figure 16. Duodenal microbiome taxonomic composition (from phylum to genus level) in controls, active and GFD CD patients. P.66 Figure 17. Bacterial diversity analysis. P.68

(7)

3

TABLES INDEX

Table 1. Comparison of the currently available NGS platforms features.

P.12 Table 2. Comparison of the Third generation sequencers features. P.14 Table 3. NGS-based enrichment strategies for DNA sequence variants

identification. P.17

Table 4. Human microbiota composition across the five most extensively

studied body sites. P.23

Table 5. List of the primers’ sequences used to amplify, by NGS methodology, the bacterial 16S V4-V6 region. P.36 Table 6. Summary of the features of the genomic regions analyzed by both

NGS and DHPLC/Sanger methods. P.40

Table 7. Overview of the entire sequencing and annotation procedure.

P.46

Table 8. Genotype assignment. P.48

Table 9. Evaluation of common allele frequency. P.49 Table 10. Number of variants in the final set annotated according to their

predicted features. P.51

Table 11. Performance Indexes of DHPLC/Sanger and NGS methods in nucleotide sequence-variants detection. P.52 Table 12. Mutations most likely to exert a pathogenic role in the patients

analyzed. P.59

Table 13. ‘HC’ and ‘Final’ variants identified for each subject in the pool.

P.60 Table 14. Analysis of non-reference (variant) alleles found in single subject sequencing experiments and also identified in the pool. P.61 Table 15. 16S bacterial RNA samples metadata, globally considered in the

study population. P.64

(8)

4 ABSTRACT

Next Generation Sequencing (NGS) technologies have greatly impacted every field of molecular research, reducing costs and simultaneously increasing throughput of DNA sequencing. These features, together with technology’s flexibility, have open the way to a variety of applications, especially for the study of the molecular basis of human diseases.

So far, several analytical approaches have been developed to selectively enrich regions of interest from the whole genome, both to identify germinal and/or somatic sequence variants. All of these have assessed their potential in research area and are now being improved also in routine molecular diagnostics. Thanks to the improvement due to NGS methods introduction, also the metagenomic field has achieved very exciting results, increasing our knowledge about the microbiome and its mutually beneficial relationships with the human host. If microbiome plays a role in the maintenance of a healthy status, it is conceivable to suppose that its quantitative and/or qualitative alterations could lead to pathological dysbiosis, as shown in an increasing number of intestinal and extra- intestinal diseases.

The aim of this project was to use NGS-based strategies to study the molecular basis of human diseases. In particular, two analytic approaches were used: DNA sequence capture and metagenomics.

A DNA sequence capture approach was used to analyze a large panel of genes possibly related to inherited cardiomyopathies. The obtained results indicate that this approach is useful to analyze, in a time and cost effective manner, heterogeneous diseases allowing the identification not only of the disease-causing mutation, but also of other variants involved in

(9)

5

disease-phenotypic expression. Finally, methods reliability was higher than traditional, currently used techniques. All the above data indicate that this validated NGS-based approach can be used to improve the molecular analysis of inherited cardiomyopathies, such as of other inherited diseases, also in routine diagnostic settings.

With regard to metagenomics, a 16S rRNA pyrotag analysis was carried out to deeply investigate the gut microbiome composition of Crohn and celiac diseases. Specific microbial signatures were identified in the patients. Moreover, the effects of Crohn nutritional therapy on gut microbial composition were also verified. These results suggest a role of gut microbiome in diseases pathogenesis and could, in turn, make possible to develop novel diagnostic, prognostic and, most important, therapeutic strategies.

Taken together, all the above results indicate that the used NGS- based procedures can be easily applied to increase our understanding of the molecular basis of human diseases and that they can be useful also for routine diagnostic purposes.

Keywords

Next generation sequencing, molecular diagnostics, metagenomics, inherited cardiomyopathies, Crohn disease, celiac disease.

(10)

6

I. INTRODUCTION

The process of determining the exact order of the nucleotides in a DNA molecule or in a genome is defined as “DNA sequencing”. This simple definition is enough to understand how, techniques able to perform this operation have radically changed the course of molecular research in all its fields of application.

In the last 30 years, the so-called Sanger sequencing has been the most widely used sequencing technology worldwide [1]. This method, developed in the ‘70s by Frederick Sanger, thanks to continuous improvements in technology performances, reached its peak with the Human Genome Project (HGP), which, in 2001, elucidated the first entire human genome [2,3]. Now, the Sanger sequencing procedure is completely automated; however, it is a single amplicon method and, as such, it is expensive and time consuming. Starting from these points, there was the need to develop novel sequencing methods able to overcome Sanger limits and answer to the increasing requests of great amounts of high quality sequencing data, both in a faster and cheaper fashion.

To address the above mentioned issues, in the last ten years, novel technologies, called “next generation sequencing” (NGS), have become available. NGS methods have dramatically increased the throughput of DNA sequencing, simultaneously reducing its costs [4]. Just to give the idea of what this means, it took more than 10 years to elucidate the first human genome sequence and it cost 3 billion $. Using NGS instruments, the entire genome sequence of an individual has been elucidated in only 1 year and at a much lower cost [5]. It is expected that the sequencing of the entire

(11)

7

genome of an individual will cost about 1,000 $ in a near future (Figure 1) [6].

Figure 1. Impressive reduction of DNA sequencing costs. The graph shows that the introduction of NGS dramatically dropped sequencing costs respect to the hypothetical prevision of the Moore’s Law, based on the prevision of exponential computing costs reduction [6].

These aspects, coupled to the technologies versatility, have open the way to the massive NGS diffusion in every field of molecular research and to the begin of the so called “-Omics” era.

1.1 Next Generation Sequencing Technologies

Next Generation Sequencing technologies (NGS) have been defined as the next phase of DNA sequencing evolution since they allow single laboratories to sequence entire genomes in a few time and at very competitive costs [7].

(12)

8

Different NGS platforms have been developed so far, each one showing specific advantages and limitations. However, until now three of them has had the largest diffusion: i) the Roche 454 Genome Sequencer FLX (http://www.my454.com/); ii) the Illumina HiSeq (http://www.illumina.com); and iii) the Life SOLiD (http://www.lifetechnologies.com/). In general, samples’ preparation through these platforms is made up of three main analytic steps: i) the generation of a single stranded DNA (sstDNA) library; ii) the library amplification; and iii) the sequencing reactions. In addition, data analysis using specific bionformatic pipelines has to be carried out (Figure 2).

Figure 2. Overview of the sample preparation workflow through NGS platforms.

The library preparation is the most flexible step in the entire procedure. It depends on the kind of project, on the nature of the biological

(13)

9

samples used as source, and on the question/s you are looking to address.

Several methods have been used for NGS library preparation. Traditionally, the analysis of an entire genome is based on the isolation of the DNA and its fragmentation using different methods. Otherwise, to analyze selected genomic regions, a number of enrichment procedure, both PCR-based and PCR-independent, are available [8]. Finally, it is also possible to analyze the whole RNA or RNA subpopulations [9]. At the end of the library preparation (independently from the used protocol) a population of sstDNA fragments, homogeneous in size and compatible with sequence reads length, is obtained. These fragments are modified by the ligation of specific adapters (different for each NGS platform), which are required as primers for the subsequent amplification and sequencing reactions. The chemistry used to carry out these two steps are different for each NGS platform. In particular, the library amplification can be obtained by emulsion PCR (emPCR) or by solid-phase amplification. In the emPCR, after adapters ligation, DNA fragments are captured on the surface of specific beads and amplified into the droplets of a water-in-oil emulsion. The ratio between library fragments and capture beads is carefully evaluated (titration assay) to allow the capture of one DNA molecule per bead. In this way, at the end of the amplification cycle, each bead will carry million of copy of the same DNA fragment clonally amplified. The emulsion increase the amplification specificity since each bead is amplified in its own droplet, reducing cross contamination risks. After the amplification, the emulsion is chemically broken to recover and enrich the DNA carrying beads (Figure 3, panel A) [10]. In the solid-phase amplification, the adapted DNA fragments are hybridized to a glass slide to obtain clonally amplified clusters. In this case, high density forward and reverse primers are immobilized on the slide and

(14)

10

the ratio between them and the library fragments is crucial for cluster density and, consequently, for sequencing throughput [8]. In brief, ssDNA fragments are hybridized to the glass slide surface, extended by polymerase and denatured to obtain ssDNA fragments covalently ligated to the slide surface. These obtained ssDNA fragments flip over to form a bridge by hybridizing to the primer on the slide and are used as template by the polymerase (bridge amplification) (Figure 3, panel B).

Figure 3. Library amplification’s strategies. DNA library fragments are immobilized on the surface of DNA capture beads and amplified into an emulsion (A), or hybridized on the surface of a glass slide to do a bridge amplification (B) [8].

After the amplification, the libraries are ready to be sequenced.

Different sequencing chemistries have been developed and optimized by each NGS platform, such as pyrosequencing, reversible terminator strategy and sequencing by ligation. In the pyrosequencing chemistry the four

(15)

11

nucleotides are eluted one at time in a fixed order. When a nucleotide complimentary to the template is added, it is incorporated by the polymerase in the elongation strand, releasing a molecule of PPi. The latter is used by the sulphurylase to convert the adenosine phosphosulfate in ATP which is used by the luciferase to oxidate luciferin. The emitted light is registered by a CCD camera and converted in sequence data (Figure 4, panel A) [10]. In the reversible terminator strategy, the four nucleotides are labeled each with a different fluorescent dye, so that they can be eluted all together during each sequencing cycle. The nucleotide complimentary to the template is incorporated and, after fluorescence signals registration, it is cleaved together with the terminating group becoming available for nucleotide adding in the next sequencing step (Figure 4, panel B) [11]. Finally, the sequencing by ligation use a mixture of fluorescently labeled octamer oligonucleotides that hybridize to the sequence adjacent to the primed template. These octamers are two-base-encoded probes in which the combination of the first two bases is assigned to a specific dye. So, after the hybridization and fluorescence registration, the last three bases of the octamer are cleaved to start again with the octamers elution. In this way, at the end of the sequencing cycle, two bases each five nucleotides have been read on the template strand. At this point, the primer is removed and a second ligation cycle is performed using a “n-1 primer”. This primer resetting procedure is repeated more time to fill in the gaps in the template.

This allows each base to be read twice, in two independent ligation reactions and using two different primers, ensuring a great accuracy in base calling (Figure 4, panel C) [12].

(16)

12

Figure 4. NGS sequencing chemistries. The pyrosequencing process is based on the production of light when a nucleotide complimentary to the template is added. Since the four nucleotides are not differentially labeled, they are eluted one at time in a fixed order to ensure correct base calling. In the 454 Roche procedure, the amplified library fragments are annealed on the surface of a DNA capture bead. These beads (each carrying million of copy of the same DNA library fragment clonally amplified) are deposited into the wells of a fiber optic slide (PTP) together with beads carrying the pyrosequencing enzymes. In this way, sequencing reactions occur simultaneously in all the PTP wells (A). The reversible terminator strategy uses four differentially labeled nucleotides eluted together in each sequencing cycle. In the Illumina system, the amplified library fragments are annealed on the surface of a glass slide (flow cell) to obtain clusters simultaneously sequenced (B). Finally, the sequencing by ligation procedure use a mixture of fluorescently labeled octamers coupled to a dual color code. In the SOLiD procedure the library amplification has been optimized from a bead to a slide system (C) [8, 10-12].

(17)

13

The different technical features of each NGS platform account for their own strengths and pitfalls, especially in terms of sequencing reads length, base calling accuracy and sequencing throughput (Table 1).

Table 1. Comparison of the currently available NGS platforms features.

NGS Platform

Library amplification

NGS chemistry Read lenght (bp)*

Run throughput (Gb)*

Run time(Days)

Roche 454 GS FLX

emPCR Pyrosequencing 700 0.7 1

Illumina HiSeq

Solid-phase Reversible terminator

2x125§ 600 10

Life SOLiD Solid-phase Sequencing by ligation

2x50§ 160 8

Roche 454 Junior

emPCR Pyrosequencing 700 0.07 0.5 Illumina

MiSeq

Solid-phase Reversible terminator

2x300§ 15 2.2

Life Ion Torrent

emPCR H+ Ion

semiconductor

400 2 0.34

NGS, Next generation Sequencing; bp, base pair; Gb, gigabase.

*Reported considering the highest performances actually available for each platform. Average read’s length (up to 1,000 bp). §Paired-end sequencing.

The progressive optimization and standardization of NGS procedure has lead to a continuous increase of sequencing productivity/run and a reduction of sequencing costs, as mentioned above [6]. This, in turn, has increased the request for routine NGS-based applications also in clinical settings for diagnostic purposes [7]. In this view, the so called NGS “bench- top” instruments, such as the Roche 454 Junior, the Illumina MiSeq and the Life IonTorrent, have been launched on the market promising the same sensitivity and accuracy of the biggest instruments, but with instruments costs and sequencing run time markedly reduced. From a technical point of view, while the Junior and the MiSeq systems use the same chemistries of the larger versions, the IonTorrent is completely different from its

(18)

14

corresponding larger platform. In fact, it couples the emPCR protocol for library amplification to a hydrogen ion semiconductor chip for sequencing reactions. In particular, the clonally amplified DNA beads are loaded into the wells of this semiconductor device, where the four unlabeled nucleotides are loaded. When there is an incorporation event, a hydrogen ion is released and the consequent voltage difference is recorded (Figure 5).

Figure 5. IonTorrent sequencing chemistry. A nucleotide incorporation into the strand of DNA in elongation produce the release of a hydrogen ion.

The ion charge modify the pH, which is detected by ion sensor and converted in sequence data.

Taken together, these bench-top platforms are featured by a fast turnaround time and a great flexibility in sequencing throughput, representing a real chance to combine NGS advantages to a routine laboratory use. Their main features in terms of productivity are summarized in Table 1.

In addition to all the above mentioned sequencers, usually defined as

“second generation”, novel technologies also called “third generation”

(19)

15

sequencers are being developed. The main feature of these novel sequencing technologies is that they avoid the library amplification step. Therefore, library fragments are directly sequenced at a single molecule level with the great advantage to avoid PCR biases. Different third generation platforms are available, such as the SeqLL HeliScope (http://seqll.com/), the Pacific Biosciences PacBio (http://www.pacificbiosciences.com) and the Oxford Nanopore (http://www.nanoporetech.com) [13-16]. Also in this case, different chemistries have been developed and characterize each platform also in terms of capability (Table 2). Even if the potential of these novel technologies is really exciting, to date they are not yet currently and massively used. It is to be expected that, once the protocols will be standardized and the analytic performances improved, they will overwhelm the market.

Table 2. Comparison of the Third generation sequencers features.

Platform Sequencing chemistry

Read lenght (bp)*

Run throughput (Gb)*

Run

time(Days) SeqLL

HeliScope

Reversible terminator

32 28 1.2

Pacific Bioscience PacBio

Real time 10,000 0.5 80.2

Oxford Nanopore

Electronic Sensing

NA NA NA

bp, base pair; Gb, gigabase; NA, not available.

*Reported considering the highest performances actually available for each platform. Average value. §Paired-end sequencing.

1.2 Next Generation Sequencing Technologies applications

The great technological advances following the diffusion of NGS- based approaches gave new impetus to several research topics. It is

(20)

16

important to underline that the NGS platforms are featured by a large flexibility. Several biological samples can be used as source and different kind of projects can be realized in a way and in time not easily imaginable before NGS revolution. This novel awareness has involved also the biomedical research area with the aims to markedly accelerate the search for genetic causes of human diseases and to answer previously difficult-to- answer questions regarding disease pathogenesis. In recent years, several NGS-based approaches have been described and validated to improve the study of the molecular basis of human diseases with the aim to: (i) develop novel, sensitive, accurate, cost- and time-effective pipelines for molecular diagnostics; and (ii) highlight the mechanisms involved in diseases development to identify novel diagnostic, prognostic and therapeutic markers [7]. In fact, NGS technologies can be easily applied to the study of the human genome through a variety of approaches and until now, they have been successfully used to analyze target regions of the human genome, ranging in size from the entire exome to a restricted number of genes or a single amplicon [17-19]. In addition to nucleotide variants detection, NGS- based strategies are useful also to study the DNA methylation status, both at single gene or at genome-wide level [20]. Finally, metagenomics has been really improved by NGS advent [21].This phenomenon is so diffused that is conceivable to suppose that novel NGS-based strategies are still developing and that these technologies will became even more routinely, especially for diagnostic purposes, considering the progressive protocol simplification, the operator “hand on” work reduction, and the advantages of the “bench-top”

NGS platforms. In addition, the integration of data obtained using several NGS-based strategies could represent an additional advantage to better understand the mechanisms involved in diseases development and, in turn to

(21)

17

identify actionable target for a better patients identification, stratification and treatment.

Actually, two kinds of NGS-based approaches has having a huge diffusion and showing their potentialities for the study of the molecular basis of human diseases, targeted DNA sequence capture and metagenomics, and will be discussed more in detail.

1.3 Targeted DNA sequence capture

Targeted DNA sequence capture is a NGS-based approach for the simultaneous analysis of genomic target regions selectively enriched by the whole DNA.

NGS techniques allow the study of entire genomes faster and cheaper than conventional Sanger sequencing [22-23]. However, the entire sequencing of a large number of samples is not yet feasible for routine use due to the cost, time and infrastructures required. Thus, different approaches to specifically enrich target genomic regions, simultaneously allowing samples barcoding for sequence multiplexing, have been developed and can be classified in PCR-based and PCR-alternative strategies. Both of these can be used for NGS libraries preparation and the choose of the most appropriate one depend on the size of the target regions, the number of samples to be analyzed, the costs and time required and on the biological questions to be addressed (Table 3).

(22)

18

Table 3. NGS-based enrichment strategies for DNA sequence variants identification.

Enrichment System

DNA input

Required Time*

Sensitivity Specificity Max target size

Long PCR 5ng/

amplicon

4-5h High High Depend on

amplicon length Multiplex

PCR

5ng/

multiplex

3-4h High High Depend on

amplicon length and multiplexing Microdroplet

PCR

1.5ug 48h High High Up to 20,000

genomic loci

WES 500ng-

2ug

92h High >60%§ 50-75Mb§

Targeted capture

500ng- 2ug

92h High >60%§ Up to 50 Mb

of custom regions WES, whole exome sequencing; Mb, megabase.

PCR has been the most widely used pre-sequencing strategy to date, since it is perfectly compatible with Sanger sequencing and also with all the NGS instruments: at the end of the amplification, the resulting amplicons have NGS-platform-specific adapters ligated to their ends. This represents a library that is suitable for the downstream sequencing reactions [17]. Since barcode sequence-tags can be also added during this step, samples multiplexing is also allowed [24]. In general, PCR-based strategies, including also multiplex PCR and long range PCR, are useful to analyze one or a few genes [25-31]. Otherwise, since PCR amplification is too laborious for large scale NGS downstream applications, it risks to be a bottleneck in the sample preparation workflow.

The so called “DNA sequence capture” approach is a PCR alternative strategy able to overcome PCR limitations and is an excellent way to isolate large or highly dispersed regions from a pool of DNA molecules [32]. Sequence capture is essentially based on hybrid capture

(23)

19

reactions for the selective enrichment of targeted genomic regions. Specific capture probes can be synthesized to enrich the regions of interest from the whole genome, thus obtaining a captured, adapted and barcoded library for NGS applications [18,33,34]. More in details, DNA fragments hybridize to the capture probes synthesized on DNA microarray glass slides in array- based hybridization method [18], while, biotinylated DNA or RNA probes are used in liquid-phase hybridization. The non-targeted DNA fragments are washed away and the enriched DNA is recovered and used for high throughput sequencing (Figure 6).

Figure 6. Schematic view of the DNA sequence capture procedure.

An alternative technology (solution-based) is a noted example of enrichment system featured by selective circularization-based method which is a further development of the principle of selector probes used in several diagnostic approaches [35,36]. In brief, genomic DNA is fragmented by restriction enzyme digestion and circularized by hybridization to probes

(24)

20

whose ends are complementary to two non-contiguous stretches of a target region (Figure 7).

Figure 7. Circularization-based procedure for selective DNA enrichment.

Even if PCR-based enrichment methods have the benefit of even coverage and high specificity, DNA sequence capture has several advantages able to overcome it. Hybridization is less sensitive to contamination and also mismatches are less harmful. In addition, PCR specificity depends on reaction optimization and primer design: large rearrangements in genomes, for example, may be undetectable unless primer pairs in flanking regions. Unlike PCR, drawbacks of hybridization capture are the need of relatively large amount of high-quality DNA and the loss of target molecules during library preparation. Therefore, this approach is especially suitable for the study of large genomic regions, either contiguous or not contiguous, including the entire exome.

(25)

21

Whole exome sequencing (WES) is defined as the selective sequence the coding regions of a genome, to discover rare or common variants associated with a disorder or phenotype [37,38]. As a result, WES is an attractive and practical approach for the study of coding variants related to rare Mendelian disorders and of many disease-predisposing SNPs throughout the exome [39-42].

To reduce sequencing time and costs and avoid the drawbacks related to data analysis, the target enrichment of specific genomic regions of interest seems to be an attractive alternative. Until now, target enrichment- based strategies, have been used for sensitive nucleotide variants identification [43,44], validation of novel diagnostic tools [45-47] and drug resistance/sensitivity profiling [48-50]. This method is useful for the study of complex families with different genotype/phenotype correlation and to identify all the at risk subjects [51].

Considering all the above, and also that target enrichment technologies are easy to use, they appear really suitable for the study of the molecular basis of genetic diseases, both for research and diagnostic purposes.

1.3.1 Inherited cardiomiopathies

Inherited cardiomyopathies are a group of heterogeneous genetic diseases usually classified according to functional and morphology abnormalities of the cardiac muscle. It includes hypertrophic cardiomyopathy (HCM), dilated cardiomyopathy (DCM), arrhythmogenic right ventricular cardiomyopathy (ARVC), left ventricular noncompaction (LVNC), and restrictive cardiomyopathy (RCM). In addition, another group of inherited cardiomyopathies is featured by a primary involvement of

(26)

22

cardiac electric transmission (channelopathies) and includes long QT syndrome (LQTS), short QT syndrome (SQTS), Brugada syndrome, and catecholaminergic polymorphic ventricular tachycardia (CPVT). All these diseases are featured by a high clinical and genetic heterogeneity [52].

Clinical presentation ranges from asymptomatic to severe rapidly worsening forms and the same symptoms can be the expression of different diseases. The age of onset is also extremely variable: sometimes an inherited cardiac disease is diagnosed before birth, while other subjects can show clinical signs later in the adulthood [52]. Most relevant, all of these diseases can predispose to the development of malignant arrhythmias, heart failure and sudden cardiac death (SCD). It has been estimated that the majority of SCD in young individuals can be related to the presence of one of the above mentioned diseases [53]. Therefore, the correct identification of the molecular alterations responsible for inherited cardiomyopathies is crucial for the correct patient’s management and for the identification of all the at risk subjects within the affected families.

The cardiomyopathies’ clinical variability is reflected in the heterogeneity of their genetic basis. To date, tens of genes, showing different mechanisms of inheritance and incomplete penetrance, have been related to the onset of each inherited cardiomyopathy. However, they explain only a variable proportion of all cases suggesting the existence of other, still unknown disease-causative genes [52]. In addition, the highly variable phenotipic expression, also in the presence of the same causative mutation and in the same family, has strongly suggested a role for additional inherited variants, in the same gene or in independently inherited genes, able to act as phenotype-modifiers. Finally, molecular variants in the same gene have been related to different cardiomyopathies.

(27)

23

Taken together, all the above mentioned issues explain the difficulties in the correct molecular diagnosis of these overlapping diseases.

Therefore, NGS-based approaches, enabling the simultaneous analysis of large number of genes, promise to overcome these limitations. Increasing evidences are assessing the potentialities of NGS-based approaches for the study of inherited cardiomyopathies [52]. van de Meerakker et al identified, in a DCM family, a novel mutation in the alpha-tropomyosin gene through the targeted enrichment of a panel of 23 candidate genes (array-based enrichment) [54]. In another study, the custom enrichment of 16 genes was able to identify a compound heterozygosis in an infant affected by LVNC, confirming the potential of NGS as fast diagnostic tool [55]. Similar approaches have been used in different patient settings showing their reliability in terms of accuracy and sensitivity also for diagnostic pulposes [56-62] Finally, post-mortem WES was able to identify the causative mutation in a young women death for SCD [63]. Taken all together, these findings assess the feasibility of NGS for the study of complex inherited diseases, such as inherited cardiomyopathies and suggest their use as routine diagnostic procedure in a molecular biology laboratory.

1.4 Metagenomics

Metagenomics is defined as the field of molecular research that studies the complexity of microbiomes, i.e. the entire collection of all the genomic elements of a specific community of microrganisms, including bacteria, archaea, viruses, and some unicellular eukaryotes, living in a specific environment (microbiota).

In the last few years, metagenomics literature has grown exponentially, essentially due to technological improvements related to the

(28)

24

introduction of NGS. This has, in turn, greatly improved our understanding about the role of human microbiota in the healthy status and its possible implications in diseases pathogenesis [64].

If we consider the human body as an environment, the human microbiota is the entire collection of microorganisms living on the surface and inside our body (Table 4) [65-68].

Table 4. Human microbiota composition across the five most extensively studied body sites. Interestingly, the oral and gut microbiota have the highest microbial diversity, while the urogenital tract has the smallest bacterial diversity [from 64].

Human microbiota

(10 times more microbial than human cells: 1014 vs 1013)

Human Microbial Habitats

Most represented Phyla and their relative abundance (%)

Number of species Oral cavity Firmicutes (36.7), Bacteroidetes (17.3),

Proteobacteria (17.1), Actinobacteria (11.9), Fusobacteria (5.2)

>500

Skin Actinobacteria (52), Firmicutes (24.4), Proteobacteria (16.5), Bacteroidetes (6.3)

~300 Airways Actinobacteria (55), Firmicutes (15),

Proteobacteria (8), Bacteroidetes (3)

>500 Gut Firmicutes (38.8), Bacteroidetes (27.8),

Actinobacteria (8.2), Proteobacteria (2.1)

>1,000 Urogenital tracta Firmicutes (83), Bacteroidetes (3),

Actinobacteria (3)

~150

amainly female.

These communities are required for human physiology, immune system development, digestion and detoxification reactions [60,70]. In this view, humans can be defined as “superorganisms” made of two genomes, one inherited from parents and the other acquired, i.e., the microbiome [71].

Differently from the inherited genome, which is almost stable during lifetime, the microbiome is extremely dynamic and influenced by age, diet, hormonal cycles, travel, therapies, and illness [72-77].

(29)

25

Most of the human adult microbiota lives in the gut. Only in the human colon microbial cell density exceed 1011 cells/g contents, being equivalent to 1-2 kg of body weight [78]. In addition, it has been estimated that the human gut microbiome accounts for more than 5 million different genes [79]. Even if over 1,000 different species colonize the human gut [21], they belong to a small number of phyla: Firmicutes, Bacteroidetes and Actinobacteria, followed by the less represented Proteobacteria, Fusobacteria, Cyanobacteria and Verrucomicrobia [70]. To date, a number of functions have been associated to the gut microbiome, including polysaccharide digestion, immune system development, defense against infections, synthesis of vitamins, fat storage, angiogenesis regulation, and behavior development [69,70,80,81]. Therefore, it is conceivable to suppose that alterations of the human gut microbiome can play a role in disease development and more our understanding on its role will grow up, more it will became possible to use the human microbiome for diagnostic purposes or as target for novel therapies.

All the above, explain the boom of metagenomics worlwide. NGS plays its role in this scenario, since it allows the qualitative and quantitative analysis of a specific microbiome without selection biases and constraints associated with cultivation methods. Different NGS-based strategies can be used for metagenomic purposes, as shown in Figure 8.

(30)

26

Figure 8. Next Generation sequencing-based approaches for metagenomics. Starting from an environmental sample of interest (1), total DNA and/or RNA are extracted (2). Three different sample preparation strategies can be used depending on the project aims: 16s rRNA Sequencing, Shotgun Sequencing, and Metatranscriptomics (3). Usually, the 16S rRNA procedure allows sample multiplexing while a higher coverage is required for the others. After sequencing (4), specific bioinformatic pipelines are used for data analysis (5) [from 64].

(31)

27

Shotgun approaches allow a comprehensive view of entire microbial communities both at DNA or RNA level, with taxonomic assignment up to the species. However, they still have technical limitations, most of which related to sequence data analysis. Therefore, the most used approach is the targeted resequencing of specific sequence tags useful for phylogenetic purposes, like the bacterial 16S rRNA gene. The latter gene has a peculiar structure characterized by hypervariable regions spaced by ultra-conserved regions [82]. Universal primers which anneal on the conserved regions can be used to amplify, in a single PCR reaction, virtually all the bacteria present in a target environment, and to unequivocally identify them at the end of sequencing [83].

This approach has been successfully used to analyze the microbial community richness of the human gut microbiome and its relationship with specific diseases, such as obesity and immune-related and inflammatory diseases [84-93]. It is now well known that the gut microbiome of obese subjects differ from that of not obese and that such differences could play a role in altering gut permeability and induce inflammatory reactions [91-93].

Similar mechanisms seems to be applicable to an increasing number of intestinal and extra-intestinal diseases, including inflammatory bowel diseases and celiac disease [84-87].

1.4.1 Gut microbiome and inflammatory bowel diseases

Inflammatory bowel disease (IBD) are chronic inflammatory disorders of the gastrointestinal tract with increasing worldwide incidence [94]. IBD clinical features are severe, including diarrhea, weight loss and debilitating abdominal pain, and result in substantial morbidity and impairment in quality life [95]. IBD also increase the risk for colon cancer

(32)

28

development [96]. The affected intestinal areas are usually characterized by transmural inflammation associated to lymphoid hyperplasia, submucosal edema, ulcerative lesions and fibrosis which can be assessed by colonoscopy [97]. The two main forms of these disorders are Crohn disease (CD) and ulcerative colitis (UC). IBD causes are still unclear. Even if host genetics play a key role, mostly in CD than in UC onset [98], environmental factors are also involved [99]. Numerous evidences emphasize the role of gut microbiome in triggering and perpetuating typical IBD chronic inflammation. The most widely currently accepted hypothesis on IBD pathogenesis is that, in genetically predisposed subject, an altered host immune response against luminal agents, such as the gut microbiota, could result in the IBD chronic inflammation [97,100].

It has been established that microbes rapidly colonizes the gut after birth [101] and this process is influenced by different factors, such as the heredity of the mother, the immediate living environment, the feeding practices, microbial infections, and the host’s genetics [102]. As discussed above, this mutual relationship between the human host and its microbiota seems required for healthy status acquisition and maintenance.

Several studies have identified significant alterations in the gut microbiota composition of IBD patients with respect to healthy individuals [103,104]. Therefore, there is a growing interest in understanding whether interactions between intestinal microbes and innate immunity could influence IBD expression and act as triggers of the inflammation, since this could open the way to novel diagnostic and therapeutic opportunities. In addition, since nutritional therapy is effective in pediatric Crohn disease, it has been suggested that it may induce its beneficial effects by modifying the microbiome composition [105].

(33)

29 1.4.2 Gut microbiome and celiac disease

Celiac disease (CD) is a chronic, multi-factorial, inflammatory disorder of the small intestine that involves interactions between genetic and environmental factors [106]. In genetically susceptible individuals (HLA- DQ2/DQ8 carriers) ingestion of gluten leads to an abnormal intestinal immune response involving both adaptive and innate immunity, which is characterized by failure to establish and/or maintain tolerance to dietary peptides in wheat, barley, and rye (particularly to wheat gliadin) [107]. This abnormal immune activity damages the small intestine, which typically shows villous atrophy, crypt hyperplasia, and an increased number of lymphocytes within both the epithelium and the lamina propria [108].

However, more recent studies have challenged the gluten-hypersensitivity dogma and suggest that exposure to gluten may not be the only factor contributing to the onset of CD [109,110]. Indeed, traditional pathogenic mechanisms of gluten hypersensitivity do not explain why: i) the frequency of DQ2/DQ8 molecules in the general population is about 30%, but only 1to 3% of individuals develop CD [106]; ii) CD is increasingly diagnosed in adulthood, many years after introduction of gluten in the diet [111]; iii) the prevalence of CD prevalence is rapidly increasing in the western world (although consumption of gliadin-containing food has not increased) [112].

All these observations support the idea that other environmental factors, as the gut microbiome, could play a role in CD pathogenesis.

A number of studies have identified significant alterations in the composition of the gut microbiota in CD patients with respect to healthy individuals [113,114]. As for other diseases, also in this case the relationship between intestinal microbes and innate immunity through their

(34)

30

role in promoting inflammation and impairing mucosal barrier functions could clarify CD pathogenesis and give novel opportunities for patients care.

(35)

31 II. AIMS

The aim of this PhD project was to use next generation sequencing (NGS)-based strategies to study the molecular basis of human diseases with the specific aims to: i) increase diagnostic sensitivity, respect to commonly used techniques; ii) identify novel disease-causing genes; iii) identify phenotype modifier genes able to justify the heterogeneity of clinical signs;

and iv) clarify pathogenetic mechanisms involved in disease development.

To do these, we used two different NGS-based approaches: targeted DNA sequence capture and metagenomics.

Targeted DNA sequence capture was used to assess the feasibility of this approach for the study of a large panel of candidate target genes possibly related to complex (highly heterogeneous from both molecular and clinical points of view) inherited diseases. We studied, as model disease, the inherited cardiomyopathies. Particularly, we carefully evaluated results reliability to assess methods portability into routine diagnostic settings. In addition, we evaluated the feasibility of pooling together barcoded DNA samples from various patients in order to reduce the cost and time of the procedure.

Metagenomics, through the 16S bacterial rRNA analysis, was used to verify the presence of a specific dysbiosis into the composition of human gut microbiome in association to specific diseases. In this case, we used as model two inflammatory diseases: Crohn and celiac disease. In particular, in Crohn disease we aimed to assess the effects of nutritional therapy on the microbiome composition.

(36)

32

III. MATERIALS AND METHODS

3.1 Patients selection and biological samples collection

3.1.1 Targeted DNA Sequence Capture

Three unrelated individuals with a clinical diagnosis of hypertrophic cardiomyopathy (HCM) were selected for the present project with the aim to validate a targeted DNA sequence capture approaches for the study of inherited cardiomyopathies. HCM was defined as unexplained left or left and right ventricular hypertrophy in the absence of any potential cause of cardiac hypertrophy. It was diagnosed by echocardiographic evidence of increased wall thickness, two standard deviations or more above the upper reference limit of healthy individuals matched for age, sex, and body surface area. Maximal ventricular wall thickness (MWT) was defined as the greatest thickness in the various segments and measured as absolute value and as z- score for age and body surface area; z-score reference value < +2 [115].

Normal reference ranges of specific electrocardiographic (ECG) features were: PR: 0.12-0.20 sec; QRS: 0.06-0.10 sec; QTc: < 440 msec in men, <

460 msec in women.

Patient 1 was diagnosed with unexplained left ventricular hypertrophy (LVH) and cleft mitral valve during intrauterine life. HCM with asymmetrical LVH was confirmed at birth (MWT anterior interventricular [IV] septum was 8 mm at; z-score +7). This patient was followed-up at the Cardiomyopathy Clinic of the Monaldi Hospital (Naples, Italy) with regular ECG and echocardiographic evaluations. Verapamil therapy was started and the left ventricular wall thickness progressively normalized during childhood. No significant progression of LVH was seen

(37)

33

at follow-up. This is similar to a previous report [116]. At the last clinical evaluation, at 15 years of age, she showed the mitral cleft with concomitant mild-to-moderate regurgitation; her septal thickness was nearly normal (MWT at anterior IV septum was 12 mm; z-score +3.1). Patient 2 (now 8 years old) was diagnosed at birth with non-obstructive HCM and multiple septal ventricular defects (MWT at anterior IV septum at 4 months age was 8.3 mm; z-score +6.1). During follow-up, in the absence of cardiological therapy, LVH did not progress and all ventricular septal defects, except one, closed [116]. Her ECG showed signs of LVH and a long QTc (490 msec), which was considered “a secondary effect” of the cardiac hypertrophy. At last follow-up, MWT at the anterior IV septum was 8 mm (z-score +1.4).

Patient 3 was diagnosed at birth with a valvular pulmonary stenosis, and underwent a surgical valvulotomy at 14 years of age. He was diagnosed with diabetes mellitus and polycythemia at 44 years of age. One year later, he had the first episode of atrial fibrillation/flutter. He also showed a first- degree atrioventricular block, an incomplete right bundle branch block (QRS 110 msec), and a long QTc (QT 495 msec). Echocardiography showed non obstructive HCM (MWT at anterior IV was 15 mm; z-score +4.4). The patient and his younger daughter also suffer from minor depressive disorders. Patient 3 underwent biochemical investigation and a muscle biopsy to exclude that LVH was of a secondary nature (biochemical evaluation: normal lactate/pyruvate ratio; normal CK, AST, ALT; muscle biopsy: negative COX and SDH fibers, normal mitochondria, no inclusion bodies). During follow-up, recurrent episodes of atrial fibrillation/flutter and a progressive end-stage evolution (“burn out”) were observed; thus the patient underwent orthotropic heart transplant in 2011 at 51 years of age (MWT at anterior IV was 17 mm before heart transplantation; z-score +5.8).

(38)

34

Two of these patients (1 and 2) had previously been evaluated by using a DHPLC/Sanger test to search for causative mutations within the exons of 8 sarcomeric genes: MYH7, MYBPBC3, TNNI3, TNNT2, TPM1, ACTC, MYL2 and MYL3 [115].The molecular analysis was extended also to the patient’s relatives in order to study mutations segregation in each family. We obtained informed consent from each patient and family member, according to the procedures of the institutional review boards of the participating institutions.

Blood samples were obtained from each subject. Total DNA was isolated from peripheral blood using the Nucleon BACC3 Genomic DNA Extraction Kit (GE Healthcare, Life Sciences) according to the manufacturer’s instructions. Then, samples quantification was done through the NanoDrop 2000c Spectrophotometer (Thermo Scientific).

3.1.2 Metagenomics

With regard the gut microbiome characterization of Crohn disease, two child where enrolled for the project. In particular, one child was affected by Crohn disease and underwent nutritional therapy, the other was a sex- and age-matched not affected subject. The Crohn disease patient was a 14-year-old boy diagnosed with active Crohn disease (Pediatric Crohn’s Disease Activity Index, PCDAI=50). After colon endoscopy, he underwent nutritional therapy consisting of a daily powder constituted by proteins, antioxidants and anti-inflammatory fats (Alicalm formula, Nutricia Advanced Medical Nutrition) for 8 weeks. After this time, a clinical re- evaluation revealed disease remission (PCDAI= 0). Endoscopic ileum mucosal samples at diagnosis (BT-patient) and after therapy (AT-patient) were collected for DNA extraction. An ileum tissue sample was obtained

(39)

35

also from a 15-year-old boy affected by a gut polyp and without familiarity for Crohn disease.

To study the mucosal gut microbiome of celiac disease (CD), 15 active CD patients, 10 clinical controls and 6 at gluten free diet (GFD) CD patients were recruited among patients attending the Departments of Gastroenterology of the Universities of Salerno and of Roma-Tor Vergata, Italy. The exclusion criteria for enrolment were: any known food intolerance apart from gluten, IgA deficiency, treatment with antibiotics, proton pomp inhibitors, antiviral or corticosteroids, or assumption of probiotics in the 2 months before the sampling time. Duodenal biopsies from all the enrolled individuals were collected during diagnostic endoscopy procedures.

Total DNA was extracted from all collected biopsies (3 mg/sample, duodenum for CD and ileum for Crohn disease respectively) using the QIAamp DNA mini Kit (Qiagen,Venlo, Netherlands), following the manufacturer’s instructions.

3.2 NGS Library Preparation

3.2.1 Targeted DNA Sequence Capture

Gene selection and microarray design. A list of target candidate genes was produced by selecting all the genes related to cardiomyopathy onset and the genes coding for ionic channels, membrane receptors, growth factors and inflammatory and transcriptional factors [117-122]. Thus, 202 genes of interest were identified (Appendix 1), and a list with their refseq IDs was generated. Starting from this list, 2,250 genomic targets were selected, each corresponding to one or more exons, plus a flanking region of 500 bp at the 5’ and 3’ ends of each exon. These genomic coordinates,

(40)

36

including chromosome start and stop positions, were used to define unique probes for the final set of 4,790 probed regions (one or more for each target) for a total of about 3.9x106 bp. The microarray was provided by Roche Nimblegen Inc. (Madison, WI, 454 Optimized Sequence Capture 385K Array). Probe uniqueness was assessed by comparing the probe sequence to the genome [123].

Enrichment of target sequences. The selected regions were captured according to the manufacturers’ protocols using the NimbleGen Hybridization System (Nimblegen and Roche, NimbleGen Arrays User’s Guide: Sequence Capture Array Delivery v3.2) [18]. The procedures is based on four main steps: i) preparation of a library of adapted DNA fragments; ii) library hybridization on the custom capture array; iii) enriched library fragments recovery; and iv) enrichment assessment. In brief, 21 µg of genomic DNA from each sample were sheared into small fragments (range size: 300-500 bp) through nebulization. After fragments ends polishing and purification (AMPure Beads, Agencourt), specific adapters (Adaptor A and B) were ligated to each fragments. After quality (DNA chip1000, BioAnalyzer, Agilent) and quantity assessment (NanoDrop, Thermo Scientific), each library has been amplified and hybridazed on the custom array for 72h. After incubation, stringent washes were performed to remove the unbound fragments, while the enriched fragments were eluted and amplified. Sample enrichment was evaluated according to the manufacturer’s instructions, i.e. by measuring, through quantitative PCR analysis, the relative fold-enrichment of 4 control loci present in each capture array. The analysis was carried out on a LightCycler 480 real time PCR system (Roche).

(41)

37 3.2.2 Metagenomics

An aliquot of the tissue DNA was used for PCR amplification and sequencing of bacterial 16S rRNA gene. To deeply investigate the bacterial composition of gut mucosal samples, a 548 bp amplicon, spanning from V4 to V6 variable regions of the 16S rRNA gene, was amplified using the 519F and the 1067R primers [124]. Both primers were modified to obtain fusion primers so that each one contained at the 5’ end a universal 454 adaptor (adaptor A for forward primer and adaptor B for the reverse) and a specific 10 nucleotide tag/sample (Table 5).

Table 5. List of the primers’ sequences used to amplify, by NGS methodology, the bacterial 16S V4-V6 region. For each couple the sequences are reported from 5’ to 3’, both on forward and reverse primers.

Each primer is a fusion primer resulting from the following sequences (from left to right): i) the 454-Roche adaptor sequences (adaptors A and B) required for emulsion amplification and sequencing reactions (upper case characters); ii) the sample-specific 10 nucleotide tag sequences (MID 1-10) required to univocally tag each individual subject (underlined characters);

iii) the primers’ template-specific sequences (bold characters).

16S Primers for whole microbiota amplification

F (5’-3’) R (5’-3’)

Adaptor A MID (1-10) Template Specific Primer

Adaptor B MID (1-10) Template Specific Primer CGTATCGCCTCCCTCGCGCCATCAG-

ACGAGTGCGT-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- ACGAGTGCGT-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

ACGCTCGACA-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- ACGCTCGACA-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

AGACGCACTC-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- AGACGCACTC-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

AGCACTGTAG-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- AGCACTGTAG-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

ATCAGACACG-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- ATCAGACACG-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

ATATCGCGAG-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- ATATCGCGAG-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

CGTGTCTCTA-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- CGTGTCTCTA- TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

CTCGCGTGTC-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- CTCGCGTGTC- TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

TAGTATCAGC-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- TAGTATCAGC-TGACGACAGCCATGC CGTATCGCCTCCCTCGCGCCATCAG-

TCTCTATGCG-CAGCAGCCGCGGTAATAC

CTATGCGCCTTGCCAGCCCGCTCAG- TCTCTATGCG- TGACGACAGCCATGC

F: forward; R:reverse.

(42)

38

PCR reactions were carried out using 25 µl of H2O, 20 µl of 2.5X HotMaster PCR mix (Eppendorf), 1.5 µl of each primer 10 µM and 60 ng of DNA. The amplifications were performed on a DNA ENGINE Chassis (Biorad) at the following conditions: 2 min at 94°C, 30 cycles of 94°C for 40 s, 50°C for 40 s, and 65°C for 40 s, and a final extension of 70°C for 7 min. After visualization by agarose gel electrophoresis, each PCR products was individually purified using magnetic purification beads (AMPure Beads, Agencourt), assessed for quality on a Bioanalyzer 2100 (DNA 1000 chip, Agilent) and quantified using the Quant-it PicoGreen dsDNA kit (Invitrogen). Equimolar amounts of each amplicon were pooled together to obtain multiple amplicon libraries, each containing a total of 5 mixed subjects.

3.3 NGS Library Amplification and sequencing

Both the NGS-based approaches used in the present project were carried out using the GS FLX System (454 Roche). Therefore, the obtained DNA libraries were amplified through emPCR and sequenced using the pyrosequencing chemistry, according to the manufacturer’s specifications [10].

(43)

39 3.4 Bioinformatics

3.4.1 Targeted DNA Sequence Capture

Read mapping and variant detection. The obtained sequencing reads were mapped by using the GS Reference Mapper software package (454 Roche, version 2.6), with the default parameters. Variants were identified by comparing assembled vs reference genomic sequence (hg19, http://genome.ucsc.edu/cgi-bin/hgGateway). The procedure produces a list of high confidence nucleotide differences (HCDiffs), as well as a larger list of all nucleotide differences (AllDiffs), identified by a less stringent approach. The GS Reference Mapper uses a combination of flow signal and quality score information together with the type of nucleotide change to determine if a variant is to be treated as a high-confidence variant. The strategy used to select HCDiffs is based on the following criteria: 1) there must be at least 3 non-duplicate reads with the difference, with at least 5 bases on both sides of the difference, and a few other isolated sequence differences in the read; 2) there must be both forward and reverse reads showing the difference, unless there are at least 7 reads with quality scores over 20 (or 30 if the difference involves a homopolymer of 5 nt or more); 3) not all overcalls/undercalls are reported as insertions/deletions, but only those where the difference is the consensus of the sequenced reads. The AllDiff strategy is less stringent and requires at least 2 non-duplicate reads, without restrictions in terms of forward/reverse strands. To facilitate downstream analyses, the complete list of variants (AllDiffs/HCDiffs), together with relevant information concerning target coverage, were imported in a relational database based on PostgreSQL (http://www.postgresql.org).

Riferimenti

Documenti correlati

The goal of the case study is providing a software system (called the Science Cloud Platform, SCP) which will, installed on multiple virtual or non-virtual machines, form a

The pattern in the solid state is strictly dependent on electronic and steric properties of L, and a moltitude of weak forces like aurophilic interactions, hydrogen bonds or π –

thaliana flBid lines, after H 2 O 2 treatment, exhibit a reduced number of alive root hairs compared to wild type demonstrating that flBid leads to higher sensitivity to

Elena e Lila, rappresentate narrativamente come donne che hanno dovuto subire la violenza fisica e simbolica del dominio maschile, ma che hanno cercato, in un modo o

Data l’esperienza di questo progetto didattico, il corso di Knowledge organization and cultural heritage (KO and CH), una delle attività obbligatorie del primo anno del corso di

The paper presented and discussed QETAC, a method for optimising the energy efficiency and traffic offloading performance of a wireless access network with caching and mesh