Whole Genome Sequencing of Italian Isolate Populations to identify rare and characteristic variants and to generate a reference panel for imputation.

(1)

(2)

List of Abbreviations

1000GP3 1000 Genomes project phase 3. 28 @OQ Original Quality. 14

@RG Read group. 14

AC Alternative Allele count. 35

AF Alternative Allele frequency. 29, 32 BAM Binary alignment/map. 13–15 BCL Base Call. 13

BGI Beijing Genomics Institute. 13

CADD Combined Annotation Dependent Depletion. 27–30 CARL Carlantino. iv, 21, 23–25, 28, 31, 32, 46–50

CDCV Common Disease Common Variant. 2 CDRV Common Disease Rare Variant. 2 CNV Copy Number Variants. 4

DP Read depth. 35

EGA European Genome-phenome Archive. 13, 15 ExAC Exome Aggregation Consortium. 7

FVG Friuli Venezia Giulia. iv, 21–25, 28, 31, 32, 46–50 GIANT Genetic Investigation of ANthropometric Traits. iii GWAS Genome Wide Association Studies. iii, 2–4, 6, 8, 45, 46, 52

(3)

List of Abbreviations iii

HSR San Raffaele Hospital. 13

INGI Italian Network of Genetic Isolates. iii, iv, 31, 32, 35, 37, 38, 40, 44–46, 51 LD linkage disequilibrium. 1

LOF Loss Of Function. 28

MCH Mean Corpuscular Haemoglobin. 45, 46, 51 NPG New Pipeline Group. 13, 14

NRDR Non Reference Discordance Rate. 22, 25 QC Quality control. 11, 17

SAM Sequence alignment/map. 14 UK10K the UK10K project. iii UKB UK Biobank. iii

VBI Val Borbera. iv, 23–25, 28, 31, 32, 46–50 VQSR Variant Quality Score Recalibrator. 20 WES Whole Exome Sequencing. iii, 6, 7

(4)

Abstract

Background

The drop of Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) prices has started a race toward the generation of denser and more accurate maps of the human genome, but even with the contribute of huge projects as UK10K (The UK10K Consortium, 2015), the resources currently available for Genome Wide Association Studies (GWAS) in terms of sample size and power to detect associations, outdo the ones available for Whole Genome rare variants analyses (e.g UKB (Sudlow et al., 2015) , GIANT (Speliotes et al., 2010) etc. ). GWAS analysis is still the most used tool to date to discover correlations between genotypes and phenotypes also due to the development of imputation algorithms which allow to infer missing geno-types in a sample using a scaffold of known haplogeno-types(Marchini and Howie, 2010). The release of the 1000 Genomes project data (1000 Genomes Project Consortium et al., 2012) allowed the creation of a reference panel which comprises population from dif-ferent ancestry based on Next generation Sequencing data (Howie et al., 2011): this initial resource proved to be extremely valuable for the scientific community and has been recently updated (Sudmant et al., 2015). Moreover, this showed how useful could be to include WGS data belonging to the population in study in a reference panel for imputation (Sidore et al., 2015). To date the rush for the ‘best panel’ is still open and many collaborations are arising based on data sharing to provide a ‘state of the art’ resource (McCarthy et al., 2016).

Research aims

With this work we aim to create a resource which can be used as a tool to improve imputation quality and increase the statistical power of the Italian Network of Genetic Isolates (INGI) cohorts and, at the same time, which will provide us data to have a

(5)

List of Abbreviations v

better insight of the structure and peculiar characteristics of our cohorts compared with outbred populations.

Methods

(6)

Introduction

1.1 Genetic architecture of complex traits and Genome Wide

Association Studies

Population genetics is the study of the distributions and changes of allele frequency in a population, while the population is subject to evolutionary processes. Study areas of population genetics include recombinations, Mendelian inheritance, genetic linkage and linkage disequilibrium (LD), population stratification, etc. Allelic architecture refers to the number and frequencies and effect sizes of susceptibility alleles underlying complex diseases. Diseases with high prevalence in the general population such as T2D and hypertension are polygenic, i.e. determined by multiple genetic variants, together with lifestyle and environmental factors. This is also the case for complex, quantitative traits, e.g body mass index . Genetic research on complex traits began with surveying candidate variants or regions of the genome, followed by analyses that scan the whole genome with limited resolution, and then genome-wide association studies (GWAS) over the past ∼10 years. Due to their ‘hypothesis driven’ nature , candidate gene studies used a very liberal P value threshold (such as P<0.05) to claim significance, which could lead to a high level of false positives (Masicampo and Lalande, 2012). Actually, less than 5% of associations identified in candidate gene studies were replicated in larger GWAS (McCarthy et al., 2008).

Linkage analysis is suitable for detecting rare variants with high penetrance causative for rare diseases with classical Mendelian patterns of inheritance but in general, linkage analysis is not suitable for detecting common alleles of small to middling effects for complex diseases.

(9)

1. Introduction 2

Before the GWAS approach was widely used, there were two theories for explaining genetic underpinning of complex diseases with high prevalence: Common Disease Com-mon Variant (CDCV) and ComCom-mon Disease Rare Variant (CDRV). The CDCV the-ory hypothesised that a large proportion of phenotypic variation for common traits could be explained by common variants (Lander, 1996; Reich and Lander, 2001; Pritchard and Cox, 2002; Botstein and Risch, 2003). This theory has been well supported by GWAS where many common variants are identified for association with common diseases and complex traits (Hindorff et al., 2009a). However, common variants did not explain common variation fully (Manolio et al., 2009), and this led researchers to look for new potential sources of missing heritability. This model was also supported by GWAS, especially large scale meta-analysis with adequate power for both diseases traits (In-ternational Schizophrenia Consortium et al., 2009) and quantitative traits (Yang et al., 2011).

In contrast to the CDCV model, the CDRV theory hypothesized that a large number of rare variants with large effects could explain a large proportion of heritability (Cirulli and Goldstein, 2010). Statistical simulations have shown that CDCV and CDRV are not necessarily mutually exclusive, with both rare and common variants underlying a polygenic genetic architecture for complex traits (Hemani et al., 2013). Other models such as the broad sense heritability model (Eichler et al., 2010) looked beyond genetic variants by considering the combined effects of genotypic, environmental and epigenetic interactions.

The completion of the human genome project (Lander et al., 2001; Venter et al., 2001) and the improvement of technologies for ascertaining and analysing the human genome set the stage for GWAS, changing the landscape of genetic study on complex diseases.

(10)

re-1. Introduction 3

search projects with large sample size. The genetic polymorphism selection by major vendors was mainly based on data generated from the International HapMap project (International HapMap Consortium et al., 2007; International HapMap 3 Consortium et al., 2010). The early versions of SNP arrays usually included less than 1 million com-mon variants, which could be imputed to up to 3 million variants discovered from the HapMap project reaching up to 8 million with data from the pilot phase of the 1000G project.

Compared to candidate gene studies and linkage analysis, GWAS scan the whole genome in a systematic manner for detecting genetic variants conferring susceptibility to diseases and altering quantitative traits. The conceptual basis of GWAS is straight-forward (Balding, 2006). Although family based designs are sometimes adopted, most are essentially population-based case-control studies. The risk factor for the disease outcome (case versus control) is the genotype at a specific marker. The data for a SNP form a contingency table of disease outcome by SNP genotype. Evidence for association is typically based on simple statistical tests of single SNPs, such as a chi-square test based on genotype counts with two degrees of freedom, or based on allele counts with 1 df, although a range of other tests are possible.

When a common set of haplotype variants are analysed by most individual cohorts, results could be cross-examined and meta-analysed in large collaborative consortia.

Since GWAS became available, large advances have been made.

In 2007, a landmark GWAS study with ∼17,000 subjects typed on half a million variant SNP array (Wellcome Trust Case Control Consortium, 2007) identified 24 inde-pendent association signals for seven common diseases. This first WTCCC study was the largest set of GWAS of its time, costing a total of $9 million. It identified 21 loci, of which 14 were novel. All these associations have been confirmed in later meta-analyses. Later on, many other studies conducted extensive replication for suggestive signals com-ing from this WTCCC study and identified many more novel loci, for type 1 and type 2 diabetes (Todd et al., 2007; Zeggini et al., 2007), rheumatoid arthritis (Barton et al., 2008), and Crohn’s disease (Parkes et al., 2007).

(11)

1. Introduction 4

diseases as opposed to hypertension or CAD.

Besides novel findings, a number of novel techniques and protocols used in this WTCCC1 study became standards in GWAS since then, for example, systematic as-sessing and adjusting for population stratification, and using the HapMap reference panel for genotype imputation. This study also characterised other types of genomic varia-tions including Copy Number Variants (CNV) and large inservaria-tions and delevaria-tions. The second landmark genomic study from the WTCCC concluded that most common CNVs are well tagged by common SNPs and are unlikely to discover novel findings for com-mon human diseases (Wellcome Trust Case Control Consortium et al., 2010). However, rare CNV and large deletions have been reported for association with other categories of complex diseases including autism and schizophrenia (International Schizophrenia Consortium, 2008). The subsequent widespread implementation of imputation analysis based on common reference maps (HapMap2 at first and the 1000Genome Project phase 1 and 3, lately) has been instrumental in the completion of well- powered meta-analyses of GWAS studies, reaching sample sizes necessary for robust genetic discoveries. By the end of 2011, the EMBL-EBI and NHGRI GWAS catalogue has reported over 2,000 association signals for over 200 complex traits: as of 2016-12-12, the GWAS Catalog contains 2,670 studies and 30,321 unique SNP-trait associations (Welter et al., 2014). Figure 1.1 represents the updated picture of published GWAS results with p-values <= 5.0 x 10-8.

1.2 From SNP array to Whole Genome Sequencing

(12)

1. Introduction 5

Figure 1.1: The iconic GWAS diagram of all SNP-trait associations, with p-values <= 5.0 x 10-8, mapped onto the human genome by chromosomal locations and displayed on the human karyotype published by the EMBL-EBI and NHGRI GWAS catalogue.

effects but too low frequency to be detected by SNP array. This is also supported by the evolutionary theory that alleles susceptible to diseases and their risks are likely to be deleterious and could not reach high frequency due to purifying selection (Pritchard, 2001; Goldstein et al., 2013).

(13)

1. Introduction 6

by fast development in sequencing technologies. By 2007, it was possible to sequence over 500Mb a day on a single machine and that was when the 1000 Genomes Project (1000GP) was founded to perform low-coverage (2-4X) sequencing on up to 2,500 human genomes. Since 2008, more sequencing technologies have been developed, including Ion torrent, pacific biosciences and Illumina’s MiSeq (Quail et al., 2012). In January 2010, Illumina unveiled the HiSeq 2000 sequencing system. It initially generated two billion paired-end reads and 200Gb of quality filtered data in a single run, which allows researchers to obtain 30-fold coverage of two human genomes in a single run. This is the sequencing technology adopted by the UK10K project and to generate the data presented in this thesis .

Figure 1.2: The allelic spectrum of human disease predisposition - This figure from Maniolio et al. 2009 (Manolio et al., 2009) illustrates the relationship between frequency and effect size for genetic variants contributing to human disease, from common to rare. WGS based studies aim to capture the effect of low-frequency and rare alleles with modest effect sizes.

(14)

1. Introduction 7

easily interpreted. WES became the dominant method for discovering causal variants for Mendelian diseases while WGS should discover a lot more biologically relevant variants for common complex traits. This is consistent with findings from the ENCODE project that most variants that control protein biochemistry are non-coding and are not within exons (Pennisi, 2012).

Recently a resource based on WES data was made available by the Exome Aggre-gation Consortium (ExAC). With DNA sequence data for 60,706 individuals of diverse ancestries, this catalogue of human genetic diversity provides one of the most complete resources based on next generation sequence data (Lek et al., 2016).

Since is still prohibitive to conduct WGS on a large number of samples as needed to study the effect of rare variants on phenotypic variation, one of the application of the WGS data is the creation of a reference panel for imputation of existing datasets with genome-wide SNP array data. A first example has been provided by the 1000Genome project (1000 Genomes Project Consortium et al., 2012; Sudmant et al., 2015) followed by the data generated by the UK10K project (Huang et al., 2015) and the Haplotype Reference Consortium (McCarthy et al., 2016): this allowed the variant coverage and the imputation quality to be boosted , especially for variants with MAF of 1-5%. The use of the UK10K and 1000GP panel combined allowed the identification and replication of several novel loci on a broad range of phenotypes, proving the efficacy of this approach (The UK10K Consortium, 2015). Moreover, this showed how useful could be to include WGS data belonging to the population under study in a reference panel for imputation (Sidore et al., 2015; Asimit and Zeggini, 2012).

(15)

1. Introduction 8

1.3 Isolated cohorts

In genetic association studies, a large sample size is required to identify modest effects on phenotypes at low frequency and rare variants. The power of the analysis can be boosted by using approaches that aggregate rare variants across selected regions (i.e. genes) or a genome-wide scan using moving windows. Another way to increase the statistical power would be to leverage the unique characteristics of founder populations. We can define population isolates as subpopulations derived from small numbers of individuals who became isolated because of a single or a chain of founding events in the demographic history of the isolate. By definition they are characterized by small effective population size (N_e), which results in stronger effects of random genetic drift leading to decreased genetic variability (i.e. higher homogeneity). Another potentially advantageous property of population isolates is the environmental and cultural homo-geneity: everyone is exposed to the same factors, decreasing the environmental variance and increasing the power to identify genetic effects.

In population isolates we can see how certain alleles reach fixation or extinction at a particular locus, but some mutations that contribute to complex traits and are rare in outbred populations can be drifted to higher frequencies. This enrichment of low frequency alleles can empower the identification of these variants with a smaller discovery set.

The power of isolated population in GWAS study has been well documented and recently exemplified in literature through successful identification of complex trait loci that replicate in other populations, but not all novel discoveries in isolates can be easily replicated in other populations.

A perfect example is the APOC3 variant discovered in a isolated Greek population through the analysis of very low-coverage whole genome sequence data of 1,192 samples (Gilly et al., 2016). This signal couldn’t be replicated through imputation of a wider cohort (∼5,000 samples) and required a gene level meta analysis of ∼13,000 samples to be proven robust.

(16)

1.3.3. The Carlantino project 9

populations, coupled with the decreasing costs for deep whole genome sequencing, sets the scene for new discoveries in the near future.

In the following paragraphs we present a brief description of the Italian Isolated cohorts involved in this study.

1.3.1 The Val Borbera and Spinti project

The Val Borbera population is a collection of 1,785 genotyped samples collected in the Val Borbera Valley, a geographically isolated valley located within the Appennine Mountains in Northwest Italy (Traglia et al., 2009). The valley is inhabited by about 3,000 descendants from the original population, living in 7 villages along the valley and in the mountains. Participants were healthy people 18-102 years of age that had at least one grandparent living in the valley. A standard battery of tests was performed by the laboratory of ASL 22 - Novi Ligure (AL), on sera from fasting blood collected in the morning. The project was approved by the Ethical committee of the San Raffaele Hospital and of the Piemonte Region. All participants signed an informed consent.

1.3.2 The Friuli Venezia Giulia genetic park

The Friuli Venezia Giulia population represents a collection of six villages covering a total area of 7858 km2 in a hilly part of Friuli-Venezia Giulia (FVG) county located in north-eastern Italy. A recent study (Esko et al., 2013) characterized this population as a genetic isolate with high level of genomic homozygosity and elevated linkage disequi-librium. The cohort accounts for 1590 genotyped samples. Participants were randomly selected people 3-92 years of age. Genotyping and phenotyipic data for 1590 samples are available. People with age < 18 were excluded from analyses. A written informed consent for participation was obtained from all subjects. The project was approved by the Ethical committee of the IRCCS Burlo-Garofolo.

1.3.3 The Carlantino project

(17)

1.3.3. The Carlantino project 10

(18)

CHAPTER

2

WGS data generation and Quality Control

2.1 Introduction

For the purposes of this work, we generated low coverage Whole Genome Sequence data for our study populations. In order to create a reliable call set, due to the nature of the data, we performed an extensive Quality control (QC) process, keeping in mind the ‘Garbage in == Garbage out’ paradigm: the more we refine our data, the less we will have to worry about false positive results in downstream analyses.

There are already best practices in use among the scientific community which help in this process (e.g. GATK’s best practices (DePristo et al., 2011; Van der Auwera et al., 2013)) addressing the QC procedure with different tools and approaches. All of them agree on a three steps work-flow:

1. Pre processing: this step is fundamental to perform a first clean-up of the data, removing all those samples which could compromise subsequent analyses due to poor DNA quality or technical issues arising during the sequencing step.

2. Variant discovery: the pre processed data is submitted to a ‘variant caller’ which will return a list of variant sites for each analysed sample.

3. Call set refinement: data from the previous step will be further refined in order to remove genotyping errors or artefacts by comparison with more ‘reliable’ resources. Additional information regarding each variant site ( functional consequences an-notations, conservation scores, genes names overlapping the variant position) will be added in this step.

(19)

2. WGS data generation and Quality Control 12

to 10x) and from different data centres, we had to pay particular attention to the pre-processing step, in order to correctly harmonize the data and avoid batch effects. In this section we will describe in detail all the procedures we adopted to create the final release of our data.

(20)

2.2.2. Lane level QC 13

2.2 Pre-processing

2.2.1 Sample selection

The initial set of selected samples accounted for a total of 997 individuals across our three cohorts. All samples selected for sequencing already had genotype data from other platforms (SNP array and Exome chip). For Carlantino and Val Borbera cohorts, the sample selection was carried out randomly, while samples belonging to the Friuli Venezia Giulia cohort were selected to better represent each of the different villages which are part of the project. The sequencing was carried out at different sequencing centres: the Wellcome Trust Sanger Institute in Hinxton (UK), the BGI, Shenzhen (PRC) and the HSR in Milan. All the data was post processed at the Sanger Institute, in order to harmonize the quality check process.

2.2.2 Lane level QC

As we said previously, though the sequencing was carried out in different centres, we processed all our samples at the Wellcome Trust Sanger institute in Hinxton (UK): we applied the pre-processing steps in use at the Sanger Institute thus to all our samples , to better harmonize the data.

The sequencing machines used for our samples were Illumina HiSeq 2000s or 2500s. Illumina’s Sequencing Control Software (SCS) and Real Time Analysis (RTA) programs control the machine and produce Base Call (BCL) files broken down by the lane cycle and tile of the run. The machine’s runs are monitored by a team of technicians who ensure it is operating correctly and perform the quality control on the lane’s results, annotating or rejecting it if any abnormalities are noticed. The New Pipeline Group (NPG) at the Wellcome Trust Sanger Institute is responsible for the software that supports the run and processes the data up until submission to the European Genome-phenome Archive (EGA).

(21)

2.2.2. Lane level QC 14

BAM format.

Where library preparation has produced short templates these will occasionally be so short that the adaptors will accidentally be sequenced. In order to deal with this BamAdapterFinder is used, another tool written by NPG. The BamIndexDecoder tool (part of the illumina2bam suite) is then used to give to the multiplexed libraries that the lane was run from a Read group (@RG) tag corresponding to the sample in the lane they match.

The lane BAM is then split by @RG into the lanelets split by the DNA barcode they were given. The control library (PhiX) that was spiked into the run (usually with the tag #168) is separated out, mapped with BWA and fed into the spatial filter to calculate a filter for the lane. This filter is then applied to all the other lanelets in the lane in an attempt to remove any obviously spatially oriented artefacts, specifically INDELs that may have been caused by bubbles entering the flowcell during the run. Now the lanelets are ready to be mapped to the reference genome.

However in order to eliminate any PhiX that may have contaminated the tag, all human samples are mapped twice, once to the Human reference and once to PhiX. Each sample is first fed to BWA for mapping to the appropriate reference. The BWA step strips some tag information out, so instead of directly using the BWA Sequence align-ment/map (SAM) file output, the alignment is merged into the BAM using an in house BamMerger tool. The resulting SAM file is then fed to Picard SamFormatConverter and then the samtools fixmates (Li, 2011) command is used to clean up pairing and flag errors that sometimes are emitted from BWA.

The Human mapped and PhiX mapped results are then used as input to the Align-mentFilter tool that removes the PhiX contamination by similarity, removing a read if it is mapped to PhiX, but if it is not, putting it into the lanelet Human BAM. For the next steps in processing the BAM has to be sorted by coordinate and this is done by using the samtools sort command. Once the BAM has been coordinate sorted Picard MarkDuplicates (https://broadinstitute.github.io/picard/) is run to mark any Optical and PCR duplicates that may have occurred in the lane. This will need to be rerun at library level to eliminate any PCR duplicates but it is useful at lanelet level for quality control purposes.

(22)

2.2.3. Samples re-alignment 15

is uploaded into the sequencing server. The resulting BAM is also submitted to the EGA for permanent archival.

Once each sample has been processed from the Sanger sequencing pipeline as we described, or imported in the system if belonging to a different sequencing centre, various statistics and graphs based on them are calculated for each file, to perform a first step of QC at the lane level through an automated pipeline developed at the Sanger Institute, called ‘AutoQC’.

This tool is designed to catch errors common with sequencing data and flag lanelets that are likely to contain poor quality data and increase the noise level. In table 2.1 are reported the parameters used for human WES and WGS data QC.

After this first step, the files are merged to library level and Picard’s MarkDuplicates is rerun to mark PCR and optical duplicates.

Parameters WES WGS

duplicate read percentage 8 3

error rate 0.02 0.02

gtype check passed or unchecked passed or unchecked

insert peak reads* 80 80

insert peak window* 25 25

mapped base percentage 85 85

mapped reads properly paired 80 80

Max insert:deletion ratio 1.0 1.0

overlapping base dup percent 4 4

Table 2.1: Parameters used for human whole genome sequence and whole exome sequence in AutoQC - *At least 80% of the inserts must be within 25% of the maximum peak in the insert size graph.

The outcome of this step is particularly useful to discover samples which present spatially oriented artifacts (e.g. fig 2.1), which weren’t corrected by the first step of the filtering. A new filter file is created with the locations of reads to be removed or marked as invalid than applied to the selected bam file. A tool internally developed at the Sanger Institute was used to apply the filter (wtsi-npg/pb_calibration).

2.2.3 Samples re-alignment

(23)

2.2.4. Bam improvement 16

Figure 2.1: Two examples of INDELs spikes that need to be removed through spatial filtering

(chromosomal plus unlocalized and unplaced contigs), the rCRS mitochondrial sequence, Human herpesvirus 4 type 1 and the concatenated decoy sequences. Bwa v.0.5.10 was used for the alignment step (Li and Durbin, 2010). 54 individuals belonging to the FVG cohort and sequenced at BGI, were aligned to a previous version of the GRCh37 build: we used the ‘Bridgebuilder system’ developed by the Human Genetics Infor-matic group at the Wellcome Trust Sanger Insitute (https://github.com/wtsi-hgi/ bridgebuilder) to realign the data. This tool was developed to remap BAM reads to a new reference by first building a "bridge" reference, then mapping to that bridge, and finally remapping only a subset of reads to the full new reference.

2.2.4 Bam improvement

Once we had all BAM files aligned to the same reference and merged at the sample level, they underwent a step of improvement consisting of:

1. Realignment around known and discovered INDELs using GATK’s RealignerTar-getCreator and IndelRealigner: the realigned target creator takes a list of known INDELs and the input BAMs and attempts to produce a list of intervals to feed to the INDEL re-aligner.

(24)

2.2.5. Pre processing results 17

3. Recalculation of the MD tag by samtools calmd. 4. Bam indexing.

After the bam improvement process we visually checked all samples taking into ac-count a) Base content per cycle, b) Coverage, c) GC content, d) Mapped depth, e) Indel count per read cycle, f ) Indel count by Indel length, g) Insert size distribution, h) Aver-age quality distribution per cycle to identify samples which passed the auto-qc pipeline but still had some quality issues and had to be removed from the final dataset.

2.2.5 Pre processing results

We processed each cohort separately. Table 2.2 summarize the total number of samples sequenced and the ones which passed this first QC step.

For the Carlantino cohort, sequencing was carried out using Illumina technology (Genome Analyzer and HiSeq 2000) at the Wellcome Trust Sanger Institute for 115 samples with an average coverage of 4x, while an additional batch of 40 samples was sequenced at Beijing Genomics Institute (BGI) with an average coverage of 10x, for a total of 155 samples.

Among the 115 samples sequenced at the Sanger Institute 27 failed the quality check at the lane level: 5 were re-processed since DNA was still available for those samples while 22 were excluded from further analyses after the last visual QC step. The most common cause of failure was the high percentage of adapter contamination and a bimodal insert size distribution fig.2.2.

With respect to this last issue, we noticed that almost all samples presented a bimodal distribution for insert size: this could affect in unpredictable ways the mapping and calling step, because the aligner and the caller will expect a normal-ish distribution. Moreover a bimodal distribution could interfere in the number of duplicated reads: this could result in a loss of data because duplicated reads are marked and skipped during the calling step. We took account of these observations in each of the subsequent QC steps, but we didn’t notice any major flaw with the data.

(25)

Figure 2.2: Example of bimodal distribution of insert size in the Carlantino population.

lane level, thus were excluded from further analyses. Among the BGI set, we had 6 samples that were duplicated from the Sanger pool: we kept them since they didn’t present any quality issue and we merged the two sets of data to increase each sample’s coverage. We than removed 1 additional sample from this same set because of data corruption.

Figure 2.3: Two examples of samples with quality issues highlighted by the AutoQC pipeline from the Val Borbera cohort.

(26)

Cohort Sanger BGI HSR

Carlantino 115 (4x - 93 ) 40 (10x) 0

Friuli Venezia Giulia 200 (4x - 196) 192 (10x - 185) 0

Val Borbera 210 (6x - 208) 209 (6x - 208) 29 (6x - 17)

Table 2.2: Number of sequenced samples from each sequencing center in each cohort with mean coverage values and QC passed samples

Finally a total set of 947 samples was sent forward for the Variant Calling step.

2.3 Variant calling and genotype refinement

Genotype calls for autosomal chromosomes were produced for each population sep-arately using the following pipeline. Samtools mpileup (v.1.2) (Li et al., 2009) was used for multisample genotype calling (parameter set: -E -t DP,DV,SP -C50 -pm3 -F0.2 -d 10000). The generated BCF files were converted to VCF format with bcftools call (v.1.2) (parameter set: -Nvm) and a series of filters were applied with bcftools filter (v.1.2) (parameter set: -m+ -sLowQual -e"%QUAL<=0"-g3 -G10 -Ov - ).

Variant Quality Score Recalibrator (VQSR) filtering was applied to the raw call data with GATK v.3.3 (DePristo et al., 2011). First the raw calls from samtools are used with the Unified Genotyper module in "Given allele mode" to generate a VCF file containing all the annotation needed to calculate the VQSLOD scores through the VariantRecalibrator module, separately for SNVs and INDELs. The filter creates a Gaussian Mixture Model by looking at annotations values over a high quality subset of the input call set and then evaluate all input variants.

For SNVs we used the following parameters: i) Annotations: QD, DP, FS, Haplotype-Score, MQRankSum, ReadPosRankSum, InbreedingCoeff; ii) Training set: HapMap 3.3, Omni 2.5M chip, 1000 Genomes Phase I; iii) Truth set: HapMap 3.3, Omni 2.5M chip; iv) Known set: dbSNP build 138. For INDELs we selected: i) Annotations: DP, FS, ReadPosRankSum, MQRankSum; ii) Training set: Mills-Devine, 1000 Genomes Phase I, DbSnp v138; iii) Truth set: Mills-Devine; iv) Known set: Mills-Devine, dbSNP build 138.

(27)

Figure 2.4: Variant calling and refinement work-flow.

VQSLOD threshold has been chosen according to the output produced by VariantRe-calibrator to select the best cut off in terms of specificity and sensitivity of the trained model. We used the Transition/Transversion (Ti/Tv) ratio as a parameter to select the best threshold, taking as a reference the empirical value of ∼2 calculated by (1000 Genomes Project Consortium et al., 2012). For SNPs the minimum VQSLOD values selected are -59.1994 (99.94% truth sensitivity threshold), -15.0283 (99.80% truth sensi-tivity threshold), -22.6034 (99.9% truth sensisensi-tivity threshold) for VBI, FVG and CARL cohort respectively. Since INDELs calling and alignment is still more prone to error we used a conservative approach, selecting a sensitivity threshold of 95% for each popula-tion. The filter has been applied using GATK’s Apply Recalibration module.

Since we work with low coverage sequence data, we performed several genotype refine-ment steps on the filtered data: 1. we used BEAGLEv4.r1230 (Browning and Browning, 2007) to assign posterior probabilities to all remaining genotypes. 2. SHAPEITv2 (De-laneau et al., 2013) to phase all genotypes calls and 3. IMPUTEv2 (Howie et al., 2009) to perform internal imputation in order to correct genotyping errors.

(28)

Table 2.3 summarize the number of sites called and refined at this stage for each population. The discrepancy in numbers between the VQSR filtered and Refined set columns for FVG and CARL arises from the fact that a few multi allelic sites were removed during the last refinement step.

Cohort Raw calls VQSR filtered Refined set

Carlantino 14,907,118 13,375,773 13,375,770

Friuli Venezia Giulia 19,371,926 17,004,558 17,004,556

Val Borbera 21,630,591 19,363,583 19,363,583

Table 2.3: Number of sites called and filtered in this data processing step, for each population.

2.4 Post calling QC

After the genotype calling and refinement step, we performed sample and sites post calling QC. First we checked for batch effects due to the different sequencing centres used: we conducted an MDS analyses on each cohort plotting the first two principal components after stratifying for a ‘sequencing centre’ variable (fig. 2.5). We than tested the first PCA component for correlation with the sequencing centre variable with a Pearson’s correlation test, obtaining a significant outcome only for the FVG cohort (p=0.001728). We compared this analysis for the FVG cohort with data available from a previous work (Esko et al., 2013) showing that the pattern we see is consistent with the underlying population structure.

We then generated a sites exclusion list, focusing on: a) Hardy-Weinberg equilibrium; b) Heterozygosity rate; c) MAF mismatch when compared with SNP array data; d) Non Reference Discordance rate (NRDR), defined as the ratio between the sum of concordant calls of the alternative allele in WGS and GWAS data and the sum of all discordant calls of the alternative allele in WGS and GWAS data.

(29)

a) b)

c)

Figure 2.5: Plot of the first two principal components from MDS analysis for a) CARL b) FVG c) VBI. The structure we observe here is due to the structure of each population and not to batch effects introduced by the sequencing centre: we tested the first PCA component versus the Sequencing Centre variable with a Pearson’s correlation test and only for the FVG cohort we obtained a significant p-value (p=0.001728).

We selected a subset of sites for each population overlapping with already available genotype information and compared the Minor Allele Frequency distribution (fig.2.6). On the same subset of sites we calculated the concordance and the Non Reference Dis-cordance Rate (NRDR), removing from the WGS dataset all sites falling outside the boundaries of 3 standard deviations. As we can see in table 2.4 we had really low dis-cordance values: still we removed 5,552 , 2,577 and 2,502 sites from CARL, FVG and VBI respectively. SNP Array CARL FVG VBI WGS 2.94% 0.46% 0.82% SNP Array CARL FVG VBI WGS 2.47% 0.45% 0.72%

Table 2.4: Non Reference Discordance Rate by sample (left panel) and by site (right panel) for each cohort.

(30)

Het-2.2.5. Pre processing results 23

a) b)

c)

Figure 2.6: MAF comparison between GWAS and WGS data for a) CARL, b) FVG, c) VBI cohorts

erozygosity rate and c) Non Reference Discordance rate. We counted the number of singleton for each sample and tagged all samples with a suspicious excess of singletons, removing one sample from the FVG cohort (fig.2.7). We calculated also the heterozy-gosity rate for each sample and removed all samples with values exceeding a threshold of 3 SD from the average value for each population: one sample was removed from the CARL cohort, one sample from the FVG cohort and 4 samples from the VBI cohort. Finally, we calculate the samples’ non reference discordance rate and removed all indi-viduals with an NRDR greater than 5% : 8 samples from the CARL cohort, 1 sample from FVG cohort and 5 samples from the VBI cohort (tab.2.4).

Table 2.5 shows variants count for each population after all the filtering steps, splitted in different categories.

(31)

2.5.1. Variants distribution and comparison with outbred populations 24

Figure 2.7: Singleton count versus Average coverage per sample, for the INGI cohorts.

a) b)

c)

Figure 2.8: NRDR versus Heterozigosity rate per sample, for a) CARL b) FVG c) VBI.

2.5 Discussion

The extensive QC on the data allowed us to generate a data release for each cohort in which we minimised the genotyping error. In the following paragraphs we will provide some examples of the information we were able to extract from the generated data.

2.5.1 Variants distribution and comparison with outbred populations

(32)

2.5.1. Variants distribution and comparison with outbred populations 25

INGI All samples

CARL FVG VBI INGI

Samples 124 378 424 926 Females 66 220 249 535 Males 58 158 175 391 Average coverage 6.31 7.23 6.12 6.55 Sites 13,370,262 17,002,010 19,361,094 26,619,091 Multiallelic Sites 248,638 356,599 393,328 560,918 SNPs 12,208,629 15,521,313 17,830,208 24,557,366 INDELs 1,161,633 1,480,697 1,530,886 2,061,725 Sites MAF <= 1% 3,627,622 7,283,720 9,416,028 16,685,951 Sites 1% < MAF <= 5% 3,007,162 3,069,534 3,121,545 3,125,971 Sites MAF > 5% 6,735,478 6,648,756 6,823,521 7,123,064 Singletons SNPs 2,061,824 2,784,746 3,554,744 6,193,486 Singletons INDELs 92,372 131,275 133,156 273,679

Average Heterozygosity rate per sample 17.57% 13.27% 12.16% 13.34%

Average Derived allele count per sample 4,703,290 4,741,910 4,844,980 4,763,393.3

Average singleton per sample 17,285 7,671 8,646 6,925

Table 2.5: Results of variant calling for all the INGI cohorts.

Figure 2.9: Minor allele frequency spectrum of the final data set. The MAF spectrum is calculated separately for each population, on a subset of unrelated individuals. On the y-axis it is shown the proportion of sites for each frequency bin relative to the total amount of variants identified for each population.

(33)

2.5.2. Insight on Loss Of function variants 26

INGI

Private variants Shared variants

CARL only VBI only FVG only COMMON VBI-CARL VBI-FVG FVG-CARL SNPs 2,086,255 6,228,040 4,286,183 9,045,896 721,758 1,834,514 354,720 INDELs 151,822 359,316 325,779 886,471 69,417 213,110 53,191 TOT 2,238,077 6,587,356 4,611,962 9,932,367 791,175 2,047,624 407,911

Table 2.6: Private and shared variants between INGI cohorts.

Table 2.7 shows the overlap between the INGI cohorts and the outbred samples from the EUR superpopulation of the 1000G Project phase 3 : we can see an high overlap of sites common to all the INGI cohorts and to the outbred reference and we can see also that the sharing amount of each single INGI cohort is increased. This highlight the differences in our cohorts. If we look at the private variants, with respect to the EUR subpopulation , we can obviously see a reduction in private sites but still a consistent number of variants remain private of our cohorts. Figure 2.10 shows how the overlap between INGI cohorts and the EUR population is spread across the allele frequency spectrum: we can see that the majority of the private variants belongs in the range of low and rare frequencies ( MAF<2% ), highlighting the fact that those variants could increase the burden of low frequency imputed variants, once the data is included in a reference panel. To complete the comparison, we also checked the overlap with the complete 1000G Phase 3 resource: as presented in table 2.8 we can see that still a consistent quota of private sites for each of our cohort, is present.

INGI vs EUR from Phase 3

CARL only VBI only FVG only INGI only COMMON CARL-EUR VBI-EUR FVG-EUR SNPs 1,707,580 4,984,114 3,425,000 11,259,369 8,781,522 9,838,929 11,824,435 11,230,368 INDELs 110,022 278,604 249,676 884,820 761,251 877,922 1,024,362 1,008,924 TOT 1,817,602 5,262,718 3,674,676 12,144,189 9,542,773 10,716,851 12,848,797 12,239,292

Table 2.7: Private and shared variants from the comparison of INGI cohorts and EUR samples from the 1000G project phase 3 data.

2.5.2 Insight on Loss Of function variants

(34)

dele-2.5.2. Insight on Loss Of function variants 27

a)

b)

c)

Figure 2.10: Overlapping variants between a) CARL b) FVG c) VBI and the EUR samples from the 1000GP3 project. Blue colour indicates shared variants and red indicates private variants.

teriousness of single nucleotide variants as well as insertion/deletions variants (Kircher et al., 2014). We were able to stratify our data in different ways, using this information. Figure 2.11 shows how the different Loss Of Function (LOF) categories are dis-tributed with respect to the CADD score: we can clearly see that more deleterious classes are represented by higher CADD scores.

(35)

2.5.3. Variants enrichment 28

INGI vs 1000G Phase 3

CARL only VBI only FVG only INGI only COMMON CARL-1000GP VBI-1000GP FVG-1000GP SNPs 1,347,687 4,097,163 2,858,434 9,154,172 8,818,753 10,361,586 12,958,501 12,007,860 INDELs 98,294 253,621 231,696 819,720 763,799 895,985 1,058,293 1,034,987 TOT 1,445,981 4,350,784 3,090,130 9,973,892 9,582,552 11,257,571 14,016,794 13,042,847

Table 2.8: Private and shared variants from the comparison of INGI cohorts and 1000G project phase 3 data.

Figure 2.11: Loss Of Function variants stratified by CADD score in the whole INGI dataset.

2.5.3 Variants enrichment

We used the CADD score also to select, in each population, a subset of sites to check for frequency enrichment with respect to the general EUR population from the 1000G project: to have a better picture we excluded samples of italian origin (TSI). In table 2.10 we report the number of variants with a CADD score greater than 20 for each population and the number of sites we tested for frequency enrichment versus the EUR-TSI cohort. A Fisher’s Exact Test on Alternative Allele counts, was performed in order to asses the significance of the enrichment. The comparison was carried out considering the Alternative Allele frequency (AF), since the CADD score of a variant is relative to the presence of the alternative allele.

(36)

CARL FVG VBI

SNPs

CADD # Variants # Singleton % Class # Variants # Singleton % Class # Variants # Singleton % Class 0-5 9,515,114 1,529,371 16.07% splice_donor 11,937,931 2,048,685 17.16% splice_donor 13,683,182 2,609,835 19.07% splice_donor 5-15 2,477,035 473,424 19.11% splice_donor 3,277,258 648,866 19.80% splice_donor 3,793,973 831,647 21.92% splice_donor 15-20 171,074 37,619 21.99% splice_donor 235,989 52,516 22.25% splice_donor 271,373 67,147 24.74% splice_donor >=20 48,158 13,180 27.37% stop_gain 74,228 20,471 27.58% stop_gain 87,036 26,526 30.48% stop_gain

INDELs

CADD # Variants # Singleton % Class # Variants # Singleton % Class # Variants # Singleton % Class 0-5 762,902 30,442 3.99% frameshift 908,295 35,285 3.88% frameshift 922,743 31,264 3.39% frameshift 5-15 305,572 18,864 6.17% frameshift 376,224 22,482 5.98% frameshift 384,963 20,238 5.26% frameshift 15-20 13,810 1,130 8.18% frameshift 17,522 1,375 7.85% frameshift 17,814 1,289 7.24% frameshift >=20 2,931 376 12.83% frameshift 3,531 449 12.72% frameshift 4,267 535 12.54% frameshift

Table 2.9: SNPs and INDELs stratified by CADD score.

FVG

N CADD >= 20 Fisher tested # delta>0 sig. # p-val. sig. (Bonferroni)

AF <= 0.02 8,257,252 53,809 20,907 987 31

0.02 <AF <= 0.05 1,538,631 6,525 6,132 1,175 63

AF >0.05 6,849,512 17,539 17,143 1,744 59

tot 16,645,395 77,873 44,182 3,906 153

VBI

AF <= 0.02 10,306,733 66,742 23,108 723 44

0.02 <AF <= 0.05 1,595,109 6,305 6,018 1,008 45

AF >0.05 7,065,916 17,838 17,281 1,867 76

tot 18,967,758 90,885 46,407 3,598 165

CARL

AF <= 0.02 4,744,815 27,694 12,704 505 13

0.02 <AF <= 0.05 1,403,135 5,624 5,058 966 22

AF >0.05 6,973,667 17,531 17,094 1,804 54

tot 13,121,617 50,849 34,856 3,275 89

(37)

that are not shared between the INGI cohorts.

It’s interesting to see how the three population share the same enrichment pattern in the lowest frequency bin, with the majority of variants presenting high level of enrichment (4 to 15 fold). A shared pattern is visible also in the ‘common variants’ bin, with variants enriched by 2 to 4 fold. In the ‘low frequency’ bin (fig. 2.12 panels d,e,f) we can see that FVG and VBI seem to share a similar enrichment pattern again, with most of the sites enriched by a 2 to 4 factor, while CARL shows an increment in highly enriched variants (5 to 38 fold). The characterization of those enriched variants is the focus of an ongoing project which will also investigate the causes of this enrichments ( genetic drift or driven by selection ).

a) _b) _c)

d) e) f )

g) h) i)

(38)

CHAPTER

3

Reference Panel

3.1 Introduction

As we already pointed out, is still prohibitive to conduct WGS on a large number of samples to investigate the effect of rare variants on phenotypic variation so a different solution was developed in order to exploit the WGS data available: the creation of reference panel from WGS data for genotype imputation.

Genotype imputation can be defined as "the prediction of genotypes at SNP that have not been assayed in an association study" (Marchini and Howie, 2008). The underlying mechanism of imputation is simple: we have data we are confident about but with poor resolution (i.e. study genotypes from SNP arrays), than we have datasets with high resolution and high confidence (i.e. reference panels) and we use them to make inference on missing data and fill ‘holes’ in our study genotypes.

We already cited the 1000Genome project (1000 Genomes Project Consortium et al., 2012; Sudmant et al., 2015), the UK10K project (Huang et al., 2015) and the more recent Haplotype Consortium (McCarthy et al., 2016): those projects aimed to provide the scientific community with a reliable resource which allowed to boost the variant coverage and the imputation quality for variants with MAF of 1-5% making possible also the comparison, through meta-analyses, of different cohorts imputed with the same ‘standardized’ resource in order to increase the sample size of a study and its power to detect associations. Moreover, the work from Sidore et al (Sidore et al., 2015) shows how useful could be to include WGS data belonging to the population in study in a reference panel for imputation.

(39)

3. Reference Panel 32

(40)

3.2 Methods

Once the WGS data from our INGI cohorts were deemed reliable enough to create a data release, we defined a work-flow to further select a ‘highly reliable’ subset of variants to include in our reference panel. The work-flow was implemented in a pipeline of custom made scripts, using a mixture of bash and python language programming.

A key point of the pipeline was the ability to compare variants across different resources, thus in order to avoid mismatches between datasets, we split all multi-allelic variant sites in different vcf’s records and performed a step of normalization on all INDELs to prepare the data. We processed in the same way also the data from 1000G Project phase 3 and UK10K project(The UK10K Consortium, 2015). This allowed us to define a unique key for comparison across different datasets: the combination of chromosome, position, reference and alternative allele.

Figure 3.1 shows the work-flow we defined to create the list of sites to be included in our panel. We worked separately with each INGI cohort. First, we created two separate sets for SNPs and INDELs, then selected all sites with Alternative Allele count (AC) >= 2 and Read depth (DP) >= 5. We then included all the singleton sites (AC = 1) which were overlapping at least between two INGI cohorts or which were known sites or were present at least in one of the external resources selected (UK10K and 1000G Project Phase 3).

As the output of these selection steps, we obtained for each cohort a VCF file con-taining the final set of variant to be used to generate the reference panel.

Figure 3.1: Work flow used to select variants to be included in the reference panel

(41)

merged panel: a) selecting only overlapping sites between each reference or b) creating a ‘union’ set of the data which you will have to further process to ‘fix’ eventually missing variants represented only in one of the selected cohorts.

Since we wanted to take advantage of the peculiar characteristics of our populations and use all the information we could, we chose the the second option. In particular, we adopted the solution implemented by the IMPUTE2 software (Howie et al., 2011). Figure 3.2 shows how the different panels are combined by the software.

Figure 3.2: This figure from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html schematise the solu-tion implemented in the IMPUTE2 software for reference panel merging. The top panel shows two reference panels and a GWAS cohort; rows represent individuals and columns represent positions along the genome. Each vertical line represents a genotyped variant, and each reference panel includes variants that are not found in the other. The untyped variants are imputed in three steps: 1. Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation. 2. Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation. 3. Now that we have imputed the two reference panels up to the union of their variants, take the best-guess haplotypes and impute the GWAS cohort.

(42)

input data from two reference panel at a time, we proceeded in two step, first merging the Carlantino and the Friuli Venezia Giulia cohort, than adding the Val Borbera cohort to the previous merging result. A sample of the command line used to merge the Carlantino and Friuli Venezia Giulia cohorts is shown below:

i m p u t e 2 - a l l o w _ l a r g e _ r e g i o n s - m g e n e t i c _ m a p _ c h r 1 _ c o m b i n e d _ b 3 7 . txt - h 1. I N G I _ R E F . C A R L . hap . gz 1. I N G I _ R E F . FVG . hap . gz - l 1. I N G I _ R E F . C AR L . l e g e n d . gz 1. I N G I _ R E F . FVG . l e g e n d . gz k _ h a p 2 0 0 0 2 0 00 m e r g e _ r e f _ p a n e l s -m e r g e _ r e f _ p a n e l s _ o u t p u t _ r e f 1. I N G I _ R E F . C A R L _ F V G . 0 0 8 - int 2 1 0 1 0 2 8 6 2 4 0 1 0 2 8 6 - Ne 2 0 0 0 0 - b u f f e r 500 - i 1. I N G I _ R E F . C A R L _ F V G . 0 0 8 . i n f o

We selected the fine-scale recombination map provided by the 1000G Phase 3 project for the region to be analysed (option -m) then specified the haplotype and legend files to be merged (options -h and -l). We split each chromosome in chunks of 3 mega bases to reduce the computational time by parallelization (option -int). We set the k_hap parameter to 2000 for each cohort to use the maximum number of haplotypes provided by our samples to infer missing genotypes. The options -merge_ref_panels and -merge_ref_panels_output_ref were needed to instruct the software to save the merged reference panel as output file.

In order to test the resource performances, we created different versions of the refer-ence panel. The first version accounted for the samples belonging to the INGI cohorts and the ones from the TSI cohort from the 1000G Project phase 3 (INGI+TSI). A second version consisted in the merge between the INGI panel and the whole 1000G Project phase 3 reference panel (INGI+1000G).

The imputation test was performed on the INGI cohorts as well as on a test cohort of 567 unselected samples collected in the North West of Italy (referred as NW-ITALY cohort, henceforth).

(43)

Cohort Samples Males Females Genotyping platform

CARL 504 304 200 Humancnv370-quad V3.0, HumanExome-12 v1.2 FVG 1179 512 667 Humancnv370-quad V3.0, HumanOmniExpress-12v1 C, HumanExome-12 v1.2 VBI 1317 588 729 Humancnv370-quad V3.0, HumanOmniExpress-12v1 C, HumanExome-12 v1.2 NW-ITALY 567 363 204 Infinium OmniExpressExome-8

Table 3.1: Sample size for each imputed cohort and available genotyping platform.

(44)

3.3 Results

Table 3.2 shows, for each INGI cohort, the number of variants (SNPs and INDELs) which were selected at each step of the defined workflow.

Cohort AC >=2 and DP >=5 Singletons overlapping with 1000GPh3 or UK10K Singletons overlapping at least between 2 INGI cohors TOTAL SNPs VBI 14,269,966 1,079,579 639,777 15,989,322 FVG 12,750,969 801,378 704,408 14,256,755 CARL 10,072,498 462,609 874,559 11,409,666 INDELs VBI 1,778,080 36,801 20,104 1,834,985 FVG 1,711,655 36,254 27,318 1,775,227 CARL 1,318,278 19,636 35,310 1,373,224

Table 3.2: Variants selected from each INGI cohort to be included in the reference panel.

In table 3.3 we show the genome wide number of sites present in each panel used for testing and for subsequent imputation. As we can see in table 3.4, we contributed 7.8% to the merged reference panel INGI+1000GP3 with data belonging only to the INGI cohorts.

SNPs INDELs Multi-allelic SNPs Multi allelic INDELs

INGI 20,824,903 2,510,222 78,666 445,975

1000GP3 78,397,635 3,308,387 259,986 106,002

INGI+TSI 23,574,532 2,925,694 88,721 468,946

INGI+1000GP3 83,963,965 4,645,443 408,060 513,448

Table 3.3: Number of sites for each reference panel generated for testing purposes.

Sites INGI Sites INGI+1000GP3 Sites 1000GP3 Sites added by INGI

SNPs 20,824,903 83,963,965 78,397,635 5,566,330

INDELs 2,510,222 4,645,443 3,308,387 1,337,056

tot 23,335,125 88,609,408 81,706,022 6,903,386

Table 3.4: Comparison of sites added by merging INGI reference panel with 1000GP phase 3 reference.

(45)

we show, for minor allele frequency smaller than 5%, for each test cohort, the average values of r2.

As we can see, the panel including our own data (red and green lines) always out-perform the ‘standard’ 1000GP phase 3 reference panel for the INGI population. When we test the panel on an outbred population, the performances are comparable. The Carlantino cohort seems to behave, for frequencies lower than 1% in a similar way to the outbred population: this could be due to the small haplotype set included in the reference panel with respect to the other two INGI cohorts or to the underlying struc-ture of the population (the data generated from this project are currently being used to investigate the structure of this isolated population).

Figure 3.3: Mean values of r2 _{stratified by minor allele frequency.}

Table 3.5 shows the numbers underlying fig. 3.3: as we can clearly see here, for the lower frequency bin, the mean value of r2 is always higher when we include our WGS data in the reference panel for the INGI populations. We can also clearly see that there is no appreciable increment of quality in the outbred (NW-ITA) population.

The second metric used to asses imputation quality is the ‘info score’. The info metric can be used to remove poorly imputed SNPs from association testing results. In figure 3.4 we show how the different references behave, at low frequency, in terms of info score.

(46)

pro-3. Reference Panel 39

Figure 3.4: Values of info scores stratified by minor allele frequency.

(47)

CARL

INGI+TSI INGI+1000GP3 1000GP3

MAF N sites mean N sites mean N sites mean

<= 0.5% 645 0.3417 777 0.4477 722 0.4620 0.5% - 1% 315 0.4827 344 0.5374 313 0.5451 1% - 5% 1,528 0.7226 1,540 0.7639 1,528 0.7325 >= 5% 25,578 0.8982 25,583 0.9069 25,557 0.8850 FVG INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 975 0.3763 1,079 0.4556 979 0.4016 0.5% - 1% 328 0.6041 372 0.6554 316 0.5360 1% - 5% 1,222 0.7898 1,223 0.8058 1,213 0.7195 >= 5% 15,791 0.8983 15,799 0.9071 15,779 0.8438 VBI INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 1,296 0.2417 1,820 0.2335 1,636 0.2087 0.5% - 1% 495 0.5051 509 0.5393 469 0.4682 1% - 5% 1,627 0.7818 1,629 0.8050 1,619 0.7461 >= 5% 26,808 0.9362 26,809 0.9422 26,778 0.9111 NW-ITA INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 2,694 0.3732 4,337 0.5812 4,183 0.6036

0.5% - 1% 831 0.6293 845 0.7626 841 0.7742

1% - 5% 4,382 0.8320 4,382 0.8712 4,375 0.8735

>= 5% 46,368 0.9281 46,371 0.9385 46,335 0.9388

Table 3.5: Mean values of r2and number of sites for each frequency in different populations.

3.4 Discussion

(48)

CARL

INGI+TSI INGI+1000GP3 1000GP3

MAF INFO BIN N sites % N sites % N sites %

<= 0.5% 0.2 249,966 40.89% 384,735 35.27% 411,773 41.69% 0.4 141,388 23.13% 237,605 21.78% 205,986 20.86% 0.6 95,984 15.70% 169,309 15.52% 134,497 13.62% 0.8 68,179 11.15% 138,947 12.74% 106,712 10.80% 1 55,847 9.13% 160,120 14.68% 128,659 13.03% FVG INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 0.2 138,918 24.34% 208,007 22.70% 384,192 39.64% 0.4 159,504 27.95% 238,894 26.07% 263,543 27.19% 0.6 150,002 26.28% 239,173 26.10% 183,333 18.92% 0.8 122,299 21.43% 230,332 25.13% 138,085 14.25% 1 74,111 11.49% 187,589 16.99% 96,939 9.09% VBI INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 0.2 112,857 17.30% 180,980 15.92% 309,013 30.06% 0.4 138,370 21.22% 211,835 18.63% 241,107 23.46% 0.6 143,248 21.96% 224,152 19.72% 180,750 17.58% 0.8 136,646 20.95% 242,050 21.29% 148,491 14.45% 1 121,079 18.56% 277,924 24.44% 148,525 14.45% NW-ITA INGI+TSI INGI+1000GP3 1000GP3

<= 0.5% 0.2 189,195 27.33% 313,018 21.77% 254,779 21.34%

0.4 188,402 27.22% 323,960 22.53% 241,114 20.19%

0.6 147,768 21.35% 279,392 19.43% 204,901 17.16%

0.8 90,987 13.14% 246,843 17.17% 216,962 18.17%

1 75,891 10.96% 274,486 19.09% 276,277 23.14%

(49)

combine our data together without losing information typical of each cohort. Merging with the data available from the 1000G Project phase 3 allowed us to exploit the infor-mation added by the genotypic variation available from cohorts with different ancestry and to greatly increase the number of variants imputed. Moreover, for the INGI cohorts we saw an increment in terms of ‘absolute’ quality of the imputation (the r2 metric) and in terms of ‘overall’ imputation quality (the info score parameter).

(50)

CHAPTER

4

Applications

4.1 Introduction

In this chapter we will show a proof of concept application of the work done on the reference panel creation.

One direct application for the data we generated would be the analysis of rare variants from Whole Genome Sequences using gene based tests (or windows based tests, in case of genome wide analyses) such as the one implemented in the SKAT package (Wu et al., 2011): unfortunately the sample size of this first batch of the INGI cohorts doesn’t allow such analyses.

The main focus of this work, though, was to create a resource which could help on GWAS analyses, increasing the power to identify new causative variants both in known and new loci. We show here how the resource created allowed us to carry out a GWAS analysis on our INGI cohorts and perform a meta-analysis to identify putative causative loci. In addition, we carried out parallel analyses using data imputed with a previous version of the reference panel (from the 1000Genome project Phase 1) which doesn’t include our WGS data, widely used for imputation in different consortia, to compare the outcome.

4.2 Methods

For this test we selected the phenotype Mean Corpuscular Haemoglobin (MCH). We carried out imputation for each cohort using our custom reference panel (INGI + 1000GP3 version) using the IMPUTEv2 software. GWAS analysis was performed

(51)

4. Applications 44

in the three populations (Carlantino, Friuli Venezia Giulia, Val Borbera) separately using age and gender as covariates and an additive model. The analysis was carried out using the mixed linear models as implemented in R GenABEL and ProbABEL packages (Karssen et al., 2016). Genomic kinship was used to take relatedness into account . We selected only variants with MAF ≥ 5% and info − score ≥ 0.3 . Meta-analysis was assessed using an inverse-variance weighting method. In a successive step we investigated also results from variants with a MAF ≥ 1% . We compared our results with the published associated variants as reported in http://www.ebi.ac.uk/gwas/. We also compared our new analysis with results from a previous GWAS on the same trait, performed after the imputation with the 1000G project phase 1 panel. Regional plots were generated using the Locuszoom software (Pruim et al., 2010).

Figure 4.1: Distribution of the MCH phenotype in the three INGI cohorts.

4.3 Results

The GWAS was carried out in 436 individuals in CARL, 1232 in FVG and 1591 in VBI. Mean and standard deviation of MCH was 29.30 (2.86) , 29.82 (1.7), 30 (1.93) in CARL, FVG and VBI respectively (fig.4.1).

Figure 4.2 shows the Manhattan plots of the meta-analysis for 1000 Genomes refer-ence panel and INGI+1000GP referrefer-ence.

(52)

respec-4. Applications 45

tively. a)

b)

Figure 4.2: Manhattan plot of the meta-analyses results on the INGI cohorts for the MCH phenotype. The results were filtered by MAF ≥ 5% and info − score ≥ 0.3 : a) using data imputed with the 1000G Phase 1 reference panel, b) using data imputed with the INGI+1000GPhase 3 reference panel. The highlighted loci represent known associations signals from the GWAS catalog(Burdett T (EBI) et al.).

SNP Chr Position Other allele Refer allele N beta se Dir p gene feature left_gene right_gene rs9832259 3 45444357 G A 3257 0.33 0.06 +++ 4.75E-08 LARS2 intron TMEM158 LOC100130135

rs11759553 6 135422296 A T 3257 0.29 0.05 +++ 3.76E-08 NA NA HBS1L MYB rs9373124 6 135423209 T C 3257 0.29 0.05 +++ 3.89E-08 NA NA HBS1L MYB rs35959442 6 135424179 C G 3257 0.29 0.05 +++ 2.99E-08 NA NA HBS1L MYB rs4895440 6 135426558 A T 3257 0.29 0.05 +++ 1.79E-08 NA NA HBS1L MYB rs4895441 6 135426573 A G 3257 0.29 0.05 +++ 3.06E-08 NA NA HBS1L MYB rs9376092 6 135427144 C A 3253 0.29 0.05 +++ 2.15E-08 NA NA HBS1L MYB rs9389269 6 135427159 T C 3257 0.29 0.05 +++ 4.55E-08 NA NA HBS1L MYB rs9402686 6 135427817 G A 3257 0.29 0.05 +++ 3.18E-08 NA NA HBS1L MYB rs7758845 6 135428537 A C 3257 0.30 0.05 +++ 1.58E-08 NA NA HBS1L MYB rs6920211 6 135431318 T C 3257 0.31 0.05 +++ 1.72E-08 NA NA HBS1L MYB rs9494142 6 135431640 T C 3257 0.30 0.05 +++ 2.82E-08 NA NA HBS1L MYB rs9494145 6 135432552 T C 3255 0.31 0.06 +++ 2.22E-08 NA NA HBS1L MYB

rs6000553 22 37469192 A G 3257 0.28 0.05 +++ 1.22E-08 TMPRSS6 intron KCTD17 IL2RB rs4820268 22 37469591 G A 3257 0.30 0.05 +++ 4.54E-10 TMPRSS6 reference KCTD17 IL2RB rs2076085 22 37470041 C A 3257 0.28 0.05 +++ 9.06E-09 TMPRSS6 intron KCTD17 IL2RB rs2413450 22 37470224 T C 3257 0.28 0.05 +++ 1.19E-08 TMPRSS6 intron KCTD17 IL2RB rs2072860 22 37470604 G A 3257 0.27 0.05 +++ 2.87E-08 TMPRSS6 intron KCTD17 IL2RB

Table 4.1: Significant results of meta-analysis with 1000 Genomes phase I imputation reference panel (popula-tions order: FVG, CARL, VBI).

(53)

4. Applications 46

SNP Chr Position Other allele Refer allele N beta se Dir p gene feature left_gene right_gene rs854200 3 45460460 T C 3257 0.31 0.05 +++ 1.86E-08 LARS2 intron TMEM158 LOC100130135 rs6769129 3 45463682 T A 3257 0.31 0.05 +++ 1.40E-08 LARS2 intron TMEM158 LOC100130135 rs6769240 3 45463814 T A 3257 0.31 0.05 +++ 1.40E-08 LARS2 intron TMEM158 LOC100130135

rs4895440 6 135426558 A T 3257 0.28 0.05 +++ 4.74E-08 NA NA HBS1L MYB rs9376092 6 135427144 C A 3257 0.29 0.05 +++ 2.53E-08 NA NA HBS1L MYB rs9389269 6 135427159 T C 3257 0.29 0.05 +++ 3.91E-08 NA NA HBS1L MYB rs6920211 6 135431318 T C 3257 0.30 0.05 +++ 2.30E-08 NA NA HBS1L MYB rs9494142 6 135431640 T C 3257 0.30 0.05 +++ 3.05E-08 NA NA HBS1L MYB rs9494145 6 135432552 T C 3257 0.31 0.06 +++ 2.43E-08 NA NA HBS1L MYB

rs6000553 22 37469192 A G 3257 0.29 0.05 +++ 1.46E-09 TMPRSS6 intron KCTD17 IL2RB rs4820268 22 37469591 A G 3257 -0.29 0.05 — 7.03E-10 TMPRSS6 reference KCTD17 IL2RB rs2076085 22 37470041 C A 3257 0.30 0.05 +++ 4.65E-10 TMPRSS6 intron KCTD17 IL2RB rs2413450 22 37470224 C T 3257 -0.30 0.05 — 6.37E-10 TMPRSS6 intron KCTD17 IL2RB rs2072860 22 37470604 G A 3257 0.30 0.05 +++ 6.12E-10 TMPRSS6 intron KCTD17 IL2RB

Table 4.2: Significant results of meta-analysis with INGI + 1000 Genomes phase 3 imputation reference panel (populations order: FVG, CARL, VBI).

If we take in account a lower MAF threshold ( MAF ≥ 1% ) we can see how the number of significant signals (p-value<5x10-8) increases for each population. For the 1000G phase 1 imputation, we had 2 , 0 and 157 significant signals for CARL, FVG and VBI respectively while the imputation with the INGI+1000GP phase 3 data resulted in 6, 5 and 297 signals for CARL, FVG and VBI. We noticed an enrichment in significant signals in the VBI cohort when lowering the MAF threshold: a closer look at those signals highlighted the presence of a large number of sites in high linkage disequilibrium.

a)

b)

Figure 4.3: Manhattan plot of the meta-analyses results on the INGI cohorts for the MCH phenotype including variants with MAF>=1%: a) using data imputed with the 1000G Phase 1 reference panel, b) using data imputed with the INGI+1000GPhase 3 reference panel.

(54)

4. Applications 47

highly significant values when we take into account also lower frequency variants. We can clearly see, for example, how the signal on chromosome 11 is driven by low frequency variants (fig.4.3 a) and how the top SNP p-value is boosted in the results of the meta-analysis performed on the data imputed with the merged panel (fig.4.3 b) (table 4.3 and 4.4).

SNP Chr Position Other allele Refer allele N beta se Dir p gene feature left_gene right_gene rs9832259 3 45444357 G A 3257 0.33 0.06 +++ 4.75E-08 LARS2 intron TMEM158 LOC100130135

rs11759553 6 135422296 A T 3257 0.29 0.05 +++ 3.76E-08 NA NA HBS1L MYB rs9373124 6 135423209 T C 3257 0.29 0.05 +++ 3.89E-08 NA NA HBS1L MYB rs35959442 6 135424179 C G 3257 0.29 0.05 +++ 2.99E-08 NA NA HBS1L MYB rs4895440 6 135426558 A T 3257 0.29 0.05 +++ 1.79E-08 NA NA HBS1L MYB rs4895441 6 135426573 A G 3257 0.29 0.05 +++ 3.06E-08 NA NA HBS1L MYB rs9376092 6 135427144 C A 3253 0.29 0.05 +++ 2.15E-08 NA NA HBS1L MYB rs9389269 6 135427159 T C 3257 0.29 0.05 +++ 4.55E-08 NA NA HBS1L MYB rs9402686 6 135427817 G A 3257 0.29 0.05 +++ 3.18E-08 NA NA HBS1L MYB rs7758845 6 135428537 A C 3257 0.30 0.05 +++ 1.58E-08 NA NA HBS1L MYB rs6920211 6 135431318 T C 3257 0.31 0.05 +++ 1.72E-08 NA NA HBS1L MYB rs9494142 6 135431640 T C 3257 0.30 0.05 +++ 2.82E-08 NA NA HBS1L MYB rs9494145 6 135432552 T C 3255 0.31 0.06 +++ 2.22E-08 NA NA HBS1L MYB rs78270456 11 5229028 C A 2821 -1.09 0.20 -?- 3.40E-08 NA NA NA NA

rs6000553 22 37469192 A G 3257 0.28 0.05 +++ 1.22E-08 TMPRSS6 intron KCTD17 IL2RB rs4820268 22 37469591 G A 3257 0.30 0.05 +++ 4.54E-10 TMPRSS6 reference KCTD17 IL2RB rs2076085 22 37470041 C A 3257 0.28 0.05 +++ 9.06E-09 TMPRSS6 intron KCTD17 IL2RB rs2413450 22 37470224 T C 3257 0.28 0.05 +++ 1.19E-08 TMPRSS6 intron KCTD17 IL2RB rs2072860 22 37470604 G A 3257 0.27 0.05 +++ 2.87E-08 TMPRSS6 intron KCTD17 IL2RB

Whole Genome Sequencing of Italian Isolate Populations to identify rare and characteristic variants and to generate a reference panel for imputation.

List of Abbreviations

Abstract

Contents

1

Introduction

1.1

Genetic architecture of complex traits and Genome Wide

Association Studies

1.2

From SNP array to Whole Genome Sequencing

1.3

Isolated cohorts

2

WGS data generation and Quality Control

2.1

Introduction

2.2

Pre-processing

2.3

Variant calling and genotype refinement

2.4

Post calling QC

2.5

Discussion

3

Reference Panel

3.1

Introduction

3.2

Methods

3.3

Results

3.4

Discussion

4

Applications

4.1

Introduction

4.2

Methods

4.3

Results