A bird’s-eye view of Italian genomic variation through whole-genome sequencing

(1)

AperTO - Archivio Istituzionale Open Access dell'Università di Torino

Original Citation:

A bird’s-eye view of Italian genomic variation through whole-genome sequencing

Published version:

DOI:10.1038/s41431-019-0551-x

Terms of use:

Open Access

(Article begins on next page)

Anyone can freely access the full text of works made available as "Open Access". Works made available under a Creative Commons license can be used according to the terms and conditions of said license. Use of all other works requires consent of the right holder (author or publisher) if not exempted from copyright protection by the applicable law.

Availability:

This is a pre print version of the following article:

(2)

FVG CAR VBI SAU ILL RSI ERT CLZ SMC (a) CAR FVG VBI TSI 0 10 20 0.01 0.03 0.05 0.49 MAF Prop ortion of sites (%) (b) 0 2500000 5000000 7500000 10000000 12500000 1000GP hase3 EUR -1000G Phase3 UK10K+1 000GPh ase3 Sites MAF > 5% Sites 1% < MAF <= 5% Sites MAF <= 1% (c)

Figure (1) a) Geographic localization of the three study cohorts. b) Minor allele frequency spectrum

of the final INGI data set. For comparison, the Minor allele frequency spectrum of the TSI cohort from 1000GPhase 3 data has been added. c) Novel sites identified in the whole INGI dataset, compared to available resources. The majority of the private INGI sites are in the range of the rare variants (singletons sites are included).

(3)

NW-ITA VBI CAR FVG 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.010.010.010.010.010.010.010.010.010.010.010.01 0.020.020.020.020.020.020.020.020.020.020.020.02 0.050.050.050.050.050.050.050.050.050.050.050.05 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.010.010.010.010.010.010.010.010.010.010.010.01 0.020.020.020.020.020.020.020.020.020.020.020.02 0.050.050.050.050.050.050.050.050.050.050.050.05 0.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.0050.005 0.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.010.01 0.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.020.02 0.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.050.05 0 500,000 1,000,000 1,500,000 0 500,000 1,000,000 1,500,000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 MAF bins Number of variants Mean IM PUTE r^2 REFERENCE IGRP1.0 1000GPhase3 Info score 0.2 0.4 1

Figure (2) Mean values of r2 _{stratified by minor allele frequency (colored lines) and values of info}

scores stratified by minor allele frequency (bar plot) for Italian cohorts. An outbred cohort from North Italy (NW ITA) was included for comparison.

(4)

(a)

(5)

−0.05 0.00 0.05 0.10 0.15 0.20 −0.20 −0.15 −0.10 −0.05 0.00 0.05 PC1 9.01 % PC2 7.82 % FIN GBR IBS CAR TSI VBI CLZ SMC ERT SAU ILG RSI (a) −0.15 −0.10 −0.05 0.00 −0.05 0.00 0.05 0.10 0.15 0.20 PC3 7.53 % PC4 6.98 % (b)

ASW ACBLWK GWD MSL YRI ESN PEL JPT CDX KHV CHB CHS PUR CLM MXL PJL GIH BEB ITU STU ILG RSI SAUERTCLZ SMC FIN VBI CAR TSI IBS CEU GBR ASW ACB LWK GWD MSL YRI ESN PEL JPT CDX KHV CHB CHS PUR CLM MXL PJL GIH BEB ITU STU ILG RSI SAU ERT CLZ SMC FIN VBI CAR TSI IBS CEU GBR 00.05 0.1 0.15 Value 0 40 80 Count FST (c) Drift parameter 0.000 0.001 0.002 0.003 0.004 0.005 0.006 TSI SMC GBR RSI SAU CAR ILG VBI FIN ERT IBS CLZ 10 s.e. (d) 0.00 0.05 0.10

FIN GBR IBS CAR TSI VBI CLZ SMC ERT SAU ILG RSI

Total Homozygosity (Mb)

(e)

0.00

0.05

0.10

FIN GBR IBS CAR TSI VBI CLZ SMC ERT SAU ILG RSI

Inbreeding Coefficient

(f)

Figure (4) Population structure: PCA of Italian samples. a) first two PC and b) the 3rd PC versus

the 4th PC. Variance explained by the axis is reported. Each population from FVG has its own axis of variation. The first axis separates ILG from all other Italian populations. The second separates SAU from RSI; c) Pairwise matrix of FST; d) TREEMIX graph analyses with 3 migration edges, link between North Euopean population and some isolates are shown; e) Distribution of total amount of homozygosity in the all cohort minimum ROH length was set at 1 Mb; f) Beanplots of Inbreeding coefficient of 1000G European populations and Italian populations. All FVG population have higher inbreeding coefficient respect to other Italian and European population with the exception of FIN. The plot shows as in the INGI populations the distribution of the inbreeding values is more sparse respect to the reference Italian population TSI from 1000G.

(6)

TSI CAR VBI CLZ SMC ERT SAU ILG RSI Genes under Selection |iHS|>=2

Populations Number of genes 0 100 200 300 400

Shared with TSI Different from TSI

Figure (5) Proportion of genes with signatures of selection in INGI subpopulations respect to TSI. Blue

colour represents the fraction of shared genes with TSI, orange colour represent the fraction of private genes respect to TSI.

(7)

0_5 5_15 15_20 >20 CAR_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 VBI_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 CLZ_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 SMC_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 ERT_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 SAU_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 ILG_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ● 0_5 5_15 15_20 >20 RSI_TSI DVxy AC 3−5

CADD scores bin

0.6 0.8 1.0 1.2 1.4 1.6 0.6 0.8 1.0 1.2 1.4 1.6 ● ● ● ●

Figure (6) DVxy statistic for each INGI cohort using as reference the TSI population using variants

between 3-5 allele count (AC) and binned for CADD score as show in the x axis. Confidence intervals were created using the distribution of all 22 chromosomes and represent 1 standard deviation. A value =1 means no enrichment, a value <1 means depletion in the INGI coohort respect to TSI and a value >1 means enrichment respect to the reference.

(8)

CAR VBI CLZ SMC ER T SA U ILG RSI RSI ILG SAU ERT SMC CLZ VBI CAR 23 17 16 19 23 31 22 0 24 22 20 23 29 24 0 28 17 21 22 21 0 20 17 19 19 0 19 15 16 0 20 12 0 16 0 0

Figure (7) Heatmap of the ratio of DV variants respect to the Italian reference ( 3-5 AC ,CADD≥20,

fre-quency fold enrichment≥3) that are different between pairs of INGI sub-populations with the DV variants that are shared between pairs.

(9)

Figure (8) Venn diagram shows how the 133 genes harbouring HKO (with at least 1 homozygous individual) are distributed among INGI cohorts.