41
6kb
3. RESULTS
3.1. Preliminary Results
3.1.1. Existing Data
All the Daghestani populations analyzed in this study had already been genotyped for a set of Y chromosome markers (20 Y-STRs and 6 Y SNPs). The data obtained in previous studies (Tofanelli, Ferri et al. 2009) (Caciagli et al in publication)(Figure i2) show how the Daghestani populations cluster together, separated from the other population carrying the J1 Y haplogroup, one of the most represented Y haplogroup in those populations.
3.1.2. DNA Quality
A subset of 4 samples from each ethnic group was initially amplified with several long- range PCRs to check whether the DNA quality was adequate for PCR amplification. The results (Figure r1) showed that more that 90% of the tested samples were suitable for long- range PCRs.
Figure r1: Agarose gel of long-PCR yield tests. Ladder 1kb; 1, 2, 3, 4=Avars; 5, 6, 7, 8=Kubachians; 9, 10, 11, 12=Laks; positive control; negative control. Samples amplified with a test primer pair of 6000 Kbp
42 3.2. Plate Design And PCR Yields
The 20 Avars, 14 Kubachians, 21 Laks, 20 CEU, 20 Adygei and 1 Chimpanzee samples were dispensed in 96 well plates as shown in Figure r2.
1 2 3 4 5 6 7 8 9 10 11 12
a D083 D097 D098 D099 D100 D101 D102 D103 D104 D105 D106 D107
b D108 D109 D110 D111 D112 D113 D114 D115 D041 D042 D043 D044
c D045 D046 D047 D048 D049 D050 D051 D052 D053 D054 D061 D062
d D063 D064 D065 D066 D067 D068 D069 D070 D071 D072 D073 D074
e D075 D076 D078 D079 D080 D081 D091 NA11839 NA10851 NA12762 NA11881 NA10857 f NA12146 NA12264 NA12005 NA11829 NA11994 NA11992 NA10838 NA06993 NA12154 NA12801 NA12155 NA07022 g NA12003 NA11831 NA12891 JK3158 JK3159 JK3161 JK3170 JK3173 JK3174 JK3177 JK3182 JK3222 h JK3223 JK3224 JK3225 JK3227 JK3234 JK3235 JK3237 JK3238 JK3243 JK3245 JK3249 CHIMP
Figure r2: Position of each sample in the 96 wells plate. The CEU shown in red are all included in the Encode3 project.
Amplicons in rows “b” and “e” of each plate were run on an agarose gel and given a score of 1, 0.5 or 0 according to the band intensity. Thus, the yield of each plate was estimated with a score that ranged from 0 to 24, i.e. zero indicating no PCR products and 24 representing good amplification in the two rows.
The plot summarizing the yield of every reaction is shown in Figure r3, with the score normalized to 100 and the width of each signal proportional to the length (in bp) of the amplified fragment. The amplicons displayed on the X axis are listed in Table r1 and are in the same order in the sequenced template (see Materials and Methods p.30). The amplicons showed in red in Table r1 were excluded from the analysis because either the relevant
Legend:
Avars Kubachians
Laks CEU Adygei
43 AMPLICON
LENGTH
(bp) AMPLICON
LENGTH
(bp) AMPLICON
LENGTH (bp)
10qMB119a 2378 5pMB4b 1363 HIF1A006 4834
10qMB119b 1866 5qMb128a 1487 HIF1A007 4324
10qMB119c 1617 5qMB128b 1626 HIF1A008 4562
10qMB128a 2921 5qMB128c 1774 HIF1A009 4766
10qMB128b 1675 6pMB14a 1710 HIF1A010 4928
10qMb128c 3009 6pMB14b 2000 HIF1A011 5003
12qMB46a 1374 6pMB14c 1524 HIF1A012 4724
12qMB46b 2257 6qMB164c 1814 HBA2_1 3912
12qMB46c 1426 7pMB8a 2349 mt HVSI&II 1438
13qMB107a 2375 7pMB8b 2475 NOS3001 6856
13qMB107b 2668 7pMB8c 2433 NOS3002 5310
13qMb107c 1931 8pMB5a 2418 NOS3003 6112
13qMB108a 1477 8pMB5b 1209 NOS3004 5115
13qMB108b 2112 8pMB5c 2757 NOS3005 2860
13qMB108c 3416 ACE001 5519 PHD1001 4893
16pMB17a 1695 ACE002 2536 PHD1002 5309
16pMB17b 1866 ACE003 4825 PHD2001 6323
16pMB17c 3393 ACE004 5518 PHD2002 5217
18qMb73a 2398 ACE005 5643 PHD2003 4925
18qMb73b 2362 ACE006 4382 PHD2004 4703
18qMb73c 2309 ACE007 5696 PHD2005 4925
18pMB7a 1593 ACE008 5382 PHD2005 5094
18pMB7b 1736 ACE009 4260 PHD2007 5632
18pMB7c 1624 HBB 4353 PHD2008 6176
19qMB35a 1820 Encode 12 4505 PHD2009 6959
19qMB35b 2462 Encode 18 4364 PHD2010 6152
19qMB35c 1769 Encode 21 6461 PHD2011 5616
1pMB4a 1627 Encode 5 4051 PHD2012 5938
1pMB4c 1431 Encode 7 4074 PHD3001 6418
20pMB7a 1348 Encode 8 5374 PHD3002 5675
20pMB7b 2160 Encode 9 5034 PHD3003 6109
20pMB7c 1795 EDN1001 3718 PHD3004 5276
4qMb105a 2001 EDN1002 4262 PHD3005 6306
4qMb105b 2091 EPO001 4559 HBG1 1991
4qMb105c 2058 EPOr001 3775 VEGF001 4743
4qMB181a 1644 EPOr002 4258 VEGF002 4938
4qMB181b 2179 HIF1A001 5290 VEGF003 5006
4qMB181c 2212 HIF1A002 5397 VHL1new 4850
5pMB10a 2144 HIF1A003 5476 VHL2new 2938
5pMB10c 1895 HIF1A004 6680 VHL003 5935
5pMB4a 1872 HIF1A005 5230 HBD001 2619
Table r1: List of all the amplicons included in the present study, together with their length in bp. The ones shown in red were not at all amplified.
44 primer pairs failed standardization or the DNA available was not sufficient to replicate the given plate. Therefore a 0 yield (or a gap in Figure r3) is to be expected for those primers.
There was an inverse relationship between the length of the amplified fragment and the yield of the reaction (Figure r4).
Figure r3: PCR yield. Each column is an PCR product (see Table r1 for a list of the amplicons), the width of each column is proportional to the length of the amplicon.
Figure r4: each diamond is a PCR product. The short fragments produced consistently high scores, whereas the long fragments (>3000bp) obtained both high and low ones.
PLATES SCORE
LENGTH (bps)
SCORE
45 3.3. Whole Genome Amplification
The same whole genome amplified samples produced very different yields when amplified with a ~2 Kbp (18pMB7a) primer pairs and with a >5 Kb primer pairs (ACE007) as shown in Figure r5. Therefore WGA DNA was only used in the extreme case of a lack of DNA and being aware that PCRs longer than 2-3 kbp would have produced very low yields.
Figure r5: Ladder; 8 WGA samples amplified with 18pMB7a and ACE007 primer pairs.
The two WGA products of each sample were run in adjacent lanes.
6 Kb
2 Kb
46 3.4. Samples
The following sections show the results of the re-sequencing and downstream analysis. At the current state of the research, just one half of the samples passed the quality criteria to be considered “reliable”. Infact the plate with the pooled PCR products was split in two halves (1-6 and 7-12 columns) and processed separately for technical reasons. As stated above, the downstream quality analysis performed on the second half of the samples encountered an unexpected high error rate resulting from problems during the sequencing process. For this reason, the following sections will show the results relative to the first half only, while the second one is currently being re-processed, in order to decrease the error rate encountered during the first run.
For practical reasons the samples shown will be called accordingly to their position in the plate (Figure r2) and not with their extended names. The reader should therefore expect to see results for the samples labelled as a1-a6, b1-b6, c1-c6, d1-d6, e1-e6, f1-f6, g1-g6, h1- h6.
47 3.5. Sequencing Output
Figure r6 shows how an Illumina output looks like and the most representative steps to get the final genotype file.
A
B
@IL2_3254:1:1:3:989#0/1
CTACCTCGAACTCTATATTTCTATCAGGTGAGCCAACATCCTTCCAGCCACCCAACCGTGGAGTTACCTG +
@@@@???>??@?@@>>>@@@@@@?=';3@=(6>=698@?8:9?=>5>;<>??>?>=>579185=:=-:'
C
D
Figure r6: A. Laser image of 9 consecutive bases sequenced in two different spots (or clusters). B. Fastq file showing name of the read (line1) read content (line 2) base quality (line 3). C. CNS file showing template position, reference, consensus, q1, q2, q3, coverage, description of each call and quality. D. Genotype file obtained with the script provided in Appendix 7 p.100
48 3.6. Coverage
The coverage is the measure of how many times the same position was sequenced.
Generally the higher the coverage, the more accurate is the base calling. However, there are some exceptions, but usually the coverage gives a good measure of the reliability of the sequencing at a given position. The coverage obtained in the present study is shown below in different ways. Figure r7 shows the coverage for all the 48 samples along the template sequence. This plot gives the summary information, but it is not amplicon-specific. To get more detailed information, further plots are provided. Table sr2, sr3 and sr4 (see Supplementary Materials on CD) show the average coverage for each sample on each amplicon, the percentage of bases with coverage above 10x and the percentage of bases with coverage above 20x respectively. Figure r8, r9 and r10 show the histograms obtained from those tables. All the results obtained in the detailed tables do not include the primer regions and the overlapping areas between two amplicons, since the correct coverage estimate is difficult for the latter and useless for the former (being the primer sequence known). Overall, consistently with PCR yields, the average coverage was extremely high for control regions and variable for genic regions.
49 Figure r7: Global Coverage. On the Y axis is reported the cumulative coverage across the 48 samples, each one represented by a different color. On the X axis all the bases of the template reference sequence (for computational reasons, each spot is the average of a window of 70 bp), ordered as shown in Table r1. As expected, some gaps are consisted with the presence of a not amplified amplicon.
COVERAGE
TEMPLATE
50 Figure r8: Average Coverage. Each bar is an amplicon as reported in table r1 (completely failed amplicons are not shown). A single bar is the sum of the contributions in terms of average coverage, provided by the 48 analyzed samples.
AMPLICONS CUM. AV. COV.
51 Figure r9: 10x Coverage. Each bar is an amplicon as reported in Table r1 (red amplicons are not shown). A single bar counts the number of positions above 10x coverage in all the 48 samples for each amplicon. Max expected = 45 (three samples failed sequencing).
AMPLICONS CUMULATED FREQ.
52
Figure r10: 20x Coverage. Each bar is an amplicon as reported in Table r1 (red amplicons are not shown). A single bar counts the number of positions above 10x coverage in all the 48 samples for each amplicon. Max expected = 45 (three samples failed sequencing).
CUMULATED FREQ.
AMPLICONS
53 3.7 Quality Controls
As explained in Material and Methods (p.35), the quality control was performed using the data obtained for the 9 CEU samples (samples f1-f6 and g1-g3) and comparing it with the data available online for the same individuals.
The following tables (Tables r2, r3 and r4) display respectively the results of the comparison between HapMap SNPs and SEQ SNPs, between Encode Regions SNPs and SEQ SNPs for the same regions and the estimation of the False Negative (FN) and False Positive (FP) rates as described in Materials and Methods (p.37).
Table r2: Discordance rate estimated comparing HapMap vs My SNPs.
The 16 entries table shows the comparison between the two datasets, considering the four possible classes of SNPs (Href= Homozygous for the reference allele; Het =
Heterozygous; Halt= Homozygous for an allele alternative to the reference; Missing= data not available either on the HapMap database or in my data).
In the second table are calculated the miscalls (divided into two classes), the discordance rate and the percentage of missing data
HAPMAP\SEQ Href Het Halt Missing
Href 2757 8 3 1073
Het 5 902 9 315
Halt 7 4 1065 398
Missing 29 12 6 30
Good 4724
Homo/Homo miscalls 10
Homo/Het miscalls 26
Total of matched and unmatched data 4760 Discordance rate (matched/total) 7.56 10-3 Missing data (missing/Total table) 0.27
TOTAL TABLE 6623
54
ENCODE3\SEQ Href Het Halt Missing
Href 3167 0 0 1373
Het 13 83 0 42
Halt 3 0 90 30
Missing 56 10 5 18
Good 3340
Homo/Homo miscalls 3
Homo/Het miscalls 13
Total of matched and unmatched data 3356 Discordance rate (matched/total) 4.77 10-3 Missing data (missing/Total table) 0.30
TOTAL TABLE 4890
Table r3: Discordance rate estimated comparing ENCODE3 Regions vs a subset of My SNPs for the same regions.
The 16 entries table shows the comparison between the two datasets, considering the four possible classes of SNPs (Href= Homozygous for the reference allele; Het =
Heterozygous; Halt= Homozygous for an allele alternative to the reference; Missing= data not available either on the HapMap database or in my data).
In the second table are calculated the miscalls (divided into two classes), the discordance rate and the percentage of missing data
55 Table r4: False Negative and False Positive estimates based on HapMap and Encode3 data, calculated as explained in Materials and Methods (p.37).
While a FN rate of 16% is acceptable considering the high percentage of missing data (27%, see Table r2), the high 35% FP rate is discussed at p.65 (Discussion).
Given the slightly high FP value, a further filtering was attempted, using as an empirical filter the ranges of values of q1, q2 and q2 reported by the matching NR SNPs (i.e. Het/Het and Halt/Halt in HapMap vs SEQ comparison).
The ranges obtained are shown in Figure r11; the new FN and FP estimations shown in Table r5. As it appears from that table, despite the new FP is one order of magnitude smaller than the previous one, the FN becomes very high. In the following analysis the q1, q2, q3 filters are therefore not taken into account.
SEQvsHapMap SEQvsEncode EncodeVsHapMap
HapMap NR(allTemplate) 438 Encode NR 60 Encode NR 60
My NR(allTemplate) 2239 MyNR in Encode 139 HapMap NR in Encode 49 shared Mine/HapMap 366 Shared Mine/ENCODE3 41 Shared HapMap/ENCODE3 32 HapMap "private" 72 ENCODE3 "private" 19 ENCODE3 "private" 28 My "private" 1873 My "private" 98 HapMap"private" 17
seq raw FP rate
(myPRIV/myTOT) 0.70
sequencing FN rate
(hmPRIV/hmTOT) 0.16
ENCODE3 FN rate
(hmPRIV/hmTOT) 0.35
sequencing true FP rate 0.35
56 A
B
SCORE
SCORE
CLASS
CLASS
57 C
Figure r11:A. Distribution of Q1 values among the classes represented on the X axis. B.
Distribution of Q2 values. C. Distribution of Q3 values. Each distribution shows the average value ± 95% C.I. The 9 Classes shown in the three figures are the ones including all the genotyped data. Looking at the empirical distribution of the q values which
distinguish the “matching” SNPs ( HRHR, HAHA and HH, i.e. the ones found equal in both HapMap and SEQ datasets) a set of values is obtained for further SNPs filtering
Table r5:New FN and FP estimations using as further filter q1, q2 and q3 ranges (found looking at Figure r11). The set of values used for the further filtering was: q1>200 ; q2 >
200; 93<q3<97. Being the FP reduced but the FN dramatically increased (almost 2/3 of missing data), this further filtering was discarded.
SEQvsHapMap SEQvsEncode EncodeVsHapMap
HapMap NR(allTemplate) 438 Encode NR 60 Encode NR 60
My NR(allTemplate) 376 MyNR in Encode 27 HapMap NR in Encode 49
shared Mine/HapMap 176 Shared Mine/ENCODE3 17 Shared HapMap/ENCODE3 32 HapMap "private" 262 ENCODE3 "private" 43 ENCODE3 "private" 28
My "private" 200 My "private" 10 HapMap"private" 17
seq raw FP rate
(myPRIV/myTOT) 0.37 sequencing FN rate
(hmPRIV/hmTOT) 0.60
ENCODE3 FN rate
(hmPRIV/hmTOT) 0.35
sequencing true FP rate 0.02
SCORE
CLASS
58 3.8. SNP calling
Once established the parameters minimizing the discordance rate, FP and FN, the VarFilter (Li, Ruan et al. 2008) script was used to extract all the non reference positions from the CNS files of all the 48 individuals. The filtered CNS files were then processed as a normal CNS file and the genotype file produced with the best parameters found so far (ie 20x coverage and 0.20 alleles ratio. See Materials and Methods p.33) gives the genotype of the given individual at each SNP file. The set of SNP positions found is different in each individual. It is therefore necessary to compile a list of all the SNP positions found at least once among all the 48 filtered CNS files. This list will be the SNP list.
Each SNP position is afterwards extracted from each non filtered genotype file.
The table of genotypes containing on each row an individual and on each column a SNP position is called SNP table.
A reduced version of the SNP table is shown in Table r6, while the complete version is available in Supplementary Materials on CD.
SNP positions 1582
177 1
206 3
220 1
237 4
444 4 allele
1
allele
2 all1
all
2 all1 all
2 all1 all
2 all1 all
2 all1 all 2
Individuals a1 T T T T A A G G A A G G
a2 T T T T A A G G A A ? ?
a3 T T T T A A G G A A G G
a4 G T T C A C G G G A G G
a5 T T T T A A G G A A G G
Table r6. SNP table. The “?” signs stand for not available information, either because the position did not pass the filtering or because the sequence was not available for that individual. For this reason the SNP table was successively PHASEd.
59 To impute the missing genotypes (“?”) and to infer the haplotypic phases, the SNP table was processed using the software PHASE (Stephens and Donnelly 2003). In order to get more reliable results, the original table was split into smaller ones. Each new table contains just the data for one of the five analyzed population, relative to the SNPs included in one region. The total number of SNP table to be PHASEd obtained is therefore 5 populations x 42 regions (20 Hominid Project+15 genes+ 7 Encode3) = 210 PHASE input files.
For regions of space neither the input nor the output PHASE files are shown, but are available in Supplementary Materials on CD.
3.9. Downstream Analysis
3.9.1. Summary statistics
Once obtained, the haplotypes generated with PHASE can be used for a set of downstream analysis. Given the reduced number of samples available at the moment, any statistic analysis performed so far produced unreliable results (very high p values). For the sake of completeness, a reduced version of the summary statistics obtained (see Materials and Methods p.39) is showed in Table r7. As already explained, those values must be intended just as a representation of what can be done with a re-sequenced output, and not as a reliable scientific evidence of selection or demographical processes happened to the studied populations. Further and more reliable summary statistics will be generated as soon as the second half of the samples is available.
60
REGION Pop
Sample size
Segregating
sites Singletons Fay and Wu's H
p value
Fu and Li's D
p value
Fu and Li's F
p value
Fu's Fs
p value
Tajima's D
p value
10qMB119 ADYGEI 18 10 2
-
2.693 0.13 0.482 0.44 0.632 0.68 1.065 0.55 0.633 0.43
10qMB119 AVARS 18 10 3
-
5.725 0.026 -0.05 0.81 -0.15 0.98 0.336 0.78 -0.298 0.86
10qMB119 CEU 18 9 0 0.183 0.686 1.499 0.001 2.085 0.14 0.858 0.58 2.316 0.01
10qMB119 KUB 12 10 3
-
4.091 0.05 0.15 0.66 0.133 0.89 -0.04 0.98 0.008 0.9
10qMB119 LAKS 24 10 1 -4.63 0.044 0.95 0.17 0.947 0.53 -0.38 0.96 0.442 0.6
Table r7. Summary statistics for 10qMB119( Hominid region). The p values are generally too high to consider as reliable those found values. A complete version of this table is available in Supplementary Materials on CD.
3.9.2 SNPs Mapping
Given the present status of the research, a much more efficient analysis on the results available so far is a functional investigation on the SNPs found. Dr Yuan Chen from EBI (The European Bioinformatic Institute, The Wellcome Trust Genome Campus, Hinxton, UK) kindly mapped to the annotated Human Genome the SNPs found in this research (both novel and already annotated) to see whether they modify the function of the genes where they are present. As logically follows, the SNPs mapped by Dr. Chen are just the ones falling in a coding region (i.e. Candidate genes and Encode regions). The following Table r8 and Table r9 show a resume of all the SNPs found and of the changes they produced in the coding regions, respectively. Extended versions of the two tables, with the details of each SNP at each position is available in Supplementary Materials on CD.
A further analysis of the Synonymous and Non Synonymous SNPs using SIFT software (http://sift.jcvi.org/) was performed, to see whether the found aminoacidic substitutions would cause a functional modification in the protein.
61 The results of the SIFT analysis relative to 16 out of 20 NS SNPs and the population frequencies of the SNPs modifying the protein functions are displayed in Table r10 and Figure r12 respectively and discussed in the next section.
Region
SNPs
found Novel Region
SNPs
found Novel
10qMB119 16 4 EDN1 30 4
10qMB128 24 10 Encode12 13 4
12qMB46 29 11 Encode18 11 2
13qMB107 24 8 Encode21 11 11
13qMB108 18 9 Encode5 11 5
16pMB17 38 22 Encode7 5 1
18pMB7 24 10 Encode8 22 13
18qMB73 21 5 Encode9 21 2
19qMB35 17 4 EPO 15 8
1pMB4 25 25 EPOr 38 35
20pMB7 21 5 HBA 20 15
4qMB105 20 20 HBB 22 6
4qMB181 29 29 HBD 5 3
5pMB10 34 11 HBG1 44 23
5pMB4 15 6 HIF1A 285 219
5qMB128 45 20 NOS3 74 28
6pMB14 51 23 PHD1 15 9
6qMB164 10 10 PHD2 187 53
7pMB8 28 28 PHD3 135 55
8pMB5 36 13 VEGF 58 21
ACE 238 124 VHL 88 45
TOTAL 1873 959
Table r8. SNPs found in the analyzed regions (each neutral region is considered as a whole)
62 Table r9. SNPs distribution in coding regions and novel SNPs found in each class
Name Chr Position Alleles Codons dbSNP ID Type Prediction Gene
#HighAltitude559 17 58916386 C/G TCC-TgC novel NS TOL ACE
#HighAltitude591 17 58922309 C/T ACG-AtG rs3730043:T NS DAM ACE
#HighAltitude594 17 58924749 G/A GAC-aAC novel NS TOL ACE
#HighAltitude684 17 58938452 A/T AAA-AAt rs4459610:T NS DAM _LC ACE
#HighAltitude874 6 12400605 A/G GAG-GgG novel NS TOL EDN1
#HighAltitude895 6 12404241 G/T AAG-AAt rs5370:T NS DAM _LC EDN1
#HighAltitude900 7 100157569 G/A GAC-aAC novel NS TOL EPO
#HighAltitude917 19 11352566 T/G CAC-CcC novel NS DAM EPOR
#HighAltitude1135 14 61256965 G/C GTG-cTG novel NS DAM HIF1A
#HighAltitude1220 14 61277310 C/T CCA-tCA rs11549465:T NS TOL HIF1A
#HighAltitude1358 7 150327044 T/G GAT-GAg rs1799983:G NS TOL NOS3
#HighAltitude1366 7 150329336 G/T GGC-tGC novel NS DAM NOS3
#HighAltitude1794 11 5226148 A/C ATT-ATg rs11546324:C NS TOL HBG1
#HighAltitude1799 11 5226199 G/C GCA-GgA novel NS TOL HBG1
#HighAltitude1827 11 5227262 G/A ACA-AtA rs1061234:A NS TOL HBG1
#HighAltitude1980 11 5212158 C/A GCC-tCC rs35152987:A NS TOL HBD
Table r10. Sift Software predictions on the effect of the Non Synonymous SNPs on the protein functionality. The SNPS which can affect (increase/decrease/modify) the physiological function of the protein where they are found are reported in red. DAM=
Damaging; TOL= Tolerated; LC= Low Confidence; NS= Non Synonymous
CLASS SNPs %of total Novel SNPs
UPSTREAM coding region 62 4.97 31
3’ UTR 29 2.32 11
SYNONYMOUS CODING 19 1.52 3
NON SYNONYMOUS CODING 20 1.60 5
INTRONIC 1065 85.34 551
SPLICE SITE 3 0.24 0
5’ UTR 10 0.80 7
DOWNSTREAM coding region 40 3.21 26
TOTAL 1248 100 634
63 Figure r11.Population frequencies of the 4 SNPs predicted to modify, with high confidence, the function of the protein where they a found. Each row is a population, each column a SNP (with affected gene). In green is shown the frequency of the NS SNP.
POP\SNP-GENE #HighAltitude591 ACE
#HighAltitude917 EPOr
#HighAltitude1135 HIF1a
#HighAltitude1366 NOS3 AVARS
KUBACHIANS
LAKS
CEU
ADYGEI