• Non ci sono risultati.

2. MATERIALS AND METHODS

N/A
N/A
Protected

Academic year: 2021

Condividi "2. MATERIALS AND METHODS"

Copied!
18
0
0

Testo completo

(1)

23

2. MATERIALS AND METHODS

2.1 Ethical Consent

The Daghestani DNA samples used in this study were previously extracted from blood samples obtained under an agreement between the University of Pisa and the Russian Academy of Sciences (see Appendix 1 p.72). The CEU and Chimpanzee Samples were obtained from cell cultures stored at the Coriell Institute (www.coriell.org) while the Adygei DNA (extracted from cell cultures) were kindly provided by Dr. Kenneth Kidd from Kidd Labs, Department of Genetics, Yale University School of Medicine.

In order to get the approval to ship the Daghestani samples from Italy (Laboratory Of Molecular Anthropology – Biology Department– University of Pisa), where they were stored after extraction, to the UK (The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK) a REC approval was submitted through IRAS (www.myresearchproject.org.uk) (see Appendix 1 p.72). The REC committee approved the documents supplied (see Appendix 1 p.72) and judged the DNA samples ethically suitable to be employed in the proposed project.

2.2 Control Regions

Together with the candidate genes (see Introduction p.19), twenty-seven genomic regions unrelated with the high altitude adaptation were sequenced in order to get a control when looking for signals of positive selection. Particularly, the twenty autosomal non-coding regions already used in the Hominid Project (Wall, Cox et al. 2008) and seven random

(2)

24 regions chosen among the ones of the Encode 3 Project (Birney, Stamatoyannopoulos et al. 2007) were included.

According to the procedures (Frisse, Hudson et al. 2001; Wall, Cox et al. 2008) twenty (~ 20kb each) primarily single-copy, non-coding (i.e. putatively non-functional) regions present in genomic areas of medium or high recombination (r _ 0.9 cM/Mb) (Kong et al. 2002) were re-sequenced in the whole set of samples. 4–6 kbp of sequence data from two or three discrete subsections spanning most of the distance of each region (locus trio) were selected (in this thesis, each trio will be reported as a “Hominid region”). The coordinates and sequences of each trio were kindly provided by Dr. Michael Hammer (The Hammer Lab, University of Arizona), through release from the website (http://hammerlab. biosci.arizona.edu) (Appendix 3 p.77) while the primers were independently designed (see Appendix 4 p.83).

The seven regions chosen among the ones already re-sequenced in the Encode 3 Project (Birney, Stamatoyannopoulos et al. 2007) were selected disregarding their location on the genome and only considering the quality of the sequences available on the database (ftp://ftp.hgsc.bcm.tmc.edu/pub/data/HapMap3-ENCODE/ENCODE3/ENCODE3v1). In facts the choice was oriented towards contiguous sequences (> 5 kbp) showing the lowest number of missing nucleotides or miscalls among the whole set of samples analysed in Encode. By this way seven regions on seven different chromosomes (chr: 5, 7, 8, 9, 12, 18 and 21) were selected (for details see Appendix 3 p.79) and the primers designed (Appendix 4 p.87).

(3)

25 2.3. Primer Design And Standardization

All the primers used in this study were designed using the algorithms Primer3 (http://primer3.sourceforge.net/) and Blast ( http://blast.ncbi.nlm.nih.gov/ Blast.cgi). The actual primer design was performed using the Perl software Pfetch (available at Sanger Institute) and the Perl scripts available at the Human Evolution Team of the Wellcome Trust Sanger Institute, while the Blast check was performed online.

The sequences of the primer pairs as ordered to Sigma-Aldrich (www.sigma-aldrich.com) are available in Appendix 4 (p.80).

Every primer pairs were tested and the PCR condition standardized through using a HapMap DNA (NA07029) as a test sample.

2.4. Whole Genome Amplification

Since the amount of DNA available from each Daghestani sample (3.5 µg) was nearly sufficient to perform the amplifications required by the study design, and being the population resident in an area difficult to reach and close to war zones, a Whole Genome Amplification (WGA)(GE GenomiPhi HY DNA Amplification Kit) was performed in order to increase the amount of DNA available.

Being aware that the use of Phi Polymerase could have added additional errors in the final sequences and, in general, reduced the yield of long range PCRs (see Results p.41 for gel showing different yield amplimers), the “WGA DNA” have been used only to replicate some reactions implying regions smaller than 3 kbp (i.e. some of the twenty Hominid

(4)

26 Regions) or to replace some samples that failed to amplify (ie. Samples D83, D97 and D107).

2.5. PCRs (Polymerase Chain Reactions)

The PCR amplifications were performed in 96 wells plates using Platinum Taq Polimerase (Invitrogen), one plate for each of the 124 primers (see Appendix 5 p.88 for protocols). The final number of PCR reactions was:

N of PCRs: 124 x 96 = 11904

The amplimers were visualized by agarose gel electrophoresis containing EtBr and those belonging to the same individuals were pooled together, in volumes proportional to each reaction yield, in order to reach the equimolarity among all the primers. The pooled DNA was subsequently purified through post-PCR columns (Qiagen QIAquick PCR Purification Kit).

2.6. Illumina Flow-cell Platform

The Solexa Illumina re-sequencing technology is, basically, a modified version of the classic Sanger sequencing (sequencing by synthesis) which pair to the advantages of the previous method, a massive yield and an astonishing reduction of costs and times (Bentley, Balasubramanian et al. 2008).

The main features of this new technology consist in using a reversible terminator (in spite of the ddNTPs) and the sequencing of two small fragments of DNA at the two ends of a chunk of sheared DNA. The use of the reversible terminator, paired with a fluorescent dye, allows the reading of the last base added by the polymerase together with preventing the

(5)

27 latter to stop its action. The presence of two sequences at the two ends of each chunk (of known length) gives, at the aligning step, the additional information of the relative distance between the two fragments. The principal problem of this technology, shared with all the other re-sequencing technologies, is that the individual reads are too short to perform de novo assembling and need to be aligned to the reference sequence to give the final information.

Briefly:

- The amplified DNA (PCR, Pull-down, WGA) is cut into small fragments of variable ranges of length (for this study 150-450 bp with an average of 200 bp); - Each fragment is processed through the addition of two different linkers, one at

each 3’ or 5’ end (total four linkers) (Figure m1);

- The processed fragments are blocked on the flowcell, a special surface providing complementary sites where the linkers can anneal. Each flowcell is made of eight “lanes” and the sequencing of the whole flowcell is a “run”;

- The fragments bound to the surface of each lane are enriched. Using the known sequences at of the two linkers, some runs of PCRs (called “bridged” for the conformation assumed by the fragments in this process, Figure m1) are performed to increase the number of copies of each fragment bound to the surface. This process produces a “spot” or “cluster” of millions of copies of the same sequence all bound in the same area of the lane. Each spot has the function of increasing the luminous signal of every single base, when the fluorescent dyes are added;

(6)

28 - Every fragment is now bound to the platform with just one linker (see Figure m1) and ready to be sequenced. After denaturation, a sequencing primer complementary to the binding linker is added and the sequencing reaction performed. Each nucleotide added carries a reversible terminator and a fluorescent dye. Therefore, if the dNTP is polymerized, after the first washing up, the luminous signal detected at the given spot reflects the complementary basis of the nucleotide present at the given position on the sequence. The second washing up removes both the dye and the reversible terminator, and allows the DNA polymerase to add the following nucleotide.

This process can be theorethically repeated indefinitely but, to keep the error rate low, it is preferable not to extend it further than few tens of bases. In this study, 70 bases were sequenced each time.

- Once sequenced the end linked to the surface, is possible to perform another denaturation and bridged amplification. By doing this, the complementary strand in generated and the opposite end is now linked to the platform (see Figure m1). The same processes described above will now produce the sequence of the second end (paired ends sequencing).

- The final result is a series of fluorescent images (laser-detected) representing the bases at the two ends (see Results p.47). The obtained sequence is a patchwork of two 70 bp sequences. It has to be bore in mind that between the two sequences there is a gap of ~60 bp (the average length of each DNA chunk minus 70x2 bp);

(7)

29 Figure m1 (Bentley, Balasubramanian et al. 2008). The various steps of Illumina

sequencing are shown. A. DNA shearing and linker addition. B. Each fragment is bound to the surface of the flowcell and amplified through bridged PCR. C. Once sequenced, a fragment can generate its complementary and being sequenced starting from the other end.

2.7. Indexed Re-sequencing

To better exploit the huge yield of an Illumina run, in the present study the “Indexed re-sequencing” strategy was applied (Craig, Pearson et al. 2008). Developed at WTSI and recently optimized by Dr. Danile Turner, Dr. Iwanka Kozarewa and Dr. Daniel MacArthur (manuscript in preparation) it allows the simultaneous sequencing, in the same run, of DNA belonging to different samples. To do that, a unique eight nucleotide sequence or “tag” is added to one of the linkers of each genomic DNA fragment (after the DNA

(8)

30 shearing step) All the “tagged” DNAs, belonging to different samples, are then pooled together, and the resulting mix sequenced. Once obtained, the output of the run has to undergo an additional step called “splitting”, to divide the sequences of each sample and produce a single file for each individual (see below).

2.8. Data Processing

The processed image from each one of the seven lanes of an Illumina paired-ends run is released in the form of two “.fastq” files, one for each end and a “tag” file, containing the tag sequenced in each read. Such files can vary in shape, but the main feature is a list of single reads with additional information structured as follows:

@IL2_3254:1:1:3:989#0/1

CTACCTCGAACTCTATATTTCTATCAGGTGAGCCAACATCCTTCCAGCCACCCAACCGT GGAGTTACCTG

+

@@@@???>??@?@@>>>@@@@@@?=';3@=(6>=698@?8:9?=>5>;<>??>?>=>579185=:=-:' 1st line: read name; 2nd line: read sequence or tag; 3rd line: quality score of each base in the read.

To get the aligned consensus sequence for the re-sequenced region together with the quality and coverage scores, the fastq files must be processed to create a “.cns” file, containing all the above information.

The main steps to get a cns file from a fastq file, using the fastq2cns.pl Perl script kindly provided by Dr. Daniel MacArthur from WTSI, are:

- Get a list of all the possible variants of a tag which could unambiguously be recognized as belonging to the same class. By this way the 8 nucleotides of each tag are expanded

(9)

31 into a list of one base variants. This list is crucial to assess whether an 8 nucleotides sequence found different from the expected (i.e. because affected by sequencing error) can still be considered as a known tag or must be discarded because the errors introduced ambiguity with variants of other tags.

- Split the reads accordingly to their identification tag. This step creates as many fastq files as many different tags (samples) are present in the main fastq file. After this step, all the sequences belonging to a single sample will be stored in two fastq files, one for each end.

- Create a template file containing the reference sequence (Fasta format) of all the regions included in the study. This “patchwork” will be used as reference sequence by the aligning software MAQ (Li, Ruan et al. 2008) as described below.

- Align the split reads to the reference template file. In this study, the used aligning algorithm is MAQ. The aligning software compares the 70 bp sequence of each read to the reference. When the mismatch is below a certain threshold (proportional to the length of the read) and the distance between the two 70 bp ends is within the expected range (depending on the average length of the shared fragment, in this case ~60 bp), the read is mapped to the reference sequence. If the mismatch is too high or one of the two ends does not map or maps unexpectedly, the whole read is ignored.

- Merge reads. The mapped reads obtained from the two ends are merged together to create a single file containing all the information for each sample.

- Create the CNS file. This file format summarizes, as shown in Figure m2, the main features of each sequenced base. Particularly:

(10)

32  Position on the template sequence (this position is arbitrary, since the total

sequence results from a patchwork of genomic .fasta files);  Reference base at that position;

 Consensus base obtained from the alignment;

 Three quality scores (q1=Consensus score, q2=SNP score, q3=Mapping score) giving the goodness of the sequenced base;

 Coverage, the number of different reads mapping for that base. Usually, the higher is the coverage , the higher is the accuracy of the base calling (the positions reporting 0 coverage are automatically removed from a CNS file created with the MAQ scripts);

 Description and quality of each base sequenced at that position. A base with 300x coverage will be represented with a string of 300 signs (for an interpretation, see Appendix 6 p.99) showing for every read whether that position was sequenced as the reference, as another base, as a deletion or as a mismatch. This string is followed by another string containing, for each base, a quality parameter.

(11)

33 - Create a conversion table (script and table available in Appendix 7 p.100) listing, for each position of the template sequence (and, therefore, of the CNS file), the corresponding genomic position and chromosome and the description of what the given base correspond to (i.e.: sequenced region, forward primer, reverse primer, gap between two sequenced regions).

- Create a genotype file from each CNS file (script available in Appendix 7 p.100). The consensus base calling provided in a CNS file results from standard estimations built in the MAQ algorithms. Especially for the positions where a known or a putative SNP is present, the built in base calling is not totally reliable. Therefore it is worth producing an independent file called “genotype file” which lists, together with all the relevant parameters directly extracted from the CNS file, the putative genotype at each position. The first allele of the genotype specifies the most represented base at that position, whereas the second one depends on the ratio between the two mostly represented bases:

Alleles ratio = (2nd mostly represented base / (1st+2nd) mostly repr. bases)

This ratio value can be modified and is subjected to a feedback regulation by the downstream quality controls (see 2.10. p.35). When this ratio is above the given threshold, the second base is different from the first one and, therefore, the site is recognized as a heterozygous SNP. Otherwise the two base are the same, being the site either homozygous for the reference or a homozygous SNP. The other variable considered for the “genotype guess” is the coverage. When the coverage for the given position is below a certain threshold (subjected to the same feedback specified above) the whole position is regarded as “impossible to be genotyped” and

(12)

34 the genotyped is assigned as “NN”. In the present study the best found parameters are coverage of 20x and ratio of 0.20.

Once obtained, the CNS and genotype files and the corresponding conversion table can provide all the information necessary to the downstream coverage and quality control analysis, SNPs calling and so on.

-Create a summary sheet for each CNS file: this step produces several files resuming the coverage at each base for each sample and other parameters describing the performances of the run.

-Call all the positions different from the reference sequence: using the VarFilter option in the fastq2cns.pl script, each CNS file is reduced to just those positions which are either heterozygous or homozygous for a base different from the reference. Using the same ratio and coverage values used to create the genotype files, is possible to create a “putative SNP genotype file”, listing the genotypes at all the filtered positions.

2.9. Coverage Analysis

The coverage for each base is the first parameter to be obtained from a re-sequencing study, because it gives a straightforward idea of the success or failure of all the sequencing steps (PCRs, PCR purification, Library preparation and sequencing itself). Theoretically, the expected average coverage should be estimated from the yield of a lane divided by the number of samples and length of the region sequenced:

(13)

35 Of course a number of unpredictable factors concur to decrease that yield but, still, the range of coverage per base can estimate the yield of the above mentioned processes. Furthermore, the average coverage for each sequenced region provides a measure of the reliability of the information that can be obtained for that portion of the data.

The traditional representation of the coverage is a plot showing on the X axis the sequenced bases and on the Y axis the depth of the coverage. Due to the high number of regions analyzed in the present study, this kind of plot does not provide all the relevant information at a glance. Therefore this plot is flanked, in the Results section (pp. 49-52), by three figures: one showing the mean coverage for each amplicon; the others counting the number of bases above a given coverage threshold (10x and 20x) in each amplicon. The aim of those three figures is to give a clear idea of the yield of a single amplicon for each individual (average) together with a count of the bases that can be effectively used for meaningful analysis. The scripts developed for the above, are available in Appendix 7 p.100.

2.10. Quality Controls

The CEU individuals included in the present study acted as control population twice: together with being an European population not exposed to low pressure of oxygen, their HapMap SNPs genotypes are available onto the public databases ( www.hapmap.org) and, in addition, some of the samples are included in the Encode3 project (Birney, Stamatoyannopoulos et al. 2007).

Therefore comparing the data obtained for those individuals with the data available online results in an accurate estimate of the quality and reliability of the re-sequenced data.

(14)

36 The procedures to obtain the false negative and false positive rates, together with the sequencing score, are listed below.

2.10.1. Discordance Rate

The genotypes at the known HapMap SNPs positions are extracted from the CEU genotype files (obtained with variable “ratio” and “coverage” values) and compared with the known HapMap genotypes for the same individuals (script available in Appendix 7 p.100). All the possible comparisons fall into the 16 different classes defined in the “discordance rate table” (see Results p.53). The discordance rate is defined as:

Discordance rate = n. of mismatches / all the genotyped SNPs

( 0 = all the SNPs genotyped match with the genotypes available on HapMap)

A “mismatch” is defined as a genotype different across the two datasets, not counting in the total number of SNPs the position where either HapMap or the re-sequenced or both datasets have no information available (i.e. “NNs”).

The discordance rate estimation and the genotype files creation are part of a feedback process aiming to minimize the score itself. As stated before, the genotype files are obtained in relation to the variable parameters “ratio” and “coverage”. To minimize the discordance, various comparison are performed using different genotype files obtained with different sets of parameters. The sets giving the lowest error rate in a CEU comparison will be adopted to get the genotypes for all the other samples. In the present study, the best found set of parameters is a coverage of 20x and a ratio between alleles of 0.20, producing a discordance rate of 7.56 10-3.

(15)

37 2.10.2. False Negative And False Positive Rates

Once obtained the genotype files that minimize the discordance rate, the list of all the “non-reference” (NR) SNPs in the CEU samples is produced (using the VarFilter option, as described in p.34). A non-reference SNP is defined as a position genotyped either heterozygous or homozygous for an allele different from the reference sequence. The same is done with the data downloaded from the HapMap andEncode3websites.

The three lists (non reference position found among HapMap, Encode3 and re-sequenced (SEQ) CEU datasets) are compared to see how they overlap.

Particularly, all the HapMap NR SNPs positions are compared against all the SEQ NR positions to see how many sites are shared between the two lists and how many are not. From this comparison is possible to estimate the False Negative rate (FN):

FN = private HapMap NR SNPs / all HapMap NR SNPs

Defining as “private” the SNPs present only in the HapMap database and, therefore, not included in SEQ data.

From the comparison of the NR SNPs of the Encode3 dataset against a subset of the SEQ NR SNPs including only the Encode regions, is estimated the raw False Positive rate (rFP). This FP is reported as “raw”, becasue still includes the Encode FN rate.

rFP = private SEQ NR SNPs in Encode Regions / all SEQ SNPS

Finally, the comparison between the HapMap and Encode NR SNPs gives the Encode FN rate (eFN) and the true FP (FP) rate for SEQ SNPs:

(16)

38 eFN = private HapMap NR SNPs / all HapMap NR SNPs in Encode Regions

FP = rFP – eFN.

2.10.3. Further Filtering

In order to decrease the slightly high False Positive rate, the following empirical approach was followed.

Each NR position comes in the genotype file together with its quality parameters (q1, q2 and q2 as previously described). Analyzing the distribution of those values among the “good” NR SNPs, a range of values was obtained. A “good NR SNP” is a NR SNP with the same genotype in both the SEQ and HapMap datasets. The range of q1, q2 and q3 values reported by those SNPs are therefore used to further filtering the output of the VarFilter script. As a consequence of this step, the FP rate actually decreases of one order of magnitude, but the FN rate increases consistently (see Results p.57). This further filtering approach has been discarded at the present status of the research.

2.11. SNP Calling and Analysis

The parameters and quality score filters obtained in the previous steps are applied to generate the genotype and NR SNPs files for every studied samples.

A list of all the different NR positions obtained is compiled and the genotype of each individual at each one of those position is extracted.

(17)

39 The table resulting in all the genotypes is showed in Results (p.58). This table includes, as expected, a series of “?” symbols, where the information is unavailable for the given individual.

To fill the “?” gaps in the SNP table, the PHASE (Stephens and Donnelly 2003) software was used to impute missing genotypes using data from surrounding SNPs.

To get reliable results in the “PHASing process”, the SNP table is split into the various regions and each region is PHASEd divided in each single population.

Each single Population/Region PHASE output file can now be processed to get the summary statistics listed below, using a script developed by Ni Huang (WTSI) and available at the Evolutionary Team of the WTSI.

2.11.1. Summary Statistics

The following summary statistics are produced with the above specified script: Fay and Wu; Fu and Li's D; Fu and Li's F; Fu's Fs; π; Tajima's D (Tajima 1989; Fu and Li 1993; Fu 1997; Fay and Wu 2000). All of them are different ways of testing from departure from neutrality. Indeed they can reject a neutrality model giving the direction (+ or -) of the modification. Despite it is known that such a deviation can be due to the presence of a selective pressure or demographic events, each case must be carefully discussed to disentangle the two components. For a complete description of these parameters see (McVean 2002; Schmegner, Hoegel et al. 2007) in addition to the relevant papers. For the purposes of this study, the summary statistics obtained are used just to ascertain the presence of those two phenomena. For a more detailed discussion, see Results and Discussion sections.

(18)

40 2.11.2. Synonymous And Non Synonymous SNPs

The SNPs position obtained can also be mapped to the genome to see whether they are Synonymous or Non Synonymous (NS) and whether or not they produce a relevant effect in the protein function. The list containing the SNPs found among the coding regions analyzed (namely, all the genes and some of the Encode regions) were kindly analyzed by Dr. Yuan Chen at EBI and mapped onto the annotated genome (Exons, Introns, UTRs, regulative regions). For details, see Results (p.61-62).

Figura

Figure m2. Aspect of a CNS file line. For detailed description see above.

Riferimenti

Documenti correlati

I use these ideas to discuss the possibility that the European Central Bank (ECB) might contribute to the construction of a common European identity. Thus, my aim is to

Al valore dei diritti degli azionisti può giungersi quindi anche attraverso lo sconto dei dividendi futuri, la stima si rivela inoltre più semplice poiché non sono

Performance matrix of the Office for

 Explain the periodic variations of atomic size, ionisation energy, electron affinity and electronegativity.. 3.. The

Microwave vacuum pyrolysis of waste palm shell (WPS) was performed to produce biochar, which was then tested as bio-fertilizer in growing Oyster mushroom (Pleurotus ostreatus)..

Authors introduce the concept of political opportunity structure borrowed from the new social movement theory to explain the relationship and networking between local

When an aqueous solution of metal ion is shaken with an immiscible organic solvent containing such an organic ligand, the metal-ion complex is distributed between the two phases..

[r]