1 Algorithms for biological sequence
Comparison and Alignment
Sara Brunetti,
Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche
University of Siena, Italy, sara.brunetti@unisi.it
2 Contents
Introduction and Motivations:
– What,why,when
Formulation of the problems and methods (how):
– Dot plots
– Global Sequence Alignment
– Semiglobal Sequence Alignment – Local Sequence Alignment
...
3
A piece of history…
1953: DNA structure, Watson e Crick
1975: development of the sequencing technique, Ranger, Maxam e Gilbert
1990: beginning of the Genome Project
https://www.genome.gov/25019885/online-education-kit-how-to-sequence-a-human- genome/
1. sequence the entire human genome producing the complete DNA transcript
2. produce maps of the genome showing locations of expressed sites
2000: Tony Blair and Bill Clinton announce the completion (2003) of the human genome sequencing
Cost: 3 10
9euros
(Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at:www.genome.gov/sequencingcostsdata)2002: High-throughput sequencing (HTS)
2008: 1000 genomes pilot project to study variation in the genome of individuals
2009: 1000 genomes phase 1 (1092 samples)
2011: 1000 genomes phase 2 (1700-2500 samples)
4 Genomic phase
– Amounts of data (Genome Project)
• Human genome: 3x10
9bp;
• Contained in 10
15cells
– Macromolecular structures: 77000 entries (2011, PDB)
– sequences: 333012760 records (2016, Genbank)
Bioinformatics:
study of problems of storage, organization and
distribution of large amounts of genomic data
5 Post-genomic phase
Computational biology: study of mathematical and combinatorial problems of modeling biological processes in the cell,
interpreting the data and providing theories about their biological relations
1. Data representation 2. Problem formulation
3. (Efficient) algorithm design
6 Data representation
Alphabet
– Italian:
– A B C D E F G H I L M N O P Q R S T U V Z
– English:
– A B C D E F G H I J K L M N O P Q R S T U V W Y Z
– DNA:
– A C G T (adenine, cytosine,guanine,thymine)
– Protein:
– A Q W E R T Y I P L K H F D S C V N M
– Binary:
– 0 1
7 Data representation: strings
DNA
String: ACCGTATATAAAAGGCCGGGTT Length: 22
suffix substring
prefix
8
Why? DNA information
9 Why sequences comparison?
• Learning about the functionality or structure of a protein without performing any experiments
Basic idea:
In biomolecular sequences (DNA, RNA, Aminoacid sequences) high similarity usually implies significant functional or structural similarity.
• Usually 25% sequence identity suffice two proteins to have
same 3-dim structure and almost identical function
10
• WARNING Sequence similarities implies functional similarities, but the reverse is not necessarily true!
• Beside sequences other levels to enquire: 3D protein
structure, cellular biochemistry or morphology etc., but
sequences are easier to study.
11
DNA Sequence Comparison: First Success Story
Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s
function
In 1984 Russell Doolittle and colleagues found similarities (87% identical) between cancer-causing gene and normal growth factor (PDGF) gene
Discovery: the cancer-causing gene works by stimulating
growth
12 Sequence similarity vs homology. Why?
The resemblance of two DNA sequences taken from different organisms can be explained by the theory that all contemporary genetic material has one ancestral ancient DNA.
According to this theory, during the course of evolution mutations occurred, creating differences between families of contemporary species.
Most of these changes are due to local mutations, each modifying the DNA sequence at a specific manner.
13 Similarities and differences?
Similarity: measures how much the strings are alike Its definition derives from the concept of one
ancestral ancient DNA
Distance: measures how much the strings differ
Its definition derives from the concept of mutations
-Differences between the human genome and the chimpanzee genome:
2%
-Differences betweeen human and worm: 50%
-Similarity between two humans: 99,9%
– But: genome length 3 109 bp
– They can differ into 3 106 positions
14 Sequence Comparison Problems
Informally: find which parts of sequences are alike and which parts are different.
1) Given two sequences over the same alphabet, about of the same length (10.000 char.), and almost equal, find the places where differences occur.
• Problem 1): the same gene is sequenced by two laboratories and they want to compare the results.
2) Given two sequences with a few hundred of char., find two similar sub-strings (one from each sequence).
3) Same as Problem 2), but one sequence is compared with thousand of others.
Problems 2), 3): in searching local similarities in large databases of bio-sequences (for ex.ignoring stretches of non-coding DNA).
15 Sequence Comparison Problems
4) Given two sequences with a few hundred of char., find a prefix of one similar to the suffix of the other.
• Problem 4): in the fragment assembly procedure in large scale DNA sequencing.
We introduce a single basic algorithmic idea to solve all the
above problems.
16 Pairwise alignment
How to compare two sequences?
•Alignment
•Similarity
17 Sequence alignment: an example
s: ATGCAGCTGAGCATCG
t: ATACAGCGAGTATCG
?
18
Perfect match
19
Mismatch
20
Delation/gap
21 Sequence alignment
Sequence 1 s=(s1,…,sm) of size m Sequence2 t=(t1,…,tn) of size n
An alignment (s’,t’) between s and t is obtained by insertion of spaces in arbitrary positions along the sequences so that they end up with the same size
s’1 s’2 … s’l t’1 t’2 … t’l
• (s’i,t’i) pair of characters in s and t or -
• Not allowed (-,-)
22 Number of alignments
How many ways s can be aligned with t?
s’1 s’2 … s’l t’1 t’2 … t’l
Max(n,m) <= l <= n+m:
s1 .. sm - - - - - - - t1 … tn
- s1 .-. sm - t1 ……….… tn
f(i,j)=#alignments of one sequence of i letters with another of j letters
f(n,m)=f(n-1,m)+f(n-1,m-1)+f(n,m-1) and f(n,n)(1+2)2n+1 n as n
23
Es. two sequences of length 1000 have the following number of possible alignments:
f(1000,1000)(1+2)
20011000=10
767,4..!!!!!!!!
(there are 10
80elementary particles in the universe)
24 Dot plot
Dot plot is a graphical way to visualize and identify regions of identity between two biosequences that allows their
comparison.
The dotplot capture in a single image the overall similarity not
concentrating on any one optimal region, but rather giving
you the “Gestalt” of the whole thing.
25
Dot plot
26
Dot plot
27 Dot plot
A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches.
Solution: use a window and a threshold
- compare character by character within a window (have to choose window size)
- require certain fraction of matches within window in order to
display it with a dot
28
Dot plot: window length 2
29
Dot plot: window length 3
30
Which do you think has a larger window?
31
Dot plot - path
32
More paths – Which one?
33
How to measure similarity or non-similarity?
V = ATATATAT W = TATATATA
distance: distance:
d(v, w)=8 d(v, w)=2
W = TATATATA
Just one shift Make it all line up
V = - ATATATAT
Hamming distance
always compares
i
-thletter of v with
i
-thletter of w
34 Edit distance
Mistakes in writing:
-a letter instead of another is written (subs);
-a letter more is inserted (ins);
-a letter is not written (del);
Distance:
d(a,b)=1; d(a,-)=d(-,a)=1; d(a,a)=0 -ATCCGAT
TG-CATAT
1 1 101 100=5
35 Global alignment
Given two sequences s and t of roughly the same length, determine the alignment of s and t with maximal (or minimal) score
AC –- GCTTTG - CATG –TAT-
(Needleman&Wunsch Algorithm '70s)
Motivation: the same gene is sequenced by two laboratories
and they want to compare the results
36
More about similarity and distance
37 Similarity and distance
Two approaches to comparing strings:
Similarity: measures how much the strings are alike
Its definition derives from the concept of one ancestral ancient DNA An alignment (s’,t’) of the strings s and t is obtained by inserting space
characters in them in such a way that:
1 |s’|=|t’|
2 Removal of - from s’ gives s 3 Removal of - from t’ gives t
4 For every i, either s’[i] or t’[i] is not – A scoring system (p,g) has members:
• p:AxA->R, g<0
• additive scoring
• sim(s,t)=max score(s’,t’)
38 Similarity and distance
Distance: measures how much the strings differ
Its definition derives from the concept of mutations A distance d on E is d:ExE->R:
1 d(x,x)=0 for all x in E and d(x,y)>0 for x<>y 2 d(x,y)=d(y,x) for all x,y in E
3 d(x,y)<=d(x,z)+d(y,z) for all x,y in E
An allignment is obtained by successive applications of a number of admissible operations transforming s into t
1 substitution a->b
2 insertion or deletion of any character (indel) A cost measure (c,h) has members:
c:AxA->R, h>0
39
When are similarity and distance algorithms equivalent?
When sequences are aligned by distance in global alignment, there is a similarity algorithm that gives the same set of optimal alignments, and vice versa
The measures are related by the formula:
p(a,b)=M-c(a,b) g=-h+M/2 dist(s,t)+sim(s,t)=M/2(|s|+|t|) Es.
Edit distance, M=0=> p(a,a)=0, p(a,b)=-1, g=-1;
M=2=> p(a,a)=2, p(a,b)=1, g=0;
Same set of optimal solutions, different scores.
Usually 0<=M<=max c(a’,b’)
40 Dynamic Programming Algorithm
Basic Idea of Dynamic Programming: a problem is solved taking advantage of the already solved sub-problems.
Each optimal alignment contains optimal alignments of the subproblems (example)
- GCTGATATAGCT GGGTGAT -TAGCT
– Additivity of the penalty function Three essential components:
– Recurrence relation – Tabular computation – Traceback
41
Dynamic Programming Algorithm –Recurrence relation
Sequence 1: s of size m Sequence 2: t of size n
s[i..j] sub-string from char i to char j of s.
• M(i,j) is the score of the best alignment between s[1..i] and t[1..j]
• M(j,0) = M(0,j)=-2j
• M(m,n) is computed by solving the more general problem of computing M(i,j) for all i,j
No top-down approach, but bottom up
The computation is arranged in a (m+1) (n+1) array M
M[i,j-1] - 2 +1, if s
i= t
jM[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)=
M[i-1,j] -2 -1, if s
i t
j42
Dynamic Programming Algorithm –Tabular computation s t A G C
A A A C
0 -2 -4 -2
-4 -1 0 -2 -6 -3 -2 -1 -8 -5 -4 -1
1 -4 -3
-4 1 -3 -6 -1 –1
M
row 0: comparison between t and an empty sequence.
column 0: comparison between s and an empty sequence
M[i,j] is computed by observing the 3 previous entries M[i-1,j-1], M[i,j-1]
and M[i-1,j].
M[i-1,j-1]: a new char of s and a new char of t are considered;
+1 is added in case of match and –1 in case of mismatch.
Align s[1..i-1] with t[1..j-1]
M[i,j-1]: a new char of the sequence t is considered corresponding to a space in s (-2).
Align s[1..i] with t[1..j-1] and match a space with tj
M[i-1,j]: a new char of the sequence s is considered corresponding to a space in t (-2).
Align s[1..i-1] with t[1..j] and match a space with si -6
43 Best Score
Trace back to find the best alignment(s)
solution2
AG -C AAAC
A -GC
solution1
AAAC
solution3-AGC
AAAC
t A G C
A A A C
0 -2 -4 -2
-6 -3 -2 -1 -8 -5 -4 -1 -6 1 –1 -3 -4 -1 0 -2
-1
Dynamic Programming Algorithm- Traceback
s
44 Algorithm Similarity
input: S,T,m,n output: M
for i 1 to m do M(i,0) i g for j 0 to n do M(0,j) j g for i 1 to m do for j 1 to n do
M(i,j) max( M(i-1,j)+g
M(i-1,j-1)+p(i,j) M(i,j-1)+g )
return M
Complexity: O(nm)
45
Align (i, j, len)
input: i, j, array M obtained by Similarity Alg.
output: alignment in align-s align-t, vectors of length len if i = 0 and j =0 then len = 0;
else if i > 0 and M[i, j] = M[i-1, j] + c
sthen Align (i-1,j,len);
len = len+1; align-s = s[i]; align-t =‘ – ‘;(space)
else if i > 0 and j>0 and M[i, j] = M[i-1, j-1] + c(i,j) then Align (i-1,j-1,len);
len = len+1; align-s = s[i]; align-t = t[j];
else Align (i,j-1,len);
len = len+1; align-s =‘ – ‘(space); align-t = t[j];
•First call Align(m,n,len)
•max (|s|,|t|) len m + n
• Algorithm Align finds solution1.
• By inverting the order of the if statements it is possible to find
the other solutions.
46 Complexity
• Algorithm Similarity takes O( m n) time and space
• Algorithm Align takes O (m + n) time:
Let h=m+n
T(h) = k for h 2
T(h) = T(h-1) + k, for h > 2 (k constant) T(h) = O(h) = O(m+n)
• Algorithm Similarity can be refined to run with O(m+n) space.
In a “row by row” computation store the last and the current row only.
• Algorithm Align can be designed to run with O(m+n) space with a divide and conquer strategy.
It is not a trivial task!
The basic algorithm Similarity can be modified to solve a
variety of different problems!!
47
Semi-global alignment
48
Semi-global Comparison
Find the best fit of a short sequence t of size n into a larger sequence s of size m
s1 sk sl sm s:
t:
The solution to this problem as formulated above will take time proportional to
k=1..m
l=k..mn(l-k)=O(nm
3)
49
Semi-global Comparison
Ignore the spaces at the beginning and at the end of a sequence.
Problem: Find the highest score semi-global alignment between t and substring (prefix of a suffix) of s.
s: CAGCA -CTTGG ATTCTCGG t: - - - CAGCTTGG(- - - -
1. Ignore final spaces.
Find the best score between t and a prefix of s.
M[i,j] of problem1 contains the best score between s[1..i] and t[1..j], hence take the maximum value M[i,n] in the last column n.
There is no need to reach the last row.
50
Semi-global Comparison
s: CAGCA -CTTGG ATTCTCGG t: - - -)CAGCTTGG - - - -
2. Ignore initial spaces
Find the best alignment between t and a suffix of s.
M[i,j] now contains the best score between t[1..j] and a suffix of s[1..i], hence in the first column we have all zeroes.
C A G C ….
C
0A
0G
0C
0 1A
Initial char !!
Join solutions 1 and 2 to solve
semi-global comparison!!
51
Local Alignment
52
Local Alignment
Find the best fit between a sub-string of s and a sub-string of t.
s1 sk si sm s:
t:
t1 th tj tn
Motivation: Ignore streches of non-coding DNA
53
Local Alignment Algorithm – Smith&Waterman Global Alignment
Local Alignment—better alignment to find conserved segment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
54 Local Alignment: Example
Global alignment
Local alignment
Compute a “mini”
Global Alignment to
get Local
55
Local Alignment Algorithm – Smith&Waterman '80
The LA problem is still solved computing M.
• M[i,j] holds the value of the best alignment between a suffix of s[1..i] and a suffix of t[1..j].
• The first row and the first column are initialized with zeros.
56
M[i,j-1] - 2 +1 if s
i= t
jM[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)=
M[i-1,j] -2 -1 if s
i t
j0
Local Alignment
For any entry M[i,j] there exists always the alignment
between the empty suffixes of s[1..i] and of t[1..j] with score 0
At the end choose the entry M[i,j] with maximal score in any
position. Start align tracing back, as before, from there until
you find a value 0.
57 Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
HEGAWGHEE
PAW HEAE
58
End free-space alignment - Motivation
Find the best fit of substrings of s and t, where at least one of these substrings must be a prefix of the original string and one must be a suffix?
Motivation: in the shotgun sequence assembly procedure, one has a large set of partially overlapping substrings that come from many copies of one original but unknown DNA sequences.
The problem is to use comparisons of pairs of substrings to infer the correct original string.
59
End free-space alignment
) , (
) , max (
) , (
) , ( max )
, (
) , ( max )
, (
2 ) 1 ,
(
2 ) , 1 (
) , ( )
1 ,
1 (
max )
, (
0 ) , 0 (
0 ) 0 , (
*
* 1
* 1
*
j m V
n i T V
S V
j m V j
m V
n i V n
i V
j i V
j i
V
T S p j
i V j
i V
j V
i V
n j m i
j i
60 Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
HEGAWGHEE
PAW HEAE
61 Kinds of Alignment
Global Alignment INPUT: Two strings S and T of roughly the same length.
QUESTION: What is the similarity between the two?
Semi-global Alignment INPUT: Two strings S and T .
QUESTION: What is the similarity between a substring of S and T?
Local Alignment INPUT: Two strings S and T.
QUESTION: What is the similarity (difference) between a substring of S and a substring of T? What are these most similar substrings?
Ends free-space alignment INPUT: Two strings S and T of different length.
QUESTION: What is the similarity between substrings of S and T, respectively? where at least one of these substrings must be a prefix of the original string and one (not necessary the other) must be a suffix?
62 Complexity of Alignments
Problem Time complexity Space complexity
Global Alignment O(nm) O(n+m) (O(nm) to bt)
Semi-global Alignment O(nm) O(nm)
Local Alignment O(nm) O(nm)
Ends free-space alignment O(nm) O(nm)
The space complexity could be a critical bottleneck.
How we can improve such a complexity?
Linear-Space Alignment
Hirschberg’s algorithm -- Miller and Myers algorithm
63 Extensions to the basic algorithm
Hirschberg’s linear space method for alignment uses
a divide-et-conquer strategy
64
Gap penalty
65
Gap penalty function Gap: consecutive number (k>1) of spaces.
From Biology we know that when mutations are involved, gap of k spaces are more probable than k isolated spaces.
One concrete example is given by the c-DNA matching.
In the previous problems the cost w(k) of k internal consecutive spaces was proportional to k, w(k) = k g.
Now w(k) = h +kg where h + g is the cost of the first space of a gap and g the cost of the following ones, k>1.
CA---CTTGG
h+g
g g g g
gap
w(k) = h +5g
66
Attention!
The scoring system is no more additive, i.e. we cannot break an alignment in two parts and expect the total score to be the sum of the partial scores
AAC|- - -|A|ATTC C G|ACT|AC ACT|ACC|T| - - - |CGC| - -
The scoring of an alignment is done at the block level
67
Similarities with gap
We need three matrices a, b, c, with the following meaning:
a[i,j] = maximum score of an alignment between s[1..i] and t[1..j]
where s[i] is matched with t[j].
b[i,j] = maximum score of an alignment between s[1..i] and t[1..j]
that ends in a - aligned with t[j].
c[i,j] = maximum score of an alignment between s[1..i] and t[1..j]
that ends in s[i] aligned with a - .
Where a[i-1,j-1]
a[i,j] =p(i,j) + max b[i-1,j-1]
c[i-1,j-1]
a[i,j-1] -(h+g) a[i-1,j] -(h+g) b[i,j] =max b[i,j-1]-g c[i,j] =max (b[i-1,j]-(h+g) ) (c[i,j-1] -(h+g)) c[i-1,j]-g
First space