1Algorithms for biological sequence Comparison and Alignment

(1)

1 Algorithms for biological sequence

Comparison and Alignment

Sara Brunetti,

Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche

University of Siena, Italy, sara.brunetti@unisi.it

(2)

2 Contents

Introduction and Motivations:

– What,why,when

Formulation of the problems and methods (how):

– Dot plots

– Global Sequence Alignment

– Semiglobal Sequence Alignment – Local Sequence Alignment

...

(3)

3 A piece of history…

1953: DNA structure, Watson e Crick

1975: development of the sequencing technique, Ranger, Maxam e Gilbert

1990: beginning of the Genome Project

https://www.genome.gov/25019885/online-education-kit-how-to-sequence-a-human- genome/

1. sequence the entire human genome producing the complete DNA transcript

2. produce maps of the genome showing locations of expressed sites

2000: Tony Blair and Bill Clinton announce the completion (2003) of the human genome sequencing

Cost: 3 10

⁹

euros

(Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at:www.genome.gov/sequencingcostsdata)

2002: High-throughput sequencing (HTS)

2008: 1000 genomes pilot project to study variation in the genome of individuals

2009: 1000 genomes phase 1 (1092 samples)

2011: 1000 genomes phase 2 (1700-2500 samples)

(4)

4 Genomic phase

– Amounts of data (Genome Project)

• Human genome: 3x10

⁹

bp;

• Contained in 10

¹⁵

cells

– Macromolecular structures: 77000 entries (2011, PDB)

– sequences: 333012760 records (2016, Genbank)

Bioinformatics:

study of problems of storage, organization and

distribution of large amounts of genomic data

(5)

5 Post-genomic phase

Computational biology: study of mathematical and combinatorial problems of modeling biological processes in the cell,

interpreting the data and providing theories about their biological relations

1. Data representation 2. Problem formulation

3. (Efficient) algorithm design

(6)

6 Data representation

Alphabet

– Italian:

– A B C D E F G H I L M N O P Q R S T U V Z

– English:

– A B C D E F G H I J K L M N O P Q R S T U V W Y Z

– DNA:

– A C G T (adenine, cytosine,guanine,thymine)

– Protein:

– A Q W E R T Y I P L K H F D S C V N M

– Binary:

– 0 1

(7)

7 Data representation: strings

DNA

String: ACCGTATATAAAAGGCCGGGTT Length: 22

suffix substring

prefix

(8)

8 Why? DNA information

(9)

9 Why sequences comparison?

• Learning about the functionality or structure of a protein without performing any experiments

Basic idea:

In biomolecular sequences (DNA, RNA, Aminoacid sequences) high similarity usually implies significant functional or structural similarity.

• Usually 25% sequence identity suffice two proteins to have

same 3-dim structure and almost identical function

(10)

10 • WARNING Sequence similarities implies functional similarities, but the reverse is not necessarily true!

• Beside sequences other levels to enquire: 3D protein

structure, cellular biochemistry or morphology etc., but

sequences are easier to study.

(11)

11

DNA Sequence Comparison: First Success Story

Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s

function

In 1984 Russell Doolittle and colleagues found similarities (87% identical) between cancer-causing gene and normal growth factor (PDGF) gene

Discovery: the cancer-causing gene works by stimulating

growth

(12)

12 Sequence similarity vs homology. Why?

The resemblance of two DNA sequences taken from different organisms can be explained by the theory that all contemporary genetic material has one ancestral ancient DNA.

According to this theory, during the course of evolution mutations occurred, creating differences between families of contemporary species.

Most of these changes are due to local mutations, each modifying the DNA sequence at a specific manner.

(13)

13 Similarities and differences?

Similarity: measures how much the strings are alike Its definition derives from the concept of one

ancestral ancient DNA

Distance: measures how much the strings differ

Its definition derives from the concept of mutations

-Differences between the human genome and the chimpanzee genome:

2%

-Differences betweeen human and worm: 50%

-Similarity between two humans: 99,9%

– But: genome length 3 10⁹ bp

– They can differ into 3 10⁶ positions

(14)

14 Sequence Comparison Problems

Informally: find which parts of sequences are alike and which parts are different.

1) Given two sequences over the same alphabet, about of the same length (10.000 char.), and almost equal, find the places where differences occur.

• Problem 1): the same gene is sequenced by two laboratories and they want to compare the results.

2) Given two sequences with a few hundred of char., find two similar sub-strings (one from each sequence).

3) Same as Problem 2), but one sequence is compared with thousand of others.

Problems 2), 3): in searching local similarities in large databases of bio-sequences (for ex.ignoring stretches of non-coding DNA).

(15)

15 Sequence Comparison Problems

4) Given two sequences with a few hundred of char., find a prefix of one similar to the suffix of the other.

• Problem 4): in the fragment assembly procedure in large scale DNA sequencing.

We introduce a single basic algorithmic idea to solve all the

above problems.

(16)

16 Pairwise alignment

How to compare two sequences?

•Alignment

•Similarity

(17)

17 Sequence alignment: an example

s: ATGCAGCTGAGCATCG

t: ATACAGCGAGTATCG

?

(18)

18 Perfect match

(19)

19 Mismatch

(20)

20 Delation/gap

(21)

21 Sequence alignment

Sequence 1 s=(s1,…,sm) of size m Sequence2 t=(t1,…,tn) of size n

An alignment (s’,t’) between s and t is obtained by insertion of spaces in arbitrary positions along the sequences so that they end up with the same size

s’1 s’2 … s’l t’1 t’2 … t’l

• (s’i,t’i) pair of characters in s and t or -

• Not allowed (-,-)

(22)

22 Number of alignments

How many ways s can be aligned with t?

s’1 s’2 … s’l t’1 t’2 … t’l

Max(n,m) <= l <= n+m:

s1 .. sm - - - - - - - t1 … tn

- s1 .-. sm - t1 ……….… tn

f(i,j)=#alignments of one sequence of i letters with another of j letters

f(n,m)=f(n-1,m)+f(n-1,m-1)+f(n,m-1) and f(n,n)(1+2)²ⁿ⁺¹ n as n

(23)

23 Es. two sequences of length 1000 have the following number of possible alignments:

f(1000,1000)(1+2)

²⁰⁰¹

1000=10

^767,4..

!!!!!!!!

(there are 10

⁸⁰

elementary particles in the universe)

(24)

24 Dot plot

Dot plot is a graphical way to visualize and identify regions of identity between two biosequences that allows their

comparison.

The dotplot capture in a single image the overall similarity not

concentrating on any one optimal region, but rather giving

you the “Gestalt” of the whole thing.

(25)

25 Dot plot

(26)

26 Dot plot

(27)

27 Dot plot

A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches.

Solution: use a window and a threshold

- compare character by character within a window (have to choose window size)

- require certain fraction of matches within window in order to

display it with a dot

(28)

28 Dot plot: window length 2

(29)

29 Dot plot: window length 3

(30)

30 Which do you think has a larger window?

(31)

31 Dot plot - path

(32)

32 More paths – Which one?

(33)

33 How to measure similarity or non-similarity?

V = ATATATAT W = TATATATA

distance: distance:

d(v, w)=8 d(v, w)=2

W = TATATATA

Just one shift Make it all line up

V = - ATATATAT

Hamming distance

always compares

i

^-th

letter of v with

i

^-th

letter of w

(34)

34 Edit distance

Mistakes in writing:

-a letter instead of another is written (subs);

-a letter more is inserted (ins);

-a letter is not written (del);

Distance:

d(a,b)=1; d(a,-)=d(-,a)=1; d(a,a)=0 -ATCCGAT

TG-CATAT

1 1 101 100=5

(35)

35 Global alignment

Given two sequences s and t of roughly the same length, determine the alignment of s and t with maximal (or minimal) score

AC –- GCTTTG - CATG –TAT-

(Needleman&Wunsch Algorithm '70s)

Motivation: the same gene is sequenced by two laboratories

and they want to compare the results

(36)

36 More about similarity and distance

(37)

37 Similarity and distance

Two approaches to comparing strings:

Similarity: measures how much the strings are alike

Its definition derives from the concept of one ancestral ancient DNA An alignment (s’,t’) of the strings s and t is obtained by inserting space

characters in them in such a way that:

1 |s’|=|t’|

2 Removal of - from s’ gives s 3 Removal of - from t’ gives t

4 For every i, either s’[i] or t’[i] is not – A scoring system (p,g) has members:

• p:AxA->R, g<0

• additive scoring

• sim(s,t)=max score(s’,t’)

(38)

38 Similarity and distance

Distance: measures how much the strings differ

Its definition derives from the concept of mutations A distance d on E is d:ExE->R:

1 d(x,x)=0 for all x in E and d(x,y)>0 for x<>y 2 d(x,y)=d(y,x) for all x,y in E

3 d(x,y)<=d(x,z)+d(y,z) for all x,y in E

An allignment is obtained by successive applications of a number of admissible operations transforming s into t

1 substitution a->b

2 insertion or deletion of any character (indel) A cost measure (c,h) has members:

c:AxA->R, h>0

(39)

39

When are similarity and distance algorithms equivalent?

When sequences are aligned by distance in global alignment, there is a similarity algorithm that gives the same set of optimal alignments, and vice versa

The measures are related by the formula:

p(a,b)=M-c(a,b) g=-h+M/2 dist(s,t)+sim(s,t)=M/2(|s|+|t|) Es.

Edit distance, M=0=> p(a,a)=0, p(a,b)=-1, g=-1;

M=2=> p(a,a)=2, p(a,b)=1, g=0;

Same set of optimal solutions, different scores.

Usually 0<=M<=max c(a’,b’)

(40)

40 Dynamic Programming Algorithm

Basic Idea of Dynamic Programming: a problem is solved taking advantage of the already solved sub-problems.

Each optimal alignment contains optimal alignments of the subproblems (example)

- GCTGATATAGCT GGGTGAT -TAGCT

– Additivity of the penalty function Three essential components:

– Recurrence relation – Tabular computation – Traceback

(41)

41 Dynamic Programming Algorithm –Recurrence relation

Sequence 1: s of size m Sequence 2: t of size n

s[i..j] sub-string from char i to char j of s.

• M(i,j) is the score of the best alignment between s[1..i] and t[1..j]

• M(j,0) = M(0,j)=-2j

• M(m,n) is computed by solving the more general problem of computing M(i,j) for all i,j

No top-down approach, but bottom up

The computation is arranged in a (m+1) (n+1) array M

M[i,j-1] - 2 +1, if s

_i

= t

_j

M[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)=

M[i-1,j] -2 -1, if s

_i

 t

_j

(42)

42 Dynamic Programming Algorithm –Tabular computation s t A G C

A A A C

0 -2 -4 -2

-4 -1 0 -2 -6 -3 -2 -1 -8 -5 -4 -1

1 -4 -3

-4 1 ^{-3 -6}^-1 –1

M

row 0: comparison between t and an empty sequence.

column 0: comparison between s and an empty sequence

M[i,j] is computed by observing the 3 previous entries M[i-1,j-1], M[i,j-1]

and M[i-1,j].

M[i-1,j-1]: a new char of s and a new char of t are considered;

+1 is added in case of match and –1 in case of mismatch.

Align s[1..i-1] with t[1..j-1]

M[i,j-1]: a new char of the sequence t is considered corresponding to a space in s (-2).

Align s[1..i] with t[1..j-1] and match a space with tj

M[i-1,j]: a new char of the sequence s is considered corresponding to a space in t (-2).

Align s[1..i-1] with t[1..j] and match a space with si -6

(43)

43 Best Score

Trace back to find the best alignment(s)

solution2

AG -C AAAC

A -GC

solution1

AAAC

solution3

-AGC

AAAC

t A G C

A A A C

0 -2 -4 -2

-6 -3 -2 -1 -8 -5 -4 -1 -6 1 –1 -3 -4 -1 0 -2

-1

Dynamic Programming Algorithm- Traceback

s

(44)

44 Algorithm Similarity

input: S,T,m,n output: M

for i 1 to m do M(i,0) i g for j 0 to n do M(0,j)  j g for i  1 to m do for j 1 to n do

M(i,j)  max( M(i-1,j)+g

M(i-1,j-1)+p(i,j) M(i,j-1)+g )

return M

Complexity: O(nm)

(45)

45 Align (i, j, len)

input: i, j, array M obtained by Similarity Alg.

output: alignment in align-s align-t, vectors of length len if i = 0 and j =0 then len = 0;

else if i > 0 and M[i, j] = M[i-1, j] + c

_s

then Align (i-1,j,len);

len = len+1; align-s = s[i]; align-t =‘ – ‘;(space)

else if i > 0 and j>0 and M[i, j] = M[i-1, j-1] + c(i,j) then Align (i-1,j-1,len);

len = len+1; align-s = s[i]; align-t = t[j];

else Align (i,j-1,len);

len = len+1; align-s =‘ – ‘(space); align-t = t[j];

•First call Align(m,n,len)

•max (|s|,|t|)  len  m + n

• Algorithm Align finds solution1.

• By inverting the order of the if statements it is possible to find

the other solutions.

(46)

46 Complexity

• Algorithm Similarity takes O( m n) time and space

• Algorithm Align takes O (m + n) time:

Let h=m+n

T(h) = k for h  2

T(h) = T(h-1) + k, for h > 2 (k constant) T(h) = O(h) = O(m+n)

• Algorithm Similarity can be refined to run with O(m+n) space.

In a “row by row” computation store the last and the current row only.

• Algorithm Align can be designed to run with O(m+n) space with a divide and conquer strategy.

It is not a trivial task!

The basic algorithm Similarity can be modified to solve a

variety of different problems!!

(47)

47 Semi-global alignment

(48)

48 Semi-global Comparison

Find the best fit of a short sequence t of size n into a larger sequence s of size m

s1 sk sl sm s:

t:

The solution to this problem as formulated above will take time proportional to



_k=1..m



_l=k..m

n(l-k)=O(nm

³

)

(49)

49 Semi-global Comparison

Ignore the spaces at the beginning and at the end of a sequence.

Problem: Find the highest score semi-global alignment between t and substring (prefix of a suffix) of s.

s: CAGCA -CTTGG ATTCTCGG t: - - - CAGCTTGG(- - - -

1. Ignore final spaces.

Find the best score between t and a prefix of s.

M[i,j] of problem1 contains the best score between s[1..i] and t[1..j], hence take the maximum value M[i,n] in the last column n.

There is no need to reach the last row.

(50)

50 Semi-global Comparison

s: CAGCA -CTTGG ATTCTCGG t: - - -)CAGCTTGG - - - -

2. Ignore initial spaces

Find the best alignment between t and a suffix of s.

M[i,j] now contains the best score between t[1..j] and a suffix of s[1..i], hence in the first column we have all zeroes.

C A G C ….

C

0

A

0

G

0

C

0 1

A

Initial char !!

Join solutions 1 and 2 to solve

semi-global comparison!!

(51)

51 Local Alignment

(52)

52 Local Alignment

Find the best fit between a sub-string of s and a sub-string of t.

s1 sk si sm s:

t:

t1 th tj tn

Motivation: Ignore streches of non-coding DNA

(53)

53 Local Alignment Algorithm – Smith&Waterman Global Alignment

Local Alignment—better alignment to find conserved segment

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||

aattgccgccgtcgttttcagCAGTTATGTCAGatc

(54)

54 Local Alignment: Example

Global alignment

Local alignment

Compute a “mini”

Global Alignment to

get Local

(55)

55 Local Alignment Algorithm – Smith&Waterman '80

The LA problem is still solved computing M.

• M[i,j] holds the value of the best alignment between a suffix of s[1..i] and a suffix of t[1..j].

• The first row and the first column are initialized with zeros.

(56)

56 M[i,j-1] - 2 +1 if s

_i

= t

_j

M[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)=

M[i-1,j] -2 -1 if s

_i

 t

_j

0 Local Alignment

For any entry M[i,j] there exists always the alignment

between the empty suffixes of s[1..i] and of t[1..j] with score 0

At the end choose the entry M[i,j] with maximal score in any

position. Start align tracing back, as before, from there until

you find a value 0.

(57)

57 Example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

HEGAWGHEE

PAW HEAE

(58)

58 End free-space alignment - Motivation

Find the best fit of substrings of s and t, where at least one of these substrings must be a prefix of the original string and one must be a suffix?

Motivation: in the shotgun sequence assembly procedure, one has a large set of partially overlapping substrings that come from many copies of one original but unknown DNA sequences.

The problem is to use comparisons of pairs of substrings to infer the correct original string.

(59)

59 End free-space alignment

 

 



 

 













) , (

) , max (

) , (

) , ( max )

, (

) , ( max )

, (

2 ) 1 ,

(

2 ) , 1 (

) , ( )

1 ,

1 (

max )

, (

0 ) , 0 (

0 ) 0 , (

*

* 1

*

j m V

n i T V

S V

j m V j

m V

n i V n

i V

j i V

j i

V

T S p j

i V j

i V

j V

i V

n j m i

j i

(60)

60 Example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0

P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

HEGAWGHEE

PAW HEAE

(61)

61 Kinds of Alignment

Global Alignment INPUT: Two strings S and T of roughly the same length.

QUESTION: What is the similarity between the two?

Semi-global Alignment INPUT: Two strings S and T .

QUESTION: What is the similarity between a substring of S and T?

Local Alignment INPUT: Two strings S and T.

QUESTION: What is the similarity (difference) between a substring of S and a substring of T? What are these most similar substrings?

Ends free-space alignment INPUT: Two strings S and T of different length.

QUESTION: What is the similarity between substrings of S and T, respectively? where at least one of these substrings must be a prefix of the original string and one (not necessary the other) must be a suffix?

(62)

62 Complexity of Alignments

Problem Time complexity Space complexity

Global Alignment O(nm) O(n+m) (O(nm) to bt)

Semi-global Alignment O(nm) O(nm)

Local Alignment O(nm) O(nm)

Ends free-space alignment O(nm) O(nm)

The space complexity could be a critical bottleneck.

How we can improve such a complexity?

Linear-Space Alignment

Hirschberg’s algorithm -- Miller and Myers algorithm

(63)

63 Extensions to the basic algorithm

Hirschberg’s linear space method for alignment uses

a divide-et-conquer strategy

(64)

64 Gap penalty

(65)

65 Gap penalty function Gap: consecutive number (k>1) of spaces.

From Biology we know that when mutations are involved, gap of k spaces are more probable than k isolated spaces.

One concrete example is given by the c-DNA matching.

In the previous problems the cost w(k) of k internal consecutive spaces was proportional to k, w(k) = k g.

Now w(k) = h +kg where h + g is the cost of the first space of a gap and g the cost of the following ones, k>1.

CA---CTTGG

h+g

g g g g

gap

w(k) = h +5g

(66)

66 Attention!

The scoring system is no more additive, i.e. we cannot break an alignment in two parts and expect the total score to be the sum of the partial scores

AAC|- - -|A|ATTC C G|ACT|AC ACT|ACC|T| - - - |CGC| - -

The scoring of an alignment is done at the block level

(67)

67 Similarities with gap

We need three matrices a, b, c, with the following meaning:

a[i,j] = maximum score of an alignment between s[1..i] and t[1..j]

where s[i] is matched with t[j].

b[i,j] = maximum score of an alignment between s[1..i] and t[1..j]

that ends in a - aligned with t[j].

c[i,j] = maximum score of an alignment between s[1..i] and t[1..j]

that ends in s[i] aligned with a - .

Where a[i-1,j-1]

a[i,j] =p(i,j) + max b[i-1,j-1]

c[i-1,j-1]

a[i,j-1] -(h+g) a[i-1,j] -(h+g) b[i,j] =max b[i,j-1]-g c[i,j] =max (b[i-1,j]-(h+g) ) (c[i,j-1] -(h+g)) c[i-1,j]-g

First space

(68)

68 Initialization:

a[0,0] = 0, a[i,0] =-  for 0  i  m, a[0,j] = -  for 0j  n b[i,0] = -  for 0  i  m

b[0,j] = -(h+gj) for 0  j  n c[i,0] = -(h+gi) for 0  i  m

c[0,j] = -  for 0  j  n

a[m,n]

Final result Get the maximum among b[m,n]

c[m,n]

Trace back to obtain the optimal alignment, remembering the current position and which array belongs to.

Time O(mn)

Space 3(mn) = O(mn)

(69)

69 Other gap penalty models

• Constant.

• Affine.

• Convex:

– each additional space in a gap contributes less to the gap weight than the previous space (ex. Log(q))

– the problem is solvable in O(nm log(m)) time

• Arbitrary:

– Any gap weight function is acceptable

– the problem is solvable in O(nm (m+n)) time

(70)

70 References

Michael Waterman: "Introduction to Computational Biology"

[Chapman & Hall/CRC Statistics and Mathematics; ISBN 0412993910]

Pavel Pevzner: "Computational Molecular Biology - An

Algorithmic Approach" [The MIT Press (A Bradford Book);

ISBN 0262161974]

Dan Gusfield: "Algorithms on Strings, Trees and Sequences"

[Cambridge, 1997; ISBN 0521585198]

Richard Durbin, Sean Eddy, A. Krogh, G. Mitchison "Biological

Sequence Analysis: Probabilistic Models of Proteins and

Nucleic Acids" [Cambridge, 1997; ISBN 0521629713]

(71)