The Population Haplotyping problem

(1)

The Population

Haplotyping problem

(2)

10

11 01 10

01

00 11 11 10

00 10 00 10

10 0*

**

10 1*

11 *0

*0 NOTATION

: each SNP only two values in a population (bio).

Call them

0

and

1

. Also, call

*

the fact that a site is heterozygous

HAPLOTYPE: string over 0, 1 GENOTYPE: string over 0, 1, *

**where 0={0}, 1={1}, *={0,1}**

(3)

10

11 01 10

01 00

11 00

00 10 10

10 0*

**

10 1*

**

*0

0 + 0 = --- 0

1 + 1 = --- 1

0 + 1 + 1 = 0 = --- --- * * ALGEBRA OF HAPLOTYPES:

Homozygous sites Heterozygous (ambiguous) sites

(4)

1**0*

11101 10000 11100 10001 11001 10100 11000 10101

Phasing the alleles

For k heterozygous (ambiguous) sites, there are 2^k-1 possible phasings

(5)

THE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings of all heterozygous sites.

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

(6)

The input is GENOTYPE data

00011

11011

*1**1

**1 111

INPUT: G = { 111, ****1, 11011, *11, 00011 }

(7)

The input is GENOTYPE data

11011 11101

00011

00011 11101

11011 01101

11011 11011 00011

00011

11011

*1**1

**1 111

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 111, ****1, 11011, *11, 00011 }

Each genotype is resolved by two haplotypes We will define some objectives for H

(8)

-without objectives/constraints, the haplotyping problem would be

(mathematically)trivial

OBJECTIVES

**0*1  00001 11011

E.g., always put 0 above and 1 below

**1*0  10000 11011

-the objectives/constraints must be

“driven by biology”

(9)

4°) (parsimony): minimize |H|

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) Disease Association

OBJECTIVES

(10)

Obj: Clark’s rule

1st

(11)

1011001011 +

?????????? = 1**10011

known haplotype

h

known (ambiguos) genotype

g

Inference Rule

for a

compatible

^pair

h , g

(12)

1011001011 + 1101001110 = 1**10011

known haplotype

h

g

Inference Rule

for a

compatible

^pair

h , g

new (derived) haplotype

h’

We write

h + h’ = g

(13)

1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes

2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’

4. set H = H + {h’} and G = G - {g}

5. end while

(14)

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

(15)

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

0000 1000

**00 11**

(16)

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

1100 0000

1000

**00 11**

(17)

1100 1111

SUCCESS 1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

0000 1000

**00 11**

(18)

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

0000 1000

**00 11**

(19)

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

0100 0000

1000

**00 11**

(20)

0100 FAILURE (can’t resolve 1122 )

1st Objective (Clark, 1990)

4. set H = H + {h’} and G = G - {g}

5. end while

0000 1000

**00 11**

(21)

4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explaining all genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

1st Objective (Clark, 1990)

OBJ: find order of application rule that leaves the fewest elements in G

(22)

The problem was studied by Gusfield

(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations -solved via ILP of exponential-size

(practical for small real instances)

(23)

Obj: Perfect Phylogeny

2nd

(24)

- Parsimony does not take into account mutations/evolution of haplotypes

- parsimony is very relialable on “small”

haplotype blocks

- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)

- the cleanest model for evolution is the

perfect phylogeny

(25)

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogeny

(26)

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

has 2 legs

3rd objective is based on perfect phylogeny

has tail

flies

(27)

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

has 2 legs

But…a new species may come along so that no Perfect phylogeny is possible…

has tail

flies

(28)

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10

01 11 Human 1 0 0 Mouse 0 1 0 Spider 0 0 0 Eagle 1 0 1

two legs

tail

flies

(29)

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10

01 11 Human 1 0 0 Mouse 0 1 0 Spider 0 0 0 Eagle 1 0 1 Mickey mouse 1 1 0

two legs

tail

flies

(30)

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)

(31)

0 1 * 0

* 1 0 *

* 0 * 0

(32)

0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 * 0

* 1 0 *

* 0 * 0

(33)

0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0

NOT a perfect phylogeny solution ! 0 1 * 0

* 1 0 *

* 0 * 0

(34)

0 1 * 0 0 1 0 * 0 0 0 *

(35)

0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1

A perfect phylogeny 0 1 * 0

0 1 0 * 0 0 0 *

(36)

Theorem: The Perfect Phylogeny

Haplotyping problem is polynomial

(37)

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types (forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way **  00 + 11 or **  01 + 10

- a complex visit of the graph decides how to phase free SNPs

(38)

Obj: Disease Association

3rd

(39)

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

(40)

00011

**0*011**

*1**1

001 111

INPUT: GD = { 111,*1**1,0*011 }, GH = {111,001,00011}**

111**

(41)

11011 11101

01101 00001

11011 01101

01011 00011

00011 00011

OUTPUT: H = {

11011,01011,00001,11111,11101,00011,01101

}

H contains H^D, s.t. each diseased has >=1 haplotype in H^D and each healty none

11001 11111

00011

**0*011**

*1**1

001 111

INPUT: GD = { 111,*1**1,0*011 }, GH = {111,001,00011}**

111**

(42)

(43)

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:

There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

(44)

(45)

Obj: Max Parsimony

 See separate slides…

4th

(46)

Summary:

- haplotyping in-silico needed for economical reasons - several objectives, all biologically driven

- nice combinatorial problems

(mostly due to binary nature of SNPs)

- these problems are technology-dependant and may

become obsolete (hopefully after we have retired)

(47)

011 101

111 111

011 000

010

001 010

011

111 111

022

222

012

221

011 111 022

211 022 012

012 222

(48)

minimize |H|

2nd Objective (parsimony) :

(49)

1. The problem is APX-Hard

Reduction from VERTEX-COVER

(50)

A

B

C

D E

(51)

A

B

C

D E

A B C D E *

(52)

A

B

C

D E

A B C D E * AB

BC AE

DE AD

(53)

A

B

C

D E

A B C D E * AB

BC AE

DE AD A B C D E

(54)

A

B

C

D E

A B C D E * AB 2 2

BC 2 2

AE 2 2 DE 2 2 AD 2 2

A B C D E

(55)

A

B

C

D E

A B C D E * AB 2 2

BC 2 2

AE 2 2 DE 2 2 AD 2 2

A 0

B 0

C 0

D 0

E 0

(56)

A

B

C

D E

A B C D E *

AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2 A 0 0 B 0 0 C 0 0 D 0 0 E 0 0

(57)

A

B

C

D E

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

(58)

A

B

C

D E

G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that explain all genotypes

(59)

A

B

C

D E

(60)

A

B

C

D E

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1 B’ 1 0 1 1 1 1 E’ 1 1 1 1 0 1

(61)

A basic ILP formulation

(62)

Expand your input G in all possible ways

220 120 022

(63)

010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010

220 120 022

(64)

010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010

220 120 022

xh

 ^h

₁

^{, h}

₂





x h

 

y

_h _h

2 1,

(65)

The resulting Integer Program (IP1):

(66)

(67)

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

(68)

(69)

The input is GENOTYPE data

11011 11101 00011 11101

11011 01101

11011 11011 00011

00011

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is explained by two haplotypes We will define some objectives for H

INPUT: G = { 111, ****1, 11011, *11, 00011 }

**1 111

OOO11

11O11

*1**1

(70)

1st Objective (open research problem):

minimize |H|

2nd Objective based on inference rule:

(71)

1st Objective (parsimony) :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants (Lancia, Pinotti, Rizzi ’02)

(72)

2nd Objective based on inference rule:

(73)

xoxxooxoxx +

****** = x??xoox?x?

known haplotype

h

g

Inference Rule

(74)

xoxxooxoxx + xxoxooxxxo = x??xoox?x?

known haplotype

h

g

h’

Inference Rule

(75)

xoxxooxoxx + xxoxooxxxo = x??xoox?x?

known haplotype

h

g

h’

We write

h + h’ = g

g

^and

h

^{must be}

compatible

to derive

h’

Inference Rule

(76)

2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

(77)

2nd Objective (Clark, 1990)

5. end while

(78)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

(79)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

xxoo

(80)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

xxoo xxxx

SUCCESS

(81)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

(82)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

oxoo

(83)

2nd Objective (Clark, 1990)

5. end while

oooo xooo

??oo xx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G

(84)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

(85)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand genotypes

(86)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox xxxoo xxoox xxooo xoxox xooox xoxoo

xoooo

1. expand genotypes

(87)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox xxxoo xxoox xxooo xoxox xooox xoxoo

xoooo

2. create (h, h’) if exists g s.t. h’ can be derived from g and h

1. expand

genotypes 3. Largest number of nodes in forest rooted at unambiguos genotpes =

= largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

(88)

3rd Objective (open research problem) Disease Detection:

oooxx

??oxx

?x??x

????x xx??x

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

(89)

3rd Objective (open research problem) Disease Detection:

xxoxx xxxox

oooxx

oooxx xxxox

xxoxx oxxox

xxoxx oooxx oooxx

oooxx

??oxx

?x??x

????x xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

(90)

Genome Rearrangements and

Evolutionary Distances

(91)

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………ttgatggattga

tcgattatggat………ttttgatatcca

Genomes evolve by means of

• Insertions

• Deletions

• Inversions

• Transpositions

• Translocations of DNA regions

(92)

(93)

deletion

(94)

deletion insertion

(95)

deletion insertion

translocation

(96)

deletion insertion

translocation

inversion

(97)

deletion insertion

translocation

inversion

transposition

(98)

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)

5 6 4 8 3 2 1 9 7

Example:

We focus on inversions, that are the most important in Nature

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9

(99)

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)

+5 +6 -4 -8 -3 -2 -1 -9 +7

Example:

We focus on inversions, that are the most important in Nature

+1 +2 +3 +8 +4 -6 -5 -9 +7 +1 +2 +3 +8 +4 +5 +6 -9 +7 +1 +2 +3 -6 -5 -4 -8 -9 +7 +1 +2 +3 -6 -5 -4 -8 -7 +9 +1 +2 +3 +4 +5 +6 -8 -7 +9 +1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(100)

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

(101)

5 7 8 2 1 4 3 6 9

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0 10

(102)

5 7 8 2 1 4 3 6 9

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0 10

d(p) = inversion distance b(p) = # breakpoints

TRIVIAL BOUND: d(p) >= b(p) / 2 Example: d(p) >= 6 / 2 = 3

(103)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(104)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(105)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(106)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(107)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

10 4 6

Each node has degree...

0 2 or 4 …

hence the graph can be decomposed in cycles!

(108)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition

(109)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(110)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

c(p) = max # cycles in alternating decomposition

VERY STRONG BOUND : d (p) >= b(p) - c(p)

Example: c(p)= 2 and d (p) >= 6 - 2 = 4

(111)

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10 The best algorithm for this problem is based on an Integer Programming formulation of the max cycle decomposition

A variable

x

^C for each cycle (exponential # of vars…) A constraint

S

x

^C = 1 for each edge e Objective: maximize

S

^C

^x

^C

C containing e

(112)

max S ^x

^C

S ^x

^C = 1 for all edges eC C\ni e

x

^C\in {0,1} for all alt. cycles C PRIMAL

min S ^y

^e

S ^y

^e <= 1 for all alt. Cycles Ce

e\in C

y

^e\in R for all edges e DUAL

(113)

max S ^x

^C

S ^x

^C = 1 for all edges eC C\ni e

x

^C\in {0,1} for all alt. cycles C PRIMAL

min S ^y

^e

S ^y

^e <= 1 for all alt. Cycles Ce

e\in C

y

^e\in R for all edges e DUAL

(114)

5 7 8 2 1 4 3 6 9 0

10

Pricing out the cycles for which y*(C) < 1

(115)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Split the graph in two copies

(116)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Connect twins

(117)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles

(118)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

(119)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

(120)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

(121)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

(122)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

The weight of the matching is the y*-weight of the cycles

.2 .4

.5 1

.6

0

(123)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Forcing a cycle to use a certain node

.2 .4

.5 1

.6

100000

(124)

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100 NP-hard problem practically solved to optimality!

(125)

Statistical view of evolution

• Genome evolve by random inversions

• It’s like a random walk on a huge graph with an edge for each permutation an edge for each inversion

• It is not clear why the shortest solution should be the one followed by Nature (in fact, often it isn’t)

• We want to find the most likely number of inversions that lead from (1 2 … n ) to p

• We use the expected number of breakpoints after k inversions as a way to guess the # of inversions

(126)

Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n) Given a p obtained by h random inversions from (1 … n ) we want to estimate h The inversion distance is only a lower bound: h >= d(p) but the gap could be big

We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as close as possible to b(p) (maximum likelihood). CL ,2000, have shown:

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions

E[B(k)] = ( n - 1 ) ( ^{1 -} ⁽ ^{n - 3} _{n - 1} )

^k

)

(127)

Example: n = 200, k (u.a.r. in 1…n) inversions

8 8 8 16

19 19 19 34

68 67 67 98

69 73 68 104

73 79 73 109

85 91 83 120

86 85 83 115

87 90 84 119

118 117 109 138

184 184 135 168

k k’ d(p) b

(128)

Protein Structure Alignments: the Maximum Contact Map Overlap

Problem

(129)

A Protein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a 3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design^problem calls for constructing peptides with a 3D shape complementary to a protein, so as to dock^{onto it.}

(130)

Motivation:

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function) - Search in 3D data bases

- Protein Classification and Evolutionary Studies - ...

Problem: Align two 3D protein structures

(131)

Contact Maps

(132)

Unfolded protein

CONTACT MAPS

(133)

Folded protein = contacts

CONTACT MAPS

(134)

Contact map = graph

CONTACT MAPS

(135)

CONTACT MAPS

Contact map = graph

OBJECTIVE: align 3d folds of proteins = align contact maps

(136)

Contact Map Alignments

(137)

Non-crossing Alignments

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

(138)

The value of an alignment

(139)

The value of an alignment

(140)

The value of an alignment

(141)

Value = 3

The value of an alignment

(142)

Value = 3

We want to maximize the value

The value of an alignment

(143)

NP-Hard

The value of an alignment

(144)

Integer Programming

Formulation

(145)

Integer Programming Formulation

0-1 VARIABLES

y_ef for e and f contacts

e

f

y_ef

(146)

Integer Programming Formulation

0-1 VARIABLES

y_ef + y_e’f’<= 1 y_ef for e and f contacts

e

f

y_ef

CONSTRAINTS e

f

e’

f’

(147)

Integer Programming Formulation

0-1 VARIABLES

y_ef + y_e’f’<= 1 y_ef for e and f contacts

e

f

y_ef

CONSTRAINTS e

f

e’

f’

OBJECTIVE max

S

^e

S

^f^yef

(148)

Independent Set Problem

It’s just a huge max independent set problem in Gy:

• a node for each sharing

• an edge for each pair of incompatible sharings

e

f

e’

f’’ f’

e’’

e f

e’

f’

e’’

f’’

(149)

Independent Set Problem

It’s just a huge max independent set problem in Gy:

• a node for each sharing

• an edge for each pair of incompatible sharings

e

f

e’

f’’ f’

e’’

e f

e’

f’

e’’

f’’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each) The best exact algorithm for independent set can solve for at most a few hundred nodes

(150)

Node to Node Variables

New variables x provide an easy check for the non-crossing conditions

NEW VARIABLES x_ij for i and j residues

e

f

y_ef i

j x_ij

(151)

Node to Node Variables

NEW VARIABLES x_ij for i and j residues

e

f

y_ef

NEW CONSTRAINTS i

j i’

j’

x_ij + x_i’j’<= 1

i j

x_ij

(152)

Node to Node Variables

NEW VARIABLES

y_(ip)(jq) <= x_ij and y_(ip)(jq)<= x_pq

x_ij for i and j residues

e

f

y_ef

NEW CONSTRAINTS i

j i’

j’

x_ij + x_i’j’<= 1

i j

x_ij

i j

p q

(153)

Clique Constraints

Variables x define a graph G^x:

• A node for each line

• An edge between each pair of crossing lines i

j i’

j’

i j

i’

j’

(154)

Clique Constraints

Variables x define a graph G^x:

• G^x is much smaller than G^y

• G^x has nice proprieties (it’s a perfect graph)

• It’s easier to find large independent sets in G^x

• A node for each line

• An edge between each pair of crossing lines i

j i’

j’

i j

i’

j’

(155)

Clique Constraints

Non-crossing constraints can be extended to CLIQUE CONSTRAINTS

S

^xij <= 1

[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and G^x perfect) imply a strong bound!

(156)

Structure of Maximal cliques in G ^x

1. Pick two subsets of same size

(157)

Structure of Maximal cliques in G ^x

2. Connect them in a zig-zag fashion

(158)

Structure of Maximal cliques in

G ^x

(159)

Structure of Maximal cliques in

G ^x

(160)

Structure of Maximal cliques in

G ^x

(161)

Structure of Maximal cliques in

G ^x

(162)

Structure of Maximal cliques in

G ^x

(163)

Structure of Maximal cliques in

G ^x

(164)

Structure of Maximal cliques in

G ^x

(165)

Structure of Maximal cliques in

G ^x

(166)

Structure of Maximal cliques in

G ^x

(167)

Structure of Maximal cliques in G ^x

3. Throw in all lines included in a zig or a zag

(168)

Structure of Maximal cliques in G ^x

3. Throw in all lines included in a zig or a zag

(169)

Structure of Maximal cliques in G ^x

The result is a maximal clique in G^x

(170)

Separation of Clique Inequalities

(171)

Separation of Clique Inequalities

PROBLEM

There exist exponentially many such cliques (O(2²ⁿ) inequalities).

We need to generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x*

S

^x*ij > 1

[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n²)

(172)

Separation of Clique Inequalities

PROOF (sketch)

1) Clique = zigzag path

(173)

Separation of Clique Inequalities

PROOF (sketch)

1) Clique = zigzag path

1 2 3 4 5 6 7 8

(174)

Separation of Clique Inequalities

PROOF (sketch)

1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1

(175)

Separation of Clique Inequalities

PROOF (sketch)

1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.

to find longest path in grid, time O(n^2)

(176)

Separation of cliques

n2 1

1 2 n1 2

i

u

Create n1 x n2 grid

Orient all edges and give

weights

(177)

Separation of cliques

n2 1

1 2 n1 2

i

u

Create n1 x n2 grid

Orient all edges and give weights

x*iu

(178)

Separation of cliques

Create n1 x n2 grid

Orient all edges and give weights

There is violated clique iff longest A,B path has length > 1

A=(1,n2)

B=(n1,1)

(179)

Gx is a Perfect Graph

We show why polynomial separation is possible:

G

^x

is weakly triangulated (no chordless cycles >= 5 in G

^x

or G

^x)

=> Gx is perfect (Hayward, 1985)

(180)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

PROOF (Sketch, for G

^x

)

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

(181)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1 L3

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

(182)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1 L3

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

(183)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1 L3

=> RIGHT (Li, L1)

L4

(184)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

=> RIGHT (Li, L1)

L1

L4

L5

(185)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

=> RIGHT (Li, L1)

L1 L6 L5

(186)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1

We get LEFT(L1, {L3, L4, L5, L6})

L3, L4, L5 L6

L6

(187)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1

A symmetric argument started at L6, with LEFT(L1, L6) implies

LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

L6

(188)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1

A symmetric argument started at L6, with LEFT(L1, L6) implies

LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

L6

L2, L3, L4 L5

(189)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L6

L2, L3, L4 L5

(190)

Gx is a Perfect Graph

L1

L2

L3

L4

L7

L6

L5

L1

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L6

L2, L3, L4 L5

But L

7

crosses L

1

and L

6

, and so should cross them all !

L7

(191)

The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)

It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN RELAXATION. Approach borrowed from Quadratic Assignment. With new approach we can solve important proteins (up to 150 residues/300 contacts)

(192)

The Population Haplotyping problem

The Population

Haplotyping problem

10

11

01 10

01

00

11 11 10

00

10 00 10

10

0*

**

10 1*

11

*0

*0 NOTATION

0

1

*

HAPLOTYPE: string over 0, 1 GENOTYPE: string over 0, 1, *

where 0={0}, 1={1}, *={0,1}

10

11

01 10

01 00

11 00

00 10 10

10

0*

**

10 1*

**

*0

Phasing the alleles

THE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings of all heterozygous sites.

The input is GENOTYPE data

00011

11011

*1**1

****1 11**1

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

The input is GENOTYPE data

00011

11011

*1**1

****1 11**1

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

-without objectives/constraints, the haplotyping problem would be

(mathematically)trivial

OBJECTIVES

**0*1  00001 11011

1*0**  10000 11011

-the objectives/constraints must be

“driven by biology”

4°) (parsimony): minimize |H|

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) Disease Association

OBJECTIVES

Obj: Clark’s rule

1st

1011001011 +

?????????? = 1**1001*1*

h

g

Inference Rule

compatible

h , g

1011001011 + 1101001110 = 1**1001*1*

h

g

Inference Rule

compatible

h , g

h’

h + h’ = g

**where 0={0}, 1={1}, *={0,1}**

**1 111

INPUT: G = { 111, ****1, 11011, *11, 00011 }

**1 111

INPUT: G = { 111, ****1, 11011, *11, 00011 }

**1*0  10000 11011

?????????? = 1**10011

1011001011 + 1101001110 = 1**10011

**0*011**

001 111

INPUT: GD = { 111,*1**1,0*011 }, GH = {111,001,00011}**

111**

**0*011**

001 111

INPUT: GD = { 111,*1**1,0*011 }, GH = {111,001,00011}**

111**

 ^h

^{, h}