The Population
Haplotyping problem
10
11
01 10
01
00
11 11 10
00
10 00 10
10
0*
**
10 1*
11
*0
*0 NOTATION
: each SNP only two values in a population (bio).Call them
0
and1
. Also, call*
the fact that a site is heterozygousHAPLOTYPE: string over 0, 1 GENOTYPE: string over 0, 1, *
where 0={0}, 1={1}, *={0,1}
10
11
01 10
01 00
11 00
00 10 10
10
0*
**
10 1*
**
*0
0 + 0 = --- 0
1 + 1 = --- 1
0 + 1 + 1 = 0 = --- --- * * ALGEBRA OF HAPLOTYPES:
Homozygous sites Heterozygous (ambiguous) sites
1**0*
11101 10000 11100 10001 11001 10100 11000 10101
Phasing the alleles
For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings
THE PHASING (or HAPLOTYPING) PROBLEM
Given genotypes of k individuals, determine the phasings of all heterozygous sites.
It is too expensive to determine haplotypes directly
Much cheaper to determine genotypes, and then infer haplotypes in silico:
This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.
The input is GENOTYPE data
00011
11011
*1**1
****1 11**1
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
The input is GENOTYPE data
11011 11101
00011
00011 11101
11011 01101
11011 11011 00011
00011
11011
*1**1
****1 11**1
OUTPUT: H = { 11011, 11101, 00011, 01101}
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
Each genotype is resolved by two haplotypes We will define some objectives for H
-without objectives/constraints, the haplotyping problem would be
(mathematically)trivial
OBJECTIVES
**0*1 00001 11011
E.g., always put 0 above and 1 below
1*0** 10000 11011
-the objectives/constraints must be
“driven by biology”
4°) (parsimony): minimize |H|
1°) Clark’s inference rule
2°) Perfect Phylogeny
3°) Disease Association
OBJECTIVES
Obj: Clark’s rule
1st
1011001011 +
?????????? = 1**1001*1*
known haplotype
h
known (ambiguos) genotype
g
Inference Rule
for a
compatible
pairh , g
1011001011 + 1101001110 = 1**1001*1*
known haplotype
h
known (ambiguos) genotype
g
Inference Rule
for a
compatible
pairh , g
new (derived) haplotype
h’
We write
h + h’ = g
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000
**00 11**
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
1100 0000
1000
**00 11**
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1100 1111
SUCCESS 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000
**00 11**
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000
**00 11**
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0100 0000
1000
**00 11**
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
0100 FAILURE (can’t resolve 1122 )
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000
**00 11**
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic: the algorithm could end without explaining all genotypes even if an explanation was possible.
The number of genotypes solved depends on order of application.
1st Objective (Clark, 1990)
OBJ: find order of application rule that leaves the fewest elements in G
The problem was studied by Gusfield
(ISMB 2000, and Journal of Comp. Biol., 2001)
- problem is APX-hard
- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations -solved via ILP of exponential-size
(practical for small real instances)
Obj: Perfect Phylogeny
2nd
- Parsimony does not take into account mutations/evolution of haplotypes
- parsimony is very relialable on “small”
haplotype blocks
- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)
- the cleanest model for evolution is the
perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
3rd objective is based on perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
3rd objective is based on perfect phylogeny
has tail
flies
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
But…a new species may come along so that no Perfect phylogeny is possible…
has tail
flies
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10
01 11 Human 1 0 0 Mouse 0 1 0 Spider 0 0 0 Eagle 1 0 1
two legs
tail
flies
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10
01 11 Human 1 0 0 Mouse 0 1 0 Spider 0 0 0 Eagle 1 0 1 Mickey mouse 1 1 0
two legs
tail
flies
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 * 0
* 1 0 *
* 0 * 0
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 * 0
* 1 0 *
* 0 * 0
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0
NOT a perfect phylogeny solution ! 0 1 * 0
* 1 0 *
* 0 * 0
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 * 0 0 1 0 * 0 0 0 *
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1
A perfect phylogeny 0 1 * 0
0 1 0 * 0 0 0 *
Theorem: The Perfect Phylogeny
Haplotyping problem is polynomial
Theorem: The Perfect Phylogeny Haplotyping problem is polynomial
Algorithms are of combinatorial nature
- There is a graph for which SNPs are columns and edges are of two types (forced and free)
- forced edges connect pairs of SNPs that must be phased in the same way ** 00 + 11 or ** 01 + 10
- a complex visit of the graph decides how to phase free SNPs
Obj: Disease Association
3rd
Some diseases may be due to a gene which has “faulty” configurations
RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease
DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy
Two individuals of which one is healthy and the other diseased may have the same genotype.
The explanation of the disease lies in a difference in their haplotypes
00011
0*011
*1**1
0**01 11**1
INPUT: GD = { 11**1,*1**1,0*011 }, GH = {11**1,0**01,00011}
11**1
11011 11101
01101 00001
11011 01101
01011 00011
00011 00011
OUTPUT: H = {
11011,01011,00001,11111,11101,00011,01101}
H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none
11001 11111
00011
0*011
*1**1
0**01 11**1
INPUT: GD = { 11**1,*1**1,0*011 }, GH = {11**1,0**01,00011}
11**1
Theorem 1 is proved via a reduction from 3 SAT
Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:
There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3
This means the model must be refined!
Obj: Max Parsimony
See separate slides…
4th
Summary:
- haplotyping in-silico needed for economical reasons - several objectives, all biologically driven
- nice combinatorial problems
(mostly due to binary nature of SNPs)
- these problems are technology-dependant and may
become obsolete (hopefully after we have retired)
011 101
111 111
011 000
010
001 010
011
111 111
022
222
012
221
011 111 022
211 022 012
012 222
minimize |H|
2nd Objective (parsimony) :
1. The problem is APX-Hard
Reduction from VERTEX-COVER
A
B
C
D E
A
B
C
D E
A B C D E *
A
B
C
D E
A B C D E * AB
BC AE
DE AD
A
B
C
D E
A B C D E * AB
BC AE
DE AD A B C D E
A
B
C
D E
A B C D E * AB 2 2
BC 2 2
AE 2 2 DE 2 2 AD 2 2
A B C D E
A
B
C
D E
A B C D E * AB 2 2
BC 2 2
AE 2 2 DE 2 2 AD 2 2
A 0
B 0
C 0
D 0
E 0
A
B
C
D E
A B C D E *
AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2 A 0 0 B 0 0 C 0 0 D 0 0 E 0 0
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1 B’ 1 0 1 1 1 1 E’ 1 1 1 1 0 1
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
A basic ILP formulation
Expand your input G in all possible ways
220 120 022
A basic ILP formulation
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
xh
h
1, h
2
x h
y
h h2 1,
The resulting Integer Program (IP1):
Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations
The input is GENOTYPE data
11011 11101 00011 11101
11011 01101
11011 11011 00011
00011
OUTPUT: H = { 11011, 11101, 00011, 01101}
Each genotype is explained by two haplotypes We will define some objectives for H
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
****1 11**1
OOO11
11O11
*1**1
1st Objective (open research problem):
minimize |H|
2nd Objective based on inference rule:
1st Objective (parsimony) :
minimize |H|
An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.
BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB
It’s difficult, but not impossible, to come up with better approximations, like constants (Lancia, Pinotti, Rizzi ’02)
2nd Objective based on inference rule:
xoxxooxoxx +
********** = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
Inference Rule
xoxxooxoxx + xxoxooxxxo = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
new (derived) haplotype
h’
Inference Rule
xoxxooxoxx + xxoxooxxxo = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
new (derived) haplotype
h’
We write
h + h’ = g
g
andh
must becompatible
to deriveh’
Inference Rule
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
xxoo
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
xxoo xxxx
SUCCESS
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
oxoo
2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
oxoo FAILURE (can’t resolve xx?? )
OBJ: find order of application rule that leaves the fewest elements in G
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox xxxoo xxoox xxooo xoxox xooox xoxoo
xoooo
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox xxxoo xxoox xxooo xoxox xooox xoxoo
xoooo
2. create (h, h’) if exists g s.t. h’ can be derived from g and h
1. expand
genotypes 3. Largest number of nodes in forest rooted at unambiguos genotpes =
= largest number of ambiguous genotypes resolved
Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).
This reduction is exponential. Is there a better practical approach?
3rd Objective (open research problem) Disease Detection:
oooxx
??oxx
?x??x
????x xx??x
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
3rd Objective (open research problem) Disease Detection:
xxoxx xxxox
oooxx
oooxx xxxox
xxoxx oxxox
xxoxx oooxx oooxx
oooxx
??oxx
?x??x
????x xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
H contains H’, s.t. each diseased has one haplotype in H’ and each healty none
minimize | H|
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
Genome Rearrangements and
Evolutionary Distances
Each species has a genome (organized in pairs of chromosomes)
tcgtgatggat………ttgatggattga
tcgattatggat………ttttgatatcca
Genomes evolve by means of
• Insertions
• Deletions
• Inversions
• Transpositions
• Translocations of DNA regions
deletion
deletion insertion
deletion insertion
translocation
deletion insertion
translocation
inversion
deletion insertion
translocation
inversion
transposition
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)
5 6 4 8 3 2 1 9 7
Example:
We focus on inversions, that are the most important in Nature
1 2 3 8 4 6 5 9 7
1 2 3 8 4 5 6 9 7
1 2 3 6 5 4 8 9 7
1 2 3 6 5 4 8 7 9
1 2 3 4 5 6 8 7 9
1 2 3 4 5 6 7 8 9
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)
+5 +6 -4 -8 -3 -2 -1 -9 +7
Example:
We focus on inversions, that are the most important in Nature
+1 +2 +3 +8 +4 -6 -5 -9 +7 +1 +2 +3 +8 +4 +5 +6 -9 +7 +1 +2 +3 -6 -5 -4 -8 -9 +7 +1 +2 +3 -6 -5 -4 -8 -7 +9 +1 +2 +3 +4 +5 +6 -8 -7 +9 +1 +2 +3 +4 +5 +6 +7 +8 +9
There is also a SIGNED VERSION of the problem !
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0 10
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0 10
d(p) = inversion distance b(p) = # breakpoints
TRIVIAL BOUND: d(p) >= b(p) / 2 Example: d(p) >= 6 / 2 = 3
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
10 4 6
Each node has degree...
0 2 or 4 …
hence the graph can be decomposed in cycles!
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
c(p) = max # cycles in alternating decomposition
VERY STRONG BOUND : d (p) >= b(p) - c(p)
Example: c(p)= 2 and d (p) >= 6 - 2 = 4
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10 The best algorithm for this problem is based on an Integer Programming formulation of the max cycle decomposition
A variable
x
C for each cycle (exponential # of vars…) A constraintS
x
C = 1 for each edge e Objective: maximizeS
Cx
CC containing e
max S x
CS x
C = 1 for all edges eC C\ni ex
C \in {0,1} for all alt. cycles C PRIMALmin S y
eS y
e <= 1 for all alt. Cycles Cee\in C
y
e \in R for all edges e DUALmax S x
CS x
C = 1 for all edges eC C\ni ex
C \in {0,1} for all alt. cycles C PRIMALmin S y
eS y
e <= 1 for all alt. Cycles Cee\in C
y
e \in R for all edges e DUAL5 7 8 2 1 4 3 6 9 0
10
Pricing out the cycles for which y*(C) < 1
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Split the graph in two copies
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Connect twins
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
The weight of the matching is the y*-weight of the cycles
.2 .4
.5 1
.6
0
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Forcing a cycle to use a certain node
.2 .4
.5 1
.6
100000
- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)
BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)
BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)
KS can solve at most n=40. Take days for n=50
CLN can solve for n=200. Takes few seconds (say 5) for n=100 NP-hard problem practically solved to optimality!
Statistical view of evolution
• Genome evolve by random inversions
• It’s like a random walk on a huge graph with an edge for each permutation an edge for each inversion
• It is not clear why the shortest solution should be the one followed by Nature (in fact, often it isn’t)
• We want to find the most likely number of inversions that lead from (1 2 … n ) to p
• We use the expected number of breakpoints after k inversions as a way to guess the # of inversions
Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n) Given a p obtained by h random inversions from (1 … n ) we want to estimate h The inversion distance is only a lower bound: h >= d(p) but the gap could be big
We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as close as possible to b(p) (maximum likelihood). CL ,2000, have shown:
Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions
E[B(k)] = ( n - 1 ) ( 1 - ( n - 3 n - 1 )
k)
Example: n = 200, k (u.a.r. in 1…n) inversions
8 8 8 16
19 19 19 34
68 67 67 98
69 73 68 104
73 79 73 109
85 91 83 120
86 85 83 115
87 90 84 119
118 117 109 138
184 184 135 168
k k’ d(p) b
Protein Structure Alignments: the Maximum Contact Map Overlap
Problem
A Protein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a 3-Dimensional structure (the protein fold).
Protein STRUCTURE determines its FUNCTION
For instance, the Drug Design problem calls for constructing peptides with a 3D shape complementary to a protein, so as to dock onto it.
Motivation:
Structure Alignment is Important for:
- Discovery of Protein Function (shape determines function) - Search in 3D data bases
- Protein Classification and Evolutionary Studies - ...
Problem: Align two 3D protein structures
Contact Maps
Unfolded protein
CONTACT MAPS
Unfolded protein
Folded protein = contacts
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
CONTACT MAPS
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
OBJECTIVE: align 3d folds of proteins = align contact maps
Contact Map Alignments
Non-crossing Alignments
Protein 1
Protein 2
non-crossing map of residues in protein 1 and protein 2
The value of an alignment
The value of an alignment
The value of an alignment
Value = 3
The value of an alignment
Value = 3
We want to maximize the value
The value of an alignment
NP-Hard
The value of an alignment
Integer Programming
Formulation
Integer Programming Formulation
0-1 VARIABLES
yef for e and f contacts
e
f
yef
Integer Programming Formulation
0-1 VARIABLES
yef + ye’f’<= 1 yef for e and f contacts
e
f
yef
CONSTRAINTS e
f
e’
f’
Integer Programming Formulation
0-1 VARIABLES
yef + ye’f’<= 1 yef for e and f contacts
e
f
yef
CONSTRAINTS e
f
e’
f’
OBJECTIVE max
S
eS
f yefIndependent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e
f
e’
f’’ f’
e’’
e f
e’
f’
e’’
f’’
Independent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e
f
e’
f’’ f’
e’’
e f
e’
f’
e’’
f’’
|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each) The best exact algorithm for independent set can solve for at most a few hundred nodes
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
NEW VARIABLES xij for i and j residues
e
f
yef i
j xij
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
NEW VARIABLES xij for i and j residues
e
f
yef
NEW CONSTRAINTS i
j i’
j’
xij + xi’j’ <= 1
i j
xij
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
NEW VARIABLES
y(ip)(jq) <= xij and y(ip)(jq)<= xpq
xij for i and j residues
e
f
yef
NEW CONSTRAINTS i
j i’
j’
xij + xi’j’<= 1
i j
xij
i j
p q
Clique Constraints
Variables x define a graph Gx:
• A node for each line
• An edge between each pair of crossing lines i
j i’
j’
i j
i’
j’
Clique Constraints
Variables x define a graph Gx:
• Gx is much smaller than Gy
• Gx has nice proprieties (it’s a perfect graph)
• It’s easier to find large independent sets in Gx
• A node for each line
• An edge between each pair of crossing lines i
j i’
j’
i j
i’
j’
Clique Constraints
Non-crossing constraints can be extended to CLIQUE CONSTRAINTS
S
xij <= 1[i,j] in M
For all sets M of mutually incompatible (i.e. crossing) lines
All clique constraints satisfied (and Gx perfect) imply a strong bound!
Structure of Maximal cliques in G x
1. Pick two subsets of same size
Structure of Maximal cliques in G x
2. Connect them in a zig-zag fashion
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in
G x
Structure of Maximal cliques in G x
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in G x
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in G x
The result is a maximal clique in Gx
Separation of Clique Inequalities
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
We need to generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x*
S
x*ij > 1[i,j] in M
THEOREM
We can find the most violated clique inequality in time O(n2)
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
1 2 3 4 5 6 7 8
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.
to find longest path in grid, time O(n^2)
Separation of cliques
n2 1
1 2 n1 2
i
u
Create n1 x n2 grid
Orient all edges and give
weights
Separation of cliques
n2 1
1 2 n1 2
i
u
Create n1 x n2 grid
Orient all edges and give weights
x*iu
x*iu
Separation of cliques
Create n1 x n2 grid
Orient all edges and give weights
There is violated clique iff longest A,B path has length > 1
A=(1,n2)
B=(n1,1)
Gx is a Perfect Graph
We show why polynomial separation is possible:
G
xis weakly triangulated (no chordless cycles >= 5 in G
xor G
x)
=> Gx is perfect (Hayward, 1985)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
PROOF (Sketch, for G
x)
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1 L3
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1 L3
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1 L3
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L4
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L4
L5
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1 L6 L5
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1
We get LEFT(L1, {L3, L4, L5, L6})
L3, L4, L5 L6
L6
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1
A symmetric argument started at L6, with LEFT(L1, L6) implies
LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1
A symmetric argument started at L6, with LEFT(L1, L6) implies
LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
L2, L3, L4 L5
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
But L
7crosses L
1and L
6, and so should cross them all !
L7
The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)
It can be applied to small or moderate proteins (up to 80 residues/150 contacts)
In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIAN RELAXATION. Approach borrowed from Quadratic Assignment. With new approach we can solve important proteins (up to 150 residues/300 contacts)