beamer-tu-logo
(a tour of combinatorial optimization
through a single problem)
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
How do we tackle problems?
When faced with a new problem we go through these steps:
1
Assess complexity: is it NP/APX-hard? (usually ...yes)
2
Exact approaches:
Combinatorial branch-and-bound
Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming
SAT/Boolean approaches ...
3
Approximation algorithms:
LP rounding
Randomized approximations ...
4
Heuristics:
(Sophisticated) local search Metaheuristics
...
All of the above have been tried on our problem!
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
Each number as a sum
Problem description
INPUT : a set G of numbers in {0, . . . , v }
OUTPUT: a set H of numbers in {0, . . . , dv /2e} such that for each g ∈ G there are h 0 , h 00 ∈ H with g = h 0 + h 00 GOAL: minimize |H|
Example
For instance, G = {2, 3, 4, 5, 6, 8} can be generated by
H = {1, 2, 4} as 2 = 1 + 1, 3 = 1 + 2, etc...
beamer-tu-logo
Each number as a sum
Problem description
INPUT : a set G of numbers in {0, . . . , v }
OUTPUT: a set H of numbers in {0, . . . , dv /2e} such that for each g ∈ G there are h 0 , h 00 ∈ H with g = h 0 + h 00 GOAL: minimize |H|
Example
For instance, G = {2, 3, 4, 5, 6, 8} can be generated by
H = {1, 2, 4} as 2 = 1 + 1, 3 = 1 + 2, etc...
beamer-tu-logo
Each vector as a sum
The above problem is particularly interesting for v = 2 and vectors instead of numbers.
G subset of {0, 1, 2} n H subset of {0, 1} n Example
For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0)
The star of the show
This is our problem! It’s as hard as it gets... Let’s give it a bio
motivation
beamer-tu-logo
Each vector as a sum
The above problem is particularly interesting for v = 2 and vectors instead of numbers.
G subset of {0, 1, 2} n H subset of {0, 1} n Example
For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0) The star of the show
This is our problem! It’s as hard as it gets... Let’s give it a bio
motivation
beamer-tu-logo
Each vector as a sum
The above problem is particularly interesting for v = 2 and vectors instead of numbers.
G subset of {0, 1, 2} n H subset of {0, 1} n Example
For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0) The star of the show
This is our problem! It’s as hard as it gets... Let’s give it a bio
motivation
beamer-tu-logo
The underlying biology problem
Single Nucleotide Polymorphisms
A SNP is a specific nucleotide at a certain position that can vary in a population
Its different values are called alleles and are typically just two
Different SNPs can have different alleles
SNPs are the predominant form of human polymorphism
beamer-tu-logo
The underlying biology problem
Single Nucleotide Polymorphisms
A SNP is a specific nucleotide at a certain position that can vary in a population
Its different values are called alleles and are typically just two
Different SNPs can have different alleles
SNPs are the predominant form of human polymorphism
beamer-tu-logo
The underlying biology problem
Single Nucleotide Polymorphisms
A SNP is a specific nucleotide at a certain position that can vary in a population
Its different values are called alleles and are typically just two
Different SNPs can have different alleles
SNPs are the predominant form of human polymorphism
beamer-tu-logo
The underlying biology problem
Single Nucleotide Polymorphisms
A SNP is a specific nucleotide at a certain position that can vary in a population
Its different values are called alleles and are typically just two
Different SNPs can have different alleles
SNPs are the predominant form of human polymorphism
beamer-tu-logo
Haplotypes
Hapl. 1, paternal: tagccCtatttCccaggcgcCgtatgacgggTcta Hapl. 1, maternal: tagccGtatttAccaggcgcGgtatgacgggTcta Hapl. 2, paternal: tagccCtatttAccaggcgcGgtatgacgggTcta Hapl. 2, maternal: tagccGtatttCccaggcgcGgtatgacgggCcta Hapl. 3, paternal: tagccCtatttAccaggcgcGgtatgacgggTcta Hapl. 3, maternal: tagccGtatttAccaggcgcCgtatgacgggCcta
We are diploid organisms
For each chromosome we have two haplotypes
In the above, the haplotypes for Individual 1 are {CCCT,
GAGT}
beamer-tu-logo
Haplotypes are expensive to get in lab exp Much cheaper experiment yields genotypes
Genotype is the “sum” of haplotypes, specifying the alleles at each position but not their phase (father, mother)
Example
h
0= G A T C
h
00= G T T A
g = {G} {A, T } {T } {A, C}
Cardinality one : homozygous ({G}, {T} )
Cardinality two : heterozygous ({A,T}, {A,C})
beamer-tu-logo
Haplotypes are expensive to get in lab exp Much cheaper experiment yields genotypes
Genotype is the “sum” of haplotypes, specifying the alleles at each position but not their phase (father, mother)
Example
h
0= G A T C
h
00= G T T A
g = {G} {A, T } {T } {A, C}
Cardinality one : homozygous ({G}, {T} )
Cardinality two : heterozygous ({A,T}, {A,C})
beamer-tu-logo
The phasing problem
Each genotype with k heterozygous SNPs can be resolved in 2 k −1 ways
Resolutions
A G A T A
T G T T C
{A, T } {G} {A, T } {T } {A, C}
A G A T C
T G T T A
{A, T } {G} {A, T } {T } {A, C}
A G T T A
T G A T C
{A, T } {G} {A, T } {T } {A, C}
A G T T C
T G A T A
{A, T } {G} {A, T } {T } {A, C}
How do we find the correct one?
beamer-tu-logo
From DNA to bits
we encode the alleles with 0/1. haplotypes = binary vectors we encode genotypes with ternary vectors over {0,1,*}
0 and 1 = homozygous, * = heterozygous
Haplotype 1, paternal: 0 1 0 1
* * * 1 Genotype 1 Haplotype 1, maternal: 1 0 1 1
Haplotype 2, paternal: 0 0 1 1
* * 1 * Genotype 2 Haplotype 2, maternal: 1 1 1 0
Haplotype 3, paternal: 0 0 1 1
* 0 * * Genotype 3 Haplotype 3, maternal: 1 0 0 0
Figure: Haplotypes and corresponding genotypes.
beamer-tu-logo
From DNA to bits
we encode the alleles with 0/1. haplotypes = binary vectors we encode genotypes with ternary vectors over {0,1,*}
0 and 1 = homozygous, * = heterozygous
Haplotype 1, paternal: 0 1 0 1
* * * 1 Genotype 1 Haplotype 1, maternal: 1 0 1 1
Haplotype 2, paternal: 0 0 1 1
* * 1 * Genotype 2 Haplotype 2, maternal: 1 1 1 0
Haplotype 3, paternal: 0 0 1 1
* 0 * * Genotype 3 Haplotype 3, maternal: 1 0 0 0
Figure: Haplotypes and corresponding genotypes.
beamer-tu-logo
math of haplotypes
Again, a genotype is the sum of two haplotypes g = h
0+ h
00.
0 +
0 =
0
1 +
1 =
1
0 +
1 =
∗
1 +
0 =
∗
Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}
Resolution of g: pair h
0, h
00∈ H(g) s.t. g = h
0+ h
00Ambiguous sites: positions g
i= ∗
A genotype w/k ambiguous sites has 2
kcompatible haplotypes and 2
k −1resolutions. (If k ≤ 1 genotype has only 1 resolution)
HAPLOTYPING PROBLEM INPUT: set of genotypes G
OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h
0, h
00∈ H with g = h
0+ h
00OBJECTIVE: ?
beamer-tu-logo
math of haplotypes
Again, a genotype is the sum of two haplotypes g = h
0+ h
00.
0 +
0 =
0
1 +
1 =
1
0 +
1 =
∗
1 +
0 =
∗
Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}
Resolution of g: pair h
0, h
00∈ H(g) s.t. g = h
0+ h
00Ambiguous sites: positions g
i= ∗
A genotype w/k ambiguous sites has 2
kcompatible haplotypes and 2
k −1resolutions. (If k ≤ 1 genotype has only 1 resolution)
HAPLOTYPING PROBLEM INPUT: set of genotypes G
OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h
0, h
00∈ H with g = h
0+ h
00OBJECTIVE: ?
beamer-tu-logo
math of haplotypes
Again, a genotype is the sum of two haplotypes g = h
0+ h
00.
0 +
0 =
0
1 +
1 =
1
0 +
1 =
∗
1 +
0 =
∗
Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}
Resolution of g: pair h
0, h
00∈ H(g) s.t. g = h
0+ h
00Ambiguous sites: positions g
i= ∗
A genotype w/k ambiguous sites has 2
kcompatible haplotypes and 2
k −1resolutions. (If k ≤ 1 genotype has only 1 resolution)
HAPLOTYPING PROBLEM INPUT: set of genotypes G
OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h
0, h
00∈ H with g = h
0+ h
00OBJECTIVE: ?
beamer-tu-logo
Max parsimony Haplotyping (MPH)
after some years...
Objective: parsimony
Haplotypes are very conserved. We seek the set H of smallest
cardinality (simplest explanation, Occam’s razor)
beamer-tu-logo
Max parsimony Haplotyping (MPH)
after some years...
Objective: parsimony
Haplotypes are very conserved. We seek the set H of smallest
cardinality (simplest explanation, Occam’s razor)
beamer-tu-logo
Max parsimony Haplotyping (MPH)
after some years...
Objective: parsimony
Haplotypes are very conserved. We seek the set H of smallest
cardinality (simplest explanation, Occam’s razor)
beamer-tu-logo
Max parsimony Haplotyping (MPH)
after some years...
Objective: parsimony
Haplotypes are very conserved. We seek the set H of smallest
cardinality (simplest explanation, Occam’s razor)
beamer-tu-logo
Our problem is exactly the math problem of beginning, only sum is written differently.
G =
g 1 = *01 g 2 = 00*
g 3 = *0*
g 4 = 0**
g 5 = 0*1
7→ H = {h 1 , h 2 , h 3 , h 4 } = {001, 101, 000, 011}
g 1 = h 1 + h 2 , g 2 = h 1 + h 3 , etc...
Important parameters
# of genotypes (rows) N g
# of SNPs (cols) N s (and, most important of all,)
max (or avg) # of ambiguous sites N ∗
beamer-tu-logo
Our problem is exactly the math problem of beginning, only sum is written differently.
G =
g 1 = *01 g 2 = 00*
g 3 = *0*
g 4 = 0**
g 5 = 0*1
7→ H = {h 1 , h 2 , h 3 , h 4 } = {001, 101, 000, 011}
g 1 = h 1 + h 2 , g 2 = h 1 + h 3 , etc...
Important parameters
# of genotypes (rows) N g
# of SNPs (cols) N s (and, most important of all,)
max (or avg) # of ambiguous sites N ∗
beamer-tu-logo
Now what?
Problem MPH introduced by Gusfield (2002)
It is “extremely exponential”: solution space has size 2 2
N∗It looks complex.
Let’s start by assessing its real complexity...
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
Theorem
The MPH problem is APX-hard, even if each genotype has ≤ 3 ambiguous sites (reduction from VC).
Graph has a cover size c iff there is a set H of |V | + c haplotypes that resolve all genotypes.
APX-hardness
Lancia, Pinotti and Rizzi,Haplotyping populations by pure parsimony: Complexity,
Exact and Approximation algorithms,INFORMS J. on Computing, 2004
beamer-tu-logo
When N ∗ = 1 problem is trivial and when N ∗ = 3 it is APX-hard.
What for N ∗ = 2 ? Theorem
The MPH problem is polynomially solvable if each genotype has ≤ 2 ambiguous sites (ILP formulation totally unimodular).
Polynomial 2-case
Lancia and Rizzi,A polynomial case of the parsimony
haplotyping problem,Operations Research Letters, 2006
beamer-tu-logo
When N ∗ = 1 problem is trivial and when N ∗ = 3 it is APX-hard.
What for N ∗ = 2 ? Theorem
The MPH problem is polynomially solvable if each genotype has ≤ 2 ambiguous sites (ILP formulation totally unimodular).
Polynomial 2-case
Lancia and Rizzi,A polynomial case of the parsimony
haplotyping problem,Operations Research Letters, 2006
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
Exact algorithms
1
(Naive) combinatorial branch-and-bound
2
Exponential ILP model (no row/col generation)
3
Polynomial and hybrid ILP models
4
Exponential ILP model (with row/col generation)
5
Semidefinite programming
6
SAT approaches
beamer-tu-logo
1. A (naive) combinatorial b&b
branch on all resolutions on g
trivial LB : used haplotypes so far plus...
...elementary bound on remaining needed haplotypes
HAPAR experiments N
g15
N
s20
N
∗very small (say ≤ 7)
Comments: Can take days even on “toy” problems with 12 genotypes
Combinatorial B& B
Wang and Xu,Haplotype inference by maximum parsimony,Bioinformatics, 2003
beamer-tu-logo
1. A (naive) combinatorial b&b
branch on all resolutions on g
trivial LB : used haplotypes so far plus...
...elementary bound on remaining needed haplotypes
HAPAR experiments N
g15
N
s20
N
∗very small (say ≤ 7)
Comments: Can take days even on “toy” problems with 12 genotypes
Combinatorial B& B
Wang and Xu,Haplotype inference by maximum parsimony,Bioinformatics, 2003
beamer-tu-logo
2. A (static) exponential ILP model
Variable x
hfor each haplotype h and y
h1,h2for each resolution h
1+ h
2min X
h∈H(G)
x
hsubject to
X
(h1,h2):g=h1+h2
y
h1,h2≥ 1 ∀g ∈ G
y
h1,h2≤ x
h1∀(h
1,h
2)
y
h1,h2≤ x
h1∀(h
1,h
2) where x , y are 0/1 variables.
EXP SIZE
Exponential no. of variables and of constraints (each resolution (h
1,h
2))
Some variables can be disposed of, but not many (still exp size, model RTIP,
Reduced Theoretical Integer Program)
beamer-tu-logo
2. A (static) exponential ILP model
Variable x
hfor each haplotype h and y
h1,h2for each resolution h
1+ h
2min X
h∈H(G)
x
hsubject to
X
(h1,h2):g=h1+h2
y
h1,h2≥ 1 ∀g ∈ G
y
h1,h2≤ x
h1∀(h
1,h
2)
y
h1,h2≤ x
h1∀(h
1,h
2) where x , y are 0/1 variables.
EXP SIZE
Exponential no. of variables and of constraints (each resolution (h
1,h
2))
Some variables can be disposed of, but not many (still exp size, model RTIP,
Reduced Theoretical Integer Program)
beamer-tu-logo
2. A (static) exponential ILP model
Variable x
hfor each haplotype h and y
h1,h2for each resolution h
1+ h
2min X
h∈H(G)
x
hsubject to
X
(h1,h2):g=h1+h2
y
h1,h2≥ 1 ∀g ∈ G
y
h1,h2≤ x
h1∀(h
1,h
2)
y
h1,h2≤ x
h1∀(h
1,h
2) where x , y are 0/1 variables.
EXP SIZE
Exponential no. of variables and of constraints (each resolution (h
1,h
2))
Some variables can be disposed of, but not many (still exp size, model RTIP,
Reduced Theoretical Integer Program)
beamer-tu-logo
RTIP
RTIP exp. IP model
Gusfield,Haplotype Inference by Pure
Parsimony,Combinatorial Pattern Matching, 2003 Model suitable only for small/moderate data sets Yields very strong lower bound
RTIP experiments N g ≤ 50 N s ≤ 30 N ∗ ≤ 10
Comments: CPLEX. Runtimes from secs to hrs. Solvable
models with about 100,000 rows and cols.
beamer-tu-logo
3. Polynomial and hybrid ILP models
Gusfield’s RTIP exponential model impractical for increasing N ∗
Polynomial ILP (both in rows and cols) are possible.
They are theoretically better than RTIP (in that they can run on problems on which RTIP cannot even create the LP) In practice, they are much worse, as they have very poor lower bound
Several authors independently came up with them
beamer-tu-logo
Idea for poly IP models: “bit” variables
G =
0 ∗ ∗ 1
0 0 1 1
∗ 1 0 ∗
∗ ∗ ∗ 0
1 2 3 4
H =
0 x
1x
21
0 1 − x
11 − x
21
0 0 1 1
0 0 1 1
x
31 0 x
41 − x
31 0 1 − x
4x
5x
6x
70
1 − x
51 − x
61 − x
70
1
01
002
02
003
03
004
04
00Vars z
uv= 1 if row u and v are different (imposed by linear constraints, e.g.
z
30,40≥ x
3− x
5and z
30,40≥ x
5− x
3z
30,40≥ x
4(because of last col, if x
4= 1) z
10,40= 1 (because of last col)
etc...)
Vars w
u= 1 if no row t > u is identical to row u (imposed by lin. constraints) w
u≥ P
t>u
z
ut− (t − u − 1) Objective: minimize P
u
w
ubeamer-tu-logo
Idea for poly IP models: “bit” variables
G =
0 ∗ ∗ 1
0 0 1 1
∗ 1 0 ∗
∗ ∗ ∗ 0
1 2 3 4
H =
0 x
1x
21
0 1 − x
11 − x
21
0 0 1 1
0 0 1 1
x
31 0 x
41 − x
31 0 1 − x
4x
5x
6x
70
1 − x
51 − x
61 − x
70
1
01
002
02
003
03
004
04
00Vars z
uv= 1 if row u and v are different (imposed by linear constraints, e.g.
z
30,40≥ x
3− x
5and z
30,40≥ x
5− x
3z
30,40≥ x
4(because of last col, if x
4= 1) z
10,40= 1 (because of last col)
etc...)
Vars w
u= 1 if no row t > u is identical to row u (imposed by lin. constraints) w
u≥ P
t>u
z
ut− (t − u − 1) Objective: minimize P
u
w
ubeamer-tu-logo
Idea for poly IP models: “bit” variables
G =
0 ∗ ∗ 1
0 0 1 1
∗ 1 0 ∗
∗ ∗ ∗ 0
1 2 3 4
H =
0 x
1x
21
0 1 − x
11 − x
21
0 0 1 1
0 0 1 1
x
31 0 x
41 − x
31 0 1 − x
4x
5x
6x
70
1 − x
51 − x
61 − x
70
1
01
002
02
003
03
004
04
00Vars z
uv= 1 if row u and v are different (imposed by linear constraints, e.g.
z
30,40≥ x
3− x
5and z
30,40≥ x
5− x
3z
30,40≥ x
4(because of last col, if x
4= 1) z
10,40= 1 (because of last col)
etc...)
Vars w
u= 1 if no row t > u is identical to row u (imposed by lin. constraints) w
u≥ P
t>u
z
ut− (t − u − 1) Objective: minimize P
u
w
ubeamer-tu-logo
Idea for poly IP models: “bit” variables
G =
0 ∗ ∗ 1
0 0 1 1
∗ 1 0 ∗
∗ ∗ ∗ 0
1 2 3 4
H =
0 x
1x
21
0 1 − x
11 − x
21
0 0 1 1
0 0 1 1
x
31 0 x
41 − x
31 0 1 − x
4x
5x
6x
70
1 − x
51 − x
61 − x
70
1
01
002
02
003
03
004
04
00Vars z
uv= 1 if row u and v are different (imposed by linear constraints, e.g.
z
30,40≥ x
3− x
5and z
30,40≥ x
5− x
3z
30,40≥ x
4(because of last col, if x
4= 1) z
10,40= 1 (because of last col)
etc...)
Vars w
u= 1 if no row t > u is identical to row u (imposed by lin. constraints) w
u≥ P
t>u
z
ut− (t − u − 1) Objective: minimize P
u
w
ubeamer-tu-logo
Similar (“in spirit”) models have been proposed:
Poly IP
Brown and Harrower,A new integer programming formulation for the pure parsimony problem in haplotype analysis,Lecture Notes in Computer Science (WABI), 2004
Poly IP
Halldorsson, Bafna, Edwards, Lippert and Istrail,A survey of computational methods for determining haplotypes,Lecture Notes in CS, 2004
Poly IP
Lancia, Pinotti and Rizzi,Haplotyping populations by pure parsimony: Complexity, Exact and Approximation algorithms,INFORMS J. on Computing, 2004
Poly IP
Bertolazzi, Godi, Labbe’ and Tininini,Solving haplotyping inference parsimony
problem using a new basic polynomial formulation,Computers & Math. with Applic.,
2008
beamer-tu-logo
Poly-IP Experiments
These poly IP models yield very weak lower bounds.
They have been strengthened by cuts: still no good They perform in practice much worse than RTIP IP model experiments
N g ≤ 30 N s ≤ 20 N ∗ ≤ 7, 8
Comments: Runtimes from mins to several hrs (CPLEX).
beamer-tu-logo
Hybrid poly model
Generate explicitly R haplotypes h
1, . . . , h
Rand associate variables to them x
h1, . . . , x
hRWrite model with these explicit haplotypes plus poly “bit”
variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h
1, . . . , h
Rit seems solution is always found within R Comp results similar to Gusfield’s RTIP
hybrid
Brown and Harrower,Integer programming approaches to haplotype
inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and
Bioinf., 2006
beamer-tu-logo
Hybrid poly model
Generate explicitly R haplotypes h
1, . . . , h
Rand associate variables to them x
h1, . . . , x
hRWrite model with these explicit haplotypes plus poly “bit”
variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h
1, . . . , h
Rit seems solution is always found within R Comp results similar to Gusfield’s RTIP
hybrid
Brown and Harrower,Integer programming approaches to haplotype
inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and
Bioinf., 2006
beamer-tu-logo
Hybrid poly model
Generate explicitly R haplotypes h
1, . . . , h
Rand associate variables to them x
h1, . . . , x
hRWrite model with these explicit haplotypes plus poly “bit”
variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h
1, . . . , h
Rit seems solution is always found within R Comp results similar to Gusfield’s RTIP
hybrid
Brown and Harrower,Integer programming approaches to haplotype
inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and
Bioinf., 2006
beamer-tu-logo
Hybrid poly model
Generate explicitly R haplotypes h
1, . . . , h
Rand associate variables to them x
h1, . . . , x
hRWrite model with these explicit haplotypes plus poly “bit”
variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h
1, . . . , h
Rit seems solution is always found within R Comp results similar to Gusfield’s RTIP
hybrid
Brown and Harrower,Integer programming approaches to haplotype
inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and
Bioinf., 2006
beamer-tu-logo
Hybrid poly model
Generate explicitly R haplotypes h
1, . . . , h
Rand associate variables to them x
h1, . . . , x
hRWrite model with these explicit haplotypes plus poly “bit”
variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h
1, . . . , h
Rit seems solution is always found within R Comp results similar to Gusfield’s RTIP
hybrid
Brown and Harrower,Integer programming approaches to haplotype
inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and
Bioinf., 2006
beamer-tu-logo
4. A (dynamic) exponential model
Exp models can yield strong bounds, but are unmanageable.
We need new ideas to move to real-life hard problems (N ∗ ≥ 20)
Solution: Column-generation (and/or Constraint generation) at run time
Must use a different model than RTIP: exploit relations to
set cover
beamer-tu-logo
A set-covering approach
Covering condition
For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.
This condition is only necessary, but not sufficient, for H to be a feasible for MPH.
Example:
G =
1 ∗ ∗∗
∗1 ∗ ∗
∗ ∗ 1∗
∗ ∗ ∗1
here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.
It is covering condition:
ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.
subsets of g.s. = haplotype h =
h ↔ {(g, i, a) | h ∈ H(g) , h
i= a}
then the covering condition requires that H covers the ground set.
beamer-tu-logo
A set-covering approach
Covering condition
For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.
This condition is only necessary, but not sufficient, for H to be a feasible for MPH.
Example:
G =
1 ∗ ∗∗
∗1 ∗ ∗
∗ ∗ 1∗
∗ ∗ ∗1
here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.
It is covering condition:
ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.
subsets of g.s. = haplotype h =
h ↔ {(g, i, a) | h ∈ H(g) , h
i= a}
then the covering condition requires that H covers the ground set.
beamer-tu-logo
A set-covering approach
Covering condition
For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.
This condition is only necessary, but not sufficient, for H to be a feasible for MPH.
Example:
G =
1 ∗ ∗∗
∗1 ∗ ∗
∗ ∗ 1∗
∗ ∗ ∗1
here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.
It is covering condition:
ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.
subsets of g.s. = haplotype h =
h ↔ {(g, i, a) | h ∈ H(g) , h
i= a}
then the covering condition requires that H covers the ground set.
beamer-tu-logo
The SC model
SC OPT := min X
h∈H(G)
x
hsubject to X
h∈H(g):h[i]=a
x
h≥ 1 ∀g ∈ G, i ∈ amb(g), a ∈ {0, 1}
x ∈ {0, 1}
H(G).
O(2
Ns) variables and O(N
gN
s) constraints.
Column generation via pricing is possible in linear time wrt N
∗if sol is feasible for MPH it is optimal
if not, a cut is added that impose at least a haplotype of those at zero
Cut generation is possible in poly-time, Col generation remains effective
beamer-tu-logo
The SC model
SC OPT := min X
h∈H(G)
x
hsubject to X
h∈H(g):h[i]=a
x
h≥ 1 ∀g ∈ G, i ∈ amb(g), a ∈ {0, 1}
x ∈ {0, 1}
H(G).
O(2
Ns) variables and O(N
gN
s) constraints.
Column generation via pricing is possible in linear time wrt N
∗if sol is feasible for MPH it is optimal
if not, a cut is added that impose at least a haplotype of those at zero
Cut generation is possible in poly-time, Col generation remains effective
beamer-tu-logo
SC experiments N g 50 N s 100 N ∗ 25
Comments: Optimal sols to real-life instances never solved before. Up to 10 10 haplotype variables (implicit).
Best exact approach so far Branch-and-Cut-and-Price
Lancia and Serafini,A set covering approach with column
generation for parsimony haplotyping,INFORMS J. on
Computing, 2009
beamer-tu-logo
5. Semidefinite programming
Still expon.: let possible haplotypes h
1, . . . , h
n, where n = O(2
Ns). Instead of {0, 1}
vars use {−1, 1} vars:
x
i:=
( +1 if haplotype h
iis in solution
−1 otherwise.
Integer Quadratic Program
Number of selected haplotypes (to min):
n
X
i=1
(1 + x
i)
24
since
(1+1)4 2= 1 and
(1−1)4 2= 0.
Constraints, for each g X
(i,j):hi+hj=g
(1 + x
i)(1 + x
j)
4 ≥ 1 ∀g ∈ G
beamer-tu-logo
5. Semidefinite programming
Still expon.: let possible haplotypes h
1, . . . , h
n, where n = O(2
Ns). Instead of {0, 1}
vars use {−1, 1} vars:
x
i:=
( +1 if haplotype h
iis in solution
−1 otherwise.
Integer Quadratic Program
Number of selected haplotypes (to min):
n
X
i=1
(1 + x
i)
24
since
(1+1)4 2= 1 and
(1−1)4 2= 0.
Constraints, for each g X
(i,j):hi+hj=g
(1 + x
i)(1 + x
j)
4 ≥ 1 ∀g ∈ G
beamer-tu-logo
5. Semidefinite programming
Relax each x
ito an (n + 1)-dimensional unit vector y
iand replace 1 with y
0= (1, 0, . . . , 0)
min
n
X
i=1
(1 + x
i)
24 7→ min
n
X
i=1
(y
0+ y
i)
24
X
hi+hj=g
(1 + x
i)(1 + x
j)
4 ≥ 1 7→ X
hi+hj=g
(y
0+ y
i)(y
0+ y
j)
4 ≥ 1 ∀g ∈ G
with |y
i| = 1 i = 1, . . . , n
Let Y = (y
0y
1. . . y
n)
T(y
0y
1. . . y
n)
Y is positive semidefinite, and the problem can be solved via semidefinite
programming.
beamer-tu-logo
5. Semidefinite programming
Relax each x
ito an (n + 1)-dimensional unit vector y
iand replace 1 with y
0= (1, 0, . . . , 0)
min
n
X
i=1
(1 + x
i)
24 7→ min
n
X
i=1
(y
0+ y
i)
24
X
hi+hj=g
(1 + x
i)(1 + x
j)
4 ≥ 1 7→ X
hi+hj=g
(y
0+ y
i)(y
0+ y
j)
4 ≥ 1 ∀g ∈ G
with |y
i| = 1 i = 1, . . . , n
Let Y = (y
0y
1. . . y
n)
T(y
0y
1. . . y
n)
Y is positive semidefinite, and the problem can be solved via semidefinite
programming.
beamer-tu-logo
SDP experiments N g ≤ 20 N s ≤ 20 N ∗ small
Comments: worse than RTIP, but probably better than Poly-IP
Semidefinite Progr.
Kalpakis and Namjoshi,Haplotype phasing using semidefinite
programming,IEEE Symp. on Bioinf. and Bioeng., 2005
beamer-tu-logo
6. Boolean/SAT approaches
For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”
can be expressed in the bit vars It is a SAT problem
Solved by SAT solvers via constraint programming Iterate on values of V
SAT approach
Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006
SAT experiments N
g50 N
s100 N
∗' 15
Comments: minisat solver. Good performance, better than RTIP
beamer-tu-logo
6. Boolean/SAT approaches
For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”
can be expressed in the bit vars It is a SAT problem
Solved by SAT solvers via constraint programming Iterate on values of V
SAT approach
Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006
SAT experiments N
g50 N
s100 N
∗' 15
Comments: minisat solver. Good performance, better than RTIP
beamer-tu-logo
6. Boolean/SAT approaches
For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”
can be expressed in the bit vars It is a SAT problem
Solved by SAT solvers via constraint programming Iterate on values of V
SAT approach
Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006
SAT experiments N
g50 N
s100 N
∗' 15
Comments: minisat solver. Good performance, better than RTIP
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
LP deterministic rounding
Theorem (LPN04)
Let ˆ x be optimal LP-sol of RTIP. Round ˆ x to z as
z i :=
( 1 if ˆ x i ≥ 2
N∗−11 0 otherwise.
Then z is a 2 N
∗−1 -approximate solution.
This is a constant factor approx when N ∗ bounded by constant.
beamer-tu-logo
QP randomized rounding
A (log N g )-approximation Uses SDP formulation
1
solve SDP and retrieve norm-1 vectors y (Cholesky decomposition of Y )
2
from vectors y retrieve integer vars for each g (used haplotypes) by
randomization (randomized rounding through projection of random unit vectors on y )
3
if all genotypes solved ok, else goto 1.
Fast on small instances (but not polynomial, size of SDP exponential)
Approx. algo
Huang, Chao and Chen,An approximation algorithm for
haplotype inference by maximum parsimony,ACM Symp. on
Appl. Computing, 2005
beamer-tu-logo
Outline
1 Combinatorial approaches to a problem
2 Our problem: math and bio
3 Complexity
4 Exact algorithms
5 Approximation algorithms
6 Heuristics
beamer-tu-logo
The CollHAPS heuristic
Symbolic haplotypes
A “haplotype” with bits and variables (negated or not).
e.g. (0 1 x 3 0 0 ¯ x 5 0 1)
Two symbolic haplotypes can be collapsed if some variables can be assigned so that they become identical:
0 1 x 3 0 0 x ¯ 5 0 1
0 1 x 6 0 x 9 1 0 1
become
0 1 x 3 0 0 1 0 1
by assigning x 3 = x 6 , x 9 = 0 and x 5 = 0.
beamer-tu-logo
The CollHAPS heuristic
Symbolic haplotypes
A “haplotype” with bits and variables (negated or not).
e.g. (0 1 x 3 0 0 ¯ x 5 0 1)
Two symbolic haplotypes can be collapsed if some variables can be assigned so that they become identical:
0 1 x 3 0 0 x ¯ 5 0 1 0 1 x 6 0 x 9 1 0 1 become
0 1 x 3 0 0 1 0 1
by assigning x 3 = x 6 , x 9 = 0 and x 5 = 0.
beamer-tu-logo
Heuristic CollHAPS
Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
4x
5x
6x
71 x ¯
5x ¯
6x ¯
71
1
01
002
02
003
03
00collapse 1’ and 3”
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
40 x
1x
21 1 x ¯
1x ¯
21
1
01
002
02
003
03
00collapse 1’ and 2”
0 1 0 1
0 0 1 1
0 1 0 1
1 1 0 0
0 1 0 1
1 0 1 1
1
01
002
02
003
03
00Key points
One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”
Two: efficient update of distances
beamer-tu-logo
Heuristic CollHAPS
Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
4x
5x
6x
71 x ¯
5x ¯
6x ¯
71
1
01
002
02
003
03
00collapse 1’ and 3”
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
40 x
1x
21 1 x ¯
1x ¯
21
1
01
002
02
003
03
00collapse 1’ and 2”
0 1 0 1
0 0 1 1
0 1 0 1
1 1 0 0
0 1 0 1
1 0 1 1
1
01
002
02
003
03
00Key points
One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”
Two: efficient update of distances
beamer-tu-logo
Heuristic CollHAPS
Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
4x
5x
6x
71 x ¯
5x ¯
6x ¯
71
1
01
002
02
003
03
00collapse 1’ and 3”
0 x
1x
21 0 x ¯
1x ¯
21 x
31 0 x
4x ¯
31 0 x ¯
40 x
1x
21 1 x ¯
1x ¯
21
1
01
002
02
003
03
00collapse 1’ and 2”
0 1 0 1
0 0 1 1
0 1 0 1
1 1 0 0
0 1 0 1
1 0 1 1
1
01
002
02
003
03
00Key points
One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”
Two: efficient update of distances
beamer-tu-logo