Max Parsimony Haplotyping (a tour of combinatorial optimization through a single problem)

(1)

beamer-tu-logo

(a tour of combinatorial optimization

through a single problem)

(2)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(3)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(4)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(5)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(6)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(7)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(8)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(9)

beamer-tu-logo

How do we tackle problems?

When faced with a new problem we go through these steps:

1

Assess complexity: is it NP/APX-hard? (usually ...yes)

2

Exact approaches:

Combinatorial branch-and-bound

Integer Linear Programming (possibly w/cuts& pricing) Quadratic/Semidefinite Programming

SAT/Boolean approaches ...

3

Approximation algorithms:

LP rounding

Randomized approximations ...

4

Heuristics:

(Sophisticated) local search Metaheuristics

...

All of the above have been tried on our problem!

(10)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(11)

beamer-tu-logo

Each number as a sum

Problem description

INPUT : a set G of numbers in {0, . . . , v }

OUTPUT: a set H of numbers in {0, . . . , dv /2e} such that for each g ∈ G there are h ⁰ , h ⁰⁰ ∈ H with g = h ⁰ + h ⁰⁰ GOAL: minimize |H|

Example

For instance, G = {2, 3, 4, 5, 6, 8} can be generated by

H = {1, 2, 4} as 2 = 1 + 1, 3 = 1 + 2, etc...

(12)

beamer-tu-logo

Each number as a sum

Problem description

INPUT : a set G of numbers in {0, . . . , v }

OUTPUT: a set H of numbers in {0, . . . , dv /2e} such that for each g ∈ G there are h ⁰ , h ⁰⁰ ∈ H with g = h ⁰ + h ⁰⁰ GOAL: minimize |H|

Example

For instance, G = {2, 3, 4, 5, 6, 8} can be generated by

H = {1, 2, 4} as 2 = 1 + 1, 3 = 1 + 2, etc...

(13)

beamer-tu-logo

Each vector as a sum

The above problem is particularly interesting for v = 2 and vectors instead of numbers.

G subset of {0, 1, 2} ⁿ H subset of {0, 1} ⁿ Example

For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0)

The star of the show

This is our problem! It’s as hard as it gets... Let’s give it a bio

motivation

(14)

beamer-tu-logo

Each vector as a sum

The above problem is particularly interesting for v = 2 and vectors instead of numbers.

G subset of {0, 1, 2} ⁿ H subset of {0, 1} ⁿ Example

For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0) The star of the show

This is our problem! It’s as hard as it gets... Let’s give it a bio

motivation

(15)

beamer-tu-logo

Each vector as a sum

The above problem is particularly interesting for v = 2 and vectors instead of numbers.

G subset of {0, 1, 2} ⁿ H subset of {0, 1} ⁿ Example

For instance, G = {(0, 2, 0), (1, 2, 0), (2, 2, 0)} can be generated by H = {(0, 1, 0), (1, 1, 0)} as (0, 2, 0) = (0, 1, 0) + (0, 1, 0), (1, 2, 0) = (0, 1, 0) + (1, 1, 0) and (2, 2, 0) = (1, 1, 0) + (1, 1, 0) The star of the show

This is our problem! It’s as hard as it gets... Let’s give it a bio

motivation

(16)

beamer-tu-logo

The underlying biology problem

Single Nucleotide Polymorphisms

A SNP is a specific nucleotide at a certain position that can vary in a population

Its different values are called alleles and are typically just two

Different SNPs can have different alleles

SNPs are the predominant form of human polymorphism

(17)

beamer-tu-logo

The underlying biology problem

Single Nucleotide Polymorphisms

A SNP is a specific nucleotide at a certain position that can vary in a population

Its different values are called alleles and are typically just two

Different SNPs can have different alleles

SNPs are the predominant form of human polymorphism

(18)

beamer-tu-logo

The underlying biology problem

Single Nucleotide Polymorphisms

A SNP is a specific nucleotide at a certain position that can vary in a population

Its different values are called alleles and are typically just two

Different SNPs can have different alleles

SNPs are the predominant form of human polymorphism

(19)

beamer-tu-logo

The underlying biology problem

Single Nucleotide Polymorphisms

A SNP is a specific nucleotide at a certain position that can vary in a population

Its different values are called alleles and are typically just two

Different SNPs can have different alleles

SNPs are the predominant form of human polymorphism

(20)

beamer-tu-logo

Haplotypes

Hapl. 1, paternal: tagccCtatttCccaggcgcCgtatgacgggTcta Hapl. 1, maternal: tagccGtatttAccaggcgcGgtatgacgggTcta Hapl. 2, paternal: tagccCtatttAccaggcgcGgtatgacgggTcta Hapl. 2, maternal: tagccGtatttCccaggcgcGgtatgacgggCcta Hapl. 3, paternal: tagccCtatttAccaggcgcGgtatgacgggTcta Hapl. 3, maternal: tagccGtatttAccaggcgcCgtatgacgggCcta

We are diploid organisms

For each chromosome we have two haplotypes

In the above, the haplotypes for Individual 1 are {CCCT,

GAGT}

(21)

beamer-tu-logo

Haplotypes are expensive to get in lab exp Much cheaper experiment yields genotypes

Genotype is the “sum” of haplotypes, specifying the alleles at each position but not their phase (father, mother)

Example

h

⁰

= G A T C

h

⁰⁰

= G T T A

g = {G} {A, T } {T } {A, C}

Cardinality one : homozygous ({G}, {T} )

Cardinality two : heterozygous ({A,T}, {A,C})

(22)

beamer-tu-logo

Haplotypes are expensive to get in lab exp Much cheaper experiment yields genotypes

Genotype is the “sum” of haplotypes, specifying the alleles at each position but not their phase (father, mother)

Example

h

⁰

= G A T C

h

⁰⁰

= G T T A

g = {G} {A, T } {T } {A, C}

Cardinality one : homozygous ({G}, {T} )

Cardinality two : heterozygous ({A,T}, {A,C})

(23)

beamer-tu-logo

The phasing problem

Each genotype with k heterozygous SNPs can be resolved in 2 ^{k −1} ways

Resolutions

A G A T A

T G T T C

{A, T } {G} {A, T } {T } {A, C}

A G A T C

T G T T A

{A, T } {G} {A, T } {T } {A, C}

A G T T A

T G A T C

{A, T } {G} {A, T } {T } {A, C}

A G T T C

T G A T A

{A, T } {G} {A, T } {T } {A, C}

How do we find the correct one?

(24)

beamer-tu-logo

From DNA to bits

we encode the alleles with 0/1. haplotypes = binary vectors we encode genotypes with ternary vectors over {0,1,*}

0 and 1 = homozygous, * = heterozygous

Haplotype 1, paternal: 0 1 0 1

* * * 1 Genotype 1 Haplotype 1, maternal: 1 0 1 1

Haplotype 2, paternal: 0 0 1 1

* * 1 * Genotype 2 Haplotype 2, maternal: 1 1 1 0

Haplotype 3, paternal: 0 0 1 1

* 0 * * Genotype 3 Haplotype 3, maternal: 1 0 0 0

Figure: Haplotypes and corresponding genotypes.

(25)

beamer-tu-logo

From DNA to bits

we encode the alleles with 0/1. haplotypes = binary vectors we encode genotypes with ternary vectors over {0,1,*}

0 and 1 = homozygous, * = heterozygous

Haplotype 1, paternal: 0 1 0 1

* * * 1 Genotype 1 Haplotype 1, maternal: 1 0 1 1

Haplotype 2, paternal: 0 0 1 1

* * 1 * Genotype 2 Haplotype 2, maternal: 1 1 1 0

Haplotype 3, paternal: 0 0 1 1

* 0 * * Genotype 3 Haplotype 3, maternal: 1 0 0 0

Figure: Haplotypes and corresponding genotypes.

(26)

beamer-tu-logo

math of haplotypes

Again, a genotype is the sum of two haplotypes g = h

⁰

+ h

⁰⁰

.

0 +

0 =

0 1 +

1 =

1 0 +

1 =

∗

1 +

0 =

∗

Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}

Resolution of g: pair h

⁰

, h

⁰⁰

∈ H(g) s.t. g = h

⁰

+ h

⁰⁰

Ambiguous sites: positions g

_i

= ∗

A genotype w/k ambiguous sites has 2

^k

compatible haplotypes and 2

^{k −1}

resolutions. (If k ≤ 1 genotype has only 1 resolution)

HAPLOTYPING PROBLEM INPUT: set of genotypes G

OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h

⁰

, h

⁰⁰

∈ H with g = h

⁰

+ h

⁰⁰

OBJECTIVE: ?

(27)

beamer-tu-logo

math of haplotypes

Again, a genotype is the sum of two haplotypes g = h

⁰

+ h

⁰⁰

.

0 +

0 =

0 1 +

1 =

1 0 +

1 =

∗

1 +

0 =

∗

Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}

Resolution of g: pair h

⁰

, h

⁰⁰

∈ H(g) s.t. g = h

⁰

+ h

⁰⁰

Ambiguous sites: positions g

_i

= ∗

A genotype w/k ambiguous sites has 2

^k

compatible haplotypes and 2

^{k −1}

resolutions. (If k ≤ 1 genotype has only 1 resolution)

HAPLOTYPING PROBLEM INPUT: set of genotypes G

OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h

⁰

, h

⁰⁰

∈ H with g = h

⁰

+ h

⁰⁰

OBJECTIVE: ?

(28)

beamer-tu-logo

math of haplotypes

Again, a genotype is the sum of two haplotypes g = h

⁰

+ h

⁰⁰

.

0 +

0 =

0 1 +

1 =

1 0 +

1 =

∗

1 +

0 =

∗

Denote H(g) = haplotypes compatible with g. H(0**) = {000, 001, 010, 011}

Resolution of g: pair h

⁰

, h

⁰⁰

∈ H(g) s.t. g = h

⁰

+ h

⁰⁰

Ambiguous sites: positions g

_i

= ∗

A genotype w/k ambiguous sites has 2

^k

compatible haplotypes and 2

^{k −1}

resolutions. (If k ≤ 1 genotype has only 1 resolution)

HAPLOTYPING PROBLEM INPUT: set of genotypes G

OUTPUT: set of haplotypes H s.t. ∀g ∈ G ∃h

⁰

, h

⁰⁰

∈ H with g = h

⁰

+ h

⁰⁰

OBJECTIVE: ?

(29)

beamer-tu-logo

Max parsimony Haplotyping (MPH)

after some years...

Objective: parsimony

Haplotypes are very conserved. We seek the set H of smallest

cardinality (simplest explanation, Occam’s razor)

(30)

beamer-tu-logo

Max parsimony Haplotyping (MPH)

after some years...

Objective: parsimony

Haplotypes are very conserved. We seek the set H of smallest

cardinality (simplest explanation, Occam’s razor)

(31)

beamer-tu-logo

Max parsimony Haplotyping (MPH)

after some years...

Objective: parsimony

Haplotypes are very conserved. We seek the set H of smallest

cardinality (simplest explanation, Occam’s razor)

(32)

beamer-tu-logo

Max parsimony Haplotyping (MPH)

after some years...

Objective: parsimony

Haplotypes are very conserved. We seek the set H of smallest

cardinality (simplest explanation, Occam’s razor)

(33)

beamer-tu-logo

Our problem is exactly the math problem of beginning, only sum is written differently.

G =







g ¹ = 01 g ² = 00

g ³ = 0

g ⁴ = 0**

g ⁵ = 0*1







7→ H = {h ¹ , h ² , h ³ , h ⁴ } = {001, 101, 000, 011}

g ¹ = h ¹ + h ² , g ² = h ¹ + h ³ , etc...

Important parameters

# of genotypes (rows) N _g

# of SNPs (cols) N _s (and, most important of all,)

max (or avg) # of ambiguous sites N ∗

(34)

beamer-tu-logo

Our problem is exactly the math problem of beginning, only sum is written differently.

G =







g ¹ = 01 g ² = 00

g ³ = 0

g ⁴ = 0**

g ⁵ = 0*1







7→ H = {h ¹ , h ² , h ³ , h ⁴ } = {001, 101, 000, 011}

g ¹ = h ¹ + h ² , g ² = h ¹ + h ³ , etc...

Important parameters

# of genotypes (rows) N _g

# of SNPs (cols) N _s (and, most important of all,)

max (or avg) # of ambiguous sites N ∗

(35)

beamer-tu-logo

Now what?

Problem MPH introduced by Gusfield (2002)

It is “extremely exponential”: solution space has size 2 ²

^N∗

It looks complex.

Let’s start by assessing its real complexity...

(36)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(37)

beamer-tu-logo

Theorem

The MPH problem is APX-hard, even if each genotype has ≤ 3 ambiguous sites (reduction from VC).

Graph has a cover size c iff there is a set H of |V | + c haplotypes that resolve all genotypes.

APX-hardness

Lancia, Pinotti and Rizzi,Haplotyping populations by pure parsimony: Complexity,

Exact and Approximation algorithms,INFORMS J. on Computing, 2004

(38)

beamer-tu-logo

When N _∗ = 1 problem is trivial and when N _∗ = 3 it is APX-hard.

What for N ∗ = 2 ? Theorem

The MPH problem is polynomially solvable if each genotype has ≤ 2 ambiguous sites (ILP formulation totally unimodular).

Polynomial 2-case

Lancia and Rizzi,A polynomial case of the parsimony

haplotyping problem,Operations Research Letters, 2006

(39)

beamer-tu-logo

When N _∗ = 1 problem is trivial and when N _∗ = 3 it is APX-hard.

What for N ∗ = 2 ? Theorem

The MPH problem is polynomially solvable if each genotype has ≤ 2 ambiguous sites (ILP formulation totally unimodular).

Polynomial 2-case

Lancia and Rizzi,A polynomial case of the parsimony

haplotyping problem,Operations Research Letters, 2006

(40)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(41)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(42)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(43)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(44)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(45)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(46)

beamer-tu-logo

Exact algorithms

1

(Naive) combinatorial branch-and-bound

2

Exponential ILP model (no row/col generation)

3

Polynomial and hybrid ILP models

4

Exponential ILP model (with row/col generation)

5

Semidefinite programming

6

SAT approaches

(47)

beamer-tu-logo

1. A (naive) combinatorial b&b

branch on all resolutions on g

trivial LB : used haplotypes so far plus...

...elementary bound on remaining needed haplotypes

HAPAR experiments N

g

15 N

s

20 N

∗

very small (say ≤ 7)

Comments: Can take days even on “toy” problems with 12 genotypes

Combinatorial B& B

Wang and Xu,Haplotype inference by maximum parsimony,Bioinformatics, 2003

(48)

beamer-tu-logo

1. A (naive) combinatorial b&b

branch on all resolutions on g

trivial LB : used haplotypes so far plus...

...elementary bound on remaining needed haplotypes

HAPAR experiments N

g

15 N

s

20 N

∗

very small (say ≤ 7)

Comments: Can take days even on “toy” problems with 12 genotypes

Combinatorial B& B

Wang and Xu,Haplotype inference by maximum parsimony,Bioinformatics, 2003

(49)

beamer-tu-logo

2. A (static) exponential ILP model

Variable x

_h

for each haplotype h and y

_h₁_,h₂

for each resolution h

₁

+ h

₂

min X

h∈H(G)

x

_h

subject to

X

(h₁,h₂):g=h₁+h₂

y

_h₁_,h₂

≥ 1 ∀g ∈ G

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

)

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

) where x , y are 0/1 variables.

EXP SIZE

Exponential no. of variables and of constraints (each resolution (h

₁

,h

₂

))

Some variables can be disposed of, but not many (still exp size, model RTIP,

Reduced Theoretical Integer Program)

(50)

beamer-tu-logo

2. A (static) exponential ILP model

Variable x

_h

for each haplotype h and y

_h₁_,h₂

for each resolution h

₁

+ h

₂

min X

h∈H(G)

x

_h

subject to

X

(h₁,h₂):g=h₁+h₂

y

_h₁_,h₂

≥ 1 ∀g ∈ G

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

)

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

) where x , y are 0/1 variables.

EXP SIZE

Exponential no. of variables and of constraints (each resolution (h

₁

,h

₂

))

Some variables can be disposed of, but not many (still exp size, model RTIP,

Reduced Theoretical Integer Program)

(51)

beamer-tu-logo

2. A (static) exponential ILP model

Variable x

_h

for each haplotype h and y

_h₁_,h₂

for each resolution h

₁

+ h

₂

min X

h∈H(G)

x

_h

subject to

X

(h₁,h₂):g=h₁+h₂

y

_h₁_,h₂

≥ 1 ∀g ∈ G

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

)

y

h₁,h₂

≤ x

h₁

∀(h

1

,h

2

) where x , y are 0/1 variables.

EXP SIZE

Exponential no. of variables and of constraints (each resolution (h

₁

,h

₂

))

Some variables can be disposed of, but not many (still exp size, model RTIP,

Reduced Theoretical Integer Program)

(52)

beamer-tu-logo

RTIP

RTIP exp. IP model

Gusfield,Haplotype Inference by Pure

Parsimony,Combinatorial Pattern Matching, 2003 Model suitable only for small/moderate data sets Yields very strong lower bound

RTIP experiments N g ≤ 50 N s ≤ 30 N _∗ ≤ 10

Comments: CPLEX. Runtimes from secs to hrs. Solvable

models with about 100,000 rows and cols.

(53)

beamer-tu-logo

3. Polynomial and hybrid ILP models

Gusfield’s RTIP exponential model impractical for increasing N ∗

Polynomial ILP (both in rows and cols) are possible.

They are theoretically better than RTIP (in that they can run on problems on which RTIP cannot even create the LP) In practice, they are much worse, as they have very poor lower bound

Several authors independently came up with them

(54)

beamer-tu-logo

Idea for poly IP models: “bit” variables

G =







0 ∗ ∗ 1

0 0 1 1

∗ 1 0 ∗

∗ ∗ ∗ 0





 1 2 3 4

H =







0 x

₁

x

₂

1 0 1 − x

₁

1 − x

₂

1 0 0 1 1

0 0 1 1

x

3

1 0 x

4

1 − x

₃

1 0 1 − x

₄

x

5

x

6

x

7

0 1 − x

₅

1 − x

₆

1 − x

₇

0 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

4

⁰

4

⁰⁰

Vars z

uv

= 1 if row u and v are different (imposed by linear constraints, e.g.

z

3⁰,4⁰

≥ x

3

− x

5

and z

3⁰,4⁰

≥ x

5

− x

3

z

₃0,4⁰

≥ x

4

(because of last col, if x

₄

= 1) z

₁0,4⁰

= 1 (because of last col)

etc...)

Vars w

u

= 1 if no row t > u is identical to row u (imposed by lin. constraints) w

u

≥ P

t>u

z

ut

− (t − u − 1) Objective: minimize P

u

w

u

(55)

beamer-tu-logo

Idea for poly IP models: “bit” variables

G =







0 ∗ ∗ 1

0 0 1 1

∗ 1 0 ∗

∗ ∗ ∗ 0





 1 2 3 4

H =







0 x

₁

x

₂

1 0 1 − x

₁

1 − x

₂

1 0 0 1 1

0 0 1 1

x

3

1 0 x

4

1 − x

₃

1 0 1 − x

₄

x

5

x

6

x

7

0 1 − x

₅

1 − x

₆

1 − x

₇

0 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

4

⁰

4

⁰⁰

Vars z

uv

= 1 if row u and v are different (imposed by linear constraints, e.g.

z

3⁰,4⁰

≥ x

3

− x

5

and z

3⁰,4⁰

≥ x

5

− x

3

z

₃0,4⁰

≥ x

4

(because of last col, if x

₄

= 1) z

₁0,4⁰

= 1 (because of last col)

etc...)

Vars w

u

= 1 if no row t > u is identical to row u (imposed by lin. constraints) w

u

≥ P

t>u

z

ut

− (t − u − 1) Objective: minimize P

u

w

u

(56)

beamer-tu-logo

Idea for poly IP models: “bit” variables

G =







0 ∗ ∗ 1

0 0 1 1

∗ 1 0 ∗

∗ ∗ ∗ 0





 1 2 3 4

H =







0 x

₁

x

₂

1 0 1 − x

₁

1 − x

₂

1 0 0 1 1

0 0 1 1

x

3

1 0 x

4

1 − x

₃

1 0 1 − x

₄

x

5

x

6

x

7

0 1 − x

₅

1 − x

₆

1 − x

₇

0 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

4

⁰

4

⁰⁰

Vars z

uv

= 1 if row u and v are different (imposed by linear constraints, e.g.

z

3⁰,4⁰

≥ x

3

− x

5

and z

3⁰,4⁰

≥ x

5

− x

3

z

₃0,4⁰

≥ x

4

(because of last col, if x

₄

= 1) z

₁0,4⁰

= 1 (because of last col)

etc...)

Vars w

u

= 1 if no row t > u is identical to row u (imposed by lin. constraints) w

u

≥ P

t>u

z

ut

− (t − u − 1) Objective: minimize P

u

w

u

(57)

beamer-tu-logo

Idea for poly IP models: “bit” variables

G =







0 ∗ ∗ 1

0 0 1 1

∗ 1 0 ∗

∗ ∗ ∗ 0





 1 2 3 4

H =







0 x

₁

x

₂

1 0 1 − x

₁

1 − x

₂

1 0 0 1 1

0 0 1 1

x

3

1 0 x

4

1 − x

₃

1 0 1 − x

₄

x

5

x

6

x

7

0 1 − x

₅

1 − x

₆

1 − x

₇

0 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

4

⁰

4

⁰⁰

Vars z

uv

= 1 if row u and v are different (imposed by linear constraints, e.g.

z

3⁰,4⁰

≥ x

3

− x

5

and z

3⁰,4⁰

≥ x

5

− x

3

z

₃0,4⁰

≥ x

4

(because of last col, if x

₄

= 1) z

₁0,4⁰

= 1 (because of last col)

etc...)

Vars w

u

= 1 if no row t > u is identical to row u (imposed by lin. constraints) w

u

≥ P

t>u

z

ut

− (t − u − 1) Objective: minimize P

u

w

u

(58)

beamer-tu-logo

Similar (“in spirit”) models have been proposed:

Poly IP

Brown and Harrower,A new integer programming formulation for the pure parsimony problem in haplotype analysis,Lecture Notes in Computer Science (WABI), 2004

Poly IP

Halldorsson, Bafna, Edwards, Lippert and Istrail,A survey of computational methods for determining haplotypes,Lecture Notes in CS, 2004

Poly IP

Lancia, Pinotti and Rizzi,Haplotyping populations by pure parsimony: Complexity, Exact and Approximation algorithms,INFORMS J. on Computing, 2004

Poly IP

Bertolazzi, Godi, Labbe’ and Tininini,Solving haplotyping inference parsimony

problem using a new basic polynomial formulation,Computers & Math. with Applic.,

2008

(59)

beamer-tu-logo

Poly-IP Experiments

These poly IP models yield very weak lower bounds.

They have been strengthened by cuts: still no good They perform in practice much worse than RTIP IP model experiments

N _g ≤ 30 N _s ≤ 20 N ∗ ≤ 7, 8

Comments: Runtimes from mins to several hrs (CPLEX).

(60)

beamer-tu-logo

Hybrid poly model

Generate explicitly R haplotypes h

1

, . . . , h

_R

and associate variables to them x

_h₁

, . . . , x

_h_R

Write model with these explicit haplotypes plus poly “bit”

variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h

1

, . . . , h

R

it seems solution is always found within R Comp results similar to Gusfield’s RTIP

hybrid

Brown and Harrower,Integer programming approaches to haplotype

inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and

Bioinf., 2006

(61)

beamer-tu-logo

Hybrid poly model

Generate explicitly R haplotypes h

1

, . . . , h

_R

and associate variables to them x

_h₁

, . . . , x

_h_R

Write model with these explicit haplotypes plus poly “bit”

variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h

1

, . . . , h

R

it seems solution is always found within R Comp results similar to Gusfield’s RTIP

hybrid

Brown and Harrower,Integer programming approaches to haplotype

inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and

Bioinf., 2006

(62)

beamer-tu-logo

Hybrid poly model

Generate explicitly R haplotypes h

1

, . . . , h

_R

and associate variables to them x

_h₁

, . . . , x

_h_R

Write model with these explicit haplotypes plus poly “bit”

variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h

1

, . . . , h

R

it seems solution is always found within R Comp results similar to Gusfield’s RTIP

hybrid

Brown and Harrower,Integer programming approaches to haplotype

inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and

Bioinf., 2006

(63)

beamer-tu-logo

Hybrid poly model

Generate explicitly R haplotypes h

1

, . . . , h

_R

and associate variables to them x

_h₁

, . . . , x

_h_R

Write model with these explicit haplotypes plus poly “bit”

variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h

1

, . . . , h

R

it seems solution is always found within R Comp results similar to Gusfield’s RTIP

hybrid

Brown and Harrower,Integer programming approaches to haplotype

inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and

Bioinf., 2006

(64)

beamer-tu-logo

Hybrid poly model

Generate explicitly R haplotypes h

1

, . . . , h

_R

and associate variables to them x

_h₁

, . . . , x

_h_R

Write model with these explicit haplotypes plus poly “bit”

variables for implicit haplotypes (model is still polynomial) Heavily depends on starting heur to get h

1

, . . . , h

R

it seems solution is always found within R Comp results similar to Gusfield’s RTIP

hybrid

Brown and Harrower,Integer programming approaches to haplotype

inference by pure parsimony,IEEE/ACM Trans. on Comp. Biol. and

Bioinf., 2006

(65)

beamer-tu-logo

4. A (dynamic) exponential model

Exp models can yield strong bounds, but are unmanageable.

We need new ideas to move to real-life hard problems (N ∗ ≥ 20)

Solution: Column-generation (and/or Constraint generation) at run time

Must use a different model than RTIP: exploit relations to

set cover

(66)

beamer-tu-logo

A set-covering approach

Covering condition

For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.

This condition is only necessary, but not sufficient, for H to be a feasible for MPH.

Example:

G =





 1 ∗ ∗∗

∗1 ∗ ∗

∗ ∗ 1∗

∗ ∗ ∗1







here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.

It is covering condition:

ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.

subsets of g.s. = haplotype h =

h ↔ {(g, i, a) | h ∈ H(g) , h

i

= a}

then the covering condition requires that H covers the ground set.

(67)

beamer-tu-logo

A set-covering approach

Covering condition

For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.

This condition is only necessary, but not sufficient, for H to be a feasible for MPH.

Example:

G =





 1 ∗ ∗∗

∗1 ∗ ∗

∗ ∗ 1∗

∗ ∗ ∗1







here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.

It is covering condition:

ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.

subsets of g.s. = haplotype h =

h ↔ {(g, i, a) | h ∈ H(g) , h

i

= a}

then the covering condition requires that H covers the ground set.

(68)

beamer-tu-logo

A set-covering approach

Covering condition

For each genotype g, ambiguous position i and value a ∈ {0, 1}, there must be a haplotype h in solution H s.t. h[i] = a.

This condition is only necessary, but not sufficient, for H to be a feasible for MPH.

Example:

G =





 1 ∗ ∗∗

∗1 ∗ ∗

∗ ∗ 1∗

∗ ∗ ∗1







here H = {0111, 1011, 1101, 1110} satisfies cov. cond. but does not resolve G.

It is covering condition:

ground set = triples (g, i, a) such that g ∈ G, i ambiguous for g and a ∈ {0, 1}.

subsets of g.s. = haplotype h =

h ↔ {(g, i, a) | h ∈ H(g) , h

i

= a}

then the covering condition requires that H covers the ground set.

(69)

beamer-tu-logo

The SC model

SC OPT := min X

h∈H(G)

x

h

subject to X

h∈H(g):h[i]=a

x

_h

≥ 1 ∀g ∈ G, i ∈ amb(g), a ∈ {0, 1}

x ∈ {0, 1}

^H(G)

.

O(2

^N^s

) variables and O(N

g

N

s

) constraints.

Column generation via pricing is possible in linear time wrt N

∗

if sol is feasible for MPH it is optimal

if not, a cut is added that impose at least a haplotype of those at zero

Cut generation is possible in poly-time, Col generation remains effective

(70)

beamer-tu-logo

The SC model

SC OPT := min X

h∈H(G)

x

h

subject to X

h∈H(g):h[i]=a

x

_h

≥ 1 ∀g ∈ G, i ∈ amb(g), a ∈ {0, 1}

x ∈ {0, 1}

^H(G)

.

O(2

^N^s

) variables and O(N

g

N

s

) constraints.

Column generation via pricing is possible in linear time wrt N

∗

if sol is feasible for MPH it is optimal

if not, a cut is added that impose at least a haplotype of those at zero

Cut generation is possible in poly-time, Col generation remains effective

(71)

beamer-tu-logo

SC experiments N g 50 N _s 100 N ∗ 25

Comments: Optimal sols to real-life instances never solved before. Up to 10 ¹⁰ haplotype variables (implicit).

Best exact approach so far Branch-and-Cut-and-Price

Lancia and Serafini,A set covering approach with column

generation for parsimony haplotyping,INFORMS J. on

Computing, 2009

(72)

beamer-tu-logo

5. Semidefinite programming

Still expon.: let possible haplotypes h

¹

, . . . , h

ⁿ

, where n = O(2

^N^s

). Instead of {0, 1}

vars use {−1, 1} vars:

x

i

:=

( +1 if haplotype h

ⁱ

is in solution

−1 otherwise.

Integer Quadratic Program

Number of selected haplotypes (to min):

n

X

i=1

(1 + x

_i

)

²

4 since

⁽¹⁺¹⁾₄ ²

= 1 and

⁽¹⁻¹⁾₄ ²

= 0.

Constraints, for each g X

(i,j):hⁱ+h^j=g

(1 + x

_i

)(1 + x

_j

)

4 ≥ 1 ∀g ∈ G

(73)

beamer-tu-logo

5. Semidefinite programming

Still expon.: let possible haplotypes h

¹

, . . . , h

ⁿ

, where n = O(2

^N^s

). Instead of {0, 1}

vars use {−1, 1} vars:

x

i

:=

( +1 if haplotype h

ⁱ

is in solution

−1 otherwise.

Integer Quadratic Program

Number of selected haplotypes (to min):

n

X

i=1

(1 + x

_i

)

²

4 since

⁽¹⁺¹⁾₄ ²

= 1 and

⁽¹⁻¹⁾₄ ²

= 0.

Constraints, for each g X

(i,j):hⁱ+h^j=g

(1 + x

_i

)(1 + x

_j

)

4 ≥ 1 ∀g ∈ G

(74)

beamer-tu-logo

5. Semidefinite programming

Relax each x

i

to an (n + 1)-dimensional unit vector y

i

and replace 1 with y

0

= (1, 0, . . . , 0)

min

n

X

i=1

(1 + x

i

)

²

4 7→ min

n

X

i=1

(y

0

+ y

i

)

²

4 X

hⁱ+h^j=g

(1 + x

i

)(1 + x

j

)

4 ≥ 1 7→ X

hⁱ+h^j=g

(y

0

+ y

i

)(y

0

+ y

j

)

4 ≥ 1 ∀g ∈ G

with |y

i

| = 1 i = 1, . . . , n

Let Y = (y

0

y

1

. . . y

n

)

^T

(y

0

y

1

. . . y

n

)

Y is positive semidefinite, and the problem can be solved via semidefinite

programming.

(75)

beamer-tu-logo

5. Semidefinite programming

Relax each x

i

to an (n + 1)-dimensional unit vector y

i

and replace 1 with y

0

= (1, 0, . . . , 0)

min

n

X

i=1

(1 + x

i

)

²

4 7→ min

n

X

i=1

(y

0

+ y

i

)

²

4 X

hⁱ+h^j=g

(1 + x

i

)(1 + x

j

)

4 ≥ 1 7→ X

hⁱ+h^j=g

(y

0

+ y

i

)(y

0

+ y

j

)

4 ≥ 1 ∀g ∈ G

with |y

i

| = 1 i = 1, . . . , n

Let Y = (y

0

y

1

. . . y

n

)

^T

(y

0

y

1

. . . y

n

)

Y is positive semidefinite, and the problem can be solved via semidefinite

programming.

(76)

beamer-tu-logo

SDP experiments N _g ≤ 20 N _s ≤ 20 N ∗ small

Comments: worse than RTIP, but probably better than Poly-IP

Semidefinite Progr.

Kalpakis and Namjoshi,Haplotype phasing using semidefinite

programming,IEEE Symp. on Bioinf. and Bioeng., 2005

(77)

beamer-tu-logo

6. Boolean/SAT approaches

For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”

can be expressed in the bit vars It is a SAT problem

Solved by SAT solvers via constraint programming Iterate on values of V

SAT approach

Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006

SAT experiments N

g

50 N

s

100 N

∗

' 15

Comments: minisat solver. Good performance, better than RTIP

(78)

beamer-tu-logo

6. Boolean/SAT approaches

For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”

can be expressed in the bit vars It is a SAT problem

Solved by SAT solvers via constraint programming Iterate on values of V

SAT approach

Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006

SAT experiments N

g

50 N

s

100 N

∗

' 15

Comments: minisat solver. Good performance, better than RTIP

(79)

beamer-tu-logo

6. Boolean/SAT approaches

For a fixed V , the predicate “There exist V haplotypes that resolve all genotypes”

can be expressed in the bit vars It is a SAT problem

Solved by SAT solvers via constraint programming Iterate on values of V

SAT approach

Lynce and Marques-Silva,SAT in Bioinformatics: Making the case with haplotype inference,Lecture Notes in Computer Science, 2006

SAT experiments N

g

50 N

s

100 N

∗

' 15

Comments: minisat solver. Good performance, better than RTIP

(80)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(81)

beamer-tu-logo

LP deterministic rounding

Theorem (LPN04)

Let ˆ x be optimal LP-sol of RTIP. Round ˆ x to z as

z _i :=

( 1 if ˆ x _i ≥ ₂

_N∗−1

¹ 0 otherwise.

Then z is a 2 ^N

^∗

⁻¹ -approximate solution.

This is a constant factor approx when N _∗ bounded by constant.

(82)

beamer-tu-logo

QP randomized rounding

A (log N _g )-approximation Uses SDP formulation

1

solve SDP and retrieve norm-1 vectors y (Cholesky decomposition of Y )

2

from vectors y retrieve integer vars for each g (used haplotypes) by

randomization (randomized rounding through projection of random unit vectors on y )

3

if all genotypes solved ok, else goto 1.

Fast on small instances (but not polynomial, size of SDP exponential)

Approx. algo

Huang, Chao and Chen,An approximation algorithm for

haplotype inference by maximum parsimony,ACM Symp. on

Appl. Computing, 2005

(83)

beamer-tu-logo

Outline

1 Combinatorial approaches to a problem

2 Our problem: math and bio

3 Complexity

4 Exact algorithms

5 Approximation algorithms

6 Heuristics

(84)

beamer-tu-logo

The CollHAPS heuristic

Symbolic haplotypes

A “haplotype” with bits and variables (negated or not).

e.g. (0 1 x ₃ 0 0 ¯ x ₅ 0 1)

Two symbolic haplotypes can be collapsed if some variables can be assigned so that they become identical:

0 1 x ₃ 0 0 x ¯ 5 0 1

0 1 x ₆ 0 x ₉ 1 0 1

become

0 1 x ₃ 0 0 1 0 1

by assigning x ₃ = x ₆ , x ₉ = 0 and x ₅ = 0.

(85)

beamer-tu-logo

The CollHAPS heuristic

Symbolic haplotypes

A “haplotype” with bits and variables (negated or not).

e.g. (0 1 x ₃ 0 0 ¯ x ₅ 0 1)

Two symbolic haplotypes can be collapsed if some variables can be assigned so that they become identical:

0 1 x ₃ 0 0 x ¯ 5 0 1 0 1 x ₆ 0 x ₉ 1 0 1 become

0 1 x ₃ 0 0 1 0 1

by assigning x ₃ = x ₆ , x ₉ = 0 and x ₅ = 0.

(86)

beamer-tu-logo

Heuristic CollHAPS

Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

x

5

x

6

x

7

1 x ¯

5

x ¯

6

x ¯

7

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 3”







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

0 x

1

x

2

1 1 x ¯

1

x ¯

2

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 2”







0 1 0 1

0 0 1 1

0 1 0 1

1 1 0 0

0 1 0 1

1 0 1 1





 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

Key points

One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”

Two: efficient update of distances

(87)

beamer-tu-logo

Heuristic CollHAPS

Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

x

5

x

6

x

7

1 x ¯

5

x ¯

6

x ¯

7

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 3”







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

0 x

1

x

2

1 1 x ¯

1

x ¯

2

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 2”







0 1 0 1

0 0 1 1

0 1 0 1

1 1 0 0

0 1 0 1

1 0 1 1





 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

Key points

One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”

Two: efficient update of distances

(88)

beamer-tu-logo

Heuristic CollHAPS

Start from matrix w/bit vars of Poly-IP. Collapse two rows that fix “few” variables to constants. Iterate until no collapse is possible. Perform reoptimization. Repeat.







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

x

5

x

6

x

7

1 x ¯

5

x ¯

6

x ¯

7

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 3”







0 x

1

x

2

1 0 x ¯

1

x ¯

2

1 x

3

1 0 x

4

x ¯

3

1 0 x ¯

4

0 x

1

x

2

1 1 x ¯

1

x ¯

2

1 



 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

collapse 1’ and 2”







0 1 0 1

0 0 1 1

0 1 0 1

1 1 0 0

0 1 0 1

1 0 1 1





 1

⁰

1

⁰⁰

2

⁰

2

⁰⁰

3

⁰

3

⁰⁰

Key points

One: Sophisticated choice of candidate pair for collapse, via a randomized rule depending on “distance”

Two: efficient update of distances

(89)

beamer-tu-logo

The CollHAPS heuristic

CollHAPS experiments N _g 1000

N _s 150 N ∗ 100

Comments: Runs in secs to mins. Optimal sols for 95% of instances on which OPT is known. Best known heuristic.

CollHAPS heuristic

Tininini, Bertolazzi, Godi, and Lancia,CollHaps: a heuristic

approach to haplotype inference by parsimony,IEEE/ACM

Trans. on Comp. Biology and Bioinf., 2009