A polymorphism is a feature

(1)

Optimization Problems for

Polymorphisms of Single Nucleotides

(2)

Polymorphisms

A polymorphism is a feature

(3)

Polymorphisms

A polymorphism is a feature - common to everybody

(4)

Polymorphisms

- not identical in everybody

(5)

Polymorphisms

- the possible variants (alleles) are just a few

(6)

Polymorphisms

E.g. think of

eye-color

(7)

Polymorphisms

E.g. think of

eye-color

Or blood-type for a feature not visible from outside

(8)

At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

(9)

The shortest possible sequence has only 1 nucleotide, hence

S

^ingle

N

^ucleotide

P

olymorphism (SNP)

(10)

S

^ingle

N

^ucleotide

P

olymorphism (SNP)

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

(11)

S

^ingle

N

^ucleotide

P

olymorphism (SNP)

atcggattagttagggcacaggacgg^ac

atcggattagttagggcacaggacgt^ac

atcggcttagttagggcacaggacgt^ac

atcggcttagttagggcacaggacgg^ac

(12)

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

(13)

- Multimillion dollar SNP consortium project

- Goal: associate SNPs (or group of SNPs) to genetic diseases - 1st step: build maps of several thousand SNPs

(14)

HOMOZYGOUS: same allele on both chromosomes

(15)

HOMOZYGOUS: same allele on both chromosomes

(16)

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles

(17)

(18)

HAPLOTYPE

: chromosome content at SNP sites

(19)

atcggattagttagggcacaggacgt

HAPLOTYPE

(20)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag

HAPLOTYPE

(21)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag

HAPLOTYPE

GENOTYPE

: “union” of 2 haplotypes

O^cE

EE

O^aO^g O^aE

O^aO^t EO^g

O^gE

(22)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag

O^cE

EE

O^aO^g O^aE

O^aO^t EO^g

O^gE

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,*

(23)

1o

11

o1 1o

o1

oo

11 11 1o

oo

1o oo 1o

1o

o*

**

*o 1*

11

*o

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,*

(24)

THE HAPLOTYPING PROBLEM

Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population : Given genomic data of k individuals, determine

(at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

(25)

Theory and Results

- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping

(BLIR 02)

Single individual

- NP-hardness for general gapped haplotyping (LBILS 01)

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)

Population

-New formulations and Disease Detection (L, Ravi, Rizzi, 02)

- Exact algorithms for min-size solution (L,Serafini 2011) - Heuristics (Tininini, L, Bertolazzi 2010)

(26)

The Single-Individual

Haplotyping problem

(27)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGA GATTT GCCTAG CTATCTT

ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome

[ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

(28)

-Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT

CATTGGAAC

AATGGAACGGA

-Contaminants

MAIN ERROR SOURCES

(29)

Given errors, the data may be

inconsistent with exactly 2 haplotypes

PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2

haplotypes

Hence, assembler is unable to

build 2 chromosomes

(30)

ACTGAAAGCGA ACTAGAGACAGCATG

ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC

AGCATG

ACTGAAAGCGA ACTAGAGACAGCATG

ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC

AGCATG

1 1 O O O 1 1 1 1 1 O

The data: a SNP matrix

(31)

Snips 1,..,n

Fragments 1,..,m 1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -

(32)

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype 1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -

(33)

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype 1

6 2

3

4

5

Fragment Conflict Graph GF(M) We have 2 haplotypes iff GF is BIPARTITE

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -

(34)

Snips 1,..,n

Fragments 1,..,m

1

6 2

3

4

5

PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9

1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -

(35)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite 1

6 2

3

4

5

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 4 O O 1 - - - - O -

3 1 1 O 1 1 - - - - 5 - - - 1 O O O 1 O 1 1 O O 1

1 1 O 1 1 - - 1 O

(36)

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978]

O(|V|(log log |V|/log |V|)²)-approximable [Halldórsson, 1999]

not O(|V|^)-approximable for some  [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---O11OO---O1OO1--- gap

---O11OO1O1O1OO1--- gapless

---O11--1O----O1--- 2 gaps

(37)

Why gaps?

Sequencing errors (don’t call with low confidence) ---OO11?11--- ===> ---OO11-11---

(38)

Why gaps?

Sequencing errors (don’t call with low confidence) ---OO11?11--- ===> ---OO11-11---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga

(39)

THEOREM

For a gapless M, the Min Fragment Removal Problem is Polynomial

NOTE: Does not need to be gapless. Enough if it can be sorted to become such

(Consecutive Ones Property, Booth and Lueker, 1976)

(40)

An O(nm + n ) D.P. algo

³

1 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O

(41)

An O(nm + n ) D.P. algo

³

LFT(i) RGT(i)

sort according to LFT

1 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O

(42)

An O(nm + n ) D.P. algo

³

1 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O

LFT(i) RGT(i)

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise

{

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

(43)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)

WITH GAPS…..

(44)

Th : NP-Hard if even 1 gap per fragment

proof: technical. reduction from MAX2SAT

WITH GAPS…..

(45)

Th : NP-Hard if even 1 gap per fragment

proof: technical. reduction from MAX2SAT

WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

(46)

What for MFR with gaps? Why not ILP...

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(47)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(48)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(49)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(50)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(51)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(52)

What for MFR with gaps? Why not ILP...

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

Randomized rounding heuristic: round and repeat. Worked well at Celera

min ∑

�

_�

� ∈ �

∑

�

_�

≥1for all oddcycles �

� ∈{0,1}^�

(53)

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments and correct errors otherwise

A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.

We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragment graph becomes bipartite.

(54)

SNP conflicts