• Non ci sono risultati.

A polymorphism is a feature

N/A
N/A
Protected

Academic year: 2021

Condividi "A polymorphism is a feature"

Copied!
227
0
0

Testo completo

(1)

Optimization Problems for Optimization Problems for

Polymorphisms of Single Nucleotides

Polymorphisms of Single Nucleotides

(2)

Polymorphisms Polymorphisms

A polymorphism is a feature

(3)

Polymorphisms Polymorphisms

A polymorphism is a feature

- common to everybody

(4)

Polymorphisms Polymorphisms

A polymorphism is a feature - common to everybody

- not identical in everybody

(5)

Polymorphisms Polymorphisms

A polymorphism is a feature - common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

(6)

Polymorphisms Polymorphisms

E.g. think of eye-color eye-color

A polymorphism is a feature - common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

(7)

Polymorphisms Polymorphisms

A polymorphism is a feature - common to everybody

- not identical in everybody

- the possible variants (alleles) are just a few

E.g. think of eye-color eye-color

Or blood-type for a feature not visible from outside blood-type

(8)

At DNA level, a polymorphism is a sequence of nucleotides

varying in a population.

(9)

At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

S S ingle N N ucleotide P P olymorphism (SNP)

(10)

At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

S S ingle N N ucleotide P P olymorphism (SNP)

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

(11)

At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

The shortest possible sequence has only 1 nucleotide, hence

S S ingle N N ucleotide P P olymorphism (SNP)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

(12)

- SNPs are predominant form of human variations

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

(13)

- Multimillion dollar SNP consortium project

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

- Goal: associate SNPs (or group of SNPs) to genetic diseases

- 1st step: build maps of several thousand SNPs

(14)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes

(15)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes

(16)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

(17)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

(18)

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

HAPLOTYPE

HAPLOTYPE : chromosome content at SNP sites

(19)

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

t

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

atcgg

a

ttagttagggcacaggacg

t

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

c

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

g

ac

HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

HAPLOTYPE

HAPLOTYPE : chromosome content at SNP sites

atcgg

a

ttagttagggcacaggacg

g

ac

atcgg

a

ttagttagggcacaggacg

t

ac

(20)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

HAPLOTYPE

HAPLOTYPE : chromosome content at SNP sites

(21)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag HOMOZYGOUS

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS

HETEROZYGOUS: different alleles

HAPLOTYPE

HAPLOTYPE : chromosome content at SNP sites

GENOTYPE

GENOTYPE : “union” of 2 haplotypes

O

c

E

EE

O

a

O

g

O

a

E

O

a

O

t

EO

g

O

g

E

(22)

ag

at

ct ag

ct

cg

at at ag

cg

ag cg ag

ag

O

c

E

EE

O

a

O

g

O

a

E

O

a

O

t

EO

g

O

g

E CHANGE OF SYMBOLS

CHANGE OF SYMBOLS

: each SNP only two values in a poplulation (bio).

Call them

1

and

O

. Also, call

*

the fact that a site is heterozygous

HAPLOTYPE

HAPLOTYPE: string over 1,O GENOTYPE

GENOTYPE: string over 1,O,*

(23)

1o

11

o1 1o

o1

oo

11 11 1o

oo

1o oo 1o

1o

o*

**

*o 1*

11

*o

*o CHANGE OF SYMBOLS

CHANGE OF SYMBOLS

: each SNP only two values in a poplulation (bio).

Call them

1

and

O

. Also, call

*

the fact that a site is heterozygous

HAPLOTYPE

HAPLOTYPE: string over 1,O GENOTYPE

GENOTYPE: string over 1,O,*

(24)

THE HAPLOTYPING PROBLEM THE HAPLOTYPING PROBLEM

Single Individual

Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population

Population : Given genomic data of k individuals, determine

(at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

(25)

Theory and Results Theory and Results

- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping

(BLIR 02)

Single individual

- NP-hardness for general gapped haplotyping (LBILS 01)

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)

Population

-New formulations and Disease Detection (L, Ravi, Rizzi, 02)

- Exact algorithms for min-size solution (L,Serafini 2011) - Heuristics (Tininini, L, Bertolazzi 2010)

(26)

The Single-Individual The Single-Individual

Haplotyping problem

Haplotyping problem

(27)

TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome

[ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

(28)

-Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT

CATTGGAAC

AATGGAACGGA -Contaminants

MAIN ERROR SOURCES

MAIN ERROR SOURCES

(29)

Given errors, the data may be errors inconsistent

inconsistent with exactly 2 haplotypes

PROBLEM

PROBLEM : Find and remove the : Find and remove the errors so that the data becomes errors so that the data becomes

consistent with exactly 2 consistent with exactly 2

haplotypes haplotypes

Hence, assembler is unable to Hence, assembler is unable to

build 2 chromosomes

build 2 chromosomes

(30)

ACTGAAAGCGA ACTAGAGACAGCATG

ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC

AGCATG

ACTGAAAGCGA ACTAGAGACAGCATG

ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC

AGCATG

1 1 O O O 1 1 1 1 1 O

The data: a SNP matrix

(31)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -

Fragments 1,..,m

(32)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

(33)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X

X

O X X - - - - 4 O

O

X - - - - O - 5 - - - X O 6 - - - - O O O X -

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype 1

6 2

3

4

5

Fragment Conflict Graph GF(M)

We have 2 haplotypes iff GF is

BIPARTITE

(34)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X

X

O X X - - - - 4 O

O

X - - - - O - 5 - - - X O 6 - - - - O O O X -

Fragments 1,..,m

1

6 2

3

4

5

PROBLEM (Fragment Removal): make GF Bipartite

(35)

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite 1

6 2

3

4

5

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 4 O O X - - - - O -

3 X X O X X - - - - 5 - - - X O O O X O X X O O X

X X O X X - - X O

(36)

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete

[Yannakakis, 1978a, 1978b; Lewis, 1978]

O(|V|(log log |V|/log |V|)

2

)-approximable

[Halldórsson, 1999]

not O(|V|

)-approximable for some 

[Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---OXXOO---OXOOX--- gap

---OXXOOXOXOXOOX--- gapless

---OXX--XO----OX--- 2 gaps

(37)

Why gaps?

Sequencing errors (don’t call with low confidence)

---OOXX?XX--- ===> ---OOXX-XX--- Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

attcgttgtagtggtagcctaaatgtcggtagaccttga

(38)

THEOREM

For a gapless M, the Min Fragment Removal Problem is Polynomial

NOTE: Does not need to be gapless. Enough if it can be NOTE sorted to become such

(Consecutive Ones Property, Booth and Lueker, 1976)

(39)

An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo

3

1 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O

(40)

An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo

3

1 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O

LFT(i) RGT(i)

sort according to LFT

(41)

An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo

3

1 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O

LFT(i) RGT(i)

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise

{

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

(42)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t.

G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs

Th : NP-Hard if even 1 gap per fragment

proof: technical. reduction from MAX2SAT

WITH GAPS…..

WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

(43)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

(44)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

(45)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

(46)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

(47)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

(48)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

(49)

What for MFR with gaps? Why not ILP...

What for MFR with gaps? Why not ILP...

min  x

f

 x

f >= 1 for all odd cycles Cf f\in C

x

\in {0,1}^n

1

5 2

4 3

1/2

1/3

1/2 1/4 0

1

5 2

4 3

1

5 2

4 3

5/12 5/12

Randomized rounding heuristic: round and repeat. Worked well at Celera

(50)

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments and correct errors otherwise

A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.

We describe the min SNP removal problem: remove the

fewest number of columns from M so that the fragment

graph becomes bipartite.

(51)

- - - O X X O O -

- O X O X - - - X

X X

O X X - - - -

O O X - - - O O -

- - - X X O

- - - - O O O X -

SNP conflicts

(52)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

OK

(53)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

OK

(54)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

OK

(55)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

CONFLICT !

(56)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

CONFLICT !

(57)

- - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

SNP conflict graph GS(M)

1 node for each SNP (column)

edge between conflicting SNPs

(58)

1 2 3 4 5 6 7 8 9

- - - O X X O O -

- O X O X - - - X

X X

O X X - - - -

O O X - - - O O -

- - - X X O

- - - - O O O X -

SNP conflicts

(59)

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

1

6 2

3

4

5

8

9

7

(60)

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X X

O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts

1

6 2

3

4

5

8

9

7

(61)

THEOREM 1

For a gapless M, GF(M) is bipartite

if and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removal

problem is polynomial

(62)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OOXXOO--- ----OOXOOXOXXO--- ---XXOXOXXX- ----XXOOXOXXO---- ---XOOOX--- ---XXXXXO--- --XXOXXOXOO---

Assume M gapless, GS(M) an independent set, but GF(M) not bipartite.

Take an odd cycle in GF

(63)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---

There is a generic structure of hor-vert cycle

(64)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---

“vertical lines”

There cannot be only one vertical line in odd cycle We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

(65)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---

“vertical lines”

Must be X

(66)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???--- ----O?????X??O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---

“vertical lines”

Must be X

Merge the rightmost lines

(67)

THEOREM 1

For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?X???--- ----O?????X--- ---??O--- ----??????X--- ---???O--- ---????X--- --X???????O---

“vertical lines”

Still a counterexample!

Merge the rightmost lines

(68)

1 2 3 1 O - O 2 - O X 3 X X

-

Note: Theorem not true if there are gaps

1

2 3

1

2 3

GF(M) GS(M)

M

(69)

THEOREM 2

For a gapless M, GS(M) is a perfect graph

PROOF: GS(M) is the complement of a comparability graph A Comparability graphs are perfect

Comparability Graphs: unoriented that can be oriented to become a partial order

(70)

LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict

i j k - X O O ? X O X - - O X O ? X X X -

Equal:conflicts with i

O

O Different:conflicts with k

O X

i j k

I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict

So (u,v) with u < v and u not a conflict with v is a comparability graph A and GS is A complement

NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

(71)

THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1 2L + 2

Hence gapless MSR is polynomial (max stable set on perfect graph).

There are better, D.P., algorithms, O(mn + m^2)

What if gaps ?

(72)

The Population The Population

Haplotyping problem

Haplotyping problem

(73)

The input is GENOTYPE data

oooxx

xxoxx

?x??x

????x xx??x

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

(74)

The input is GENOTYPE data

xxoxx xxxox

oooxx

oooxx xxxox

xxoxx oxxox

xxoxx xxoxx oooxx

oooxx

xxoxx

?x??x

????x xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

Each genotype is explained by two haplotypes We will define some objectives for H

(75)

1st Objective

1st Objective (open research problem):

minimize |H|

2nd Objective

2nd Objective based on inference rule:

(76)

1st Objective (parsimony) 1st Objective (parsimony) :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants (Lancia, Pinotti, Rizzi ’02)

(77)

2nd Objective

2nd Objective based on inference rule:

(78)

xoxxooxoxx +

********** = x??xoox?x?

known haplotype

h

known (ambiguos) genotype

g

Inference Rule

Inference Rule

(79)

xoxxooxoxx + xxoxooxxxo = x??xoox?x?

known haplotype

h

known (ambiguos) genotype

g

new (derived) haplotype

h’

Inference Rule

Inference Rule

(80)

xoxxooxoxx + xxoxooxxxo = x??xoox?x?

known haplotype

h

known (ambiguos) genotype

g

new (derived) haplotype

h’

We write

h + h’ = g

g

and

h

must be

compatible

to derive

h’

Inference Rule

Inference Rule

(81)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

(82)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

(83)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

(84)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

xxoo

(85)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

xxoo xxxx

SUCCESS

(86)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

(87)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

oxoo

(88)

2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G

3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}

5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic

oooo xooo

??oo xx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G OBJ: find order of application rule that leaves the fewest elements in G

(89)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

(90)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

1. expand genotypes

(91)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox xxxoo xxoox xxooo xoxox xooox xoxoo

xoooo

1. expand genotypes

(92)

- Problem is APX-hard (Gusfield,00)

-

Graph-Model + Integer Programming for practical solution (G.,01)

x??o?

xxxox xxxoo xxoox xxooo xoxox xooox xoxoo

xoooo

2. create (h, h’) if exists g s.t. h’ can be derived from g and h

1. expand

genotypes 3. Largest number of nodes in forest rooted at unambiguos genotpes =

= largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

(93)

3rd Objective

3rd Objective (open research problem) Disease Detection:

oooxx

??oxx

?x??x

????x xx??x

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

(94)

3rd Objective

3rd Objective (open research problem) Disease Detection:

xxoxx xxxox

oooxx

oooxx xxxox

xxoxx oxxox

xxoxx oooxx oooxx

oooxx

??oxx

?x??x

????x xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

(95)

Genome Rearrangements and Genome Rearrangements and

Evolutionary Distances

Evolutionary Distances

(96)

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………ttgatggattga

tcgattatggat………ttttgatatcca

Genomes evolve by means of

•Insertions

•Deletions

•Inversions

•Transpositions

•Translocations of DNA regions

(97)
(98)

deletion

(99)

deletion insertion

(100)

deletion insertion

translocation

(101)

deletion insertion

translocation

inversion

(102)

deletion insertion

translocation

inversion

transposition

(103)

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)

5 6 4 8 3 2 1 9 7

Example:

We focus on inversions, that are the most important in Nature

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9

(104)

Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)

+5 +6 -4 -8 -3 -2 -1 -9 +7

Example:

We focus on inversions, that are the most important in Nature

+1 +2 +3 +8 +4 -6 -5 -9 +7 +1 +2 +3 +8 +4 +5 +6 -9 +7 +1 +2 +3 -6 -5 -4 -8 -9 +7 +1 +2 +3 -6 -5 -4 -8 -7 +9 +1 +2 +3 +4 +5 +6 -8 -7 +9 +1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(105)

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

(106)

5 7 8 2 1 4 3 6 9

The concept of breakpoint

reakpoint at position i if(i) - (i+1) | > 1

0 10

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

(107)

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

5 7 8 2 1 4 3 6 9

The concept of breakpoint

reakpoint at position i if(i) - (i+1) | > 1

0 10

d() = inversion distance b() = # breakpoints

TRIVIAL BOUND: d() >= b() / 2 Example: d() >= 6 / 2 = 3

(108)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(109)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(110)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(111)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

(112)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

10 4 6

Each node has degree...

0 2 or 4 …

hence the graph can be decomposed in cycles!

(113)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition

(114)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition

(115)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition

c() = max # cycles in alternating decomposition

VERY STRONG BOUND : d () >= b() - c()

Example: c()= 2 and d () >= 6 - 2 = 4

(116)

The Breakpoint Graph Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10 The best algorithm for this problem is based on an Integer Programming formulation of the max cycle decomposition

A variable

x

C for each cycle (exponential # of vars…) A constraint

x

C = 1 for each edge e Objective: maximize

C

x

C

C containing e

(117)

max  x

C

 x

C = 1 for all edges eC C\ni e

x

C \in {0,1} for all alt. cycles C PRIMAL

min  y

e

 y

e <= 1 for all alt. Cycles Ce e\in C

y

e \in R for all edges e DUAL

(118)

max  x

C

 x

C = 1 for all edges eC C\ni e

x

C \in {0,1} for all alt. cycles C PRIMAL

min  y

e

 y

e <= 1 for all alt. Cycles Ce e\in C

y

e \in R for all edges e DUAL

(119)

5 7 8 2 1 4 3 6 9 0

10

Pricing out the cycles for which y*(C) < 1 Pricing out the cycles for which y*(C) < 1

(120)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Split the graph in two copies Split the graph in two copies

(121)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Connect twins Connect twins

(122)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles

(123)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles

(124)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles

(125)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles

(126)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles

(127)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

The weight of the matching is the y*-weight of the cycles The weight of the matching is the y*-weight of the cycles

.2 .4

.5 1

.6

0

(128)

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Forcing a cycle to use a certain node Forcing a cycle to use a certain node

.2 .4

.5 1

.6

100000

(129)

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100 NP-hard problem practically solved to optimality!

Riferimenti

Documenti correlati

In the present study, we adopted four different approaches (implemented in the computer programs ARLEQUIN, HELIxTREE, HAP and PHASE) to infer phase information

 “A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

 The DataFrameReader class (the same we used for reading a json file and store it in a DataFrame) provides other methods to read many standard (textual) formats and read data

 The DataFrameReader class (the same we used for reading a json file and store it in a DataFrame) provides other methods to read many standard (textual) formats and read data

In a sample of 1,111 publicly listed US firms, we find that the likelihood of CEO dismissal due to poor financial performance, as measured by firm stock returns, increases in

Recently Citigroup announced a series of “Zoom Free” Fridays to encourage workers to get away from their screens and LinkedIn said it was giving its employees a paid week off to

The results of the densitometric analysis (ImageJ software) of OsFK1 or OsFK2 western blot bands of embryo, coleoptile and root from seedlings sowed under anoxia or under

The PeakSum (PS) system is the stage of the neutral trigger pipeline chain where the total energy and the weighted energy sums (moments) are computed from the individual strip