Optimization Problems for Optimization Problems for
Polymorphisms of Single Nucleotides
Polymorphisms of Single Nucleotides
Polymorphisms Polymorphisms
A polymorphism is a feature
Polymorphisms Polymorphisms
A polymorphism is a feature
- common to everybody
Polymorphisms Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
Polymorphisms Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
Polymorphisms Polymorphisms
E.g. think of eye-color eye-color
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
Polymorphisms Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
E.g. think of eye-color eye-color
Or blood-type for a feature not visible from outside blood-type
At DNA level, a polymorphism is a sequence of nucleotides
varying in a population.
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S S ingle N N ucleotide P P olymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S S ingle N N ucleotide P P olymorphism (SNP)
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S S ingle N N ucleotide P P olymorphism (SNP)
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
ac- SNPs are predominant form of human variations
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
ac- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
- Multimillion dollar SNP consortium project
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
ac- Goal: associate SNPs (or group of SNPs) to genetic diseases
- 1st step: build maps of several thousand SNPs
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
atcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
atcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acHOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
ac
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag HOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag HOMOZYGOUS
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS
HETEROZYGOUS: different alleles
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
GENOTYPE
GENOTYPE : “union” of 2 haplotypes
O
cE
EE
O
aO
gO
aE
O
aO
tEO
gO
gE
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
O
cE
EE
O
aO
gO
aE
O
aO
tEO
gO
gE CHANGE OF SYMBOLS
CHANGE OF SYMBOLS
: each SNP only two values in a poplulation (bio).Call them
1
andO
. Also, call*
the fact that a site is heterozygousHAPLOTYPE
HAPLOTYPE: string over 1,O GENOTYPE
GENOTYPE: string over 1,O,*
1o
11
o1 1o
o1
oo
11 11 1o
oo
1o oo 1o
1o
o*
**
*o 1*
11
*o
*o CHANGE OF SYMBOLS
CHANGE OF SYMBOLS
: each SNP only two values in a poplulation (bio).Call them
1
andO
. Also, call*
the fact that a site is heterozygousHAPLOTYPE
HAPLOTYPE: string over 1,O GENOTYPE
GENOTYPE: string over 1,O,*
THE HAPLOTYPING PROBLEM THE HAPLOTYPING PROBLEM
Single Individual
Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)
Population
Population : Given genomic data of k individuals, determine
(at most) 2k haplotypes (one per chromosome/indiv.)
For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening
OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)
Theory and Results Theory and Results
- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)
- Polynomial Algorithms for bounded-length gapped haplotyping
(BLIR 02)
Single individual
- NP-hardness for general gapped haplotyping (LBILS 01)
- APX-hardness (Gusfield 00)
- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)
Population
-New formulations and Disease Detection (L, Ravi, Rizzi, 02)
- Exact algorithms for min-size solution (L,Serafini 2011) - Heuristics (Tininini, L, Bertolazzi 2010)
The Single-Individual The Single-Individual
Haplotyping problem
Haplotyping problem
TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA
TAGAGATTTC TCCTAAAGAT CGCATAGATA
fragmentation
sequencing
assembly
Shotgun Assembly of a Chromosome
[ Webber and Myers, 1997]
ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
-Sequencing errors:
ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT
CATTGGAAC
AATGGAACGGA -Contaminants
MAIN ERROR SOURCES
MAIN ERROR SOURCES
Given errors, the data may be errors inconsistent
inconsistent with exactly 2 haplotypes
PROBLEM
PROBLEM : Find and remove the : Find and remove the errors so that the data becomes errors so that the data becomes
consistent with exactly 2 consistent with exactly 2
haplotypes haplotypes
Hence, assembler is unable to Hence, assembler is unable to
build 2 chromosomes
build 2 chromosomes
ACTGAAAGCGA ACTAGAGACAGCATG
ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC
AGCATG
ACTGAAAGCGA ACTAGAGACAGCATG
ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC
AGCATG
1 1 O O O 1 1 1 1 1 O
The data: a SNP matrix
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -
Fragments 1,..,m
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X
XO X X - - - - 4 O
OX - - - - O - 5 - - - X O 6 - - - - O O O X -
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype 1
6 2
3
4
5
Fragment Conflict Graph GF(M)
We have 2 haplotypes iff GF is
BIPARTITE
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X
XO X X - - - - 4 O
OX - - - - O - 5 - - - X O 6 - - - - O O O X -
Fragments 1,..,m
1
6 2
3
4
5
PROBLEM (Fragment Removal): make GF Bipartite
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 3 X X O X X - - - - 4 O O X - - - - O - 5 - - - X O 6 - - - - O O O X -
Fragments 1,..,m
PROBLEM (Fragment Removal): make GF Bipartite 1
6 2
3
4
5
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X 4 O O X - - - - O -
3 X X O X X - - - - 5 - - - X O O O X O X X O O X
X X O X X - - X O
Removing fewest fragments is equivalent to maximum induced bipartite subgraph
NP-complete
[Yannakakis, 1978a, 1978b; Lewis, 1978]
O(|V|(log log |V|/log |V|)
2)-approximable
[Halldórsson, 1999]
not O(|V|
)-approximable for some
[Lund and Yannakakis, 1993]Are there cases of M for which GF(M) is easier?
YES: the gapless M
---OXXOO---OXOOX--- gap
---OXXOOXOXOXOOX--- gapless
---OXX--XO----OX--- 2 gaps
Why gaps?
Sequencing errors (don’t call with low confidence)
---OOXX?XX--- ===> ---OOXX-XX--- Celera’s mate pairs
attcgttgtagtggtagcctaaatgtcggtagaccttga
attcgttgtagtggtagcctaaatgtcggtagaccttga
THEOREM
For a gapless M, the Min Fragment Removal Problem is Polynomial
NOTE: Does not need to be gapless. Enough if it can be NOTE sorted to become such
(Consecutive Ones Property, Booth and Lueker, 1976)
An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo
31 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O
An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo
31 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O
LFT(i) RGT(i)
sort according to LFT
An O(nm + n ) D.P. algo An O(nm + n ) D.P. algo
31 - O O X X O O - - 2 - - X O X X O - - 3 - - - X X O - - - 4 - - - - O O X O - 5 - - - X O X O
LFT(i) RGT(i)
D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)
sort according to LFT
D(i; h,k) =
D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise
{
OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)
Th: NP-Hard if 2 gaps per fragment
proof: (simple) use fact that for every G there is M s.t.
G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs
Th : NP-Hard if even 1 gap per fragment
proof: technical. reduction from MAX2SAT
WITH GAPS…..
WITH GAPS…..
But, gaps must be long for problem to be difficult.
We have O( 2 mn + 2 n ) D.P.
for MFR on matrix with total gaps length L
2L 3L 3
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^nWhat for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
What for MFR with gaps? Why not ILP...
What for MFR with gaps? Why not ILP...
min x
f x
f >= 1 for all odd cycles Cf f\in Cx
\in {0,1}^n1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
Randomized rounding heuristic: round and repeat. Worked well at Celera
The fragment removal is good to get rid of contaminants.
However, we may want to keep all fragments and correct errors otherwise
A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes
All fragments get assigned to one of the two haplotypes.
We describe the min SNP removal problem: remove the
fewest number of columns from M so that the fragment
graph becomes bipartite.
- - - O X X O O -
- O X O X - - - X
X X
O X X - - - -
O O X - - - O O -
- - - X X O
- - - - O O O X -
SNP conflicts
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
OK
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
OK
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
OK
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
CONFLICT !
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
CONFLICT !
- - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
SNP conflict graph GS(M)
1 node for each SNP (column)
edge between conflicting SNPs
1 2 3 4 5 6 7 8 9
- - - O X X O O -
- O X O X - - - X
X X
O X X - - - -
O O X - - - O O -
- - - X X O
- - - - O O O X -
SNP conflicts
1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
1
6 2
3
4
5
8
9
7
1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - X X X
O X X - - - - O O X - - - O O - - - - X X O - - - - O O O X - SNP conflicts
1
6 2
3
4
5
8
9
7
THEOREM 1
For a gapless M, GF(M) is bipartite
if and only if GS(M) is an independent set
THEOREM 2
For a gapless M, GS(M) is a perfect graph
COROLLARY
For a gapless M, the min SNP removal
problem is polynomial
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--OOXXOO--- ----OOXOOXOXXO--- ---XXOXOXXX- ----XXOOXOXXO---- ---XOOOX--- ---XXXXXO--- --XXOXXOXOO---
Assume M gapless, GS(M) an independent set, but GF(M) not bipartite.
Take an odd cycle in GF
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---
There is a generic structure of hor-vert cycle
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---
“vertical lines”
There cannot be only one vertical line in odd cycle We merge rightmost and next to reduce them by 1
Hence, there cannot be a minimal (in n. of vertical lines) counterexample
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???--- ----O????????O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---
“vertical lines”
Must be X
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???--- ----O?????X??O--- ---??O??X??- ----??????X??---- ---???O?--- ---????X?--- --X???????O---
“vertical lines”
Must be X
Merge the rightmost lines
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???--- ----O?????X--- ---??O--- ----??????X--- ---???O--- ---????X--- --X???????O---
“vertical lines”
Still a counterexample!
Merge the rightmost lines
1 2 3 1 O - O 2 - O X 3 X X
-
Note: Theorem not true if there are gaps
1
2 3
1
2 3
GF(M) GS(M)
M
THEOREM 2
For a gapless M, GS(M) is a perfect graph
PROOF: GS(M) is the complement of a comparability graph A Comparability graphs are perfect
Comparability Graphs: unoriented that can be oriented to become a partial order
LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict
i j k - X O O ? X O X - - O X O ? X X X -
Equal:conflicts with i
O
O Different:conflicts with k
O X
i j k
I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict
So (u,v) with u < v and u not a conflict with v is a comparability graph A and GS is A complement
NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)
THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)
Again, gaps must be long for problem to be difficult.
We have O(mn + n ) D.P.
for MSR on matrix with total gaps length L
2L + 1 2L + 2
Hence gapless MSR is polynomial (max stable set on perfect graph).
There are better, D.P., algorithms, O(mn + m^2)
What if gaps ?
The Population The Population
Haplotyping problem
Haplotyping problem
The input is GENOTYPE data
oooxx
xxoxx
?x??x
????x xx??x
INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }
The input is GENOTYPE data
xxoxx xxxox
oooxx
oooxx xxxox
xxoxx oxxox
xxoxx xxoxx oooxx
oooxx
xxoxx
?x??x
????x xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }
Each genotype is explained by two haplotypes We will define some objectives for H
1st Objective
1st Objective (open research problem):
minimize |H|
2nd Objective
2nd Objective based on inference rule:
1st Objective (parsimony) 1st Objective (parsimony) :
minimize |H|
An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.
BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving a solution of <= 2n haplotypes, i.e. <= SQRT(n) * LB
It’s difficult, but not impossible, to come up with better approximations, like constants (Lancia, Pinotti, Rizzi ’02)
2nd Objective
2nd Objective based on inference rule:
xoxxooxoxx +
********** = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
Inference Rule
Inference Rule
xoxxooxoxx + xxoxooxxxo = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
new (derived) haplotype
h’
Inference Rule
Inference Rule
xoxxooxoxx + xxoxooxxxo = x??xoox?x?
known haplotype
h
known (ambiguos) genotype
g
new (derived) haplotype
h’
We write
h + h’ = g
g
andh
must becompatible
to deriveh’
Inference Rule
Inference Rule
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
xxoo
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
xxoo xxxx
SUCCESS
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
oxoo
2nd Objective (Clark, 1990) 2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes 2. while exists ambiguos genotype g in G
3. take h in H compatible with g and let h + h’ = g 4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
oooo xooo
??oo xx??
oxoo FAILURE (can’t resolve xx?? )
OBJ: find order of application rule that leaves the fewest elements in G OBJ: find order of application rule that leaves the fewest elements in G
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox xxxoo xxoox xxooo xoxox xooox xoxoo
xoooo
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
-
Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox xxxoo xxoox xxooo xoxox xooox xoxoo
xoooo
2. create (h, h’) if exists g s.t. h’ can be derived from g and h
1. expand
genotypes 3. Largest number of nodes in forest rooted at unambiguos genotpes =
= largest number of ambiguous genotypes resolved
Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).
This reduction is exponential. Is there a better practical approach?
3rd Objective
3rd Objective (open research problem) Disease Detection:
oooxx
??oxx
?x??x
????x xx??x
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
3rd Objective
3rd Objective (open research problem) Disease Detection:
xxoxx xxxox
oooxx
oooxx xxxox
xxoxx oxxox
xxoxx oooxx oooxx
oooxx
??oxx
?x??x
????x xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
H contains H’, s.t. each diseased has one haplotype in H’ and each healty none
minimize | H|
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
Genome Rearrangements and Genome Rearrangements and
Evolutionary Distances
Evolutionary Distances
Each species has a genome (organized in pairs of chromosomes)
tcgtgatggat………ttgatggattga
tcgattatggat………ttttgatatcca
Genomes evolve by means of
•Insertions
•Deletions
•Inversions
•Transpositions
•Translocations of DNA regions
deletion
deletion insertion
deletion insertion
translocation
deletion insertion
translocation
inversion
deletion insertion
translocation
inversion
transposition
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)
5 6 4 8 3 2 1 9 7
Example:
We focus on inversions, that are the most important in Nature
1 2 3 8 4 6 5 9 7
1 2 3 8 4 5 6 9 7
1 2 3 6 5 4 8 9 7
1 2 3 6 5 4 8 7 9
1 2 3 4 5 6 8 7 9
1 2 3 4 5 6 7 8 9
Combinatorial problem: given 2 permutations P, Q and operators in a set F find a shortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions) (…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)
+5 +6 -4 -8 -3 -2 -1 -9 +7
Example:
We focus on inversions, that are the most important in Nature
+1 +2 +3 +8 +4 -6 -5 -9 +7 +1 +2 +3 +8 +4 +5 +6 -9 +7 +1 +2 +3 -6 -5 -4 -8 -9 +7 +1 +2 +3 -6 -5 -4 -8 -7 +9 +1 +2 +3 +4 +5 +6 -8 -7 +9 +1 +2 +3 +4 +5 +6 +7 +8 +9
There is also a SIGNED VERSION of the problem !
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
reakpoint at position i if(i) - (i+1) | > 1
0 10
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98) Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
reakpoint at position i if(i) - (i+1) | > 1
0 10
d() = inversion distance b() = # breakpoints
TRIVIAL BOUND: d() >= b() / 2 Example: d() >= 6 / 2 = 3
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
10 4 6
Each node has degree...
0 2 or 4 …
hence the graph can be decomposed in cycles!
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
c() = max # cycles in alternating decomposition
VERY STRONG BOUND : d () >= b() - c()
Example: c()= 2 and d () >= 6 - 2 = 4
The Breakpoint Graph Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10 The best algorithm for this problem is based on an Integer Programming formulation of the max cycle decomposition
A variable
x
C for each cycle (exponential # of vars…) A constraint
x
C = 1 for each edge e Objective: maximize
Cx
CC containing e
max x
C x
C = 1 for all edges eC C\ni ex
C \in {0,1} for all alt. cycles C PRIMALmin y
e y
e <= 1 for all alt. Cycles Ce e\in Cy
e \in R for all edges e DUALmax x
C x
C = 1 for all edges eC C\ni ex
C \in {0,1} for all alt. cycles C PRIMALmin y
e y
e <= 1 for all alt. Cycles Ce e\in Cy
e \in R for all edges e DUAL5 7 8 2 1 4 3 6 9 0
10
Pricing out the cycles for which y*(C) < 1 Pricing out the cycles for which y*(C) < 1
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Split the graph in two copies Split the graph in two copies
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Connect twins Connect twins
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles A perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
The weight of the matching is the y*-weight of the cycles The weight of the matching is the y*-weight of the cycles
.2 .4
.5 1
.6
0
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Forcing a cycle to use a certain node Forcing a cycle to use a certain node
.2 .4
.5 1
.6
100000
- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)
BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)
BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)
KS can solve at most n=40. Take days for n=50
CLN can solve for n=200. Takes few seconds (say 5) for n=100 NP-hard problem practically solved to optimality!