Optimization Problems for
Polymorphisms of Single Nucleotides
Polymorphisms
A polymorphism is a feature
Polymorphisms
A polymorphism is a feature - common to everybody
Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
Polymorphisms
E.g. think of
eye-color
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
Polymorphisms
A polymorphism is a feature - common to everybody
- not identical in everybody
- the possible variants (alleles) are just a few
E.g. think of
eye-color
Or blood-type for a feature not visible from outside
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S
ingleN
ucleotideP
olymorphism (SNP)At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S
ingleN
ucleotideP
olymorphism (SNP)atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S
ingleN
ucleotideP
olymorphism (SNP)atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- SNPs are predominant form of human variations
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
- Multimillion dollar SNP consortium project
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- Goal: associate SNPs (or group of SNPs) to genetic diseases - 1st step: build maps of several thousand SNPs
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
HAPLOTYPE
: chromosome content at SNP sitesatcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgt
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
HAPLOTYPE
: chromosome content at SNP sitesatcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
HAPLOTYPE
: chromosome content at SNP sitesag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles
HAPLOTYPE
: chromosome content at SNP sitesGENOTYPE
: “union” of 2 haplotypesOcE
EE
OaOg OaE
OaOt EOg
OgE
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
OcE
EE
OaOg OaE
OaOt EOg
OgE
CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 1 and O. Also, call * the fact that a site is heterozygous
HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,*
1o
11
o1 1o
o1
oo
11 11 1o
oo
1o oo 1o
1o
o*
**
*o 1*
11
*o
*o
CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 1 and O. Also, call * the fact that a site is heterozygous
HAPLOTYPE: string over 1,O GENOTYPE: string over 1,O,*
THE HAPLOTYPING PROBLEM
Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)
Population : Given genomic data of k individuals, determine
(at most) 2k haplotypes (one per chromosome/indiv.)
For the individual problem, input is erroneous haplotype data, from sequencing For the population problem, data is ambiguous genotype data, from screening
OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)
Theory and Results
- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)
- Polynomial Algorithms for bounded-length gapped haplotyping
(BLIR 02)
Single individual
- NP-hardness for general gapped haplotyping (LBILS 01)
- APX-hardness (Gusfield 00)
- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)
Population
-New formulations and Disease Detection (L, Ravi, Rizzi, 02)
- Exact algorithms for min-size solution (L,Serafini 2011) - Heuristics (Tininini, L, Bertolazzi 2010)
The Single-Individual
Haplotyping problem
ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGA GATTT GCCTAG CTATCTT
ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA
TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA
TAGAGATTTC TCCTAAAGAT CGCATAGATA
fragmentation
sequencing
assembly
Shotgun Assembly of a Chromosome
[ Webber and Myers, 1997]
ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
-Sequencing errors:
ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT
CATTGGAAC
AATGGAACGGA
-Contaminants
MAIN ERROR SOURCES
Given errors, the data may be
inconsistent with exactly 2 haplotypes
PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2
haplotypes
Hence, assembler is unable to
build 2 chromosomes
ACTGAAAGCGA ACTAGAGACAGCATG
ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC
AGCATG
ACTGAAAGCGA ACTAGAGACAGCATG
ACTGATAGC GTAGAGTCA ACTG TCGACTAGA CATG ACTGA CGATCCATCG TCAGC ACTGAAA ATCGATC
AGCATG
1 1 O O O 1 1 1 1 1 O
The data: a SNP matrix
Snips 1,..,n
Fragments 1,..,m 1 2 3 4 5 6 7 8 9
1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -
Snips 1,..,n
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype 1 2 3 4 5 6 7 8 9
1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -
Snips 1,..,n
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype 1
6 2
3
4
5
Fragment Conflict Graph GF(M) We have 2 haplotypes iff GF is BIPARTITE
1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -
Snips 1,..,n
Fragments 1,..,m
1
6 2
3
4
5
PROBLEM (Fragment Removal): make GF Bipartite 1 2 3 4 5 6 7 8 9
1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 3 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - 1 O 6 - - - - O O O 1 -
Fragments 1,..,m
PROBLEM (Fragment Removal): make GF Bipartite 1
6 2
3
4
5
1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 1 4 O O 1 - - - - O -
3 1 1 O 1 1 - - - - 5 - - - 1 O O O 1 O 1 1 O O 1
1 1 O 1 1 - - 1 O
Removing fewest fragments is equivalent to maximum induced bipartite subgraph
NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978]
O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999]
not O(|V|)-approximable for some [Lund and Yannakakis, 1993]
Are there cases of M for which GF(M) is easier?
YES: the gapless M
---O11OO---O1OO1--- gap
---O11OO1O1O1OO1--- gapless
---O11--1O----O1--- 2 gaps
Why gaps?
Sequencing errors (don’t call with low confidence) ---OO11?11--- ===> ---OO11-11---
Why gaps?
Sequencing errors (don’t call with low confidence) ---OO11?11--- ===> ---OO11-11---
Celera’s mate pairs
attcgttgtagtggtagcctaaatgtcggtagaccttga attcgttgtagtggtagcctaaatgtcggtagaccttga
THEOREM
For a gapless M, the Min Fragment Removal Problem is Polynomial
NOTE: Does not need to be gapless. Enough if it can be sorted to become such
(Consecutive Ones Property, Booth and Lueker, 1976)
An O(nm + n ) D.P. algo
31 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O
An O(nm + n ) D.P. algo
3LFT(i) RGT(i)
sort according to LFT
1 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O
An O(nm + n ) D.P. algo
31 - O O 1 1 O O - - 2 - - 1 O 1 1 O - - 3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - 1 O 1 O
LFT(i) RGT(i)
D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)
sort according to LFT
D(i; h,k) =
D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h) 1 + D(i-1; h, k) otherwise
{
OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)
Th: NP-Hard if 2 gaps per fragment
proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)
WITH GAPS…..
Th : NP-Hard if even 1 gap per fragment
proof: technical. reduction from MAX2SAT
WITH GAPS…..
Th: NP-Hard if 2 gaps per fragment
proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)
Th : NP-Hard if even 1 gap per fragment
proof: technical. reduction from MAX2SAT
WITH GAPS…..
But, gaps must be long for problem to be difficult.
We have O( 2 mn + 2 n ) D.P.
for MFR on matrix with total gaps length L
2L 3L 3
Th: NP-Hard if 2 gaps per fragment
proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)
What for MFR with gaps? Why not ILP...
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
What for MFR with gaps? Why not ILP...
1
5 2
4 3
1/2
1/3
1/2 1/4 0
1
5 2
4 3
1
5 2
4 3
5/12 5/12
Randomized rounding heuristic: round and repeat. Worked well at Celera
min ∑
�
�
�� ∈ �
∑
�
�≥1for all oddcycles �
� ∈{0,1}�
The fragment removal is good to get rid of contaminants.
However, we may want to keep all fragments and correct errors otherwise
A dual point of view is to disregard some SNPs and keep the largest subset sufficient to reconstruct the haplotypes
All fragments get assigned to one of the two haplotypes.
We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragment graph becomes bipartite.
SNP conflicts
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
OK
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
OK
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
OK
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
CONFLICT !
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
CONFLICT !
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
- - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 - SNP conflicts
SNP conflict graph GS(M)
1 node for each SNP (column) edge between conflicting SNPs
SNP conflicts 1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
SNP conflicts
1
6 2
3
4
5
8
9 7
1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 -
1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 1 1 1 O 1 1 - - - - O O 1 - - - O O - - - - 1 1 O - - - - O O O 1 - SNP conflicts
1
6 2
3
4
5
8
9 7
THEOREM 1
For a gapless M, GF(M) is bipartite
if and only if GS(M) is an independent set
THEOREM 2
For a gapless M, GS(M) is a perfect graph
COROLLARY
For a gapless M, the min SNP removal problem is polynomial
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--OO11OO--- ----OO1OO1O11O--- ---11O1O111- ----11OO1O11O---- ---1OOO1--- ---11111O--- --11O11O1OO---
Assume M gapless, GS(M) an independent set, but GF(M) not bipartite.
Take an odd cycle in GF
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?1???--- ----O????????O--- ---??O??1??- ----??????1??---- ---???O?--- ---????1?--- --1???????O---
There is a generic structure of hor-vert cycle
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?1???--- ----O????????O--- ---??O??1??- ----??????1??---- ---???O?--- ---????1?--- --1???????O---
“vertical lines”
There cannot be only one vertical line in odd cycle We merge rightmost and next to reduce them by 1
Hence, there cannot be a minimal (in n. of vertical lines) counterexample
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?1???--- ----O????????O--- ---??O??1??- ----??????1??---- ---???O?--- ---????1?--- --1???????O---
“vertical lines”
Must be 1
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?1???--- ----O?????1??O--- ---??O??1??- ----??????1??---- ---???O?--- ---????1?--- --1???????O---
“vertical lines”
Must be 1
Merge the rightmost lines
THEOREM 1
For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?1???--- ----O?????1--- ---??O--- ----??????1--- ---???O--- ---????1--- --1???????O---
“vertical lines”
Still a counterexample!
Merge the rightmost lines
1 2 3 1 O - O 2 - O 1 3 1 1 -
Note: Theorem not true if there are gaps
1
2 3
1
2 3
GF(M) GS(M)
M
THEOREM 2
For a gapless M, GS(M) is a perfect graph
PROOF: GS(M) is the complement of a comparability graph A
Comparability graphs are perfect
Comparability Graphs: unoriented that can be oriented to become a partial order
LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict
i j k - 1 O O ? 1 O 1 - - O 1 O ? 1 1 1 -
Equal:conflicts with i
O
O Different:conflicts with k
O 1
i j k
I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict
So (u,v) with u < v and u not a conflict with v is a comparability graph A and GS is A complement
NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)
THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)
Again, gaps must be long for problem to be difficult.
We have O(mn + n ) D.P.
for MSR on matrix with total gaps length L
2L + 1 2L + 2
Hence gapless MSR is polynomial (max stable set on perfect graph).
There are better, D.P., algorithms, O(mn + m^2)
What if gaps ?