Giuseppe Lancia
University of Udine
The phasing of heterozygous The phasing of heterozygous
traits:
traits:
Algorithms and Complexity
Algorithms and Complexity
-The genomic age has allowed to look at ourselves in a detailed, comparative way
-All humans are >99% identical at genome level
-Small changes in a genome can make a big
difference in how we look and who we are
What makes us different from each other?
The answer is
POLYMORPHISMS
POLYMORPHISMS
This is true for humans
as well as for other species
Polymorphisms are features existing in different
“flavours”, that make us all look (and be) different
Examples can be eye-color, blood type, hair, etc…
In fact, polymorphisms in the way we look (phenotyes) are
determined by polymorphisms in our genome
For a given polymorhism, say the eye-color, the possible forms are called alleles
We all inherit two alleles (paternal and maternal)
identical HOMOZYGOUS If they are
different HETEROZYGOUS
{
mother father
child Homozygous
mother father
child Homozygous
mother father
child Heterozygous
Dominant Recessive
mother father
child Homozygous
mother father
child Heterozygous
mother father
child Homozygous
Dominant Recessive
mother father
child Homozygous
mother father
child Heterozygous
mother father
child Homozygous
Dominant Recessive
mother father
child
mother father
child
mother father
child
?
?
?
?
?
?
?
?
?
?
?
?
mother father
child
mother father
child
mother father
child
?
?
?
?
?
?
?
?
?
?
?
?
Single Single
Nucleotide Nucleotide
Polymorphisms
Polymorphisms
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S S ingle N N ucleotide P P olymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotides varying in a population.
The shortest possible sequence has only 1 nucleotide, hence
S S ingle N N ucleotide P P olymorphism (SNP)
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acc g
atcgg
a
ttagttagggcacaggacgg
aca g
- SNPs are predominant form of human variations
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
ac- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
atcgg
c
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgt
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
acatcgg
a
ttagttagggcacaggacgt
atcgg
a
ttagttagggcacaggacgg
acatcgg
c
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgg
acc g
atcgg
a
ttagttagggcacaggacgg
aca g
atcgg
a
ttagttagggcacaggacgg
acatcgg
a
ttagttagggcacaggacgt
ac- SNPs are predominant form of human variations
- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
- SNPs are predominant form of human variations
- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
HAPLOTYPE
HAPLOTYPE : chromosome content at SNP sites
GENOTYPE
GENOTYPE : “union” of 2 haplotypes
{c}{g,t}
{a,c}{g,t}
{a}{g}
{a}{g,t}
{a}{t}
{a,c}{g}
{a,c}{g}
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
{a,c}{g,t}
{a}{g,t}
{c}{g,t}
{a}{g}
{a}{t}
{a,c}{g}
{a,c}{g}
CHANGE OF SYMBOLS
CHANGE OF SYMBOLS
: each SNP only two values in a population (bio).Call them
0
and1
. Also, call2
the fact that a site is heterozygousHAPLOTYPE
HAPLOTYPE: string over 0, 1 GENOTYPE
GENOTYPE: string over 0, 1, 2
ag
at
ct ag
ct
cg
at at ag
cg
ag cg ag
ag
{a,c}{g,t}
{a}{g,t}
{c}{g,t}
{a}{g}
{a}{t}
{a,c}{g}
{a,c}{g}
CHANGE OF SYMBOLS
CHANGE OF SYMBOLS
: each SNP only two values in a population (bio).Call them
0
and1
. Also, call2
the fact that a site is heterozygousHAPLOTYPE
HAPLOTYPE: string over 0, 1 GENOTYPE
GENOTYPE: string over 0, 1, 2
where 0={0}, 1={1}, 2={0,1}
10
11
01 10
01
00
11 11 10
00
10 00 10
10
02
22
10 12
11 20
20 CHANGE OF SYMBOLS
CHANGE OF SYMBOLS
: each SNP only two values in a population (bio).Call them
0
and1
. Also, call2
the fact that a site is heterozygousHAPLOTYPE
HAPLOTYPE: string over 0, 1 GENOTYPE
GENOTYPE: string over 0, 1, 2
where 0={0}, 1={1}, 2={0,1}
10
11
01 10
01 00
11 00
00 10 10
10
02
22
10 12
22
20
0 + 0 = --- 0
1 + 1 = --- 1
0 + 1 + 1 = 0 = --- --- 2 2 ALGEBRA OF HAPLOTYPES:
Homozygous sites Heterozygous (ambiguous) sites
12202
11101 10000 11100 10001 11001 10100 11000 10101
Phasing the alleles Phasing the alleles
For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings
THE PHASING (or HAPLOTYPING) PROBLEM THE PHASING (or HAPLOTYPING) PROBLEM
Given genotypes of k individuals, determine the phasings of all heterozygous sites.
It is too expensive to determine haplotypes directly
Much cheaper to determine genotypes, and then infer haplotypes in silico:
This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.
The input is GENOTYPE data
00011
11011 21221
22221 11221
INPUT: G = { 11221, 22221, 11011, 21221, 00011 }
The input is GENOTYPE data
11011 11101
00011
00011 11101
11011 01101
11011 11011 00011
00011
11011 21221
22221 11221
OUTPUT: H = { 11011, 11101, 00011, 01101}
INPUT: G = { 11221, 22221, 11011, 21221, 00011 }
Each genotype is resolved by two haplotypes We will define some objectives for H
-without objectives/constraints, the - haplotyping problem would be
(mathematically)trivial
OBJECTIVES
22021 00001 11011
E.g., always put 0 above and 1 below
12022 10000 11011
-the objectives/constraints must be -
“driven by biology”
2°) (parsimony): minimize |H| 2°) 1°)
1°) Clark’s inference rule
3°) Perfect Phylogeny 3°) Perfect Phylogeny
4°) Disease Association 4°) Disease Association
OBJECTIVES
Obj: Clark’s rule Obj: Clark’s rule
1st 1st
1011001011 +
********** = 1221001212
known haplotype
h
known (ambiguos) genotype
g
Inference Rule Inference Rule
for a
compatible
pairh , g
1011001011 + 1101001110 = 1221001212
known haplotype
h
known (ambiguos) genotype
g
Inference Rule Inference Rule
for a
compatible
pairh , g
new (derived) haplotype
h’
We write
h + h’ = g
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000 2200 1122
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000 2200 1122
1100
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
0000 1000 2200 1122
1100 1111
SUCCESS 1st Objective (Clark, 1990)
1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000 2200 1122
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
0000 1000 2200 1122
0100
If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic
0000 1000 2200 1122
0100 FAILURE (can’t resolve 1122 )
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
1. Start with H = “bootstrap” haplotypes
2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’
4. set H = H + {h’} and G = G - {g}
5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic: the algorithm could end without explaining all genotypes even if an explanation was possible.
The number of genotypes solved depends on order of application.
1st Objective (Clark, 1990) 1st Objective (Clark, 1990)
OBJ: find order of application rule that leaves the fewest elements in G OBJ: find order of application rule that leaves the fewest elements in G
The problem was studied by Gusfield
(ISMB 2000, and Journal of Comp. Biol., 2001)
- problem is APX-hard
- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations -solved via ILP of exponential-size
(practical for small real instances)
Obj: Max Parsimony Obj: Max Parsimony
2nd 2nd
- Clark conjectured solution (when found) uses min # of haplotypes
- this is clearly false
- solution with few haplotypes is biologically relevant (as we all descend from a
small set of ancestors)
011 101
111 111
011 000 010
001 010
011
111 111
011 101
111 111
011 000 010
001 010
011
111 111
022
222
012
221
011 111 022
211 022 012
012 222
minimize |H|
2nd Objective (parsimony)
2nd Objective (parsimony) :
1. The problem is APX-Hard 1. The problem is APX-Hard
Reduction from VERTEX-COVER
A
B
C
D E
A
B
C
D E
A B C D E *
A
B
C
D E
A B C D E * AB
BC AE
DE AD
A
B
C
D E
A B C D E * AB
BC AE
DE AD A B C D E
A
B
C
D E
A B C D E * AB 2 2
BC 2 2
AE 2 2 DE 2 2 AD 2 2
A B C D E
A
B
C
D E
A B C D E * AB 2 2
BC 2 2
AE 2 2 DE 2 2 AD 2 2
A 0
B 0
C 0
D 0
E 0
A
B
C
D E
A B C D E *
AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2 A 0 0 B 0 0 C 0 0 D 0 0 E 0 0
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that
A
B
C
D E
A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1 B’ 1 0 1 1 1 1 E’ 1 1 1 1 0 1
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
A basic ILP formulation
Expand your input G in all possible ways
220 120 022
A basic ILP formulation
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
xh
h
1, h
2
x h
y
h h2 1,
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
The resulting Integer Program (IP1):
Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations
Obj: Perfect Phylogeny Obj: Perfect Phylogeny
3rd 3rd
- Parsimony does not take into account mutations/evolution of haplotypes
- parsimony is very relialable on “small”
haplotype blocks
- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)
- the cleanest model for evolution is the
perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
3rd objective is based on perfect phylogeny perfect phylogeny
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
3rd objective is based on perfect phylogeny perfect phylogeny
has tail
flies
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
But…a new species may come along so that no Perfect phylogeny is possible…
has tail
flies
Theorem
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor
10
01
11
Human 1 0 0
Mouse 0 1 0
Spider 0 0 0
Eagle 1 0 1 two legs
tail
flies
Theorem
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor
10
01
11
Human 1 0 0
Mouse 0 1 0
Spider 0 0 0
Eagle 1 0 1
Mickey mouse 1 1 0 two legs
tail
flies
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 2 0 2 1 0 2 2 0 2 0
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 2 0
2 1 0 2 2 0 2 0
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 2 0 2 1 0 2 2 0 2 0
0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0
NOT a perfect phylogeny solution !
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 2 0 0 1 0 2 0 0 0 2
We can consider each SNP as a binary feature Objective:
Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)
0 1 2 0 0 1 0 2 0 0 0 2
0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1
A perfect phylogeny
Theorem: The Perfect Phylogeny
Haplotyping problem is polynomial
Theorem: The Perfect Phylogeny Haplotyping problem is polynomial
Algorithms are of combinatorial nature
- There is a graph for which SNPs are columns and edges are of two types (forced and free)
- forced edges connect pairs of SNPs that must be phased in the same way 22 00 + 11 or 22 01 + 10
- a complex visit of the graph decides how to phase free SNPs
Obj: Disease Association Obj: Disease Association
4th 4th
Some diseases may be due to a gene which has “faulty” configurations
RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease
DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy
Two individuals of which one is healthy and the other diseased may have the same genotype.
The explanation of the disease lies in a difference in their haplotypes
00011
02011 21221
02201 11221
INPUT: GD = { 11221,21221,02011 }, GH = {11221,02201,00011}
11221
11011 11101
00011
01101 00001
11011 01101
01011 00011
00011 00011
02011 21221
02201 11221
OUTPUT: H = {
11011,01011,00001,11111,11101,00011,01101} INPUT: GD = { 11221,21221,02011 },
GH = {11221,02201,00011}
11001 11111