Integer Programming Models for Computational Biology Prob-
lems
GiuseppeLancia
D.I.M.I,UniversitadiUdine,VialedelleScienze206,33100 Udine,Italy
E-mail: lancia@dimi.uniud.it
ReceivedMay26,2003;revisedAugust21,2003.
Abstract TherecentyearshaveseenanimpressiveincreaseintheuseofIntegerProgram-
mingmodels forthesolutionof optimization problemsoriginatinginMolecularBiology. Inthis
survey,we describe someofthemost successfulIntegerProgrammingapproaches, whilegivinga
broadoverviewofapplicationareasinmodernComputationalMolecularBiology.
Keywords integerprogramming,computationalbiology,sequencealignment,genomerear-
rangements,proteinstructures
1 Introduction
Computational (Molecular) Biology is a rel-
atively young science, which has experienced a
tremendous growth in the last decade. A sign
of this growth is the fact that a steadily increas-
ingnumber ofpeoplewithdierentscienticback-
groundshavebeenapproachingthiseldinthere-
centyears. ItisnowcommontoseepapersonCom-
putational Biology problems written not only by
computer scientistsand biologists,but also statis-
ticians, physicists and mathematicians, pure and
applied. Inparticular,theapparentlyendlesscapa-
bilityofComputationalBiologytoprovideexciting
combinatorial problems, has become very appeal-
ingtotheMathematicalProgramming/Operations
Research community. In fact, many of the prob-
lems consideredcan becastas optimization prob-
lems,tobeattackedbystandardoptimizationtech-
niques.
Consider thetypical approachtotackleacom-
putational biology problem, which is as follows.
First,a (ratherdiÆcult)modelinganalysis is per-
formed, inorderto mapthebiological processun-
der investigation into one or more mathematical
objects (such as graphs, sets, arrays:::). After
thistransformation,theoriginalquestion concern-
ing the biological data becomes a mathematical
question abouttheobjects ofthechosen represen-
tation. Sometimesthisquestioncanbesimplycast
as \does there exist at least one objectwith such
andsuchmathematicalproperties?". Thisisafea-
sibility,orexistenceproblem(whichcan besolved,
e.g., by using powerful arguments from combina-
torics -such as the pigeonhole principle-, or from
analgorithmwhichbuildsa solution,ifoneexists,
andprovingitscorrectness).
However, many other times (most times, we
wouldsay),eachobject,representingatentativeso-
lution,hasassociatedanumericalweight,intended
tomeasureitsquality. Theweightisproportional,
loosely speaking, to the probability that the solu-
tionisthe\true" answer to theoriginalbiological
problem. Thehighertheweight,themorereliablea
solutionis. Sincetheproblemwasmappedintothe
domainofdiscretemathematics,alsotheweightis
represented by some mathematical function f(x).
Inordertondthebestanswer totheoriginalbi-
ologicalproblem, onehasthento solvethefollow-
ingoptimizationproblem: nda solutionx
which
maximizesf(x)over allpossible solutions.
Optimization problems have been a subject of
studyfordecadesinthemathematicalcommunity.
In particular, the whole eld of Operations Re-
search is devoted to the subject, and many very
successful techniques have been developed over
thepastyearsfordiÆcult(NP-hard)optimization
problems. For such diÆcult problems, very rigor-
ousmethodshavebeendevisedthatarepractically
(i.e.,withina\small" time)capableofndingthe
optimalsolutionfordataderivedfromreal-life ap-
plications.
The mostsuccessful approach to the exact so-
lution of hard optimization problems is proba-
bly Integer Linear Programming (for brevity, In-
teger Programminghereafter), which hasbeen ap-
pliedprotablyforhundreds ofproblemsfromreal
domains [1 3]
. TheInteger Programmingapproach
consistsinformulatingaproblemasthemaximiza-
tion(minimization)ofalinearfunctionofsomein-
bound,wheretheupper(lower)boundcomesfrom
theLinearProgramming (LP)relaxation. TheLP
relaxation,inwhichthesamefunctionisoptimized
but thevariables are notrestricted to be integer,
is polynomially solvable. A formulation is as suc-
cessfulasthestrengthofitsLPbound. That is,if
thevalue ofthe objectivefunction over the relax-
ationisclosetothevalueovertheintegers,thenthe
bound,andhence thepruningofthesearch space,
is eective. It is often the case that in order to
obtainbetterbounds,theformulationisreinforced
by the use of additional constraints, called cuts,
and the resulting approach is known as Branch-
and-Price. Cuts areconstraintsthat do notelim-
inate any feasible integer solution, but make the
space of fractional solutions smaller, this way im-
provingthevalueoftheLPbound.
Besides its use to solvea problem exactly, the
IntegerProgrammingapproachcanalsobeapplied
to compute provably good feasible solutions. In
fact, at any point of the search process, by using
the LP bound, we have an estimate of the maxi-
mumerrorbetweenthevalueofthecurrentsolution
andtheoptimum. Thesearchcanthenbestopped
assoonasthemaximumerrorbecomesacceptably
small. Furthermore, approximation-guarantee al-
gorithmswhicharebasedonIntegerProgramming
formulation are also possible. These are polyno-
mialalgorithmswhichexploittheLPrelaxationof
aproblemtondafeasiblesolutionwithinaguar-
anteedmaximumerrorfromtheoptimum. Wewill
give examples of results of all these types in the
followingsections.
The use of Integer Programming approaches
to Computational Biology problems has been in-
troduced, starting about seven years ago, by sev-
eralpeoplewithMathematicalProgrammingback-
ground (among which A. Caprara, B. Carr, D.
Guseld, R. Karp, J. Kececioglu, G. Lancia, H.
P. Lenhof,P. Mutzel, R.Ravi, K.Reinert and M.
Vingron)andhastodaybecomeacommontoolfor
approachinga newproblem.
Inthis survey wewill reviewsome ofthemain
results that wereobtained withthe useof Integer
Programming approaches. Theproblems, coming
from dierent areas, touch almost every aspect of
modernComputationalBiology.
The material is organized in sections. In each
section, we describe one or more problems of the
same nature, and the Integer Programming ap-
proaches that were proposed in the literature for
theirsolution. Manyoftheseapproachesarebased
ponential number of inequalities (or variables),
that aresolved by Branch-and-Price(respectively,
Branch-and-Price). Section 2is devoted to a brief
description of the general principles of the theory
ofInteger Programming.
The remainder of the paper is then organized
as follows. In Section 3 we outline some Integer
Programming approaches to Alignment Problems.
Wereviewseveralproblemsrelatedwiththealign-
mentofgenomicobjects,suchasDNAorRNAse-
quences and secondary or tertiary structures. We
distinguishthreetypes ofalignment:
(i) Sequencevs SequenceAlignment: Thissub-
section is devoted to the problem of aligning ge-
nomic sequences. First, we discuss the classical
multiple alignment problem, studied by Reinert
et al.
[4]
and Kececioglu et al.
[5]
, whodeveloped a
Branch-and-Pricealgorithmforitssolution. Then,
wedescribeaheuristicapproachforthesameprob-
lem,byFischettietal.
[6]
. Theapproachgeneralizes
anapproximationalgorithmbyGuseld [7]
anduses
a reduction to the Minimum Routing Cost Tree
problem, solved by Branch-and-Price. Closely re-
lated to the alignment problem, are the so called
\center"problems,inwhichonetriestodetermine
a sequenceascloseoras faraspossiblefrom aset
of inputsequences. For this problem, wedescribe
an approximation algorithm based on LP relax-
ation and Randomized Rounding, by Ben-Dor et
al.
[8]
. Wealsodiscusssomeworksthat,developing
similar ideas, obtain PTAS forthe consensus and
thefarthest-sequenceproblems(Lietal.
[9]
,Ma [10]
,
Lanctotetal.
[11]
).
(ii) Sequence vs Structure Alignment: In this
subsectionwedescribetheproblemofaligningtwo
RNA sequences, when for one of them the sec-
ondary structureis alreadyknown. For this prob-
lem, wedescribe a Branch-and-Price approach by
Lenhofetal.
[12]
andKececioglu etal.
[5]
.
(iii) Structure vs Structure Alignment: Inthis
subsection we consider the problem of comparing
two protein tertiary structures. For this problem,
wereview aBranch-and-Priceapproach, proposed
byLanciaetal.
[13]
.
Section4isdevotedtoGenomeRearrangements
problems. In this section we describe a Branch-
and-Price approach, by Caprara et al.
[14;15]
for
computing the evolutionary distance between two
genomes evolved from a common ancestor by in-
versions oflongDNAregions.
In Section5 wediversities(polymorphisms)at
the genomic level. The SNP (Single Nucleotide
determiningthevaluesofasetofpolymorphicsites
in a genome. First, wediscuss the single individ-
ual haplotyping problem, for which some Integer
Programming techniques were employed by Lan-
cia et al.
[16]
and Lippert et al.
[17]
. Then, we re-
view two versions of the problem relative to hap-
lotyping entire populations, studied primarily by
Guseld [18 20]
. Theapproachesarebasedemploys
a reduction to a graph problem, which is then
solvedbyIntergerProgramming.
InSection 6 wementionsome approachesthat
were left out for space reasons [21;22]
. Finally, we
show how theInteger Programming approach can
also be applied indirectly, i.e., when a very well
known optimization problem (solved by Integer
Programming techniques) isused to solve a prob-
lem in Computational Biology. This is the case,
e.g.,fortheTSPproblem,usedtosolvetheRadia-
tionHybridMappingproblem(Agarwalaetal.
[23]
).
2 Integer Linear ProgrammingBasics
An Integer Linear Programming (IP) problem
is dened by an m n real constraint matrix
A = (a
ij
), a real 1n cost vector c and a real
m1rhsvectorb. Theproblemconsistsinnding
the values for n variables x = (x
1
;:::;x
n ) which
optimize(e.g.,minimize)
min n
X
j=1 c
j x
j
(1)
subjectto
n
X
j=1 a
ij x
j
>b
i
8i=1;:::;m (2)
andwitheachx
i
beinganon-negative,integernum-
ber. Thislastconstraintiswrittenas
x>0; x2Z n
: (3)
It is easyto show that the IPproblem is NP-
hard. However,itsLinearProgrammingrelaxation,
obtainedbyreplacingtheintegralityconstraints(3)
with
x>0 (4)
ispolynomiallysolvable [24;25]
. Letz
IP
betheopti-
mumvalueof(1),(2)and(3)andz
LP
betheopti-
mumvalueof(1),(2)and(3). Thevaluez
IP z
LP
is calledthe integrality gap oftherelaxation. The
successofanIPmodelrelies intheintegralitygap
To formulate a combinatorial optimization
problemas anIPproblem, onehasto denea set
of variables, linear constraints anda linear objec-
tivefunction,suchthattheoptimalsolutiontothe
problem of minimizing (1) subject to (2) and (3)
identiestheoptimalsolution oftheoriginalprob-
lem. Manytimesthenaturalchoicefortheinteger
variablesasksthemtotakeonlythevalues0and1
(binaryvariables). Thesevalues areusedtorepre-
sentthedecisionoftakingornotacertainelement
aspart oftheoptimalsolution.
SometimesthetightestIPformulationsrequire
a very large (exponential with respect to the size
of the original problem) number of constraints.
A fundamental paper by Grotschel, Lovasz and
Schrijver [26]
states that even exponential-size LP
relaxationscan besolvedin polynomial time,pro-
vided that separation can be done in polynomial
time. Separation consists in solving the LP with
onlyasubsetoftheconstraints(2),andthencheck-
ing if any of the constraints that were left out is
violatedbytheoptimalsolution. Ifthisisthecase,
theseparationprocedurendsonesuchconstraint,
whichisthenaddedtothecurrentLP,tobesolved
again. By using the ellipsoid method as the LP
solver,Grotschel,LovaszandSchrijvershowedthat
theaboveprocedureconverges inpolynomialtime.
The Branch-and-Bound approach for IP formula-
tionsin which inequalities areadded through sep-
aration is called Branch-and-Price. In fact, each
added inequality iscalled a \cut"since it cutso
an LP solution which is not feasible for all con-
straints.
An alternativeway toobtain tightIPformula-
tion is sometimes through the use of a very large
(exponential with respect to the size of theorigi-
nalproblem)numberofvariables [27]
. Variablesre-
late to constraints through duality. The dual of
a Linear Program (called the primal) dened by
(1),(2)and(4)isthefollowingLPinthevariables
y=(y
1
;:::;y
m ):
max m
X
i=1 b
i y
i
(5)
subjectto
m
X
i=1 a
ij y
i 6c
j
8j=1;:::;n (6)
y>0: (7)
Note howthe role of variablesand constraints
InordertosolveanLPformulationwithanex-
ponentialnumber ofvariables in polynomial time,
itis suÆcientto designa polynomial timepricing
procedure. Pricingconsistsin solvingtheLPwith
onlyasubsetofthevariables,andthencheckingif
anyofthevariablesthatwereleftoutcanbeused
to improvethe currentoptimalsolution. If this is
thecase,thepricingprocedurendsonesuchvari-
able,which isthenaddedtothecurrentLP,tobe
solved again. Using duality, it can beshown that
nding a variable to add to thecurrent LPis the
same as nding a violated inequality for thedual
ofthecurrentLP(i.e.,thepricingproblemforthe
primalistheseparation problemforthedual).
TheBranch-and-BoundapproachforIPformu-
lationsinwhichvariablesareaddedthroughpricing
iscalledBranch-and-Price.
3 AlignmentProblems
Oftentimes,incomputationalbiology,one must
compareobjectswhichconsistofa setof elements
arrangedin alinearly orderedstructure. A typical
example is the genomic sequence, which is equiv-
alent to a string over an alphabet (the4-letter
nucletotide alphabet fA;T;C;Gg, or the 20-letter
aminoacidalphabet). Anotherexampleisthepro-
tein, when this is regarded as a linear chain of
residues(andhencewitharstandalastresidue).
Aligning two (or more) such objects consists
in determining subsets of corresponding elements
in each. The correspondence must be order-
preserving,i.e.,ifthei-thelementofobject1corre-
spondsto thek-thelementofobject2,noelement
followingiinobject1cancorrespondtoanelement
precedingk inobject2.
The situation can be described graphically as
follows. Given two objects, where the rst has n
elements,numbered1;:::;nandthesecondhasm
elements,numbered1;:::;m,weconsiderthecom-
plete bipartitegraph
W
nm
:=([n];[m];L) (8)
whereL=[n][m]and[k]:=f1;:::;kg8k2Z +
.
We hereafter call a pair(i;j) with i2 [n] and
j2[m] aline(sinceitaligns iwithj),andwede-
noteit by [i;j]. Lis thesetofalllines. Two lines
[i;j]and[i 0
;j 0
],with[i;j]6=[i 0
;j 0
],aresaidtocross
ifeither i 0
>iand j 0
6j or i 0
6iand j 0
>j (by
denition,alinedoesnotcrossitself). Graphically,
crossing lines correspond to lines that intersect in
a single point. A matching isa subsetof lines no
identiedbyanoncrossingmatching,i.e.,amatch-
ingforwhichnotwolines cross,inW
nm
. Fig.1(a)
shows(inbold)analignmentoftwoobjects.
Fig.1. (a) Anoncrossing matching(alignment). (b) The
directedgrid.
In the following Subsections 3.1, 3.2 and 3.3,
we will describe Integer Programming approaches
for three alignment problems, i.e., Sequences vs.
Sequences (Reinert et al.
[4]
, Kececioglu et al.
[5]
,
Sequences vs. Structures (Lenhof et al.
[12]
, Ke-
cecioglu et al.
[5]
), and Structures vs. Structures
(Lancia et al.
[13]
, Caprara and Lancia [31]
) respec-
tively. All Integer Programming formulations for
these problems contain binary variables x
l to se-
lect which lines l 2 L dene the alignment, and
constraintstoinsurethattheselectedlinesarenon-
crossing. Therefore, it is appropriate to discuss
here this part ofthemodels, since theresults will
apply to all the following formulations. We will
consider here the case of two objects only. This
case can be generalized to more objects, as men-
tionedinSubsection3.1.
Let X
nm
be theset of incidence vectors of all
noncrossing matchings in W
nm
. A noncrossing
matching inW
nm
corresponds toa stable set in a
newgraph, G
L
(theLine Con ict Graph),dened
as follows. Each line l 2L isa vertexof G
L ,and
twoverticeslandhareconnectedbyanedgeifthe
linesl andhcross.
The graph G
L
has been studied in Lancia et
al.
[13]
,where thefollowingtheoremisproved:
Theorem 1.
[13]
The graphG
L
isperfect.
Since a stable set can intersect a clique in at
most1node,thefollowingfamilyofcliqueinequal-
ities can be used as constraintsfor the alignment
problem:
X
l 2Q x
l
61 8Q2clique (L): (9)
where clique (L) denotes the set of all cliques in
G
L
. Given weights x
l
for each line l, the sepa-
ration problem for theclique inequalities (9) con-
sistsin nding a clique Q in G
L with x
(Q)>1.
solvedinpolynomialtimebyndingthemaximum
x
-weightcliqueinG
L
. Insteadofresortingtoslow
andcomplexalgorithmsforperfectgraphs,thesep-
arationproblemcanbesolveddirectlybyDynamic
Programming,asdescribedinLenhofetal.
[12]
and,
independentlyandwithadierentconstruction,in
Lanciaet al.
[13]
. Bothalgorithmshavecomplexity
O(nm). Wedescribeheretheconstructionof[12].
Consider Lasthevertexsetofadirected grid,
in which vertex [1;n] is put in the lower left cor-
nerandvertex[m;1]intheupperrightcorner(see
Fig.1(b)).
At each internal node l = [i;j], there are two
incoming arcs, one directed from the node below,
i.e. [i;j+1], and one directed from the node on
theleft,i.e. [i 1;j]. Eachsucharchasassociated
alengthequal tox
l .
Theorem2.
[12;13]
Thenodesonadirectedpath
P from [1;n] to [m;1] in the grid correspondto a
clique Q, of maximal cardinality, in G
L
and vice
versa.
Thenthemostviolatedcliqueinequalitycanbe
foundbytakingthelongest[1;n]-[m;1]pathinthe
grid. There is a violated clique inequality if and
onlyifthelengthofthepath(plus x
1n
)isgreater
than1.
3.1 Sequence vsSequence Alignments
Comparing genomic sequences drawn from in-
dividuals of thesame or dierentspeciesis one of
thefundamentalproblemsofmolecularbiology. In
fact,suchcomparisonsareneededtoidentifyhighly
conserved (and therefore presumably functionally
relevant)DNA regions, spot fatal mutations, sug-
gestevolutionaryrelationships,andhelpincorrect-
ingsequencingerrors.
Aligning a set of sequences (i.e., computing a
Multiple Sequence Alignment) consists in arrang-
ing them in a matrix having each sequence in a
row. This is obtained by possibly inserting gaps
(representedbythe`-'character)ineachsequence
sothat theyallresultofthesamelength. Thefol-
lowing isa simple example of analignmentof the
sequencesATTCCGAC,TTCCCTGandATCCTC.
A T T C C G A - C
- T T C C C - T G
A - T C C - - T C
By looking at a column of the alignment, we
can reconstruct the events that havehappenedat
thecorrespondingposition inthegiven sequences.
mutatedinto a Cin sequence2, and deletedin se-
quence3.
The multiple sequence alignment problem has
been formalizedas an optimization problem. The
mostpopularobjectivefunctionformultiplealign-
mentgeneralizesideasfromoptimallyaligningtwo
sequences. This problem, called pairwise align-
ment, is formulated as follows: Given symmetric
costs (a;b) for replacing a letter a with a letter
b and costs (a; ) for deleting or inserting a let-
ter a, nd a minimum-cost set of symbol opera-
tionsthat turn a sequence S 0
into a sequence S 00
.
For genomicsequences,thecosts (;)areusually
speciedbysomewidelyusedsubstitutionmatrices
(likePAM and BLOSUM), which score the likeli-
hoodofspeciclettermutations, deletionsandin-
sertions. The pairwise alignment problem can be
solvedbyDynamicProgrammingintimeandspace
O(l 2
), where l is the maximum lengthof thetwo
sequences(Needleman andWunsch [32]
). Thevalue
ofanoptimalsolution iscalledtheeditdistanceof
S 0
andS 00
andisdenotedbyd(S 0
;S 00
).
Formally, an alignment A of two or more
sequences is a bi-dimensional array having the
(gapped)sequencesasrows. Thevalue d
A (S
0
;S 00
)
ofanalignmentoftwo sequencesS 0
and S 00
isob-
tainedbyaddingupthecostsforthepairsofchar-
actersin correspondingpositions,anditiseasyto
see that d(S 0
;S 00
)=min
A d
A (S
0
;S 00
). Thisobjec-
tiveisgeneralized,forksequencesfS
1
;:::;S
k g,by
theSum{of{Pairs(SP) score, inwhichthecost of
an alignment is obtained by adding up the costs
of thesymbols matched upat the same positions,
over allthepairsofsequences:
SP(A):= X
16i<j6k d
A (S
i
;S
j )
= X
16i<j6k jAj
X
l =1
(A[i][l];A[j][l])
(10)
wherejAjdenotesthelengthofthealignment,i.e.,
thenumberofitscolumns.
Finding the optimal SP alignment was shown
NP-hardbyWangandJiang [33]
. Astraightforward
generalizationoftheDynamicProgrammingfrom2
toksequences,oflengthl,leadstoanexponential-
time algorithmof complexity O(2 k
l k
), whichis in
practiceuseless forallbuttinyproblems [34]
.
3.3.1 Trace andInteger ProgrammingApproach
AsanalternativeapproachtotheineectiveDy-
[5]
(seealsoReinertetal.
[4]
forapreliminaryversion)
proposed anapproach based on Integer Program-
ming, to optimizeanobjectivefunction called the
Maximum Weight Trace (MWT) Problem. The
MWTgeneralizestheSPobjectiveaswellasmany
other alignment problems in the literature. The
problemwasintroduced in1993byKececioglu [35]
,
whodescribeda combinatorial Branch-and-Bound
algorithmforitssolution.
The trace is a graph theoretic generalization
of the noncrossing matching to the case of the
alignment of more than two objects. Suppose
we want to align k objects, of n
1
;n
2
;:::;n
k el-
ements. We dene a complete k-partite graph
W
n1;:::;nk
= (V;L), where V = S
k
i=1 V
i
, the set
V
i
= fv
i1
;:::;v
ini
g denotes the nodes of level i
(correspondingtotheelementsofthei-thobject),
andL= S
16i<j6k V
i
V
j
. Eachedge [i;j]2L is
stillcalledaline. GivenanalignmentAoftheob-
jects (i.e.,anarrangementin anarray, withgaps),
we say that the alignment realizes a line [i;j] if i
and j are put in thesame column of A. A trace
T isa set of lines such that there exists analign-
mentAthat realizesallthelinesin T. TheMaxi-
mumWeightTrace (MWT)Problemisthefollow-
ing: given weights w
[i;j]
for each line [i;j], nd a
traceT ofmaximumtotalweight.
LetA bethesetofdirectedarcspointingfrom
each element to the elements that follow it in an
object(i.e., arcs of type (v
is
;v
it
) fori = 1;:::;k,
1 6s<t 6n
i
). Then a trace T is characterized
asfollows:
Proposition 3.
[14]
T is a trace if and only if
thereisnodirectedmixedcycleinthemixedgraph
(V;T[A).
Adirectedmixedcycleisacyclecontainingboth
arcs and edges, in which thearcs (elements of A)
are traversed according to their orientation, while
theundirected edges(lines ofT)can betraversed
ineither direction.
Usingbinaryvariablesx
l
forthelinesinL,the
MWTproblemcanbemodeledas follows [5]
:
max X
l 2L w
l x
l
(11)
subjectto
X
l 2C\L x
l
6jC\Lj 1 8directed mixedcycles
C in (V;L[A) (12)
x
l
2f0;1g 8l2L: (13)
Theformulation(11){(13)can bestrengthened
byusingthecliqueinequalities(9),relativetoeach
subgraphW
n
i
;n
j :=(V
i
;V
j
;V
i
V
j )ofW
n
1
;:::;n
k .
The mixed-cycle inequalities (12)can besepa-
ratedinpolynomialtime,viathecomputationofat
mostO(
P
k
i=1 n
i
)shortestpathsin(V;L[A) [5]
. A
Branch-and-Price algorithm based on the mixed-
cycle inequalities and the clique inequalities (as
wellas othercuts,separatedheuristically)allowed
Kececioglu et al.
[5]
to compute, for the rsttime,
an optimal alignment for a set of 15 proteins of
about300aminoacidseachfromtheSWISSPROT
database, whereas the previous limit, achieved by
a combinatorial Branch-and-Bound algorithm for
this problem, was 6 sequences of size 200 We re-
mark the the optimal alignment of 6 sequences
of size 200 is already out of reach for Dynamic
Programming-based algorithms. We must also
pointout,however,thatthelengthofthesequences
isstillamajorlimitationtotheapplicabilityofthe
Integer Programming solution, and themethod is
notsuitable for sequences longer than a few hun-
dredletters.
3.1.2 AnIP-Based HeuristicforSPAlignment
Due to the complexity of the alignment prob-
lem, many heuristic algorithms have been de-
veloped over time, some of which provide an
approximation-guarantee onthequalityof theso-
lution found. Inthis sectionwedescribeone such
heuristicprocedure,whichusesideasfromNetwork
DesignandIntegerProgrammingtechniques. This
heuristiciswellsuitedforlongsequences.
A popular approach formany heuristics is the
socalled \progressive"alignment, inwhich theso-
lutionisincrementallybuiltbyconsideringthese-
quences one ata time. One suchalgorithm isdue
to Guseld [7]
, who suggested to use a star (i.e.,
a tree in which at most one node {the center{ is
nota leaf), to align thesequences as follows. Let
S = fS
1
;:::;S
k
g be the set of input sequences.
FixasequenceS
c
asthecenter,andcomputek 1
pairwise optimalalignmentsA
i
, one foreach each
S
i
aligned with S
c
. Thesealignmentscan thenbe
mergedintoasingle alignmentA(c),byputtingin
thesamecolumnallthelettersthatarealignedto
thesameletterofS
c .
Note that each A
i
optimally aligns S
i to S
c
(therebyrealizingtheireditdistance),anditistrue
alsoinA(c)thateachsequenceisalignedoptimally
withS
c ,i.e.
d
A(c) (S
i
;S
c )=d
Ai (S
i
;S
c )=d(S
i
;S
c
) 8i: (14)
Furthermore, for all pairs i;j, the triangle in-
equalityd
A(c) (S
i
;S
j )6d(S
i
;S
c )+d(S
j
;S
c )holds,
since thedistance inanalignmentis ametric over
the sequences (assuming the cost function is a
metric over ). Let c
be such that SP(A(c
))=
min
r
SP(A(r)). Call OPT be the optimal SP
value, and HEUR := SP(A(c
)). By compar-
ing P
16i<j6k d(S
i
;S
j
)(alowerboundtoOPT)to
P
16i<j6k (d(S
i
;S
c
)+d(S
j
;S
c
))
(anupperbound
to HEUR), Guseld proved thefollowing approxi-
mationresult:
Theorem 4.
[7]
HEUR6
2 2
k
OPT.
Guseld'salgorithmcanbegeneralizedsoasto
useanytree,instead ofjusta star,andstillmain-
tainthe2-approximationguarantee. Theapproach
isbased ona connectionto thefollowingNetwork
Designproblem. Givenacompleteweightedgraph
and a spanning tree T, the routing costof T, de-
notedbyr(T),isdenedasthesumoftheweights
of the unique paths, between all pairs of nodes,
identied by T. Computing theminimumrouting
costtree is NP-hard, but a Polynomial Time Ap-
proximationScheme(PTAS)ispossible [36]
.
Torelate routingcostandalignments,consider
theset of sequencesS as thevertexset of a com-
plete weighted graph G, where the weight of an
edge S
i S
j is d(S
i
;S
j
). The following procedure
showsthat,foranyspanningtreeT,onecanbuild
analignmentA(T)whichisoptimalfork 1ofthe
the k
2
pairsofsequences: (i)pickanedgeS
i S
j of
thetreeandalignrecursivelythesequencesonboth
sidesofthecutdenedbytheedge,obtainingtwo
alignments, sayA
i and A
j
; (ii)align optimallyS
i
withS
j
;(iii)mergeA
i andA
j
into acompleteso-
lution,byaligningthecolumnsofA
i andA
j inthe
samewayasthelettersofS
i
areoptimallyaligned
tothelettersofS
j .
Theorem5. ForanyspanningtreeT =(S;E)
of G, there exists an alignment A(T) such that
d
A(T) (S
i
;S
j
)=d(S
i
;S
j
) forall pairsof sequences
S
i S
j 2E.
Furthermore,thealignmentA(T)canbeeasily
computed,as outlinedabove. Bytriangleinequal-
ity,itfollows
Corollary6. ForanyspanningtreeT =(S;E)
of G, there exists an alignment A(T) such that
SP(A(T))6r(T).
FromCorollary6,itfollowsthatagoodtreeto
useforaligningthesequencesisatreeT
suchthat
r(T
) = min
T
r(T). This tree does not guaran-
tee anoptimal alignment, but guarantees the best
possible upper bound to the alignment value, for
eld's alignment). In [6],Fischetti et al. describe
aBranch-and-Pricealgorithmforndingthemini-
mumroutingcosttreeinagraph,whichiseective
for up to k = 30 nodes (a large enough value for
alignmentproblems). Thealgorithmisbased ona
formulationwithbinaryvariablesx
P
foreachpath
P in the graph (an exponential number of vari-
ables), andconstraints thatforce thevariablesset
to1 to betheset ofpathsinducedbyatree. The
pricingproblemissolvedbycomputingO(k 2
)non-
negative shortest paths. Fischetti et al.
[6]
report
on computational experiments of their algorithm
appliedtoalignmentproblems,showingthatusing
thebest routing cost tree instead of the best star
producesalignmentsthatarewithin6%ofoptimal
onaverage.
3.1.3 ConsensusSequences
Given a set S =fS
1
;:::;S
k
g of sequences, an
important problem consists in determining their
consensusi.e., anewsequencewhichrepresentsall
the sequences in the set. Let C be the consen-
sus. To avoid a consensus bias (e.g., due to over-
representedsequencesinthesetS),anappropriate
objectiveisto minimizethe maximumdistance of
C from any sequence in S. This objectiveis NP-
hard [37]
.
Assume allinputsequenceshavelengthn. For
theconsensusprobleminsteadofusingtheeditdis-
tance, the Hamming distance is preferable (some
biological reasons for this can be found in [11]).
Thisdistance isdened,forsequencesAandB of
thesame length, as P
n
i=1
(A[i];B[i]), where X[i]
denotesthei{thsymbolofasequenceX.
The Integer Programming formulation for the
consensus problem has a binary variable x
i
, for
everysymbol2,andevery position 16i6n,
to indicate whether C[i] = (x
i
= 1) or not
(x
i
=0). Themodelreads:
min r (15)
subjectto
X
2
x
i
=1 8i=1;:::;n (16)
n
X
i=1 X
2
(;S
j [i])x
i
6r 8j=1;:::;k
(17)
x
i
2f0;1g 8i=1;:::;n; 82: (18)
Let OPT denote the value of the optimum so-
lution to the program above. An approxima-
Rounding [38]
. Consider the Linear Programming
relaxationof(15){(18),obtainedbyreplacing(18)
withx
i
>0. Letr
betheLPvalueandx
i
bethe
LPoptimal solution,possiblyfractional. An inte-
gersolutionxcanbeobtainedfromx
byrandom-
ized rounding: independently, at each position i,
choosea letter2(i.e.,setx
i
=1andx
i
=0
for 6= ) with probability of x
i
= 1 given by
x
i
. LetHEURbethevalueofthesolutionx, and
Gamma= max
a;b2
(a;b). The rst approxima-
tionresultfortheconsensuswasgivenbyBen-Dor
etal.
[8]
,andstatesthat,foranyconstant>0,
Pr
HEUR>OPT+ r
3OPTlog k
<: (19)
This probabilistic algorithm can be de-
randomized using standard techniques of condi-
tionalprobabilities [39]
.
TheIPmodel(15){(18)wasutilizedinastring
of papers on consensus as well as other, related,
problems [9 11]
, such as nding the farthest se-
quence (i.e.,asequence \asdissimilaras possible"
fromeach sequenceofagivenset S).
The main results obtained on these problems
arePTAS based ontheaboveIPformulation and
RandomizedRounding,coupledwithrandomsam-
pling. In[9,10],Lietal. andMarespectively, de-
scribePTASforndingtheconsensus,oraconsen-
sussubstring. Themain ideabehindtheapproach
isthefollowing. Givenasubsetofrsequencesfrom
S,linethemupinanrnarray,andconsiderthe
positions where they all agree. Intuitively, there
isa highlikelihood thattheconsensus should also
agreewiththem atthesepositions. Hence, allone
needs to do is to optimize on the positions (sym-
bols)wheretheydonotagree,whichisdonebyLP
relaxationandrandomized rounding.
A PTASfor thefarthest sequenceproblem, by
Lanctot et al., is described in [11]. Also this re-
sultisobtainedbyLPrelaxationandRandomized
RoundingofanIPsimilarto(15){(18).
3.2 Sequence vsStructure Alignments
Inordertobuildaprotein,theDNAfromagene
isrstcopiedontothemessengerRNA,whichwill
serveas atemplate fortheprotein synthesis. This
process is called transcription. In RNA, the nu-
cleotideThymine(T)isreplacedbytheequivalent
nucleotideUracile (U). RNA moleculesare single-
stranded and the exposed bases tend to form hy-
drogenbondswithinthesame molecule,according
to Watson-Crick pairing rules (A $ U, C $ G).
Thisbondsleadtostructureformation,alsocalled
secondarystructureoftheRNAsequence.
Determining thesecondary structure from the
nucleotidesequenceisnotaneasyproblem. Forin-
stance,complementary basesmaynothybridize if
theyarenotsuÆcientlyapartinthemolecule. Itis
alsopossiblethatnotallthefeasibleWatson-Crick
pairingscanberealizedatthesametime,andNa-
turechooses themost favorableones according to
an energy minimization which is notyet well un-
derstood.
Allpotentialpairingscanberepresentedbythe
edge set of a graph G
S
= (V
S
;E
S
). The vertex
set V
S
= fv
1
;:::;v
n
g represents the nucleotides
ofS, withv
i
correspondingto thei-thnucleotide.
Each edge e 2 E
S
connects two nucleotides that
may possibly hybridize to each other in the sec-
ondarystructure. For instance, thegraphof Fig.2
describes the possible pairings for the sequence
S = UCGUGCGGUAACUUCCACGA, assuming a pairing
can be achieved only if the bases are at least 8
positions apart from each other. Only a subset
E 0
S
E
S
ofpairings is realizedby theactualsec-
ondary structure. This subsetis drawn in boldin
Fig.2. ThegraphG 0
S
=(V
S
;E 0
S
)describesthesec-
ondary structure. Notethat each node in G 0
S has
degree61.
Fig.2.GraphofRNAsecondarystructure.
TwoRNAsequencesthatlookquitedierentas
strings ofnucleotides, may havesimilar secondary
structure. Agenericsequencealignmentalgorithm,
such as those described in the previous section,
would notspot thestructuresimilarity, andhence
adierentmodelforcomparisonisneededforthis
case.
One possible modelwasproposed in Lenhofet
al.
[12]
(seealsoKececioglu etal.
[5]
). Inthismodel,
a sequence U of unknown secondary structure is
compared to a sequence K of known secondary
structure. This is done by comparing the graph
G
U
=(V
U
;E
U )toG
0
K
=(V
K
;E 0
K ).
Assume V
U
= [n] and V
K
= [m]. An align-
ment of G
U
and G 0
K
is dened by a noncross-
ing matching of W
nm
(which, recalling (8), is the
complete bipartite graph (V
U
;V
K
;L)). Given an
edge e = ij 2 E
U
and two noncrossing lines
l = [i;u] 2 L, h = [j;v] 2 L, we say that l and
h generatee if uv 2 E 0
K
. Hence, two lines gener-
ateapossiblepairing(edgeofE
U
)iftheyalignits
endpointstoelementsthathybridizetoeach other
inthesecondary structureofK.
For each line l = [i;j] 2 L let w
l
be the
protofaligningthetwoendpointsoftheline(e.g.,
w
l
= (U[i];K[j])). Furthermore, for each edge
e 2 E
U , let w
e
be a nonnegative weight associ-
ated to theedge, measuring the \strength"of the
correspondingbond. TheobjectiveoftheRNASe-
quence/Structure Alignment (RSA) problem is to
determineanalignmentwhichhasmaximumcom-
bined value: the value is given by the sum of the
weights of the alignment lines and the weights of
theedgesinE
U
that theselinesgenerate.
Let bean arbitrarytotal orderdened over
theset ofall lines L. LetG betheset of pairs of
lines(p;q)withpq,suchthateachpair(p;q)2G
generatesan edge in E
U
. For each (p;q)2G, de-
ne w
pq := w
e
, where p = [i;u], q = [j;v] and
e=ij2E
U .
TheIntegerProgrammingmodelin[12]forthis
alignment problem hasvariablesx
l
forlines l 2L
andy
pq
forpairsoflines(p;q)2G:
max X
l 2L w
l x
l +
X
(p;q)2G w
pq y
pq
(20)
subjectto
x
l +x
h
61 8l;h2Ljlandhcross
(21)
y
pq 6x
p
8(p;q)2G (22)
y
pq 6x
q
8(p;q)2G (23)
x
l
;y
pq
2f0;1g 8l2L; 8(p;q)2G:
(24)
Themodelcan bereadilytightenedasfollows.
First,thecliqueinequalities(9)canbereplacedfor
l 2 L, let G
l
= f(p;q) 2 Gjp =l_q = lg. Each
element(i;j)ofG
l
identiesa pairinge2E
U such
thatlisoneofthegeneratinglinesofe. Notethat
ifanedgee2E
U
canbegeneratedbylanda2L
and,alternatively,bylandb6=a2L,thenaandb
mustcross(inparticular,theyshareanendpoint).
Hence, of all the generating pairs in G
l
, at most
onecanbeselectedinafeasiblesolution. Thenat-
uralgeneralization forconstraints(22)and (23)is
therefore
X
(a;b)2Gl y
ab 6x
l
8l2L: (25)
A Branch-and-Price based on the above in-
equalities is described in [5, 12]. The method
wasvalidatedbyaligningRNAsequencesofknown
structure to RNA sequences of known structure,
butwithoutusingthestructureinformationforone
ofthetwo sequences. Inallcases tested,thealgo-
rithmretrievedthecorrectalignment,ascomputed
byhandbythebiologists. Forcomparison,a\stan-
dard"sequencealignmentalgorithm(whichignores
structure in both sequences) was used, and failed
toretrievethecorrectalignments.
Inasecondexperiment,themethodwasusedto
alignRNA sequencesofunknownsecondarystruc-
ture, of upto 1,400 nucleotides each, to RNA se-
quences of known secondary structure. The opti-
mal solutions found were compared with alterna-
tivesolutions foundby\standard" alignment, and
their merits were discussed. The results showed
thatstructureinformationisessentialinretrieving
biologicallyrelevantalignments.
3.3 Structure vsStructure Alignments
Aproteinisachainofmoleculesknownasamino
acids, or residues, which folds into a peculiar 3-D
shape (called its tertiary structure) under several
molecular forces (entropic, hydrophobic, thermo-
dynamic). A protein's fold is perhaps the most
important of all protein's features, since it deter-
mineshowtheproteinfunctionsandinteractswith
other molecules (in fact, most biological mecha-
nisms{e.g.,binding{attheproteinlevelarebased
onshape-complementarity).
Thecomparisonofproteinstructuresisaprob-
lem of paramount importance in structural ge-
nomics, with applications to protein clustering,
function determination and assessment of pro-
tein fold predictions. An increasing number of
approaches for structure comparison have been
proposed over the past years (see Lemmen and
[40]
structureclassicationservers(themostimportant
of which is the Protein Data Bank, PDB [41]
have
beendesignedandareextensivelyusedinpractice.
Loosely speaking, the structure comparison
problem is the following: Given two 3-D protein
structures, determine how similar they are. The
comparison of two structures requires to \align"
them in some way. Since, by theirnature, three-
dimensionalcomputationalproblemsareinherently
more complex than the similar one-dimensional
ones, there is a dramatic dierence between the
complexity of sequence alignment and structure
alignment. Notsurprisingly,varioussimpliedver-
sions of the structure comparison problems were
shownNP-hard [42]
.
A popular structure comparison measure is
based on the comparison of the proteins contact
maps. Acontactmap isa 2-D representation ofa
3-Dprotein structure. When a proteins folds,two
residuesthatwerenotadjacentintheprotein'slin-
earsequence,mayendupclosetoeachotherinthe
3-D space. The contact map of a protein with n
residues is dened as an nn 0-1 matrix, whose
1{elementscorrespondtopairsofresiduesthatare
withina \small"distance(typicallyaround5
A)in
theprotein'sfold,butarenotadjacentinthelinear
sequence. Wesay that such a pair ofresidues are
in contact. Acontact map can alsobeconsidered
the adjacencymatrix of a graph G. Each residue
is a node of G, and there is an edge between two
nodesifthecorresponding residuesarein contact.
TheContactMapOverlap(CMO)problemtries
to evaluate the similarity in the 3-D folds of two
proteinsbydeterminingthesimilarityoftheircon-
tactmaps(therationale being thata highcontact
mapsimilarityisagoodindicatorofhigh3-Dsim-
ilarity). This measurewas introduced in [43], and
its optimization was provedNP-hard in [42], thus
justifyingtheuseofsophisticatedheuristicsorenu-
merativemethods.
Given two folded proteins, the CMO problem
calls for determining an alignment between the
residues oftherstproteinand ofthesecondpro-
tein. Thegoalistondthealignmentwhich high-
lights the largest set of common contacts. The
value of an alignment is given by the number of
pairs of residues in contact in the rst protein
which are aligned with pairs of residues that are
also in contact in the second protein. Given the
graph representation for contact maps, we can
phrase the CMO problem in graph{theoretic lan-
guage. Theinputconsistsoftwoundirectedgraphs
G = (V ;E ) and G = (V ;E ), with V = [n]
and V
2
= [m]. For each edge ij, with i < j, we
distinguishaleftendpoint(i)andarightendpoint
(j), and therefore we denote the edge by the or-
dered pair (i;j). A solution of the problem is an
alignment ofV
1 andV
2
, i.e., anoncrossingmatch-
ing L 0
L in W
nm
. Two edges e = (i;j) 2 E
1
and f = (i 0
;j 0
) 2 E
2
contribute a sharingto the
value of an alignment L 0
if l = [i;i 0
] 2 L 0
and
h = [j;j 0
] 2 L 0
. In this case, we say that l and
h generate thesharing (e;f). TheCMO problem
consists in nding analignment which maximizes
the number of sharings. Fig.3 shows two contact
mapsandanalignmentwith5 sharings.
Fig.3. Analignmentofvalue5.
AsuccessfulapproachfortheCMOsolutionre-
liesonformulatingtheproblemasanInteger Pro-
gramandsolvingitbyBranch-and-Price. Thefor-
mulation is based on binary variables x
l
for each
linel,andy
pq
forpairsoflinesp;qwhich generate
asharing.
The model, proposed by Lancia et al.
[13]
, is
very similar to theformulationfortheRSA prob-
lem, described in Section 3.2. We denote with G
the set of all pairs of lines that generate a shar-
ing. For each pair (l;t) 2 G, where l = [i;i 0
];
and t = [j;j 0
], the edge (i;j) 2 E
1
and the edge
(i 0
;j 0
)2E
2
. Foralinel2L, letG
l
=f(p;l)2Gg
andG +
l
=f(l;p)2Gg. ThemodelforCMOis
max X
(l ;t)2G y
l t
(26)
subjectto theclique inequalities (9) and thecon-
straints
X
(p;l )2G
l y
pl 6x
l
8l2L (27)
X
(l ;p)2G +
l y
l p 6x
l
8l2L (28)
x
l
;y
pq
2f0;1g 8l2L; 8(p;q)2G:
(29)
A Branch-and-Price based on the above con-
straints, as well as other, weaker, cuts, was used
in [13] to optimally align, for the rst time, real
proteinsfromthePDB.Thealgorithmwasrun on
597 protein pairs, with sizes ranging from 64 to
72 residues. Within the time limit of 1 hour per
instance, 55 problems were solved optimally and
for421problems(70percent)thegapbetweenthe
bestsolutionfoundandtheupper bound was65,
therebyprovidinga certicateofnear-optimality.
Inasecondexperiment,theCMOmeasurewas
validated from a biological point of view. A set
of 33 proteins, known to belong to four dierent
families, were pairwise compared. The proteins
werethenre-clusteredaccordingtotheircomputed
CMOvalue. Inthenewclustering,notwoproteins
wereputinthesameclusterwhichwerenotinthe
same familybefore(a 0% falsepositives rate)and
onlyfewpairsofproteinsthatbelongedtothesame
family were put in dierent clusters (a 1.3% false
negativesrate).
4 GenomeRearrangements
Thanks to the large amount of genomic data
that has become available in the past years, it is
now possible to compare the genomes of dierent
species,in orderto ndtheirdierencesandsimi-
larities. An importantapplication ofthis problem
is,e.g.,thecomparisonofmouseandmangenomes,
sincenewdrugsaretypicallytestedonmicebefore
humans.
In principle, a genome can be thought of as a
(very long) sequence and hence one may want to
comparetwogenomesviaasequencealignmental-
gorithm. However, sequence alignment is not the
right model for large-scale genome comparisons,
and,atthegenomelevel,dierencesshouldbemea-
suredin termsofrearrangementsoflong DNAre-
gionswhichoccurredduring evolution.
Amongthemainrearrangementsknownarein-
versions, transpositions, and translocations. Each
oftheseeventsaectsalongfragmentofDNAona
chromosome. Whenaninversionoratransposition
occurs,a DNAfragmentisdetachedfrom itsorig-
inal position and then is reinserted, on the same
chromosome. In an inversion, it is reinserted at
thesameplace,butwithoppositeorientationthan
it originally had. In a transposition, it keeps the
original orientation but ends upin a dierentpo-
sition. A translocation causes a pair of fragments
to beexchanged between theends oftwo chromo-
somes. Fig.4 illustrates these events, where each
stringrepresentsachromosome.
ATTGTTataggttagAATTG ATTgtttataGGCTAGATCCGCCAGA CTGGATgcaggcat TCATTGAaata
# # #
ATTGTTgattggataAATTG ATTGGCTAGATCCGCgtttataCAGA CTGGATaata TCATTGAgcaggcat
(Inversion) (Transposition) (Translocation)
Fig.4.Evolutionaryevents.
Since evolutionary events aect long DNA re-
gions (several thousand bases) the basic unit for
comparison is not the nucleotide, but rather the
gene.
The generalGenome Comparison Problemcan
beinformally stated as: Given two genomes (rep-
resentedby two sets of ordered listsof genes, one
listper chromosome)nda sequenceofevolution-
aryeventsthat,appliedtotherstgenome,turnit
intothesecond. Undera commonparsimonyprin-
ciple, the solution sought is the one requiring the
minimumpossiblenumberof events.
In the past decade, people have concentrated
onevolutionbymeansofsomespeciceventalone,
and haveshown that even these special cases can
bealreadyveryhardtosolve. Sinceinversions are
considered the predominant of all types of rear-
rangements,theyhavereceivedthemostattention.
Forhistoricalreasons, theyhavebecomeknownas
reversalsin thecomputer sciencecommunity.
In the remainder of this section, we will out-
lineasuccessfulIntegerProgrammingapproachfor
thesolutionofcomputingtheevolutionarydistance
between two genomes evolved by reversalsonly, a
problem known as Sorting by Reversals. The ap-
proach isduetoCapraraet al [14;15]
.
Twogenomesarecomparedbylookingattheir
common genes. After numbering each of n com-
mon geneswith a unique labelin f1;:::;ng,each
genomeisapermutationoftheelementsf1;:::;ng.
Let = (
1 :::
n
) and = (
1 :::
n ) be
twogenomes. Bypossiblyrelabelingthegenes,we
can alwaysassume that =:=(1 2 ::: n), the
identitypermutation. Hence,theproblembecomes
turninginto, i.e.,sorting.
A reversal is a permutation , with 1 6 i <
j6n,dened as
ij=(1 ::: i 1 jj 1 ::: i+1i j+1 ::: n):
reversed
By applying
ij
to , one obtains
(
1 :::
i 1
;
j
j 1 :::
i
;
j+1 :::
n
), i.e., the
orderoftheelements
i
;:::;
j
hasbeen reversed.
Let G =f
ij
;1 6i <j 6ng. Each permutation
can beexpressed as a product
1
2
::: D
with
i
2 G for i = 1:::D. The minimum value D
such that
1
2
::: D
= is called the reversal
distance of , and denoted by d(). Sorting by
Reversals(SBR)istheproblemofndingd()and
a sequence 1
2
::: d()
satisfying the previous
equation.
A majorstep towards thepractical solution of
the problem was made by Bafna and Pevzner [44]
,
who investigated the relations between SBR and
the breakpoints of . A breakpoint is given by a
pair of adjacent elements in that are not adja-
centin , thatis, thereis a breakpointat position
i,ifj
i
i 1
j>1. Letb()denotethenumber of
breakpoints. A trivial bound is d() > db()=2e,
since a reversal can remove at most two break-
points, and hasnobreakpoints. However,Bafna
and Pevzner showed how to obtain a much bet-
ter bound from the breakpoint graph G(). G()
hasanodeforeach elementof andedgesoftwo
colors, say red and blue. There is a red edge be-
tween
i and
i 1
foreachpositioniatwhichthere
is a breakpoint, and a blue edge between each h
and k such that jh kj = 1 and h;k are not ad-
jacent in . Each node has the same number of
red andblue edges incidenton it (0, 1 or 2), and
G()canbedecomposedintoasetofedge-disjoint
color-alternating cycles. Note that this cycles are
notnecessarilysimple,i.e., theycan repeatnodes.
Letc()bethemaximumnumber ofedge-disjoint
alternating cycles in G(). Bafna andPevzner [44]
provedthefollowing
Theorem 7.
[44]
For every permutation ,
d()>b() c().
The lower bound b() c() turns out to be
verytight,asobservedrstexperimentally byvar-
iousauthors andthen proved tobealmost always
the case by Caprara [45]
who showed in [46] that
determining c() is essentially the same problem
as determining d(), and that both problems are
NP-hard.
Since computingc()is hard,thelowerbound
b() c()shouldnotbeuseddirectlyinaBranch-
and-Bound procedure for SBR. However, for any
upper bound c 0
() to c(), also the value b()
0
easier to compute. Givenaneective Integer Lin-
earProgramming(ILP)formulationtondc(),a
good upper bound to c()can beobtained byLP
relaxation. LetCdenotethesetofallthealternat-
ing cycles of G() = (V;E), and for each C 2 C
dene a binary variable x
C
. The following is the
ILPformulationofthemaximumcycledecomposi-
tion:
c():=max X
C2C x
C
(30)
subjectto
X
C3e x
C
61 8e2E (31)
x
C
2f0;1g 8C2C: (32)
Caprara et al showed in [14] how to solve the
exponentially large LP relaxation of (30){(32) in
polynomialtimebycolumn-generationtechniques.
ConsiderthedualoftheLPrelaxationof(30){
(32),whichreads
c 0
():=min X
e2E y
e
(33)
subjectto
X
e2C y
e
>1 8C2C (34)
y
e
>0; 8e2E: (35)
Theseparation problemfor(33){(35),whichis
the same as the pricing problem for (30){(32), is
the following: given a weight y
e
for each e 2 E,
ndanalternating cycleC2C suchthat
X
e2C y
e
<1; (36)
orprovethatnoneexists. Callanalternatingcycle
satisfying (36)aviolatedcycle. To solve(33){(35)
inpolynomial time,onemustbeableto identifya
violatedcycleinpolynomialtime,i.e.,analternat-
ing cycle of G() having weight < 1, where each
edge e 2 E is giventhe weight y
e
. This problem
canbesolvedviaareductiontothesolutionofsome
non-bipartite min-costperfect matching problems
in a suitable graph. The reduction is similar to
oneused byGrotschel andPulleyblank fornding
minimum-weight odd and even paths in an undi-
rectedgraph [47]
.
Although polynomial, the solution of the
weighted matching problem on general graphs is
et al. describe a weaker bound, obtained by en-
larging the set C in (30){(32) to include also the
pseudo-alternatingcycles,i.e.,cyclesthatalternate
red and blue edges but may possibly use an edge
twice. With this new set of variables, the pricing
becomesmuchfaster,sinceonlybipartitematching
problemsmustbesolved.
Although weaker than the bound based on
alternating cycles, the bound based on pseudo-
alternating cyclesturns outto be still very tight.
Based on this bound, Caprara et al. developed
a branch-and-price algorithm [15]
which can rou-
tinelysolve,inamatterofseconds, instanceswith
n=200 elements, a large enough size forall real-
life instances available so far. Note that the pre-
viouslimitfor anoptimization algorithmfor SBR
was n = 30. The algorithm in [15] was used to
nd, for the rst time, the optimal solution to a
man vs mouse instance of SBR, and to compare
the mitochondrial genomes of 20 species, nding
theoptimalsolutioninallcases.
5 Single NucleotidePolymorphisms
ADNAregionwhosecontentvariesinastatisti-
callysignicantwaywithinapopulationiscalleda
geneticpolymorphism. Thesmallestpossibleregion
consists ofa single nucleotide, andhence iscalled
a Single Nucleotide Polymorphism, or SNP (pro-
nounced\snip"). ASNPisanucleotidesite,inthe
middleofa DNAregionwhich isotherwise identi-
calforeverybody,atwhichweobserveavariability
inapopulation. Furthermore,thevariabilityisre-
stricted to only two values (called alleles) out of
the four possible (A, G, C, G). It is believed that
SNPsarethe predominant form ofhuman genetic
variation, and their importance cannot beoveres-
timated for medical, drug-design, diagnostic and
forensic applications.
Chrom. c, paternal: ataggtccCtatttccaggcgcCgtatacttcgacgggActata
Chrom. c, maternal: ataggtccGtatttccaggcgcCgtatacttcgacgggTctata
Haplotype 1 ! C C A
Haplotype 2 ! G C T
Fig.5.Achromosomeandthetwohaplotypes.
SinceDNAofdiploidorganismsisorganizedin
pairsofchromosomes,foreachSNPonecaneither
behomozygous(samealleleonbothchromosomes)
or heterozygous(dierentalleles). The values ofa
set of SNPs on a particular chromosome copy is
called a haplotype. Haplotypingconsists in deter-
miningapairofhaplotypes,oneforeachcopyofa
givenchromosome. InFig.5wegiveasimplisticex-
ampleofachromosomewiththreeSNPsites. This
individualisheterozygousatSNPs1and3andho-
mozygous at SNP2. Thehaplotypes areCCA and
GCT.
In recent years, several optimization problems
have been dened for SNP data. Some of these
were tackled by Integer Programming techniques,
which werecallinthissection.
We consider thehaplotyping problems relative
to a single individual and to a population. In
the rst case, the input is inconsistent haplotype
data, coming from sequencing, with unavoidable
sequencing errors. In the lattercase, the input is
ambiguous genotypedata,which species onlythe
multiplicity of each allele foreach individual (i.e.,
erozygous atSNPj,foreachiandj).
5.1 Haplotypinga Single Individual
The process of reading the content of a DNA
moleculeiscalledsequencing. Fortechnicallimita-
tions, only short DNA fragments (a couple thou-
sand bases)can be sequenced. Furthermore, even
with today's best possible instruments, sequenc-
ingerrors are unavoidable. These consistin bases
which have been misread or skipped altogether.
Anothersourceofproblemscanbethepresenceof
contaminants, i.e. DNAcoming from another or-
ganism which was wrongly mixed with the DNA
to be sequenced. In this framework, the Single
Individual Haplotyping Problem can beinformally
statedas: \giveninconsistenthaplotypedatacom-
ing from fragment sequencing, remove the errors
so as to retrieve a consistent pair of haplotypes."
Themathematicalframeworkforthisproblemisas
follows. Denote thetwovalues thateachSNPcan
takeby0 and 1. A haplotype, i.e. a chromosome
contentprojectedonasetofSNPs,isthenabinary
vector. Let S = f1;:::;ng be a set of SNPs and
is covered by some of the fragments, on which it
can take the values 0 or 1. A fragment f can be
represented by a vector in f0;1; g n
. For a pair
(f;s)2FS,f[s]denotes thes-thcomponentof
f. When f[s] = , it meansthat the fragment f
does notcover theSNPs.
Agaplessfragmentisonecovering asetofcon-
secutiveSNPs(i.e.,betweenanytwocomponentsin
f0;1g,thereareonlycomponentsinf0;1g),other-
wise,thefragmenthasgaps. Therecanbegapsfor
two reasons: (i) thresholding of low-quality reads
(i.e.,whenthesequencercannotcall aSNP0 or1
withenough condence, thecomponentis marked
witha "-"); (ii)mate-pairing inshotgun sequenc-
ing(amatepairisapairoffragmentscomingfrom
thesamechromosomecopy. Thispairisequivalent
toa singlefragmentwith onegapin themiddle).
Fig.6. (a)AsetoffragmentsF. (b)Thefragmentcon ict
graphG
F .
Two fragments f and g are said to bein con-
ictifthereexists aSNPksuchthatf[k]=0 and
g[k] =1 or f[k] = 1 and g[k] = 0. The fragment
con ict graph is the graph G
F
= (F;E
F
) which
has an edge for each pairof fragments in con ict
(seeFig.6). Thedataareconsistentwiththeexis-
tenceoftwohaplotypes (oneforeachchromosome
copy)ifandonlyifG
F
isabipartite graph. Inthe
presence of contaminants, or of some \bad" frag-
ments(fragmentswithmanysequencingerrors),to
correct thedata one must face theproblem of re-
movingthefewestnumberoffragmentssothatG
F
becomesbipartite. ThisproblemiscalledtheMin-
imum Fragment Removal (MFR) problem. When
thefragmentscancontaingaps,theproblemisNP-
hard [16]
. Forthisproblem,thefollowingIPformu-
lationcanbeadoptedsoastominimizethenumber
offragments(nodes)removedfromG
F
tomakeG
F
bipartite [17]
. Introducea 0-1 variable x
f
for each
fragment. Thevariablesforwhichx
f
=1inanop-
timalsolutiondenethenodestoremove. LetCbe
theset ofalloddcyclesin G
F
(M). Sincea graph
is bipartite if and onlyif there are noodd cycles,
one must removeat least one node from each cy-
Programming(IP)formulation:
min X
f2F x
f
(37)
s.t.
X
f2C x
f
>1 8C2C (38)
x
f
2f0;1g 8f2F: (39)
The LP relaxation of (37){(39) can be solved
in polynomial time, by solving the following sep-
aration problem: Given fractional weights x
for
the fragments, nd an odd cycle C for which
P
f2C x
f
< 1. This can be obtained as follows.
First,buildanewgraphQ=(W;F)made,loosely
speaking,oftwocopiesofG
F
. Foreachnodei2V
there are two nodes i 0
;i 00
2W, andfor each edge
ij2E,therearetwoedgesi 0
j 00
;i 00
j 0
2F. Edgesin
F areweighted,withweightsw(i 0
j 00
)=w(i 00
j 0
):=
x
i
=2+x
j
=2. Byaparityargument,apathP inQ
ofweightw(P),fromi 0
toi 00
,correspondstoanodd
cycle C through i in G
F , with
P
j2C x
j
=w(P).
Hence,theshortestpathcorrespondstothe\small-
est"oddcycle,andiftheshortestpathhasweight
> 1 then there are no violated inequalities (38).
Otherwise, the shortest path individuates a vio-
latedinequality.
Tondgoodsolutions ofMFR inashorttime,
Lippert et al.
[17]
have also experimented with a
heuristicprocedure, based on theaboveLPrelax-
ation andrandomized rounding. First,theLPre-
laxation of (37){(39)is solved, obtaining an opti-
mal solution x
. If x
is fractional, each variable
x
f
is rounded-up to 1 with probabilityx
f
. If the
solution is feasible, stop. Otherwise, x to 1 all
thevariablesthatwererounded-upsofar(i.e. add
theconstraintsx
f
=1)anditerateanewLP.This
methodconvergedvery quicklyonalltestsonreal
data,withasmallgapinthenalsolution(anup-
perboundto theoptimum)andtherstLPvalue
(a lower bound to the optimum). On real data
comingfromshotgunsequencing, outofabout800
fragments, only less than 5 fragments had to be
thrownouttoretrievethecorrecthaplotypes.
5.2 Haplotypinga Population
Becauseof economical reasons, when polymor-
phism screens areconducted onlarge populations
it is not feasible to examine the two copies of
each chromosome separately, andgenotype, rather
thanhaplotype,datais usuallyobtained. A geno-
Foranyindividual,ateachSNP,threepossibilities
arise: homozygousfortheallele0,homozygousfor
the allele 1, or heterozygous (a situation denoted
by the symbol 2). Hence a genotype can be rep-
resented by a vector in f0;1;2g n
, where each po-
sitionwith a 2is called anambiguousposition. A
genotypeg2f0;1;2g n
isresolvedbyapairofhap-
lotypes h;q 2 f0;1g n
, written g = hq, if for
eachSNPj,g[j]2f0;1gimpliesh[j]=q[j]=g[j],
and g[j] = 2 implies h[j] 6= q[j]. A genotype is
called ambiguous if it has at least two ambiguous
positions(agenotypewithatmostoneambiguous
positionscanberesolveduniquely).
ThePopulationHaplotypingProblemisthefol-
lowing: GivenasetGofmgenotypesovernSNPs,
nd aset Hof haplotypessuchthat each genotype
is resolved by one pair of haplotypes in H . To
turn this problem into an optimization problem,
one hasto specify theobjectivefunction. One ob-
jectivewhichhasbeenstudiedbyGuseld [18;19]
,is
based on a greedy algorithm for haplotype infer-
ence, known asClark's rule [48]
. Thegoal isto de-
rivethehaplotypesinHbysuccessiveapplications
ofthefollowinginferencerule.
Inference Rule. Given a genotype g and a
compatible haplotype h(i.e.,a haplotypehwhich
agreeswith gat allnon-ambiguous positions), ob-
tain a new haplotype q, such that g = hq, by
setting q[j] = 1 h[j] at all ambiguous positions
andq[j]=h[j]at theremainingpositions.
Inordertoresolvethegenotypes inG,Clarkin
[48]proposedthefollowing,nondeterministic,pro-
cedure: LetG 0
be theset of non-ambiguous geno-
types. StartwithsettingG :=G G 0
andHtobe
theset of haplotypes obtained from G 0
. Then, re-
peatthefollowing: takea g2G and acompatible
h 2 H and apply the inference rule, obtaining q.
SetG:=G fgg,H:=H[fqganditerate. When
nosuchg andhexist, thealgorithmhassucceeded
ifG=; andfailedotherwise.
The non-determinism in thechoice of the pair
g,htowhichtheinferenceruleisappliedcanlead
to unsuccessful runs even for instances in which
a there exists an order of application of the rule
which would resolve all genotypes in G. To nd
thebest possible order ofapplication of theinfer-
ence rule, Guseld considered the following opti-
mizationproblem: ndtheorderingofapplication
oftheinferencerulethatleaves thefewestnumber
ofunresolvedgenotypesin theend.
To solve the problem, Guseld [18;19]
proposed
anIntegerProgrammingapproach. Theproblemis
(N;A), dened as follows. Let N =
g2G N(g),
where N(g) := f(h;g)jhiscompatiblewithgg.
N(g) corresponds to the set of possible haplo-
types obtainable by setting each ambiguous posi-
tion of a genotype g to one of the 2 possible val-
ues. Let N 0
= S
g2G 0
N(g)correspond to thesub-
set ofhaplotypes unambiguously determinedfrom
the set G 0
of unambiguous genotypes. For each
pair v = (h;g 0
), w = (q;g) in N, there is an arc
(v;w)2Aifg isambiguous,g 0
6=gandg=hq.
Any directed tree rooted at a node v 2 N 0
speci-
es a feasible history of successiveapplications of
theinferencerulestartingatnodev2N 0
,andthe
problem can be stated as: Findthe largest num-
ber of nodes in N N 0
that can be reached by
a set of node{disjoint directed trees, where each
treeisrootedat anodein N 0
andwhereforevery
ambiguous genotypeg, at most one nodein N(g)
is reached. For solving this problem, Guseld [18]
proposed the following formulation, based on 0{1
variablesx
v
associatedtothenodesinN:
max X
v2N N 0
x
v
(40)
subjectto
X
v2N(g) x
v
61 8g2G G 0
(41)
x
w 6
X
v2Æ (w) x
v
8w2N N 0
(42)
withx
v
2f0;1gforv2N.
Guseld observedthattheLPsolution,onreal
and simulated data, was almost always integer,
thus requiring no branching in the branch and
boundsearch. Also,although thereisapossibility
of an integer solution containing directed cycles,
this situation never occurred in the experiments,
and hence there was no not need to add subtour
elimination-type inequalities to the model. The
abovemodelwasusedtoresolverealandrandomly
generatedinputinstancesofupto80genotypes,in
amatterofseconds.
OneofthereasonsforthepopularityofClark's
rule, is a common belief among biologists that
this ruleoften produces thesmallestset of haplo-
types foragivenset ofgenotypes. However,when
the objective is to minimize the number of dier-
ent haplotypes, itis easy to show examples of in-
stances on which Clark's rule behaves extremely
poorly. This new objective for population haplo-
typing is called parsimony, and requires to deter-