In fact, many of the prob- lems consideredcan becastas optimization prob- lems,tobeattackedbystandardoptimizationtech- niques

(1)

Integer Programming Models for Computational Biology Prob-

lems

GiuseppeLancia

D.I.M.I,UniversitadiUdine,VialedelleScienze206,33100 Udine,Italy

E-mail: lancia@dimi.uniud.it

ReceivedMay26,2003;revisedAugust21,2003.

Abstract TherecentyearshaveseenanimpressiveincreaseintheuseofIntegerProgram-

mingmodels forthesolutionof optimization problemsoriginatinginMolecularBiology. Inthis

survey,we describe someofthemost successfulIntegerProgrammingapproaches, whilegivinga

broadoverviewofapplicationareasinmodernComputationalMolecularBiology.

Keywords integerprogramming,computationalbiology,sequencealignment,genomerear-

rangements,proteinstructures

1 Introduction

Computational (Molecular) Biology is a rel-

atively young science, which has experienced a

tremendous growth in the last decade. A sign

of this growth is the fact that a steadily increas-

ingnumber ofpeoplewithdierentscienticback-

groundshavebeenapproachingthiseldinthere-

centyears. ItisnowcommontoseepapersonCom-

putational Biology problems written not only by

computer scientistsand biologists,but also statis-

ticians, physicists and mathematicians, pure and

applied. Inparticular,theapparentlyendlesscapa-

bilityofComputationalBiologytoprovideexciting

combinatorial problems, has become very appeal-

ingtotheMathematicalProgramming/Operations

Research community. In fact, many of the prob-

lems consideredcan becastas optimization prob-

lems,tobeattackedbystandardoptimizationtech-

niques.

Consider thetypical approachtotackleacom-

putational biology problem, which is as follows.

First,a (ratherdiÆcult)modelinganalysis is per-

formed, inorderto mapthebiological processun-

der investigation into one or more mathematical

objects (such as graphs, sets, arrays:::). After

thistransformation,theoriginalquestion concern-

ing the biological data becomes a mathematical

question abouttheobjects ofthechosen represen-

tation. Sometimesthisquestioncanbesimplycast

as \does there exist at least one objectwith such

andsuchmathematicalproperties?". Thisisafea-

sibility,orexistenceproblem(whichcan besolved,

e.g., by using powerful arguments from combina-

torics -such as the pigeonhole principle-, or from

analgorithmwhichbuildsa solution,ifoneexists,

andprovingitscorrectness).

However, many other times (most times, we

wouldsay),eachobject,representingatentativeso-

lution,hasassociatedanumericalweight,intended

tomeasureitsquality. Theweightisproportional,

loosely speaking, to the probability that the solu-

tionisthe\true" answer to theoriginalbiological

problem. Thehighertheweight,themorereliablea

solutionis. Sincetheproblemwasmappedintothe

domainofdiscretemathematics,alsotheweightis

represented by some mathematical function f(x).

Inordertondthebestanswer totheoriginalbi-

ologicalproblem, onehasthento solvethefollow-

ingoptimizationproblem: nda solutionx

which

maximizesf(x)over allpossible solutions.

Optimization problems have been a subject of

studyfordecadesinthemathematicalcommunity.

In particular, the whole eld of Operations Re-

search is devoted to the subject, and many very

successful techniques have been developed over

thepastyearsfordiÆcult(NP-hard)optimization

problems. For such diÆcult problems, very rigor-

ousmethodshavebeendevisedthatarepractically

(i.e.,withina\small" time)capableofndingthe

optimalsolutionfordataderivedfromreal-life ap-

plications.

The mostsuccessful approach to the exact so-

lution of hard optimization problems is proba-

bly Integer Linear Programming (for brevity, In-

teger Programminghereafter), which hasbeen ap-

pliedprotablyforhundreds ofproblemsfromreal

domains [1 3]

. TheInteger Programmingapproach

consistsinformulatingaproblemasthemaximiza-

tion(minimization)ofalinearfunctionofsomein-

(2)

bound,wheretheupper(lower)boundcomesfrom

theLinearProgramming (LP)relaxation. TheLP

relaxation,inwhichthesamefunctionisoptimized

but thevariables are notrestricted to be integer,

is polynomially solvable. A formulation is as suc-

cessfulasthestrengthofitsLPbound. That is,if

thevalue ofthe objectivefunction over the relax-

ationisclosetothevalueovertheintegers,thenthe

bound,andhence thepruningofthesearch space,

is eective. It is often the case that in order to

obtainbetterbounds,theformulationisreinforced

by the use of additional constraints, called cuts,

and the resulting approach is known as Branch-

and-Price. Cuts areconstraintsthat do notelim-

inate any feasible integer solution, but make the

space of fractional solutions smaller, this way im-

provingthevalueoftheLPbound.

Besides its use to solvea problem exactly, the

IntegerProgrammingapproachcanalsobeapplied

to compute provably good feasible solutions. In

fact, at any point of the search process, by using

the LP bound, we have an estimate of the maxi-

mumerrorbetweenthevalueofthecurrentsolution

andtheoptimum. Thesearchcanthenbestopped

assoonasthemaximumerrorbecomesacceptably

small. Furthermore, approximation-guarantee al-

gorithmswhicharebasedonIntegerProgramming

formulation are also possible. These are polyno-

mialalgorithmswhichexploittheLPrelaxationof

aproblemtondafeasiblesolutionwithinaguar-

anteedmaximumerrorfromtheoptimum. Wewill

give examples of results of all these types in the

followingsections.

The use of Integer Programming approaches

to Computational Biology problems has been in-

troduced, starting about seven years ago, by sev-

eralpeoplewithMathematicalProgrammingback-

ground (among which A. Caprara, B. Carr, D.

Guseld, R. Karp, J. Kececioglu, G. Lancia, H.

P. Lenhof,P. Mutzel, R.Ravi, K.Reinert and M.

Vingron)andhastodaybecomeacommontoolfor

approachinga newproblem.

Inthis survey wewill reviewsome ofthemain

results that wereobtained withthe useof Integer

Programming approaches. Theproblems, coming

from dierent areas, touch almost every aspect of

modernComputationalBiology.

The material is organized in sections. In each

section, we describe one or more problems of the

same nature, and the Integer Programming ap-

proaches that were proposed in the literature for

theirsolution. Manyoftheseapproachesarebased

ponential number of inequalities (or variables),

that aresolved by Branch-and-Price(respectively,

Branch-and-Price). Section 2is devoted to a brief

description of the general principles of the theory

ofInteger Programming.

The remainder of the paper is then organized

as follows. In Section 3 we outline some Integer

Programming approaches to Alignment Problems.

Wereviewseveralproblemsrelatedwiththealign-

mentofgenomicobjects,suchasDNAorRNAse-

quences and secondary or tertiary structures. We

distinguishthreetypes ofalignment:

(i) Sequencevs SequenceAlignment: Thissub-

section is devoted to the problem of aligning ge-

nomic sequences. First, we discuss the classical

multiple alignment problem, studied by Reinert

et al.

[4]

and Kececioglu et al.

[5]

, whodeveloped a

Branch-and-Pricealgorithmforitssolution. Then,

wedescribeaheuristicapproachforthesameprob-

lem,byFischettietal.

[6]

. Theapproachgeneralizes

anapproximationalgorithmbyGuseld [7]

anduses

a reduction to the Minimum Routing Cost Tree

problem, solved by Branch-and-Price. Closely re-

lated to the alignment problem, are the so called

\center"problems,inwhichonetriestodetermine

a sequenceascloseoras faraspossiblefrom aset

of inputsequences. For this problem, wedescribe

an approximation algorithm based on LP relax-

ation and Randomized Rounding, by Ben-Dor et

al.

[8]

. Wealsodiscusssomeworksthat,developing

similar ideas, obtain PTAS forthe consensus and

thefarthest-sequenceproblems(Lietal.

[9]

,Ma [10]

,

Lanctotetal.

[11]

).

(ii) Sequence vs Structure Alignment: In this

subsectionwedescribetheproblemofaligningtwo

RNA sequences, when for one of them the sec-

ondary structureis alreadyknown. For this prob-

lem, wedescribe a Branch-and-Price approach by

Lenhofetal.

[12]

andKececioglu etal.

[5]

.

(iii) Structure vs Structure Alignment: Inthis

subsection we consider the problem of comparing

two protein tertiary structures. For this problem,

wereview aBranch-and-Priceapproach, proposed

byLanciaetal.

[13]

.

Section4isdevotedtoGenomeRearrangements

problems. In this section we describe a Branch-

and-Price approach, by Caprara et al.

[14;15]

for

computing the evolutionary distance between two

genomes evolved from a common ancestor by in-

versions oflongDNAregions.

In Section5 wediversities(polymorphisms)at

the genomic level. The SNP (Single Nucleotide

(3)

determiningthevaluesofasetofpolymorphicsites

in a genome. First, wediscuss the single individ-

ual haplotyping problem, for which some Integer

Programming techniques were employed by Lan-

cia et al.

[16]

and Lippert et al.

[17]

. Then, we re-

view two versions of the problem relative to hap-

lotyping entire populations, studied primarily by

Guseld [18 20]

. Theapproachesarebasedemploys

a reduction to a graph problem, which is then

solvedbyIntergerProgramming.

InSection 6 wementionsome approachesthat

were left out for space reasons [21;22]

. Finally, we

show how theInteger Programming approach can

also be applied indirectly, i.e., when a very well

known optimization problem (solved by Integer

Programming techniques) isused to solve a prob-

lem in Computational Biology. This is the case,

e.g.,fortheTSPproblem,usedtosolvetheRadia-

tionHybridMappingproblem(Agarwalaetal.

[23]

).

2 Integer Linear ProgrammingBasics

An Integer Linear Programming (IP) problem

is dened by an m n real constraint matrix

A = (a

ij

), a real 1n cost vector c and a real

m1rhsvectorb. Theproblemconsistsinnding

the values for n variables x = (x

1

;:::;x

n ) which

optimize(e.g.,minimize)

min n

X

j=1 c

j x

j

(1)

subjectto

n

X

j=1 a

ij x

j

>b

i

8i=1;:::;m (2)

andwitheachx

i

beinganon-negative,integernum-

ber. Thislastconstraintiswrittenas

x>0; x2Z n

: (3)

It is easyto show that the IPproblem is NP-

hard. However,itsLinearProgrammingrelaxation,

obtainedbyreplacingtheintegralityconstraints(3)

with

x>0 (4)

ispolynomiallysolvable [24;25]

. Letz

IP

betheopti-

mumvalueof(1),(2)and(3)andz

LP

betheopti-

mumvalueof(1),(2)and(3). Thevaluez

IP z

LP

is calledthe integrality gap oftherelaxation. The

successofanIPmodelrelies intheintegralitygap

To formulate a combinatorial optimization

problemas anIPproblem, onehasto denea set

of variables, linear constraints anda linear objec-

tivefunction,suchthattheoptimalsolutiontothe

problem of minimizing (1) subject to (2) and (3)

identiestheoptimalsolution oftheoriginalprob-

lem. Manytimesthenaturalchoicefortheinteger

variablesasksthemtotakeonlythevalues0and1

(binaryvariables). Thesevalues areusedtorepre-

sentthedecisionoftakingornotacertainelement

aspart oftheoptimalsolution.

SometimesthetightestIPformulationsrequire

a very large (exponential with respect to the size

of the original problem) number of constraints.

A fundamental paper by Grotschel, Lovasz and

Schrijver [26]

states that even exponential-size LP

relaxationscan besolvedin polynomial time,pro-

vided that separation can be done in polynomial

time. Separation consists in solving the LP with

onlyasubsetoftheconstraints(2),andthencheck-

ing if any of the constraints that were left out is

violatedbytheoptimalsolution. Ifthisisthecase,

theseparationprocedurendsonesuchconstraint,

whichisthenaddedtothecurrentLP,tobesolved

again. By using the ellipsoid method as the LP

solver,Grotschel,LovaszandSchrijvershowedthat

theaboveprocedureconverges inpolynomialtime.

The Branch-and-Bound approach for IP formula-

tionsin which inequalities areadded through sep-

aration is called Branch-and-Price. In fact, each

added inequality iscalled a \cut"since it cutso

an LP solution which is not feasible for all con-

straints.

An alternativeway toobtain tightIPformula-

tion is sometimes through the use of a very large

(exponential with respect to the size of theorigi-

nalproblem)numberofvariables [27]

. Variablesre-

late to constraints through duality. The dual of

a Linear Program (called the primal) dened by

(1),(2)and(4)isthefollowingLPinthevariables

y=(y

1

;:::;y

m ):

max m

X

i=1 b

i y

i

(5)

subjectto

m

X

i=1 a

ij y

i 6c

j

8j=1;:::;n (6)

y>0: (7)

Note howthe role of variablesand constraints

(4)

InordertosolveanLPformulationwithanex-

ponentialnumber ofvariables in polynomial time,

itis suÆcientto designa polynomial timepricing

procedure. Pricingconsistsin solvingtheLPwith

onlyasubsetofthevariables,andthencheckingif

anyofthevariablesthatwereleftoutcanbeused

to improvethe currentoptimalsolution. If this is

thecase,thepricingprocedurendsonesuchvari-

able,which isthenaddedtothecurrentLP,tobe

solved again. Using duality, it can beshown that

nding a variable to add to thecurrent LPis the

same as nding a violated inequality for thedual

ofthecurrentLP(i.e.,thepricingproblemforthe

primalistheseparation problemforthedual).

TheBranch-and-BoundapproachforIPformu-

lationsinwhichvariablesareaddedthroughpricing

iscalledBranch-and-Price.

3 AlignmentProblems

Oftentimes,incomputationalbiology,one must

compareobjectswhichconsistofa setof elements

arrangedin alinearly orderedstructure. A typical

example is the genomic sequence, which is equiv-

alent to a string over an alphabet (the4-letter

nucletotide alphabet fA;T;C;Gg, or the 20-letter

aminoacidalphabet). Anotherexampleisthepro-

tein, when this is regarded as a linear chain of

residues(andhencewitharstandalastresidue).

Aligning two (or more) such objects consists

in determining subsets of corresponding elements

in each. The correspondence must be order-

preserving,i.e.,ifthei-thelementofobject1corre-

spondsto thek-thelementofobject2,noelement

followingiinobject1cancorrespondtoanelement

precedingk inobject2.

The situation can be described graphically as

follows. Given two objects, where the rst has n

elements,numbered1;:::;nandthesecondhasm

elements,numbered1;:::;m,weconsiderthecom-

plete bipartitegraph

W

nm

:=([n];[m];L) (8)

whereL=[n][m]and[k]:=f1;:::;kg8k2Z +

.

We hereafter call a pair(i;j) with i2 [n] and

j2[m] aline(sinceitaligns iwithj),andwede-

noteit by [i;j]. Lis thesetofalllines. Two lines

[i;j]and[i 0

;j 0

],with[i;j]6=[i 0

;j 0

],aresaidtocross

ifeither i 0

>iand j 0

6j or i 0

6iand j 0

>j (by

denition,alinedoesnotcrossitself). Graphically,

crossing lines correspond to lines that intersect in

a single point. A matching isa subsetof lines no

identiedbyanoncrossingmatching,i.e.,amatch-

ingforwhichnotwolines cross,inW

nm

. Fig.1(a)

shows(inbold)analignmentoftwoobjects.

Fig.1. (a) Anoncrossing matching(alignment). (b) The

directedgrid.

In the following Subsections 3.1, 3.2 and 3.3,

we will describe Integer Programming approaches

for three alignment problems, i.e., Sequences vs.

Sequences (Reinert et al.

[4]

, Kececioglu et al.

[5]

,

Sequences vs. Structures (Lenhof et al.

[12]

, Ke-

cecioglu et al.

[5]

), and Structures vs. Structures

(Lancia et al.

[13]

, Caprara and Lancia [31]

) respec-

tively. All Integer Programming formulations for

these problems contain binary variables x

l to se-

lect which lines l 2 L dene the alignment, and

constraintstoinsurethattheselectedlinesarenon-

crossing. Therefore, it is appropriate to discuss

here this part ofthemodels, since theresults will

apply to all the following formulations. We will

consider here the case of two objects only. This

case can be generalized to more objects, as men-

tionedinSubsection3.1.

Let X

nm

be theset of incidence vectors of all

noncrossing matchings in W

nm

. A noncrossing

matching inW

nm

corresponds toa stable set in a

newgraph, G

L

(theLine Con ict Graph),dened

as follows. Each line l 2L isa vertexof G

L ,and

twoverticeslandhareconnectedbyanedgeifthe

linesl andhcross.

The graph G

L

has been studied in Lancia et

al.

[13]

,where thefollowingtheoremisproved:

Theorem 1.

[13]

The graphG

L

isperfect.

Since a stable set can intersect a clique in at

most1node,thefollowingfamilyofcliqueinequal-

ities can be used as constraintsfor the alignment

problem:

X

l 2Q x

l

61 8Q2clique (L): (9)

where clique (L) denotes the set of all cliques in

G

L

. Given weights x

l

for each line l, the sepa-

ration problem for theclique inequalities (9) con-

sistsin nding a clique Q in G

L with x

(Q)>1.

(5)

solvedinpolynomialtimebyndingthemaximum

x

-weightcliqueinG

L

. Insteadofresortingtoslow

andcomplexalgorithmsforperfectgraphs,thesep-

arationproblemcanbesolveddirectlybyDynamic

Programming,asdescribedinLenhofetal.

[12]

and,

independentlyandwithadierentconstruction,in

Lanciaet al.

[13]

. Bothalgorithmshavecomplexity

O(nm). Wedescribeheretheconstructionof[12].

Consider Lasthevertexsetofadirected grid,

in which vertex [1;n] is put in the lower left cor-

nerandvertex[m;1]intheupperrightcorner(see

Fig.1(b)).

At each internal node l = [i;j], there are two

incoming arcs, one directed from the node below,

i.e. [i;j+1], and one directed from the node on

theleft,i.e. [i 1;j]. Eachsucharchasassociated

alengthequal tox

l .

Theorem2.

[12;13]

Thenodesonadirectedpath

P from [1;n] to [m;1] in the grid correspondto a

clique Q, of maximal cardinality, in G

L

and vice

versa.

Thenthemostviolatedcliqueinequalitycanbe

foundbytakingthelongest[1;n]-[m;1]pathinthe

grid. There is a violated clique inequality if and

onlyifthelengthofthepath(plus x

1n

)isgreater

than1.

3.1 Sequence vsSequence Alignments

Comparing genomic sequences drawn from in-

dividuals of thesame or dierentspeciesis one of

thefundamentalproblemsofmolecularbiology. In

fact,suchcomparisonsareneededtoidentifyhighly

conserved (and therefore presumably functionally

relevant)DNA regions, spot fatal mutations, sug-

gestevolutionaryrelationships,andhelpincorrect-

ingsequencingerrors.

Aligning a set of sequences (i.e., computing a

Multiple Sequence Alignment) consists in arrang-

ing them in a matrix having each sequence in a

row. This is obtained by possibly inserting gaps

(representedbythe`-'character)ineachsequence

sothat theyallresultofthesamelength. Thefol-

lowing isa simple example of analignmentof the

sequencesATTCCGAC,TTCCCTGandATCCTC.

A T T C C G A - C

- T T C C C - T G

A - T C C - - T C

By looking at a column of the alignment, we

can reconstruct the events that havehappenedat

thecorrespondingposition inthegiven sequences.

mutatedinto a Cin sequence2, and deletedin se-

quence3.

The multiple sequence alignment problem has

been formalizedas an optimization problem. The

mostpopularobjectivefunctionformultiplealign-

mentgeneralizesideasfromoptimallyaligningtwo

sequences. This problem, called pairwise align-

ment, is formulated as follows: Given symmetric

costs (a;b) for replacing a letter a with a letter

b and costs (a; ) for deleting or inserting a let-

ter a, nd a minimum-cost set of symbol opera-

tionsthat turn a sequence S 0

into a sequence S 00

.

For genomicsequences,thecosts (;)areusually

speciedbysomewidelyusedsubstitutionmatrices

(likePAM and BLOSUM), which score the likeli-

hoodofspeciclettermutations, deletionsandin-

sertions. The pairwise alignment problem can be

solvedbyDynamicProgrammingintimeandspace

O(l 2

), where l is the maximum lengthof thetwo

sequences(Needleman andWunsch [32]

). Thevalue

ofanoptimalsolution iscalledtheeditdistanceof

S 0

andS 00

andisdenotedbyd(S 0

;S 00

).

Formally, an alignment A of two or more

sequences is a bi-dimensional array having the

(gapped)sequencesasrows. Thevalue d

A (S

0

;S 00

)

ofanalignmentoftwo sequencesS 0

and S 00

isob-

tainedbyaddingupthecostsforthepairsofchar-

actersin correspondingpositions,anditiseasyto

see that d(S 0

;S 00

)=min

A d

A (S

0

;S 00

). Thisobjec-

tiveisgeneralized,forksequencesfS

1

;:::;S

k g,by

theSum{of{Pairs(SP) score, inwhichthecost of

an alignment is obtained by adding up the costs

of thesymbols matched upat the same positions,

over allthepairsofsequences:

SP(A):= X

16i<j6k d

A (S

i

;S

j )

= X

16i<j6k jAj

X

l =1

(A[i][l];A[j][l])

(10)

wherejAjdenotesthelengthofthealignment,i.e.,

thenumberofitscolumns.

Finding the optimal SP alignment was shown

NP-hardbyWangandJiang [33]

. Astraightforward

generalizationoftheDynamicProgrammingfrom2

toksequences,oflengthl,leadstoanexponential-

time algorithmof complexity O(2 k

l k

), whichis in

practiceuseless forallbuttinyproblems [34]

.

3.3.1 Trace andInteger ProgrammingApproach

AsanalternativeapproachtotheineectiveDy-

[5]

(6)

(seealsoReinertetal.

[4]

forapreliminaryversion)

proposed anapproach based on Integer Program-

ming, to optimizeanobjectivefunction called the

Maximum Weight Trace (MWT) Problem. The

MWTgeneralizestheSPobjectiveaswellasmany

other alignment problems in the literature. The

problemwasintroduced in1993byKececioglu [35]

,

whodescribeda combinatorial Branch-and-Bound

algorithmforitssolution.

The trace is a graph theoretic generalization

of the noncrossing matching to the case of the

alignment of more than two objects. Suppose

we want to align k objects, of n

1

;n

2

;:::;n

k el-

ements. We dene a complete k-partite graph

W

n1;:::;nk

= (V;L), where V = S

k

i=1 V

i

, the set

V

i

= fv

i1

;:::;v

ini

g denotes the nodes of level i

(correspondingtotheelementsofthei-thobject),

andL= S

16i<j6k V

i

V

j

. Eachedge [i;j]2L is

stillcalledaline. GivenanalignmentAoftheob-

jects (i.e.,anarrangementin anarray, withgaps),

we say that the alignment realizes a line [i;j] if i

and j are put in thesame column of A. A trace

T isa set of lines such that there exists analign-

mentAthat realizesallthelinesin T. TheMaxi-

mumWeightTrace (MWT)Problemisthefollow-

ing: given weights w

[i;j]

for each line [i;j], nd a

traceT ofmaximumtotalweight.

LetA bethesetofdirectedarcspointingfrom

each element to the elements that follow it in an

object(i.e., arcs of type (v

is

;v

it

) fori = 1;:::;k,

1 6s<t 6n

i

). Then a trace T is characterized

asfollows:

Proposition 3.

[14]

T is a trace if and only if

thereisnodirectedmixedcycleinthemixedgraph

(V;T[A).

Adirectedmixedcycleisacyclecontainingboth

arcs and edges, in which thearcs (elements of A)

are traversed according to their orientation, while

theundirected edges(lines ofT)can betraversed

ineither direction.

Usingbinaryvariablesx

l

forthelinesinL,the

MWTproblemcanbemodeledas follows [5]

:

max X

l 2L w

l x

l

(11)

subjectto

X

l 2C\L x

l

6jC\Lj 1 8directed mixedcycles

C in (V;L[A) (12)

x

l

2f0;1g 8l2L: (13)

Theformulation(11){(13)can bestrengthened

byusingthecliqueinequalities(9),relativetoeach

subgraphW

n

i

;n

j :=(V

i

;V

j

;V

i

V

j )ofW

n

1

;:::;n

k .

The mixed-cycle inequalities (12)can besepa-

ratedinpolynomialtime,viathecomputationofat

mostO(

P

k

i=1 n

i

)shortestpathsin(V;L[A) [5]

. A

Branch-and-Price algorithm based on the mixed-

cycle inequalities and the clique inequalities (as

wellas othercuts,separatedheuristically)allowed

Kececioglu et al.

[5]

to compute, for the rsttime,

an optimal alignment for a set of 15 proteins of

about300aminoacidseachfromtheSWISSPROT

database, whereas the previous limit, achieved by

a combinatorial Branch-and-Bound algorithm for

this problem, was 6 sequences of size 200 We re-

mark the the optimal alignment of 6 sequences

of size 200 is already out of reach for Dynamic

Programming-based algorithms. We must also

pointout,however,thatthelengthofthesequences

isstillamajorlimitationtotheapplicabilityofthe

Integer Programming solution, and themethod is

notsuitable for sequences longer than a few hun-

dredletters.

3.1.2 AnIP-Based HeuristicforSPAlignment

Due to the complexity of the alignment prob-

lem, many heuristic algorithms have been de-

veloped over time, some of which provide an

approximation-guarantee onthequalityof theso-

lution found. Inthis sectionwedescribeone such

heuristicprocedure,whichusesideasfromNetwork

DesignandIntegerProgrammingtechniques. This

heuristiciswellsuitedforlongsequences.

A popular approach formany heuristics is the

socalled \progressive"alignment, inwhich theso-

lutionisincrementallybuiltbyconsideringthese-

quences one ata time. One suchalgorithm isdue

to Guseld [7]

, who suggested to use a star (i.e.,

a tree in which at most one node {the center{ is

nota leaf), to align thesequences as follows. Let

S = fS

1

;:::;S

k

g be the set of input sequences.

FixasequenceS

c

asthecenter,andcomputek 1

pairwise optimalalignmentsA

i

, one foreach each

S

i

aligned with S

c

. Thesealignmentscan thenbe

mergedintoasingle alignmentA(c),byputtingin

thesamecolumnallthelettersthatarealignedto

thesameletterofS

c .

Note that each A

i

optimally aligns S

i to S

c

(therebyrealizingtheireditdistance),anditistrue

alsoinA(c)thateachsequenceisalignedoptimally

withS

c ,i.e.

d

A(c) (S

i

;S

c )=d

Ai (S

i

;S

c )=d(S

i

;S

c

) 8i: (14)

(7)

Furthermore, for all pairs i;j, the triangle in-

equalityd

A(c) (S

i

;S

j )6d(S

i

;S

c )+d(S

j

;S

c )holds,

since thedistance inanalignmentis ametric over

the sequences (assuming the cost function is a

metric over ). Let c

be such that SP(A(c

))=

min

r

SP(A(r)). Call OPT be the optimal SP

value, and HEUR := SP(A(c

)). By compar-

ing P

16i<j6k d(S

i

;S

j

)(alowerboundtoOPT)to

P

16i<j6k (d(S

i

;S

c

)+d(S

j

;S

c

))

(anupperbound

to HEUR), Guseld proved thefollowing approxi-

mationresult:

Theorem 4.

[7]

HEUR6

2 2

k

OPT.

Guseld'salgorithmcanbegeneralizedsoasto

useanytree,instead ofjusta star,andstillmain-

tainthe2-approximationguarantee. Theapproach

isbased ona connectionto thefollowingNetwork

Designproblem. Givenacompleteweightedgraph

and a spanning tree T, the routing costof T, de-

notedbyr(T),isdenedasthesumoftheweights

of the unique paths, between all pairs of nodes,

identied by T. Computing theminimumrouting

costtree is NP-hard, but a Polynomial Time Ap-

proximationScheme(PTAS)ispossible [36]

.

Torelate routingcostandalignments,consider

theset of sequencesS as thevertexset of a com-

plete weighted graph G, where the weight of an

edge S

i S

j is d(S

i

;S

j

). The following procedure

showsthat,foranyspanningtreeT,onecanbuild

analignmentA(T)whichisoptimalfork 1ofthe

the k

2

pairsofsequences: (i)pickanedgeS

i S

j of

thetreeandalignrecursivelythesequencesonboth

sidesofthecutdenedbytheedge,obtainingtwo

alignments, sayA

i and A

j

; (ii)align optimallyS

i

withS

j

;(iii)mergeA

i andA

j

into acompleteso-

lution,byaligningthecolumnsofA

i andA

j inthe

samewayasthelettersofS

i

areoptimallyaligned

tothelettersofS

j .

Theorem5. ForanyspanningtreeT =(S;E)

of G, there exists an alignment A(T) such that

d

A(T) (S

i

;S

j

)=d(S

i

;S

j

) forall pairsof sequences

S

i S

j 2E.

Furthermore,thealignmentA(T)canbeeasily

computed,as outlinedabove. Bytriangleinequal-

ity,itfollows

Corollary6. ForanyspanningtreeT =(S;E)

of G, there exists an alignment A(T) such that

SP(A(T))6r(T).

FromCorollary6,itfollowsthatagoodtreeto

useforaligningthesequencesisatreeT

suchthat

r(T

) = min

T

r(T). This tree does not guaran-

tee anoptimal alignment, but guarantees the best

possible upper bound to the alignment value, for

eld's alignment). In [6],Fischetti et al. describe

aBranch-and-Pricealgorithmforndingthemini-

mumroutingcosttreeinagraph,whichiseective

for up to k = 30 nodes (a large enough value for

alignmentproblems). Thealgorithmisbased ona

formulationwithbinaryvariablesx

P

foreachpath

P in the graph (an exponential number of vari-

ables), andconstraints thatforce thevariablesset

to1 to betheset ofpathsinducedbyatree. The

pricingproblemissolvedbycomputingO(k 2

)non-

negative shortest paths. Fischetti et al.

[6]

report

on computational experiments of their algorithm

appliedtoalignmentproblems,showingthatusing

thebest routing cost tree instead of the best star

producesalignmentsthatarewithin6%ofoptimal

onaverage.

3.1.3 ConsensusSequences

Given a set S =fS

1

;:::;S

k

g of sequences, an

important problem consists in determining their

consensusi.e., anewsequencewhichrepresentsall

the sequences in the set. Let C be the consen-

sus. To avoid a consensus bias (e.g., due to over-

representedsequencesinthesetS),anappropriate

objectiveisto minimizethe maximumdistance of

C from any sequence in S. This objectiveis NP-

hard [37]

.

Assume allinputsequenceshavelengthn. For

theconsensusprobleminsteadofusingtheeditdis-

tance, the Hamming distance is preferable (some

biological reasons for this can be found in [11]).

Thisdistance isdened,forsequencesAandB of

thesame length, as P

n

i=1

(A[i];B[i]), where X[i]

denotesthei{thsymbolofasequenceX.

The Integer Programming formulation for the

consensus problem has a binary variable x

i

, for

everysymbol2,andevery position 16i6n,

to indicate whether C[i] = (x

i

= 1) or not

(x

i

=0). Themodelreads:

min r (15)

subjectto

X

2

x

i

=1 8i=1;:::;n (16)

n

X

i=1 X

2

(;S

j [i])x

i

6r 8j=1;:::;k

(17)

x

i

2f0;1g 8i=1;:::;n; 82: (18)

Let OPT denote the value of the optimum so-

lution to the program above. An approxima-

(8)

Rounding [38]

. Consider the Linear Programming

relaxationof(15){(18),obtainedbyreplacing(18)

withx

i

>0. Letr

betheLPvalueandx

i

bethe

LPoptimal solution,possiblyfractional. An inte-

gersolutionxcanbeobtainedfromx

byrandom-

ized rounding: independently, at each position i,

choosea letter2(i.e.,setx

i

=1andx

i

=0

for 6= ) with probability of x

i

= 1 given by

x

i

. LetHEURbethevalueofthesolutionx, and

Gamma= max

a;b2

(a;b). The rst approxima-

tionresultfortheconsensuswasgivenbyBen-Dor

etal.

[8]

,andstatesthat,foranyconstant>0,

Pr

HEUR>OPT+ r

3OPTlog k

<: (19)

This probabilistic algorithm can be de-

randomized using standard techniques of condi-

tionalprobabilities [39]

.

TheIPmodel(15){(18)wasutilizedinastring

of papers on consensus as well as other, related,

problems [9 11]

, such as nding the farthest se-

quence (i.e.,asequence \asdissimilaras possible"

fromeach sequenceofagivenset S).

The main results obtained on these problems

arePTAS based ontheaboveIPformulation and

RandomizedRounding,coupledwithrandomsam-

pling. In[9,10],Lietal. andMarespectively, de-

scribePTASforndingtheconsensus,oraconsen-

sussubstring. Themain ideabehindtheapproach

isthefollowing. Givenasubsetofrsequencesfrom

S,linethemupinanrnarray,andconsiderthe

positions where they all agree. Intuitively, there

isa highlikelihood thattheconsensus should also

agreewiththem atthesepositions. Hence, allone

needs to do is to optimize on the positions (sym-

bols)wheretheydonotagree,whichisdonebyLP

relaxationandrandomized rounding.

A PTASfor thefarthest sequenceproblem, by

Lanctot et al., is described in [11]. Also this re-

sultisobtainedbyLPrelaxationandRandomized

RoundingofanIPsimilarto(15){(18).

3.2 Sequence vsStructure Alignments

Inordertobuildaprotein,theDNAfromagene

isrstcopiedontothemessengerRNA,whichwill

serveas atemplate fortheprotein synthesis. This

process is called transcription. In RNA, the nu-

cleotideThymine(T)isreplacedbytheequivalent

nucleotideUracile (U). RNA moleculesare single-

stranded and the exposed bases tend to form hy-

drogenbondswithinthesame molecule,according

to Watson-Crick pairing rules (A $ U, C $ G).

Thisbondsleadtostructureformation,alsocalled

secondarystructureoftheRNAsequence.

Determining thesecondary structure from the

nucleotidesequenceisnotaneasyproblem. Forin-

stance,complementary basesmaynothybridize if

theyarenotsuÆcientlyapartinthemolecule. Itis

alsopossiblethatnotallthefeasibleWatson-Crick

pairingscanberealizedatthesametime,andNa-

turechooses themost favorableones according to

an energy minimization which is notyet well un-

derstood.

Allpotentialpairingscanberepresentedbythe

edge set of a graph G

S

= (V

S

;E

S

). The vertex

set V

S

= fv

1

;:::;v

n

g represents the nucleotides

ofS, withv

i

correspondingto thei-thnucleotide.

Each edge e 2 E

S

connects two nucleotides that

may possibly hybridize to each other in the sec-

ondarystructure. For instance, thegraphof Fig.2

describes the possible pairings for the sequence

S = UCGUGCGGUAACUUCCACGA, assuming a pairing

can be achieved only if the bases are at least 8

positions apart from each other. Only a subset

E 0

S

E

S

ofpairings is realizedby theactualsec-

ondary structure. This subsetis drawn in boldin

Fig.2. ThegraphG 0

S

=(V

S

;E 0

S

)describesthesec-

ondary structure. Notethat each node in G 0

S has

degree61.

Fig.2.GraphofRNAsecondarystructure.

TwoRNAsequencesthatlookquitedierentas

strings ofnucleotides, may havesimilar secondary

structure. Agenericsequencealignmentalgorithm,

such as those described in the previous section,

would notspot thestructuresimilarity, andhence

adierentmodelforcomparisonisneededforthis

(9)

case.

One possible modelwasproposed in Lenhofet

al.

[12]

(seealsoKececioglu etal.

[5]

). Inthismodel,

a sequence U of unknown secondary structure is

compared to a sequence K of known secondary

structure. This is done by comparing the graph

G

U

=(V

U

;E

U )toG

0

K

=(V

K

;E 0

K ).

Assume V

U

= [n] and V

K

= [m]. An align-

ment of G

U

and G 0

K

is dened by a noncross-

ing matching of W

nm

(which, recalling (8), is the

complete bipartite graph (V

U

;V

K

;L)). Given an

edge e = ij 2 E

U

and two noncrossing lines

l = [i;u] 2 L, h = [j;v] 2 L, we say that l and

h generatee if uv 2 E 0

K

. Hence, two lines gener-

ateapossiblepairing(edgeofE

U

)iftheyalignits

endpointstoelementsthathybridizetoeach other

inthesecondary structureofK.

For each line l = [i;j] 2 L let w

l

be the

protofaligningthetwoendpointsoftheline(e.g.,

w

l

= (U[i];K[j])). Furthermore, for each edge

e 2 E

U , let w

e

be a nonnegative weight associ-

ated to theedge, measuring the \strength"of the

correspondingbond. TheobjectiveoftheRNASe-

quence/Structure Alignment (RSA) problem is to

determineanalignmentwhichhasmaximumcom-

bined value: the value is given by the sum of the

weights of the alignment lines and the weights of

theedgesinE

U

that theselinesgenerate.

Let bean arbitrarytotal orderdened over

theset ofall lines L. LetG betheset of pairs of

lines(p;q)withpq,suchthateachpair(p;q)2G

generatesan edge in E

U

. For each (p;q)2G, de-

ne w

pq := w

e

, where p = [i;u], q = [j;v] and

e=ij2E

U .

TheIntegerProgrammingmodelin[12]forthis

alignment problem hasvariablesx

l

forlines l 2L

andy

pq

forpairsoflines(p;q)2G:

max X

l 2L w

l x

l +

X

(p;q)2G w

pq y

pq

(20)

subjectto

x

l +x

h

61 8l;h2Ljlandhcross

(21)

y

pq 6x

p

8(p;q)2G (22)

y

pq 6x

q

8(p;q)2G (23)

x

l

;y

pq

2f0;1g 8l2L; 8(p;q)2G:

(24)

Themodelcan bereadilytightenedasfollows.

First,thecliqueinequalities(9)canbereplacedfor

l 2 L, let G

l

= f(p;q) 2 Gjp =l_q = lg. Each

element(i;j)ofG

l

identiesa pairinge2E

U such

thatlisoneofthegeneratinglinesofe. Notethat

ifanedgee2E

U

canbegeneratedbylanda2L

and,alternatively,bylandb6=a2L,thenaandb

mustcross(inparticular,theyshareanendpoint).

Hence, of all the generating pairs in G

l

, at most

onecanbeselectedinafeasiblesolution. Thenat-

uralgeneralization forconstraints(22)and (23)is

therefore

X

(a;b)2Gl y

ab 6x

l

8l2L: (25)

A Branch-and-Price based on the above in-

equalities is described in [5, 12]. The method

wasvalidatedbyaligningRNAsequencesofknown

structure to RNA sequences of known structure,

butwithoutusingthestructureinformationforone

ofthetwo sequences. Inallcases tested,thealgo-

rithmretrievedthecorrectalignment,ascomputed

byhandbythebiologists. Forcomparison,a\stan-

dard"sequencealignmentalgorithm(whichignores

structure in both sequences) was used, and failed

toretrievethecorrectalignments.

Inasecondexperiment,themethodwasusedto

alignRNA sequencesofunknownsecondarystruc-

ture, of upto 1,400 nucleotides each, to RNA se-

quences of known secondary structure. The opti-

mal solutions found were compared with alterna-

tivesolutions foundby\standard" alignment, and

their merits were discussed. The results showed

thatstructureinformationisessentialinretrieving

biologicallyrelevantalignments.

3.3 Structure vsStructure Alignments

Aproteinisachainofmoleculesknownasamino

acids, or residues, which folds into a peculiar 3-D

shape (called its tertiary structure) under several

molecular forces (entropic, hydrophobic, thermo-

dynamic). A protein's fold is perhaps the most

important of all protein's features, since it deter-

mineshowtheproteinfunctionsandinteractswith

other molecules (in fact, most biological mecha-

nisms{e.g.,binding{attheproteinlevelarebased

onshape-complementarity).

Thecomparisonofproteinstructuresisaprob-

lem of paramount importance in structural ge-

nomics, with applications to protein clustering,

function determination and assessment of pro-

tein fold predictions. An increasing number of

approaches for structure comparison have been

proposed over the past years (see Lemmen and

[40]

(10)

structureclassicationservers(themostimportant

of which is the Protein Data Bank, PDB [41]

have

beendesignedandareextensivelyusedinpractice.

Loosely speaking, the structure comparison

problem is the following: Given two 3-D protein

structures, determine how similar they are. The

comparison of two structures requires to \align"

them in some way. Since, by theirnature, three-

dimensionalcomputationalproblemsareinherently

more complex than the similar one-dimensional

ones, there is a dramatic dierence between the

complexity of sequence alignment and structure

alignment. Notsurprisingly,varioussimpliedver-

sions of the structure comparison problems were

shownNP-hard [42]

.

A popular structure comparison measure is

based on the comparison of the proteins contact

maps. Acontactmap isa 2-D representation ofa

3-Dprotein structure. When a proteins folds,two

residuesthatwerenotadjacentintheprotein'slin-

earsequence,mayendupclosetoeachotherinthe

3-D space. The contact map of a protein with n

residues is dened as an nn 0-1 matrix, whose

1{elementscorrespondtopairsofresiduesthatare

withina \small"distance(typicallyaround5

A)in

theprotein'sfold,butarenotadjacentinthelinear

sequence. Wesay that such a pair ofresidues are

in contact. Acontact map can alsobeconsidered

the adjacencymatrix of a graph G. Each residue

is a node of G, and there is an edge between two

nodesifthecorresponding residuesarein contact.

TheContactMapOverlap(CMO)problemtries

to evaluate the similarity in the 3-D folds of two

proteinsbydeterminingthesimilarityoftheircon-

tactmaps(therationale being thata highcontact

mapsimilarityisagoodindicatorofhigh3-Dsim-

ilarity). This measurewas introduced in [43], and

its optimization was provedNP-hard in [42], thus

justifyingtheuseofsophisticatedheuristicsorenu-

merativemethods.

Given two folded proteins, the CMO problem

calls for determining an alignment between the

residues oftherstproteinand ofthesecondpro-

tein. Thegoalistondthealignmentwhich high-

lights the largest set of common contacts. The

value of an alignment is given by the number of

pairs of residues in contact in the rst protein

which are aligned with pairs of residues that are

also in contact in the second protein. Given the

graph representation for contact maps, we can

phrase the CMO problem in graph{theoretic lan-

guage. Theinputconsistsoftwoundirectedgraphs

G = (V ;E ) and G = (V ;E ), with V = [n]

and V

2

= [m]. For each edge ij, with i < j, we

distinguishaleftendpoint(i)andarightendpoint

(j), and therefore we denote the edge by the or-

dered pair (i;j). A solution of the problem is an

alignment ofV

1 andV

2

, i.e., anoncrossingmatch-

ing L 0

L in W

nm

. Two edges e = (i;j) 2 E

1

and f = (i 0

;j 0

) 2 E

2

contribute a sharingto the

value of an alignment L 0

if l = [i;i 0

] 2 L 0

and

h = [j;j 0

] 2 L 0

. In this case, we say that l and

h generate thesharing (e;f). TheCMO problem

consists in nding analignment which maximizes

the number of sharings. Fig.3 shows two contact

mapsandanalignmentwith5 sharings.

Fig.3. Analignmentofvalue5.

AsuccessfulapproachfortheCMOsolutionre-

liesonformulatingtheproblemasanInteger Pro-

gramandsolvingitbyBranch-and-Price. Thefor-

mulation is based on binary variables x

l

for each

linel,andy

pq

forpairsoflinesp;qwhich generate

asharing.

The model, proposed by Lancia et al.

[13]

, is

very similar to theformulationfortheRSA prob-

lem, described in Section 3.2. We denote with G

the set of all pairs of lines that generate a shar-

ing. For each pair (l;t) 2 G, where l = [i;i 0

];

and t = [j;j 0

], the edge (i;j) 2 E

1

and the edge

(i 0

;j 0

)2E

2

. Foralinel2L, letG

l

=f(p;l)2Gg

andG +

l

=f(l;p)2Gg. ThemodelforCMOis

max X

(l ;t)2G y

l t

(26)

subjectto theclique inequalities (9) and thecon-

straints

X

(p;l )2G

l y

pl 6x

l

8l2L (27)

X

(l ;p)2G +

l y

l p 6x

l

8l2L (28)

x

l

;y

pq

2f0;1g 8l2L; 8(p;q)2G:

(29)

A Branch-and-Price based on the above con-

straints, as well as other, weaker, cuts, was used

in [13] to optimally align, for the rst time, real

proteinsfromthePDB.Thealgorithmwasrun on

(11)

597 protein pairs, with sizes ranging from 64 to

72 residues. Within the time limit of 1 hour per

instance, 55 problems were solved optimally and

for421problems(70percent)thegapbetweenthe

bestsolutionfoundandtheupper bound was65,

therebyprovidinga certicateofnear-optimality.

Inasecondexperiment,theCMOmeasurewas

validated from a biological point of view. A set

of 33 proteins, known to belong to four dierent

families, were pairwise compared. The proteins

werethenre-clusteredaccordingtotheircomputed

CMOvalue. Inthenewclustering,notwoproteins

wereputinthesameclusterwhichwerenotinthe

same familybefore(a 0% falsepositives rate)and

onlyfewpairsofproteinsthatbelongedtothesame

family were put in dierent clusters (a 1.3% false

negativesrate).

4 GenomeRearrangements

Thanks to the large amount of genomic data

that has become available in the past years, it is

now possible to compare the genomes of dierent

species,in orderto ndtheirdierencesandsimi-

larities. An importantapplication ofthis problem

is,e.g.,thecomparisonofmouseandmangenomes,

sincenewdrugsaretypicallytestedonmicebefore

humans.

In principle, a genome can be thought of as a

(very long) sequence and hence one may want to

comparetwogenomesviaasequencealignmental-

gorithm. However, sequence alignment is not the

right model for large-scale genome comparisons,

and,atthegenomelevel,dierencesshouldbemea-

suredin termsofrearrangementsoflong DNAre-

gionswhichoccurredduring evolution.

Amongthemainrearrangementsknownarein-

versions, transpositions, and translocations. Each

oftheseeventsaectsalongfragmentofDNAona

chromosome. Whenaninversionoratransposition

occurs,a DNAfragmentisdetachedfrom itsorig-

inal position and then is reinserted, on the same

chromosome. In an inversion, it is reinserted at

thesameplace,butwithoppositeorientationthan

it originally had. In a transposition, it keeps the

original orientation but ends upin a dierentpo-

sition. A translocation causes a pair of fragments

to beexchanged between theends oftwo chromo-

somes. Fig.4 illustrates these events, where each

stringrepresentsachromosome.

ATTGTTataggttagAATTG ATTgtttataGGCTAGATCCGCCAGA CTGGATgcaggcat TCATTGAaata

# # #

ATTGTTgattggataAATTG ATTGGCTAGATCCGCgtttataCAGA CTGGATaata TCATTGAgcaggcat

(Inversion) (Transposition) (Translocation)

Fig.4.Evolutionaryevents.

Since evolutionary events aect long DNA re-

gions (several thousand bases) the basic unit for

comparison is not the nucleotide, but rather the

gene.

The generalGenome Comparison Problemcan

beinformally stated as: Given two genomes (rep-

resentedby two sets of ordered listsof genes, one

listper chromosome)nda sequenceofevolution-

aryeventsthat,appliedtotherstgenome,turnit

intothesecond. Undera commonparsimonyprin-

ciple, the solution sought is the one requiring the

minimumpossiblenumberof events.

In the past decade, people have concentrated

onevolutionbymeansofsomespeciceventalone,

and haveshown that even these special cases can

bealreadyveryhardtosolve. Sinceinversions are

considered the predominant of all types of rear-

rangements,theyhavereceivedthemostattention.

Forhistoricalreasons, theyhavebecomeknownas

reversalsin thecomputer sciencecommunity.

In the remainder of this section, we will out-

lineasuccessfulIntegerProgrammingapproachfor

thesolutionofcomputingtheevolutionarydistance

between two genomes evolved by reversalsonly, a

problem known as Sorting by Reversals. The ap-

proach isduetoCapraraet al [14;15]

.

Twogenomesarecomparedbylookingattheir

common genes. After numbering each of n com-

mon geneswith a unique labelin f1;:::;ng,each

genomeisapermutationoftheelementsf1;:::;ng.

Let = (

1 :::

n

) and = (

1 :::

n ) be

twogenomes. Bypossiblyrelabelingthegenes,we

can alwaysassume that =:=(1 2 ::: n), the

identitypermutation. Hence,theproblembecomes

turninginto, i.e.,sorting.

A reversal is a permutation , with 1 6 i <

(12)

j6n,dened as

ij=(1 ::: i 1 jj 1 ::: i+1i j+1 ::: n):

reversed

By applying

ij

to , one obtains

(

1 :::

i 1

;

j

j 1 :::

i

;

j+1 :::

n

), i.e., the

orderoftheelements

i

;:::;

j

hasbeen reversed.

Let G =f

ij

;1 6i <j 6ng. Each permutation

can beexpressed as a product

1

2

::: D

with

i

2 G for i = 1:::D. The minimum value D

such that

1

2

::: D

= is called the reversal

distance of , and denoted by d(). Sorting by

Reversals(SBR)istheproblemofndingd()and

a sequence 1

2

::: d()

satisfying the previous

equation.

A majorstep towards thepractical solution of

the problem was made by Bafna and Pevzner [44]

,

who investigated the relations between SBR and

the breakpoints of . A breakpoint is given by a

pair of adjacent elements in that are not adja-

centin , thatis, thereis a breakpointat position

i,ifj

i

i 1

j>1. Letb()denotethenumber of

breakpoints. A trivial bound is d() > db()=2e,

since a reversal can remove at most two break-

points, and hasnobreakpoints. However,Bafna

and Pevzner showed how to obtain a much bet-

ter bound from the breakpoint graph G(). G()

hasanodeforeach elementof andedgesoftwo

colors, say red and blue. There is a red edge be-

tween

i and

i 1

foreachpositioniatwhichthere

is a breakpoint, and a blue edge between each h

and k such that jh kj = 1 and h;k are not ad-

jacent in . Each node has the same number of

red andblue edges incidenton it (0, 1 or 2), and

G()canbedecomposedintoasetofedge-disjoint

color-alternating cycles. Note that this cycles are

notnecessarilysimple,i.e., theycan repeatnodes.

Letc()bethemaximumnumber ofedge-disjoint

alternating cycles in G(). Bafna andPevzner [44]

provedthefollowing

Theorem 7.

[44]

For every permutation ,

d()>b() c().

The lower bound b() c() turns out to be

verytight,asobservedrstexperimentally byvar-

iousauthors andthen proved tobealmost always

the case by Caprara [45]

who showed in [46] that

determining c() is essentially the same problem

as determining d(), and that both problems are

NP-hard.

Since computingc()is hard,thelowerbound

b() c()shouldnotbeuseddirectlyinaBranch-

and-Bound procedure for SBR. However, for any

upper bound c 0

() to c(), also the value b()

0

easier to compute. Givenaneective Integer Lin-

earProgramming(ILP)formulationtondc(),a

good upper bound to c()can beobtained byLP

relaxation. LetCdenotethesetofallthealternat-

ing cycles of G() = (V;E), and for each C 2 C

dene a binary variable x

C

. The following is the

ILPformulationofthemaximumcycledecomposi-

tion:

c():=max X

C2C x

C

(30)

subjectto

X

C3e x

C

61 8e2E (31)

x

C

2f0;1g 8C2C: (32)

Caprara et al showed in [14] how to solve the

exponentially large LP relaxation of (30){(32) in

polynomialtimebycolumn-generationtechniques.

ConsiderthedualoftheLPrelaxationof(30){

(32),whichreads

c 0

():=min X

e2E y

e

(33)

subjectto

X

e2C y

e

>1 8C2C (34)

y

e

>0; 8e2E: (35)

Theseparation problemfor(33){(35),whichis

the same as the pricing problem for (30){(32), is

the following: given a weight y

e

for each e 2 E,

ndanalternating cycleC2C suchthat

X

e2C y

e

<1; (36)

orprovethatnoneexists. Callanalternatingcycle

satisfying (36)aviolatedcycle. To solve(33){(35)

inpolynomial time,onemustbeableto identifya

violatedcycleinpolynomialtime,i.e.,analternat-

ing cycle of G() having weight < 1, where each

edge e 2 E is giventhe weight y

e

. This problem

canbesolvedviaareductiontothesolutionofsome

non-bipartite min-costperfect matching problems

in a suitable graph. The reduction is similar to

oneused byGrotschel andPulleyblank fornding

minimum-weight odd and even paths in an undi-

rectedgraph [47]

.

Although polynomial, the solution of the

weighted matching problem on general graphs is

(13)

et al. describe a weaker bound, obtained by en-

larging the set C in (30){(32) to include also the

pseudo-alternatingcycles,i.e.,cyclesthatalternate

red and blue edges but may possibly use an edge

twice. With this new set of variables, the pricing

becomesmuchfaster,sinceonlybipartitematching

problemsmustbesolved.

Although weaker than the bound based on

alternating cycles, the bound based on pseudo-

alternating cyclesturns outto be still very tight.

Based on this bound, Caprara et al. developed

a branch-and-price algorithm [15]

which can rou-

tinelysolve,inamatterofseconds, instanceswith

n=200 elements, a large enough size forall real-

life instances available so far. Note that the pre-

viouslimitfor anoptimization algorithmfor SBR

was n = 30. The algorithm in [15] was used to

nd, for the rst time, the optimal solution to a

man vs mouse instance of SBR, and to compare

the mitochondrial genomes of 20 species, nding

theoptimalsolutioninallcases.

5 Single NucleotidePolymorphisms

ADNAregionwhosecontentvariesinastatisti-

callysignicantwaywithinapopulationiscalleda

geneticpolymorphism. Thesmallestpossibleregion

consists ofa single nucleotide, andhence iscalled

a Single Nucleotide Polymorphism, or SNP (pro-

nounced\snip"). ASNPisanucleotidesite,inthe

middleofa DNAregionwhich isotherwise identi-

calforeverybody,atwhichweobserveavariability

inapopulation. Furthermore,thevariabilityisre-

stricted to only two values (called alleles) out of

the four possible (A, G, C, G). It is believed that

SNPsarethe predominant form ofhuman genetic

variation, and their importance cannot beoveres-

timated for medical, drug-design, diagnostic and

forensic applications.

Chrom. c, paternal: ataggtccCtatttccaggcgcCgtatacttcgacgggActata

Chrom. c, maternal: ataggtccGtatttccaggcgcCgtatacttcgacgggTctata

Haplotype 1 ! C C A

Haplotype 2 ! G C T

Fig.5.Achromosomeandthetwohaplotypes.

SinceDNAofdiploidorganismsisorganizedin

pairsofchromosomes,foreachSNPonecaneither

behomozygous(samealleleonbothchromosomes)

or heterozygous(dierentalleles). The values ofa

set of SNPs on a particular chromosome copy is

called a haplotype. Haplotypingconsists in deter-

miningapairofhaplotypes,oneforeachcopyofa

givenchromosome. InFig.5wegiveasimplisticex-

ampleofachromosomewiththreeSNPsites. This

individualisheterozygousatSNPs1and3andho-

mozygous at SNP2. Thehaplotypes areCCA and

GCT.

In recent years, several optimization problems

have been dened for SNP data. Some of these

were tackled by Integer Programming techniques,

which werecallinthissection.

We consider thehaplotyping problems relative

to a single individual and to a population. In

the rst case, the input is inconsistent haplotype

data, coming from sequencing, with unavoidable

sequencing errors. In the lattercase, the input is

ambiguous genotypedata,which species onlythe

multiplicity of each allele foreach individual (i.e.,

erozygous atSNPj,foreachiandj).

5.1 Haplotypinga Single Individual

The process of reading the content of a DNA

moleculeiscalledsequencing. Fortechnicallimita-

tions, only short DNA fragments (a couple thou-

sand bases)can be sequenced. Furthermore, even

with today's best possible instruments, sequenc-

ingerrors are unavoidable. These consistin bases

which have been misread or skipped altogether.

Anothersourceofproblemscanbethepresenceof

contaminants, i.e. DNAcoming from another or-

ganism which was wrongly mixed with the DNA

to be sequenced. In this framework, the Single

Individual Haplotyping Problem can beinformally

statedas: \giveninconsistenthaplotypedatacom-

ing from fragment sequencing, remove the errors

so as to retrieve a consistent pair of haplotypes."

Themathematicalframeworkforthisproblemisas

follows. Denote thetwovalues thateachSNPcan

takeby0 and 1. A haplotype, i.e. a chromosome

contentprojectedonasetofSNPs,isthenabinary

vector. Let S = f1;:::;ng be a set of SNPs and

(14)

is covered by some of the fragments, on which it

can take the values 0 or 1. A fragment f can be

represented by a vector in f0;1; g n

. For a pair

(f;s)2FS,f[s]denotes thes-thcomponentof

f. When f[s] = , it meansthat the fragment f

does notcover theSNPs.

Agaplessfragmentisonecovering asetofcon-

secutiveSNPs(i.e.,betweenanytwocomponentsin

f0;1g,thereareonlycomponentsinf0;1g),other-

wise,thefragmenthasgaps. Therecanbegapsfor

two reasons: (i) thresholding of low-quality reads

(i.e.,whenthesequencercannotcall aSNP0 or1

withenough condence, thecomponentis marked

witha "-"); (ii)mate-pairing inshotgun sequenc-

ing(amatepairisapairoffragmentscomingfrom

thesamechromosomecopy. Thispairisequivalent

toa singlefragmentwith onegapin themiddle).

Fig.6. (a)AsetoffragmentsF. (b)Thefragmentcon ict

graphG

F .

Two fragments f and g are said to bein con-

ictifthereexists aSNPksuchthatf[k]=0 and

g[k] =1 or f[k] = 1 and g[k] = 0. The fragment

con ict graph is the graph G

F

= (F;E

F

) which

has an edge for each pairof fragments in con ict

(seeFig.6). Thedataareconsistentwiththeexis-

tenceoftwohaplotypes (oneforeachchromosome

copy)ifandonlyifG

F

isabipartite graph. Inthe

presence of contaminants, or of some \bad" frag-

ments(fragmentswithmanysequencingerrors),to

correct thedata one must face theproblem of re-

movingthefewestnumberoffragmentssothatG

F

becomesbipartite. ThisproblemiscalledtheMin-

imum Fragment Removal (MFR) problem. When

thefragmentscancontaingaps,theproblemisNP-

hard [16]

. Forthisproblem,thefollowingIPformu-

lationcanbeadoptedsoastominimizethenumber

offragments(nodes)removedfromG

F

tomakeG

F

bipartite [17]

. Introducea 0-1 variable x

f

for each

fragment. Thevariablesforwhichx

f

=1inanop-

timalsolutiondenethenodestoremove. LetCbe

theset ofalloddcyclesin G

F

(M). Sincea graph

is bipartite if and onlyif there are noodd cycles,

one must removeat least one node from each cy-

Programming(IP)formulation:

min X

f2F x

f

(37)

s.t.

X

f2C x

f

>1 8C2C (38)

x

f

2f0;1g 8f2F: (39)

The LP relaxation of (37){(39) can be solved

in polynomial time, by solving the following sep-

aration problem: Given fractional weights x

for

the fragments, nd an odd cycle C for which

P

f2C x

f

< 1. This can be obtained as follows.

First,buildanewgraphQ=(W;F)made,loosely

speaking,oftwocopiesofG

F

. Foreachnodei2V

there are two nodes i 0

;i 00

2W, andfor each edge

ij2E,therearetwoedgesi 0

j 00

;i 00

j 0

2F. Edgesin

F areweighted,withweightsw(i 0

j 00

)=w(i 00

j 0

):=

x

i

=2+x

j

=2. Byaparityargument,apathP inQ

ofweightw(P),fromi 0

toi 00

,correspondstoanodd

cycle C through i in G

F , with

P

j2C x

j

=w(P).

Hence,theshortestpathcorrespondstothe\small-

est"oddcycle,andiftheshortestpathhasweight

> 1 then there are no violated inequalities (38).

Otherwise, the shortest path individuates a vio-

latedinequality.

Tondgoodsolutions ofMFR inashorttime,

Lippert et al.

[17]

have also experimented with a

heuristicprocedure, based on theaboveLPrelax-

ation andrandomized rounding. First,theLPre-

laxation of (37){(39)is solved, obtaining an opti-

mal solution x

. If x

is fractional, each variable

x

f

is rounded-up to 1 with probabilityx

f

. If the

solution is feasible, stop. Otherwise, x to 1 all

thevariablesthatwererounded-upsofar(i.e. add

theconstraintsx

f

=1)anditerateanewLP.This

methodconvergedvery quicklyonalltestsonreal

data,withasmallgapinthenalsolution(anup-

perboundto theoptimum)andtherstLPvalue

(a lower bound to the optimum). On real data

comingfromshotgunsequencing, outofabout800

fragments, only less than 5 fragments had to be

thrownouttoretrievethecorrecthaplotypes.

5.2 Haplotypinga Population

Becauseof economical reasons, when polymor-

phism screens areconducted onlarge populations

it is not feasible to examine the two copies of

each chromosome separately, andgenotype, rather

thanhaplotype,datais usuallyobtained. A geno-

(15)

Foranyindividual,ateachSNP,threepossibilities

arise: homozygousfortheallele0,homozygousfor

the allele 1, or heterozygous (a situation denoted

by the symbol 2). Hence a genotype can be rep-

resented by a vector in f0;1;2g n

, where each po-

sitionwith a 2is called anambiguousposition. A

genotypeg2f0;1;2g n

isresolvedbyapairofhap-

lotypes h;q 2 f0;1g n

, written g = hq, if for

eachSNPj,g[j]2f0;1gimpliesh[j]=q[j]=g[j],

and g[j] = 2 implies h[j] 6= q[j]. A genotype is

called ambiguous if it has at least two ambiguous

positions(agenotypewithatmostoneambiguous

positionscanberesolveduniquely).

ThePopulationHaplotypingProblemisthefol-

lowing: GivenasetGofmgenotypesovernSNPs,

nd aset Hof haplotypessuchthat each genotype

is resolved by one pair of haplotypes in H . To

turn this problem into an optimization problem,

one hasto specify theobjectivefunction. One ob-

jectivewhichhasbeenstudiedbyGuseld [18;19]

,is

based on a greedy algorithm for haplotype infer-

ence, known asClark's rule [48]

. Thegoal isto de-

rivethehaplotypesinHbysuccessiveapplications

ofthefollowinginferencerule.

Inference Rule. Given a genotype g and a

compatible haplotype h(i.e.,a haplotypehwhich

agreeswith gat allnon-ambiguous positions), ob-

tain a new haplotype q, such that g = hq, by

setting q[j] = 1 h[j] at all ambiguous positions

andq[j]=h[j]at theremainingpositions.

Inordertoresolvethegenotypes inG,Clarkin

[48]proposedthefollowing,nondeterministic,pro-

cedure: LetG 0

be theset of non-ambiguous geno-

types. StartwithsettingG :=G G 0

andHtobe

theset of haplotypes obtained from G 0

. Then, re-

peatthefollowing: takea g2G and acompatible

h 2 H and apply the inference rule, obtaining q.

SetG:=G fgg,H:=H[fqganditerate. When

nosuchg andhexist, thealgorithmhassucceeded

ifG=; andfailedotherwise.

The non-determinism in thechoice of the pair

g,htowhichtheinferenceruleisappliedcanlead

to unsuccessful runs even for instances in which

a there exists an order of application of the rule

which would resolve all genotypes in G. To nd

thebest possible order ofapplication of theinfer-

ence rule, Guseld considered the following opti-

mizationproblem: ndtheorderingofapplication

oftheinferencerulethatleaves thefewestnumber

ofunresolvedgenotypesin theend.

To solve the problem, Guseld [18;19]

proposed

anIntegerProgrammingapproach. Theproblemis

(N;A), dened as follows. Let N =

g2G N(g),

where N(g) := f(h;g)jhiscompatiblewithgg.

N(g) corresponds to the set of possible haplo-

types obtainable by setting each ambiguous posi-

tion of a genotype g to one of the 2 possible val-

ues. Let N 0

= S

g2G 0

N(g)correspond to thesub-

set ofhaplotypes unambiguously determinedfrom

the set G 0

of unambiguous genotypes. For each

pair v = (h;g 0

), w = (q;g) in N, there is an arc

(v;w)2Aifg isambiguous,g 0

6=gandg=hq.

Any directed tree rooted at a node v 2 N 0

speci-

es a feasible history of successiveapplications of

theinferencerulestartingatnodev2N 0

,andthe

problem can be stated as: Findthe largest num-

ber of nodes in N N 0

that can be reached by

a set of node{disjoint directed trees, where each

treeisrootedat anodein N 0

andwhereforevery

ambiguous genotypeg, at most one nodein N(g)

is reached. For solving this problem, Guseld [18]

proposed the following formulation, based on 0{1

variablesx

v

associatedtothenodesinN:

max X

v2N N 0

x

v

(40)

subjectto

X

v2N(g) x

v

61 8g2G G 0

(41)

x

w 6

X

v2Æ (w) x

v

8w2N N 0

(42)

withx

v

2f0;1gforv2N.

Guseld observedthattheLPsolution,onreal

and simulated data, was almost always integer,

thus requiring no branching in the branch and

boundsearch. Also,although thereisapossibility

of an integer solution containing directed cycles,

this situation never occurred in the experiments,

and hence there was no not need to add subtour

elimination-type inequalities to the model. The

abovemodelwasusedtoresolverealandrandomly

generatedinputinstancesofupto80genotypes,in

amatterofseconds.

OneofthereasonsforthepopularityofClark's

rule, is a common belief among biologists that

this ruleoften produces thesmallestset of haplo-

types foragivenset ofgenotypes. However,when

the objective is to minimize the number of dier-

ent haplotypes, itis easy to show examples of in-

stances on which Clark's rule behaves extremely

poorly. This new objective for population haplo-

typing is called parsimony, and requires to deter-