Effective Exponential-Size Integer Programming Models for Bioinformatics Problems

(1)

Effective Exponential-Size Integer Programming Models for Bioinformatics Problems

Giuseppe Lancia

1 Introduction

Over the past two decades, the field of bioinformatics (oftentimes also referred to as Computational Biology) has experienced a tremendous growth. Computational biology problems are mostly concerned with the interpretation and analysis of (possibly large volumes of) genomic data. More precisely, the simulation of some biological processes in the cell, as well as the discovery of important signals in genomic data, or even the detection and removal of experimental errors, are cast as computational problems over some suitable mathematical models. Once the genomic data have been translated into mathematical objects (e.g., graphs, strings, permutations, etc.) the original biology questions become computational problems, to be solved by standard techniques. Moreover, most of the times, these problems turn out to be very hard optimization problems, whose hardness not only stems from their theoretical complexity but also from the large size of the instances of interest in real-life bioinformatics applications.

The availability of genomic data has increased at almost exponential rate over the past thirty years (although, recently, this increase has been slowing down). For example, the EMBL repository for nucleotide sequences has increased from roughly 1,000 entries of 1982 to about 100,000,000 of today [41]. Notice that each entry it- self is a sequence that can range from a few hundred to several hundred thousand of nucleotides. Similarly, the PDB [10] (the main data bank for 3D protein structures) has grown from 100 to almost 100,000 entries, where each entry is the 3D representation of a protein which can contain several thousands amino acids. This increase in both volume and size of genomic data has posed new challenging problems which require sophisticated lines of attack.

Without any doubts, mathematical programming techniques rank amongst the most powerful approaches for hard optimization problems [52, 58, 28], and hence

Dipartimento di Matematica e Informatica

University of Udine, Via delle Scienze 206, 33100 Udine (Italy), e-mail: [email protected]

1

(2)

their use in bioinformatics applications has also steadily increased over the past years. In the mathematical programming approach, a computational biology problem is typically modeled as the search of the optimal values for a set of integer and/or continuous variables satisfying a set of linear (or, more rarely, non-linear) constraints. The objective function to optimize can either be linear or non-linear (e.g., quadratic). The model is then solved by Branch-and-Bound, sometimes em- ploying Lagrangian Relaxation in order to obtain good bounds in a quick way.

Furthermore, in some cases, the mathematical modeling highlights how a bioinformatics problem can be reduced to some well-known optimization problem. For example, in what is perhaps the first application of mathematical programming to bioinformatics, Alizadeh, Karp, Weisser and Zweig reduced the problem of building a physical map for a set of probes, given the results for probe/clones hybridization experiments, to a ”standard” TSP [1].

Oftentimes, the most effective mathematical programming model for an optimization problem employs an exponential number of constraints and/or variables.

The solution to an exponential-size model requires the generation at run time of constraints and/or variables, and the corresponding approaches are known as Branch- and-Cut [53, 33] and Branch-and-Price [8]. In this chapter we survey some of our works in which mathematical programming techniques were applied to several computational biology problems, and the most effective models turned out to be of exponential size. The problems chosen span a large portion of the main areas of interest for computational biology. In particular, we will address (i) genome rearrangement problems arising in the computation of evolutionary distances between genomes, (ii) haplotyping problems, arising in the analysis of the genotypes shared by the individuals of a population, and (iii) alignment problems, for both protein sequences and protein structures.

Besides describing our works in some detail, we will also mention other relevant works in related areas which also adopt mathematical programming formulations of exponential size, and hint at some open problems or research areas that could also benefit from this type of approach. For a more detailed and comprehensive descrip- tion of mathematical programming approaches to a large number of computational biology problems, we refer the reader to [43].

The remainder of the chapter is organized as follows. In Section 2 we describe an algorithm for the problem of Sorting By Reversals, arising in the study of genome rearrangements and evolutionary distances between genomes. The algorithm makes use of an auxiliary problem, namely, the decomposition of a graph into a maximum set of alternating cycles, which is solved via column-generation. In Section 3 we describe a Branch-and-Cut-and-Price for the Maximum Parsimony Haplotyp- ing problem. The model is based on a particular Set Covering formulation with an exponential number of both variables and constraints. In Section 4 we describe a heuristic procedure for the Multiple Sequence Alignment problem, arising in the context of the analysis and comparison of genomic sequences. The procedure makes use of an auxiliary problem from Network Design, namely, the Minimum Routing Cost Tree problem, which is solved via column-generation. In Section 5 we describe an algorithm for the Maximum Contact Map Overlap problem, arising in the con-

(3)

ATTCTTGTTttataggttagAATTTG

↓ (a)

ATTCTTGTTgattggatattAATTTG ATTCTTgttttataGGCTAGATCCGCCATGGA

↓ (b)

ATTCTTGGCTAGATCCGCgttttataCATGGA

Fig. 1 Evolutionary events: (a) inversion; (b) transposition.

text of the analysis and comparison of 3-dimensional protein structures. First we describe a model with an exponential number of constraints, solved by Branch-and- Cut. Then, we describe a model based on the Quadratic Assignment Problem, still with an exponential number of constraints, and solved by Lagrangian Relaxation.

Some conclusions are drawn in Section 6.

2 Genome rearrangements

The large amount of available genomic data makes it now possible to compare the genomes of different species, in order to find their differences and similarities. This is a very important problem because new drugs are typically tested on animals be- fore than humans. But how close is, e.g., a lab mouse to a man? How much evolution separates the two species?

When comparing genomes of different species, the differences should be mea- sured not in terms of insertions/deletions/mutations of single nucleotides, but rather by the number of evolutionary macro-events that have happened since the time when two species diverged from their closest common ancestor. Among the main evolutionary events known we recall inversions and transpositions of genomic regions.

These events affect a long fragment of DNA on a chromosome. When an inversion or a transposition occurs, the fragment is detached from its original position and then it is reinserted, on the same chromosome. In an inversion, it is reinserted at the same place, but with opposite orientation than it originally had. In a transposition, it keeps the original orientation but ends up in a different position. Figure 1 illustrates these events, where each string represents a chromosome.

Since evolutionary events affect long DNA regions (several thousand bases, com- prising possibly many genes), the basic unit for genome comparison is not the nucleotide, but rather the gene. As a matter of fact, the computational study of rearrangement problems started after it was observed that several species possess the same genes but in a different order (e.g., most genes of the mitochondrial genome of cabbage are identical in turnip, but their ordering is quite different in the two species). Inversions are considered by far the predominant type of evolutionary

(4)

event, and hence their study has received the most attention in the literature. For historical reasons, they are called reversals in the computer science community.

Two genomes are compared by looking at their common genes. After numbering each of n common genes with a unique label in [n] := {1, . . . , n}, each genome becomes a permutation of the elements in [n]. Given two genomes, the goal of the genome comparison problem is then to identify a set of evolutionary events (e.g., a set of reversals) that, when applied to the first genome (a permutation π = (π1 . . . πn)) yield the second genome (a permutation σ = (σ1 . . . σn)). Note that, by possibly relabeling the genes, we can always assume that σ is the identity permutation ι := (1 2 . . . n). Hence, the problem becomes that of turning π into ι, i.e., sorting π.

A reversal is a permutation ρi j, with 1 ≤ i < j ≤ n, defined as ρi j= (1 . . . i − 1 j j− 1 . . . i + 1 i j+ 1 . . . n).

reversed

Note that by applying (multiplying) ρi jto a permutation π, one obtains the permutation (π1 . . . π_i−1, π_j π_j−1 . . . π_i, πj+1 . . . π_n), i.e., the order of the elements π_i, . . . , π_jhas been reversed.

LetR = {ρi j| 1 ≤ i < j ≤ n}.R is a set of generators of Sn, the set of the n!

permutations of [n]. That is, each permutation π can be expressed (non-uniquely) as a product ιρ¹ρ². . . ρ^D with ρⁱ ∈R for i = 1...D. The minimum value D such that ιρ¹ρ². . . ρ^D= π is called the reversal distance of π, and is denoted by d_R(π). Sorting by reversals (SBR) is the problem of finding dR(π) and a sequence ρ¹ρ². . . ρ^d^R^(π)that satisfies ιρ¹ρ². . . ρ^D= π. The solution of SBR yields a possible scenario to explain how an organism evolved from another, under the simplifying assumptions that inversions were the only rearrangement to occur, and that evolution required the minimum number of rearrangements. Even if these assumptions lead to some approximation, both are well-motivated. Indeed, on the one hand inversions are the most frequent type of rearrangement, and on the other rearrangements are very rare events. If the evolutionary distance is needed within a framework aimed at reconstructing an evolutionary tree for a set of species (see e.g. [57]), SBR may be the subproblem to be solved to evaluate the length o the tree edges. In this respect, it is important to have very fast algorithms to solve it.

The formulation of SBR that we just gave does not take into account the fact that a reversal not only does change the order of some genes, but it also inverts the nucleotide sequence of each reversed gene. When the order of the nucleotide sequence of a gene is known, we gain some extra information. In this case, a genome can be represented by a signed permutation, i.e., a permutation in which each element is signed either ‘+’ or ‘−’. For a signed permutation, the effect of a reversal is not only to flip the order of some consecutive elements, but also to complement their sign. For instance, the reversal ρ24applied to the signed permutation (+1 − 4 + 3 − 5 + 2) yields the signed permutation (+1 + 5 − 3 + 4 + 2). When signs are taken into consideration, the resulting problem is called Signed sorting by reversals (SSBR),

(5)

and its goal is to determine the minimum number of reversals that turn a signed permutation π into (+1 + 2 . . . + n).

After developing a beautiful combinatorial theory of genome rearrangements, Hannenhalli and Pevzner showed in 1995 that (against all beliefs at the time) SSBR is polynomial [38]. On the other hand, the unsigned version is NP-hard, as proved in 1997 by Caprara, who closed as longstanding open question about the computational complexity of sorting by reversals [15]. For this reason, the solution of SBR appears to call for the use of sophisticated and powerful optimization techniques, such as those employed by the mathematical programming approach.

SBR has been widely studied by, among others, Kececioglu and Sankoff [42], Bafna and Pevzner [7], Caprara, Lancia and Ng [22], Tran [60], Caprara [16, 17], Christie [27], Berman, Hannenhalli and Karpinski [9]. Most of these papers deal with the complexity and theoretical approximability of the problem. References [42, 7, 27, 9] present, respectively, polynomial-time 2-approximation,⁷₄-approximation,

3

2-approximation, and 1.375-approximation algorithms.

2.1 A Branch-and-Price approach

In this section we describe an ILP model that has been successfully used for the effective solution of SBR. The model has an exponential number of variables, and therefore it requires column-generation at run time. A peculiar characteristic of the approach is that the model does not represent SBR directly, but rather it solves a different problem which is closely related to SBR, i.e., the decomposition of a bicolored graph into a maximum set of edge-disjoint alternating cycles.

In order to describe the model in some detail, we need to introduce the funda- mental concepts of breakpoint and breakpoint graph. Let π = (π1. . . πn) be the permutation of which we seek to compute the reversal distance. The breakpoint graph G(π) = (V, B ∪ Y ) of π is defined as follows. Add to π the elements π₀:= 0 and πn+1:= n + 1, re-defining π := (0 π1 . . . π_nn+ 1). The node set V := {0, . . . , n + 1}

has a node for each element of π. The graph G(π) is bicolored, i.e. its edge set is partitioned into two subsets, each represented by a different color. B is the set of black edges, each of the form (πi, πi+1), for all i ∈ {0, . . . , n} such that |πi− πi+1| 6= 1, i.e.

elements which are in consecutive positions in π but not in the identity permutation ι . Such a pair πi, πi+1is called a breakpoint of π, and we denote by b(π) := |B| the number of breakpoints of π. Denote by π⁻¹the inverse permutation of π, defined by π_π⁻¹_i := i for i = 0, . . . , n + 1. Y is the set of gray edges, each of the form (i, i + 1), for all i ∈ {0, . . . , n} such that |π_i⁻¹− π_i+1⁻¹| 6= 1, i.e. elements which are in consecutive positions in ι but not in π. Note that each node i ∈ V has either degree 0, 2 or 4, and has the same number of incident gray and black edges. Therefore, |B| = |Y |(= b(π)).

Figure 2 depicts the breakpoint graph associated with the permutation (4 2 1 3).

An alternating cycle of G(π) is a cycle whose edges are alternately grey and black, without edge repetitions but possibly with node repetitions. For example, edges (0, 4), (4, 3), (3, 1), (1, 0) and (4, 2), (2, 3), (3, 5), (5, 4) form alternating cycles

(6)

Fig. 2 The breakpoint graph G(π) for π = (4 2 1 3). Gray edges are thin, black edges are thick.

e

e e

e 0

4

1 2 3

5

in the graph of Figure 2. An alternating-cycle decomposition of G(π) is a collection of edge-disjoint alternating cycles, such that every edge of G is contained in exactly one cycle of the collection. It is easy to see that G(π) always admits an alternating-cycle decomposition. In the graph of Figure 2, alternating cycles (0, 4), (4, 3), (3, 1), (1, 0) and (4, 2), (2, 3), (3, 5), (5, 4) form an alternating-cycle decomposition. Let c(π) be the maximum cardinality of an alternating-cycle decomposition of G(π). Bafna and Pevzner [7] proved the following property which turns out to be instrumental for the practical solution of SBR:

Theorem 1. ([7]) For every permutation π, d(π) ≥ b(π) − c(π).

For x ∈ R, define LB(x) = b(π) − bxc. From the theorem it follows that, whenever x≥ c(π), LB(x) is a valid lower bound which can be used in a Branch-and-Bound approach for SBR. Furthermore, the lower bound LB(c(π)) turns out to be very strong, and most of the times it value matches the optimal value of SBR. In particular, Caprara proved in [17] that the probability that d(π) > LB(c(π)) for a random permutation π with n elements is Θ (1/n⁵).

The computation of the lower bound motivates the study of the Alternating-Cycle Decomposition(ACD) problem: given the breakpoint graph G(π) of a permutation π , find a maximum-cardinality alternating-cycle decomposition of G(π ). Comput- ing c(π) is still NP-hard but, from a practical point of view, there is a great advantage in dealing with ACD instead of SBR. Namely, while there seem to be no easy and/or clean ILP formulation for SBR, there is a strong, natural ILP formulation for ACD, with one variable for each alternating-cycle of G(π). Furthermore, the LP relaxation of the model yields a tight upper bound on c(π) which, in turn, translates into a tight lower bound for d(π). LetC denote the set of all the alternating cycles of G(π), and for each C ∈C introduce a binary variable xC. A natural ILP model for ACD (see [22]) is

max

∑

C∈C

xC (1)

subject to

C3e

∑

x_C≤ 1, e∈ E, (2)

x_C∈ {0, 1}, C∈C . (3)

If c^∗(π) is the optimal value of the LP relaxation of (1)-(3), obtained by replacing constraints (3) with

(7)

x_C≥ 0, C∈C . (4) then c^∗(π) is a valid upper bound on c(π), and hence LB(c^∗(π)) is a valid lower bound on d(π).

Solving the LP relaxation (1), (2) and (4) amounts to solving an LP with |C |

= O(2ⁿ) variables. This is done by column generation techniques (see e.g. [8]), starting from a “small” LP with a restricted subset of the variables and iteratively solving the small LP, testing if the solution is optimal for the overall LP, and, if not, adding few variables with positive reduced cost to the small LP. The optimality test, namely the detection of variables with positive reduced cost, is called column generation(or pricing) problem.

For this LP, the pricing problem is the following: given a weight u_eassociated with each edge e ∈ E, find an alternating cycle C ∈C such that

e∈C

∑

u^∗_e< 1, (5)

or prove that none exists. In [22] a polynomial-time algorithm for the pricing problem is described. The algorithm reduces the pricing problem to the solution of up to n + 1 min-cost perfect matching problems in a suitably-defined (non-bipartite) graph.

Although polynomial, the pricing algorithm requiring the solution of min-cost general matching problems can be computationally quite expensive. To overcome this drawback, it is possible to utilize an alternative LP relaxation. In the new model, the set of variables is enlarged so as to include also alternating cycles that go through the same edge twice. Let us then call pseudo alternating cycle an alternating cycle in which at least one edge is traversed twice, but no edge is traversed more than twice. An example of a pseudo alternating cycles is given in Figure 3. LetP be the set of pseudo alternating cycles of G(π), and letS := C ∪ P denote the set of surrogate alternating cyclesof G(π), i.e., the set of all alternating cycles that may possibly pass through an edge more than once, but no more than two times.

e e

e

@

@@

@

@@ b2

b1 b3

y1

y4

y2

y3

i j

Fig. 3 Pseudo alternating cycle C = b1, y₁, b₂, y₂, b₃, y₃, b₂, y₄, where µb₂,C= 2 and µe,C= 1 for e∈ C \ {b2}.

The new LP relaxation (proposed in [23]) is the counterpart of (1), (2) and (4), where the setC of alternating cycles is replaced by the set S of surrogate alternating cycles. In other words, the LP model reads

(8)

max

∑

C∈S

x_C (6)

subject to

C3e

∑

µ_e,Cx_C≤ 1, e∈ E (7)

x_C≥ 0, C∈C , (8)

where, for each C ∈S and e ∈ C, µe,Cis the number of times that edge e appears in the sequence defining surrogate alternating cycle C. Obviously, µe,C= 1 for each alternating cycle C and edge e ∈ C, and µe,C= 2 for at least one edge e in a pseudo alternating cycle C. It can be shown that in every integer solution of (6)-(8), x_C= 0 for all C ∈P, and therefore, by solving LP (6)-(8) instead of (1), (2) and (4), the bound obtained may be weaker, but if the optimal LP solution is integer then it is an optimal ACD solution.

The column generation problem for the new model requires finding a surrogate alternating cycle C ∈S such that ∑e∈Cµe,C u^∗_e < 1 for a given u^∗ vector. Call

∑e∈Cµe,Cu^∗_ethe weight of C ∈S .

e

e e

e 0

4

1 2 3

5 e

e

e e

e 0

4

1 2 3

5

] Y

-

7

jq

~ -

G(π) D(π)

Fig. 4 Graph G(π) and the associated graph D(π).

In order to solve the pricing problem, construct an arc-weighted directed graph D(π) = (V, A) from G(π) and u^∗as follows. D(π) has the same node set V as G(π) and, roughly speaking, each path consisting of a black and a grey edge in G(π) is replaced by an arc in D(π). Formally, for each node pair i, j ∈ V , D(π) has an arc (i, j) ∈ A if there exists k ∈ V such that (i, k) ∈ B and (k, j) ∈ Y . The arc weight for (i, j) is given by u^∗_(i,k). Figure 4 shows the graph D(π) associated with graph G(π) in Figure 2.

It is not difficult to see that each surrogate alternating cycle of G(π) corresponds to a simple directed cycle (a dicycle) of D(π) of the same weight, and viceversa.

Hence, G(π) contains a surrogate alternating cycle C ∈S of weight < 1 if and only if D(π) contains a dicycle of weight < 1. Therefore, the pricing problem is basically reduced to the search for a minimum-weight dicycle in D(π), and this problem can be solved via the solution of up to n weighted bipartite matching problems (i.e., assignment problems) on the weight matrix of D(π) (see [23] for details). By using efficient implementations of the Hungarian method (see e.g. [31]), the overall computing time is O(n²log n).

(9)

Recall that c^∗(π) is the optimal value of the original LP and let c⁰(π) be the optimal value of LP (6)-(8). The adoption of the new model in place of the original LP (1), (2) and (4) requires that a certain trade-off must be won. In particular, while c⁰(π) can be obtained much quicker than c^∗(π), the bound LB(c⁰(π)) could potentially be quite weaker than LB(c^∗(π)), thus resulting less effective when used within a Branch-and-Bound search for SBR. Luckily, it turns out that c⁰(π) is never too far from c^∗(π). As a matter of fact, extensive computational tests have shown that the two bounds are almost always identical. Furthermore, it can be proved that c⁰(π) ≥³₄c^∗(π) always.

Coming back to SBR, the final procedure is basically a Branch-and-Price algorithm, in which the lower bound at each node of the search tree is LB(c⁰(π)), and the branching is done by making some suitable changes to the permutation π. The details of the Branch-and-Price procedure can be found in [23].

2.2 Computational results

Prior to the Branch-and-Price model that we just discussed, the best exact algorithm for SBR was a combinatorial Branch-and-Bound by Kececioglu and Sankoff [42]

capable of solving, within minutes of computing time, random instances with up to about n = 30 elements. An instance of 40 elements would take hours, and the procedure would quickly collapse for instances with a larger n.

The use of the mathematical programming approach made a dramatic change in the size of solvable instances. In particular, the approach was able to solve within seconds random instances of 100 elements, and within minutes, random instances of 200 elements. The approach was also utilized for the solution of what is now the largest hard instance known solved to optimality (a 1000-element permutation obtained by applying 500 random reversals to the identity [20]). Notice that for SBR random permutations, either drawn u.a.r. or obtained from the application of a large number of random reversals to the identity, represent the hardest instances, while real-life instances are typically much easier to solve.

With respect to the latter, the algorithm was tested on the largest real-world data set available, i.e., 138 common genes between human and mice genomes. The input is a partially signed permutation, namely, the orientation of only 47 genes is known.

From this input two instances were derived, the first by ignoring the signs of the elements and the second by considering the actual partially signed permutation. The instances were solved in less than 10 seconds each. The results showed that the minimum number of reversals needed to turn mouse genome into man genome is 106 if signs are ignored and 118 if they are taken into account. The Branch-and- Price for SBR was the first algorithm capable of finding a provably optimal solution for these two instances.

Finally, the algorithm was run on a number of smaller size permutations associated with mitochondrial genomes. The size of these permutations is much smaller than the man-mouse one above, namely each permutation has 37 elements. The code

(10)

was used to compute the distances between each pair of permutations. The overall time for computing the whole table was 4.5 seconds. This shows that, the reversal distance for permutations of this size can be computed very fast, and therefore one may afford computing several times this distance within algorithms that try to reconstruct evolutionary trees (see, e.g., [57]).

2.3 Related works and open problems

Exponential-size ILP models have been used for some variants of SBR and other genome rearrangements problems. One of these problems is the Reversal Median Problem(RMP), which calls for a signed permutation minimizing the total reversal distance from a set of given signed permutations π¹, . . . , π^q. In [18] Caprara showed that RMP is NP-hard, and that the same complexity holds for the Steiner Tree problem (called Tree SBR, or TSBR) with centers in the given set of signed permutations. For both RMP and TSBR he described a (2 − 2/q)-approximation algorithm and gave exponential-size ILP formulations. Median problems are generally aimed at the reconstruction of evolutionary trees. In particular, the Breakpoint Median Problem (BMP), introduced by Sankoff and Blanchette in [56] calls for determining a permutation which minimizes the total breakpoint distance from a set of given permutations, and was widely used as a subroutine by algorithms reconstructing evolutionary trees. In [19] Caprara proposed effective algorithms (both exact and heuristic) for the BMP based on an exponential-size formulation, and showed that the problem can be essentially reduced to finding a perfect matching in a graph that forms the maximum number of cycles jointly with q given perfect matchings.

The approach is currently used within many state-of-the art software packages for evolutionary trees (such as the program GRAPPA by Moret et al. [50]).

With respect to open problems, there are two problems worth mentioning in the field of genome rearragements. The first concerns Sorting By Transpositions (SBT) which is the analogous of SBR when the evolutionary events are transpositions only.

The complexity of SBT has remained open for more than 20 years, but recently it was shown that SBT is NP-Hard [14]. There is no exact algorithm for SBT based on mathematical programming techniques (as a matter of fact, to the best of our knowledge, no papers describing exact algorithms for SBT have been published yet). The second problem would be to give an effective algorithm (perhaps based on a mathematical programming formulation) for the general scenario in which many types of events are considered together (e.g., reversals, transpositions, reversed transpositions, deletions, etc.) and where each event may also have a cost associated (e.g., the cost of a reversal could depend on the length of the fragment inverted).

(11)

Hapl. 1, paternal: taggtccCtatttCccaggcgcCgtatacttcgacgggTctata Hapl. 1, maternal: taggtccGtatttAccaggcgcGgtatacttcgacgggTctata Hapl. 2, paternal: taggtccCtatttAccaggcgcGgtatacttcgacgggTctata Hapl. 2, maternal: taggtccGtatttCccaggcgcGgtatacttcgacgggCctata Hapl. 3, paternal: taggtccCtatttAccaggcgcGgtatacttcgacgggTctata Hapl. 3, maternal: taggtccGtatttAccaggcgcCgtatacttcgacgggCctata Fig. 5 The haplotypes of 3 individuals, with 4 SNPs.

3 Haplotyping of populations

A single nucleotide polymorphism (SNP, pronounced “snip”) is a specific nucleotide in the genome showing a statistically significant variability within a population.

Besides very rare exceptions, at each SNP site only two bases (out of A, T, C and G) are observed, and they are called the SNP alleles. Being the predominant form of human polymorphism, SNPs are widely used in therapeutic, diagnostic, and forensic applications and their importance can hardly be overestimated.

Humans are diploid organisms, i.e., their DNA is organized in pairs of chromosomes. For each pair of chromosomes, one chromosome copy is inherited from the father and the other copy is inherited from the mother. For a given SNP, an individual can be either homozygous (i.e., possess the same allele on both chromosomes) or heterozygous (i.e., possess two different alleles). The values of a set of SNPs on a particular chromosome copy define a haplotype. In Fig. 5, we illustrate a simplistic example of three individuals and four SNPs. The alleles for SNP 1, in this example, are C and G. Individual 1, in this example, is heterozygous for SNPs 1, 2 and 3, and homozygous for SNP 4. His haplotypes are CCCT and GAGT.

Haplotypingan individual consists in determining his two haplotypes, for a given chromosome. With the larger availability in SNP genomic data, recent years have seen the birth of many new computational problems related to haplotyping (see [36, 44]). These problems are motivated by the fact that it is very expensive to determine the haplotypes experimentally, while there is a cheap experiment that determines the (less informative and often ambiguous) genotypes, from which the haplotypes must then be retrieved computationally.

A genotype provides information about the multiplicity of each SNP allele: i.e., for each SNP site, the genotype specifies if the individual is heterozygous or homozygous (in the latter case, it also specifies the allele). The ambiguity comes from heterozygous sites, since, to retrieve the haplotypes, one has to decide how to dis- tribute the two allele values on the two chromosome copies. Resolving a genotype requires to determine two haplotypes such that, if they are assumed to be the two chromosome copies, then the multiplicity of each SNP allele yields exactly that genotype. Note that, for a genotype with k heterozygous sites, there are 2^k−1 pairs of distinct haplotypes that could resolve the genotype. Given a set of genotypes, the general haplotyping problem requires to determine a set of haplotypes such that each genotype is resolved by two haplotypes. In the Parsimony Haplotyping Prob-

(12)

lem(PHP), one is interested in finding a smallest-size set of resolving haplotypes.

Among several objective functions for haplotyping, parsimony is a model whose importance is now being recognized as crucial, and it is instrumental also in the solution of more complex haplotyping problems.

The computational study of PHP has been first investigated by Gusfield [35], who adopted an Integer Programming formulation for its practical solution. The problem is APX-hard in general [46], but there are special cases that are polynomially solvable [37, 47].

3.1 A Branch-and-Cut-and-Price approach

Many optimization approaches have been tried for the solution of the PHP but, generally speaking, these models run into troubles when trying to solve “large” instances (where the most critical parameter is the number of heterozygous sites per genotype). In this section we survey a successful ILP approach which was able to outperform previous optimization approaches and increase the size of solvable instances by almost an order of magnitude [48].

The approach is based on formulating the PHP as a particular Set Covering (SC) problem, in which each possible haplotype corresponds to a set. The SC formulation has an exponential number of variables and constraints, and it employs effective pricing and separation procedures that generate variables and violated inequalities at run-time. The resulting algorithm is able to tackle instances that could not even be input, let alone solved, with the software based on previous formulations. The good performance of the approach is due to two main reasons: on one hand, the LP relaxation bound of the SC model is quite strong; on the other hand, the method includes an effective heuristic, based on the LP solution, to find good PHP feasible solutions.

Haplotype 1, paternal: 0 1 0 1

2 2 2 1Genotype 1 Haplotype 1, maternal: 1 0 1 1

Fig. 6 Haplotypes and corresponding genotypes.

Let us introduce some basic notation and definitions. Given a set of n SNPs, fix arbitrarily a binary encoding of the two alleles for each SNP (i.e., call one of the

(13)

two alleles ’0’ and the other ’1’). Once the encoding has been fixed, each haplotype becomes a binary vector of length n.

For a haplotype h, denote by h_ithe value of its i-th component, with i = 1, . . . , n.

Given two haplotypes h⁰ and h⁰⁰, their sum is a vector h⁰⊕ h⁰⁰, where the binary operator ⊕ is defined, component-wise, as

(h⁰⊕ h⁰⁰)_i:=







0 if h⁰_i= h⁰⁰_i = 0 1 if h⁰_i= h⁰⁰_i = 1 2 if h⁰_i6= h⁰⁰_i.

A vector g = h⁰⊕ h⁰⁰ is called a genotype. In general, any vector g ∈ {0, 1, 2}ⁿ is a genotype. In Fig. 6, we illustrate a case of three individuals, showing their haplotypes and genotypes.

• Resolutions and ambiguity: Let g be a genotype. A pair of haplotypes {h⁰, h⁰⁰} such that g = h⁰⊕ h⁰⁰is a resolution of g. The haplotypes h⁰ and h⁰⁰are said to resolve g. Let G be a set of genotypes and H be a set of haplotypes such that each g has a resolution in H. Then H resolves G, and is called a resolving set for G. The problem (PHP) is then formally defined as follows: Given a set G of genotypes, find a resolving set ˆH for G of minimum cardinality.

For a genotype g, each position i such that g_i= 2 is called an ambiguous position.

Let A(g) denote the set of ambiguous positions of g. A genotype is ambiguous if it has more than one resolution, i.e., if |A(g)| ≥ 2. Note that genotype entries with value 0 or 1 correspond to homozygous SNP sites, while ambiguous positions correspond to heterozygous sites.

• Compatibility: A haplotype h is compatible with a genotype g if there exists a haplotype h⁰such that g = h ⊕ h⁰(this is possible if and only if g_i= h_iwhenever gi6= 2). Clearly, a genotype can be resolved only by compatible haplotypes. For a genotype g, denote by H(g) the set of haplotypes that are compatible with g.

Given a set of genotypes G, let H_G=^S_g∈GH(g). The set HGcontains all possible haplotypes that can be used to resolve G. Conversely, for a haplotype h, denote by G(h) the set of all genotypes g ∈ G such that g is compatible with h.

Two genotypes g and g⁰are compatible if there exists a haplotype h compatible with both of them. Given a set G of genotypes, the compatibility graph over G is a graph which has vertex set G and an edge between each pair of compatible genotypes.

• Cliques and s-cliques: A set K of genotypes is a clique of genotypes (or, for short, a clique) if its compatibility graph is a clique. The definition implies that K is a clique if and only if there exists at least one haplotype compatible with each g in K. If we represent K as a |K| × n matrix A over {0, 1, 2}, where rows correspond to elements of K, then K is a clique if and only if no column of A has both a 0 and a 1 on two rows.

Given a set G of genotypes, a set K ⊆ G is a selectable clique (or, shortly, an s-clique) if there exists a haplotype h such that K = G(h). Such an h is called a selectorof K. Note that G(h) is a clique, but not all cliques are selectable.

(14)

Denote by H(K) := {h | G(h) = K}, and letK := {G(h)|h ∈ HG} be the family of all s-cliques. The familyK induces a partition of HGinto the sets {H(K) | K ∈ K } i.e., HGis partitioned into the sets of selectors of the same s-clique.

For example, let G = {g¹, . . . , g⁵} with g₁= 22021, g2= 12222, g3= 12211, g₄= 20120 and g5= 00102. The compatibility graph has edges g¹g², g¹g³, g²g³, g²g⁴and g⁴g⁵. It can be checked that H_G consists of 23 haplotypes. It is H(g¹) = {00001, 00011, 01001, 01011, 10001, 10011, 11001, 11011}. There are 7 s-cliques.

Two of them are K₁₂₃= {g¹, g², g³} and K₂₄= {g², g⁴}. The selectors of K₁₂₃are 10011 and 11011, while H(K₂₄) = {10100, 10110}.

The partitioning of the space of all haplotypes into s-cliques will be used by the pricing procedure, which goes through each s-clique of G to find if it is profitable to price-in one of its selectors.

The main idea on which the set covering formulation is based is the following observation: if H is a set of haplotypes resolving G, then a strong covering condition is implied, namely: For each genotype g, position i ∈ A(g) and value a ∈ {0, 1}, there is a haplotype h∈ H ∩ H(g), such that h_i= a.

This condition is only necessary, but not sufficient, for H to be a feasible solution of the PHP. Consider, for example, the following “diagonal” instance G = {1222, 2122, 2212, 2221}. The set {0111, 1011, 1101, 1110} satisfies the covering condition but does not resolve G.

This condition is a covering condition because it represents the covering con- straint for a set cover problem in which the universe is the set of all triples (g, i, a), for g ∈ G, i ∈ A(g) and a ∈ {0, 1}, and each haplotype h represents a subset of the universe, namely h ↔ {(g, i, a) | h ∈ H ∩ H(g) , h_i= a}. This SC is in fact a relaxation of the PHP. It is possible that the optimal solution of SC resolves G. If not, one may still obtain a good feasible PHP solution from the optimal cover by adding only a small number of haplotypes, which is the idea exploited by the effective heuristic presented in [48].

We now describe the exponential ILP formulation based on the above SC. The formulation has a binary variable x_h, for each possible haplotype h ∈ H_G. The main constraints correspond to the covering condition outlined above. However, as we observed, these alone are not sufficient to guarantee that the optimal solution is indeed feasible for PHP. In order to insure feasibility, some suitable cuts are introduced. No- tice that this is not customary, since usually cuts are employed to eliminate fractional solutions while integer solutions are expected to be feasible. In particular, let us call a set H⁰of haplotypes insufficient if H⁰does not resolve G. For an insufficient set H⁰, let U (H⁰) be the set of unresolved genotypes, i.e., genotypes which have no resolution in H⁰. Let g ∈ U (H⁰) be an unresolved genotype, and let C(g, H⁰) := H(g) − H⁰. An insufficient set H⁰and an unresolved genotype g ∈ U (H⁰) then give rise to the following cut:

x(C(g, H⁰)) ≥ 1. (9)

(15)

Let us callN ⁰the set of all pairs (g, H⁰) with H⁰an insufficient set of haplotypes and g ∈ U (H⁰). By using the above cuts, the PHP can be formulated as:

min

∑

h∈H_G

x_h (10)

subject to

∑

h∈H_i^a(g)

x_h≥ 1 ∀g ∈ G, i ∈ A(g), a ∈ {0, 1} (11)

x(C(g, H⁰)) ≥ 1 ∀(g, H⁰) ∈N ⁰. (12)

x∈ {0, 1}^H^G (13)

where, for g a genotype, i ∈ A(g) and a = 0, 1, the set H_i^a(g) is defined as the set of all haplotypes compatible with g that have the value a in position i. Notice that the sets H_i⁰(g) and H_i¹(g) are a partition of H(g), and if g has k ≥ 1 ambiguous positions, then |H_i⁰(g)| = |H_i¹(g)| = 2^k−1.

The formulation (10)-(13) was introduced in [48] for the solution of the PHP. The formulation has an exponential number of variables and constraints. The procedure keeps one global setX ⊂ HGof generated variables and one global setN ⊂ N⁰of generated cuts (from the family of constraints (12), while constraints (11) are always present in the model) and, during the solution process, these sets are updated by adding, when needed, new variables or new cuts. The overall procedure is, therefore, a Branch-and-Cut-and-Price approach.

As we already mentioned, the pricing of the variables is done by going through all s-cliques, one at a time. The algorithm then maintains a compact data structure in the form of a binary tree, called the s-clique tree, representing the s-cliques of G at the leaves.

The data structure allows also for a compact representation of the selectors of each s-clique, by means of its pattern table. A pattern p is an n-string over {0, 1, ∗}.

In a natural way, a pattern represents the hypercube Q(p) := {b ∈ {0, 1}ⁿ| b_i= p_i∀i : p_i6= ∗}, i.e., the set of 0-1 vectors that can be obtained by replacing each “*” of p with 0 or 1 in all possible ways. If we associate to a pattern p the virtual genotype g^pdefined by setting g^p_i = p_iif p_i∈ {0, 1} and g_i^p= 2 if pi= *, then p represents the set of haplotypes H(g^p). A pattern table P for an s-clique K is a set of patterns Psuch that ∪p∈PQ(p) = H(K). A pattern table for the s-clique K24of the example consists of the single pattern 101*0, since H(K₂₄) = {10100, 10110}. The s-clique tree provides compact representations of the s-cliques in terms of pattern tables. In particular, the ratio between |H(K)| and |P| can be of several orders of magnitude (in the section on open problems, we will discuss the problem of representing a set of haplotypes by its minimum-size pattern table).

The pricing problem for the LP relaxation of (10)–(13), is solved as follows.

Given non-negative weights α_gi^∗ and β_gi^∗, corresponding to constraints (11) for a = 0 and a = 1 respectively, and γ_gH^∗ corresponding to constraints (12), find, if any, a haplotype h ∈ H_Gsuch that

(16)

∑

g∈G(h)

∑

i∈A(g)

(1 − h_i) α_gi^∗+ h_iβ_gi^∗ +

∑

(g,H)∈N : g∈G(h)

γ_gH^∗ > 1. (14)

The pricing problem can be solved by maximizing the lhs of (14) over all haplotypes. This, in turn, can be done by iterating over all s-cliques K and maximizing the lhs of (14) within the selectors of K.

For an s-clique K ∈K , define

α (K, i) := ∑g∈K : i∈A(g)α_gi^∗ i= 1, . . . , n β (K, i) := ∑g∈K : i∈A(g)β_gi^∗ i= 1, . . . , n

γ (K) := ∑(g,H)∈N :g∈Kγ_gH^∗ .

By using the above definitions, for each h ∈ K the lhs of (14) can be rewritten as λ (K, h) :=

∑

i

(1 − h_i) α(K, i) + h_iβ (K, i) + γ(K)

and the pricing problem can be rephrased as: find ˆh ∈ H_G−X such that λ(G(ˆh), ˆh) is maximum. If λ (G(ˆh), ˆh) > 1, then ˆh is a variable with negative reduced cost, that should be added to the current LP. Otherwise, there are no variables that should be added to the current LP.

The haplotype ˆh can be found by computing max{λ (K, h) | K ∈K ,h ∈ H(k)}, which can be accomplished by going through all s-cliques one at a time. For each s-clique K, the optimal haplotype in H(K) is found by considering the pattern table P(K) representing K. For each pattern p ∈ P(K), the best haplotype in Q(p) is found by a greedy algorithm in time O(n), so that the pricing problem can be solved effectively overall. The computational cost of solving the pricing problem for one s-clique K whose pattern table has q patterns is O(∑g∈K|A(g)| + q n + |N |). (for further details, see [48]).

The overall Branch-and-Cut-and-Price procedures works as follows. At each node of the search tree a column-generation phase is carried out until no further haplotypes are priced-in. At that point, if the solution is feasible for PHP, the node has been solved to optimality. Otherwise, a phase of cutting planes is carried out which adds inequalities for all currently unsolved genotypes, and then the procedures re- verts to the pricing phase. Each time an integer solution satisfying the covering condition but not feasible for PHP is encountered, it is used to produce a heuristic solution for PHP. This is done by first adding a “small” set of haplotypes (at most one for each unsolved genotype) and then removing possibly redundant haplotypes from the resulting set.

(17)

3.2 Computational results

The Branch-and-Cut-and-Price SC approach was compared to some of the existing methods, such as RTIP, the ILP proposed by Gusfield in [35], and the polynomial- size ILP formulation proposed by Brown and Harrower in [12]. The experiments were run on an “average” personal computer, with a time timit of two hours per instance. The formulation [12] turned out to be too weak for the test instances (none of the instances could be solved within the time limit).

In a first set of computational experiments, the approach was tested on some real- data datasets, taken from the SeattleSNPs database [59]. The database contains data for 47 individuals (24 African-American and 23 European subjects) and about 100 genes. For each gene, 47 genotypes are reported. A total of 15 random instances (corresponding to 15 random genes) were chosen for the tests. All instances were solved, within a maximum running time of 196 seconds. The number of branch-and- bound nodes ranged from a minimum of 1 to a maximum of 126, with an average of 10.

With some exceptions, these instances are relatively small, and so RTIP was able to tackle them effectively. Sixteen instances were solved by RTIP, within a maximum time of 202 seconds, but four instances could not be solved within two hours.

By using a much more powerful computer on the unsolved instances, three of them could eventually be solved by RTIP within two hours (with an average of more than one hour each), while one instance remained unsolved. Note that the four instances could be solved by SC on the slow computer with an average time of less than thirty seconds each.

In a second set of experiments, the approach was tested on instances generated by ms, a program developed by R. Hudson [39]. and which is regarded as the standard code for simulating haplotype populations. The haplotypes are generated according to coalescent theory, i.e., representing the evolution of the sampled haplotypes. By setting the recombination parameter of ms to ρ = 0 and ρ = 16 respectively, two sets of instances were generated. Each set contains six classes of ten instances each.

Each class is characterized by the number of SNPs, set to either 30 or 50, and by a target number of genotypes, set to either 30, 40 or 50.

The search space for the instances with ρ = 0 ranges from 10⁴to 10⁹haplotypes.

The hardest instances were for m = 40 and n = 50, requiring an average solution time of about 18 seconds for SC versus about 1hour for RTIP. No instance with n= m = 50 could be solved by RTIP, while SC solved all of them with an average time of about 10 seconds. Overall, RTIP could not solve 28 instances out of 60, while SC left 7 unsolved instances. Over the instances which were solved by both methods, SC was about two orders of magnitude faster than RTIP.

The search space for the instances with ρ = 16 ranges from 10³to 10⁷haplotypes. The hardest instances were for n = m = 50. None of them could be solved by RTIP, and five could be solved by SC, taking an average time of about 1 minute each. Overall, RTIP could not solve 29 instances out of 60, while SC left 14 un-

(18)

solved instances. Over the instances which were solved by both methods, SC was about one order of magnitude faster than RTIP.

3.3 Related works and open problems

Prior to the SC model. the PHP had been widely studied and many approaches of different nature had been tried for its solution. The ILP proposed by Gusfield in [35]

has an exponential number of variables (corresponding to H_G) and constraints, but no column-generation/row generation techniques are used, and actually there seem to be no suitable pricing and separation procedures for the proposed model. Because of the model size, Gusfield’s approach can only be used on “small” instances, with low levels of heterozygosity.

Several authors (see, e.g. [12, 46, 11]) have proposed polynomially-sized integer programming formulations for the PHP. The LP relaxations of these formulations are quite weak (even after the addition of some valid cuts), so that these formulations turn out to be less effective than exponential ones. In [13], Brown and Harrower propose an hybrid model, in which variables for a fixed subset of haplotypes in HGare explicitly present (as in the exponential formulation of Gusfield), while the others are implicitly represented by a polynomial set of variables and constraints.

The quality of the bound depends on the choice of the explicit haplotypes, but still is inferior to that of the exponential model. The polynomial/hybrid formulations were used for the solution of problems of similar size as those solved by Gusfield’s exponential model.

A quadratic formulation, solved by semi-definite programming, was proposed by Kalpakis and Namjoshi [40]. Similarly to the exponential IP, the formulation has a variable for each possible haplotype and hence it cannot be used to tackle instances for which the set of possible haplotypes is too large. The size of the problems solved is comparable to the other cited methods.

The study of the SC formulation for PHP left an interesting open problem, which can be stated as follows. Given a set H of haplotypes, what is the smallest set of genotypes G such that H_G= H? (the problem can be also stated in an “implicit”

form, i.e., given a set G⁰ of genotypes, what is the smallest set of genotypes G such that H_G= H_G⁰?). For example, if H = {001, 101, 000, 100, 010, 011} an optimal solution is G = {022, 102}. This is also the optimal solution for the instance G⁰= {201, 200, 012}, since H_G⁰= H. In a sense, this is a data-compression problem, since the information in G is sufficient to represent the whole information in H. We call this the Minimum Pattern Representation problem. To the best of our knowledge, neither the complexity of the problem nor algorithms for its solution have been studied yet.

(19)

4 Sequence alignment

A genomic sequence is a string over the 4–symbols alphabet of nucleotides or the 20–symbols alphabet of amino acids. Genomic sequences are routinely compared with each other, with the goal of identify highly conserved DNA regions, spot fatal mutations, or suggest evolutionary relationships. In order to compare a set of sequences, the sequences must be aligned, i.e., arranged in a matrix, called a multiple alignment, one per row. This is obtained by possibly inserting gaps (represented by the ‘-’ character) in each sequence so that they all result of the same length. The goal of identifying important patterns is pursued by attempting as much as possible to place the same character in every column. The following is a simple example of an alignment of the sequences ATTCGAC, TTCCGTG and ATCGTC:

A T T - C G A - C - T T C C G - T G A - T - C G - T C

The multiple sequence alignment problem has been formalized as an optimization problem. The most popular objective function for multiple alignment gener- alizes ideas from optimally aligning two sequences. This problem, called pairwise alignment, is formulated as follows: Given symmetric costs c(a, b) for replacing a symbol a with a symbol b and costs c(a, −) for deleting/inserting symbol a, find a minimum–cost set of operations that turn a sequence s⁰into a sequence s⁰⁰. It is well known that this problem can be solved by dynamic programming in time and space O(l²), where l is the maximum length of the two sequences. The value of an optimal solution is called the edit distance of s⁰and s⁰⁰and is denoted by d(S⁰, S⁰⁰).

An alignmentA of two or more sequences is an array having the (gapped) sequences as rows. The value d_A(s⁰, s⁰⁰) of an alignmentA of two sequences s⁰and s⁰⁰ is obtained by adding up the costs for the pairs of characters in each column.

It is easy to see that d(s⁰, s⁰⁰) = min_Ad_A(s⁰, s⁰⁰). More generally, the Sum–of–Pairs (SP) score of an alignment of a set of sequences {s₁, . . . , s_n} is obtained by adding the costs of the symbols matched up at the same positions, over all the pairs of sequences, i.e., SP(A ) := ∑1≤i< j≤nd_A(s_i, s_j) (where it is assumed that c(−, −) = 0).

The dynamic programming algorithm for pairwise alignment can be generalized to multiple alignment, and has time complexity proportional to 2ⁿlⁿ, for a problem with n sequences each of length at most l. In typical real–life instances, while n is usually fairly small, l is in the order of several hundreds, and the dynamic programming approach turns out to be infeasible for all but tiny problems. In fact, constructing optimal alignments was shown to be computationally expensive since the problem is NP-hard (Wang and Jiang [61]).

(20)

4.1 A Branch-and-Price approach

Due to the complexity of the alignment problem, most existing algorithms are heuristics based on the so-called “progressive” approach: The alignment is incre- mentally built by considering the sequences one at a time. Let {s₁, . . . , s_n} be the set of input sequences. One way to build a multiple alignment is to choose one of the sequences, say s_c, as the “center” and then align each of the remaining sequences to it, thus obtaining n − 1 pairwise alignments. The pairwise alignments are incre- mentally put together, in a consistent way, until a complete alignment is built. More precisely, let us assume that s_c= s₁. The first partial alignmentA has two rows and it is identical to the optimal alignment of s₁and s₂. Each column ofA has, in the first row, either a gap or a symbol of s₁. At this point a new row corresponding to s₃is inserted, by using the optimal alignment of s₁and s₃(let us call itA^1,3) as a guide. Each time that inA^1,3a symbol α of s3is aligned with a symbol β of s1, it will be put in the column ofA that starts with β. Each time that in A^1,3a symbol of s₃is aligned with a gap, we insert a column of gaps intoA , in the corresponding position. At this point we have a partial alignment with three rows. We then proceed in a similar way until all sequences have been inserted inA .

Suppose, for example, the following to be the optimal alignments of s₁= ATTCG with s₂= TTCG, s3= ATGT and s4= CCACG:

A^1,2= A T T C G A^1,3= A T - T C G A^1,4= - - A T T C G

- T T C G A T G T - - C C A - - C G

We then obtain the following sequence of partial multiple alignments:

A T T C G A T - T C G - - A T - T C G

- T T C G - T - T C G - - - T - T C G

→ A T G T - - - - A T G T - -

→ C C A - - - C G

The multiple alignment determined in this way is called a star alignment, since it can be represented by a star with center s_cand leaves s_ifor i 6= c. Let us denote byActhe star alignment of center s_c.Achas the property that for each i 6= c, it is d_A_c(s_c, s_i) = d(s_c, s_i). Star alignments were proposed by Gusfield [34], who showed that, by choosing s_cso as to minimize ∑id(s_c, s_i), the value of the alignment does never exceed (2 −²_n)-times the optimum. Gusfield’s algorithm became a popular heuristic for multiple alignment and its performance, in practice, is much better than the theoretical worst-case approximation ratio.

The star alignment can be considered as a progressive alignment which proceeds by first finding a tree spanning the sequences (in this particular case, a star), and then by using the tree as a guide for aligning them iteratively. As a matter of fact, given any spanning tree T over a set of sequences (viewed as vertices of a complete graph), it is possible to build a multiple alignmentAT of the sequences such that d_A_T(s_u, s_v) = d(s_u, s_v) for all the pairs of sequences (su, s_v) connected by an edge of T (see [29]). By triangle inequality, for such an alignment it is

(21)

SP(AT) =

∑

1≤i< j≤n

d_A_T(s_i, s_j) ≤

∑

1≤i< j≤n

d(P_{i j}) (15)

where P_{i j}is the unique path between s_iand s_jin T , and d(P_{i j}) is the sum of the edit distances for the edges along this path.

In general, it is not easy to come up with a sensible phylogenetic tree relating the sequences that can be used to compute a good multiple alignment. In particular, a star is hardly an interesting tree from an evolutionary point of view, while unrestricted trees may be more relevant and lead to better alignments. In order to obtain a heuristic more effective than Gusfield’s star alignment, one can exploit a non-immediate connection with the area of Network Design, and, in particular, with the Minimum Routing Cost Tree (MRCT) problem. The MRCT problem is defined as follows. We are given a complete graph G = (V, E), with edge weights d_e for e∈ E. Let T be a spanning tree of G and let ≺ represent a total ordering of the nodes in V . For each pair of vertices i, j ∈ V let d(i, j, T ) denote the length of the unique i– j path in T . The routing cost of T is defined as

r(T ) :=

∑

(i, j) : i≺ j

d(i, j, T )

and the MRCT problem calls for determining a spanning tree of minimum routing cost. In network design, this tree represents the solution to the problem of connecting a set of points by a network so as to minimize the average distance between any two points. The MRCT is NP-hard to determine, but it has an effective ILP formulation, with an exponential number of variables.

With the notion of routing cost at hand, we can reinterpret (15) as follows: for any tree T , it is SP(AT) ≤ r(T ). Therefore, one way to obtain a low-score alignment would be to use a tree which minimizes the right-hand side, i.e., a minimum routing costtree.

The MRCT problem can be formulated as the following integer linear program [30]. LetP be the set of paths in G and let Π be the set of pairs of vertices. For each pair h = {i, j} ∈ Π , letP^hbe the set of paths with endpoints i and j. For each P∈P, let dPbe the length of the path P. The model has decision variables x_P, for P∈P, used to select a path between each pair of vertices, and variables ye, for e∈ E, used to select the tree edges. The model is

min

∑

P∈P

d_Px_P (16)

subject to

∑

P∈P^h

x_P= 1 h∈ Π (17)

∑

P∈P^h:P3e

x_P≤ y_e e∈ E, h ∈ Π (18)

e∈E

∑

ye= n − 1 (19)

(22)

y_e≥ 0, integer, e ∈ E; x_P≥ 0, P ∈P. (20) The model has exponentially many variables, but its LP relaxation can be solved in polynomial time by column–generation techniques. The pricing problem is the following: given real values u_hfor each pair h ∈ Π and non-negative real values v_eh for each e ∈ E and h ∈ Π find, if any, a pair h and a path P ∈P^hsuch that

e∈P

∑

(v_eh+ d_e) < u_h. (21)

If, for a fixed h = {i, j}, we define d_e⁰= v_eh+ d_efor all e ∈ E, the pricing problem consists in finding an i– j path whose length, with respect to the costs d⁰, is shorter than u_h. This problem can be solved by computing the shortest i– j path with respect to the costs d⁰. Note that the shortest path can be found in polynomial time, e.g., by Dijkstra’s algorithm.

4.2 Computational results

The multiple alignment heuristic which uses the MRCT on the graph of sequences weighted by the edit distances was proposed in [30]. The heuristic has been tested on several families of proteins obtained from the public–domain data base PRINTS [6].

Each family contains from 10 to 20 sequences, each with 200 to 600 amino acids.

Each MRCT instance was solved within a time limit of 10 minutes. The results show that the alignment based on the MRCT improve over the the best star alignment by about 7% (and up to 20% on some instances). Furthermore, by comparison with the lower bound given by ∑1≤i< j≤nd(s_i, s_j), it was observed that the heuristic returns solutions which are within 8% of the optimum, on average, while star alignments were, on average, within 15% from optimum. Finally, Gusfield reports that the star alignment algorithm is not suited for families of very dissimilar sequences. In these cases, the MRCT-based alignment leads to much better solutions, since for graphs with very dissimilar edge lengths, the minimum routing cost tree is almost never a star.

4.3 Related works

The linear gap model adopted by the SP objective considers a gap of length k (i.e., a run of k consecutive “-” within a row of the alignment) to be identical to k gaps of length one each. In particular, if the cost of deleting a single character is δ (called indelcost), the cost of a gap of length k is equal to kδ . From a biological point of view, the linear model is somewhat unsatisfactory: In the linear model, a gap of length k is seen as k events of deletion of one nucleotide each, while it should more likely correspond to a unique event deleting k nucleotides at once. A more