• Non ci sono risultati.

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

N/A
N/A
Protected

Academic year: 2021

Condividi "Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina"

Copied!
82
0
0

Testo completo

(1)

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e

la Medicina

Giuseppe Lancia Università di Udine

(2)

MOLECULAR BIOLOGY

Human Genome Project (1990): read

and understand human (and other species) genome (DNA)

Scientific and practical applications (medicine, agriculture, forensic, ….)

multi-million project involving several countries

(3)

COMPUTER SCIENCE

To meet the goal we need computers and programs to deal with problems such as

- Huge data sets (billions of informations to be analized)

- Data Interpretation (contradictory, erroneous or inconsistent data)

- Data sharing and networks (online genomic data banks)

(4)

Computational (molecular) Biology

• Combinatorics, Discrete Math

• Combinatorial Optimization

• Integer Programming

• Complexity theory (Approximations and Hardness)

• Graph theory

• (but also Stringology, Data Bases, Neural Networks....)

“Optimization problems arising in the analysis, interpretation and management of large sets of

genomic data”

(5)

FIRST PHASE COMPLETION :

HUMAN GENOME SEQUENCE - 2001

World Consortium universities and labs

Celera Genomics (Craig Venter)

(6)

FIRST PHASE COMPLETION :

HUMAN GENOME SEQUENCE - 2001

(7)

Computational Biology born around ’80-’90

Algorithmic approaches (e.g. Dynamic Programming for alignment)

Computational complexity (e.g. NP-hardness of folding)

String-related problems, Information retrieval, Genomic data base….

……

mostly computer scientists dominated the field

(8)

Some NP-hard problems in C.B. are OPTIMIZATION PROBLEMS

These can be solved via mathematical programming techniques

INTEGER LINEAR PROGRAMMING

LAGRANGIAN RELAXATION

SEMIDEFINITE PROGRAMMING

QUADRATIC PROGRAMMING

O.R. people (and O.R. techniques) entered the field

(9)

The most important approach of this type is probably Integer Linear Programming

It allows the solution of NP-hard problems via specialized Branch and Bound algorithms

The lower bound comes from the Linear Programming relaxation of the model

(and can be computed in polynomial time)

(10)

The I.P. approach

(11)

1. Define integer (usually binary) variables

The I.P. approach

(12)

1. Define integer (usually binary) variables

The I.P. approach

2. Define linear constraints that feasible solutions must satisfy

(13)

1. Define integer (usually binary) variables

The I.P. approach

2. Define linear constraints that feasible solutions must satisfy

3. Define linear objective function that optimal solution must mini(maxi)mize

(14)

1. Define integer (usually binary) variables

The I.P. approach

2. Define linear constraints that feasible solutions must satisfy

3. Define linear objective function that optimal solution must mini(maxi)mize

4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

(15)

1. Define integer (usually binary) variables

The I.P. approach

2. Define linear constraints that feasible solutions must satisfy

3. Define linear objective function that optimal solution must mini(maxi)mize

4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE

(16)

1. Define integer (usually binary) variables

The I.P. approach

2. Define linear constraints that feasible solutions must satisfy

3. Define linear objective function that optimal solution must mini(maxi)mize

4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE

There can be an exponential number of constraints. We need to make them implicit  BRANCH-AND-CUT

(17)

Some Integer Programming models in C.B.

Haplotyping

Clark’s rule (Gusfield )

Parsimony (Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragment assembly (Lancia+Bafna+Schwartz+Istrail)

Protein folding

Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu)

Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller ) Fold comparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia)

Sequence Alignment and consensus Lenhof+Vingron+Reinert

Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini

Meneses+Lu+Oliveira+Pardalos

Physical Mapping

Alizadeh+Karp+Weisser+Zweig

Genome Rearrangements Caprara+Lancia

(18)

Haplotyping

Clark’s rule (Gusfield )

Parsimony (Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragment assembly (Lancia+Bafna+Schwartz+Istrail)

Protein folding

Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu)

Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller ) Fold comparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia)

Sequence Alignment and consensus Lenhof+Vingron+Reinert

Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini

Meneses+Lu+Oliveira+Pardalos

Physical Mapping

Alizadeh+Karp+Weisser+Zweig

Genome Rearrangements Caprara+Lancia

We’ll see some examples

(19)

BIOLOGY 101

(20)

A genome is a long string over the DNA alphabet {A,C,G,T}

In man it is some 3.000.000.000 letters

DNA is responsible for our diversity as well as our similarity

Small changes in a genome can make a big difference, like from... to...

(21)

CELL

Nucleus

Chromosomes (pairs)

TCATCGA AGTAGCT

Eukariotic diploid organisms

(22)

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY

attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns

exons

attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE

(23)

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY

attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns

exons

attag|ctatatcgttgatg|tatatgcta|cga|aatta

att agc tat atc gtt gat gta tat gct acg aaa tta codon triplets R N C A S S F C W Y Q V

A GENE

amino acids:

a PROTEIN

(24)

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY

attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns

exons

attag|ctatatcgttgatg|tatatgcta|cga|aatta

att agc tat atc gtt gat gta tat gct acg aaa tta codon triplets R N C A S S F C W Y Q V

A GENE

amino acids:

a PROTEIN

The protein folds to a 3D shape

to perform its function CENTRAL DOGMA:

1 gene 1 protein

(25)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

From DNA to strings (i.e. reading our genome): Shotgun sequencing

(26)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

From DNA to strings (i.e. reading our genome): Shotgun sequencing

amplification

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

(27)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGA GATTT GCCTAG CTATCTT

ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

From DNA to strings (i.e. reading our genome): Shotgun sequencing

amplification

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

(28)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGA GATTT GCCTAG CTATCTT

ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

From DNA to strings (i.e. reading our genome): Shotgun sequencing

amplification

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

(29)

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGA GATTT GCCTAG CTATCTT

ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

From DNA to strings (i.e. reading our genome): Shotgun sequencing

amplification

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

assembly

(30)

We need to merge the fragments in order to retrieve the original DNA sequence

-50,000,000 fragments -1000 char each…

We better use computers!

assembly

ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG

(31)

- Take 10 copies of «Corriere della Sera»

- Cut each in tiny pieces (1cm2)

- Put the pieces in a bag and shuffle

- Grab five handful of pieces and throw them away

Problem: retrieve the newspaper from the remaining pieces

Let’s make the problem harder:

- Ultra-tiny pieces (1mm2)

Difficulties : Ripeated words (e.g., «ha», «dopo», «quando», «governo»…)

- It is not the CDS, but the encyclopedia Treccani (20 volumes) - It is written in chinese!!

Still thr problem would be easier than sequencing the human genome

Understanding the problem

(32)

Assembly

• Repeated words and missing words create problems to be solved by sophisticated programs, based on statistics and mathematical models.

• The basic underlying problem (notwithstanding the above complications) is called Shortest Superstring Problem

(SSP)

(33)

Shortest superstring:

• Given a set of strings s1, .., sn, find a string s that contains each si as a substring..

(34)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(35)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(36)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(37)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(38)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(39)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(40)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(41)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(42)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortest superstring:

(43)

cane nerone onere rene neon

reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

The space of potential (non-redundant) solutions has size O(n!) The problem is NP-Hard

There is an effective greedy algorithm

Shortest superstring:

(44)

cane nerone onere rene neon

Greedy algorithm:

- Merge the two strings with max overlap. Replace them with their fusion.

- Repeat until only one string is left.

(45)

cane nerone onere rene neon

neronere

Greedy algorithm:

- Merge the two strings with max overlap. Replace them with their fusion.

- Repeat until only one string is left.

(46)

cane nerone onere rene neon

neronere

neronerene

Greedy algorithm:

- Merge the two strings with max overlap. Replace them with their fusion.

- Repeat until only one string is left.

(47)

cane nerone onere rene neon

neronere

neronerene

neronereneon

Greedy algorithm:

- Merge the two strings with max overlap. Replace them with their fusion.

- Repeat until only one string is left.

(48)

cane nerone onere rene neon

neronere

neronerene

neronereneon caneronereneon

Greedy algorithm:

- Merge the two strings with max overlap. Replace them with their fusion.

- Repeat until only one string is left.

(49)

• In this case it found the optimum, but there is no guarantee

• There’s, however, a guarantee that v(GREEDY) <= 2.5v(OPT) (i.e., it is a 2.5-APPROX ALGORITHM (Sweedyk))

• OPEN PROBLEM (Conjecture holding from > 20 years):

Prove that GREEDY is a 2-APPROX ALGORITHM

For more info see

http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf

Greedy algorithm:

(50)

Sequence Alignments

• Sequences evolve and change

• E.g.: deletion, insertion, mutation

attcgattgat  attcggat deletion attcgattgat  attcggat insertion attcg  atgcg mutation

Given two genomic sequences (e..g., man and mouse) we want to compare them

(51)

Languages evolve as well

• Mater

• Madre

• Mother

• Matre

• Mutter

(52)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Languages evolve as well

(53)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Bad alignment:

- - M A T - E R M O T H - E R -

Languages evolve as well

(54)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Bad alignment:

- - M A T - E R M O T H - E R -

Multiple alignment:

M A T - E R - M O T H E R - M A D - - R E M A T - - R E M U T T E R -

Languages evolve as well

(55)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Bad alignment:

- - M A T - E R M O T H - E R -

Multiple alignment:

M A T - E R - M O T H E R - M A D - - R E M A T - - R E M U T T E R -

If DNA sequences:

A C T - G G - A C T C G G - A G T - - C T C C T - - G T A - T T C G -

Languages evolve as well

(56)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Bad alignment:

- - M A T - E R M O T H - E R -

To see if an alignment is good: score 0 identical (or similar) letters (e.g. T-D, B-P, S-F) and score 1 different or deleted letters.

Cost: 1 Cost: 8

Languages evolve as well

(57)

• Mater

• Madre

• Mother

• Matre

• Mutter

Good alignment:

M A T - E R M O T H E R

Bad alignment:

- - M A T - E R M O T H - E R -

To see if an alignment is good: score 0 identical (or similar) letters (e.g. T-D, B-P, S-F) and score 1 different or deleted letters.

Cost: 1 Cost: 8

There is an exponential number of possible alignments of two strings, but finding the optimal one is easy (i.e. polynomial). We’ll see it next.

For many strings, there is no polynomial algorithm (if P =/= NP), but there are effective heuristics and approximation algorithms.

Languages evolve as well

(58)

The basic problem

: pairwise sequence alignment:

Seq. 1: ATCGGCTTGTTATTG Seq. 2: ATGGGATTTAATTGCCC

ATCGGCTTGTTA-TTG--- ATGGGAT--TTAATTGCCC

Alignment:

(59)

The basic problem

: pairwise sequence alignment:

Seq. 1: ATCGGCTTGTTATTG Seq. 2: ATGGGATTTAATTGCCC

ATCGGCTTGTTA-TTG--- ATGGGAT--TTAATTGCCC

Alignment:

Cost c(x,y) for each pair x, y in {A, T, C, G, -}

Problem: find the arrangement minimizing

S

i c(AL[1,i], AL[2,i]) This value is the edit distance d(S’, S”)

(60)

Simple Dynamic Programming Recursion:

d(S’[1…n], S”[1..m]) = min

{

d(S’[1..n-1],S”[1…m]) + c(S’[n],-) d(S’[1..n],S”[1…m-1]) + c(-, S”[m])

d(S’[1..n-1],S”[1…m-1]) + c(S’[n],S”[m])

d(ATCCG, ATTC) = min

{

d(ATCC, ATTC) + c(G,-) d(ATCCG, ATT) + c(-, C)

d(ATCC, ATT) + c(G,C)

Cost: O(n m) (Smith & Waterman, ’86)

(61)

This can be extended to Multiple Sequence Alignment

Useful for:

• Finding conserved patterns (functionally relevant)

• Clustering genomic sequences (e.g. protein families)

• Evolutionary studies (e.g. intra-species comparisons)

• ….

ATCCGTTAGTTA---GATTG--- ATT--TTAGTTA---GATTGCAC ATTCGTTAGTTAGTAGATTGCA- TTTCGTTA--TA-TACATTGCAC

(62)

This can be extended to Multiple Sequence Alignment

Useful for:

• Finding conserved patterns (functionally relevant)

• Clustering genomic sequences (e.g. protein families)

• Evolutionary studies (e.g. intra-species comparisons)

• ….

Unfortunately…it’s NP-hard (Wang & Jiang, ’94)

(Dynamic Programming costs O(2 n ) for k sequences of length n) k k

ATCCGTTAGTTA---GATTG--- ATT--TTAGTTA---GATTGCAC ATTCGTTAGTTAGTAGATTGCA- TTTCGTTA--TA-TACATTGCAC

The distance in the alignment A is denoted dA(S’,S”). We wish to minimize SP(A) = S{S’,S”} dA(S’,S”) (sum-of-pairs objective)

(63)

a:humanafrican cox1 ;----accatccatcaatctgcgatccccctccaataccaaccataccactcctggccttccagtt b:humprobeuro cox1 ;gtacaccatccatcaa--tgcgatccccctccaataccaaccataccactcctggccttccagtt c:gorilla cox1 ;acatattacccatcaacttataatctctttccagcatcagtcatgctattcccgaccccttagtt

d:gibbon cox1 ;gcgtattactcgctggcccatacctcttcctagaccccaactgtgtcgcctctgattccctagtt

e:chimpanzee cox1 ;acacactacctatcgacttacaattccctcccaacaccagtcatgctgttcccgacccctcaact f:pigmychimp cox1 ;---cactacccattgacttacaatcccttctcagcactaatcacgctgttcccgatccctcaact g:baboon cox1 ;atgcaccgtttactagctcataccttccccta--ccctgatcgcatcatctttaatcccctaacc

h:orangutan cox1 ;gcgcgctgtccatcatcccacacttctctcccaacattggctataccgtccttgactccctcatc i:sumorang cox1 ;gcgcgccgtccgtcatcccacacctctctcccaacatcaactataccacccctaactccctcgtc agccccccccctagctgcctcatctccacccgcaccctcttctcccccccccaaacctcctctttac---

agccccccccctagctgcctcatctccacccgcaccctcttctcccccccccaaacctcctctttacttt aatcctccttccaatcgccctacctccacccgcactttctttctctccccccaaattcctttctcacctt gaccctcctttcggccgttctcccccctctcgcatcccaccttcccccttttaaactctccccttactcc Aacgtcccccctgacca--ccattttcacccgcactcccttcctttcctcccgaacttttttcttatctt aatgtcccccctggccactccactttcacccgcattcccttcctttcctcctgaacttttttctcatctt ggtccccccctcaaccaccctcccccatttaatatctcaccctctcctttccaagcttccccctcatccc Aactcttacctcaatcgtctta— tca-tcaacgcccccctcccccattattatgttccttctccccccc aactcttaccttaaccatctcatttcatccagtgcccccctccctcaccatcatatcctctttccccccc

Example: Same protein, different species

(64)

Sometimes the sequences can be related via a Phylogenetic Tree

A tree can be used to align sequences. Here is how:

(65)

Theorem 1: given a fully labeled tree, there is a multiple alignment A(T) such that dA(T) (S(i),S(j)) = d(S(i), S(j)) for all edges (i,j) of T

Proof (by 1 drawing):

ATGGC TTGGAC

ATG-GC

TTGGAC-- -TTGGAC

AT-GG-C

-TTG-GAC-- AT-G-G-C--

(66)

A 2-approximation algorithm for multiple alignment and a good heuristic borrow concepts from Network Design

3

2 4

10

5 9 2

7

6 3

The Routing Cost of a tree is the sum of distances in the tree between all pairs

X Y

E.g. rc(X,Y) = 3 + 5 + 9 + 2 = 19. rc(T) = 1616

1

(67)

If we have k sequences, and a tree connecting them, we can align them using Theorem 1

tccat attcgg ttg

ttatcg

ttga

attcg

ttgat

tattc ctt

tcct

ctggat aatgt

The cost of edges {S’,S”} is the edit distance d(S’S”)

(68)

If we have k sequences, and a tree connecting them, we can align them using Theorem 1

tccat attcgg ttg

ttatcg

ttga

attcg

ttgat

tattc ctt

tcct

ctggat aatgt

By triangle inequality, dA(S’,S”) £

S

{i,j}in path S’,..,S” dA(i,j)

dA(ttatcg, ctt) £ dA(ttatcg, ttga) + dA(ttga, attcg) + dA(attcg,ctt) = = d(ttatcg, ttga) + d(ttga, attcg) + d(attcg,ctt)

Hence SP(A(T)) £ rc(T)

Stars already yield a 2-approximation (next).

If T* minimizes rc(T), we can expect a low SP(A(T*)).

Finding T* is NP-hard, but “easier” than solving min SP

The cost of edges {S’,S”} is the edit distance d(S’S”)

(69)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

(70)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

(71)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

So, also SP

(72)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

So, also SP But, for all A, SP

(73)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

So, also SP But, for all A, SP

(74)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

So, also SP But, for all A, SP

And so 2 OPT

(75)

Some particular trees (STARS) give a 2-approximation to SP

c

��(� ( ) ) ¿=(�−1) ∑

�(�,�)

)) =

Hence ,��

(

(

) )

���( ��)=1

(

(�−1)

�(�,�)

)

So, also SP But, for all A, SP

And so 2 OPT

In conclusion, SP

(76)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(77)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

min

�∈ �

(78)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(i)

min

�∈ �

=1 for all�, �∈�

(79)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(i)

(ii)

min

�∈ �

=1

�∈ �(�, � ):�∈�

for all�, �∈�

e

(80)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(i)

(ii)

(iii)

min

�∈ �

=1

�∈ �(�, � ):�∈�

�∈�

=�−1

for all�, �∈�

e

(81)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(i)

(ii)

(iii)

Pricing becomes O(n^2) shortest path problems with lengths de + ve

(ve dual vars for (ii))

min

�∈ �

=1

�∈ �(�, � ):�∈�

�∈�

=�−1

for all�, �∈�

e

(82)

ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini)

Given G=(V,E), edge lengths d. Use 0-1 variables for each path

(i)

(ii)

(iii)

min

�∈ �

=1

�∈ �(�, � ):�∈�

�∈�

=�−1

for all�, �∈�

e

Best tree has RC 10% less than RC of best star “ “ “ SP 6% “ “ SP “ “, SP of star is experimentally 10-15% of optimal SP of best tree is 4-10% of optimal

Riferimenti

Documenti correlati

Metodi e Modelli per il Supporto alle

“Date due sequenze relativamente lunghe, determinare la sottosequenza della prima sequenza che realizza il maggior grado di similarità con una sottosequenza della

Ricerca (I) : Ogni SUBJECT della banca dati viene consultata allo stesso modo (anche queste sono indicizzate) e si confronta l’indice della ktup della query con l’indice della

La misura più semplice della distanza tra due sequenze nucleotidiche è contare il numero di siti nucleotidici che differiscono tra le due sequenze. Quando confrontiamo siti

Come nel caso di UPGMA ad ogni ciclo viene aggiunto un ramo interno … ma adesso è sempre il ramo più corto possibile !... (Saitou and

Quindi per valutare la verosimiglianza di un albero filogenetico mediante la tecnica della massima verosimiglianza (maximum likelihood) abbiamo bisogno innanzitutto di un

L’interprete PERL si occupa di tradurre una istruzione scritta in un linguaggio comprensibile per un essere umano in una corrispondente istruzione scritta in un

[r]