contact maps and

(1)

Protein structure comparison

contact maps and

(2)

A Protein is a complex molecule with a primary, Protein linear structure (a sequence of aminoacids) and a 3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design^problem calls for constructing peptides with a 3D shape complementary to a protein, so as to dock^{onto it.}

(3)

Motivation:

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function) - Search in 3D data bases

- Protein Classification and Evolutionary Studies - Assessment of Fold Prediction quality (e.g. CASP) -…..

Problem: Align two 3D protein structures

(4)

Contact Maps

(5)

CONTACT MAPS

Unfolded protein

(6)

Folded protein = contacts

CONTACT MAPS

(7)

Contact map = graph

CONTACT MAPS

(8)

Contact map = graph

OBJECTIVE: align 3d folds of proteins = align contact maps

CONTACT MAPS

(9)

Contact Maps are related to fold:

Similar folds  similar contact maps

We studied the problem of determining contact map similarity

(10)

Contact Maps are related to fold:

Similar folds  similar contact maps

We studied the problem of determining contact map similarity We studied the problem of determining contact map similarity

In the period 2001-2004 ---

-I.P. formulation via Branch & Cut (RECOMB)

-Use of Compact Optimization instead of separation (AIRO) -Lagrangian Relaxation (RECOMB)

(Pubblications: RECOMB proceedings, AIRO proceedings,

OR Letters, Journal of Comp. Bio., 4OR)

(11)

The Contact Map Alignment

Problem

(12)

Non-crossing Alignments

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

(13)

The value of an alignment

(14)

The value of an alignment

(15)

The value of an alignment

(16)

Value = 3

The value of an alignment

(17)

Value = 3

The value of an alignment

We want to maximize the value

(18)

The value of an alignment

NP-Hard (Goldman, Istrail, Papadimitriou, 1999)

(19)

Integer Programming Formulation

(5 ^th RECOMB conference)

(20)

The use of Integer Linear Programming

Integer Programming Formulation

•

Model a difficult problem by 0-1 variables, linear objective function and linear constraints

• Can find optimal solution by branch and bound

• Bound comes from LP relaxation (polynomial)

• Bound can be used to access quality of any feasible sol

(21)

CONTACT-CONTACT VARS

y

_ef for e and f contacts

f

y_ef

RESIDUE-RESIDUE VARS

x

_ij for i and j residues

y_ef i

j x_ij

(i) 0-1 VARIABLES (i) 0-1 VARIABLES

e

(22)

maximize 

^e



^f

y

_ef

(ii) OBJECTIVE (ii) OBJECTIVE

over all feasible x and y

(23)

(iii) CONSTRAINTS (FEASIBILITY) (iii) CONSTRAINTS (FEASIBILITY)

y

(ip)(jq)

<= x

ij

and y

(ip)(jq)

<= x

pq

non-crossing i

j i’

j’

x

ij

+ x

i’j’

<= 1

i j

p q

activation

(24)

Non-crossing clique Constraints

Variables x define a graph Gx:

• A node for each line

• An edge between each pair of crossing lines i

j i’

j’

i j

i’

j’

(25)

Variables x define a graph Gx:

• An independent set corresponds to a noncrossing alignment

• G^x has nice proprieties (it’s a perfect graph)

• It’s easy (poly) to find large independent sets in G^x

• A node for each line

• An edge between each pair of crossing lines i

j i’

j’

i j

i’

j’

Clique Constraints

(26)

Non-crossing constraints can be extended to CLIQUE CONSTRAINTS



^xij <= 1

[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied imply a strong bound!

Clique Constraints

(27)

Maximal cliques in G ^x

(28)

Structure of Maximal cliques in G ^x

1. Pick two subsets of same size

(29)

Structure of Maximal cliques in G ^x

2. Connect them in a zig-zag fashion

(30)

Structure of Maximal cliques in

G ^x

(31)

Structure of Maximal cliques in

G ^x

(32)

Structure of Maximal cliques in

G ^x

(33)

Structure of Maximal cliques in

G ^x

(34)

Structure of Maximal cliques in

G ^x

(35)

Structure of Maximal cliques in

G ^x

(36)

Structure of Maximal cliques in

G ^x

(37)

Structure of Maximal cliques in

G ^x

(38)

Structure of Maximal cliques in

G ^x

(39)

Structure of Maximal cliques in G ^x

3. Throw in all lines included in a zig or a zag

(40)

Structure of Maximal cliques in G ^x

3. Throw in all lines included in a zig or a zag

(41)

Structure of Maximal cliques in G ^x

The result is a maximal clique in G^x

(42)

Separation of Clique Inequalities

(43)

Separation of Clique Inequalities

PROBLEM

There exist exponentially many such cliques (O(2²ⁿ) inequalities).

How do we add them ?

(44)

PROBLEM

SOLUTION

We don’t add them in the original LP, but only when needed at run time. Not all of them will be needed, so we are fine as long as…

(45)

PROBLEM

SOLUTION

SEPARATION

…we can generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x*



^x*ij > 1

[i,j] in M

(46)

PROBLEM

SOLUTION

SEPARATION

…we can generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x*



^x*ij > 1

[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n²)

(47)

n2 1

1 2 n1 2

i

u

Separation of Clique Inequalities

Create n

¹

x n

²

grid

(48)

n2 1

1 2 n1 2

i

u

Orient all edges and give weights

x*iu

Separation of Clique Inequalities

Create n

¹

x n

²

grid

(49)

Create n

¹

x n

²

grid

Orient all edges and give weights

There is violated clique iff longest A,B path has length > 1

A=(1,n2)

B=(n1,1)

Separation of Clique Inequalities

.25 .20 .30

0

.35 0 .15

.20 0

(50)

• The method which adds violated inequalities by separation is called BRANCH-and-CUT

• The method can get stuck in long runs of cut additions each of which “cuts very little”

• There is an alternative to this, called

COMPACT OPTIMIZATION

(51)

Detour:

COMPACT OPTIMIZATION COMPACT OPTIMIZATION

vs vs

BRANCH and CUT

(52)

Polynomial number of “simple” constraints

LP Separation Paradigm

Exponential number of “combinatorial” constraints

TSP, O(n) Degree constraints

TSP, O(2^n) Subtour Elimination

(53)

SEPARATION ALGORITHM:

Look for a (the most) violated constraint Cast as an OPTIMIZATION problem

A violated combinatorial constraint exists

+ some of the combinatorial ones

Add inequalities Exponential number of “combinatorial” constraints

END

YESYES NONO

LP Separation Paradigm

(54)

END

YESYES NONO

E.g. SHORTEST PATH, MAX FLOW,E.g. SHORTEST PATH, MAX FLOW, BIP MATCHING...BIP MATCHING...

A violated combinatorial constraint exists

LP Separation Paradigm

(55)

END

YESYES

NONO A violated combinatorial

constraint exists

...IT’S AN LP ITSELF !!

LP Separation Paradigm

(56)

Polynomial number of “simple” constraints + some of the combinatorial ones

Add inequalities

END

YESYES

NONO A violated combinatorial

constraint exists

Polynomial number of constraints Polynomial number of constraints

for the separation problem as an LP for the separation problem as an LP

LP Separation Paradigm

(57)

Polynomial number of “simple” constraints + some of the combinatorial ones

Add inequalities

END

YESYES NONO

Polynomial number of constraints Polynomial number of constraints

for the separation problem as an LP for the separation problem as an LP

Polynomial number of constraints Polynomial number of constraints

to force that no violated combinatorial to force that no violated combinatorial constraint exists

constraint exists

LP Separation Paradigm

(58)

Polynomial number of constraints Polynomial number of constraints

for the separation problem as an LP for the separation problem as an LP

Polynomial number of constraints Polynomial number of constraints

to force that no violated combinatorial to force that no violated combinatorial constraint exists

constraint exists

Org variables x

Org variables x + new variables y

Variables x, y

LP Separation Paradigm

(59)

Th Th : Optimization iff Separation (Grotschel, Lovasz, Schrijver, 1981) : Optimization iff Separation Compact Optimization

Compact Optimization : solve an LP with an exponential n. of : solve an LP with an exponential n. of inequalities by lifting to a space with a polynomial n. of

inequalities by lifting to a space with a polynomial n. of inequalities, solving, and projecting back

inequalities, solving, and projecting back

Th Th : Compact Optimization iff Compact Separation (Carr, Lancia, 2002) : Compact Optimization iff Compact Separation Somewhat known (Maculan, Pulleyblank)

Somewhat known (Maculan, Pulleyblank) rediscovered (Carr, Lancia, 00)

rediscovered (Carr, Lancia, 00)

(60)

the application to contact maps

comparison

(61)



 ₁  ₂

max

E f

ef E

e

y

iu i

j

v u j

i

x

y 





 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



iv i

j

v u i

j

x

y 



_

 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



 1



 ij Q

ij

x

for all cliques Q of mutually intersecting alignment lines

(62)

A=(1,n2)

B=(n1,1)

Separation of cliques: there is a Q such that x*(Q) > 1 if and only if the longest A-B path in this grid is > 1

x

^*ij

x

^*ij

2 i n1

1

j

Note: Longest path on GRID can be cast as LP...

Vars: z

^p

= longest up to p.

Constraints: z

^p

>= z

^q

+ len(q,p) for arc (q,p)

(63)



 ₁  ₂

max

E f

ef E

e

y

iu i

j

v u j

i

x

y 





 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



iv i

j

v u i

j

x

y 



_

 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



 1



 ij Q

ij

x

(64)



 ₁  ₂

max

E f

ef E

e

y

iu i

j

v u j

i

x

y 





 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



iv i

j

v u i

j

x

y 



_

 ( )

) , )(

, (



1 2

2

1,(u,v) E i V ,(u,v) E V

i    



 1



 ij Q

ij

x

j i ij

j

i

x z

z

_, _₁

 

_,

j i ij

j

i

x z

z

_₁_,

 

_,

2 1

, j V V

i  



2

0

,

1 _n



z

2 1

, j V V

i  



1

1,

 z

n

1 ,j

z

i j

z

i_₁_,

z

_i_,_j

x

ij

x

ij

(65)

What about Heuristics?

Genetic algorithms

(66)

Genetic Algorithm Overview

• A Population of candidate solutions that evolve (improve) over time

• Recombination creates new candidate solutions via crossover and mutation

Population at time t

Population at time t+1 Recombination

operators

Evaluation function

(67)

Blue Parent

Offspring

Red Parent

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(68)

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(69)

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(70)

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(71)

These edges conflict with existing edges and are not copied

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(72)

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child

– Copy as many edges as possible from the other parent

– Add random edges to fill any remaining space

(73)

Crossover

• Crossover selects pieces from both parents and creates two offspring solutions

– Select a set of edges in one parent to copy to the child – Copy as many edges as possible from the other parent – Add random edges to fill any remaining space

(74)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

(75)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

– Select a set of endpoints to shift

(76)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

– Select a set of endpoints to shift

(77)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

– Select a set of endpoints to shift

This edge “fell off” the end of the contact map

and is removed

(78)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

– Select a set of endpoints to shift

– Randomly add new edges

(79)

Mutation

• Mutation introduces small changes to existing solutions by shifting edge

endpoints

– Select a set of endpoints to shift

– Randomly add new edges

(80)

Computational Results

(81)

Compact Optimization vs Separation Compact Optimization vs Separation

INSTANCE SEPARATION COM PACT OPTIM IZATION

PROT1 PROT2 n m cols rows nLPs time cols rows nLPs time speedup

1b3c 1svf 92 101 4359 6085 1944 28072 6474 10225 1 207 136x 1nmg 1svf 92 112 4722 6932 2371 38877 6837 11072 1 314 124x 1svf 2b3c 92 101 4359 6085 2118 26996 6474 10225 1 232 116x

1bw5 1svf 96 91 4209 5393 1776 14010 6504 9889 1 136 103x

1bct 1hlh 105 104 5354 7196 1477 18159 8104 12593 1 186 98x 1bw5 1joy 100 115 5805 8055 2434 62426 8304 12955 1 663 94x 1svf 1szt 97 101 4584 6349 1118 9744 6924 10934 1 142 69x

1svf 2new 93 91 4074 5624 1349 10147 6234 9853 1 149 68x

1joy 1svf 94 90 4086 5667 1174 6719 6291 9985 1 115 58x

1f22 1svf 93 88 3975 5423 827 4937 6135 9652 1 110 45x

1hlh 2new 98 120 5996 10658 1126 28941 8396 13112 1 829 35x 1qr9 1svf 100 105 4851 6738 454 3328 7326 11590 1 116 29x

1mdy 1svf 100 89 4323 5545 558 2382 6798 10397 1 83 29x

1bhb 1svf 97 104 4683 6352 442 2860 7023 10937 1 165 17x

1bct 1svf 100 75 3861 4662 316 739 6336 9514 1 54 14x

1tn9 1bmr 109 191 12068 15432 328 73449 15036 21164 1 5185 14x

1sfc 1svf 101 86 4269 5773 173 513 6789 10714 1 49 10x

1svf 1wdc 99 82 4047 5318 170 427 6477 10081 1 61 7x

1fza 1fzb 145 194 14664 21142 144 7811 19920 31511 1 1471 5x

1ehj 1f22 100 116 5851 8346 65 284 8347 13240 1 96 3x

(82)

Computational Results

• 269 proteins

– 64 to 72 residues – 80 to 140 contacts

• Selected 597 pairs of proteins out of 36046 possible

– roughly as many similar pairs as dissimilar pairs

(83)

Optimality Gap

0 1 2 3 4 5 > 5

Number of Instances

42 48 72 71 76 95 193

Average/Max Num. Residues

66.4/69 66.8/72 66.7/71 67.0/72 67.0/71 66.8/72 66.8/72 Average/Max

Num. Contacts

61.1/92 56.3/89 57.3/93 59.7/95 61.5/88 64.7/89 71.4/133

Num. GA Best 38 44 63 61 64 74 155

Num. LS1 Best 25 20 35 31 33 35 82

Num. LS2 Best 5 0 0 1 5 12 53

(84)

Skolnick Clustering Test

(85)

Skolnick Results Skolnick Results

• Four Families

1 Flavodoxin-like fold Che-Y related 2 Plastocyanin

3 TIM Barrel 4 Ferratin

• alpha-beta

• 8 structures

• up to 124 residues

• 15-30% sequence similarity

• < 3Å RMSD

(86)

• Four Families

• beta

• 8 structures

• up to 99 residues

• 35-90% sequence similarity

• < 2Å RMSD

Skolnick Results

(87)

• Four Families

• alpha-beta

• 11 structures

• up to 250 residues

• 30-90% sequence similarity

• < 2Å RMSD

Skolnick Results

(88)

• Four Families

• alpha

• 6 structures

• up to 170 residues

• 7-70% sequence similarity

• < 4Å RMSD

Skolnick Results

(89)

Family Style Residues Seq. Sim. RMSD Proteins

1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr, 1qmp, 1rnl, 3cah, 4tmy

2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,

1pla, 3b3i, 2pcy, 2plt 3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,

1hti, 1tmh, 1tre, 1tri, 1ydv, 3ypi, 8tim

4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,

1ier, 1rcd

• Four Families

Skolnick Results

(90)

Clustering Clustering

Define score(P¹, P²) as

0 <= # shared contacts

Min # of contacts of P¹,P²

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

(91)

Clustering validation

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each other and against the remaining.

(92)

We got some known families from biologists, PDB.

0.05 MISMATCH 0.1 MISMATCH 0.15 MISMATCH 0.2 MISMATCH 0.25 MISMATCH 0.3 MISMATCH 0.35 MATCH

…… ……

1.0 MATCH

score proteins were…

Experiment: Take a family F of proteins and align them against each other and against the remaining.

TYPICAL BEHAVIOUR

Clustering validation

(93)

• Performance

– 528 alignments

– 1.3% false negative – 0.0% false positive

Skolnick Results

(94)

The Lagrangian Relaxation The Lagrangian Relaxation

Approach (RECOMB)

(95)

- RECOMB-1 main meritsRECOMB-1 main merits:

showing optimality is feasible for a rigorous and well defined accepted similarity measure

providing a way to obtain bounds to optimal value

- RECOMB-1 main drawbacks:RECOMB-1 main drawbacks:

works only for small proteins (60 residues, 90 contacts) can be slow and involved: relies on LP

(96)

The RECOMB-2 approach:

Can solve larger instanceslarger instances Easier

Easier to implement (no LP) Faster

Faster (no LP)

Provides good heuristicgood heuristic solutions Provides boundsbounds from optimum

(97)

LP Lagrangian

60 resid., 80

contacts 150 resid., 250 contacts

100 res., 200 cont. 1000 res., 2000 cont.

2 hrs 5 min

Bound type

< 10% in < 1 hr

Max proved optimal

Max B&B root time

OLD NEW

Side by side

(98)

Integer Quadratic Programming

Formulation

(99)

0-1 VARIABLES: For each possible line l =(i,j), a variable xl (i and j residues)

i j

x_l (i) x>=0, integer

Node to Node Variables

(100)

i j

Node to Node Variables

(101)

i j

Node to Node Variables

(102)

i j

x_l

CONSTRAINTS: The lines must not cross

(i) x>=0, integer

Node to Node Variables

(103)

i j

x_l

(i) x>=0, integer

Node to Node Variables

(104)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

Node to Node Variables

(105)

Node to Node Variables

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

OBJECTIVE: Pick pairs of lines aligning residues in contact

(106)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

l m

Node to Node Variables

(107)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

l m

b(l,m) = 1

define constant coefficients b

Node to Node Variables

(108)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

l m b(l,m) =

0

Node to Node Variables

(109)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

b(l,m) x

^l

x

^m is 1 iff alignment lines m and l give a sharng

Node to Node Variables

(110)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

(iii) max

 ^

^{m b(l, m)}

^x

^l

^x

^m

Node to Node Variables

 ^

^l

(111)

i j

x_l



^xl <= 1

l in C

(i) x>=0, integer

for each clique C

(ii)

Node to Node Variables

(iii) max

 ^

^l

 ^

^{m b(l, m)}

^x

^l

^x

^m

(112)

The main idea

(113)

We are looking for a noncrossing matching max a quadratic function A = max



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

is a noncrossing matching

THE MAIN IDEA

(114)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

THE MAIN IDEA

(115)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

THE MAIN IDEA

(116)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

THE MAIN IDEA

Hard!

Hard! What if the function was linear?

(117)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

THE MAIN IDEA

Hard!

(118)

We are looking for a noncrossing matching max a quadratic function

3 2 1 2

A = max



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

THE MAIN IDEA

Hard!

(119)

We are looking for a noncrossing matching max a quadratic function

5 2 1

A = max



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

THE MAIN IDEA

Hard!

(120)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

Hard!

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

3 1 4 1

THE MAIN IDEA

(121)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

Hard!

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

3 1 4 1

Easy!

Easy! It’s same as sequence alignment problem. DP,

O

( n1 x n2 )

a t c t c g

c g t c

THE MAIN IDEA

(122)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

Hard!

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

3 1 4 1

Easy!

O

( n1 x n2 )

a t c t c g

c g t c

The idea is then to find b’ such that A  B

THE MAIN IDEA

(123)



ⁱ



^j

^b

^ij

^x

ⁱ

x

^j

s.t.

x

Hard!

B = max



ⁱ

^b’

ⁱ

^x

ⁱ

s.t.

x

3 1 4 1

Easy!

O

( n1 x n2 )

a t c t c g

c g t c

The idea is then to find b’ such that A  B In fact, it is always A B

THE MAIN IDEA

(124)

This b’ is found by Lagrangian Relaxation of a formulation of the model.

L.R. consists in removing constraints and put them in the objective function, weighted by some penalties

It’s a successful technique for large optimization problems We skip the technical details.

Lagrangian Relaxation

(125)

The approach yields GOOD FEASIBLE HEURISTIC SOLUTIONS The approach yields GOOD FEASIBLE HEURISTIC SOLUTIONS (best than previous methods), i.e. the noncrossing matching of (best than previous methods), i.e. the noncrossing matching of the lagrangian subproblems

the lagrangian subproblems

Hence,we get LOWER BOUNDS –as well as UPPER BOUNDS- to the optimum

Lagrangian Relaxation

(126)

Computational Results

(127)

Computational Results

• Branch-and-Bound Results

• 269 proteins

– 70 -100 residues – 80 to 140 contacts

• Picked 597 (REC’01) and 10,000 new pairs of proteins out of 36046 possible (would

have taken up to few months with old

method, took a weekend on PC)

(128)

RECOMB 01

Optimality Gap

0 1 2 3 4 5 > 5

Number of Instances

42 48 72 71 76 95 193

Average/Max Num. Residues

66.4/69 66.8/72 66.7/71 67.0/72 67.0/71 66.8/72 66.8/72 Average/Max

Num. Contacts

61.1/92 56.3/89 57.3/93 59.7/95 61.5/88 64.7/89 71.4/133

Num. GA Best 38 44 63 61 64 74 155

Num. LS1 Best 25 20 35 31 33 35 82

Num. LS2 Best 5 0 0 1 5 12 53

(129)

Lagrangian

(130)

We pushed our algorithm to its limit by optimally aligning very large proteins which are known to be very similar.

For instance, we optimally aligned a protein of 891 residues and 1944 contacts to a protein with 887 residues and 1937 contacts

Further use of Lagrangian

(131)

Skolnick Clustering Test

(132)

Family Style Residues Seq. Sim. RMSD Proteins

1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr, 1qmp, 1rnl, 3cah, 4tmy

2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,

1pla, 3b3i, 2pcy, 2plt 3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,

1hti, 1tmh, 1tre, 1tri, 1ydv, 3ypi, 8tim

4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,

1ier, 1rcd

• Four Families

1 Flavodoxin-like fold Che-Y related 2 Plastocyanin

3 TIM Barrel 4 Ferratin

- Fix similarity level , define P i and P j in same familiy iff score(P i , P j ) >= 

Clustering validation

(133)

INVESTIGATING THE CONTACT THRESHOLD

At low threshold there are no contacts At high threshold, all pairs are in contact

Hence, interesting contact maps (that highlight similarity) lie somewhere in between

We performed experiment: vary contact threshold  and similarity level  and align, until can retrieve the clusters

Clustering a 0-1 matrix of similarity amounts to find a block-diagonal structure (done by TSP)

xij

We found best results for threshold  =7-8 angstrom and similarity

 about 65%

(134)

We find something like this

(135)

A server for contact map alignment Possible practical project:

User inputs PDB proteins and threshold and retrieves alignment

Also, can compute contact maps from PDB files

Joint work with

-Sorin Istrail (Caltech)

-Bob Carr (Sandia Natl Labs) -Brian Walenz (Celera genomics)

-Alberto Caprara (University of Bologna)

contact maps and

Protein structure comparison

contact maps and

Motivation:

Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function) - Search in 3D data bases

- Protein Classification and Evolutionary Studies - Assessment of Fold Prediction quality (e.g. CASP) -…..

Problem: Align two 3D protein structures

Contact Maps

CONTACT MAPS

CONTACT MAPS

CONTACT MAPS

CONTACT MAPS

Contact Maps are related to fold:

Similar folds  similar contact maps

We studied the problem of determining contact map similarity

We studied the problem of determining contact map similarity

Contact Maps are related to fold:

Similar folds  similar contact maps

We studied the problem of determining contact map similarity We studied the problem of determining contact map similarity

In the period 2001-2004 ---

-I.P. formulation via Branch & Cut (RECOMB)

-Use of Compact Optimization instead of separation (AIRO) -Lagrangian Relaxation (RECOMB)

(Pubblications: RECOMB proceedings, AIRO proceedings,

OR Letters, Journal of Comp. Bio., 4OR)

The Contact Map Alignment

Problem

Non-crossing Alignments

The value of an alignment

The value of an alignment

The value of an alignment

Value = 3

The value of an alignment

Value = 3

The value of an alignment

The value of an alignment

Integer Programming Formulation

(5 th RECOMB conference)

Integer Programming Formulation

•

y

x

(i) 0-1 VARIABLES (i) 0-1 VARIABLES

maximize 



y

(ii) OBJECTIVE (ii) OBJECTIVE

over all feasible x and y

(iii) CONSTRAINTS (FEASIBILITY) (iii) CONSTRAINTS (FEASIBILITY)

y

<= x

and y

<= x

x

+ x

<= 1

Non-crossing clique Constraints

Clique Constraints



Clique Constraints

Maximal cliques in G x

Structure of Maximal cliques in G x

Structure of Maximal cliques in G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

G x

Structure of Maximal cliques in

(5 ^th RECOMB conference)

Maximal cliques in G ^x

Structure of Maximal cliques in G ^x

Structure of Maximal cliques in G ^x

G ^x

G ^x

G ^x

G ^x

G ^x

G ^x

G ^x

G ^x

G ^x

Structure of Maximal cliques in G ^x

Structure of Maximal cliques in G ^x

Structure of Maximal cliques in G ^x