Implementation, validation and application of a new software for automatic Molecular Perception

(1)

Università di Pisa

Dipartimento di Chimica e Chimica Industriale

Corso di Laurea in Chimica

Implementation, Validation and

Application of a new Software

for automatic Molecular

Perception

Relatore:

Prof. Vincenzo Barone

Controrelatore:

Prof. Giovanni Granucci

Candidato:

Federico Lazzari

(2)

(3)

Acknowledgements

I would like to thank all those who supported me, guided me and advised me during this period of work, and helped me get results of better quality.

In particular, I would like to thank Prof. Vincenzo Barone for the given opportunity to work in a prestigious research institution, such as the SMART laboratory at Scuola Normale Superiore, and for the proven condence in me and my abilities. I would also like to thank Prof. Giovanni Granucci for his availability and for his advices. Special thanks to Dr. Giordano Mancini, and Andrea Salvadori for the incredible support and for their continued commitment and patience, and to Andrea dal Pino for his important suggestions.

Finally, thanks to my family who raised me and love me, and prompted me to pursue my passions despite the various daily diculties, and to Giuseppe Madaro for his aection.

(4)

CONTENTS CONTENTS 6 Proxima Implementation 45 6.1 Parser . . . 48 6.1.1 MolecularBuilder . . . 52 6.1.2 FirstModelState . . . 53 6.1.3 ModelsState . . . 54 6.1.4 AltLocState . . . 54 6.1.5 ResidueBuilderState . . . 55 6.2 Fragment . . . 58 6.3 Residue . . . 58 6.3.1 AminoResidue . . . 59 6.3.2 NucleotideResidue . . . 60 6.4 RingGenerator . . . 62 6.5 SSSRGenerator . . . 64 6.6 GasteigerParameters . . . 69 6.7 Utilities . . . 69 6.8 Atom . . . 71 6.9 Bond . . . 72 6.10 HBond . . . 72 6.11 Element . . . 73 6.12 AtomsGrid . . . 73 6.13 HBInteraction . . . 74 6.14 MolecularSystem . . . 75 6.14.1 computeBonds . . . 76 6.14.2 computeHBonds . . . 76 6.14.3 computeGasteigerCharges . . . 77 6.14.4 computeHybridisation . . . 78

7 Validation Tests and Benchmarks 81 7.1 Ring Perception . . . 81 7.1.1 Decalin . . . 81 7.1.2 Ingenol Mebutate . . . 82 7.1.3 Longifolene . . . 83 7.1.4 DNA Fragment . . . 83 7.1.5 (0,4) nanotube . . . 84 7.1.6 Cubane . . . 86 7.1.7 Bullvalene . . . 87 7.1.8 Cuneane . . . 88 7.1.9 Dodecahedrane . . . 89

7.1.10 Molecular Borromean Ring . . . 90

7.1.11 Pagodane . . . 91 7.1.12 Benchmarks . . . 92 7.2 Hybridisation Perception . . . 94 7.2.1 Benzene . . . 94 7.2.2 Pyridine . . . 95 7.2.3 Ottatetraene . . . 95 7.2.4 Cellocidin . . . 96 7.2.5 1-Penten-4-yne . . . 96

(6)

CONTENTS CONTENTS

7.2.6 Methylenemalonaldehyde . . . 97

7.2.7 XeF4 . . . 98

7.2.8 Carbonic Anhydrase . . . 98

7.2.9 Tungsten complex . . . 99

7.2.10 Porphyrin iron complex . . . 99

7.2.11 Cubane . . . 100 7.2.12 Ruthenium complex . . . 101 7.3 Charge Perception . . . 101 7.3.1 Formamide . . . 102 7.3.2 Acetone . . . 102 7.3.3 Water . . . 103 7.3.4 Diethylether . . . 104 7.3.5 Decalin . . . 105 7.3.6 Naphthalene . . . 105 7.3.7 Pyrimidine . . . 107 7.3.8 Pyridine . . . 107 7.3.9 Oxime . . . 108 7.3.10 Benzammide . . . 109 7.4 PDB Tests . . . 109

7.4.1 Alternate Locations and Frames . . . 110

7.4.2 Amino Residues . . . 111

7.4.3 Nucleotide Residues . . . 112

7.4.4 Generic Residues . . . 114

8 Conclusions 116

A Object-Oriented Programming summary 117

(7)

Chapter 1

Introduction

Perception (from Latin perceptio) can be dened as:

The organization, identication, and interpretation of sensory information in order to represent and understand the presented information, or the environment. [1]

The Perception process is about the simple and almost automatic conversion of sensory information into mental representations. Rational thinking is then applied to such mental representations, thus allowing the possibility of science and new discoveries. The process of human Perception has been widely studied by psychologists and philosophers. As an example of human Perception, the following optical illusion is based on the Kanizsa triangle [2] (gure 1.0.1). These spatially separate fragments give the impression of a bright white triangle, dened by a sharp illusory contour, occluding three black circles and a black-outlined triangle. The triangle is not explicitly represented, but it is perceived. The same analogy can be made when dealing with molecular structures.

Molecular Perception is the set of theories and methods for the automatic determination of those fundamental properties that uniquely identify the system under investigation. To better highlight the similarity between Molecular Perception and human Perception, lets consider the assembling of atoms reported in gure 1.0.2. There is a convention in Molecular Visualization software to represent atoms as spheres, whose colors and radii are assigned on the base of the type of elements being represented. In particular, it is common to employ radii proportional to the Van der Waals radii while, for colors, conventions are used (e.g. red is used for oxygen and white for hydrogen). By observing gure 1.0.2, it is evident that this particular set of atoms is representing a water molecule. In this case, the information about the covalent bond connectivity is not explicitly included in the representation but is easily perceptible. Molecular Perception algorithms automatically detects such relevant chemical

(8)

CHAPTER 1. INTRODUCTION

Figure 1.0.2: Water molecule properties.

Durig the last century, huge amount of data sets of molecular structures, of increasing size and complexity, has been accumulated. Some examples are the Protein Data Bank (PDB) [3] and the Cambridge Structural Database (CSD) [4]. Most of such data sets contain reliable information only about the coordinates of the atoms in the molecular system. Thus, one of the goal of Molecular Perception algorithms is to take the initial structure, which is just a set of atoms sparse in space, and convert it to a more chemically meaningful structure. It is important to note that Molecular Perception does not necessarily requires high level computational chemistry calculations. In fact, the goal is just to highlight those information that are implicitly present in the molecular system, to represent them in a way that is more easily interpretable by a human being. Moreover, Molecular Perception can still complement computational studies. In fact, in molecular mechanics (MM) studies, the energy of a Molecular System is expressed as a function of topological variables such as torsion angles and bond distances. Such functions are well dened within a given set of parameters that are specic for the atom types and/or bond types of the given system. Thus, atom and bond types need to be determined prior to energy calculations [5]. Molecular Perception is also fundamental in interactive applications, such as molecular viewers and editors. Most molecular graphics software makes use of dedicated software modules or external libraries in order to detect the correct topology and the most relevant chemical properties of molecular systems.

The aim of the following thesis is the continuation of a previous work started during the course of my bachelor thesis. This is the development of Proxima, a Molecular Perception library written in C++ language. At the end of my bachelor thesis, Proxima was already capable to detect properties such as the covalent bond connectivity and the hydrogen bond intensities, on the base of atomic coordinates and atom types as structural information. The library also includes a parser for the PDB le format. Molecular Perception of covalent bond connectivity is traditionally performed by comparing the distance between atoms with a theoretical distance resulting from the sum of the relative covalent radii. In Proxima, we also included the additional contribution of the electronegativity (∆χ in the formula below), as proposed by Schomaker and Stevenson [6].

d < Racov+ Rbcov− 0.07(∆χ)2+ tolerance

The contribution of the electronegativity has been proven useful especially in the case of ionic in-teractions, such as in the potassium channel (PDB ID: 1BL8), where most other software (such as OpenBabel [7]) overestimate the number of bonds by adding covalent bonds to the potassium atoms. After establishing a rst connectivity between the atoms of the system, Proxima checks that every single atom respects the rules of valence, which are applied in the form of the following constrain:

(9)

CHAPTER 1. INTRODUCTION the total number of bonds can not exceed the maximum coordination number of that element (as suggested by [8]). The intensity of hydrogen bonds between a triad of atoms, instead, have been computed according to the model proposed by Dr. Pagliai [9, 10, 11].

The development of Proxima is continued in the context of my master's thesis. The initial moti-vation was to implement continuous Bond Orders perception. However, the implementation of such a method was proven to be harder than thought. Traditional software tend to represent bond orders as a discrete quantity that quanties the strength of the bond on an arbitrary scale. This approach could easily fail when trying to relate such structural properties to physical data or when there are more possible structures for the same molecule, as in the case of resonance structures. This issue could be theoretically solved by abandoning the traditional discrete bond orders for a much more physically based continuous description of the chemical bond. This new approach to the continuous bond order perception problem is still in its early stages and a complete solid algorithm has not been found yet. One of the main diculties in the development of an algorithm of this kind, whatever denition for the bond order we are going to use, is that it should be very fast. Indeed, perception of bond orders is meant to be used in interactive software, therefore the time of execution of such a computation is critical.

It is also necessary to face the problem of missing hydrogen atoms. Since state of the art bond order percetion algorithms are based on atomic valences, the information about the total number of bonds of an atom is relevant. However, most chemical structures are derived from crystallographic experiments where hydrogen atoms are invisible since are poor in electron density. As a consequence, it is important to detect the total number of hydrogen atoms that should be added to every atom. This operation can be performed by using the knowledge of the hybridisation of the element that can be evaluated by observing the bond angles. However, bond angles are not always very reliable especially in small rings. Thus, the rst perception method to be implemented in this new version of Proxima was the perception of chemical rings in molecular structure. Then, the hybridisation perception method and nally the perception of partial charges. Moreover, we also had to re-implement the PDB parser in order to better allow easier implementation of new features.

This thesis is organized in eight chapters. At rst, a short review on the theoretical background for molecular rings perception algorithms is illustrated in chapter 2. In chapter 3, the theoretical background behind the perception of the partial charges will be illustrated, while chapter 4 describes the state of the art of bond order perception methods. Then, the changes made on the PDB parser are illustrated in chapter 5 and the structure of the software Proxima is summarized in chapter 6. Finally, some benchmarks and test cases are reported in chapter 7. The nal chapter contains a concluding discussion about the results and the future improvements of this work.

(10)

Chapter 2

Ring Perception

The perception of rings is a complex task. Such complexity arises from the exponential number of possible ring combinations and the inherent diculties in dening rings that are chemically relevant (in the sense of being meaningful for a chemical interpretation of the system). In fact, dierent chemists look for dierent characteristics depending on their task (e.g. whether they are interested in the chemical synthesis of the structure, or in doing computational studies, etc.). Moreover, the concept itself of chemically relevant is not well dened. As an example, consider the structure represented in gure 2.0.1, a Decalin molecule (decahydronaphthalene, also known as bicyclo[4.4.0]decane). Any

Figure 2.0.1: Decalin molecule

Chemist would say the the Decalin molecule is composed of two cyclohexane rings condensed together. This is because of the background knowledge in chemistry that lead to interpret a complex structure as the result of the union of smallest pieces due to chemical synthesis routes [12]. However, a computer is not aware of such reasonings and, as a result, it would simply observe the presence not only of the smallest cyclohexane rings but also of the outer ring represented in gure 2.0.2. Unfortunately, computing all of the possible cycles is not only useless from a chemical point of view, but could result in an exponential amount of detected cycles in the number of vertices [13], which is computationally infeasible in interactive applications. At this point, it seems reasonable to think that only the smallest cycles are the chemically relevant ones and that the cycles with a length greater than some threshold

(11)

2.1. GRAPH THEORY PRELIMINARIES CHAPTER 2. RING PERCEPTION

Figure 2.0.2: The outer ring of the Decalin molecule

could be ignored. However, Molecules in nature can reach high level of complexity and thus the criterion of just take the smallest cycles can easily fail. One example comes from the inorganic chemistry with the Molecular Borromean rings [14], a mechanically-interlocked molecular architecture in which three macrocycles are interlocked in a way that breaking any macrocycle allows the others to disassociate (gure 2.0.3). The most meaningful cycles are the three macrocycles and not the smallest

Figure 2.0.3: Molecular Borromean rings [14]

aromatic rings. We will show later how the criterion of taking the smallest cycles is not completely wrong, but has to be better motivated and explained. The following chapter is articulated in 5 dierent sections in which the relevant concepts about chemical ring detection will be illustrated. In chapter 6, the implementation of a modication of the Horton's algorithm (introduced in section 2.3.1) for the perception of the Smallest Set of Smallest Rings will be illustrated in details.

2.1 Graph Theory Preliminaries

In order to better understand the concepts of the next sections, a brief introduction on graph theory is needed. [13] We dene a graph G(V, E) as a mathematical object composed of a vertex set V and an edge set E. An edge is dened as a pair of distinct vertices, i.e., E ⊆ V XV . A Graph can be additionally dened as a Directed Graph or an Undirected Graph. In a directed graph, each edge has a xed orientation, whereas in an undirected graph edges are not oriented. Examples of a directed and of an undirected graphs are represented in gure 2.1.1. A subgraph H(V0_{, E}0₎ _{of G(V, E) is a}

graph satisfying V0 _{⊆ V} _{and E}0 _{⊆ E}_{. For a set H of subgraphs of G we write E[H] for the union}

of all their edges. We say that H covers a subgraph F(V',E') of G if E0 _{⊆ E[H]}_{. Let v ∈ V be}

a vertex. The number of edges incident with V is called degree of V. A path P in G is a sequence of edges (e1, e2, ..., ek−1) with ei = {νi, νi+1}, i=1,...,k-1 such that the vertices νi, i=1,...,k are all

dierent, except possibly ν1 and νk. A graph G is connected if for any two vertices v, w ∈ V , there

is a path P such that ν1 = ν and νk = w. The connected components of a graph G are its maximal

(12)

2.1. GRAPH THEORY PRELIMINARIES CHAPTER 2. RING PERCEPTION

a) b)

Figure 2.1.1: a) Undirected Graph b) Directed Graph

elementary if it is connected and each vertex has degree two. Elementary cycles meet our expectation of rings in a molecular graphs. A chord of the cycle C is an edge e = {x, y} ∈ E such that e /∈ C, but both x and y are vertices of C. A cycle C is simple or chordless if it is elementary and has no chord. A cycle C is tied if it is elementary and has exactly one chord. The length l(C) of a cycle C is the number of its edges. A tree is a connected undirected graph with no simple cycles. A forest is an undirected disconnected graph, whose connected components are trees. It is convenient to identify a cycle with a vector of numbers for better algebric manipulations. As an example, we will now take as a reference

Figure 2.1.2: Decalin Molecule (Labeled edges)

gure 2.1.2 where each edge has been labeled for the Decalin molecule. Given the following ordering of the reference bonds: (a,b,c,d,e,f,g,h,i,l,m), the smallest cyclohexane ring on the left is described by the vector (1,1,0,0,0,0,0,1,1,1,1), the cyclohexane ring on the right by (0,0,1,1,1,1,1,0,0,0,1) and the bigger ring of gure 2.0.2 by (1,1,1,1,1,1,1,1,1,1,0). Each component of these vectors determines whether the related bond is absent or present in the cycle. These incidence vectors (and we can hence say, the cycles in G) form a vector space where the vector addition of two cycles C´, C´´ corresponds to their symmetric dierence [13]

C0⊕ C00= (C0∪ C00)\(C0∩ C00)

This vector space is known as the cycle space C(G), and it has dimension µ(G) = |E| − |V | + c(G) [13], where c(G) is the number of connected components of G. The parameter µ(G) is called the cyclomatic number or rst Betti number of the graph G. We are now going to proof such statement by rst computing the total number of edges in trees and forests.

Lemma 1. A tree with |V| vertices has |V|-1 edges

Proof. Choose the vertex r as the root of the tree. We set up a one-to-one correspondence between the edges and the vertices other than r by associating the terminal vertex of an edge to that edge. Since there are |V|-1 vertices other than r, there are |V|-1 edges in the tree. [15]

Lemma 2. A forest with |V| vertices and C components has |V|-C edges

(13)

2.1. GRAPH THEORY PRELIMINARIES CHAPTER 2. RING PERCEPTION by subtracting the dierence between total vertices (|TV|) and total edges (|TE|): |TV|-|TE|=C, thus |TE|=|TV|-C. [16]

Theorem 3. The cyclomatic number is equal to the dimension of the cycles vector space for a given graph G

Proof. Knowing that a |V|-vertex tree has |V|-1 edges and that a forest of |V| vertices and C compo-nents (that is the union of C trees) has |V|-C edges, we need to consider the minimum number of edges that must be deleted to create an acyclic subgraph from G, that is: |E|-(|V|-C)=|E|-|V|+C. Now, in order to prove that this is also the maximum number of linearly independent cycles, let's imagine to add edges to the now acyclic graph (the forest) obtained from the original graph. In particular, suppose that before adding the edge uk we had obtained a fundamental basis containing the cycles:

µ1, µ2, ...; and that after the edge uk has been added we have formed ther new cycles: ν1, ν2, ....

Clearly ν1 cannot be expressed linearly in terms of the µi (since the k-th component ν1k 6= 0whereas

µk₁= µk₂ = ... = 0); on the other hand, the vectors ν2, ν3, ...can be expressed linearly in terms of the

µiand ν1. To sum up, in order to create an acyclic subgraph from a cyclic graph, we need to remove

a cyclomatic number of edges. Each time we add one of the previously removed edges to the current graph, we are building up a new set of linearly independent cycles. Thus the cyclomatic number is equal to the maximum number of linearly independent cycles. [17]

A basis B of the cycle space C(G) is called a cycle basis. Given a cycle basis B there is a one-to-one relation between the cycles C and subsets Bc ⊆ B such that

C = ⊕Z∈BcZ

A graph G is 2-connected if any two vertices are contained in a common elementary cycle. The cycle space of G is the direct sum of the cycle spaces of the 2-connected components of G. Hence we restrict our attention to 2-connected graphs from now on. The length of a cycle basis B is the total length of its cycles, l(B) = PC∈B|C|. A minimum cycle basis is a cycle basis of minimal length. A

cycle is relevant if it cannot be written as ⊕ − sum of shorter cycles. Equivalently, a cycle is relevant if it is contained in at least one minimum cycle basis. We write R(G) for the set of relevant cycles. The number of relevant cycles may grow exponentially with the number of vertices. Each cycle in a minimum cycle basis, i.e., each cycle in R(G), is simple.

For each edge e ∈ E we write S(G) for the set of shortest cycles that contain e. The set of shortest cycles through the edges of G is S(G) = ∪e∈ES(e). For all graphs holds S(G) ⊆ R(G). The set S(G)

may also grow exponentially since it could happend that S(G) = R(G). A spanning tree of G(V, E) is a subgraph of G that is a tree containing every vertex of G. A spanning tree can be easily obtained by removing edges in order to remove cycles from the graph as illustrated in gure 2.1.3. It is evident

Figure 2.1.3: The spanning tree of the directed graph of gure 2.1.1.

from the above considerations that the computation of a spanning tree for a given graph can help in highlighting the presence of rings in the structure of the graph. However, removing edges for obtaining

(14)

2.1. GRAPH THEORY PRELIMINARIES CHAPTER 2. RING PERCEPTION spanning trees is algorithmically inecient. Thus, an alternative approach to generate a Spanning Tree is to add edges instead of removing edges. This is typically done by exploring the graph through a Depth-First or a Breadth-First search.

2.1.1 DFS

The depth-rst search strategy starts by selecting some arbitrary node as the root node, in the case of a graph, and explores as far as possible along each branch before backtracking. Of course, the algorithm avoids visiting the same vertex twice and when it backtracks, it only explores not fully explored vertices. It is called Depth-First because this algorithm will rst go the further away from the root node in the graph before exploring the other vertices directly connected to the root node. In other words, it goes in depth in the graph. As an example, taking the vertex labeled A as the root

Figure 2.1.4: A directed graph

node in gure 2.1.4, the DFS procedure visits vertices in the given order: ABCDEFGH. The DFS rst go down the graph until it reaches the deepest vertices (The D vertex). Then it goes up back to the rst vertex not fully explored (The B vertex) and visits its unexplored childrens (The E vertex). This sequence is repeated until all vertices are explored. [15] A more formal description of the DFS in pseudo-code is given in 2.1.

Algorithm 2.1 DFS search [18]

1 procedure DFS(G, v ) : 2 l a b e l v as d i s c o v e r e d

3 f o r a l l edges from v to w in G. adjacentEdges ( v ) do 4 i f vertex w i s not l a b e l e d as d i s c o v e r e d then 5 r e c u r s i v e l y c a l l DFS(G,w)

A typical DFS search in a graph has an algorithmic complexity of Θ(|V | + |E|) in time since it has to visit every edge and vertex [18].

2.1.2 BFS

An alternative to the DFS search is the Breadth-First Search. The BFS is conceptually similar to the DFS. During the BFS, a spanning tree for the graph will be constructed, and the the algorithm proceeds as follows: At rst, an arbitrary vertex is choosed as the root vertex. Then all the edges that are incident to this vertex are added. These new added vertices become the vertices at the level 1 in the spanning tree and are abritrarily ordered. Next, for each vertex at level 1, visited in order, each edge incident to this vertex is added to the tree as long as it does not produce a simple ring. The children of each vertex of the previous level are arbitrarily ordered thus producing the vertices at level 2 in the spanning tree. The same procedure is followed until all the vertices in the tree have been added. The procedure ends since there are only a nite number of edges in the graph. A spanning tree is produced since we have produced a tree containing every vertex of the graph. As an example, taking

(15)

2.2. KIRCHHOFF - FUNDAMENTAL CHAPTER 2. RING PERCEPTION as a reference the same graph of gure 2.1.4, the BFS procedure explores the vertices in the order: ABFCEGHD. It is called Breadth-First because instead of going down in the graph, this algorithm rst searches vertices that are all at the same level. [15] A typical implementation of the BFS requires a queue to stores the vertices to be visited. A more formal description of the DFS in pseudo-code is given in 2.2.

Algorithm 2.2 BFS search

1 procedure BFS( Graph G) :

2 l a b e l the root as d i s c o v e r e d and add i t to the queue q 3 while q i s not empty :

4 s e t the var currentNode as the node e x t r a c t e d from the f r o n t o f q 5 f o r each c h i l d o f currentNode :

6 i f the c h i l d hasn ' t been already d i s c o v e r e d : 7 add the c h i l d at the end o f q .

8 l a b e l each c h i l d o f currentNode as d i s c o v e r e d

This algorithm has complexity of O(|V | + |E|) [18].

2.2 Kirchho - Fundamental

The previous section ends with the presentation of DFS and BFS as methods to compute a Spanning Tree of a given graph. Moreover, it has been noted how the process of building a spanning tree can help highlighting the presence of rings in the graph. Thus, it is no surprise that the rst type of cycle basis we are going to deal with is intrinsically based on the concept of Spanning Tree of a graph. In 1847 Kirchho [19] gave the following denition: Suppose T is a Spanning Tree of a graph G, for each edge e /∈ T there is a unique cycle in T ∪ {e} which is called a fundamental cycle w.r.t. T. The set of fundamental cycles belonging to a given spanning tree forms a basis of the cycle space which is called the fundamental basis w.r.t. T. Thus, a cycle basis B that is fundamental w.r.t. to some spanning tree T of G is called Kirchho-fundamental. In the example of Fig. 2.1.1 (with the spanning tree in Fig. 2.1.3), the only fundamental cycle is the ABC cycle. It is possible to generalize such concept by taking a collection B of µ(G) cycles in G in order to have an ordering of these cycles such that:

Cj\(C1∪ C2∪ ...Cj−1) 6= ∅ (2.2.1)

for 2 ≤ j ≤ µ(G). If the above equation is satised by every ordering of B then the cycle basis is called strictly fundamental. It is possible to prove that a cycle basis is strictly fundamental if and only if it is Kirchho-fundamental. [20]

Theorem 4. A cycle basis B is strictly fundamental if and only if is Kirchho-fundamental. [21] Proof. Assume B is Kirchho-fundamental and more precisely B = {Ce|e ∈ E\T }, where T is a

spanning tree and Ce denotes the unique cycle in T ∪ {e} for every e in E\T . Then, for every cycle

Ce in B , we immediately have

{e} ∈ Ce\ ∪ C0∈B\{Ce}

C0 (2.2.2)

Conversely, assume B is strictly fundamental. For every ring C in B, let eC be any edge contained

in C but in no other ring of B. Notice that T := E\{eC: C ∈ B}has precisely |E|-(|E|-|V|)-1=|V|-1

(16)

2.3. SMALLEST SET OF SMALLEST RINGS (SSSR) CHAPTER 2. RING PERCEPTION

2.3 Smallest Set of Smallest Rings (SSSR)

Originally, the Smallest Set of Smallest Rings was dened as a minimum length Kirchho-fundamental basis [22]. It was later shown in 1982 that the problem of nding a Kirchho-fundamental cycle basis with minimum length is NP-hard [23]. As noted by Berger et al. [13], in the more recent chemistry literature, the term SSSR is also used as a synonym for minimum cycle basis. This tacit shift of denition reects the wide-spread misconception that every cycle basis or at least every minimum cycle basis is strictly fundamental, i.e., derivable from a suitable spanning tree. The point is that a minimum Kirchho-fundamental basis, besides being hard to compute, often does not contain all smallest cycles of the graph. An example can be seen in gure 2.3.1, where the desired minimum cycle

Figure 2.3.1: The minimum cycle basis of this planar graph consists of the four triangles and the central square. It is easy to check directly that no spanning tree generates the minimum cycle basis. [13]

basis is made by the four triangles and the central square, yet is not strictly fundamental in the sense of equation (2.2.1). Consequently, to correctly compute an SSSR, an algorithm that generates a minimum cycle basis is needed. A number of polynomial-time algorithms for this task have been published in recent years. Although their worst-case complexity is rather high (e.g. the Horton's algorithm[24] described below has a worst-case complexity of O(n7₎_{, where n is the number of vertices), their average}

performance on problems of practical interest seem to be much more favorable. The rst algorithm ever introduced to compute the SSSR is the Horton's algorithm [24], which is also at the base of the algorithm implemented in Proxima.

2.3.1 Horton's Algorithm

Before introducing the nal formulation of the Horton's algorithm, a series of lemmas and theorems will be discussed and proven, so to better understand the inner logic of the algorithm.

Lemma 5. If B is a cycle basis for a graph, C ∈ B, and C = C1⊕ C2, then B − {C} ∪ {Ci} is a

cycle basis for one of i=1 or 2.

Proof. Both C1 and C2are generated by B, and each in a unique way. Hence for both i=1 and 2, Ci

is the sum of a subset of the cycles in B, say Ai, and Aiis unique. Since C1= (the sum of the cycles in

A1) = C ⊕ C2= C ⊕ (the sum of the cycles in A2), A1and A2dier only in that one of them includes

C and the other does not. Assume C ∈ A1. Then B − {C} generates C2, hence B0= B ∪ {C1} − {C}

generates all cycles in B and is a basis. [24]

Theorem 6. A minimum cycle basis always consists of simple cycles.

Proof. Assume B is a cycle basis, C is in B, and C is not a simple cycle. Then C contains a simple cycle C1, and C = C1⊕ C2, where C2 is a cycle. If all simple cycles have length, 0 < l(C1) < l(C),

(17)

2.3. SMALLEST SET OF SMALLEST RINGS (SSSR) CHAPTER 2. RING PERCEPTION and 0 < l(C2) = l(C) − l(C1) < l(C). By lemma 5, B0= B − {C} ∪ {Ci} is a basis for one of i=1 or

2. But B0 _{has less length than B. [24]}

Theorem 7. Let x and y be two vertices of the graph G, and let B be a cycle basis of G. Let P be a path from x to y. Then any cycle C of B that contains x and y can be substituted in B for either a cycle that includes the path P, or a cycle that excludes one of x and y.

Proof. Let P1 and P2 be the two paths in C joining x and y. Assume P 6= P1 and P 6= P2. Let be

C1= P ⊕ P1and C2= P ⊕ P2. Then C1 and C2are cycles, and C = C1⊕ C2. By lemma 5, C can be

exchanged for either C1 or C2. If the exchanged Ci is not a simple cycle, then Ci can be exchanged

for one of its simple cycles, and that simple cycle cannot include both x and y. [24]

Corollary 8. If B is a minimum cycle basis, and P is the unique shortest path from x to y, then every cycle of B that contains x and y must contain the path P.

Proof. In the proof of theorem 7, l(Ci) 5 l(P ) + l(Pi) < l(P1) + l(P2) = l(C). Hence if C does not

contain P, it can be exchanged for a shorter cycle, and so B would not be minimum. [24]

Note that the corollary assumed that the shortest path from x to y was unique. In general, there can be several paths between any two vertices that are shortest. What corollary 8 implies is that in any minimum cycle basis, any two vertices of a cycle must be joined by a shortest path. Moreover, if there is a preferred shortest path from x to y, one can exchange all cycles in B for cycles that contain this preferred shortest path, and still retain a minimum cycle basis.

Let be d(x,y) the length of a shortest path from x to y. We are now going to prove the fundamental theorem that is at the basis of the Horton's algorithm.

Theorem 9. Let x be any vertex of any cycle C in a minimum cycle basis. Then there is an edge {y, z} in C such that C consists of a shortest path from x to y, a shortest path from x to z, and the edge {y, z}.

Proof. Let C = (x0, x1, ..., xn−1, xn), where x = x0 = xn. Dene Pi = (x, x1, x2, ..., xi) and Qi =

(x, xn−1, ..., xi). By the corollary 8, d(x, xi) = l(Pi) or l(Qi). Let y = xi, where i is the largest

subscript such that d(x, xi) = l(Pi). Let z = xi+1. Then C = Pi+ {xi, xi+1} + Qi+1. [24]

The theorem 9 gives a strong condition on the cycles in a minimum cycle basis, especially if all the shortest paths are unique. In the latter case all basis cycles can be found by considering for each vertex-edge pair (x,{y,z}) the cycle C(x, y, z) = P (x, y)+P (x, z)+{y, z}, where P(v,u) is the shortest path from v to u. Obviously there are at most nm, of such cycles (here n is the number of vertices and m is the number of edges). A greedy algorithm can be used to extract the minimum cycle basis from the set of cycles obtained so far. Unfortunately, this last greedy step is necessary since the countrary of theorem 9 does not hold. Thus, the set of cycles obtained so far is not guaranteed to be a set of independent cycles.

The outline of the Horton's algorithm is the following:

1. Find a minimum path P(x,y) between each pair of vertices x, y.

2. For each vertex v and edge {x, y} in the graph, create the cycle C(v, x, y) = P (v, x) + P (v, y) + {x, y}and calculate its length. Degenerate cases in which P(v,x) and P(v,y) have vertices other than v in common can be omitted.

3. Order the cycles by length.

(18)

2.3. SMALLEST SET OF SMALLEST RINGS (SSSR) CHAPTER 2. RING PERCEPTION Each of the steps above requires the following number of operations:

1. Takes Θ(n3₎_operations

2. Takes Θ(mn2₎_operations

3. Takes Θ(a log(a)) operations 4. Takes Θ(nk) operations

Where m is the number of edges, n the number of vertices, a is the number of cycles found in step (2), and k is the number of operations to decide whether a cycle is independent from another given set of independent cycles or not. The last task is typically performed by applying a Gaussian elimination on the matrix whose rows are the incident vectors representing the cycles in the current bond space. Each row should be processed in turn, in the order of the weights of the cycles. When enough independent cycles have been found, the process can stop. As noted by Horton [24], for this type of process the worst-case complexity becomes O(n7₎_.

2.3.2 Proxima Implementation

Despite the polynomial cost of the Horton's algorithm, it is still high in the total number of vertices. In order to reduce such cost, it is necessary to apply the algorithm on subsets of the overall molecular graph. Thus, the original Horton's algorithm is implemented in Proxima with the optimization pro-posed by Bo Tao Fan et al. [25]. The idea of Bo Tao Fan is to rst separate the molecular system in blocks of cyclic atoms. A cyclic atom is an atom that is part of at least one cycle. An acyclic atom, on the other end, is an atom that is not part of any cycle. The original article introduces the concept of Open Acyclic Node, dened as an acyclic atom that is not located between two blocks of cyclic atoms, and Closed Acyclic Node, dened as an acyclic atom located between two blocks of cyclic atoms. In

Figure 2.3.2: In this reference Molecular System there are two blocks of cyclic atoms: (1,2,3,4,5,6,7,8,9,10,11,12,13) and (20,21,22,23,24). There are also ve open acyclic nodes: 14,15,16,17,18 and one closed acyclic node: 19.

gure 2.3.2 this concepts are better highlighted. It is evident that the Horton's algorithm could easily be performed just on the blocks of cyclic atoms. A pre-processing step is then performed in order to detect the blocks of atoms, so to later apply Horton to each block individually. Moreover, by doing

(19)

2.4. ALTERNATIVES TO THE SSSR CHAPTER 2. RING PERCEPTION

a) b)

Figure 2.4.1: a) Cubane molecule b) Planar embedding of the cubane molecule

so, the computation of the SSSR can now be parallelized, since each block is independent from the others.

The procedure starts by removing Open Acyclic Nodes rst, then the Closed Acyclic Nodes and nally assigns atoms to their relative block. In order to remove Open Acyclic Atoms, the procedure begins by terminal nodes, that for sure can't be part of any cycle. This is carried out by means of a recursive function since, once a terminal atom is deleted, new terminal atoms are generated. As an example, consider what happens in gure 2.3.2 when the atom number 18 is erased: the atom 17 now becomes a terminal atom and it is also removed. Once all of the Open Acyclic Nodes have been removed, the algorithm starts removing the Closed ones. The procedure is relatively simple. Starting from a particular atom (root), for example atom 20 in gure 2.3.2, if a closed path can be found this atom is surely a ring member and it is labeled as a Cyclic Atom. It is important to note that the closed path found may or may not be a ring of the SSSR. This does not matter since the cycle basis will be extracted in future steps. In contrast, a closed acyclic atom can never be found in a closed path. The algorithm performs a BFS in order to check for closed paths. As an example, in gure 2.3.2, the atom number 19 is not recognized as being part of any closed path and therefore is marked as a Closed Acyclic Node and removed. At the end of the above procedures, only cyclic atoms remain.

At this point, atoms are assigned to blocks. Let's start from an arbitrary root atom, another BFS is performed and every atom that is encountered during this search is attribuited to the same block of the root atom. The BFS is then permormed iterativetly until all of the atoms are assigned to some block.

Such procedure has been proven to be ecient especially when dealing with proteins, where cycles are sparse in the structure. Of course, one of the biggest limitation of such algorithm is when dealing with structures composed by a very limited amount of huge blocks (blocks made by a huge number of atoms). An example of such structures are nanotubes. For more information see chapter 7.

2.4 Alternatives to the SSSR

To conclude this chapter, it is important to note how the problem of computing chemically relevant rings in molecular graphs is still an open problem to be solved. In particular, even if the SSSR approach is the most solid and widespread denition, it should also be taken with a grain of salt. Let's take as an example the cubane molecule represented in gure 2.4.1. The chemically relevant cycles are the six square faces. However, the SSSR basis is made only of ve squares. The reason is, simply, because it is a basis and as such it has to be made of linearly independent cycles (in the sense of the operation described at Eq. 2.1); given ve squares, the sixth square can be computed as the ⊕ sum of the other cycles. It could help to look at gure 2.4.1 b) so as to better understand this concept.

(20)

2.4. ALTERNATIVES TO THE SSSR CHAPTER 2. RING PERCEPTION The gure 2.4.1 b) is a planar embedding of the cubane molecule, i.e. a plane graph without self-intersecting edges that is isomorphic to the original graph [13]. At this point it should be easy to see why the SSSR is made only of ve quadrilateral rings by looking at gure 2.4.1 b). There are several other possible denitions of chemically relevant cycles in the literature that try to face such problematics. Each of these denitions has some advantages but also some drawbacks. In particular, most of such denitions tend to lose the requirement of being a basis for the cycle space just for the sake of chemical intuition. Anyway, the SSSR is still a very robust method for computing rings and, being a cycle basis, has a set of mathematical properties that makes it easy to be used to compute other derived properties. Some possible alternatives presented in literature to the SSSR are briey discussed in the following.

2.4.1 Essential Set of Essential Rings (ESER)

The Essential Set of Essential Rings was introduced by Fujita [26]. Depending on atom types, cycles are classied according to the atoms that they contain, such as carbon, heteroatom (N, O, S, P), and abnormal (all other atoms). For simplicity we assume here that all atoms are of the same type or at least comparable in the sense of Fujita. For a simple cycle C, let be T∗_[C]_{the set of all tied cycles C'}

that satisfy: l(C0_{) ≤ l(C)}

2l(C0_{∩ C) ≥ l(C}0₎_{, i.e. at least half of the edges of C' are in C}

A simple cycle C is ESER-dependent if there is a subset T0 _{⊆ T}∗_[C]_{such that C ⊆ E[T}∗_]_(remember

the denition of E at 2.1). ESER(G) is the set of all simple cycles in G that are not ESER-dependent.

2.4.2 Extended Set of Smallest Rings (ESSR)

The Extended Set of Smallest Rings was introduced by Downs et al. [27] as an approach to design an optimal ring set for retrieval purposes. ESSR is by denition limited to planar graphs. Paraphrasing the original denition, a cycle C is in ESSR(G) provided it satises at least one of the following conditions:

There is a planar embedding of G such that C is a chordless face. (Simple faces) C is a shortest cycle through at least one of its edges. (Class I cut face)

There is a planar embedding of G such that l(C) ≥ l(C0₎ _{for all faces C´ adjacent to C, and}

there is at least one adjacent face C´´ for which l(C) = l(C00₎_{. (Class II cut face)}

Here, a face, is an elementary cycle in G that delimits a unique region of G (if G is 2-connected)[13].

2.4.3 Set of Smallest Cycles at Edges (SSCE)

Let G(V,E) be a 2-connected graph. Set G0:= G(V, E0)with E0:= Eand dene Gk(V, Ek)recursively

as the graph with vertex set V and edge set Ek = {e ∈ E|e is contained in exactly one cycle C ∈

S(Gk−1)}. In other words, in each step we remove all edges that are covered at least twice by S(Gk−1).

Let be ξ(G) the smallest number k such that Ek = Ek+1. Clearly, ξ(G) = 0 if G is a cycle. Dury et

al. [28] dene the Set of Smallest Cycles at Edges (SSCE) as the union SSCE(G) =ξ(G)∪

(21)

2.4. ALTERNATIVES TO THE SSSR CHAPTER 2. RING PERCEPTION

Figure 2.4.2: Norbornan molecule

2.4.4 Beta Rings

The set of β-rings [29] is one of the earliest attempts to extend the SSSR to include the hexagon of norbornane, depicted in gure 2.4.2, but without including much larger envelope rings in other molecules such as naphthalene. A β-set is obtained from the length-sorted list of all chordless faces of a plane embedding: The algorithm processes the cycles by increasing length and includes all cycles of a given length that are linearly independent from all shorter cycles already contained in the set. In case of the β-rings:

all three and four-cycles are automatically included in the set of beta rings

instead of testing linear independence from all shorter cycles, it is possible to only check whether a cycle is linearly independent from all subsets containing at most three shorter rings.

In general, the set of β-rings does not contain a minimum cycle basis, because not all planar graphs have a minimum cycle basis consisting of faces. Furthermore, the denition is not unique since it depends on a particular planar embedding.

(22)

Chapter 3

Partial Charge Perception

In 1926, Schrödinger introduced the famous equation [30]: ˆ

Hψ = Eψ

The Schrödinger equation is at the foundation of modern quantum chemistry. It allows the de-velopment of advanced algorithms for the computation of electronic properties in molecules. One of the biggest challenges of Molecular Perception is to translate such advanced quantum mechanical concepts into human understandable representations. The problem of representing and visualizing microscopic properties is at the foundation of chemistry. Structural formulas are an example of one of the many ways chemists usually deal with microscopic structures. However, with the advancements in knowledge and research, more complex concepts are now required to be represented. As an example, the nal result of the Schrödinger equation is the so well known wave function ψ. This mathematical object is a function of the coordinates of the particles that compose the system under investigation and describes their distribution in space. However, our perception of the world is sensory dependent and we live in a macroscopic world. The goal is to ll the gap between quantum mechanical concepts and semi-classical representations that are more immediate to understand.

One of the perception techniques of this kind, that have been implemented in Proxima, is the automatic perception of partial charges. The most physically-based view on charges is to look at them as clouds of charge density distributed in space. However, as chemists, we tend to think at a molecule as a set of atoms bonded together with a set of xed charges attributed to each one of these atoms. The problem of localizing charges is thus a mathematical artice. One common way to localize charges is to start from the wave function itself. The Mulliken approach is one of the most famous methods for such computations [31]. Having a set of molecular orbitals ψµ each one dened

as a linear combination over a basis-set of atomic functions (ψµ=PiCµiφi), it denes the charge of

an atom A as: QA= ZA− X µ X ν Pµν

Where ZA is the atomic number, P is the density matrix:

Pµν = 2

X

i

CµiCνi∗Sµν

Here, S is the overlap matrix (Sµν =< ψµ|ψν >) and C is the matrix of the coecients for the given

(23)

3.1. GASTEIGER CHAPTER 3. PARTIAL CHARGE PERCEPTION to molecular perception algorithms. In fact, these algorithms are meant to be used in real-time applications and performance is evaluated more than accuracy.

Several other methods in the literature tries to nd ways to compute charges in molecular systems based on semi-classical methods that are parameter-dependent. These methods may be suitable for Molecular Perception applications since are fast enough and mostly accurate. Of course, one of the disadvantages of such methods is that they loose generality, thus they can't be used as computational tools to investigate new materials.

In this chapter, two methods are reported: The Gasteiger method [32], currently implemented in Proxima and Open Babel [7], and the Qeq method [33], currently part of the Open Babel software as a plugin.

3.1 Gasteiger

The Gasteiger method [32] was rst introduced in 1979. The idea of the Gasteiger method is to use electronegativities to quantify the polarity of bonds. At an intuitive level, it is clear that the polarity of a bond inuences the partial charge on each of the atoms involved. In fact, it is possible to think at the partial charge as the dierence between the mean number of electrons localized around the nucleus in the molecule and the total number of electron assigned to the same nucleus in vacuum. The starting point of the Gasteiger method is to relate the electronegativity of an atom in a molecule to its partial charge.

χiν = aiν+ biνQi+ ciνQ2i (3.1.1)

In equation (3.1.1), the electronegativity for an atom i in the valence state ν is written as a second order expansion over the partial charge Qi. To determine the three coecients (a,b and c), Gasteiger

uses ionization potentials and electronic anities.         

aiν= 1₂(Iiν0 + Eiν0)

biν =1₄(Iiν++ E + iν− E 0 iν) ciν = 1₄(Iiν+− 2I 0 iν+ E + iν− E 0 iν) Here, I0

iν and Eiν0 are the Ionization Potential and the Electronic Anity for the atom i in the

valence state ν and for the neutral form. I+ iν and E

+

iν are for the positive ion and I − iν and E

− iν are for

the negative ion. In table 3.1, values for a, b and c are reported. These values are in eV units and are used within Proxima (see section 6.6 for the implementation).

The total electronegativity of an atom Ai in the +1 state is given by setting the charge to +1 in

3.1.1:

χ+_iν = aiν+ biν+ ciν

Hydrogen poses a special problem as there is no second ionization possible. However, a value of 20.02eV is used in the original article and this value is used in Proxima. In order to maintain a state of +1 for an atom Ai in a molecule, an atom Aj bonded to Ai must have an electronegativity

which is at least as high as χ+

iν. Thus, χ +

iν is used to scale any electronegativity dierence between

two bonded atoms to a charge transfer in electron units. In fact, the driving force for the transfer of charge between two bonded atoms A and B (χB > χA) is the gain in electrostatic energy. To allow

(24)

3.2. QEQ CHAPTER 3. PARTIAL CHARGE PERCEPTION Atom Valence State a b c

H 7.17 6.24 -0.56 C sp3 7.98 9.18 1.88 C sp2 8.79 9.32 1.51 C sp 10.39 9.45 0.73 N sp3 11.54 10.82 1.36 N sp2 12.87 11.15 0.85 N sp 15.68 11.7 -0.27 O sp3 14.18 12.92 1.39 O sp2 17.07 13.79 0.47 F 14.66 13.85 2.31 Cl 11.00 9.69 1.35 Br 10.08 8.47 1.16 S 10.14 9.13 1.38

Table 3.1: Gasteiger parameters. Values are in eV [32]

q<α>= χ <α> B − χ <α> A χ+_A ( 1 2) α _(3.1.2)

In this equation, it is important to note the denominator χ+

A and the factor ( 1 2)

α_{. Here, α gives}

the number of the iteration step. In the rst iteration step only part (l/2) of the electronegativity of atom B is allowed to exert its inuence on atom A and transfer charge. This is done to consider the electrostatic eld generated through charge transfer in the following iteration steps. With the new computed charges from the equation above, new electronegativities can be computed (now having a damping factor of 1/4).

For poly atomic molecules, equation (3.1.2) must be adapted. In particular, all of the neighbors directly bonded to an atom Ai must be simultaneously taken into account at an iteration step. For

neighbors Aj of atom Aithat are more electronegative of Ai the value χ+i is taken, whether for those

atoms Ak that are less electronegative than Ai the value χ+k must be considered as the electrons are

taken away from the electropositive atom. The resulting equation for the charge of the atom Ai at

the step α is:

q_i<α>=   X j 1 χ+_iν(χ <α> jµ − χ <α> iν ) + X k 1 χ+_kλ(χ <α> kλ − χ <α> iν )   1 2 α (3.1.3) This model has proven to be correctly applicable to σ −bonded and to nonconjugated π −systems. Several other methods have been proposed that extends the Gasteiger method to treat conjugated systems. However, the Gasteiger method is currently implemented as the default method for the charge computation in Open Babel and, as a starting point, in Proxima.

3.2 Qeq

The Qeq method is another method for the computation of partial charges in molecular systems. The main dierence with the Gaisteiger method is that it is more physically based and the assignation of charges is also done considering the relative distances between atoms. However, this method is still based on the concept of electronegativity and, also, hardness.

The original formulation of the Qeq method is the one of Rappé and Goddard [34]. The starting point is to consider how the energy of an isolated atom changes when changes its charge. To do so, the energy is written as a power series of the charge:

(25)

3.2. QEQ CHAPTER 3. PARTIAL CHARGE PERCEPTION EA0= EA0+ QA ∂E ∂Q A0 +1 2Q 2 A ∂2_E ∂Q2 A0 + ... (3.2.1) Including only terms through the second order in equation (3.2.1), it is possible to write:

EA(+1) = EA0+ ∂E ∂Q A0 +1 2 ∂2_E ∂Q2 A0 EA(0) = EA0 EA(−1) = EA0− ∂E ∂Q A0 +1 2 ∂2_E ∂Q2 A0 so that ∂E ∂Q A0 = 1 2(IP + EA) = χ 0 A ∂2_E ∂Q2 A0 = IP − EA = J_AA0

where IP and EA denote the ionization potential and electron anity and χ is referred to as the electronegativity. The J0

AAterm is twice the chemical hardness. It represents a resistance to electron

ow to/from an atom. The key observation is that the electronegativity is somehow related to the chemical potential of an electron gas surrounding the nucleus. In fact:

µi= ∂E ∂N = −χ 0 i = −e ∂E ∂Qi (3.2.2)

At equilibrium, the chemical potential assumes the same value everywhere. Thus, one of the assumptions of the Qeq method is to require all of the atomic electronegativities to be equal with each other. This is like assuming that the charge equilibrates itself instantaneously as soon as a geometrical change occurs. These electronegativity and harndess values are used as parameters in this method. That means that their numerical values are optimized so as to compute the correct results.

Now, when considering a molecule, the energy must be written as a sum of single atom (such as the one of equation (3.2.1)) and couple contributions, since each atom is subjected to the electric eld generated by the surrounding charges. Thus, the energy for a molecular system is now written in the following: E(Q1, ..., QN) = X A (EA0+χ0AQA+ 1 2Q 2 AJ 0 AA)+ X A<B QAQBJAB= X A (EA0+χ0AQA)+ 1 2 X A,B QAQBJAB (3.2.3) In this equation, new parameters arise such as the JAB term. This term express the coulomb

interactions between the two charges QAand QB. Values for JAA0 and χ 0

A can be obtained as

param-eters by tting known results. However, the JAB terms must be computed on-the-y since is distance

dependent. One way to compute such term is to describe it in terms of the Coulomb interaction between two charge distributions represented by spherical (s) Slater orbitals [35]:

JAB(RAB) = Z R3 dr Z R3 dr0|φA(r − rA)| 2_|φ B(r0− rB)|2 |r − r0_| (3.2.4)

(26)

3.2. QEQ CHAPTER 3. PARTIAL CHARGE PERCEPTION real time applications. As a consequence, several empirical Coulombic interaction formula have been proposed to correctly t the JAB terms in function of the relative distance between the atoms and

the J0

AAand J 0

BBparameters. One of these expressions is the following introduced by Patél [33]:

JAB(RAB) = 1 2(J 0 AA+ JBB0 ) q 1 +1₄(J0 AA+ J 0 BB)2R 2 AB (3.2.5) Notice how this expression mimics the correct behavior of 1/r for large distance separations. The Qeq method now solves the following equations:

   χ1= χ2= ... = χN PN i=1Qi= Qtot (3.2.6) By remembering that χi= ∂E

∂Qi, it is possible to write the system of equations above as Cq = D:

        1 1 ... 1 1 J2,1− J1,1 J2,2− J1,2 ... J2,N −1− J1,N −1 J2,N− J1,N ... ... Ji,j− J1,j ... ... JN −1,1− J1,1 JN −1,2− J1,2 ... JN −1,N −1− J1,N −1 JN −1,N− J1,N JN,1− J11 JN,2− J1,2 ... JN,N −1− J1,N −1 JN,N− J1,N                 Q1 Q2 ... QN −1 QN         =         Qtot χ1− χ2 ... χ1− χN −1 χ1− χN        

This system can be solved with the use of traditional algebraic libraries. One of the biggest advantages of this method is that it can correctly treat the problem of charge transfer. In fact, instead of imposing the total charge constrain PN

i=1Qi= Qtot, it is possible to impose charge constraints on

single molecules/portions of the system (such as PNmol

(27)

Chapter 4

Bond Order Perception

Bond orders were hystorically introduced as a graphical tool to represent microscopic structures, to relate structures to single atom properties (such as the valence that determines the total number of bonds that each atom is allowed to form) and to quantify the strength of dierent bonds. Thus, the Molecular Perception of bond orders can help in highlighting some relevant properties of the system under investigation. An example of primitive representations of bond orders is given by the structure of ethylene and acetylene made by Joseph Loschmidt (1861) [36] reproduced in gure 4.0.1.

Figure 4.0.1: Structures of ethylene and acetylene made by Joseph Loschmidt (1861) [36] Further developments in chemical structure theory have better investigated the nature of such concept. In Molecular Orbital theory we now look at a molecule as a set of nuclei xed in some given positions (within the adiabatic Born-Oppenheimer approximation) and at the electrons as quantum particles delocalized around the nuclei in the form of orbitals. Bond Order can now be thought as a sort of mean number related to the number of electrons localized in between a given couple of atoms. There is not a unique way of localizing the electron density in space and thus the denition itself of a bond order parameter is not unique. For diatomic molecules, for instance, it is possible to dene the Bond Order as half the dierence between the number of bond electrons and anti-bonding electrons. For polyatomic molecules, however, other denitions must be used. A very popular one has been formulated by Mulliken [31] and denes the bond order between atoms A and B such as:

qAB= 2 A X µ B X ν PµνSµν

Where Pµν is the density matrix for a LCAO-MO treatment of the molecule, and Sµν is the overlap

matrix. Such denitions, although very useful in the computational analysis of molecules, are not very helpful in solving the Molecular Perception problematic. In fact, quantum calculations are not suited for interactive applications, therefore, molecular perception must rely on approximated methods and estimations. Indeed, a good Molecular Perception algorithm should:

Be able to determine those structural characteristics that are immediately interpretable by any chemists

(28)

4.1. PAULING METHOD CHAPTER 4. BOND ORDER PERCEPTION As a matter of fact, a more traditional vision of bond orders, such as the Lewis structures, could be more useful than advanced quantum chemicals denitions, since many chemists are already condent with such representations. Many molecular properties, especially in the Organic Chemistry eld, can be qualitatively evaluated by simply observing the molecular structure if it is represented in a meaningful way.

In this chapter several methods for the state-of-the-art determination of bond orders are presented. Most of these methods treat bond orders as a discrete quantity, since they are easier to implement and easier to understand. However, there are still some advantages in working with continuous bond orders. As an example, consider the resonance structures of benzene (represented in gure 4.0.2 a). There are two structures allowed, and neither of them correctly represents the real structure by itself. It is the ensemble of the resonance structures that correctly describes the overall structure. Ring and functional group detection can solve this problem by detecting known portions of the structure, but at the cost of additional steps that can complicate the software even further. By allowing bond orders to have continue values, however, it becomes possible to represent the actual molecule with a single structure. In the benzene example, a bond order of 1.5 can be assigned to each bond in the aromatic ring, as shown in gure 4.0.2 b). However, despite some ideas being reported in section 4.8, a complete and solid algorithm for the computation of such continuous bond orders has not been found yet.

a) b)

Figure 4.0.2: a) The resonance structures of benzene b) Single representation of the resonance structure

4.1 Pauling method

Linus Pauling, in 1947, proposed an equation that directly relates the bond order with the bond length [37].

sij = exp[

d1− dij

b ]

In this equation, b is a constant that depends on the considered pair of atoms, dij is the distance

between the atoms and d1 is the bond distance for the same pair of atoms but with a single bond

order. Such expression is much easier to handle than a density matrix but has some drawbacks: This expression computes the bond order as a two atoms dependent parameter and, as such,

it is only valuable for diatomic molecules.

Distances retrieved from 3D molecular structure les are usually not very reliable.

However, relative distances can still be used for computation of the covalent bond connectivity of molecular structures. With the covalent bond connectivity computed, the problem of assigning bond

(29)

4.2. THE I-INTERPRET CHAPTER 4. BOND ORDER PERCEPTION orders is now the problem of assigning the right weight to the edges of a molecular graph. This task can be approached in many dierent ways.

4.2 The I-interpret

A conceptually simple method is the one proposed by Yuan Zhao in 2007 [38], consisting of the following steps (summarized in gure 4.2.1):

1. Perception of the connectivity (panel 2 ingure 4.2.1).

2. Determination of the hybridization states of atoms (panel 3). Note that terminal atoms remain undetermined in this step.

3. Recognition of functional groups and aromatic rings (panel 4). 4. Bond order assignment as a function of distances (panel 5). 5. Resolution of the conicts (panel 6).

Figure 4.2.1: Interpreting the chemical structure of an organic molecule from nude model.[38] The rst two steps have already been discussed in previous sections. The third step requires a large dataset of known functional groups: using molecular ngerprints it is then possible to perform a sub-graph search within the dataset (considering the hybridization states of the atoms as a vertex weighting function). The fourth step is performed with the distance evaluations reported in Alg. 4.1. dij is the distance between the atoms in the three-dimensional structure and Lij is the single bond length for the pair of atoms i and j.

Algorithm 4.1 The algorithm for the detection of the bond types.

IF ( i s the c e n t e r o f a d i h e d r a l angle g r a t e r than 30 d e g r e e s ) THEN SINGLE

ELSE IF ( i t connects at l e a s t one sp2−h y b r i d i z e d atom AND d i j < L i j − 0 . 1 0 ) THEN DOUBLE ELSE IF ( i t connects at l e a s t one sp−h y b r i d i z e d atom AND d i j < L i j − 0 . 2 5 ) THEN TRIPLE

The nal step is required since dierent methods are used simultaneously in the detection of bond orders. As a consequence, conicts between bond order assignments can arise. In order to resolve these conicts, some priority rule must be dened. In the original paper, this general priority rule is used:

(30)

4.3. MAXIMUM WEIGHTED MATCHING PROBLEMCHAPTER 4. BOND ORDER PERCEPTION

4.3 Maximum Weighted Matching Problem

A similar approach is the one proposed by Labute [39], in which a Maximum Weighted Matching algorithm for nonbipartite graphs is used to assign bond orders by means of weights derived from statistics made on a large collection of organic molecules (instead of using ad hoc chemical group patterns or ring analyses). The edge weight is computed using the following formula:

wij = ui+ uj+ 2δ(dij < Lij− 0.11) + δ(dij< Lij− 0.25)

Where uiis the weight scoring atom i's participation in a higher order bond, while the delta terms

add additional preference for higher order bonding for short bond lengths. The atom weights are statistically derived according to the formula:

u = log[ P r(atom has a double bond) P r(atom has no double bond)]

where the probabilities (Pr) are estimated by counting frequencies of single and double bonding in a collection of 200 000+ organic compounds from vendor catalogs.

4.4 Branch-and-Bound

Recently, new approaches based on heuristics have been proposed, such as the one introduced by Artemova et al. [8]. This article extends the original idea proposed by Wang et al. [5] of associating a penalty to each atom type, and then exploring the possible combinations of bond orders that minimize the total penalty. The penalty associated to an atom depends on its element type, valence, and possibly the functional group it belongs to. A set of atomic penalties based on valence states are reported in gure 4.4.1. Precisely, the atomic valence va(i) of atom i is dened as the sum of all bond orders boij

over the bonds formed by atom i:

va(i) =

co(i)

X

j

boij

where co(i) is the atom's coordination, that is, the number of atoms bonded to i. The distance of the computed va from the most desirable valence value is measured by an atomic penalty score (aps). The total penalty score (tps) is then computed as:

tps =

N

X

i=1

apsi

The bond order assignment problem may be represented by a decision tree with height NB+ 1,

where NBis the number of unassigned bonds. Each non-root level i represents a bond order assignment

for bond i, so that each internal node at level i has three children at level i + 1 corresponding to the three possible bond orders, with such a representation, the root node can be associated to a fully unassigned connected component, whereas each leaf node represents one of the 3N _{possible combinations of bond}

order assignments. A branch-and-bound algorithm with a depth-rst search strategy may be used to enumerate the possible bond order assignments and, thus, evaluate possible scores. A branch traversal is stopped as soon as the partial score (cumulated from the tree root to the considered node) is higher than the minimal score found so far. In addition, the branch-and-bound search stops if the penalty for an entire branch equals to 0, since no better solution can be found and no distinction is made between dierent structures having the same optimal score. This approach is computationally demanding if

(31)

4.5. A* CHAPTER 4. BOND ORDER PERCEPTION

Figure 4.4.1: The types are functions of the element involved, the coordination co and possibly the belonging to some functional group. [8]

the number of bonds NB is large since the number of possible decisions grows exponentialy with the

number of bonds (As stated before, there are 3N _{possible combinations of bond order assignments). In}

order to limit the computational eort, a local bond order estimation strategy, performing the branch-and-bound algorithm on the regions with controlled sizes, has been proposed. The method starts by randomly selecting one of the atoms. Then, it considers a local region in a neighborhood closer than distance dneighfrom this root atom. Here distances between two atoms are computed as the minimum

number of bonds required to reach one atom from the other. It also temporarily assigns a bond order of 1 to all bonds at a distance dneigh+ 1from the root atom. Such temporary assignment (illustrated

in gure 4.4.2) is necessary to estimate possible valences of the atoms at the extremities of the local region. Then, the method searches for the best bond order assignment for this local region based on the branch-and-bound algorithm, as previously described. Once the assignment is performed on the local region, assigned bond orders are only kept for bonds at a distance dsaf e from the root node,

with dsaf e< dneigh. Hence, even if the assignment is incorrect at the border of the neighboring region

(since an arbitrary bond order of 1 has been set close to these regions), the risk to aect the orders of bonds within the safe region is low. In the next step, atoms surrounded by assigned bonds are tagged as already treated and a new root atom is taken at the frontier of the assigned nodes. The process is then repeated on the new root atom exploring another region of the structure. By propagation, such a mechanism results in assigning bonds for the entire structure. However, an important assumption of the following method to keep in mind is that the bond order problem can be decoupled spatially. dsaf e= 5 and dneigh= 8 have been proven to oer a good balance between eciency and quality of the

bond order assignment.

4.5 A*

There are other approaches based on the original heuristic formulation proposed by Wang [5]. As an example, it is possible to explore the decision tree by using an A* algorithm [40]. The algorithm

(32)

4.5. A* CHAPTER 4. BOND ORDER PERCEPTION

Figure 4.4.2: Local bond order assignment scheme shown for the trans-stilbene molecule. This method spatially decouples the bond order assignment problem to decrease the inherent complexity of the problem. [8]

associates with each node w a score computed by the following function: f (w) = g∗(w) + h∗(w)

where g∗_(w) _{describes the score corresponding to the decisions already made and h}∗_(w) _{is the}

so-called search heuristic that eastimates the cost of following the path from the current node to the goal. A possible denition for the function g∗, as reported by the original paper [40], is:

g∗= X

a∈K

P (a, νw(a))

Here, K is the set of atoms for which all bonds are already assigned with a bond order. νw(a)is

the valence of the atom a at the w node. P (a, νw(a)) is the penalty associated with the current

atom (a) in the current valence state (νw(a)).

For the heuristic function h∗_(w)_{the original paper propose instead three possible implementations:}

h∗₌ P

a∈A\K

min

i∈V (a){P (a, i)}

h∗₌ P a∈\K min lo(a)≤i≤size(P (a)) {P (a, i)} h∗₌ P a∈A\K min lo(a)≤i≤up(a) {P (a, i)}

Here, V (a) ⊂ N contains the possible valences of atom a ∈ A according to the penalty table P. Instead, lo(a)and up(a) are a lower and an upper bound to the possible valence state of the atom a.

The lower bound for the valence is obtained by assuming a single bond for each unassigned bond of atom a. A tighter lower bound for the valence is then:

lo(a) := νw(a) +

X

b∈B(a)\W (a)

1 = νw(a) + |B(a)\W (a)|

Here, W(a) is the set of assigned bonds connected to the atom a in the node w. This even tighter version of the heuristic takes into account the already assigned bond orders of the neighboring atoms

Implementation, validation and application of a new software for automatic Molecular Perception

Università di Pisa

Dipartimento di Chimica e Chimica Industriale

Corso di Laurea in Chimica

Implementation, Validation and

Application of a new Software

for automatic Molecular

Perception

Relatore:

Prof. Vincenzo Barone

Controrelatore:

Prof. Giovanni Granucci

Candidato:

Federico Lazzari

Acknowledgements

Contents

Chapter 1

Introduction

Chapter 2

Ring Perception

2.1 Graph Theory Preliminaries

2.1.1 DFS

2.1.2 BFS

2.2 Kirchho - Fundamental

2.3 Smallest Set of Smallest Rings (SSSR)

2.3.1 Horton's Algorithm

2.3.2 Proxima Implementation

2.4 Alternatives to the SSSR

2.4.1 Essential Set of Essential Rings (ESER)

2.4.2 Extended Set of Smallest Rings (ESSR)

2.4.3 Set of Smallest Cycles at Edges (SSCE)

2.4.4 Beta Rings

Chapter 3

Partial Charge Perception

3.1 Gasteiger

3.2 Qeq

Chapter 4

Bond Order Perception

4.1 Pauling method

4.2 The I-interpret

4.3 Maximum Weighted Matching Problem

4.4 Branch-and-Bound

4.5 A*

2.2 Kirchho - Fundamental