Theoretical Basis of Quantitative Structure-Activity/Property Relationship. Recursive Neural Network and Multi Linear Regression Methods.

(1)

2. Theoretical Basis of Quantitative Structure-Activity/Property

Relationship. Recursive Neural Network and Multi Linear

Regression Methods.

(2)

(3)

2.1. General principles of Quantitative Structure-Activity(Property)

Relationship

The basic idea of Quantitative Structure-Activity(Property) Relationship (QSAR/QSPR) is to find an appropriate function F that predicts a molecular property or biological activity, using only information related to the molecular structure:

(

)

Property = F Structure ₍₁₎

The input domain of F is a set of molecular features, where the term “feature” refers to global information characterizing the molecule (molecular shape, chemical functionalities, etc.). In the standard approaches, this information is represented by numbers called molecular descriptors, which are parameters that can be quantitatively defined from the molecular structure only. The output domain is typically the set of real numbers, which are used to quantify the property of interest, or target property. Hence, the function F can be seen as a functional transduction defined on a structured domain. The function F can be described as the sequential solution to two main problems:1,2 (i) the representation problem, i.e., how to encode molecules through the extraction and selection of structural features and (ii) the mapping problem, usually faced by linear or non-linear regression tools.

According to this view, F can be decomposed as follows

( )

(

)

F =g f (1)

where f is the encoding or feature representation function, from the domain of the chemical structures to the descriptors’ space, and g is the mapping function, from the descriptors space to the physico-chemical property and/or bioactivity space. The choice of f and g is the discriminating aspect among the different methods, with a major role of the issues related to the f function. In more detail, the encoding process requires the solution of two subtasks, the first of which aims at explicitly representing the significant structural information carried by molecules and the second at codifying this structural information into a numerical representation. Thus, the function f

( )

can be understood as the following composition

( )

(

)

E R

(4)

where f_R

( )

extracts specific structural aspects from the molecule, and f_E

( )

computes a numerical value from the structure returned by f_R

( )

. Various approaches and numerous molecular descriptors are used for the representation and encoding of chemical compounds. The most common types of descriptors are: constitutional (derived from the type of atoms, bonds or functional groups of the compound), topological (describing the bonding among individual atoms of a compound), geometric (related to the overall molecular geometry), electrostatic (related to the charge distribution) and quantum chemical (obtained by quantum mechanics from the structure). Other more sophisticated categories include 3D-MoRSE (derived from infrared spectra simulation), WHIM (based on statistical indices calculated on the projections of atoms along principal axes), and GETAWAY descriptors (summarizing information on geometry, topology and atomic weights). Detailed and exhaustive lists of descriptors are contained in the manuals of widely employed software packages, such as DRAGON,3 CODESSA4 and PETRA.5

Different statistical methods are employed to select descriptors and build the QSARs/QSPRs, ranging from Multi Linear Regression (MLR), Principal Component Analysis (PCA),6_{Partial Least Squares (PLS)}7,8_{and Genetic Algorithms (GA)}9_to

non-linear methods, such as Neural Networks (NNs).10,11 Although most applications of MLR, PCA, PLS and GA methods have been based on linear (initial) molecular descriptors, it is also possible to apply these methods to non-linearly transformed initial descriptors and obtain Multivariate Non-linear Regression models, as well as non-linear PCA, PLS or GA models. Instead, NN models are intrinsically non-linear, i.e. non-linearities are included in each NN. NN models usually use a relatively small number of input descriptors (e.g. 4-50, but in most cases below 10), depending on the number of molecules in the data set used for developing the model. However, because the number of initial molecular descriptors in a standard QSAR/QSPR study usually exceeds 100, almost all applications of NN are dependent on a descriptor selection (reduction) method. Considerable interest is also growing for hybrid approaches, like GA-NN,12 in which a genetic algorithm is used for the selection of a small sub-set of the most relevant descriptors that are used as inputs for NN.Figure 2.1 illustrates the general scheme of the QSAR/QSPR (only QSPR in the following) approach.

A common limitation of all traditional QSPR methodologies mentioned above is that both f_R

( )

and f_E

( )

are defined a priori, i.e., they do not depend on the regression procedure. In particular, the number and type of molecular descriptors used to represent

(5)

Figure 2.1. Outline of the traditional QSPR/QSAR approach. Structural features of the molecule

(e.g. 2-butanone) are represented by a proper set of numerical descriptors obtained by using different encoding approaches (f function). The target property is then obtained by applying a regression procedure (g function).

chemical compounds depend on the specific QSPR problem to be solved. In some cases they even need to be estimated and selected by an expert in the field through a lengthy trial and error procedure in order to adapt them to the required task on the basis of background knowledge of the problem. The selection of descriptors carries the risk of a feature selection bias,13-16 which occurs when the data is split into training and test sets only after completing the feature selection process on the whole dataset. This can lead to an optimistic estimation of the prediction accuracy. When the number of initial descriptors is high (> 1000) there is also a risk of obtaining chance correlations. Nevertheless, appropriate descriptors are not yet available for some types of chemical compounds, e.g. various kinds of macromolecules.

An alternative to the use of fixed-size vectors of descriptors consists in deriving QSPR directly from the molecular graph, which is a much more general and flexible tool to represent chemical compounds. A methodology1,2,17 developed at the Department of Computer Science of the University of Pisa exploits Recursive Neural Networks18 (RNNs) to infer the f_E

( )

function by learning from a structured representation of molecules generated by f_R

( )

. The RNN is able to adaptively implement both f_E

( )

, through the automatic generation of numerical descriptors for the specific regression task, and g. As a result, no a priori selection and/or extraction of features or properties by an expert is required.

(6)

In the following sections the structured representation and the RNN method, which was used throughout the whole Ph.D. thesis, will be described in detail. The description is based on the one provided in ref. 22. The last section will be devoted to an MLR technique employed in the last chapter to build property-toxicity relationships. It is based on the use of a descriptor-selection and model development procedure, named CROMRsel, developed at the Ruđer Bošković Institute in Zagreb.

2.2. The Recursive Neural Network Methodology

2.2.1. Basic definitions

Neural Networks (NNs) are powerful data modelling tools able to approximate complex input/output relationships. The input pattern for a standard neural network is a fixed-size vector. Recursive Neural Networks (RNNs) are an extension of standard NNs able to deal with structured domains, i.e. input patterns can be a class of graphs, trees, sequences, etc. Hence, RNNs allow for directly learning the mapping between molecular graphs and their target properties. In this paragraph, the approach based on RNNs for the processing of structured domains is presented by first providing a proper instantiation of the input and output domains of the functions f_E

( )

and g

( )

implemented by the RNN.

Let the structured input domain for f_E

( )

, denoted by G, be a set of labelled directed positional acyclic graphs (DPAGs). In a DPAG, a total order on the edges leaving from each vertex (or node) is defined and a position is assigned to each edge. Moreover, let us assume that G has a bounded out-degree for each node and that each DPAG possesses a super-source, i.e. a vertex s such that every vertex in the graph can be reached by a directed path starting from s. Labels, which are tuples of variables, are attached to vertexes. Let _{ℜ denote the label space.}n

Here, we consider a subclass of DPAGs formed by prohibiting cycles in the undirected skeleton of the graph, the set of the k-ary trees. In the case of trees the super-source is

Figure 2.2. k-ary tree T rooted in the node root, with subtrees T(1)_{, ... , T}(k)_.

(1) ( )

=

"

k

root

T

(7)

defined by its root node. k-ary trees (trees in the following) are rooted positional trees with finite out-degree k. This class of structures includes the class of rooted ordered trees and clearly sequential and vectorial data.

Given a node v in the tree T∈G, we give the following definitions

• The children of v are the node successors of v, each with a position j =1,...k; • k is the maximum out-degree over G, i.e. the maximum number of children for

each node;

• L(v) in _{ℜ is the input label associated with v, and L}n

i(v) is the i-th element of the

label;

• The subtree T(j)_{is a tree rooted at the j-th children of v.}

Vertexes with zero out-degree are leaves of the tree. The set of external vertexes is the frontier. Traversal of a tree allows to systematically visit (and process) all nodes of the tree in some order: in particular, processing in a recursive manner all subtrees and finally the root we define a postorder traversal. The scheme of a k-ary tree is reported in Figure 2.2.

The descriptor (or code) space is chosen as _{ℜ while the output space, for our purpose, is}m

defined as ℜ. Finally, the class of functions which can be realized by a RNN can be characterized as the class of functional graph transductions described in the form

( )

(

E

)

g f , where f_E

( )

: G → _{ℜ is the encoding function, and g:}m _{ℜ → ℜ the output}m

function.

The functions and domains involved in the definition of the RNN are shown in Eq. (4). ⎯⎯→ ⎯⎯→ℜ ⎯⎯→ℜfR fE m g

Structure G (4)

2.2.2. Encoding and mapping functions

In our approach, the function f_E

( )

is defined in a way that allows for progressively encoding an input structure, e.g. a tree, using at each step a neural computational model τNN. The function τNN is used to process each node of a given structure. Given a node in a

tree T, τNN uses the information available at the current node:

• the numerical label attached to the node (in_{ℜ ),}n

• the numerical code for each subgraph of the node (in _{ℜ ),}m

(8)

As a result, if k is the maximum out-degree of T in G, τNN is defined as

NN: ...

n m m m

k

τ ℜ ×ℜ × ×ℜ → ℜ (5)

The information coded in the k spaces _{ℜ are the new parts introduced in the RNN}m

model. In the following, we show how these extra inputs to the model allow for memorizing structural information and making the model responsive to the topological structure of the input compound.

Let us first consider, for example, the simplest non-linear realization for τNN that uses a

single recursive neural unit (m=1). Given a node v of T, the output X in ℜ of the recursive neuron (i.e., the code of v), is computed as follows:

( ) ( ) (1) ( ) NN 1 1 1 ( ( ), ,...., ) ˆ ˆ ( ) j θ ( j θ) k n k k j j i i i j j X L v X X w L v w X L w X τ φ φ = = = = = ⎛ ⎞ = ⎜ + + ⎟= + + ⎝

∑

⎠ W

∑

(6)

where_φis a non-linear sigmoidal function, L(v) in _{ℜ the label of v,}n _{θ is the bias term,}

W in _{ℜ is the weight vector associated with the label space, X}n (j)_in_{ℜ is the code for the} j-th subgraphs (subtree) of the v, and ˆ_wj_in_{ℜ is the weight associated with the j-th}

subgraph space.

When m>1 we obtain a neural network (with m units):

(1) ( ) ( ) NN 1 ˆ ( ( ), ,...., k ) ( k j j θ) j L v L τ = = = Φ +

∑

+ x x x W W x (7)

where x is a vector in _{ℜ , Φ is a set of m sigmoidal functions, θ in}m _{ℜ is the bias}m

vector, W in _{ℜ is the weight matrix associated with the label space, x}m n× ( j )_in_{ℜ are}m

the vectorial codes for each subgraph (subtree) of the current node and _Wˆ j_in_{ℜ is the}m n×

weight matrix associated with the j-th subgraph space.

The composition of _τNN used to encode a structured set of nodes, e.g. a tree T as shown in

Figure 2.2, is defined by the following recursive definition of f_E

( )

(Eq. 8):

(

(1) ( )

)

NN if is empty ( ) ( ), ( ),..., ( ) ⎧⎪ = ⎨ ⎪⎩ 0 E k root _E _E T f T L f T f T τ (8)

(9)

where 0 is the null vector in _{ℜ , root is the root node (or super-source of the tree T),}m

L(root) is the label attached to the root, and _T(1)_,...,_T(k)_{are the subtrees pointed by root.}

Note that the same definition may be applied to DPAGs once the super-source s corresponds to the root of the tree.

Eq. (8) comprehensively defines the functionality of the recursive neural network. The recursive definition of f_E

( )

determines a systematic visit of the input tree. It guides the application of _τNN to each node of the structures, from the frontier to the root of the input

tree, allowing the neural model to incrementally compute a numerical code for the whole structure. The process corresponds to a postorder traversal of the input tree T, that entails the recursive processing of all subtrees and, finally, of the root of T.

Let us consider an example that allows us to grasp the recursive encoding of structures. In Figure 2.3 we describe the encoding process visualizing through boxes the progressive processing of the input tree performed by the RNN. This corresponds to the “unfolding” of Eq. (8) through the current input structure (in the example, the molecular graph for 2-pentanone and 3-chloro-2-propanol).

Figure 2.3 allows us to illustrate the dynamics of the encoding process:

1. The encoding is a bottom-up step-by-step process following the inverse of a topological order (visit of the input graph). The beginning of the process is shown by arrows in Figure 2.3.

2. For each node we apply a neural model _τNN with auxiliary inputs, i.e. the code of

the substructures.

The encoding process terminates at the root of the tree and the numerical code computed for the root is used as numerical code for the whole molecular structure.

As a result we can observe that:

• the encoding process is modelled to the morphology of each input compound; • the encoding process is adaptive via the free parameters of τNN.

With respect to the neural realization of _τNN (Eqs. 6 and 7) we can observe that the

connection _Wˆ j_{and the internal state variables x are the machinery able to store}

information from the structure of inputs and to use it together with the current input (the label of the node). The internal state of the RNN encodes a representation sensitive to the context. Moreover, due to the learning capabilities, the memory of a recursive neural network is a dynamic memory that is dependent from the task. The state-units discover

(10)

Figure 2.3. Unfolding the encoding process through the structures of 2-pentanone (left) and

3-chloro-2-propanol (right). Each box includes the subtree progressively encoded by the RNN.

adaptive abstract representations of structural data containing the information relevant to the prediction.

In order to produce the final prediction value, the RNN model is completed by the output function g which can be realized by choosing any known mathematical model. In the class of neurocomputing models the function g may be obtained using a multilayer network to perform regression or classification tasks.

Here we use a single linear output neuron to realize a regression model:

( )_{y g}₌ _x ₌_{A x}t ₊_θ₍₉₎

where _A_∈ℜm_and_{θ is the output threshold.}

Figure 2.4. Outline of the RNN approach to QSPR/QSAR. RNNs take directly as input the

molecular graph generated by f_R

( )

, encoding it through the adaptive function f_E

( )

(recursive

hidden units). The function g, implemented by the output part of RNN, uses this internal representation to produce the final prediction value (target property).

(11)

In analogy with the scheme of standard QSPR presented in Fig. 2.1, Fig. 2.4 shows the outline of the QSPR with the RNN approach. It is worth stressing that the ability of the RNN to treat variable-size representations eliminates the need to have a fixed-size numerical vector for each input graph, as happens with standard neural networks typically used in QSPR studies.

2.2.3. Training algorithm and Recursive Cascade Correlation (RCC)

The error function, as in standard feed-forward neural networks, is differentiable and the weights of the model define a continuously parameterized space of functions. Thus, the learning algorithm of the RNN can still be based on a gradient descent technique.

However, the learning algorithm must be adapted to face the contextual nature of the approximation task, in this case a map from set of trees to output values. It must account for the set of encode transitions developed by the model for each step of the inputs, i.e. the computing of a gradient must take all input nodes seen in the encoding process by the RNN into consideration, not just the current input node.

Standard supervised algorithms for feed-forward neural networks have been extended to deal with structures, i.e. for RNN.18 In the learning algorithm of recursive neural networks, the encoding of structures and the iterative weights updates are interleaved. For RNNs of the type described so far we can outline a general learning algorithm as in Scheme 3.1.

There are different ways to realize the recursive neural network.18 In the present work we choose to use a constructive approach that allows the training algorithm to progressively add the hidden recursive neurons during the training phase. The method is a (recursive) extension of Cascade Correlation based algorithms19,20 named Recursive Cascade Correlation (RCC). The built NN has a hidden layer composed of recursive Hidden Units (HUs). The recursive HUs compute the values of f (in _E _{ℜ ) for each input tree. The}m

number of HUs, i.e. the dimension m of the descriptor space, is automatically computed by the training algorithm, thus allowing an adaptive computation of the number and type of (numerical) descriptors needed for a specific QSPR task.

The learning algorithm of the RCC is still based on gradient computation. However, the sequence of steps of the general learning algorithm presented above is changed to consider one HU of the growing network at a time: learning is performed as in standard Cascade Correlation by interleaving the minimization of the total error function (LMS) by

(12)

Scheme 3.1. Outline of the learning algorithm for a training set constituted by couples (T, d), where T is a tree and d the corresponding target property value.

Repeat

For each example of the training set (T,d)

• Encode the input three T: compute f through T applying E τNN for

each node of T following a post-order traversal of the tree • Compute the output value for T by g, i.e. g f T

(

E

( )

)

• Compute the error evaluating the differences between the target value d and g f T

(

_E

( )

)

• Compute the gradient of error for each weight of the model accounting for the influences of the weight for each step of the encoding process: the error is back-propagated unfolding the encoding process through the structure of T

• Update the weights W of g and τNN according to the value of the

gradient computed over the structures of the training set Until convergence.

a simple backpropagation training of the output layer (g function), and the maximization of the (non-normalized) correlation, i.e. the covariance, of the new inserted hidden unit with the residual output error (where the hidden units realize f ). _E

If training data are noisy and/or their number is rather low, special care has to be paid to avoid overfitting. When this occurs, the RNN learns individual data points but produces a QSPR that is not sufficiently generalizing. Several expedients can be used at this purpose. First of all, a reduced RCC architecture is built since no connection between HUs are allowed. Then the gain of the sigmoids of the HU is set to 0.4. Specifically, an incremental strategy (i-strategy)17 on the number of training epochs can be adopted for each new inserted HU of the RCC model. Allowing few epochs to the first units we avoid the increase of the weight values. The advantages of this strategy have been presented in ref. 1. The work of Bartlett21 gives theoretical support for techniques, like the i-strategy, that allows to produce networks with small weights. As a result we can continue the learning, adding new HU in RCC, without overtraining the model. A complete

(13)

description of the RCC algorithm and a formulation of the learning method and equations can be found in ref. 17.

Summarizing, the f function is realized through an adaptive mapping using the RNN. _E The process can consider both the graph topology (connectivity) and the atom types (or the chemical functionalities). Since the encoding function ( f ) is learnt by the NN _E together with the mapping function (g), we allow the automatic generation of numerical descriptors that are specific for the regression (QSPR) task to be solved. As a consequence, no a priori selection and/or extraction of features or properties by an expert or algorithm is needed in the new scheme for f . These characteristics embody the main _E difference between RNN and traditional QSPRs.

2.2.4. Molecular representation

We here illustrate the main points of the molecular representation used in RNN-QSPR. Its details are provided in the following chapters together with the description of all investigated data sets. As mentioned above, molecules are represented by a 2-D graph (labelled rooted ordered tree) easily obtained from the structural formula. Each compound is partitioned into a number of atomic groups (or fragments); each group corresponds to a node of the tree and each bond between them corresponds to an edge. The criteria guiding this fragmentation are basic notions of chemical reactivity and structural sampling. When dealing with structured domains, the latter consists in the coverage of the input space, i.e. occurrence of the different fragments/components in their possible topologies in the given data set. The sampling issue suggests choosing the smallest number of atomic groups able to build the greatest number of molecules in a reasonably compact form. It must be stressed that these atomic groups do not necessarily coincide with the typical functional groups identifying the different classes of chemical compounds.

To have a unique structured representation, a set of rules is formulated to decide the tree root and the ordering of the subtrees for each node. A commonly used set of rules,22 reported in Appendix I, defines a priority scale among groups: subtrees are ordered according to this scale and the root is placed on the molecular fragment with highest priority. Another way to ensure the uniqueness of the structured representation is to follow the graphical choices of other unique molecular representation systems, such as UniqueSMILES and InChI,23,24 see Chapter 3. The ordering among atomic groups can also be exploited to recover 3-D information lost when converting the tridimensional

(14)

molecular structure into a bidimensional tree. For example, group ordering can be used to convey information on chirality (matching the well known Newmann projections) or cis/trans configurations.22

Every vertex of a tree is associated with a numerical label, i.e. a vector that discriminates among different atomic groups. Labels are determined by a 1-of-n coding scheme for categorical variables, with n the number of different molecular fragments in the considered data set. In this way each label results in a n-bit vector, with one or a few specific bits turned on (+1) and all the others turned off (0). Sharing bits between different labels allows for representing the similarity between chemical groups. On the other hand, two orthonormal vectors (i.e. bits turned on in different positions within the vector) represent groups of different chemical nature. Except where otherwise specified, the similarity or the orthonormality between groups, assigned according to their chemical features, is the only basic chemical information transferred to the RNN as input data. As an example, Figure 2.5 presents the chemical tree representation of N-ethyl-N-methylamine as a chemical tree and its conversion into the RNN input data file. This file contains the dimension of the tree (number of nodes), the value of the target property, and the connection table of the structure. In this table the first column represents the order number identifying the specific group indicated in column 2;columns 3 to 5 indicate for each node the presence of a “child” identified by its order number, i.e. a pointer to the substructure rooted in the child (–1 means the absence of a child); column 6 reports the

Figure 2.5. Chemical tree representation of N-ethyl-N-methylamine and relevant input data file

containing the dimension of the tree, the target property value, the connection table of the

structure and the index of the respective vertex label. The numerical labels for CH2, CH3, H and

N groups are also reported. Note that the labels of CH2 and CH3 are not orthogonal to each other,

(15)

number identifying the numerical label associated to the group. In the same figure, the numerical labels corresponding to the groups present in N-ethyl-N-methylamine are also reported. This representation of the input data is used by a RNN to recursively read the structure as previously shown.

To represent certain chemical structures, it may be needed to introduce fictitious fragments that do not correspond to any real atomic group. A significant example of this concept is given by the tree representation of polymers.25 They are described by using the 2-D graph of the repeating unit as if it were a small molecule. The difference is that the tree root is now positioned on an additional super-source vertex called Start, used to close the side with highest priority, while the other end of the structure is capped by a Stop group, see figure 2.6 (in this and following chapters, fragment names in the text will be indicated with bold characters). The Start label is a vector orthonormal to all other labels, while the Stop label is a null vector and only has the purpose of closing the molecule. Start and Stop do not represent any actual molecular moiety, neither they affect the chemical features of the groups they are linked to. Other important fictitious groups are used for the representation of cyclic structures, as will be shown in the next chapter.

Figure 2.6. Chemical tree representation of poly(methyl acrylate) and relevant input data file with

vertex labels. The label of Start is also used to indicate average polymer characteristics, such as stereoregularity, here expressed as molar fraction of syndiotactic dyads. Number on edges indicate the children order.

(16)

A vertex label may also be used to allocate certain structural information that cannot be conveyed by the graph topology alone. In the above-mentioned example of polymers, the label of Start can express some average macromolecular characteristics (molecular weight, stereoregularity, polydispersity index, composition, etc.) that have a significant influence on the target property. Such information allows the model to account for both the repeating unit detailed 2-D structure and the overall features of the macromolecule.

2.3. CROMRsel technique for the stepwise and optimal selection of

descriptors in Multi Linear Regression models

In MLR models, the molecular representation (f function) is carried out by descriptors and the regression equation (g function) expressing the QSPR is in the form of a linear combination of descriptors: 0 0 1 1 1 2 2 2 ( ) ( ) ( ) ... ( ) ... est i i i P = c ± ∆c + c ± ∆ ⋅ +c d c ± ∆c ⋅d + + c ± ∆ ⋅ + (10) c d where Pest is the estimated value of the target property, di (i = 1,..., I) the molecular

descriptors, ci their coefficients of contribution and ∆ci the associated error. di can also

include higher order terms or cross-products of initial descriptors, which correspond to physico-chemical interactions between substituent groups. ci and ∆ci are determined by the least-squares method. A crucial aspect to be taken into account is the number of descriptors used to build the MLR model. If too many of them are introduced, the fitting of the training data will be very accurate, but the prediction or cross-validated statistical parameters will be poor. In practice, it is better to obtain simple models by selecting I descriptors, usually with I ≤ 10, out of a larger set of N descriptors, where N can be more than 400.26,27 This is a very delicate and computationally demanding task, as the number of such models is N!/(N-I)!·I!.

For this purpose a computer program named CROMRsel28 (CROatian MultiRegression selection of descriptors) was developed at the Ruđer Bošković Institute in Zagreb for a more efficient stepwise (one-by-one) model selection. A further version of CROMRsel28 was developed for optimal selection, i.e. selecting the best possible descriptors into MLR models. This version of CROMRsel is used whenever the number compounds and descriptors is not too large, because of the substantial computation time required for optimal selection of descriptors in the best models (for example, it is possible to select optimal models having 1, …, 6 descriptors, from an initial set of 100 descriptors).

(17)

The best models are selected in the orthogonal basis in order to have a simpler and faster procedure. To illustrate this point, let us consider the following example, taken from ref. 28, of a MLR modelling of a target property P with two descriptors, d1 and d2. The

correlation coefficient (R) and the standard error of estimate (S) between the experimental property P and estimated property Pest calculated by MLR from descriptors d1 and d2 are

given as follows:29 1 2 1 2 1 2 1 2 2 2 2 1/ 2 , , , , , , ( , est) [( 2 ) /(1 )] P d P d P d P d d d d d R P P = R +R − R R R −R (11) 1/ 2 2 1/ 2 ( , est) [ /( 1)] [(1 ( , est) ] P S P P = M M − −I ⋅S ⋅ −R P P (12)

where M is the number of molecules in the set for which the model is derived, I denotes the number of descriptors used in the model (in the above case I = 2), and SP is the standard deviation of P.

Because in the orthogonal basis

1, 2 0

d d

R = , eq. (11) is reduced to the much simpler expression: 1 2 2 2 1/ 2 , , ( , est) [ ] P d P d R P P = R +R (13)

This simplification, already evident in the case with two descriptors, is the more considerable the higher the number of descriptors involved in the model. It substantially accelerates the execution of CROMRsel and allows for taking into account many higher order and cross-product terms, which enhance the capabilities of MLR equations but greatly increase the computational load of the descriptors’ selection task. The orthogonalization of descriptors is carried out using the Gram-Schmidt algorithm for vectors, as described by Randić30 and Szabo and Ostlund.31 Using their notation, vectors |1〉 and |2〉 can be orthogonalized as

12 2 12 1 2 1 r II r − = − (14)

where |II〉 is descriptor |2〉 orthogonalized against descriptor |1〉, while r12 is the

correlation coefficient between the nonorthogonal descriptors |1〉 and |2〉. After this transformation, descriptor |1〉 remains unchanged as well as the correlation coefficient between this descriptor and the target property P, because it is the first descriptor against which the second descriptor was orthogonalized. This procedure is repeated for all of the

(18)

remaining descriptors in the order in which they are to be orthogonalized against the first descriptor.

As a result of the orthogonalization, the correlation coefficient between P and the orthogonalized descriptor |II〉 has changed. The new correlation coefficient R′2 can be

computed by 12 12 1 2 2 ₂ ₂ 12 12 1 2 ' 1 1 r P P r R R R P II r r − − = = = − − (15)

where 〈P| is a symbolic representation of target property P as bra vector, R1 and R2 stand

for the correlation coefficients between 〈P| and nonorthogonal descriptors |1〉 and |2〉, while r12 is, like in eq. 14, the correlation coefficient between the nonorthogonal

descriptors |1〉 and |2〉. Likewise, the correlation coefficients between P and the remaining descriptors orthogonalized against the first one must be recalculated.

The new correlation coefficients between all descriptors orthogonalized against the first descriptor can be computed only by using the initial correlation coefficients (calculated between nonorthogonalized descriptors) in the following way. Let us denote the third descriptor in the orthogonalization order after orthogonalization against the first descriptor |1〉 as |III1〉. When this descriptor is orthogonalized against descriptor |II〉, it

will be denoted as |III〉. To carry out the orthogonalization, it is necessary to calculate the correlation coefficient r′23 between |II〉 and |III1〉. This coefficient is defined by

12 23 23 23 1 ₂ ₂ 12 13 ' 1 1 r r r r III II r r − + = = − − (16)

In the same manner, the correlation coefficients between descriptor |II〉 and all the remaining descriptors that have been orthogonalized against the first descriptor can be recalculated. Afterwards, each of the remaining descriptors is orthogonalized against descriptor |II〉 according to eq. 14. The procedure is repeated until all available descriptors are mutually orthogonalized in an a priori-defined order. At last, the correlation coefficients between orthogonalized descriptors and target property 〈P| are used for the evaluation of the correlation coefficient of the MLR model using an extension of eq. 13 taking into account all descriptors selected in the final model.

(19)

The algorithm on which the computer program is based can be schematically presented as by following:32

1. Initial data: the set of N descriptors and target property values.

2. Two descriptors are unbiasly selected from the initial set of descriptors. They are orthogonalized, and after that the correlation coefficient between the property P and each individual orthogonalized descriptor is computed. From this, individual computations of the total correlation coefficient, which is equal to the correlation coefficient between the property P and its computed value P’, are obtained. Subsequently, the same procedure is repeated for three, four,..., n descriptors, where n ≤ N and n < m (m = the number of molecules). The value of n is fixed at the beginning of the computation.

3. For every I-tuple of descriptors (I = 2, 3, ..., n), the combination with the highest R is singled out, and this combination necessarily possesses the smallest value of S among all possibilities generated by use of an I-tuple of the same class (there are exactly

( )

N

I of them). In such way, the n best I-tuples are obtained.

4. Among the n best I-tuples (I = 2, 3, ..., n), the one that gives the smallest S, which is at the same time the best total solution, is selected. This solves the problem of selecting the optimum number of descriptors and detecting the optimum I-tuple of descriptors to produce the best estimation of the property P.

Previous implementations of the CROMRsel program investigated various physical, chemical and biological properties of different molecules, including biological activities of carboquinones33,34 and benzodiazepine derivatives,34 and compared the results with NN models. The MLR models or Multivariate Non-linear Regression models (including initial linear descriptors as well as their squares, cross-products or higher order terms) developed by this technique in many cases outperformed other models built using highly non-linear methods, such as NN.28,35,36

(20)

2.4. References

1_{Micheli, A.; Sperduti, A.; Starita, A.; Bianucci, A. M. Analysis of the Internal}

Representations Developed By Neural Networks for Structures Applied to QSAR Studies of Benzodiazepines. J. Chem. Inf. Comput. Sci. 2001, 41, 202−218.

2_{Micheli, A.; Sperduti, A.; Starita, A.; Bianucci, A. M. A Novel Approach to}

QSPR/QSAR Based on Neural Networks for Structures. In: Soft Computing Approaches in Chemistry. Sztandera, L. M.; Cartwright, H. M. Eds., Springer-Verlag: Heidelberg, 2003, pp. 265−296.

3_{Todeschini, R.; Consonni, V.; Mauri, A.; Pavan M. E-DRAGON User Manual.}

University of Milan 2005. http://michem.disat.unimib.it/chm/Help/edragon/ index.html

4_{Katritzky, A. R.; Karelson, M.; Petrukhin, R. The CODESSA PRO project.}

University of Florida 2001-2005. http://www.codessa-pro.com

5_{Gasteiger, J. PETRA 2.6 Manual. University of Erlangen 1998.}

http://www2.chemie.uni-erlangen.de/software/petra/manual/

6_{Camilleri, P.; Livingstone, D. J.; Murphy, J. A.; Manallack, D. T. Chiral}

Chromatography and Multivariate Quantitative Structure-Property Relationships of Benzimidazole Sulphoxides. J. Comput.-Aided Mol. Des. 1993, 7, 61−69.

7_{Cramer, R. D. I.; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field}

Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am. Chem. Soc. 1988, 110, 5959−5967.

8_{Kubinyi, H. Evolutionary Variable Selection in Regression and PLS Analyses. J.}

Chemometrics 1996, 10, 119−133.

9_{Rogers, D.; Hopfinger A. J. Application of Genetic Function Approximation to}

Quantitative Structure-Activity Relationships and Structure-Property Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 854−866.

10_{Burns, J. A.; Whitesides, G. M. Feed-Forward Neural Networks in Chemistry:}

Mathematical Systems for Classification and Pattern Recognition. Chem. Rev. 1993, 93, 2583−2601.

11_{Zupan, J.; Gasteiger, J. Neural Networks for Chemists: An Introduction. VCH}

Publishers: Weinheim, FRG, 1993.

12_{So, S. S.; Karplus, M. Evolutionary Optimization in Quantitative}

Structure-Activity Relationship: An Application of Genetic Neural Network. J. Med. Chem. 1996, 39, 1521−1530.

13_{Miller, A. Subset Selection in Regression. II Ed. Chapman & Hall/CRC, 2002.} 14_{Zhang, P. Inference after variable selection in linear regression models.}

(21)

15_{Ambroise, C.; McLachlan, G. J. Selection bias in gene extraction on the basis of}

microarray gene-expression data. Proc. Nat. Acad. Sci. USA 2002, 99, 6562−6566.

16_{Varnek, A.; Kireeva, N.; Tetko, I. V.; Baskin, I. I.; Solov’ev, V. P. J. Exhaustive}

QSPR Studies of a Large Diverse Set of Ionic Liquids: How Accurately Can We Predict Melting Points? Chem. Inf. Model. 2007, 47, 1111−1122.

17_{Bianucci, A. M.; Micheli, A.; Sperduti, A.; Starita, A. Application of the Cascade}

Correlation Networks for Structures to Chemistry. Appl. Int. J. 2000, 12, 117−146.

18_{Sperduti, A.; Starita, A. Supervised Neural Networks for classification of}

Structures. IEEE Transactions on Neural Networks 1997, 8, 714−735.

19_{Fahlman, S. E.; Lebiere, C. The Cascade-Correlation Learning Architecture. In:}

Advances in Neural Information Processing Systems 2. Touretzky, D. S. Ed., Morgan Kaufmann Publishers: San Mateo, CA, 1990; pp. 524−532.

20_{Fahlman, S. E., In: Advances in Neural Information Processing Systems 3.}

Lippmann, R. P.; Moody, J. E.; Touretzky, D. S., Eds., Morgan Kaufmann Publishers: San Mateo, CA, 1991; pp. 190-196.

21_{Bartlett, P. L. The sample complexity of pattern classification with neural}

networks: the size of the weights is more important than the size of the network. IEEE Transaction on Inform. Theory 1998, 44, 525−536.

22_{Bernazzani, L.; Duce, C.; Micheli, A.; Mollica, V.; Sperduti, A.; Starita, A.; Tiné,}

M. R. Predicting physical chemical properties of compounds from molecular structures by recursive neural networks. J. Chem. Inf. Model. 2006, 46, 2030– 2042.

23_{Bertinetto, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.}

Prediction of the glass transition temperature of (meth)acrylic polymers containing phenyl groups by recursive neural network. Polymer 2007, 48, 7121– 7129.

24_{Bertinetto, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.}

Evaluation of hierarchical structured representations for QSPR studies of small molecules and polymers by recursive neural networks. J. Mol. Graph. Model. 2009, 27, 797–802.

25_{Duce, C. Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R. Recursive neural}

networks prediction of glass transition temperature from monomer structure. An application to acrylic and methacrylic polymers. J. Math. Chem. 2009, 46, 729– 755.

26_{Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Quantum-Chemical Descriptors in}

(22)

27_{Katritzky, A. R.; Ignatchenko, E. S.; Barcock, R. A.; Lobanov, V. S.; Karelson,}

M. Prediction of Gas Chromatographic Retention Times and Response Factors Using a General Quantitative Structure-Property Relationship Treatment. Anal. Chem. 1994, 66, 1799−1807.

28_{Lučić, B.; Trinajstić, N. Multivariate Regression Outperforms Several Robust}

Architectures of Neural Networks in QSAR Modeling. J. Chem. Inf. Comput. Sci. 1999, 39, 121−132.

29_{Lučić, B.; Nikolić, S.; Trinajstić, N.; Jurić, A.; Mihalić, Z. A Structure-Property}

Study of the Solubility of Aliphatic Alcohols in Water. Croat. Chem. Acta 1995, 68, 417−434.

30_{Randić, M. Orthogonal Molecular Descriptors. New J. Chem. 1991, 15, 517−525.} 31_{Szabo, A.; Ostlund, N. Modern Quantum Chemistry. McGraw-Hill: New York,}

1989; pp. 15−21.

32_{Lučić, B.; Nikolić, S.; Trinajstić, N.; Juretić, D. The Structure-Property Models}

Can Be Improved Using the Orthogonalized Descriptors. J. Chem. Inf. Comput. Sci. 1995, 35, 532−538.

33_{Yoshimoto, M.; Miyazawa, H.; Nakano, H.; Shinkai, K.; Arakawa, M.}

Quantitative Structure-Activity Relationships in 2,5-bis(i-aziridinyl)-p-benzoquinone Derivatives Against Leukemia L-1210. J. Med. Chem. 1979, 22, 491−496.

34_{Aoyama, T.; Suzuki, Y.; Ichikawa, H. Neural Networks Applied to Quantitative}

Structure-Activity Relationship Analysis. J. Med. Chem. 1990, 33, 2583−2590.

35_{Lučić, B.; Amić, D.; Trinajstić, N. Nonlinear Multivariate Regression}

Outperforms Several Concisely Designed Neural Networks on Three QSPR Data Sets. J. Chem. Inf. Comput. Sci. 2000, 40, 403−413.

36_{Lučić, B.; Nadramija, D.; Bašic, I.; Trinajstić, N.; Toward Generating Simpler}

QSAR Models: Nonlinear Multivariate Regression versus Several Neural Network Ensembles and Some Related Methods. J. Chem. Inf. Comput. Sci. 2003, 43, 1094−1102.