3. Evaluation of Hierarchical Structured Representations of Cyclic Moieties for Recursive Neural Network QSPR

(1)

3. Evaluation of Hierarchical Structured Representations of

Cyclic Moieties for Recursive Neural Network QSPR

(2)

(3)

3.1. Introduction

The main aspects of the molecular representation used in RNN-QSPR were explained in Chapter 2. Our aim here was to further explore its characteristics and possibilities by carrying out and comparing experiments that employ different hierarchical tree structures for the treated compounds.1 This investigation was worth carrying out also because of the greater complexity of the input space of molecular structures as compared to the one of descriptors. The results were useful for both assessing the effectiveness of each representation method and probing the ability of the structure-based QSPR to adapt to various studied problems.

A structured molecular representation is constituted by chemical fragments (groups) connected by a topological structure and shared by all or some of the data set compounds. Current well–know standard notation systems (such as SMILES2-5 and InChI6-8) show that a general representation of chemical compounds can be based on hierarchical (acyclic) structures. The use of hierarchical labelled structures as class of data introduces both constraints and flexibility to the molecular representation. In particular, the choice of fragments, i.e. the level of detail by which chemical groups in the structures are represented, concurrently determines the level of chemical information, the fragment sampling in the data set, the structure size and complexity. A successful representation seeks a good balance among these often conflicting issues. In this respect, the occurrence of cyclic structures constitutes an important test bed for our method. Cyclic moieties can be represented into hierarchical structures according to different solutions to the above– indicated balancing issue. In this chapter, the results from RNN-QSPR experiments employing some of these solutions were compared to evaluate the effectiveness of each representation method for the relevant data set. These experiments involved different types of molecules and two physical properties: the melting point (Tm) of substituted ionic liquids and the glass transition temperature (Tg) of (meth)acrylic polymers. All investigated data sets contained various types of cyclic moieties.

3.2. Data sets and molecular representation

The experiments presented in this chapter involved four different data sets, two for the melting point (Tm) of pyridinium bromides and two for the glass transition temperature (Tg) of (meth)acrylic polymers. The source of Tm data was the Beilstein database,9

(4)

whereas Tg data were either taken from the Polymer Handbook10 or, for the most part, collected from various sources. These data had to be carefully selected because the measurement of these properties can be affected by a variety of factors that limit the accuracy and reproducibility. For ionic liquids, the main issue is their high hygroscopicity and the effect of the consequent absorption of water. The Tm of samples at different conditions of dryness can vary greatly, by even 50 K or more.11 The measurement of Tg can be an even more delicate matter. Although often cited as a single temperature value, the glass transition actually takes place over a rather broad range of temperatures.12 The Tg is typically measured as the inflection point of the range over which the discontinuity of other observable properties takes place, although some authors report the onset of the discontinuity as the Tg value. Moreover, this value depends on the measurement technique and experimental conditions; the variance among different sources can be as high as 60-70 K.

For these reasons, special care had to be taken to discard very noisy values and avoid mixing together incomparable measurements in the same data set. For the set of ionic liquids, we employed the selection made by Katritzky et al.11 For the polymer set, we used the following criteria to choose among conflicting sources:

The Tg must be taken at the inflection point of the transition range, not at the onset.

Preferably, the Tg must be measured by Differential Scanning Calorimetry (DSC),13 with the lowest possible heating rate (usually 10 K/min). In the absence of DSC data, other techniques may be considered, such as Dynamic Mechanical Analysis (DMA)14_{or dilatometry.}

The molecular weight (MW) must be about 200.000 MU, or as high as possible (the Tg is influenced by this parameter at low MW values).

From the same sources we also searched for NMR data on polymer tacticity, which is a factor that can significantly influence the Tg. Whenever this information was available, we expressed it in the Start label as molar fraction of r (syndiotactic) dyads. In all other cases, we set this fraction as 0 and 1 for all homopolymers indicated as isotactic or syndiotactic, respectively. Atactic polymers were assigned a value of 0.6 if acrylic and 0.7 if methacrylic. A more extensive survey on the background and issues of the prediction of polymer Tg is provided in Chapter 4.

The sources for all employed experimental data are listed in Appendix II. The size, characteristics and conventional names of every data set used in this chapter are given in Table 3.1. Data set IL1 contained Tm of 117 pyridinium bromides, in which the only

(5)

Table 3.1. Summary of the used data sets for the validation of cyclic compound representations.

Data set

Chemical compounds Target property

Number of compoundsa

Cyclic moietiesb

IL1 Pyridinium bromides Tm 117 (117) Phenyl (11), pyridinium (117) PA1 (Meth)acrylic polymers Tg 271 (110) Phenyl (110)

IL2 Pyridinium bromides Tm 126 (126) Phenyl (11), pyridinium (126), pyridine (4), morpholine (1), cyclohexene (2)

PA2 (Meth)acrylic polymers Tg 337 (176) Phenyl (129), cycloalkyl (20), naphthyl (2), bornyl (4), adamantyl (2), other condensed cycles (2), dioxane (21), other heterocycles (8)

a_{The number in parentheses indicates the number of compounds in which a cyclic moiety is} present. b_{The number in parentheses indicates the number of compounds in which the indicated} moiety is present.

cyclic structures were the pyridinium ion and differently substituted phenyl rings. They were represented in the prediction experiment by rooting the chemical tree in the pyridinium ring, and by describing each cycle as a single vertex with either six (Pyridinyl) or five (Phenyl) children, in order to account for all possible substitutions.15 It must be stressed that the children were constituted by the chemical groups directly bound to atoms constituting the ring skeleton (see for example Fig. 3.1A).

Data set PA1 dealt with the Tg of 271 acrylic and methacrylic polymers, in which the only occurring cyclic moiety was the phenyl ring, either mono– or di–substituted at the 2–4 positions. As in the previous case, we represented this group as a single vertex (Phenyl) with three children (see for example Fig. 3.2A). The ordering of each child corresponded to the ring position of the attached group (1 = ortho, 2 = meta, 3 = para).16,17

The procedure of describing the whole cyclic structure as a single group was named group representation. This approach had the computational advantage of generating rather compact trees: for instance, phenyl compounds of PA1 were, on average, made of 14.7 vertices and spanned 8.4 vertices in depth. On the other hand, each cycle type had a different label, which needed enough sampling for the RNN to learn it. It was therefore only suitable for data sets containing largely sampled cycle types. When a greater variety

(6)

of rings is present, more general representation methods are needed. They all tackle the sampling issue by decomposing each cycle into atom groups occurring in most structures in order to limit the number of group labels with poor sampling.

Data set IL2 was made of the same compounds of IL1, with the addition of 9 molecules containing also other cycles, such as pyridine, morpholine and cyclohexene. Although these cycles were set apart by the content of double bonds and/or by the presence of atoms different from carbon (oxygen, nitrogen), they all had the same size (6 vertices). This common feature was exploited by representing all of them with a generic “six– edged-cycle template” group (Hexa) with 6 children, which were the atom groups constituting the ring skeleton (see for example Fig. 3.1B). This representation avoided the introduction of a great number of labels with poor sampling by composing the cycle with atom groups occurring in almost all structures. The tree root was positioned on the Hexa ordered clockwise, the highest priority group having the lowest possible position corresponding to the pyridinium ion; its first child was the nitrogen atom (N), while the

Figure 3.1. Chemical structure (center) and the corresponding chemical trees of N–(2,5–

dimethoxyphenetyl)pyridinium bromide: A) each ring structure is represented as a single specific label (group representation); B) all cycles are represented by a vertex (Hexa) with six children, which are the atom groups of the ring skeleton (template representation). Numbers indicate the children order.

(7)

first child of the other rings was the atom closest to the root. The other positions were number.18 In this approach the cycle skeleton formed a template used to represent various types of cycles. The same template could be used for all cycles sharing matching skeleton geometry. Hence, the template approach was a compromise between the very specific group representation and a more general one that is presented by following.

Data set PA2 included, in addition to the compounds of PA1, 67 polymers containing cyclic moieties differing not only in the constituent atoms, but also in size and shape, such as various cycloalkyls and condensed cycles (see Table 1) with a number of ring vertices ranging from 3 to 17. They were represented by breaking the cycle and by adding a cut1 vertex at both ends of the broken bond (see for instance Fig. 3.2B). In the case of condensed cycles, more than one bond had to be cut and other labels were used (cut2,

cut3, etc…). Identical labels matched atom groups connected by the same broken bond.

Figure 3.2. Chemical structure and corresponding chemical tree of poly(2,4-dichlorobenzyl

acrylate): A) each cycle is represented as a single specific label (group representation); B) cycles are opened and a cut1 vertex is placed in the correspondence of the broken bond, indicated by the red dashed line (cycle break representation). Numbers indicate the children order. Car stands for

(8)

When rings were not condensed and no spiro moiety was present, the same “cut” number could be used repeatedly within a single molecule. This method, named cycle break representation, was flexible and allowed for describing any cycle regardless of its sampling, given that its constituting fragments were present in the data set. Its generality was supported by the existence of standard molecular representation systems such as UniqueSMILES2-5 and InChI6-8 that can describe any type of cyclic molecular structure by breaking some edges in a cyclic graph while maintaining the topological information. UniqueSMILES, which generates a unique hierarchical structure for each compound, was also exploited to obtain a unique tree by following its same choices on bond breaking and group ordering.17 Considering that molecular graphs can become very complicated with cycle break, this was a more simple and reliable way to have a unique representation than deciding ad hoc priority rules, such as the ones listed in Appendix I. To give a measure of the increased size and depth of trees generated by cycle break, if this representation is applied to the phenyl compounds of PA1, the resulting graphs contain an average number of 28 vertices, with an average depth of 14.2 vertices. This is considerably more than what resulted with the group representation and requires a greater computational effort. To better observe the effect of the representation method on the prediction results and computational efficiency, we carried out two additional experiments.17 One was performed on the PA1 data set using cycle break instead of group. The other employed the PA2 data set and the cycle break representation, with the difference that a few fragments were unified to reduce the number of vertices and the computational load. The fragment unifications are pictured in Fig. 3.3. It is worth noting that the separation between cycle break and group methods is not so wide: if we make a further fragment unification on cycle break we obtain the same complexity as in group. It is therefore more correct to think of a gradual range of options to describe the molecular detail, rather than a few separated representations.

(9)

The complete list of molecular fragments used in each experiment is included in Appendix III.

3.3. Experiments and Discussion

Six experiments were performed to investigate the prediction of two physical properties, i.e. the melting point (Tm) and the glass transition temperature (Tg), over four data sets by using different representation methods. The overall results are summarized in Table 3.2, indicating the number of recursive Hidden Units (HU), the Mean Absolute Residual (MAR), the standard error of estimate/prediction (S) and the squared correlation coefficient (R2) for both training and test set. The precise definition of these statistical parameters is provided in Appendix IV. The test set data that were used for validation were never used for the development and training of the model. The detailed results for every compound are provided in Appendix V.

For each experiment, sixteen trials (i.e. training of the model), all with the same training/test split, were carried out for the RNN simulation and the results were averaged over the different trials. The reason for this procedure was that the connection weights of the RNN model were initialized at random because of the use of a stochastic gradient-based technique to solve a least mean square problem. Consequently, diverse outcomes

Table 3.2. Average results obtained by different representations on the investigated data sets.

number of

compounds Results for TR Results for TS

Exp. Data set Cycle representation TR TS HU MAR S R2_{MAR S R}2 Tm1 IL1 Group 80 37 6 11.0 13.9 0.92 25.0 29.6 0.62 Tm2 IL2 Template 84 42 14 10.4 12.9 0.93 25.1 30.4 0.60 Tg1 PA1 Group 217 54 17 8.3 11.2 0.97 15.6 21.1 0.84

Tg2 PA2 Cycle break 272 66 32 7.1 9.5 0.98 18.2 23.7 0.80

Tg1b PA1 Cycle break 217 54 21 8.3 11.0 0.97 15.8 20.4 0.85

Tg2u PA2 Cycle breaka _{272 65 21 7.9 10.3 0.97 17.7 22.8 0.81}

TR = training set; TS= test set; HU = number of (recursive) hidden units; MAR = Mean Absolute Residual (kelvin); S = standard error of estimate/prediction (kelvin); R2_{= squared correlation}

(10)

Figure 3.4 Distribution of training and test target values in data sets IL2 and PA2.

could be achieved during the training of the network by starting from different initial conditions. The use of a basic ensemble method avoided these problems while offering an improved regression estimate. It is worth noting that a naive approach based on the selection of the best outcome over the various trials could lead to an unsatisfactory and unreliable estimation of the model performance. Moreover, this practice discarded potentially useful information on the model behaviour, which was stored in the discarded regression estimates.

The training/test partition for data sets IL1 and IL2 was the same as in the work from which these data are taken;11 the size of the test set was about one third of the total. In data sets PA1 and PA2 the test set was built through a random selection of about one fifth of all compounds, though being representative of the chemical moieties present in the whole data set. The distribution of training and test target values throughout data sets IL2 and PA2 is plotted in Fig. 3.4 (IL1 and PA1 are not plotted because they are subsets of IL2 and PA2).

Learning was stopped when the maximum error for each compound of the training set was below a preset threshold value, named Training Error Tolerance (TET). The selection of this value was an important step to be taken with care if one wants to model a factual relationship according to a suitable tolerance for the noisiness and uncertainty of the data. A general rule suggests choosing a TET equal to or near the experimental uncertainty. In our experiments it was set at 50 K for pyridinium bromides and 60 K for poly(meth)acrylates, which were values close to the literature data spread of the most uncertain compounds in the respective data sets.

In experiment Tm118 the IL1 data set was split into training and test sets of 80 and 37 molecules, respectively. The occurring cyclic structures, pyridinium ion and variously substituted phenyl rings, were described by the group representation as explained in the

(11)

previous section. Experiment Tm218 was carried out on the IL2 data set divided in 84 and 42 molecules for training and testing, respectively. The template representation was used for cyclic moieties. For both datasets, the RNN model achieved almost the same performance and the recorded MAR, S and R2 values were similar to those obtained by other literature QSPR methods that employ predefined molecular descriptors.11,19 The experimental vs. predicted values for the training and test sets in Tm1 and Tm2 runs are

Figure 3.5 Plots of the experimental vs. predicted values for Exp. Tm1. Training set: bars

encompass the minimum and maximum outputs from the 16 trials, circles indicate the average outputs. Test set: small dots indicate the outputs of each trial, circles indicate the average outputs.

Training set

(12)

plotted in Figs. 3.5 and 3.6, respectively. The more detailed structural information introduced in Tm2 by the template representation brought about an extension of the input space that was not compensated by a proportional increase of molecular data set size. Nonetheless, the result accuracy recorded for Tm2 was not worse than that of Tm1. This explicit decomposition of cycles had the clear advantage of conveying more useful

information while allowing for increasing the class of molecules that could be represented

Figure 3.6 Plots of the experimental vs. predicted values for Exp. Tm2. Training set: bars

Test set Training set

(13)

in the data set.

Experiment Tg117 was performed on data set PA1, partitioned into 217 and 54 polymers for training and testing, respectively. The phenyl rings contained in 110 of the data set compounds were described with the group representation. Run Tg21 was carried out on data set PA2, split into 272 training set and 65 test set compounds. The many different cycle types were represented by the cycle break method in accordance with the “cuts” and

Figure 3.7 Plots of the experimental vs. predicted values for Exp. Tg1. Training set: bars

Training set

(14)

children order of UniqueSMILES. In either case, the recorded MAR, S and R2 values were again comparable to those obtained by most ad hoc literature methods for polymer property prediction.20-26 The experimental vs. predicted values for the training and test sets in Tg1 and Tg2 runs are plotted in Figs. 3.7 and 3.8, respectively. Tg2 resulted in an increase of both MAR and S values, as compared with Tg1, of almost 3 kelvin (K). This behaviour could be attributed to either the representation change or the greater

Figure 3.8 Plots of the experimental vs. predicted values for Exp. Tg2. Training set: bars

Training set

(15)

complexity of PA2 data set.

To shed light on this point, a fifth experiment, named Tg1b, was run on data set PA1 by using the cycle break representation (again according to UniqueSMILES).17 Only a slight change of test MAR, S and R2 was detected as compared to Tg1. This result suggested that the observed error increase should be attributed rather to the greater complexity of PA2 data set as compared to PA1 than to the more demanding prediction task required by a more general representation design. However, comparison of the results relevant to cyclic and acyclic compounds, respectively, indicated that the type of representation did affect the RNN performance. The behaviour of the group representation (Tg1) was very similar for the two classes, their MAR and S values in the test set being set apart by only 1 K. Instead, when using the cycle break representation (Tg1b), both MAR and S values of cyclic compounds in the test set were more than 4 K greater than those of acyclic ones. This behaviour may be explained by considering that the structures generated for both cyclic and acyclic compounds had more or less the same size when using the group representation; on the other hand, cycle break gave rise to much deeper trees for cyclic structures.

To better understand the influence of the tree size and depth on the results, we carried out a sixth experiment that is named Tg2u in Table 2. It was performed on the PA2 data set using the cycle break representation with the difference that some atom groups were unified to lower the number of fragments. The unifications, mostly concerning cyclic structures, are illustrated in Fig. 3.3 and they were chosen so that the new ‘merged’ groups would be adequately sampled in the data set. The resul\ts of this experiment showed a slight improvement of the test set prediction in terms of MAR, S and R2, as compared to Tg2. The number of HU decreased from 32 to 21, indicating a significantly lower computational load. Although there was no particular need to reduce the calculation time when dealing with such small data sets, this fact could become useful when investigating larger sets. It must however be noted that part of this HU reduction was to be attributed to the less accurate fitting of the training set. Considering cyclic and acyclic compounds separately, in Tg2 cyclic ones showed a higher MAR and S in the test set by around 4 and 5 K, respectively, as compared to acyclic compounds. In Tg2u, this difference was reduced to about 2 and 3 K, respectively. This result suggested that the hypothesis relating the difference in performance between cyclic and acyclic polymers in Tg1b to the disparity of their graph size might be correct. Indeed, the fragment

(16)

unifications in Tg2u reduced the size of cyclic graphs more than that of acyclic ones, thus making the trees more homogeneous throughout the whole data set.

3.4. Conclusions

The reported experiments demonstrate the RNN flexibility in the treatment of structured molecular data of different types. In particular, the labelled tree representation can be exploited to deal with very different molecular structures by finding a balance between structural detail and molecular sampling in each investigated data set. The designed cycle representations span from a very specific one (group), dedicated to a restricted class of molecules, to renditions of increasing generality (template and cycle break). Other representations are also possible, as a nearly continuous range of options is available for detailing the molecular structure.

Comparison of the results obtained with different cycle representations on identical or similar data sets highlights a very limited error increase in going from a specially tailored technique to more general ones, despite the very different graphical form and sampling requirements of the proposed descriptions. This is another indication of the flexibility of the method and its robustness with respect to the introduction of different typologies of data. The more specific representations are useful for restricted classes of compounds, as they can provide more accurate results at less computational effort, due to the use of simpler structures. This effect is well shown by the lower number of recursive hidden units used by the RNN in experiments using trees of smaller size and depth.

This variety of helpful choices emphasizes the adaptability introduced by the RNN approach to the QSPR based on structures. This flexibility allows the designer for tuning the level of structural detail to the characteristics of the investigated molecular data set while using the same computational approach.

(17)

3.5. References

1_{Bertinetto, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.} Evaluation of hierarchical structured representations for QSPR studies of small molecules and polymers by recursive neural networks. J. Mol. Graph. Model.

2009, 27, 797–802.

2_{Weininger, D. SMILES, a Chemical Language and Information System. 1.} Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci.

1988, 28, 31–36.

3_SMILESTM_{(Simplified Molecular Input Line Entry System)} http://www.daylight.com/smiles/index.html

4_{Morgan, H. L. Generation of a Unique Machine Description for Chemical} Structures—a Technique Developed at Chemical Abstracts Service. J. Chem. Doc.

1965, 5, 107–113.

5_{Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for} Generation of Unique SMILES Notation. J. Chem. Inf. Comp. Sci. 1989, 29, 97– 101.

6_{McKay, B. D. Practical Graph Isomorphism. Congressus Numerantium 1981, 30,} 45–87.

7_{McNaught, A. The IUPAC International Chemical Identifier: InChI – A New} Standard for Molecular Informatics. Chem. Int. 2006, 28, 12–14.

8_{Stein, S. E.; Heller, H. R.; Tchekhovskoi, D. V. The IUPAC Chemical Identifier –} Technical Manual. http://www.iupac.org/inchi/

9_{The Beilstein Database. MDL Information System GmbH.} http://www.Beilstein.com.

10_{Brandrup, J; Immergut, E. H. Polymer Handbook. 3rd ed. Wiley: New York;}

1990.

11_{Katritzky, A. R.; Lomaka, A.; Petrukhin, R.; Jain, R.; Karelson, M.; Visser, A. E.;} Rogers, R. D. QSPR Correlation of the Melting Point for Pyridinium Bromides, Potential Ionic Liquids. J. Chem. Inf. Comput. Sci. 2002, 42, 71–74.

12_{Riddle, E. H. Monomeric Acrylic Esters. Reinhold Publishing Corp., New York,}

1954.

13_{Bair, H. E. Glass transition measurements by DSC. ASTM Spec. Tech. Publ., STP} 1249 1994, pp. 50–74.

14_{Chartoff, R. P.; Weissman P. T., Sircar, A. The application of dynamic} mechanical methods to Tg determination in polymers: An overview. ASTM Spec.

(18)

15_{Bertinetto, C.; Bini, R.; Chiappe, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.;} Tiné, M. R. Recent Advances in the Representation of Molecular Structures for RecNN–QSPR Analysis. In: Lecture Series on Computer and Computational Sciences, Vol. 7, Simos, T.; Maroulis, G. Eds., Brill Academic Publishers, Leiden,

2006, pp. 1352–1355.

16_{Micheli, A.; Sperduti, A.; Starita,; A. Bianucci, A. Design of New Biologically} Active Molecules by Recursive Neural Networks. Proc. Int. Joint Conference on Neural Networks 2001, 4, 2732–2737.

17_{Bertinetto, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.} Prediction of the Glass Transition Temperature of (Meth)Acrylic Polymers Containing Phenyl Groups by Recursive Neural Network. Polymer 2007, 48, 7121–7129.

18_{Bini, R.; Chiappe, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.} Ionic Liquids: Prediction of their Melting Points by a Recursive Neural Network Model. Green Chemistry 2008, 10, 306–309.

19_{Carrera, G.; Aires–de–Sousa, J. Estimation of Melting Points of Pyridinium} Bromide Ionic Liquids with Decision Trees and Neural networks. Green Chem.

2005, 7, 20–27.

20_{Bicerano, J. Prediction of polymer properties. Marcel Dekker: New York, 2002.} 21_{Katritzky, A. R.; Sild, S.; Lobanov, V. S.; Karelson, M. Quantitative Structure–}

Property Relationship (QPSR) Correlation of Glass Transition Temperatures of High Molecular Weight Polymers. J. Chem. Inf. Comput. Sci. 1998, 38, 300–304. 22_{Garcia–Domenech, R.; de Julián–Ortiz, J. V. Prediction of Indices of Refraction}

and Glass Transition Temperatures of Linear Polymers by Using Graph Theoretical Indices. J. Phys. Chem. B 2002, 106, 1501–1507.

23_{Joyce, S. I.; Osguthorpe, D. J.; Padgett, J. A.; Price, G. J. Neural Network} Prediction of Glass–Transition Temperatures from Monomer Structure. J. Chem. Soc., Faraday Trans. 1995, 91, 2491–2496.

24_{Sumpter, B. G.; Noid, D. W. On the Use of Computational Neural Networks for} the Prediction of Polymer Properties. J. Thermal. Anal. 1996, 46, 833–851.

25_{Ulmer, C. W., II; Smith, D. A.; Sumpter, B. G.; Noid, D. I. Computational Neural} Networks and the Rational Design of Polymeric Materials: the Next Generation Polycarbonates. Comput. Theor. Polym. Sci. 1998, 8, 311–321.

26_{Mattioni, B. E.; Jurs, P. C. Prediction of Glass Transition Temperatures from} Monomer and Repeat Unit Structure Using Computational Neural Networks. J. Chem. Inf. Comput. Sci. 2002, 42, 232–24.