1. Introduction and Aim of the Work

(1)

(2)

(3)

1.1. Introduction

The prediction of the physical, chemical and biological properties of molecules and polymers is being intensively investigated because of its scientific importance and technological applications. In many research fields, experimental techniques are successfully combined with theoretical predictive methods. The latter afford a helpful tool for the selection and cleaning of empirical data, as well as for the interpretation of results. They also allow for integrating data by providing information on substances or conditions in which the experimental measurement is not feasible or was not performed. Their use in industrial activities allows for saving time and money by reducing the amount of empirical assessments needed to determine the characteristics of a substance. In particular, predictive methods are of fundamental importance in the design of new compounds and materials. Molecular design generally consists in combining relatively small chemical groups or structures together into larger systems. Nowadays, this field involves increasingly complex substructures that give rise to an overwhelming number of combinations. The experimental examination of all of them is usually too costly and time-consuming to be feasible; therefore it is necessary to acquire information beforehand in order to address the molecular synthesis to the most promising chemical compounds. The predictive models found in the literature can roughly be classified, according to the approach used for their derivation, into two general categories: deductive (from “general” to “particular”) and inductive (from “particular” to “general”). The first approach is followed by all methods that apply the fundamental laws of classical and quantum physics to the appropriate molecular systems, using suitable approximations or integration with experimental data if necessary. Such methods mainly include ab initio quantum chemical calculations and molecular dynamics simulations. The second approach is used by all empirical or partially empirical methods that derive the law or relationship governing a molecular property from experimentally determined data or characteristics. This Ph.D. thesis focuses on methods belonging to the second category, in particular those involving Quantitative Structure-Activity Relationships (QSARs).

(4)

1.2. Historical Background of Quantitative Structure Activity

Relationships

A QSAR model seeks a function able to predict a biological activity (or, in general, any property of interest) using only information related to the molecular structure. Other names are sometimes used to indicate substantially the same thing, such as Quantitative Structure-Poperty Relationship (QSPR) or Quantitative Structure-Toxicity Relationship (QSTR), depending on the investigated issue. In this thesis, the QSPR acronym will be used most often. This approach dates back more than 100 years ago: the first pioneering works appeared at the end of the XIX century to predict the melting and boiling points of homologous series,1 the potency of anaesthetics from oil/water partition coefficient2 and narcosis from chain length.3 In 1925, Langmuir proposed linking intermolecular interactions in liquid state to the surface energy.4 Great progress in the discipline was achieved by the formulation of theoretical whole molecule descriptors, starting from the Wiener index5 and the Platt number,6 proposed in 1947 to model the boiling points of hydrocarbons. This methodology was further improved in the 1960s by Hansch and Fujita,7 who developed models connecting biological activities and the hydrophobic, electronic and steric properties of compounds, and by Free and Wilson in their models of additive group contributions to biological activity.8 In the past 40 years, QSAR techniques have seen a great expansion from the development of molecular-structure-based descriptors,9,10 which allowed an increasingly accurate description of molecules. The major contributions to the theoretical basis of QSAR came from the groups of, among others, Abraham,11,12 Balaban,13 Hilal,14 Jurs,15 Katritzky and Karelson,16 Kier and Hall,17 Politzer,18 Randic19 and Trinajstic.20

QSAR analysis is now a well-established and highly respected technique to correlate diverse simple and complex physico-chemical properties of a compound with its molecular structure. Its most common applications are in the fields of pharmaceutical chemistry and computer–assisted drug design.21-27 In the standard formulation, QSARs utilize what are known as molecular descriptors, i.e. any physico-chemical property or any parameter that can be defined quantitatively from the chemical structure alone. Typical descriptors are obtained by the selection of physico-chemical, geometrical and electronic properties, the calculation of topological or connectivity indices, or the occurrence of each group in the molecular structure. Once a suitable number of descriptors is computed, a statistical procedure extracts the optimal equation relating the

(5)

property of interest (or target property) to a few selected descriptors for the studied set of compounds. Such equation can then be used to predict that same property for other structures not yet measured or even not yet prepared. There are some restrictions to its use: (i) the compounds used to derive the QSAR (the training set) should be chemically similar, and (ii) realistic predictions can only be made for compounds that are chemically related to some of those from which the QSAR model was derived; i.e., predictions should be of interpolations or short extrapolations.28 The number of descriptors contained in a QSAR is usually small, around four or five. As a general rule, the higher the number of compounds employed in the training set, the higher the acceptable number of descriptors in the proposed model. Moreover, these descriptors should not be highly intercorrelated. Various mathematical approaches are used in QSAR, and the resulting models may be quite different in complexity, accuracy, stability and prediction ability. Originally, only linear regressions methods were used, such as Multi Linear Regression (MLR),29 but the last 30 years saw the introduction of a wide range of techniques, of which the most commonly used are Principal Component Analysis (PCA),30 Partial Least Squares (PLS),31,32 Genetic Algorithms (GA)33 and Neural Networks (NN).34,35

The major problem associated with standard QSAR methods is that important parts in the solution of the predictive problem need to be defined a priori, e.g. the representation of molecules through molecular descriptors and the regression algorithm. They sometimes even need to be decided by an expert in the field on the basis of the background knowledge of the specific predictive tasks at hand. Machine-learning techniques, such as NN, partially overcome this issue by capturing underlying functional relationship in the data via a process of training from examples. Consequently, no a priori correlation is instituted between the molecular descriptors and the target property. However, like in equation-based approaches, NNs represent molecules by a fixed-size numerical array, thus using a predetermined number of molecular descriptors. Moreover, appropriate descriptors are not yet available for some types of chemical compounds, e.g. various kinds of macromolecules.

In recent years, research groups from the Department of Computer Science and the Department of Chemistry at the University of Pisa developed a method that allows for avoiding the use of descriptors altogether.36,37 Its main characteristic is that it directly deals with the chemical structure of compounds by taking as input their structured hierarchical representation instead of the traditional arrays of descriptors.38,39 The representation of molecules through graphs is not only an intuitive operation commonly

(6)

performed by any chemist, it is also a much more rich and flexible vehicle of information as compared to standard vectors. These graphical representations are processed by means of a Recursive Neural Network (RNN), a particular type of NN able to treat structured domains (graphs, sequences, etc.). The RNN models a direct and adaptive relationship between molecular structures and target properties. In particular, it can encode the structured representations according to the given QSAR task by learning their specific structural descriptors from the training examples. As a result, no a priori definition and/or selection of properties by an expert are required. This method is general and flexible enough to deal with different kinds of compounds or predictive problems by using the same approach. As a matter of fact, it was successfully applied to the prediction physical, chemical and biological properties of both low and high molecular weight compounds, including the boiling points of linear and branched alkanes,36,40 the pharmacological activity of substituted benzodiazepines36,37,40 and 8-azaadenine derivates,41 the solvation free energy of monofunctional compounds,38 the glass transition temperature of acrylic and methacrylic polymers39,42,43 and the melting point of pyridinium bromides.44 On the other hand, a weakness of the structure-based approach with respect to descriptor-based ones is that it produces relationships that are much more difficult to interpret in terms of molecular features. It also deals with a more complex input space, which, depending on the task, may need a greater number of training examples.

1.3. Aim of the work

The main purpose of the present Ph.D. thesis was to explore new possibilities given by RNN-QSAR techniques in the treatment of important predictive problems in various fields. First, we discussed the characteristics and possibilities of the hierarchical structured representation used in the RNN method by comparing results obtained on various data sets with different structural characteristics and by making different choices of molecular fragmentation. In particular, we focused on the representation of cyclic moieties and the level of detail of the atomic groups used to describe the compound. The studied data sets entailed the melting point (Tm) of ionic liquids and the glass transition

temperature (Tg) of acrylic and methacrylic homopolymers. We highlighted the generality

of our method and its flexibility in adapting to the particular investigated problem.

The subsequent part of the thesis specifically concerned the glass transition temperatures of acrylic and methacrylic copolymers. The prediction of the Tg, which is a property of

(7)

paramount importance as it determines the technological applicability of many polymeric materials, is not an easy task and has been tackled by many different methods in the literature without always achieving satisfactory results. In the case of QSAR, the main issue originates from the difficulty in finding suitable descriptors to represent polymers. The structure-based RNN-QSAR proved instead to be able to satisfactorily model and predict the Tg of a set of (meth)acrylic homopolymers.41,43 Starting from these results, the

analysis was expanded to copolymers, a much less studied category, and a suitable structured representation was devised to treat data sets of different size and level of information.

In the last part of the Ph.D. thesis, the RNN-QSAR was applied to the prediction of the toxicological properties of chemical compounds. This is a field of high scientific and practical interest because of its medical and environmental relevance. Owing to the large number of new chemicals continuously produced by human activities, a great amount of data is needed to provide comprehensive toxicological information. For this reason many efforts have been spent in the last years to develop reliable predictive methods able to fill the data gaps and to indicate which compounds are the best candidates for empirical testing. RNN-QSAR appears particularly appropriate for tasks regarding toxicity of unknown substances, because usually there is no background knowledge available for such problems and our technique does not need it. To assess its effectiveness on the issue, we investigated a data set of Growth Impairment Concentration (IGC50) for substituted

phenols towards the freshwater ciliate Tetrahymena pyriformis and a data set of median Lethal Concentration (LC50) for substituted benzenes towards the fathead minnow

Pimephales promelas, a species of temperate freshwater fish. The investigation of the LC50 data set was also carried out by a descriptor-based technique that builds linear

models through MLR and selects descriptors by a methodology developed at the Ruđer Bošković Institute in Zagreb.20 The results were compared with those from RNN and the different, often complementary characteristics of the two methods were discussed in the framework of a future combined application.

(8)

1.4. Outline of the thesis

In this thesis, the previously mentioned topics will be presented in the following order: Chapter 2 will provide the general principles underlying QSAR approaches. The methods used in our studies (RNN and MLR) will be described in detail, with a deeper insight on the RNN technique.

In chapter 3, the characteristics and possibilities of the structured representation used in the RNN method will be discussed by comparing experiments on data sets with different molecular structural characteristics. The studied properties involve the Tm of pyridinium

bromides and the Tg of poly(meth)acrylates.

Chapter 4 will focus on the prediction of the glass transition temperature of polymers. A wide background of the previous methods will be presented and the RNN-QSAR will be applied to data sets containing the Tg’s of (meth)acrylic copolymers, either alone or

intermixed with homopolymers. The results will be fully explained and commented. Chapter 5 will concern the application of QSAR analysis to the toxicological field, also named QSTR. Two problems will be tackled with the RNN-QSTR method: the Growth Impairment Concentration (IGC50) of a set of substituted phenols towards the ciliate

Tetrahymena pyriformis and the Lethal Concentration (LC50) of a set of substituted

benzenes on fathead minnow (Pimephales promelas). The second problem will also be investigated by means of the MLR method and its results will be compared with those obtained by RNN.

(9)

1.5. References

1_{Mills, E. J. On melting point and boiling point as related to composition. Philos.}

Mag. 1884, 17, 173−187.

2_{Meyer, H. Zur Theorie der Alkoholnarkose: I. Welche Eigenschaft der}

anaesthetika bedingt ihre narkotische Wirkung? Arch. Exp. Pathol. Pharmakol. 1899, 42, 109−118.

3_{Overton, E. Studien über die Narkose zugleich ein Beitrag zur allgemeinen}

Pharmacologie. Verlag Gustav Fischer: Jena, Germany, 1901; p. 141.

4_{Langmuir, I. The Distribution and Orientation of Molecules. Colloid Symp.}

Monogr. 1925, 3, 48−75.

5_{Wiener, H. Structural determination of Paraffin Boiling Points. J. Am. Chem.}

Soc. 1947, 69, 17−20.

6_{Platt, J. R. Influence of Neighbor Bonds on Additive Bond Properties in Paraffins.}

J. Chem. Phys. 1947, 15, 419−420.

7_{Hansch, C.; Fujita, T. p-σ-π Analysis. A Method for the Correlation of Biological}

Activity and Chemical Structure J. Am. Chem. Soc. 1964, 86, 1616−1626.

8_{Free, S. M.; Wilson, J. W. A Mathematical Contribution to Structure-Activity}

Studies J. Med. Chem. 1964, 7, 395.

9_{Karelson, M. Molecular Descriptors in QSAR/QSPR. John Wiley & Sons: New}

York, 2000.

10_{Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors. Wiley-VCH:}

Weinheim, Germany, 2000.

11_{Abraham, M. H. New Solute Descriptors for Linear Free Energy Relationships}

and Quantitative Structure−Activity Relationships. In Quantitative Treatments of Solute/Solvent Interactions. Politzer, P.; Murray, J. S. Eds., Elsevier: Amsterdam, 1994; pp. 83−133.

12_{Abraham, M. H.; Chadha, H. S.; Dixon, J. P.; Rafols, C.; Treiner, C. Hydrogen}

bonding. Part 40. Factors that influence the distribution of solutes between water and sodium dodecylsulfate micelles. J. Chem. Soc., Perkin Trans. 2 1995, 887−895.

13_{Balaban, A. T. From Chemical Topology to 3D Geometry. J. Chem. Inf. Comput.}

Sci. 1997, 37, 645−650.

14_{Hilal, S. H.; Carreira, L. A.; Karickhoff, S. W. Estimation of Chemical Reactivity}

Parameter and Physical Properties of Organic Molecules Using SPARC. In: Quantitative Treatments of Solute/Solvent Interactions. Politzer, P.; Murray, J. S.; Eds., Elsevier: Amsterdam, 1994; pp. 291−353.

(10)

15_{Stuper, A. J.; Brugger, W. E.; Jurs, P. C. Computer-Assisted Studies of Chemical}

Structure and Biological Function. John Wiley & Sons: New York, 1979.

16_{Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Quantum-Chemical Descriptors in}

QSAR/QSPR Studies. Chem. Rev. 1996, 96, 1027−1044.

17_{Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis.}

John Wiley & Sons: New York, 1986.

18_{Murray, J. S.; Politzer, P. A. General Interaction Properties Function (GIPF): An}

Approach to Understanding and Predicting Molecular Interactions. In: Quantitative Treatments of Solute/Solvent Interactions; Politzer, P.; Murray, J. S. Eds., Elsevier: Amsterdam, 1994; pp. 243−289.

19_{Randić, M.; Razinger M. On the Characterization of Three-Dimensional}

Molecular Structure. In: From Chemical Topology to Three-Dimensional Geometry; Balaban, A. T., Ed.; Plenum Press: New York, 1996; pp. 159−236.

20_{Lučić, B.; Trinajstić, N. Multivariate Regression Outperforms Several Robust}

Architectures of Neural Networks in QSAR Modeling. J. Chem. Inf. Comput. Sci. 1999, 39, 121−132.

21_{Martin, Y. C. 3D QSAR: current state, scope, and limitations. Perspect. Drug}

Discovery 1998, 12, 3−23.

22_{Norinder, U. Recent progress in CoMFA methodology and related techniques}

Perspect. Drug Discovery 1998, 12, 25−39.

23_{Maddalena, D. J. Applications of soft computing in drug design. Expert Opin.}

Ther. Pat. 1998, 8, 249−258.

24_{Kubinyi, H. QSAR and 3D QSAR in Drug Design. II. Applications and Problems.}

Drug Discovery Today 1997, 2, 538−546.

25_{Hansch, C.; Fujita, T. Status of QSAR at the End of the Twentieth Century. In:}

Classical and Three-Dimensional QSAR in Agrochemistry; Hansch, C.; Fujita, T., Eds., American Chemical Society: Washington, DC, 1995; pp. 1−12.

26_{Hansch, C.; Leo, A. Exploring QSAR, Fundamentals and Applications in}

Chemistry and Biology. American Chemical Society: Washington, DC, 1995.

27_{Katritzky, A. R.; Fara, D. C.; Petrukhin, R. O.; Tatham, D. B.; Maran, U.;}

Lomaka, A.; Karelson, M. The Present Utility and Future Potential for Medicinal Chemistry of QSAR/QSPR with Whole Molecule Descriptors. Curr. Top. Med. Chem. 2002, 2, 1333−1356.

28_{Katritzky, A. R.; Fara, D. C. How Chemical Structure Determines Physical,}

Chemical, and Technological Properties: An Overview Illustrating the Potential of Quantitative Structure−Property Relationships for Fuels Science. Energy & Fuels 2005, 19, 922−935.

(11)

29_{So, S. S.; Karplus, M. Evolutionary Optimization in Quantitative}

Structure-Activity Relationship: An Application of Genetic Neural Network. J. Med. Chem. 1996, 39, 1521−1530.

30_{Camilleri, P.; Livingstone, D. J.; Murphy, J. A.; Manallack, D. T. Chiral}

Chromatography and Multivariate Quantitative Structure-Property Relationships of Benzimidazole Sulphoxides. J. Comput.-Aided Mol. Des. 1993, 7, 61−69.

31_{Cramer, R. D. I.; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field}

Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am. Chem. Soc. 1988, 110, 5959−5967.

32_{Kubinyi, H. Evolutionary Variable Selection in Regression and PLS Analyses. J.}

Chemometrics 1996, 10, 119−133.

33_{Rogers, D.; Hopfinger A. J. Application of Genetic Function Approximation to}

Quantitative Structure-Activity Relationships and Structure-Property Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 854−866.

34_{Burns, J. A.; Whitesides, G. M. Feed-Forward Neural Networks in Chemistry:}

Mathematical Systems for Classification and Pattern Recognition. Chem. Rev. 1993, 93, 2583−2601.

35_{Zupan, J.; Gasteiger, J. Neural Networks for Chemists: An Introduction. VCH}

Publishers: Weinheim, FRG, 1993.

36_{Bianucci, A. M.; Micheli, A.; Sperduti, A.; Starita, A. Application of the Cascade}

Correlation Networks for Structures to Chemistry. Appl. Int. J. 2000, 12, 117−146.

37_{Micheli, A.; Sperduti, A.; Starita, A.; Bianucci, A. M. Analysis of the Internal}

Representations Developed By Neural Networks for Structures Applied to QSAR Studies of Benzodiazepines. J. Chem. Inf. Comput. Sci. 2001, 41, 202−218.

38_{Bernazzani, L.; Duce, C.; Micheli, A.; Mollica, V.; Sperduti, A.; Starita, A.; Tiné,}

M. R. Predicting physical chemical properties of compounds from molecular structures by recursive neural networks. J. Chem. Inf. Model. 2006, 46, 2030– 2042.

39_{Duce, C. Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R. Recursive neural}

networks prediction of glass transition temperature from monomer structure. An application to acrylic and methacrylic polymers. J. Math. Chem. 2009, 46, 729– 755.

40_{Micheli, A.; Sperduti, A.; Starita, A.; Bianucci, A. M. A novel approach to}

QSPR/QSAR based on neural networks for structures. Stud. Fuzz. Soft Comp. 2003, 120, 265–296.

41_{Micheli, A.; Sperduti, A.; Starita, A.; Bianucci, A. M. Design of new biologically}

active molecules by recursive neural networks. Proc. Int. Joint Conference on Neural Networks 2001, 4, 2732–2737.

(12)

42_{Duce, C.; Micheli, A.; Starita, A.; Tiné, M. R.; Solaro, R. Prediction of polymer}

properties from their structure by pecursive neural networks. Macromol. Rapid Commun. 2006, 27, 712–716.

43_{Bertinetto, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.}

Prediction of the glass transition temperature of (meth)acrylic polymers containing phenyl groups by recursive neural network. Polymer 2007, 48, 7121– 7129.

44_{Bini, R.; Chiappe, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tiné, M. R.}

Ionic liquids: prediction of their melting points by a recursive neural network model. Green Chem. 2008, 10, 306–309.