Prediction of the Effects of Mutations on the Stability and Interactions of Proteins

(1)

(2)

(3)

iii

Declaration of Authorship

I, Ornela MALOKU, declare that this thesis titled, “Prediction of the Effects of Mutations on the Stability and Interactions of Proteins” and the work pre-sented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a re-search degree at this University.

• Where any part of this thesis has previously been submitted for a de-gree or any other qualification at this University or any other institu-tion, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

(4)

(5)

v

“We do not belong to this material world that science constructs for us. We are not in it; we are outside. We are only spectators. The reason why we believe that we are in it, that we belong to the picture, is that our bodies are in the picture. Our bodies belong to it. Not only my own body, but those of my friends, also of my dog and cat and horse, and of all the other people and animals. And this is my only means of communicating with them.”

(6)

(7)

vii

Abstract

Doctor of Philosophy

Prediction of the Effects of Mutations on the Stability and Interactions of Proteins

by Ornela MALOKU

In this study I investigated the prediction of a number of protocols to pre-dict the effect of rare protein coding variants related to human health, both at the genetic level, and in protein structures, in particular the amino acid se-quence change, stability and function. For the analyses of genomic variants we applied computational tools and calculated the AUC-ROC curves. Fur-themore we selected two variants of uncertain significance (VUS), ATOH1 with a mutation in R161G and NOTCH3 with a mutation in G289C and ap-plied homology modeling in order to study the mutation arrangement and protein structures. In the second part of the work, we developed a protocol in order to calculate the thermodynamic stability of proteins with amino acid mutations. We applied molecular dynamics simulations in Generalised Born Surface Area implicit solvent for a benchmark of 134 mutations in the mi-crobial Ribonuclease Barnase. Morever, I contributed in the development of PDB2ENTROPY, a tool used for the calculation of entropy contributions. In particular I implemented the Kruskal’s algorithm for the Maximum Informa-tion Spanning Tree (MIST) approximaInforma-tion, a method to compute molecular entropies, in order to cope with a large number of dimension. Our result sug-gests that the accuracy of the method is negatively affected by the treatment of electrostatic contributions in the implicit solvent model. Scaling different contributions improves the accuracy and suggests that intrasolute van der Waals interactions and non-polar solvation energy are overestimated in the implicit solvent model. Taking into account entropy improves the quality of the predictions.

(8)

(9)

ix

Acknowledgements

(10)

(11)

xi

List of Figures

2.1 frameworks . . . 6

2.2 Prioritization of NGS variants . . . 9

2.3 Procedure for deleterious nsSNPs detection . . . 12

2.4 Schematic flow diagram for prediction of nsSNVs . . . 14

2.5 Submission of Variants from the Centre for Mendelian Genomics 15 2.6 Schematic flow diagram for prediction of nsSNVs . . . 18

2.7 Algorithms . . . 21

2.8 MH plots . . . 23

2.9 MH2 . . . 24

2.10 ATOH1 sequence analysis . . . 25

2.11 ATOH1 expression in mouse cochlea . . . 26

2.12 ATOH1 modulation of trascription factors . . . 27

2.13 ATOH1 sequence analysis . . . 28

2.14 ATOH1 alignment analysis . . . 28

2.15 HM of chain B if 2QL2 . . . 29

2.16 HM of ATOH1 . . . 29

2.17 HM of ATOH1 detail . . . 30

2.18 Superposition of the starting model and proteins at the end of simulation . . . 31

2.19 Simulation at 50ns of the complex wild-type and mutant ATOH1-DNA (R161G) . . . 32

2.20 notch receptor . . . 34

2.21 notch3 role . . . 34

2.22 NOTCH3 sequence analysis . . . 36

2.23 NOTCH3 alignment analysis . . . 36

2.24 HM on the chain A of 5UK5 . . . 37

(14)

xiv

3.11 Native hydrogen bonds in the three helices and the β-sheet as

a function of time . . . 70

(15)

xv

List of Tables

2.1 Some example of Mendelian Dieseases from OMIM, Online

Mendelian Inheritance . . . 5

2.2 Overview of exome sequence strategies . . . 8

2.3 Top 25 SNPs Analysed . . . 17

2.4 A list of VUS of B. Perlin’s group to be analyzed . . . 22

3.1 Correlation analyses of the Parameters with the ∆Gexp . . . . 93

3.2 Scaling the Non Polar Solvation energy contribution . . . 93

3.3 Scaling Non Polar Solvation and Electrostatic contributions . . 94

3.4 Scaling Non Polar Solvation and Lennard-Jones contributions 94 3.5 Scaling the Non Polar Solvation, Lennard-Jones and Electro-static . . . 95

3.6 Validation of the multilinear regression model with non-polar solvation, Lennard-Jones and electrostatic contributions scaled 95 3.7 Experimental and calculated folding free energy differences in kJ/mol for barnase (PDB id: 1bni) mutations (last 50ns out of 60ns MD run) . . . 97

3.8 Experimental and calculated folding free energy differences in kJ/mol for PDB 1bn mutations at 60ns . . . 98

3.9 Experimental and calculated fold free energy differences in kJ/mol for PDB 1bn mutations at 60ns . . . 99

(16)

(17)

xvii

List of Abbreviations

WGS Whole-genome Sequencing WES Whole-exome Sequencing WGA Whole-genome Analysis WEA Whole-exome Analysis NGS Next-Generation Sequencing GWAS Genome-Wide-Association Studies

nsSNV non-synonymous Single-Nucleotide Variants SNV Single-Nucleotide Variants

NSV Non Synonymous Variant MD Molecular Dynamics

VUS Variants of Uncertain Significance HTS High-Throughput Sequencing AAS Amino Acid Substitution

OMIM Online Mendelian Inheritance in Men HGMD Human Gene Mutation Database ACMG Americano College of Medical Genetics ESHG European Society of Human Genetics HM Homology Modelling

MD Molecular Dynamics ML Machine Learning AI Artificial Intelligence

BLAST Basic Local Alignment Search Tool CES Clinical Exome Sequencing

TPR True Positive Rate TNR TrueNegative Rate FPR False Positive Rate

ROC Receiver Ooperating Characteristic AUC Area Under Curve

(18)

(19)

xix

(20)

(21)

1

Chapter 1

(22)

2 Chapter 1. Overview

1.1 Structure of the Thesis

The main topic of my PhD project is the study of prediction models employed to predict the effect of non-synonymous Single-Nucleotide Variants and the role that they play both in Mendelian disorders, at the genetic level, and in protein structures stability and function.

The thesis is composed by two parts: the first one involves the work per-formed during my six-months stage in Slovenia at the University Medical Centre Ljubljana, in the group of Dr. B. Peterlin, Professor and Head of the Clinical Institute of Medical Genetics and Dr. A. Maver, Head of the group for the NGS and Bioinformatics analyses.

The medical genetics part involves the role of Next-Generation Sequencing (NGS) methods and Whole-exome Sequencing in medical research and diag-nosis and the use of prediction scoring algorithms to evaluate the effect of non-synonymous Single Nucleotide Variants (nsSNVs) in Mendelian genetic disorders. In this project about 1000 variants were tested, in particular nsS-NVs using some of the main prediction tools in dbNSFP v2.0, in order to as-sess among them which algorithms have higher accuracy and predictability. Two interesting nsSNVs were further analysed using Homology Modeling (HM) to provide working hypotheses on the molecular mechanisms under-lying genetic disorders.

(23)

3

Chapter 2

(24)

4 Chapter 2. NGS and Analyses of Variants in Medical Genetics

2.1 Whole-exome sequencing in Diagnostics

Over the last years, High-throughput sequencing (HTS) has transformed the speed with which genetic information is obtained (Service, 2006; Niroula,

2016).

For some health-care systems, Next-Generation Sequencing methods give the opportunity to sequence the whole exome or genome of a person at a cheap price. Depending on the techniques used to detect mutations through-out the genome, the focus of the analyses may be the entire genome (whole-genome analysis, WGA) or the exome (whole-exome analysis, WEA) which result in a plethora of raw data, requiring powerful bioinformatic tools and analyses to pull out useful information (El,2013).

Although by Whole-genome Sequencing (WGS) (Wheeler, 2008) thou-sands individual genomes have already been sequenced for specific projects, such as in the Personal Genome Project (Lunshof, 2010), Whole-exome Se-quencing (WES) is by far more widely used in diagnostic laboratories.

With the introduction of Next-Generation DNA Sequencing (NGS) in di-agnostics, clinical genetics has changed. As a matter of fact it is possible to observe that Whole-genome techniques are applied in different places (di-agnosis in patients with symptoms, research, pharmacogenomics, presymp-tomatic testing) (Saunders,2012).

Instead of a gene-by-gene Sanger sequencing approach, large sets of genes can be analyzed in a single test. Sanger sequencing, also known as the dideoxy method, has been the standard approach applied in molecular diagnostics and in particular for the diagnosis of genetic disorders (Vrijenhoek,2015).

Compared with the use of microarrays as diagnostic tools, NGS methods address many more aspects of routine diagnostics and moreover are widely used in many areas of clinical genetic research, including Genome-Wide As-sociation Studies (GWAS) for common diseases (Bras, 2012). On the other side the introduction and application of NGS technologies has put the lab-oratories in difficulties because of the work that follows the application of the NGS, such as interpretation and exchange of data, informing patients ap-propriately and increasing interdependencies of people involved in genome diagnostics (Vrijenhoek,2015).

A big challenge that clinicians have to face is the choice between targeted versus whole exome sequencing. Thanks to the reduction of sequencing costs, WES appears to be a good approach even though the overall coverage of the genome tends to be between 85–95% only, not allowing the identifica-tion of interest genes not present in the covered regions. On the other side the targeted approach has a higer coverage ot the phenotype-specific genes, allowing for deeper coverage of these genes compared to WES (MacArthur,

2015).

2.1.1 Application of NGS and WES for Mendelian Disorders

(25)

2.1. Whole-exome sequencing in Diagnostics 5

TABLE2.1: Some example of Mendelian Dieseases from OMIM,

Online Mendelian Inheritance

Disease Type of Inheritance

Phenylketonuria (PKU) Autosomal recessive

Cystic fibrosis Autosomal recessive

Sickle-cell anemia Autosomal recessive

Albinism, oculocutaneous, type II Autosomal recessive

Huntington’s disease Autosomal dominant

Myotonic dystrophy type 1 Autosomal dominant Neurofibromatosis, type 1 Autosomal dominant Polycystic kidney disease 1 and 2 Autosomal dominant

Hemophilia A X-linked recessive

Muscular dystrophy, Duchenne type X-linked recessive Hypophosphatemic rickets X-linked dominant

Rett’s syndrome X-linked dominant

Spermatogenic failure, nonobstructive Y-dominant

Dominant, Autosomal Recessive, Sex-linked Dominant, Sex-linked Reces-sive. These kind of disorders tend to be inherited among the members of a family (Baird,1988). Pedigree analyses of affected members of large families are very useful for the determination of the inheritance pattern of Mendelian diseases. Table2.1includes some examples of single-gene diseases.

NGS is applied to detect rare variants and non-synonymous Single Nu-cleotide Variants (nsSNV) in patients with a phenotype suspected to be due to a Mendelian disease. Different methods are applied in which NGS is being used for the identification of causal gene variants in rare diseases. In addition to WGS and WES, there are also transcriptome sequencing, methylome and other approaches applied in NGS methods. Thanks to the WES more than 100 genes have been classified in rare Mendelian disorders and an important example is the Miller syndrome, the first rare Mendelian disorder indentified by WES (Rabbani,2012).

(26)

(27)

2.1.2 Some Basics of WES

WES is based on the probe-hybridization approach to capture entire exons. In this approach described by (Goh,2012) three steps are involved:

• Preparation of the sample in which the DNA is sheared by nebuliza-tion or sonicanebuliza-tion to obtain fragments of about 250 bp and re-pairing of fragment ends by T4 DNA ligase;

• Amplification of the prepared library and hybridization with a biotiny-lated oligo library;

• Sequencing of the exome library in paired-end reads to obtain 75–100 bases per read, amplification of the individual fragments and process-ing of data includprocess-ing mappprocess-ing, variant callprocess-ing and annotation steps;

2.1.3 Clinical Exome Sequencing and Strategies for

Identifi-cation of causal Genes

Thanks to the application of Exome sequencing in research settings, Clini-cal exome sequencing (CES) is becoming a molecular test for patients with suspected genetic diseases (Lee,2014). The number of causal variants identi-fied with Exome sequencing varies greatly. Between 20.000 and 50.000 vari-ants are identified per sequenced exome, and because of the great number of these, different strategies have to be applied in order to reduce the number of false-positives. First, variants are filtered based on quality criteria, later synonymous coding variants and variants outside coding regions are filtered out. The most important filtration of variants follows from excluding known variants present in dbSNP databases or in in-house databases. In this step, the number of potential candidates is reduced by 90%-95% Figure2.2 (Gilis-sen,2012).

It is obvious that prioritization steps are not enough to identify pathogenic variants by themselves, but CES should follow additional strategies to find the causative mutation among the list of private variants such as those shown in Table2.2(Gilissen,2012):

It is worth highlighting that additional methods and further information are important for the prioritization step, in particular the gene function in relation to the phenotype and its pathophysiology. However the most im-portant thing is the sensitivity and specificity of these methods, which need to be very high, given the amount of predictions that is performed for a sin-gle exome. An evident example could be the Miller syndrome, in which one of the two causal mutations initially was missed because it was predicted to be benign (Ng,2010).

2.1.4 Pros and Cons of NGS in Clinical Genetics

(28)

TABLE2.2: Overview of exome sequence strategies

Strategy Applicability Assumptions

Linkage Multiple affected within a_{single family}

Fully penetrant mutation segregating with thw dis-order

Homozygosity Single affected from con-_{sanguine parent}

Homozygous mutation within a homozygous stretch

Double-hit Single affected with a re-_{cessive disorder}

A single rare homozygous or two rare compound het-erozygous mutations

Overlap Multiple affected with a_{dominant disorder}

The disorder is completely monogenic and all patients suffer from the same disor-der

De novo Single sporadic affected Candidate Single affected with domi-nant disorder without

ad-ditional family member

(29)

(30)

Genomes Project) as well as in-house control datasets are discarded (Miao,

2012). Finding the variants of interest for a Mendelian disease among thou-sands of variants is so hard as looking for a needle in the haystack and furthe-more most of them are neutral and don’t cause severe disorders (Majewski,

2011). In recent years many tools have been developed in order to select the list of candidate nsSNVs in WES analyses. Some are statistical tools such as PLINK (Shaun,2007) and (IBD) identity by descent (Thompson,2013) which focus much more on genomic regions that share ancestral polymorphisms and genetic linkage co-segregation, others predict the degree of deleterious-ness of nsSNVs in a protein by using algorithms which take in consideration different fetures like physico-chemical properties of amino acids, sequence conservation and protein structure (Ng,2006).

2.1.6 Effect of nsSNVs on Protein Structure

The effects of non-synonymous Single-Nucleotide Polymorphysms (nsSNPs) in genes which have a relevant impact on Mendelian disesases are well known among geneticists and clinicians. These variants lead to a base change in a coding region that causes an amino acid substitution (AAS) in the protein sequence. If the nsSNP causes an alteration of the protein functions, the change can lead to radical phenotypic consequences. A significant part of these deleterious mutations are eliminated through purifying selection. The importance of non-synonymous variants in human is represented by the Hu-man Gene Mutation Database (HGMD) and the Online Mendelian Inheri-tance in Man (OMIM), two databases which contain information on disor-ders and Mendelian diseases, but also complex diseases due to nsSNPs on the genome (Hamoshi,2005; Stenson,2003). Since researchers have observed that throughout evolution these mutations are more likely to occur at con-served positions than at non-concon-served positions, it is suggested that predic-tion could be based on interspecific sequence homology (Miller,2001).

How and when nsNSPs cause disease is one of the major challenges en-countered when studying Mendelian disordes. Variants affect protein func-tions in different ways (Wang, 2001). Some of the effects of the nsSNVs on molecular functions are listed below:

• Protein Stability is affected by the loss of hydrogen bonds and salt bridges, reduction of hydrophobic interactions, disruption of metal bind-ing, breakage of disulfide bonds, destabilization of protein multimers and backbone strains.

• Ligand binding is affected by changes in the interaction of ligand atoms with the side chain when the side chain is mutated.

• Catalysis is affected when the mutated residue is directly involved in a catalytic process.

(31)

• Post-translational modification refers to the disruption of a sequence pattern for modifications (e.g. N-glycosylation).

Additionally and less obviously entropic changes can affect most of the fore-going effects.

2.1.7 Predictor Tools and their Characteristics for the

Analy-sis of Variants

NGS methods produce a plethora of information. Every human being ac-counts for about 3 million single nucleotide substitutions if compared with a reference genome. The decreasing of costs for sequencing analayses has ex-posed research groups in difficult situation since the analysis of data and in-terpretation and validation of variants are the most time-consuming steps in sequencing project. In these last years the attention has moved to the study and development of softwares and prediction methods, recommended by both the European Society of Human Genetics (ESHG) and the American College of Medical Genetics (ACMG). Several tools have been developed for the analysis and interpretation of variations and their effects.

When studying the effects of causal variants, it is important to verify whether the mutation has been previously found in patients or in healthy individuals. If a variant is frequent in healthy individuals probably it is not pathogenic. Several databases contain protein amino acid sequence mu-tations and SNPs related to both Mendelian disorders and complex disor-ders: General variation databases such as Clinvar (Landrum,2014), Database of immunodeficiency-causing variations (IDbases) (Pirila,2006), MITOMAP (Lott,2013), HGMD (Stenson,2003), OMIM (Hamoshi,2005), Universal Mu-tation Database (UMD) (Beroud, 2000) and Database of short genetic varia-tions (dbSNP) (Sherry,2001).

Another important aspect concerning the analyses of variants is the im-plementation of new methods to improve the performance. Most of the methods used to analyse variants are based on Machine Learning (ML), a form of Artificial Intelligence (AI) where a computer can learn from given data (Niroula,2016). The most used ML algorithms for this kind of data are based on neural networks, support vector machines, random forests and in particular supervised ML classification for variant interpretation Figure2.3.

(32)

(33)

2.1. Whole-exome sequencing in Diagnostics 13 which variants are pathogenic or deleterious (Tang, 2016). In order to fa-cilitate the process of filtering and annotating for researchers, the dbSNFP v.1.0 database was developed (Liu,2011) with a collection of 75,931,005 nsS-NVs in the human genome and to meet the request for better functional an-notations a new version, the dbNSFP v2.0, was developed. In Figure 2.4 a schematic flow diagram is shown to illustrate filtering and analyses of nsS-NPs (Adzhubei,2010).

Some of the main prediction methods in dbNSFP v2.0 listed below, which use either sequence or structural information, are also implemented as Web servers:

• PolyPhen-2 (Polymorphism Phenotyping v2) : based on both HDIV (mutations causing diseases) and HVAR (mutation putatively neutral) training sets, with the same nsSNV having multiple scores (HVAR and HDIV) depending on different amino acid positions and with score thresholds for probably damaging, possibly damaging and benign of 0.956 and 0.453 for HDIV, while 0.908 and 0.44 for HVAR (Adzhubei,2010). • Mutation Assessor: with score ranges from -5.545 to 5.975 and four

types of predictions: high, medium, low and neutral (Gnad,2013). • FATHMM (Functional Analysis through Hidden Markov Models): score

ranges from -18.09 to 11.0 (the smaller the score the more likely it will be deleterious) with a threshold of -1.5 for the separation of (deleterious) and (tolerated) (Shihab,2013).

• GERP (Genomic Evolutionary Rate Profiling): GERP++_RS (rejected substitution) is used to measure the conservation of a nucleotide site, an alternative is GERP++_NR which measures of the neutral rate (NR) of the site. For both NR and RS, the larger the scores, the more con-served the site (Davydov,2010).

• SIFT (Sorting Intolerant From Tolerant): predicts whether an amino acid substitution affects protein function (Ng, 2003), with score ranges from 0.0 to 0.05 (deleterious) and from 0.05 to 1.0 (tolerated).

• CADD (Combined Annotation Dependent Depletion):based on a sup-port vector machine (SVM) and Phred like scores (scaled C scores) rang-ing from 1 to 9 (Nagakomi, 2018). Top 10% in the ranking of CADD scores are assigned C score 10, top 1% to C score 20, top 0.1% to C score 30 and so on.

• Mutation Taster: makes an evaluation of the disease-causing potential of DNA sequence alteration, in particular it analyzes evolutionary con-servation, splice-site changes, loss of protein features and changes that might affect the amount of mRNA (Schwarz,2010) and then evaluates the mutation by a naive Bayes classifier.

(34)

(35)

(36)

clinical assertion reported in the genetic testing report. The algorithms ap-plied for the analyses of variants are PROVEAN, SIFT, MetaSVM, VEST3, CADD, GERP++_RS, GERP++_NR and MetaLR. Once obtained the score re-sults from the in silico analyses, the second step involved the calculation of the sensitivity of the methods for clinical use, to determine the proportion of reported pathogenic variants that were correctly asserted by the prediction algorithms.

Furthermore, we selected variants with theoretical predictions inconsistent with clinical assertion and studied those using in-depth analyses by protein modelling, as outlined in the sections below. Here, we studied the reliability of the algorithms and performed the homology modeling in order to under-stand and visualize better the effect of the mutation on the protein function. In Table2.3 the 25 top SNPs are reported as an example of the data used in order to obtain predicting score results. Some of them are VUS variants pre-dicted to be pathogenic. These data are kindly granted by Prof. B. Peterlin for the thesis.

2.2.2 Receiver Operating Characteristic (ROC) curves and Area

Under Curve (AUC)

Most predictors depend on numerical threshold values which are used to classify the sample in a number of possible classes. For binary predictors a useful aid for assessing their performance is provided by the Receiver Op-erating Characteristic (ROC) curve which is a plot of the true positive rate versus the false positive rate when the threshold value for the predictor is being varied.

The curve is built by classifying predictions in four classes:

• TP: true positives (prediction: positive - true value: positive) • FP: false positives (prediction: positive - true value: negative) • FN: false negatives (prediction: negative - true value: positive) • TN: true negatives (prediction: negative - true value: negative)

The sensitivity of the predictor is provided by the true positive rate (TPR), i.e. ratio of true positives over total positive predictions:

TPR = TP

TP+FN (2.1)

The specificity of the prediction is provided by the false positive rate (FPR), i.e. ratio of false positives over total negative predictions:

FPR= FP

(37)

2.2. Materials and Methods 17

TABLE2.3: Top 25 SNPs Analysed

Chr Gene Nucleotide Mutation Protein Mutation Pathogenicity Class

10 FGFR2 c.758C>G Pro253Arg Pathogenic

8 CYP7B1 c.1162C>T Arg388Ter Pathogenic

17 SCN4A c.3917G>T Gly1306Val Pathogenic

8 EXT1 c.1551G>A Trp517Ter Pathogenic

21 CSTB c.202C>T p.Arg68X Pathogenic

11 TYR c.1A>G p.Met1Val Pathogenic

2 FSHR c.2T>C p.Met1? Pathogenic

17 TSEN54 c.919G>T Ala307Ser Pathogenic

3 PIK3CA c.3140A>T His1047Leu Pathogenic

16 EARS2 c.322C>T Arg108Trp Pathogenic

19 FKRP c.826C>A Leu276Ile Pathogenic

15 FBN1 c.5183C>T Arg2685Ter Pathogenic

4 DSPP c.-21T>C Ala1728Val Pathogenic

17 NF1 c.4812C>G Tyr1604X Pathogenic

13 FREM2 c.6727C>T Arg2243X Pathogenic

1 GLMN c.108C>A Cys36X Pathogenic

16 PKD1 c.12035G>A Trp4012X Pathogenic

10 ACTA2 c.536G>A Arg179His Pathogenic

1 ABCA4 c.67-2A>G Pathogenic

19 RYR1 c.7268T>A Met2423Lys Pathogenic

15 FBN1 c.5788+5G>A Pathogenic

12 PTPN11 c.923A>G Asn308Ser Pathogenic

15 FBN1 c.6800A>T Asn2267Ile Pathogenic

19 C19orf12 c.424A>G Lys142Glu Pathogenic

12 CEP290 c.4723A>T Lys1575Ter Pathogenic

An ideal predictor would maximize the TPR and minimize the FPR. For a random predictor the expected TPR equals the FPR for all threshold values.

(38)

FIGURE2.6: Example of a ROC curve. The diagonal curve

(39)

(40)

2.3 Results

2.3.1 Data obtained from prediction algorithms on nsSNVs

Although a lot of models have been developed for predicting the effects of genetic variants, the prediction of pathogenic variants remains still a problem within the NGS studies (Eva,2016). We compared 1000 variants with gene-specific features with a number of methods, exactly 8 prediction algorithms: VEST3, SIFT, PROVEAN, MetaSVM, MetaLR, GERP++_RS, GERP++_NR, CADD.

In order to evaluate the performance of the prediction scores obtained from our data, Receiver Operating Characteristic (ROC) curves were computed and the Area Under Curve (AUC) was calculated as shown in Figure2.7. The analysis was restricted to those missense SNPs that resulted Benign, Likely Bening, Pathogenic and Likely Pathogenic. We calculated the corresponding sensitivity and specificity for them. The vertical yellow lines in the plots of Figure2.7correspond to the confidence intervals of the curve.

2.3.2 (VUS) Variants of Uncertain Significance

One of the major challenges in clinical high throughput next-generation se-quencing technologies (NGS), is the large number of variants identified (Belkadi,

2015). Even though the knowledge and the amount of databases, such as ClinVar, that link variants with clinical phenotypes to aid classification,many of these variants remain difficult to classify as either (likely) benign or (likely) pathogenic in relation to the genetic condition for which testing has been sought (Vears,2017). These variants may be in general related to the the clin-ical question but with a lack of sufficient evidence to confirm the pathogenic-ity, in genes where the function is uncertain or in so called disease-causing genes not releted to the clinical question (Bertier,2017).

In Table 2.4 are shown some of the VUS present in our dataset, which are not used in clinical decision making and cause problems for clinicians and uncertainty on how to advise patients.

For all of them the three dimensional structure of the given protein can be predicted by homology modeling. Two of them have been subjected to struc-tural analysis, described in the following sections, motivated by the clinical cases under study in the Ljubljana Clinical Genetics Institute.

2.3.3 Homology Modeling of ATOH1 and NOTCH3 protein

mutation

Homology Modeling is a method used to generate in silico protein models, in order to obtain a structure from its sequence comparable to the experimental results (Krieger,2003). HM is based on two observation:

(41)

(42)

TABLE2.4: A list of VUS of B. Perlin’s group to be analyzed

Name VUS

desmin DES-Arg375Trp

notch 3 NOTCH3-Gly289Cys

ryanodine receptor 2 RYR2-Gln8His

ubiquitin protein ligase PARK2-Arg275Gln

potassium voltage-gated channel subfamily Q member 2 KCNQ2-Pro666Leu

atonal bHLH transcription factor 1 ATOH1-Arg161Gly

polycystin 2, transient receptor potential cation channel PKD2-Gln557X

pre-mRNA processing factor 8 PRPF8-Asn270Ser

inosine monophosphate dehydrogenase 1 IMPDH1-Thr310Pro

ryanodine receptor 2 RYR2-Arg4608Trp

(43)

(44)

(45)

(46)

(47)

(48)

(49)

(50)

(51)

2.3. Results 31

FIGURE2.18: The wild-type (in red) and mutated (in cyan)

(52)

(A) 1a: Snapshot of the MD simulation

at 50ns of the complex wild-type ATOH1 with the DNA. The position of residue Arg 161 is represented in yellow on the

ribbon, and its sidechain in licorice.

(B) 1b: Snapshot of the MD simulation

at 50ns of the complex mutated ATOH1 with the DNA. The position of residue Gly 161 is represented in yellow on the ribbon,

and its sidechain in licorice.

FIGURE2.19: Simulation at 50ns of the complex wild-type and

(53)

2.3. Results 33

2.3.5 Analysis of the mutation in Protein neurogenic locus

notch homolog protein 3 (gene NOTCH3) G289C

NOTCH3 Notch3 gene was identified in 1990 as the third mammalian Notch as expressed in proliferating neuroepithelium, but different studies have demonstrated functional and structural differences among Notch3, Notch1 and Notch2 (Bellavia,2008).

Notch receptors are of type I transmembrane glycoproteins, involved in the cell fate determination (Artavanis-Tsakonas,1999). Regarding the structural composition of Notch, all the proteins share a similar structure. The extra-cellular domain is composed of 29-36 epidermal growth factor (EGF)-like repeats, 3 Lin-Notch repeats and 1 transmembrane region, while the intra-cellular region contains at least three conserved domains: the membrane-proximal RAM (RBP-jk-associated molecule) domain, seven consecutive ankyrin repeats (ANK domain) and a C-terminal PEST (proline-glutamic acid-serine-threonine) sequence, a transactivation domain (TAD) is present only in Notch1 and Notch2 (Kurooka,1998; Beatus,2001). In Figure2.20the organization of Notch3 compared to Notch1 and Notch2 is shown.

Regarding the role of Notch3 in diseases, Notch3 has a restricted tissue dis-tribution, in particular it is expressed in vascular smooth muscle (Joutel,

2000), in the central nervous system (Lardelli,1994) and in the regulation of T cells (Anastasi,2003). Studies of Notch3 mutations have shown association of mutations with ovarian high-grade serous carcinomas (Park, 2006) and with cerebral autosomal-dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), an inherited small vessel disease causing stroke and dementia (Joutel,2000).

(54)

(55)

2.3. Results 35

Homology Modeling of Notch3 From the sequence analysis of Notch3

(Liu, 2009) as shown in Figure 2.22 it is possible to see that position 289 is within one (the third) of the Calcium-binding EGF-like domain motifs (274-311).

The alignment of most similar sequences from the refseq database shows that G289 is part of an absolutely conserved region. Sequences are from taxa an-notated as primates, rodents, bats, placentals, even-toed ungulates, carni-vores, whales and dolphins, odd-toed ungulates, rabbits and hares, marsu-pials, coelacanths, bony fishes, turtles, birds, snakes, frogs and toads, lizards as shown in Figure2.23.

A homolog for the region 277-467 is found in the Protein Data Bank with PDB id. 5UK5 (Luca,2017) with 75% identical aminoacids over the region 277-467 and no gaps. The HM was built on the chain A of this structure (Complex of Notch1(EGF8-12) bound to Jagged1(N-EGF3) from rat, PDB id. 5UK5) based on the alignment shown in Figure2.24. As in ATOH1, the sidechains of non identical regions were fixed using DeepView 4.10.

(56)

(57)

(58)

(59)

2.4. Conclusion 39

2.4 Conclusion

In this work the attention was focused on the reliability of prediction tools included in dbSNFP v2.0 database (Liu,2011) and in the description of dif-ferent types of Mendelian mutations classified in medical genetics and the impact of NGS analyses in medical diagnostics.

Around 1000 variants kindly granted by Prof. B. Peterlin were tested and in particular VUS variants for which we have insufficient genetic data to defini-tively confirm that the variant is associated with risk of developing the dis-ease. From the outputs of some of the predictive tools we have obtained ROC curves with their AUC values. MetaLR (AUC: 0.78, 95% CI:0.6745-0.8995), PROVEAN (AUC: 0.77, 95% CI: 0.5996-0.953 ) and CADD (AUC: 0.62, 95% CI: 0.4985-0.7484) achieved the highest discriminative power as evaluated by AUC values of ROC curve, while the other tools showed lower perfor-mance VEST3 (AUC: 0.55, 95% CI: 0.4468-0.6648 ), SIFT (AUC: 0.56, 95% CI: 0.5996-0.953 ), GERP++_RS(AUC: 0.51, 95% CI: 0.3599-0.6679 ), MetaSVM (AUC: 0.57, 95% CI: 0.6745-0.8995) and GERP++_NR(AUC: 0.53, 95% CI: 0.37-0.6832).

As regards the algorithms with a high predictive power, the AUC results are comparable with those obtained in other works such as (Choi, 2012; Dong,

2015). MetaLR uses the logistic regression that integrates nine scoring meth-ods in order to predict the deleteriousness of nnSNVs; the logistic regression was applied by Dong et al., 2015 for 87.347.044 possible variants obtaining an AUC:0.92 in the testing dataset I and an AUC:0.94 in the testing dataset II (Dong,2015) .In Choi’s work, the algorithm PROVEAN was applied to 57.646 human and 30.615 non-human single amino acid substitution datasets from UniProtKB/Swiss-Prot database, obtaining an AUC:0.85. These results were confirmed by comparing the predictive ability of PROVEAN with other tools, such as with existing tools SIFT, PolyPhen-2, and Mutation Assessor (Choi,

2012). Regarding the CADD algorithm, it is a machine learning-based tool, consisting of two steps: a model-fitting phase, followed by a variant-scoring phase (Rentzsch, 2019). In the same work of Dong et al., 2015 the testing dataset I for the CADD algorithm gave an AUC:0.83 and an AUC:0.76 for the training dataset II. These results suggest us that our outputs reach an AUC similar to those obtained in literature. An important aspect to be considered is the number of variants and datasets used in the works. In our case the number of variants analyzed could not be enough to test with precision the accuracy of the predictive tools.

Regarding the pathogenic VUS, we selected ATOH1 R161G and NOTCH3 G289C in order to study the protein structure and the mutation arrangement by Homology Modeling (HM).

(60)

structure and function and not by changes in the DNA sequence (Strokach,

2019).

(61)

41

Chapter 3

Mutations effect on the

Thermodynamic Stability of

Proteins

Certainly no subject or field is making more progress on so many fronts at the present moment, than biology, and if we were to name the most powerful assumption of all, which leads one on and on in an attempt to understand life, it is that all things are made of atoms, and that everything that living things do can be understood in terms of the jigglings and wigglings of atoms.

Richard Feynman

(Feynman Lectures on Physics, vol. 1, Ch. 3. 1963;)

3.1 Structure of Proteins

Proteins, as other biological macromolecules such as carbohydrates and nu-cleic acids, are fundamental in regulating the molecular mechanism of life. They are synthesized on the ribosome as linear chains of amino acid as shown in Figure3.1 with their capability to perform cellular functions and thus be-ing classified as the most versatile macromolecules .

The amino acid residues are recruited amongst the 20 naturally occuring ones (Schulz,1979), each of them showing two essentials groups, the amino (H2N-) and carboxylic acid (HOOC-) groups, the differences lying on the

na-ture of the side-chains groups (R) substituents on the alpha carbon (Cα) as shown in Figure3.2.

The composition of the R groups could range from single hydrogen as in the case for glycine to aromatic rings through aliphatic moieties.

As Figure3.3 shows usually amino acids are subdivided in four groups de-pending upon their chemical affinity with water: hydrophobic, polar un-charged and negatively and positively un-charged.

Amino acids in solutions are found as dipolar ions with the NH2and COOH

groups ionized to NH3+ and COO−, respectively. The primary structure of

(62)

(63)

(64)

(65)

3.1. Structure of Proteins 45 Long chains of amino acids constituting proteins fold in a specific way confering the functional arrangement of the chemical groups. The structure of a protein is described at different levels (Schulz,1979):

• Primary stucture: the sequence of amino acids;

• Secondary stucture: the local arrangement of amino acids in helices or β-sheets whose geometry is well defined and characterized. Secondary structure elements involve ca. 60% of all the amino acids in proteins and depend on amino acids propensities to adopt the corresponding conformations;

• Tertiary structure: the folded structure of the protein, with all secondary structure elements properly arranged;

• Quaternary structure: the spatial arrangement in complexes of more than one protein.

A general principle is that the compact globular state of most proteins is due to inner residues hydrophobic interactions, whereas polar and charged residues are mostly exposed to the solvent. As described in Chapter 2 on the genetical mutations causing mutations at the level of an amino acid sequence, it was shown that the disease-causing vari-ants frequently involve drastic changes of amino acid physico-chemical properties of proteins such as charge, hydrophobicity and geometry (Petukh,2015).

Other significant changes may involve the secondary structure or local geometry propensity of the amino acid.

In order to understand the functional effects of SNPs, it is necessary to understand the effect of amino acid mutations on protein structure and stability. Different groups have worked on this topic and numer-ous studies have been done on small specific sets of proteins (Stewart,

2003; Nakken,2007). Blundell and co-workers have found that the en-vironment around an amino acid plays an important role in the effect that selection has on a mutation in a specific position (Gong,2010) and Subramanian analysed a set of 8,627 disease-associated mutations and found that disease-associated variants tend to occur on inter-species conserved residues (Subramanian, 2006). Beer and co-workers (Beer,

2013) observed that mutations to Arg residues are more than twice as common as any other mutation due to the properties of the DNA mech-anism of mutations, which favour mutations at codons containing a CpG dinucleotide Figure3.5.

(66)

46 Chapter 3. Mutations effect on the Thermodynamic Stability of Proteins

3.2 Introduction to Molecular Dynamics

The latter section section is focused on the theoretical background of classical molecular dynamics simulations, in particular with the algo-rithms and the force-fields (ff) used in thesis, focusing on the implicit solvent model used, i.e., the Generalized Born-Suface Area (GB/SA) continuum solvation model as implemented by Onufriev-Bashford-Case (OBC) (Onufriev,2000), and the implementation of this model with the amber99sb-ildn ff (Lindorff-Larsen,2010) to reproduce folding free en-ergy changes upon mutation. Particular attention will be devoted to entropy calculation using the k-nearest neighbour (kNN) method and the Maximum information spanning tree (MIST) model.

3.2.1 Theoretical background

MD simulation is a computational method that calculates the time de-pendent behavior of a molecular system and generates information at the microscopic level, in particular atomic positions and velocities. It is based on Newton’s second law or the equation of motion:

F_i =m_ia_i =m_id

2_r

i

dt2 (3.1)

with the subscript i = 1,2...,N indicative of i-th atom.

The force acting on atoms can also be expressed as the negative gradient of the potential energy function:

Fi = −∇iU (3.2) mi d2ri dt2 = −∇iU(r1, r2, .., rN) = − ∂ ∂r_iU(r1, r2, .., rN) (3.3) where Fiis the force acting on the ithparticle of the system, mi,riand ai

its mass, spatial coordinates and acceleration.

U(r1, r2, .., rN)is the system potential energy, N the total number of

(67)

3.2. Introduction to Molecular Dynamics 47 A crucial element of MD simulations is the functional form and the pa-rameters which express the potential energy as a function of atomic co-ordinates, i. e. the force field which will be described in the next section.

3.2.2 Force Fields in Molecular Dynamics

A force field is a mathematical expression describing the dependence of the energy of a system on the coordinates of its particles. It consists of an analytical form of the interatomic potential energy, U(r1, r2, ..., rN),

and a set of parameters entering into this form. Parameters are obtained either from ab initio or semi-empirical quantum mechanical calculations or by fitting to experimental data such as from neutron, X-ray and elec-tron diffraction, NMR, Raman and neuelec-tron spectroscopy, etc.

Molecules are defined as a set of atoms that is held together by simple elastic forces and the force field replaces the true potential with a sim-plified model valid in the region simulated (Gonzalez,2011). A typical expression of a force field looks like this:

U =

∑

bonds 1 2kb(r−r0)2+

∑

angles 1 2ka(θ−θ0)2+_torsions

∑

Vn 2 [1−cos(nφ−δ)]+ (3.4) ∑_impVimp + ∑LJ4ǫij _σ ij rij 12 −σij rij 6 + ∑_elec qiqi rij

In Eq.3.4 the first four terms refer to intramolecular and contributions to the total energy (bond stretching, angle bending, dihedral and im-proper torsions, i. e. dihedral terms among non-sequentially bonded atoms, used to maintain planarity at certain moieties), and the last two terms serve for the description of Van der Waals interactions and the Coulombic interactions.

The last two terms run over all pairs of atoms i and j separated from each other by the distance rij. Bond stretching and angle bending change

the bond lengths r and bond angles θ from their equilibrium values r0

and θ0. This means that a high potential energy must be associated to

the deformation of bond and angle equilibrium geometry through the force constants k_b and ka. The third and fourth terms account for

en-ergetics linked with rotations around bond and planarity, and the last terms describe the pairwise apolar atomic forces between i and j de-scribed by a Lennard-Jones 12-6 potential which accounts for van der Waals forces, repulsive for short distance (r−12_{term), and attractive for}

long distance (r−6 _{term), and finally the electrostatic interactions} ac-cording to Coulomb’s law. The variables q_iand q_jare the partial charges on atoms i and j and r_ij their distance. σ_ij is the distance at which the Lennard-Jones potential is zero and ǫij the well depth.

(68)

AMBER (Cornell,1995), GROMOS (Oostenbrink,2004), OPLS (Jorgensen,

1996), COMPASS (Sun,1998) and Amberff99SB (Lindorff-Larsen,2010). All these forcefields are continuously expanded and improved. At present Amberff99SB-ILDN (Lindorff-Larsen, 2010) is one of the best perform-ing as far as protein simulations are considered.

3.2.3 Protocol for MD simulations

The first step of a MD simulation consists in defining the starting coor-dinates and velocities (providing initial conditions), the system topol-ogy, the interaction potential, temperature, integration time-step etc. In the second step the forces exerted in the system are computed and Newton’s equation of motion are integrated. The last step is the analy-sis step that estimates the average computed observables.

Definition of the initial set of coordinates and velocities The

coordi-nates of protein structures are often taken from the Protein Data Bank or also obtained in silico by homology modeling. Since there may be steric clashes or other experimental or modeling inaccuracies, the start-ing structure is typically energy minimized, before startstart-ing MD pro-tocols. Typically few thousand minimizations steps using conjugate gradients or even steepest descent are sufficient to remove minor steric hindrances.

Atomic velocities v_i are usually assigned randomly based on the sys-tem sys-temperature T, according to the Maxwell-Boltzmann distribution function P(v_i): P(v_i) = r m_i 2kBπT exp − m_iv2_i 2kBT ! (3.5) where kBTis Boltzmann’s constant. The temperature can be calculated

from the velocities:

T= 1 3N N

∑

i=1 |pi|2 2m_i (3.6)

with N the number of atoms in the system.

Computation of forces The computation of forces in MD is a very

(69)

(70)

Numerical Integration It is difficult to obtain an analytical solution

for Eq.3.1, except for simple systems, therefore the second-order differ-ential term may be resolved by Taylor series expansion in Eq.3.8:

x(t+δt) = x(t) +δtdx(t) dt + 1 2!δt2 d2x(t) dt2 + 1 3!δt3 d3x(t) dt3 +... (3.8) Another expression of Taylor series expansion is required to approxi-mate the second-order differential term up to fourth order in time:

x(t₋δt) = x(t)−δtdx(t) dt + 1 2!δt2 d2x(t) dt2 − 1 3!δt3 d3x(t) dt3 +... (3.9) By summing the Eq. 3.8and Eq. 3.9the second-order differential term can be given by:

d2x(t) dt2 =

x(t+δt)−2x(t) +x(t−δt)

δt2 +ϑ(δt

4₎ _(3.10)

with ϑ(δt4) deifining the accuracy of the approximation. Considering Eq.3.10without terms equal or higher then fourth order in δt, the equa-tion could be written as :

d2x(t) dt2 =

x(t+δt)−2x(t) +x(t−δt)

δt2 (3.11)

The Eq. 3.11 is known as the central difference approximation. Accord-ingly, the x-component of Newton’s law of motion turns in the follow-ing equation:

x_i(t+δt) =2x_i(t) −x_i(t−δt) +δt

2

m_iFxi(t) (3.12) with r_i molecular positions and F_i the forces acting on particle i. This equation does not require the velocity components for computing the atomic positions at the next time-step. It is termed the Verlet Method (Verlet,1967).

Most popular algorithms that calculate the velocity in addition to the position are the velocity Verlet’s algorithm:

(71)

3.2. Introduction to Molecular Dynamics 51 and the Leap Frog algorithm which evaluates the velocities at half-integer time steps and uses these velocities to compute new positions:

r_i(t+δt) = r_i(t) +δtv_i(t+δt 2) (3.15) v_i(t+ δt 2) = vi(t− δt 2 ) + δt m_iFi(r(t)) (3.16) In the latter scheme kinetic and potential energy are not defined at the same time, and hence it is not possible to compute directly the total energy.

Langevin dynamics For the simulations where the environment is

not explicitly represented, as for implicit solvent simulations described hereafter, the temperature coupling to the environment is obtained by adding a random noise term and a term to dissipate the energy intro-duced in the system.

The noise and dissipation term are linked in such a way that the tem-perature of the system is preserved at the wanted value. Since all the degrees of freedom of the solvent are not represented in implicit sol-vent models, it is important to provide proper energy fluctuations to the system.

Stochastic or velocity Langevin dynamics is performed by adding a fric-tion and a noise term to Newton’s equafric-tions of mofric-tion:

m_id 2_r dt2 = −miγ dr_i dt +Fi(r) + p 2γTR(t) (3.17) where γ is the friction constant and Ri(t)is a random, Gaussian,

uncor-related noise process with i <_R_i₍_t₎_R_j₍_t₊_s_{) >=} _δ₍_s_{) ∗}_δ_ij.

The friction coefficient γ sets the dynamic regime. In particular if 1/γ is large compared to the time scales of the system’s motion, it simply pro-vides coupling with a temperature bath. All processes on longer time scales will be dampened. 1/γ is typically chosen in the range of 0.1-10 ps.

Temperature control in MD Simulations In MD the energy of a

sys-tem coupled to a thermal bath fluctuates throughout the simulation path with values that lead to temperature fluctuations. Thus, the proba-bility of locating the system at a given microstate kinetic energy follows a Maxwell-Boltzmann distribution function as in Eq.3.18.

P(p) = β 2πm 3/2 exp −βp 2 2m (3.18)

(72)

(73)

3.2. Introduction to Molecular Dynamics 53

3.2.4 Implicit Solvent

Even if the use of the explicit representation of the solvent provides a more realistic description of the environment surrounding biomolecules, it significantly increases the degrees of freedom and thereby the com-putational demand. One way to accelerate the sampling and reduce the number of degree of freedom in MD is the use of implicit solvent (Onufriev,2008; Fogolari,2002; Roux,1999).

It could be interesting to sum up reasons about the importance of using the implicit solvent model in MD Simulations (Onufriev,2019):

– No need for lengthy equilibration, necessary in explicit water

sim-ulations.

– A better sampling, due to the absence of viscosity associated with

the explicit water environment.

– No artifacts of periodic boundary conditions.

– Since solvent degrees of freedom are taken into account implicitly,

the calculation of the free energy is simpler.

The goal of the implicit solvent is to find potential functions that bet-ter take in consideration solvation effects and effective change in solute conformational free energy. In MD this is achieved by the efficient gen-eralized Born (GB) formalisms (Onufriev,2000; Still,1990).

The potential of mean force of a solvated system could be reproduced as in the Eq.3.19:

Etot =Evac+∆Gsolv (3.19)

with Evacthe potential energy of the molecule in vacuum and ∆Gsolvthe

free energy change required for transferring the molecule from vacuum into solvent. Most implicit models split the free energy of solvation into apolar and electrostatic solvation terms Eq.3.20:

∆Gsolv =∆Gapolar+∆Gel (3.20)

∆Gapolaris assumed to be proportional to the solvent accessible surface

area and is often neglected because it much smaller than the polar term for non-dramatic conformational changes, while ∆Gelgives the

electro-static contribution.

Electrostatic solvation free energy

(74)

∇[ǫ0ǫ(r)∇φ(r)] = −¯ρ(r) (3.21)

with ǫ the local dielectric constant, ¯ρ the charge density due to the solute and the salts and φ the electrostatic potential. Because of the complexity and difficulty to give a description of ionic charges, only the average ionic density is considered and it is assumed it follows a Boltzmann distribution. It is convenient to linearize the PB equation as shown in Eq. 3.22. This can be done for most proteins without signifi-cant loss in accuracy (Fogolari,1999; Fogolari,2002; Onufriev,2008).

∇[ǫ0ǫ(r)∇φ(r)] = −ρ(r) +

∑

i

C_ibz2_i q

2_φ₍_r₎

kBT (3.22)

In Eq. 3.22 φ is the electric potential, ρ is the solute charge density, zi

and Cb

i are the valence and bulk concentration of ion i, kB is the

Boltz-mann’s constant, T the temperature, q is the unit charge, ǫ is the local dielectric constant and ǫ0the vacuum permittivity.

The numerical solution of the linearized PB equation provides the elec-trostatic potential in all the space and energy and forces can be thus computed numerically. GB models approximate the electrostatic term of ∆G_solvas a pairwise summation of interaction terms (depending also on different atoms) between atomic charges i and j (Kleinjung,2014):

∆Gel ≃ −1₂ 1 ǫin − 1 ǫout _q iqj r r2_ij+α_iα_je −rij 4αiαj (3.23)

with ǫin and ǫout the dielectric costant for solvent and solute,

respec-tively. αi αjare the generalized Born radii of the interacting sites i and j. The generalized Born radius α_i is defined as the radius of a sphere for which a charge embedded at the center would have the same self energy of the same charge embedded in the molecule at site i. As such generalized Born radii measure the screening provided on each site by the molecular environment.

(75)

3.2. Introduction to Molecular Dynamics 55

Electrostatic and non polar free energy

The solvent-accessible surface area (SASA) (Durham,2009) is a geomet-ric measure of the exposure of a molecule to the environment. SASA is calculated by methods which approximate a water molecule, around a protein model. The first algorithm was developed by Lee and Richards (Lee, 1971). In their method they extended the van der Waals radius for each atom by 1.4 Å and calculated the surface area of the expanded-radius atoms. The Shrake and Rupley algorithm (Shrake, 1973) con-siders the overlapping of points on an atom’s van der Waals surface with points on the van der Waals surface of neighboring atoms, instead Wodak and co-workers SASA algorithm considers only interatomic dis-tances that approximate each amino acid by one sphere at the center of mass (Wodak,1980).

Different SASA algorithms take advantage of approximation methods such as of spline approximations (Colloch, 1990), boolean logic and look-up tables (Grand, 1992), or use a lattice model surrounding the protein to approximate SASA (Perl,1983).

Notwithstanding all the attention that has been paid to the electrostatic contribution in the GB implicit solvent models (Chen, 2008), the non polar solvation free energy has been either described by simplistic SA models or ignored (Feig, 2004; Baker, 2005). The SA-based nonpolar solvation models are described by Eq.3.24:

∆Gnp=

∑

i

γ_iA_i (3.24)

where ∆Gnpis the non polar solvation free energy, γi the atomic

effec-tive surface tension coefficients and Ai atomic solvent-accessible

sur-face areas. In most GB/SA models the Eq. 3.24 is reduced in ∆Gnp =

γA, where A is the total solvent-accessible surface area and γ has the same value for all atom - types. Following a recent literature survey (Knight and Brooks,2011) of 0.0054 kcal/mol/Å2is used in the present work.

3.2.5 MD: Advances and Applications

Molecular Dynamics (MD) simulations are powerful tools for the un-derstanding of physics theories lying behind the structure and function of proteins.

(76)

is possible to remove or alter specific contributions in order to under-stand the role of a given property (Simonson,2002).

There are three methods of applying simulations: the first method uses simulation as means of sampling configuration space, in the second method simulations are used to obtain information about the system at equilibrium structural and motional properties and thermodynamic parameters and in the third methods simulations are used to exam-ine the actual dynamics. In the latter case an appropriate sampling of configuration space and Boltzmann weighting is required, while for the first two methods Monte Carlo simulations can be used, as well as molecular dynamics (Karplus,2002).

Thanks to the broad range of programs, tools and increasing computing power available for simulation studies, in the last years the number of simulations for the study of biomolecular properties has increased. The first simulations were less than 10 ps in length, but current simulations are often 50,000 times as long (500 ns).

MD simulations are often used in conjunction with experimental tech-niques in order to provide a mechanistic explanation, down to atomic scale, of experimental results.

Perhaps the most powerful technique for the study of the dynamics and thermodynamics of biological biomolecules in solution is Nuclear Magnetic Resonance (NMR), which is in a symbiotic relation with MD simulations (Fischer,1999).

As an example, in recent years simulations have proved the importance of the residual motions to the entropy of binding of internal waters, us-ing BPTI as a model (Wrabl,2000; Karplus,2002).

One of the most relevant use of MD simulations is the study of the ef-fect of variants on the thermodynamic stability of proteins. Different studies have put attention not only on the protein folding problem and conformational analyses but also on the molecular dynamics simula-tions of wild-type and mutant proteins, e.g. for β2-microglobulin as

described in different works (Brancolini, 2018; Ma,2003). Simulations can reveal how variants can lead to functional, structural and dynamic alteration of biomolecules. In addition, MD can provide details in the clinical and diagnostic field, where the prediction and interpretation of novel variants is studied using prediction algorithms such as CADD, MetaLR and SIFT. In this case thanks to molecular dynamics simula-tions it is possible to measure the ∆∆G_Fold change in folding stability upon mutation (Zimmermann,2017).

(77)

3.2. Introduction to Molecular Dynamics 57 It is worth noting that MD methods have helped to improve the pre-diction and accuracy of results obtained with classical prepre-diction algo-rithms used in medical genetics and diagnostics. Thanks to the amount of data and variants (SNPs) obtained by NGS methods and the ap-proaches developed to differentiate pathogenic and neutral variants, it is possible to establish the way to improve target based therapies as described in literature (Kumar, 2014). The use of molecular dynamics plays an important role in this field thanks to the possibilities of the approach to determine the molecular phenotypic effects due to point mutations (Purohit, 2011; Rajendran, 2012) as illustrated in Figure 3.9

where the effect of the G325W mutation in protein Aurora-A, which was predicted to be associated to multiple cancer cases such as Neu-rofibrosarcoma, Pancreatic cancer, Li-Fraumeni syndrome and so on, were analysed by MD simulations implemented with a 200 ns trajec-tory to rationalize results obtained from computational SNP prediction techniques (Kumar,2014).

FIGURE 3.8: In Silico Modeling of the wild-type and

p.Arg217Cys and p.Arg27His mutation with Significant Struc-tural Changes within the SLC25A24 Transmembrane Domain

(78)

Prediction of the Effects of Mutations on the Stability and Interactions of Proteins

Declaration of Authorship

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

List of Abbreviations

Chapter 1

1.1

Structure of the Thesis

Chapter 2

2.1

Whole-exome sequencing in Diagnostics

2.1.1

Application of NGS and WES for Mendelian Disorders

2.1.2

Some Basics of WES

2.1.3

Clinical Exome Sequencing and Strategies for

Identifi-cation of causal Genes

2.1.4

Pros and Cons of NGS in Clinical Genetics

2.1.6

Effect of nsSNVs on Protein Structure

2.1.7

Predictor Tools and their Characteristics for the

Analy-sis of Variants

2.2.2

Receiver Operating Characteristic (ROC) curves and Area

Under Curve (AUC)

2.3

Results

2.3.1

Data obtained from prediction algorithms on nsSNVs

2.3.2

(VUS) Variants of Uncertain Significance

2.3.3

Homology Modeling of ATOH1 and NOTCH3 protein

mutation

2.3.5

Analysis of the mutation in Protein neurogenic locus

notch homolog protein 3 (gene NOTCH3) G289C

2.4

Conclusion

Chapter 3

Mutations effect on the

Thermodynamic Stability of

Proteins

3.1

Structure of Proteins

3.2

Introduction to Molecular Dynamics

3.2.1

Theoretical background

3.2.2

Force Fields in Molecular Dynamics

∑

∑

∑

3.2.3

Protocol for MD simulations

∑

3.2.4

Implicit Solvent

∑

∑

3.2.5

MD: Advances and Applications