Long-Short Term Memories for the
identification of helical moieties in proteins
Anna Visibelli
March 27, 2019
Department of Information Engineering and Mathematics, University of Siena
Introduction-1
Introduction-1
● Changes in the structure lead to changes in the function.
Introduction-2
Less than 0.2% of the sequenced proteins have been resolved by
X-ray crystallography and NMR spectroscopy.
NMR Spettroscopy X-ray Christallography
Introduction-3
Efficiently predicting the occurrence of
secondary structure
motifs can constitute an
alternative way towards
the prediction of the 3D
native structure.
Introduction-3
RECURRENT NEURAL NETWORK
Efficiently predicting the occurrence of
secondary structure
motifs can constitute an
alternative way towards
the prediction of the 3D
native structure.
Introduction-3
RECURRENT NEURAL NETWORK
LSTM NEURAL NETWORK
Efficiently predicting the occurrence of
secondary structure
motifs can constitute an
alternative way towards
the prediction of the 3D
native structure.
Proteins
Hormone Enzyme Transport Protein
Receptor Motor Protein
Biological Background: Proteins composition
Biological Background: Peptide chain
● Each protein consists of one or more polypeptide chains made up of amino acids.
● The chemical properties and order of the amino acids are fundamental in determining the structure and the
function of the protein.
Biological Background: Protein structures
Primary structure:
Sequence of amino acids
Secondary structure:
Motif formed from the interactions between atoms of the backbone
Tertiary structure:
Motif formed from the interactions between
the side chains
Quaternary structure:
Protein consisting of more than one amino
acid chain
Biological Background: Alpha-Helix
● In an helix, the carbonyl group of one amino acid is hydrogen bonded to the amino group of a downstream amino acid in the chain, at distance four.
● This pattern of bonding pulls the
polypeptide chain into a helical structure that resembles a curled ribbon, with each turn of the helix containing 3.6 amino acids.
Statistical analysis
● Take a dataset (from CATH database)
●
● Three sets of protein from
○ Mainly alpha
○ Mainly beta
○ Alpha-beta
● DSSP algorithm
○ Amino acids are assigned to a specific secondary structure
○ Based on backbone dihedral angles and hydrogen bonds.
Statistical analysis: Dataset creation
Alpha domains Beta domains Alpha-Beta domains Tot
Total Helices 11592 1864 25776 39347
>8, (1) 8343 631 16846 25873
>8, (2) 3818 328 8945 13094
>8, (3) 1220 121 2906 4249
Number of extracted helices per domain
(1)(2)(3) are the number of residues outside the helix
Statistical analysis: Length distribution
The total number of helices extracted was 4126 and 12763, for 3H and 2H data respectively.
Statistical analysis: Residue propensity value
Residue propensity value:
● as= number of residues of type a in position s
● np= dataset cardinality
Artificial Neural Network
Artificial Neural Network: structure
● Artificial Neural Network ● Neuron computation
Activation Functions
● Sigmoid: takes a real-valued input and squashes it to range between 0 and 1
σ(x) = 1 / (1 + exp(−x))
● tanh: takes a real-valued input and squashes it to the range [-1, 1]
tanh(x) = 2σ(2x) − 1
● ReLU: takes a real-valued input and thresholds it at zero f(x) = max(0, x)
Predictive Model: Recurrent Neural Network
● Suited to deal with sequential data.
● Store the information related to past elements of a sequence in a local memory, called a state.
● Forget long term dependencies
○ LSTMs to overcome this problem.
Predictive Model: Implementation
● INPUT: how to represent amino acid sequences?
● A traditional way of representing categorical data is the one-hot encoding, i.e. by a vector with all elements equal to 0 except one, of value 1.
●
One-hot encoder → Too sparse (Dimension 20)Word2Vec embedding
● Word embedding technique Word2Vec
→ Dense representation (Dimension 5)●
Amino acids appearing frequently close in the sequence are projected in near points in the reduced space.Word2Vec embedding
● We applied Word2vec to our dataset.
● Because of sequence vary their length
from 14 to 32 residues, each of them
was represented into a vector from 70
to 160 entries (from 14x5 to 32x5 real
values).
Dataset
● The dataset is composed by 3967 3H helices and 7090 non helix sequences.
● The aim is to demonstrate that elements at the extremes of a motifs are sufficient to predict the secondary structure of an amino acids sequence.
W2V Encoding One-hot
Encoding
Predictive Model: MultiLayer Perceptron
3D input reshape
● LSTMs need a three dimensional input.
Therefore, a reshape function was used to reshape the input dataset into a three dimensional array.
○ Sample: cardinality of the training set
○ Timestep: length of each sequence
○ Feature: dimensionality of each sequence
Predictive Model: LSTM Neural Network
Results comparison
LSTM gives more accurate results than the MLP model for what concern the prediction of a complete helix sequence.
The accuracy get worse when only the external part of the helices were consider, but the prediction still remain close to the state-of-the-art and suggest that there are specific paths which influence the formation and the interruption of an helix.
The performances of the two approaches are compared:
Discussion & Conclusions
● The obtained experimental results demonstrate the power of machine learning techniques to make predictions on the protein structural features.
● LSTMs can well interpret the sequential nature of protein data and focus on particularly conserved pieces of information which are fundamental in the formation of secondary structures.
● Future works:
○ Predict Helices analysing a complete protein sequence.
○ Extend the method to the prediction of 𝛽-sheets and U-turns.
○ https://tedxbeaconstreet.com/videos/cancer-alzheimers-protein-origami/