Long-Short Term Memories for the identification of helical moieties in proteins

(1)

Long-Short Term Memories for the

identification of helical moieties in proteins

Anna Visibelli

March 27, 2019

Department of Information Engineering and Mathematics, University of Siena

(2)

Introduction-1

(3)

Introduction-1

● Changes in the structure lead to changes in the function.

(4)

Introduction-2

Less than 0.2% of the sequenced proteins have been resolved by

X-ray crystallography and NMR spectroscopy.

NMR Spettroscopy X-ray Christallography

(5)

Introduction-3

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(6)

Introduction-3

RECURRENT NEURAL NETWORK

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(7)

Introduction-3

RECURRENT NEURAL NETWORK

LSTM NEURAL NETWORK

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(8)

Proteins

Hormone Enzyme Transport Protein

Receptor Motor Protein

(9)

Biological Background: Proteins composition

(10)

Biological Background: Peptide chain

● Each protein consists of one or more polypeptide chains made up of amino acids.

● The chemical properties and order of the amino acids are fundamental in determining the structure and the

function of the protein.

(11)

Biological Background: Protein structures

Primary structure:

Sequence of amino acids

Secondary structure:

Motif formed from the interactions between atoms of the backbone

Tertiary structure:

Motif formed from the interactions between

the side chains

Quaternary structure:

Protein consisting of more than one amino

acid chain

(12)

Biological Background: Alpha-Helix

● In an helix, the carbonyl group of one amino acid is hydrogen bonded to the amino group of a downstream amino acid in the chain, at distance four.

● This pattern of bonding pulls the

polypeptide chain into a helical structure that resembles a curled ribbon, with each turn of the helix containing 3.6 amino acids.

(13)

Statistical analysis

● Take a dataset (from CATH database)

● ● Three sets of protein from

○ Mainly alpha

○ Mainly beta

○ Alpha-beta

● DSSP algorithm

○ Amino acids are assigned to a specific secondary structure

○ Based on backbone dihedral angles and hydrogen bonds.

(14)

Statistical analysis: Dataset creation

Alpha domains Beta domains Alpha-Beta domains Tot

Total Helices 11592 1864 25776 39347

>8, (1) 8343 631 16846 25873

>8, (2) 3818 328 8945 13094

>8, (3) 1220 121 2906 4249

Number of extracted helices per domain

(1)(2)(3) are the number of residues outside the helix

(15)

Statistical analysis: Length distribution

The total number of helices extracted was 4126 and 12763, for 3H and 2H data respectively.

(16)

Statistical analysis: Residue propensity value

Residue propensity value:

● a_s= number of residues of type a in position s

● n_p= dataset cardinality

(17)

(18)

(19)

(20)

(21)

Artificial Neural Network

(22)

Artificial Neural Network: structure

● Artificial Neural Network ● Neuron computation

(23)

Activation Functions

● Sigmoid: takes a real-valued input and squashes it to range between 0 and 1

σ(x) = 1 / (1 + exp(−x))

● tanh: takes a real-valued input and squashes it to the range [-1, 1]

tanh(x) = 2σ(2x) − 1

● ReLU: takes a real-valued input and thresholds it at zero f(x) = max(0, x)

(24)

Predictive Model: Recurrent Neural Network

● Suited to deal with sequential data.

● Store the information related to past elements of a sequence in a local memory, called a state.

● Forget long term dependencies

○ LSTMs to overcome this problem.

(25)

Predictive Model: Implementation

● INPUT: how to represent amino acid sequences?

● A traditional way of representing categorical data is the one-hot encoding, i.e. by a vector with all elements equal to 0 except one, of value 1.

●

One-hot encoder → Too sparse (Dimension 20)

(26)

Word2Vec embedding

● Word embedding technique Word2Vec

→ Dense representation (Dimension 5)

●

Amino acids appearing frequently close in the sequence are projected in near points in the reduced space.

(27)

Word2Vec embedding

● We applied Word2vec to our dataset.

● Because of sequence vary their length

from 14 to 32 residues, each of them

was represented into a vector from 70

to 160 entries (from 14x5 to 32x5 real

values).

(28)

Dataset

● The dataset is composed by 3967 3H helices and 7090 non helix sequences.

● The aim is to demonstrate that elements at the extremes of a motifs are sufficient to predict the secondary structure of an amino acids sequence.

W2V Encoding One-hot

Encoding

(29)

Predictive Model: MultiLayer Perceptron

(30)

3D input reshape

● LSTMs need a three dimensional input.

Therefore, a reshape function was used to reshape the input dataset into a three dimensional array.

○ Sample: cardinality of the training set

○ Timestep: length of each sequence

○ Feature: dimensionality of each sequence

(31)

Predictive Model: LSTM Neural Network

(32)

Results comparison

LSTM gives more accurate results than the MLP model for what concern the prediction of a complete helix sequence.

The accuracy get worse when only the external part of the helices were consider, but the prediction still remain close to the state-of-the-art and suggest that there are specific paths which influence the formation and the interruption of an helix.

The performances of the two approaches are compared:

(33)

Discussion & Conclusions

● The obtained experimental results demonstrate the power of machine learning techniques to make predictions on the protein structural features.

● LSTMs can well interpret the sequential nature of protein data and focus on particularly conserved pieces of information which are fundamental in the formation of secondary structures.

● Future works:

○ Predict Helices analysing a complete protein sequence.

○ Extend the method to the prediction of 𝛽-sheets and U-turns.

○ https://tedxbeaconstreet.com/videos/cancer-alzheimers-protein-origami/