• Non ci sono risultati.

Long-Short Term Memories for the identification of helical moieties in proteins

N/A
N/A
Protected

Academic year: 2021

Condividi "Long-Short Term Memories for the identification of helical moieties in proteins"

Copied!
33
0
0

Testo completo

(1)

Long-Short Term Memories for the

identification of helical moieties in proteins

Anna Visibelli

March 27, 2019

Department of Information Engineering and Mathematics, University of Siena

(2)

Introduction-1

(3)

Introduction-1

● Changes in the structure lead to changes in the function.

(4)

Introduction-2

Less than 0.2% of the sequenced proteins have been resolved by

X-ray crystallography and NMR spectroscopy.

NMR Spettroscopy X-ray Christallography

(5)

Introduction-3

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(6)

Introduction-3

RECURRENT NEURAL NETWORK

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(7)

Introduction-3

RECURRENT NEURAL NETWORK

LSTM NEURAL NETWORK

Efficiently predicting the occurrence of

secondary structure

motifs can constitute an

alternative way towards

the prediction of the 3D

native structure.

(8)

Proteins

Hormone Enzyme Transport Protein

Receptor Motor Protein

(9)

Biological Background: Proteins composition

(10)

Biological Background: Peptide chain

Each protein consists of one or more polypeptide chains made up of amino acids.

The chemical properties and order of the amino acids are fundamental in determining the structure and the

function of the protein.

(11)

Biological Background: Protein structures

Primary structure:

Sequence of amino acids

Secondary structure:

Motif formed from the interactions between atoms of the backbone

Tertiary structure:

Motif formed from the interactions between

the side chains

Quaternary structure:

Protein consisting of more than one amino

acid chain

(12)

Biological Background: Alpha-Helix

In an helix, the carbonyl group of one amino acid is hydrogen bonded to the amino group of a downstream amino acid in the chain, at distance four.

This pattern of bonding pulls the

polypeptide chain into a helical structure that resembles a curled ribbon, with each turn of the helix containing 3.6 amino acids.

(13)

Statistical analysis

● Take a dataset (from CATH database)

● Three sets of protein from

Mainly alpha

Mainly beta

Alpha-beta

● DSSP algorithm

Amino acids are assigned to a specific secondary structure

Based on backbone dihedral angles and hydrogen bonds.

(14)

Statistical analysis: Dataset creation

Alpha domains Beta domains Alpha-Beta domains Tot

Total Helices 11592 1864 25776 39347

>8, (1) 8343 631 16846 25873

>8, (2) 3818 328 8945 13094

>8, (3) 1220 121 2906 4249

Number of extracted helices per domain

(1)(2)(3) are the number of residues outside the helix

(15)

Statistical analysis: Length distribution

The total number of helices extracted was 4126 and 12763, for 3H and 2H data respectively.

(16)

Statistical analysis: Residue propensity value

Residue propensity value:

● as= number of residues of type a in position s

● np= dataset cardinality

(17)
(18)
(19)
(20)
(21)

Artificial Neural Network

(22)

Artificial Neural Network: structure

● Artificial Neural Network ● Neuron computation

(23)

Activation Functions

Sigmoid: takes a real-valued input and squashes it to range between 0 and 1

σ(x) = 1 / (1 + exp(−x))

tanh: takes a real-valued input and squashes it to the range [-1, 1]

tanh(x) = 2σ(2x) − 1

ReLU: takes a real-valued input and thresholds it at zero f(x) = max(0, x)

(24)

Predictive Model: Recurrent Neural Network

● Suited to deal with sequential data.

● Store the information related to past elements of a sequence in a local memory, called a state.

● Forget long term dependencies

○ LSTMs to overcome this problem.

(25)

Predictive Model: Implementation

● INPUT: how to represent amino acid sequences?

● A traditional way of representing categorical data is the one-hot encoding, i.e. by a vector with all elements equal to 0 except one, of value 1.

One-hot encoder → Too sparse (Dimension 20)

(26)

Word2Vec embedding

● Word embedding technique Word2Vec

→ Dense representation (Dimension 5)

Amino acids appearing frequently close in the sequence are projected in near points in the reduced space.

(27)

Word2Vec embedding

● We applied Word2vec to our dataset.

● Because of sequence vary their length

from 14 to 32 residues, each of them

was represented into a vector from 70

to 160 entries (from 14x5 to 32x5 real

values).

(28)

Dataset

The dataset is composed by 3967 3H helices and 7090 non helix sequences.

The aim is to demonstrate that elements at the extremes of a motifs are sufficient to predict the secondary structure of an amino acids sequence.

W2V Encoding One-hot

Encoding

(29)

Predictive Model: MultiLayer Perceptron

(30)

3D input reshape

LSTMs need a three dimensional input.

Therefore, a reshape function was used to reshape the input dataset into a three dimensional array.

Sample: cardinality of the training set

Timestep: length of each sequence

Feature: dimensionality of each sequence

(31)

Predictive Model: LSTM Neural Network

(32)

Results comparison

LSTM gives more accurate results than the MLP model for what concern the prediction of a complete helix sequence.

The accuracy get worse when only the external part of the helices were consider, but the prediction still remain close to the state-of-the-art and suggest that there are specific paths which influence the formation and the interruption of an helix.

The performances of the two approaches are compared:

(33)

Discussion & Conclusions

● The obtained experimental results demonstrate the power of machine learning techniques to make predictions on the protein structural features.

● LSTMs can well interpret the sequential nature of protein data and focus on particularly conserved pieces of information which are fundamental in the formation of secondary structures.

● Future works:

○ Predict Helices analysing a complete protein sequence.

○ Extend the method to the prediction of 𝛽-sheets and U-turns.

○ https://tedxbeaconstreet.com/videos/cancer-alzheimers-protein-origami/

Riferimenti

Documenti correlati

In seguito cercheremo di approfondire alcuni aspetti di quel mito cominciando con la nascita del mito negativo, quello di Federico II come l'Anticristo o il suo predecessore,

All-cause mortality [A] and cardiovascular (CV) death or hospitalization for worsening heart failure (WHFH) [B] event rates (per 100-patient years) according to tertiles of

We use social identity theory to develop hypotheses regarding the relationships among perceived autonomy, identification with the profession, and identification with the

In general, intensified counselling without increased (or even less) ALMP-referrals appears to be efficient, whereas cost effectiveness seems to disappear as soon

Gli argomenti che tratta sono simili a quelli affrontati in Eureka!, ciò consente di instaurare un confronto tra i due riguardo alla modalità in cui vengono esposti

dei parametri vitali ■Posizionamento del collare cervicale ■Controllo e trattamento di altre lesioni associate: tamponamento delle emorragie importanti ■Valutazione della sensibilità

• A New Festival for the New Man: The Socialist Market of Folk Experts during the “Singing Romania” National Festival, in Studying Peoples in the People’s Democracies (II)

Conferito con Dunn […] circa sospetti sorti a Washington per nostra presunta freddezza a proposito del Patto occidentale. Gli ho spiegato la nostra posizione: l’Italia è ovviamente