Siena, Aprile 2016
Machine learning applied to vaccine research
Alessandro Brozzi, PhD Exploratory Data Analytics
Data Science & Clinical Systems
GSK in a nutshell
Ethical medicines
Main areas:
• Breathing apparatus
• Cardiovascular – metabolic
• Immunology and infectious diseases
• Central nervous system
• Dermatology
• Urology
• HIV
• Rare diseases
Vaccines
World leader in the prevention in childhood and adulthood.
Over 30 vaccines, including:
• diphteria, tetanus, pertussis, hepatitis
A+B, polio, haemophilus influenzae, parotitis, meningitis, rotavirus, HPV and flu
Consumer Healthcare
Main areas :
• dermatology (eg Physiogel)
• oral hygiene (eg Iodosan, Aquafresh)
• nutrition (eg Horlicks)
• OTC (eg Zovirax, NiQuitin)
2
4 billion
packs in 2014
860 million
doses of vaccines in 2014
18 billion
packs in 2014
Our global presence
Hamilton
Marietta
Ste Foy
Rixensart
Wavre
Marburg
Rosia/Siena
Singapore
Shanghai Tian Yuan (JV)
Nashik
Ankleshwar Gödollö
Saint-Amand-Les-Eaux
Dresden
Moscow
Rockville
R&D Hubs Manufacturing
Facilities
Research and Development
Siena
75 clinical
studies in 2014
R&D Center
generated some of the most innovative vaccines, included
MenB
€1,3 billion
investments in R&D managed from
Siena between 2006-2015
182 people in
Research
163 in
Development
Introduction
In scope:
- present the biological problem to be addressed by ML - present results of a case study
Out of scope
- the mathematical and theoretical aspects behind ML - the formal comparisons between ML models
6
What is in scope and out of scope of this presentation
Essential bibliography
7 00 Month 0000
Presentation title in footer
A vaccine is to convince our immune system to treat as an invading pathogen an harmless substance
8
What is a vaccine?
9
Microrganisms: bacteria and viruses
Who is an invading pathogen?
10
Penetrance and multiplication
How a so small organism might harm
Staphylococcus aureus
tonsillitis
11
Immune system cells and antibodies
How our organism defend itself
Four most common types of vaccines
12
subunit vaccines
attenuated microrganism killed microrganisms
fractions of microrganisms
pathogen
harmless
Yes ok, but which subunit?
13
Car metaphor
14
mechanical pieces = proteins
> 2000 subunits
Experimental procedure to select subunit candidates
15
time 5 -15 years
other assays
Experimental standard procedure
16 00 Month 0000
Presentation title in footer
experimental result
etc…
candidate
Main issues:
Time consuming High costs
Pathogen specific
The advent of genomics
DNA sequence and protein information
18
In-silico pipeline
19
length
number of structures
localization
ATFLPRYNDIRQQFYHNFRGKW WCFCQNDMVQMEYRALIKSVAD YDMGLRSFKKTRGMHPMKQYYG LMEVMQQAYDAIECTSPSRDFG GFDICVRFAWEYKADAYMYAPK TEQIVLPTFN
hydrophobicity
other features
Bioinformatic programs
Data matrix
20
length 100
150
30
20
# helices 3
0
1
3
localization membrane
membrane
nuclear
nuclear
experimental outcome
Independent variables
Machine learning
21
Breiman, 2001
00 Month 0000 Presentation title in footer
nature X
(independent variables) Y
(outcome)
unknown X
Y
Neural networks SVM
Random forests Naïve Bayes
f(x)
Siena, Aprile 2016
Case study
Study in silico vaccine candidates
23
University of Technology, Sidney
Dataset
• organisms of 4 different species
• 923 proteins of known experimental results to train the models
• 140 proteins of known experimental results to test the models
• 7 protein features
24
General characteristics
Data matrix
25
[ 923 ] …
[ 7 ]
Results
Single rule
27
Only numerical features
feature exp.
protein
Duble rule
28
Machine learning algorithms
29
Results
30
method sensitivity specificity
single rule 0,96 0,73
double rule 0,43 0,97
method sensitivity specificity
neural networks 0,97 0,97
naïve bayes 1 0,98
k-nearest neighbor 0,92 0,97
random forest 1 0,99
adaptive boosting 1 0,98
decision tree 1 0,97
svm 0,93 0,98
General overview
31
Conclusions
• Strong need to use information gathered in the past to guide experiments in the future
• Need for a procedure general for every pathogen
• ML is better than basic analyses
• A pool of algorithms might be a solution to increase efficiency Issues:
• Commonly dealing with very rectangular matrices (p >> n)
• Heterogeneous input data: categorical and numerical
• Noise effect and feature selections
32
In the future
• Make the program that infer protein features more precise
• Effort to unify in a single repository all the experimental data available
33