r 2 - Research at St Andrews

Download Report

Transcript r 2 - Research at St Andrews

Informed by Informatics?
Dr John Mitchell
Unilever Centre for Molecular Science Informatics
Department of Chemistry
University of Cambridge, U.K.
Modelling in Chemistry
PHYSICS-BASED
ab initio
Density Functional Theory
Car-Parrinello
AM1, PM3 etc.
Molecular Dynamics
Fluid Dynamics
DPD
Monte Carlo
Docking
CoMFA
EMPIRICAL
ATOMISTIC
2-D QSAR/QSPR
Machine Learning
NON-ATOMISTIC
LOW THROUGHPUT
ab initio
Density Functional Theory
Car-Parrinello
AM1, PM3 etc.
Molecular Dynamics
Fluid Dynamics
DPD
Monte Carlo
Docking
CoMFA
2-D QSAR/QSPR
Machine Learning
HIGH THROUGHPUT
THEORETICAL CHEMISTRY
ab initio
Density Functional Theory
Car-Parrinello
AM1, PM3 etc.
Molecular Dynamics
Fluid Dynamics
DPD
NO FIRM BOUNDARIES!
Monte Carlo
Docking
CoMFA
2-D QSAR/QSPR
Machine Learning
INFORMATICS
ab initio
Density Functional Theory
Car-Parrinello
AM1, PM3 etc.
Molecular Dynamics
Fluid Dynamics
DPD
Monte Carlo
Docking
CoMFA
2-D QSAR/QSPR
Machine Learning
Informatics and Empirical Models
• In general, Informatics methods represent
phenomena mathematically, but not in a
physics-based way.
• Inputs and output model are based on an
empirically parameterised equation or
more elaborate mathematical model.
• Do not attempt to simulate reality.
• Usually High Throughput.
QSPR
• Quantitative Structure  Property Relationship
• Physical property related to more than one other
variable
• Hansch et al developed QSPR in 1960’s,
building on Hammett (1930’s).
• Property-property relationships from 1860’s
• General form (for non-linear relationships):
y = f (descriptors)
QSPR
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
1
2
3
4
5
6
7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
Y = f (X1, X2, ... , XN )
• Optimisation of Y = f(X1, X2, ... , XN) is called regression.
• Model is optimised upon N “training molecules” and then
tested upon M “test” molecules.
QSPR
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
1
2
3
4
5
6
7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
• Quality of the model is judged by three parameters:
n
n
i 1
i 1
r 2  1   ( yiobs  yipred ) 2 /  ( yiobs  y average ) 2
RMSE 
n
1
( yiobs  yipred ) 2

n i 1
1 n obs
Bias   ( yi  yipred )
n i 1
QSPR
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
1
2
3
4
5
6
7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
• Different methods for carrying out regression:
• LINEAR - Multi-linear Regression (MLR), Partial Least
Squares (PLS), Principal Component Regression (PCR), etc.
• NON-LINEAR - Random Forest, Support Vector Machines
(SVM), Artificial Neural Networks (ANN), etc.
QSPR
Molecule 1
Molecule 2
Molecule 3
Molecule 4
Molecule 5
Molecule 6
Molecule 7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
• However, this does not guarantee a good
predictive model….
QSPR
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
1
2
3
4
5
6
7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
• Problems with experimental error.
• QSPR only as accurate as data it is trained upon.
• Therefore, we are need accurate experimental data.
QSPR
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
Molecule
1
2
3
4
5
6
7
Y
Property 1
Property 2
Property 3
Property 4
Property 5
Property 6
Property 7
X1
–
–
–
–
–
–
–
X2
–
–
–
–
–
–
–
X3
–
–
–
–
–
–
–
X4
–
–
–
–
–
–
–
X5
–
–
–
–
–
–
–
X6
–
–
–
–
–
–
–
• Problems with “chemical space”.
• “Sample” molecules must be representative of “Population”.
• Prediction results will be most accurate for molecules similar
to training set.
• Global or Local models?
1. Solubility
Dave Palmer
• Pfizer Institute for Pharmaceutical Materials Science
• http://www.msm.cam.ac.uk/pfizer
D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007)
D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)
Solubility is an important issue in drug
discovery and a major source of attrition
This is expensive for the industry
A good model for predicting the solubility of druglike
molecules would be very valuable.
Datasets
Phase 1 – Literature Data
• Compiled from Huuskonen dataset and AquaSol database
– pharmaceutically relevant molecules
• All molecules solid at room temperature
• n = 1000 molecules
Phase 2 – Our Own Experimental Data
● Measured by Toni Llinàs using CheqSol Machine
● pharmaceutically relevant molecules
● n = 135 molecules
Aqueous solubility – the thermodynamic solubility in
unbuffered water (at 25oC)
●
Diversity-Conserving Partitioning
Full dataset
n = 1000 molecules
Training
n = 670 molecules
Test
n = 330 molecules
MACCS Structural Key fingerprints
● Tanimoto coefficient
● MaxMin Algorithm
●
Diversity Analysis
Structures & Descriptors
3D structures from Concord
● Minimised with MMFF94
● MOE descriptors 2D/ 3D
●
● Separate
analysis of 2D and 3D descriptors
● QuaSAR Contingency Module (MOE)
● 52 descriptors selected
Random Forest
Machine Learning Method
Random Forest
● Introduced by Briemann and Cutler (2001)
● Development of Decision Trees (Recursive Partitioning):
● Dataset is partitioned into consecutively
smaller subsets
● Each partition is based upon the value
of one descriptor
● The descriptor used at each split is
selected so as to optimise splitting
● Bootstrap sample of N objects chosen
from the N available objects with
replacement
Random Forest
● Random Forest is a collection of Decision or Regression Trees
grown with the CART algorithm.
● Standard Parameters:
● 500 decision trees
● No pruning back: Minimum node size = 5
● “mtry” descriptors (square root of number of total number) tried at
each split
Important features:
● Incorporates descriptor selection
● Incorporates “Out-of-bag” validation – using those molecules not in
the bootstrap samples
Random Forest for Solubility Prediction
A Forest of Regression Trees
• Dataset is partitioned into consecutively
smaller subsets (of similar solubility)
• Each partition is based upon the value of
one descriptor
• The descriptor used at each split is
selected so as to minimise the MSE
Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).
Random Forest for Predicting Solubility
•
•
•
•
•
•
A Forest of Regression Trees
Each tree grown until terminal
nodes contain specified number of
molecules
No need to prune back
High predictive accuracy
Includes method for descriptor
selection
No training problems – largely
immune from overfitting.
Random Forest: Solubility Results
RMSE(tr)=0.27
r2(tr)=0.98
Bias(tr)=0.005
RMSE(oob)=0.68
r2(oob)=0.90
Bias(oob)=0.01
RMSE(te)=0.69
r2(te)=0.89
Bias(te)=-0.04
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
Drug Disc.Today, 10 (4), 289 (2005)
Can we use theoretical chemistry to calculate
solubility via a thermodynamic cycle?
Gsub from lattice energy & an entropy term
Ghydr from a semi-empirical solvation model
(i.e., different kinds of theoretical/computational methods)
… or this alternative cycle?
Gsub from lattice energy (DMAREL) plus entropy
Gsolv from SCRF using the B3LYP DFT functional
Gtr from ClogP
(i.e., different kinds of theoretical/computational methods)
● Nice idea, but didn’t quite work – errors larger than
QSPR
● “Why not add a correction factor to account for the
difference between the theoretical methods?”
…
…
● Within a week this had become a hybrid method,
essentially a QSPR with the theoretical energies as
descriptors
This regression equation gave r2=0.77 and RMSE=0.71
Solubility by TD Cycle: Conclusions
● We have a hybrid part-theoretical, part-empirical method.
● An interesting idea, but relatively low throughput as a crystal
structure is needed.
● Slightly more accurate than pure QSPR for a druglike set.
● Instructive to compare with literature of theoretical solubility
studies.
2. Bioactivity
Florian Nigsch
F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)
Feature Space - Chemical Space
m=
(f1,f2,…,fn)
f3
f3
CDK2 COX2
CDK1
f1
DHFR
f2
f1
f2
Feature spaces of
high dimensionality
Properties of Drugs
• High affinity to protein
target
• Soluble
• Permeable
• Absorbable
• High bioavailability
• Specific rate of
metabolism
• Renal/hepatic clearance?
•
•
•
•
Volume of distribution?
Low toxicity
Plasma protein binding?
Blood-Brain-Barrier
penetration?
• Dosage (once/twice
daily?)
• Synthetic accessibility
• Formulation (important in
development)
Multiobjective Optimisation
Bioactivity
Toxicity
Synthetic
accessibility
Solubility
Metabolism
Permeability
Huge number of candidates
…
Multiobjective Optimisation
Bioactivity
Drug
Toxicity
Synthetic
accessibility
Solubility
Metabolism
Permeability
Huge number of
candidates most of which
are useless!
Spam
• Unsolicited
(commercial) email
• Approx. 90% of all
email traffic is spam
• Where are the
legitimate messages?
• Filtering
Analogy to Drug Discovery
• Huge number of possible candidates
• Virtual screening to help in selection process
Winnow Algorithm
• Invented in late 1980s by Nick Littlestone to
learn Boolean functions
• Name from the verb “to winnow”
– High-dimensional input data
• Natural Language Processing (NLP), text
classification, bioinformatics
• Different varieties (regularised, Sparse Network
Of Winnow - SNOW, …)
• Error-driven, linear threshold, online algorithm
Winnow (“Molecular Spam Filter”)
Machine Learning Methods
Training Example
Features of Molecules
Based on circular fingerprints
Combinations of Features
Combinations of
molecular features
to account for
synergies.
Pairing Molecular Features Gives Bigrams
Orthogonal Sparse Bigrams
• Technique used in text
classification/spam filtering
• Sliding window process
• Sparse - not all
combinations
• Orthogonal - nonredundancy
Workflow
Protein Target Prediction
•
•
•
•
•
Which proteins does a given molecule bind to?
Virtual Screening
Multiple endpoint drugs - polypharmacology
New targets for existing drugs
Prediction of adverse drug reactions (ADR)
– Computational toxicology
Predicted Protein Targets
• Selection of ~230
classes from the
MDL Drug Data
Report
• ~90,000 molecules
• 15 independent
50%/50% splits into
training/test set
Predicted Protein Targets
Cumulative probability of correct prediction within
the three top-ranking predictions: 82.1% (±0.5%)
3. Melting Points
Florian Nigsch
F. Nigsch, et al., J. Chem. Inf. Model., 46, 2412-2422 (2006)
Interest in Modelling
Interest in modelling properties of molecules
- solubility, toxicity, drug-drug interactions, …
Virtual screening has become a necessity in terms of time
and cost efficiency
ADMET integration into drug development process
- many fewer late stage failures
- continuous refinement with new experimental data
Absorption, Distribution, Metabolism, Excretion and Toxicity
Melting Points are Important
Melting point facilitates prediction of solubility
- General Solubility Equation (Yalkowsky)
Notorious challenge (unknown crystal packing, interactions
in liquid state)
Goal: Stepwise model for bioavailability, from solid
compound to systemic availability
Can Current Models be Improved?
Bergström: 277 total compounds, PLS, RMSE=49.8 ºC,
q2=0.54; ntest:ntrain=185:92
Karthikeyan: 4173 total compounds, ANN, RMSE=49.3 ºC,
q2=0.66; ntest:ntrain=1042:2089
Jain: 1215 total compounds, MLR, AAE=33.2 ºC, no q2
reported; ntest:ntrain=119:1094
RMSE slightly below 50 ºC!
q2 just above 0.50!
k-Nearest Neighbours
Machine Learning Method
k-Nearest Neighbours
Used for classification and
regression
Suitable for large and diverse
datasets
• Local (tuneable via k)
• Non-linear
Calculation of distance matrix
instead of training step
Distance matrix depends on
descriptors used
Classification
Regression
Molecules and Numbers
Datasets
4119 organic molecules of
diverse complexity, MPs from
14 to 392.5 ºC (dataset 1)
277 drug-like molecules, MPs
from 40 to 345 ºC (dataset 2)
Both datasets available in the
recent literature
Descriptors
Quasar descriptors from MOE
(Molecular Operating
Environment), 146 2D and 57
3D descriptors
Sybyl Atom Type Counts (53
different types, including
information about
hybridisation state)
Finding the Neighbours
Distances
Euclidean distance between point i and point
j in r dimensions
1/ 2
 r

i
j 2
dij  X m  X m  
m1


Set of k nearest neighbours
N = {Ti, di}
k = 1,5,10,15
Calculation of Predictions
1
a
pred
T
Predictions: Tpred = f(N)
unweighted
weighted
2
Arithmetic (1) and Inverse distance

geometric (2)
(3)
average
Exponential
decrease (4)
1 k
  Tn
k n1
k
g
Tpred
  Tn1/ k
n1
3
i
pred
T



k
1
d
n1
4
e
pred
T

k
1
e
n1
n1
1
dn
n
1
k
  Tn
d n
k
  Tn ed n
n1
Assessment by Cross-validation
Data
Random split
Training
N-m remaining
molecules form
training set
Test
Predict test set
(from training set)
Repeat 25 times
Select m random
molecules for test set
RMSEP
Organics
Drugs
N: 4119
N: 277
m: 1000
m: 80
m/N: 24.3%
m/N: 28.9%
1 25
RMSECV 
RMSEP 2

25 i1
Cross-validated root-mean-square error
Assessment by Cross-validation
Not “25-fold CV” where 1/25th
is predicted based on the rest
Random split
Monte Carlo Cross-validation
Prediction of 25 random sets of
1000 molecules
Training
Test
Predict test set
(from training set)
Repeat 25 times
Data
RMSEP
Organics
Drugs
N: 4119
N: 277
m: 1000
m: 80
m/N: 24.3%
m/N: 28.9%
1 25
RMSECV 
RMSEP 2

25 i1
Cross-validated root-mean-square error
Results
Training set
Test set
3119 molecules (organic)
1000 molecules (organic)
Using a combination of 2D and 3D descriptors:
Method
NN
A
G
I
E
RMSE
r2
q2
RMS
E
r2
q2
RMS
E
r2
q2
RMS
E
r2
q2
1
57.6
0.36
0.21
-
-
-
-
-
-
-
-
-
5
48.9
0.44
0.43
50.0
0.43
0.40
48.2
0.45
0.45
48.8
0.45
0.43
10
48.8
0.44
0.43
50.2
0.43
0.40
47.7
0.46
0.46
47.6
0.47
0.46
15
49.3
0.43
0.42
50.9
0.42
0.38
48.1
0.46
0.45
47.3
0.48
0.47
Results: Organic Dataset
2D only
3D only
RMSE = 47.2 ºC
RMSE = 50.7 ºC
r2 = 0.47
r2 = 0.39
q2 = 0.47
q2 = 0.39
Using 15 nearest neighbours
and exponential weighting
Method
NN
A
RMSE
r2
G
q2
RMS
E
r2
I
q2
RMS
E
E
r2
q2
RMS
E
r2
q2
-
-
-
-
-
5
Exponential
weighting
0.36
0.21
performs
best!
48.9
0.44
0.43
50.0
0.43
0.40
48.2
0.45
0.45
48.8
0.45
0.43
10
48.8
0.44
0.43
50.2
0.43
0.40
47.7
0.46
0.46
47.6
0.47
0.46
15
49.3
0.43
0.42
50.9
0.42
0.38
48.1
0.46
0.45
47.3
0.48
0.47
1
57.6
Results: Drugs Dataset
2D only
3D only
RMSE = 46.5 ºC
RMSE = 48.4 ºC
r2 = 0.30
r2 = 0.24
q2 = 0.29
q2 = 0.23
Using 10 nearest neighbours
and inverse distance weighting
Method
NN
A
G
RMS
RMSE
r
q
Inverse
distance
E
1
57.1
0.19
weighting
performs
5
47.7
0.27
0.25
48.8
best!
2
2
I
E
r2
q2
RMS
E
r2
q2
RMS
E
r2
q2
-
-
-
-
-
-
-
-
0.26
0.22
47.1
0.29
0.27
48.5
0.28
0.22
10
46.9
0.29
0.28
47.8
0.30
0.25
46.3
0.30
0.29
47.4
0.29
0.26
15
48.0
0.25
0.24
49.0
0.26
0.21
47.2
0.28
0.27
47.2
0.29
0.27
Summary of Results
Organics: RMSE = 46.2 ºC, r2 = 0.49,
ntest:ntrain
= 1000:3119
Drugs: RMSE = 42.2 ºC, r2 = 0.42,
ntest:ntrain = 80:197
Difficulties at extremities of temperature scale
due to: distances in descriptor space and temperature
Summary of Results
Nonetheless: Comparison with other models in the
literature shows that the presented model:
- performs better in terms of RMSE under more
stringent conditions (even unoptimised)
- parameterless; optimised model many fewer
parameters than, for example, in an ANN
Conclusions
Information content of descriptors
Exponential weighting performs best
Genetic algorithm is able to improve performance,
especially for atom type counts as descriptors
Remaining error
- interactions in liquid and/or solid state not captured (by
single-molecule descriptors)
- kNN method in this setting (datasets + descriptors)
Thanks
•
•
•
•
Pfizer & PIPMS
Unilever
Dave Palmer
Florian Nigsch
All slides after here are for
information only
The Two Faces of Computational Chemistry
Theory
Empiricism
Informatics and Empirical
Models
• In general, Informatics methods represent
phenomena mathematically, but not in a
physics-based way.
• Inputs and output model are based on an
empirically parameterised equation or
more elaborate mathematical model.
• Do not attempt to simulate reality.
• Usually High Throughput.
Theoretical Chemistry
• Calculations and simulations based on
real physics.
• Calculations are either quantum
mechanical or use parameters derived
from quantum mechanics.
• Attempt to model or simulate reality.
• Usually Low Throughput.
Optimisation with Genetic Algorithm
Transformation of data to yield lower global RMSEP
- global means: for whole data set
Three operations to transform the columns of data matrix
for optimal weighting
One bit per descriptor on chromosomes: (op; param)
consisting of an operation and a parameter
{
multiplication
power
loge
{
(a,b) = (0, 30) for mult.
(a,b) = (0, 3) for power
none for loge
Iterative Genetic Optimisation
Initial data
Initial chromosomes
Transformation of data
Genetic operations
Calculation of distance matrix
Survival of the fittest
Evaluation of models
Stop if criterion reached
Set of optimal transformations
Genetic Algorithm Able to Improve Results
Application of optimal transformations to initial data
Reassessment of models
Data set
Descriptors
RMSE (ºC)
r2
q2
1
2D
46.2
0.49
0.49
2
2D
42.2
0.42
0.40
1
SAC
49.1
0.40
0.39
2
SAC
46.7
0.29
0.27
SAC: Sybyl Atom Counts
Averages over
25 runs!
Errors at Low/High-Melting Ends…
Signed RMSE analysis
Definition of signed RMSE
ei  x i  y i
A   ei2 sgn( ei )
SRMSE  sgn( A)
A
N
Details
 ºC steps across whole range
In 30
All 4119 molecules of data set 1
… Result From Large Distances
Distance in descriptor space
NNs of high-melting compounds
typically further away
Distance in temperature
Ttest-TNN
{
< 0: negative neighbour
> 0: positive neighbour
Over- & Underpredictions at Extremities
Distance in descriptor space
NNs of high-melting compounds
typically further away
Distance in temperature
Ttest-TNN
{
Underprediction!
Overprediction!
< 0: negative neighbour
> 0: positive neighbour