single-nucleotide polymorphism data and molecular
Download
Report
Transcript single-nucleotide polymorphism data and molecular
Machine Learning in Drug Design
David Page
Dept. of Biostatistics and Medical
Informatics and Dept. of Computer
Sciences
Collaborators
Michael Waddell
Paul Finn
Ashwin Srinivasan
John Shaughnessy
Bart Barlogie
Frank Zhan
Stephen Muggleton
Arno Spatola
Sean McIlwain
Brian Kay
Outline
Overview of Drug Design
How Machine Learning Fits Into the Process
Target Search: Single Nucleotide Polymorphisms
(SNPs)
Machine Learning from Feature Vectors
Decision Trees
Support Vector Machines
Voting/Ensembles
Predicting Molecular Activity: Learning from Structure
Drugs Typically Are…
Small organic molecules that…
Modulate disease by binding to some target
protein…
At a location that alters the protein’s behavior
(e.g., antagonist or agonist).
Target protein might be human (e.g., ACE for
blood pressure) or belong to invading organism
(e.g., surface protein of a bacterium).
Example of Binding
So To Design a Drug:
Identify Target
Protein
Determine
Target Site
Structure
Synthesize a
Molecule that
Will Bind
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR
Difficult if Membrane-Bound
Imperfect modeling of structure
Structures may change at binding
And even then…
Molecule Binds Target But May:
Bind too tightly or not tightly enough.
Be toxic.
Have other effects (side-effects) in the body.
Break down as soon as it gets into the body, or
may not leave the body soon enough.
It may not get to where it should in the body
(e.g., crossing blood-brain barrier).
Not diffuse from gut to bloodstream.
And Every Body is Different:
Even if a molecule works in the test tube and
works in animal studies, it may not work in
people (will fail in clinical trials).
A molecule may work for some people but not
others.
A molecule may cause harmful side-effects in
some people but not others.
Outline
Overview of Drug Design
How Machine Learning Fits Into the Process
Target Search: Single Nucleotide Polymorphisms
(SNPs)
Machine Learning from Feature Vectors
Decision Trees
Support Vector Machines
Voting/Ensembles
Predicting Molecular Activity: Learning from Structure
Places to use Machine Learning
Finding target proteins.
Inferring target site structure.
Predicting who will respond positively/negatively.
Places to use Machine Learning
Finding target proteins.
Inferring target site structure.
Predicting who will respond positively/negatively.
Healthy vs. Disease
Healthy
Diseased
If We Could Sequence DNA
Quickly and Cheaply, We Could:
Sequence DNA of people taking a drug, and use ML to
identify consistent differences between those who
respond well and those who do not.
Sequence DNA of cancer cells and healthy cells, and
use ML to detect dangerous mutations… proteins these
genes code for may be useful targets.
Sequence DNA of people who get a disease and those
who don’t, and use ML to determine genes related to
succeptibility… proteins these genes code for may be
useful targets.
Problem: Can’t Sequence Quickly
Can quickly test single positions where variation
is common: Single Nucleotide Polymorphisms
(SNPs).
Can quickly test degree to which every gene is
being transcribed: Gene Expression Microarrays
(e.g., Affymetrix Gene Chips™).
Can (moderately) quickly test which proteins are
present in a sample (Proteomics).
Outline
Overview of Drug Design
How Machine Learning Fits Into the Process
Target Search: Single Nucleotide Polymorphisms
(SNPs)
Machine Learning from Feature Vectors
Decision Trees
Support Vector Machines
Voting/Ensembles
Predicting Molecular Activity: Learning from Structure
Example of SNP Data
Person SNP
1
2
3
...
CLASS
Person 1
C
T
A
G
T
T
...
old
Person 2
C
C
A
G
C
T
...
young
Person 3
T
T
A
A
C
C
...
old
Person 4
C
T
G
G
T
T
...
young
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
Problem: SNPs are not Genes
If we find a predictive SNP, it may not be part of
a gene… we can only infer that the SNP is “near”
a gene that may be involved in the disease.
Even if the SNP is part of a gene, it may be
another nearby gene that is the key gene.
Problem: Even SNPs are Costly
Typically cannot use all known SNPs.
Can focus on a particular chromosome and area
if knowledge permits that.
Can use a scattering of SNPs, since SNPs that
are very close together may be redundant… use
one SNP per haplotype block, or region where
recombination is rare.
Why Machine Learning?
There may be no single SNP in our data that
distinguishes disease vs. healthy.
Still may be possible to have some combination
of SNPs to predict. Can gain insight from this
combination.
Outline
Overview of Drug Design
How Machine Learning Fits Into the Process
Target Search: Single Nucleotide Polymorphisms
(SNPs)
Machine Learning from Feature Vectors
Decision Trees
Support Vector Machines
Voting/Ensembles
Predicting Molecular Activity: Learning from Structure
Decision Trees in One Picture
Young
Old
SNP1 has A
Yes
No
Naïve Bayes in One Picture
Age
SNP 1
SNP 2
...
SNP 3000
Voting Approach
Score SNPs using information gain.
Choose top 1% scoring SNPs.
To classify a new case, let these SNPs vote
(majority or weighted majority vote).
We use majority vote here.
Task: Predict Early Onset Disease
From SNP Data
Only 3000 SNPs, coarsely sampled over entire
genome.
80 patients (examples), 40 with early onset.
Using technology from Orchid.
Can a predictor be learned that performs
significantly better than chance on unseen data?
Results
Use all data, only top 1% of features, or only top
10% of features (according to decision tree’s
purity measure).
Use Trees, SVMs, Voting.
SVMs with top 10% achieve 71% accuracy.
Significantly better than chance (50%).
Lessons
Feature selection is important for performance.
Methodology note for machine learning
specialists: must repeat this entire process on
each fold of cross-validation or results will be
overly-optimistic.
SNP approach is promising… get funding to
measure more SNPs.
More work on SVM comprehensibility.
Outline
Overview of Drug Design
How Machine Learning Fits Into the Process
Target Search: Single Nucleotide Polymorphisms
(SNPs)
Machine Learning from Feature Vectors
Decision Trees
Support Vector Machines
Voting/Ensembles
Predicting Molecular Activity: Learning from Structure
Places to use Machine Learning
Finding target proteins.
Inferring target site structure.
Predicting who will respond positively/negatively.
Typical Practice when Target
Structure is Unknown
Test many molecules (1,000,000) to find some
that bind to target (ligands).
Infer (induce) shape of target site from 3D
structural similarities.
Shared 3D substructure is called a
pharmacophore.
Perfect example of a machine learning task with
spatial target.
Inactive
Active
An Example of Structure Learning
Inductive Logic Programming
Represents data points in mathematical logic
Uses Background Knowledge
Returns results in logic
The Logical Representation of a
Pharmacophore
Active(X) if:
has-conformation(X,Conf),
has-hydrophobic(X,A),
has-hydrophobic(X,B),
distance(X,Conf,A,B,3.0,1.0),
has-acceptor(X,C),
distance(X,Conf,A,C,4.0,1.0),
distance(X,Conf,B,C,5.0,1.0).
This logical clause states that a molecule X is active if it has some
conformation Conf, hydrophobic groups A and B, and a hydrogen acceptor C
such that the following holds: in conformation Conf of molecule X, the
distance between A and B is 3 Angstroms (plus or minus 1), the distance
between A and C is 4, and the distance between B and C is 5.
Background Knowledge I
Information about atoms and bonds in the molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
bond(m1,a1,a2,1).
bond(m1,a2,a3,1).
Background knowledge II
Definition of distance equivalence
dist(Drug,Atom1,Atom2,Dist,Error):number(Error),
coord(Drug,Atom1,X1,Y1,Z1),
coord(Drug,Atom2,X2,Y2,Z2),
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
Diff is Dist1-Dist,
absolute_value(Diff,E1),
E1 =< Error.
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):
Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2,
D is sqrt(Dsq).
Central Idea: Generalize by
searching a lattice
Lattice of Hypotheses
active(X)
active(X) if
has-hydrophobic(X,A)
active(X) if
has-hydrophobic(X,A),
has-donor(X,B),
distance(X,A,B,5.0)
active(X) if
has-donor(X,A)
active(X) if
has-acceptor(X,A)
active(X) if
active(X) if
has-donor(X,A),
has-acceptor(X,A),
has-donor(X,B),
has-donor(X,B),
distance(X,A,B,4.0) distance(X,A,B,6.0)
etc.
Conformational model
Conformational flexibility modelled as multiple
conformations:
Sybyl randomsearch
Catalyst
Pharmacophore description
Atom and site centred
Hydrogen bond donor
Hydrogen bond acceptor
Hydrophobe
Site points (limited at present)
User definable
Distance based
Example 1: Dopamine agonists
Agonists taken from Martin data set on QSAR society
web pages
Examples (5-50 conformations/molecule)
OH
H3C
OH
N
OH
OH
OH
H2N
H2N
H3C
OH
OH
OH
OH H2N
OH
OH
N
HN
OH
CH3
Pharmacophore identified
Molecule A has the desired activity if:
in conformation B molecule A contains a hydrogen acceptor at C, and
in conformation B molecule A contains a basic nitrogen group at D, and
the distance between C and D is 7.05966 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrogen acceptor at E, and
the distance between C and E is 2.80871 +/- 0.75 Angstroms, and
the distance between D and E is 6.36846 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrophobic group at F, and
the distance between C and F is 2.68136 +/- 0.75 Angstroms, and
the distance between D and F is 4.80399 +/- 0.75 Angstroms, and
the distance between E and F is 2.74602 +/- 0.75 Angstroms.
Example II: ACE inhibitors
28 angiotensin converting enzyme inhibitors
taken from literature
D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 316, (1987)
O
HS
N
O
CH3
HO
N
O
N
COOH
P N
H
OH
N
O
COOH
COOH
CH3
N
H
N
O
COOH
Experiment 1
Attempt to identify pharmacophore using original
Mayer et al. Data (final conformations).
Initial failed attempt traced to “bugs” in
background knowledge definition.
4 pharmacophores found with corrected code
(variations on common theme)
ACE pharmacophore
Molecule A is an ACE inhibitor if:
molecule A contains a zinc-site B,
molecule A contains a hydrogen acceptor C,
the distance between B and C is 7.899 +/- 0.750 A,
molecule A contains a hydrogen acceptor D,
the distance between B and D is 8.475 +/- 0.750 A,
the distance between C and D is 2.133 +/- 0.750 A,
molecule A contains a hydrogen acceptor E,
the distance between B and E is 4.891 +/- 0.750 A,
the distance between C and E is 3.114 +/- 0.750 A,
the distance between D and E is 3.753 +/- 0.750 A.
Pharmacophore discovered
Distance
Progol
Mayer
A
4.9
5.0
B
3.8
3.8
C
8.5
8.6
Zinc site
H-bond acceptor
B
A
C
Experiment 2
Definition of “zinc ligand” added to background
knowledge
based on crystallographic data
Multiple conformations
Sybyl RandomSearch
Experiment 2
Original pharmacophore rediscovered plus one
other
different zinc ligand position
similar to alternative proposed by Ciba-Geigy
4.0
3.9
7.3
Example III: Thermolysin inhibitors
10 inhibitors for which crystallographic data is
available in PDB
Conformationally challenging molecules
Experimentally observed superposition
Key binding site interactions
Asn112-NH
O
OH
O=C Asn112
S2’
NH
Arg203-NH
O
S1’
O
Zn
P NH
O R
O=C Ala113
Interactions made by inhibitors
Interaction
Asn112-NH
S2’
Asn112 C=O
Arg 203 NH
S1’
Ala113-C=O
Zn
1HYT
1THL
1TLP
1TMN
2TMN 4TLN
4TMN
5TLN
5TMN
6TMN
Pharmacophore Identification
Structures considered 1HYT 1THL 1TLP 1TMN
2TMN 4TLN 4TMN 5TLN 5TMN 6TMN
Conformational analysis using “Best” conformer
generation in Catalyst
98-251 conformations/molecule
Thermolysin Results
10 5-point pharmacophore identified, falling into
2 groups (7/10 molecules)
3 “acceptors”, 1 hydrophobe, 1 donor
4 “acceptors, 1 donor
Common core of Zn ligands, Arg203 and Asn112
interactions identified
Correct assignments of functional groups
Correct geometry to 1 Angstrom tolerance
Thermolysin results
Increasing tolerance to 1.5Angstroms finds
common 6-point pharmacophore including one
extra interaction
Example IV: Antibacterial peptides
Dataset of 11 pentapeptides showing activity
against Pseudomonas aeruginosa
6 actives <64mg/ml IC50
5 inactives
Pharmacophore Identified
A Molecule M is active against Pseudomonas Aeruginosa
if it has a conformation B such that:
M has a hydrophobic group C,
M has a hydrogen acceptor D,
the distance between C and D in conformation B is 11.7 Angstroms
M has a positively-charged atom E,
the distance between C and E in conformation B is 4 Angstroms
the distance between D and E in conformation B is 9.4 Angstroms
M has a positively-charged atom F,
the distance between C and F in conformation B is 11.1 Angstroms
the distance between D and F in conformation B is 12.6 Angstroms
the distance between E and F in conformation B is 8.7 Angstroms
Tolerance 1.5 Angstroms
Ongoing ILP developments
(pharmacophores)
Continue to extend method validation
Extending to combinatorial mixtures
Quantitative models
Mixing different datatypes in background
knowledge
Developing graphical front-end
Ongoing developments (Other)
Analysis of HTS datasets
Analysis of “drug-likeness”
Derivation of new descriptors
eg Empirical binding functions