single-nucleotide polymorphism data and molecular

Download Report

Transcript single-nucleotide polymorphism data and molecular

Machine Learning in Drug Design
David Page
Dept. of Biostatistics and Medical
Informatics and Dept. of Computer
Sciences
Collaborators
Michael Waddell
 Paul Finn
 Ashwin Srinivasan
 John Shaughnessy
 Bart Barlogie

Frank Zhan
 Stephen Muggleton
 Arno Spatola
 Sean McIlwain
 Brian Kay

Outline
Overview of Drug Design
 How Machine Learning Fits Into the Process
 Target Search: Single Nucleotide Polymorphisms
(SNPs)
 Machine Learning from Feature Vectors

Decision Trees
Support Vector Machines
Voting/Ensembles

Predicting Molecular Activity: Learning from Structure
Drugs Typically Are…
 Small organic molecules that…
 Modulate disease by binding to some target
protein…
 At a location that alters the protein’s behavior
(e.g., antagonist or agonist).
 Target protein might be human (e.g., ACE for
blood pressure) or belong to invading organism
(e.g., surface protein of a bacterium).
Example of Binding
So To Design a Drug:
Identify Target
Protein
Determine
Target Site
Structure
Synthesize a
Molecule that
Will Bind
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR
Difficult if Membrane-Bound
Imperfect modeling of structure
Structures may change at binding
And even then…
Molecule Binds Target But May:
 Bind too tightly or not tightly enough.
 Be toxic.
 Have other effects (side-effects) in the body.
 Break down as soon as it gets into the body, or
may not leave the body soon enough.
 It may not get to where it should in the body
(e.g., crossing blood-brain barrier).
 Not diffuse from gut to bloodstream.
And Every Body is Different:
 Even if a molecule works in the test tube and
works in animal studies, it may not work in
people (will fail in clinical trials).
 A molecule may work for some people but not
others.
 A molecule may cause harmful side-effects in
some people but not others.
Outline
Overview of Drug Design
 How Machine Learning Fits Into the Process
 Target Search: Single Nucleotide Polymorphisms
(SNPs)
 Machine Learning from Feature Vectors

Decision Trees
Support Vector Machines
Voting/Ensembles

Predicting Molecular Activity: Learning from Structure
Places to use Machine Learning
 Finding target proteins.
 Inferring target site structure.
 Predicting who will respond positively/negatively.
Places to use Machine Learning
 Finding target proteins.
 Inferring target site structure.
 Predicting who will respond positively/negatively.
Healthy vs. Disease
Healthy
Diseased
If We Could Sequence DNA
Quickly and Cheaply, We Could:
Sequence DNA of people taking a drug, and use ML to
identify consistent differences between those who
respond well and those who do not.
 Sequence DNA of cancer cells and healthy cells, and
use ML to detect dangerous mutations… proteins these
genes code for may be useful targets.
 Sequence DNA of people who get a disease and those
who don’t, and use ML to determine genes related to
succeptibility… proteins these genes code for may be
useful targets.

Problem: Can’t Sequence Quickly
 Can quickly test single positions where variation
is common: Single Nucleotide Polymorphisms
(SNPs).
 Can quickly test degree to which every gene is
being transcribed: Gene Expression Microarrays
(e.g., Affymetrix Gene Chips™).
 Can (moderately) quickly test which proteins are
present in a sample (Proteomics).
Outline
Overview of Drug Design
 How Machine Learning Fits Into the Process
 Target Search: Single Nucleotide Polymorphisms
(SNPs)
 Machine Learning from Feature Vectors

Decision Trees
Support Vector Machines
Voting/Ensembles

Predicting Molecular Activity: Learning from Structure
Example of SNP Data
Person SNP
1
2
3
...
CLASS
Person 1
C
T
A
G
T
T
...
old
Person 2
C
C
A
G
C
T
...
young
Person 3
T
T
A
A
C
C
...
old
Person 4
C
T
G
G
T
T
...
young
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
...
.
Problem: SNPs are not Genes
 If we find a predictive SNP, it may not be part of
a gene… we can only infer that the SNP is “near”
a gene that may be involved in the disease.
 Even if the SNP is part of a gene, it may be
another nearby gene that is the key gene.
Problem: Even SNPs are Costly
 Typically cannot use all known SNPs.
 Can focus on a particular chromosome and area
if knowledge permits that.
 Can use a scattering of SNPs, since SNPs that
are very close together may be redundant… use
one SNP per haplotype block, or region where
recombination is rare.
Why Machine Learning?
 There may be no single SNP in our data that
distinguishes disease vs. healthy.
 Still may be possible to have some combination
of SNPs to predict. Can gain insight from this
combination.
Outline
Overview of Drug Design
 How Machine Learning Fits Into the Process
 Target Search: Single Nucleotide Polymorphisms
(SNPs)
 Machine Learning from Feature Vectors

Decision Trees
Support Vector Machines
Voting/Ensembles

Predicting Molecular Activity: Learning from Structure
Decision Trees in One Picture
Young
Old
SNP1 has A
Yes
No
Naïve Bayes in One Picture
Age
SNP 1
SNP 2
...
SNP 3000
Voting Approach
 Score SNPs using information gain.
 Choose top 1% scoring SNPs.
 To classify a new case, let these SNPs vote
(majority or weighted majority vote).
 We use majority vote here.
Task: Predict Early Onset Disease
From SNP Data
 Only 3000 SNPs, coarsely sampled over entire
genome.
 80 patients (examples), 40 with early onset.
 Using technology from Orchid.
 Can a predictor be learned that performs
significantly better than chance on unseen data?
Results
 Use all data, only top 1% of features, or only top
10% of features (according to decision tree’s
purity measure).
 Use Trees, SVMs, Voting.
 SVMs with top 10% achieve 71% accuracy.
Significantly better than chance (50%).
Lessons
 Feature selection is important for performance.
 Methodology note for machine learning
specialists: must repeat this entire process on
each fold of cross-validation or results will be
overly-optimistic.
 SNP approach is promising… get funding to
measure more SNPs.
 More work on SVM comprehensibility.
Outline
Overview of Drug Design
 How Machine Learning Fits Into the Process
 Target Search: Single Nucleotide Polymorphisms
(SNPs)
 Machine Learning from Feature Vectors

Decision Trees
Support Vector Machines
Voting/Ensembles

Predicting Molecular Activity: Learning from Structure
Places to use Machine Learning
 Finding target proteins.
 Inferring target site structure.
 Predicting who will respond positively/negatively.
Typical Practice when Target
Structure is Unknown
 Test many molecules (1,000,000) to find some
that bind to target (ligands).
 Infer (induce) shape of target site from 3D
structural similarities.
 Shared 3D substructure is called a
pharmacophore.
 Perfect example of a machine learning task with
spatial target.
Inactive
Active
An Example of Structure Learning
Inductive Logic Programming
 Represents data points in mathematical logic
 Uses Background Knowledge
 Returns results in logic
The Logical Representation of a
Pharmacophore
Active(X) if:
has-conformation(X,Conf),
has-hydrophobic(X,A),
has-hydrophobic(X,B),
distance(X,Conf,A,B,3.0,1.0),
has-acceptor(X,C),
distance(X,Conf,A,C,4.0,1.0),
distance(X,Conf,B,C,5.0,1.0).
This logical clause states that a molecule X is active if it has some
conformation Conf, hydrophobic groups A and B, and a hydrogen acceptor C
such that the following holds: in conformation Conf of molecule X, the
distance between A and B is 3 Angstroms (plus or minus 1), the distance
between A and C is 4, and the distance between B and C is 5.
Background Knowledge I

Information about atoms and bonds in the molecules
atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
 atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
 atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).



bond(m1,a1,a2,1).
bond(m1,a2,a3,1).
Background knowledge II

Definition of distance equivalence

dist(Drug,Atom1,Atom2,Dist,Error):number(Error),
coord(Drug,Atom1,X1,Y1,Z1),
coord(Drug,Atom2,X2,Y2,Z2),
euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
Diff is Dist1-Dist,
absolute_value(Diff,E1),
E1 =< Error.







euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):
Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2,

D is sqrt(Dsq).

Central Idea: Generalize by
searching a lattice
Lattice of Hypotheses
active(X)
active(X) if
has-hydrophobic(X,A)
active(X) if
has-hydrophobic(X,A),
has-donor(X,B),
distance(X,A,B,5.0)
active(X) if
has-donor(X,A)
active(X) if
has-acceptor(X,A)
active(X) if
active(X) if
has-donor(X,A),
has-acceptor(X,A),
has-donor(X,B),
has-donor(X,B),
distance(X,A,B,4.0) distance(X,A,B,6.0)
etc.
Conformational model
 Conformational flexibility modelled as multiple
conformations:
Sybyl randomsearch
Catalyst
Pharmacophore description
 Atom and site centred
Hydrogen bond donor
Hydrogen bond acceptor
Hydrophobe
Site points (limited at present)
User definable
 Distance based
Example 1: Dopamine agonists
Agonists taken from Martin data set on QSAR society
web pages
 Examples (5-50 conformations/molecule)

OH
H3C
OH
N
OH
OH
OH
H2N
H2N
H3C
OH
OH
OH
OH H2N
OH
OH
N
HN
OH
CH3
Pharmacophore identified











Molecule A has the desired activity if:
in conformation B molecule A contains a hydrogen acceptor at C, and
in conformation B molecule A contains a basic nitrogen group at D, and
the distance between C and D is 7.05966 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrogen acceptor at E, and
the distance between C and E is 2.80871 +/- 0.75 Angstroms, and
the distance between D and E is 6.36846 +/- 0.75 Angstroms, and
in conformation B molecule A contains a hydrophobic group at F, and
the distance between C and F is 2.68136 +/- 0.75 Angstroms, and
the distance between D and F is 4.80399 +/- 0.75 Angstroms, and
the distance between E and F is 2.74602 +/- 0.75 Angstroms.
Example II: ACE inhibitors
 28 angiotensin converting enzyme inhibitors
taken from literature
D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 316, (1987)
O
HS
N
O
CH3
HO
N
O
N
COOH
P N
H
OH
N
O
COOH
COOH
CH3
N
H
N
O
COOH
Experiment 1
 Attempt to identify pharmacophore using original
Mayer et al. Data (final conformations).
 Initial failed attempt traced to “bugs” in
background knowledge definition.
 4 pharmacophores found with corrected code
(variations on common theme)
ACE pharmacophore











Molecule A is an ACE inhibitor if:
molecule A contains a zinc-site B,
molecule A contains a hydrogen acceptor C,
the distance between B and C is 7.899 +/- 0.750 A,
molecule A contains a hydrogen acceptor D,
the distance between B and D is 8.475 +/- 0.750 A,
the distance between C and D is 2.133 +/- 0.750 A,
molecule A contains a hydrogen acceptor E,
the distance between B and E is 4.891 +/- 0.750 A,
the distance between C and E is 3.114 +/- 0.750 A,
the distance between D and E is 3.753 +/- 0.750 A.
Pharmacophore discovered
Distance
Progol
Mayer
A
4.9
5.0
B
3.8
3.8
C
8.5
8.6
Zinc site
H-bond acceptor
B
A
C
Experiment 2
 Definition of “zinc ligand” added to background
knowledge
based on crystallographic data
 Multiple conformations
Sybyl RandomSearch
Experiment 2
 Original pharmacophore rediscovered plus one
other
different zinc ligand position
similar to alternative proposed by Ciba-Geigy
4.0
3.9
7.3
Example III: Thermolysin inhibitors
 10 inhibitors for which crystallographic data is
available in PDB
 Conformationally challenging molecules
 Experimentally observed superposition
Key binding site interactions
Asn112-NH
O
OH
O=C Asn112
S2’
NH
Arg203-NH
O
S1’
O
Zn
P NH
O R
O=C Ala113
Interactions made by inhibitors
Interaction
Asn112-NH
S2’
Asn112 C=O
Arg 203 NH
S1’
Ala113-C=O
Zn
1HYT


1THL







1TLP







1TMN







2TMN 4TLN









4TMN







5TLN







5TMN







6TMN






Pharmacophore Identification
 Structures considered 1HYT 1THL 1TLP 1TMN
2TMN 4TLN 4TMN 5TLN 5TMN 6TMN
 Conformational analysis using “Best” conformer
generation in Catalyst
 98-251 conformations/molecule
Thermolysin Results
 10 5-point pharmacophore identified, falling into
2 groups (7/10 molecules)
3 “acceptors”, 1 hydrophobe, 1 donor
4 “acceptors, 1 donor
 Common core of Zn ligands, Arg203 and Asn112
interactions identified
 Correct assignments of functional groups
 Correct geometry to 1 Angstrom tolerance
Thermolysin results
 Increasing tolerance to 1.5Angstroms finds
common 6-point pharmacophore including one
extra interaction
Example IV: Antibacterial peptides
 Dataset of 11 pentapeptides showing activity
against Pseudomonas aeruginosa
6 actives <64mg/ml IC50
5 inactives
Pharmacophore Identified
A Molecule M is active against Pseudomonas Aeruginosa
if it has a conformation B such that:
M has a hydrophobic group C,
M has a hydrogen acceptor D,
the distance between C and D in conformation B is 11.7 Angstroms
M has a positively-charged atom E,
the distance between C and E in conformation B is 4 Angstroms
the distance between D and E in conformation B is 9.4 Angstroms
M has a positively-charged atom F,
the distance between C and F in conformation B is 11.1 Angstroms
the distance between D and F in conformation B is 12.6 Angstroms
the distance between E and F in conformation B is 8.7 Angstroms
Tolerance 1.5 Angstroms
Ongoing ILP developments
(pharmacophores)
 Continue to extend method validation
 Extending to combinatorial mixtures
 Quantitative models
 Mixing different datatypes in background
knowledge
 Developing graphical front-end
Ongoing developments (Other)
 Analysis of HTS datasets
 Analysis of “drug-likeness”
 Derivation of new descriptors
eg Empirical binding functions