Transcript Docking
Exploring Chemical Space with
Computers—Challenges and
Opportunities
Pierre Baldi
UCI
Chemical Informatics
Historical perspective: physics,
chemistry and biology
Understanding chemical space
Small molecules (systems biology,
chemical synthesis, drug design,
nanotechnology)
Chemical Space
Stars
Existing
1022
Small
Mol.
107
Virtual
0
1060 (?)
Access
Difficult
“Easy”
Mode
Individual
Combinatorial
Chemical Space
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Predict physical, chemical, biological properties
(classification/regression)
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Methods
Spetrum:
Schrodinger Equation
Molecular Dynamics
Machine Learning (e.g. SS prediction)
Chemical Informatics
Informatics must be able to deal with
variable-size structured data
Graphical Models
(Recursive) Neural Networks
ILP
GA
SGs
Kernels
Two Essential Ingredients
Data
Similarity Measures
1.
2.
Bioinformatics analogy and differences:
Data (GenBank, Swissprot, PDB)
Similarity (BLAST)
Data
Mutag (Mutagenicity)
PTC (Predictive Toxicity Challenge)
All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([164,174])
Benzodiazepines (QSAR)
70,000 compounds screened for ability to inhibit growth in 60 human tumor
cell lines
Alkanes (Boiling points)
A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
NCI (Anti-cancer activity)
200 compounds (125/63), mutagenicity in Salmonella
79 1,4-benzodiazepines-2-one, affinity towards GABAA
ChemDB
7M compounds
Similarity
Rapid Searches of Large Databases
Predictive Methods (Kernel Methods)
Why it is not hopeless?
Similarity
Rapid Search of Large Databases
Protein Receptor (Docking)
Small Molecule/Ligand (Similarity)
Predictive Methods (Kernel Methods)
Why it is not hopeless
Linear Classifiers
Classification
Learning to Classify
Limited number of training
examples (molecules, patients,
sequences, etc.)
Learning algorithm (how to
build the classifier?)
Generalization: should correctly
classify test data.
Formalization
X is the input space
Y (e.g. toxic/non toxic, or {1,1}) is the target class
f: X→Y is the classifier.
Classification
Fundamental Point:
f is entirely determined
by the dot products xi,xj
measuring the similarity
between pairs of data
points
Non Linear Classification
(Kernel Methods)
We can transform a nonlinear problem
into a linear one using a kernel.
Non Linear Classification
(Kernel Methods)
We can transform a nonlinear problem
into a linear one using a kernel K.
Fundamental property: the linear
decision surface depends on
K(xi ,xj)=(xi ) , (xj).
All we need is the Gram similarity
matrix K. K defines the local metric of
the embedding space.
Similarity: Data Representations
O
H2N
OH
OH
NC(O)C(=O)O
Molecular Representations
1D: SMILES strings
2D: Graph of bonds
2.5D: Surfaces
3D: Atomic coordinates
4D: Temporal evolution
1D SMILES Kernel
HO
C H3
OH
H3C
HO
CCCCCc1ccc(cc1)CO
Kmer Count
CCCC
2
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1)
1
c1)C
1
1)CO
1
Kmer Count1 Count2 Product
(cc1
1
1
1
1)CO
0
1
0
1O)O
1
0
0
1ccc
1
1
1
CCCC
3
2
6
CCCc
1
1
1
CCc1
1
1
1
Cc1c
1
1
1
c(cc
1
1
1
c1)C
0
1
0
c1O)
1
0
0
c1cc
1
1
1
cc(c
1
1
1
cc1)
0
1
0
cc1O
1
0
0
ccc(
1
1
1
Total: 15
CCCCCCc1ccc(cc1O)O
Kmer Count
CCCC
3
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1O
1
c1O)
1
1O)O
1
2D Molecule Graph Kernel
For chemical compounds
atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Count labeled paths
(CsNsCdO)
Fingerprints
Similarity Measures
3D Coordinate Kernel
2.8 A
2.0 A
1.4 A
4.2 A
3.4 A
Atom Distance Histogram
8
7
Count
6
5
4
3
2
1
0
0
1
2
3
Distance (Angstroms)
4
5
Distance Count
0
0
1
5
2
7
3
3
4
1
5
0
Example of Results
Results
Results
Results
Example of Results
Summary
Derived a variety of kernels for small molecules
State-of-the-art performance on several benchmark datasets
2D kernels slightly better than 1D and 3D kernels
Many possible extensions: 2.5D kernels, isomers, etc…
Need for larger data sets and new models of cooperation in the
chemistry community
Many open (ML) questions (e.g. clustering and visualizing 107
compounds, intelligent recognition of useful molecules,
information retrieval from literature, docking, prediction of
reaction rates, matching table of all proteins against all known
compounds, origin of life)
Chemistry version of the Turing test
ChemDB
7M compounds (3.5M unique)
Commercially available
PostgreSQL/Oracle
Annotation (Experimental,
Computational)
Searchable
Web interface
Similarity, in silico reactions
Acknowledgements
Informatics
Liva Ralaivola
J. Chen
S. J. Swamidass
Yimeng Dou
Peter Phung
Jocelyne Bruand
Funding
NIH
NSF
IGB
Pharmacology
Daniele Piomelli
Chemistry
G. Weiss
J. S. Nowick
R. Chamberlin
New Questions
Predict drug-like molecules? toxicity?
How can we search efficiently? Intelligently?
New data structures and algorithms
Optimizing old structures
How can we understand this much data?
New Strategies
Cluster and visualize millions of data points
Define commercially accessible space.
Are there other useful things we can do with this?
Discover new polymers, etc.
Wonder about the origin of life.
Combinatorially combine all known chemicals.
Acknowledgements
Jocelyne Bruand
Peter Phung
Liva Ralaivola
S. Joshua
Swamidass
Yimeng Dou
NIH/NSF/IGB
Questions
Docking
Query:
Binding Site of Protein
Scoring
Function
&
Efficient
Minimizer
…
Some Targets
P53 (Luecke)
ACCD5 (Tsai)
IMPDH, PPAR, etc. (Luecke)
HIV Integrase (Robinson)
P53
Drug Rescue of P53 Mutants
Docking → ChemDB
~6 million commercially available
compounds
Searchable, annotated, downloadable.
Other Databases:
Cambridge Structural Database
ChemBank
PubChem
Chemical Toxicity Prediction
By Kernel Methods
Jonathan Chen
S Joshua Swamidass
The Baldi Lab
Data Flow
ID
Toxic?
1
No
O
Kernel
HN
N
H3C
O
CH 3
OH
2
No
Cl
Cl
Gram Matrix
ID
1
2
3
4
…
1 2 3
21 4 5
4 14 5
5 5 15
10 3 6
… … …
4
10
3
6
23
…
Cl
3
Yes
O
O
4
Toxicity
State List
Linear
Classifier
Yes
C H3
O
O
O
P
C H3
S
S
HN
C H3
Predictions
…
…
…
…
…
…
Results
Example of Results
Kernel/Method Mutag MM
FM
MR
FR
Kashima (2003) 89.1
61.0
61.0
62.8
66.7
Kashima (2003) 85.1
64.3
63.4
58.4
66.1
1D SMILES spec. 84.0
66.1
61.3
57.3
66.1
1D SMILES spec+ 85.6
66.4
63.0
57.6
67.0
2D Tanimoto
87.8
66.4
64.2
63.7
66.7
2D MinMax
86.2
64.0
64.5
64.5
66.4
2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9
2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8
2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1
2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7
2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7
2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5
2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4
2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4
3D Histogram
81.9
59.8
61.0
60.8
64.4
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Catalog
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Datasets
Small Molecules as Undirected Labeled
Graphs of Bonds
atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:
Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:
Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.