Transcript Docking

Exploring Chemical Space with
Computers—Challenges and
Opportunities
Pierre Baldi
UCI
Chemical Informatics



Historical perspective: physics,
chemistry and biology
Understanding chemical space
Small molecules (systems biology,
chemical synthesis, drug design,
nanotechnology)
Chemical Space
Stars
Existing
1022
Small
Mol.
107
Virtual
0
1060 (?)
Access
Difficult
“Easy”
Mode
Individual
Combinatorial
Chemical Space
Chemical Informatics





Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Predict physical, chemical, biological properties
(classification/regression)
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Methods

Spetrum:

Schrodinger Equation


Molecular Dynamics


Machine Learning (e.g. SS prediction)
Chemical Informatics

Informatics must be able to deal with
variable-size structured data






Graphical Models
(Recursive) Neural Networks
ILP
GA
SGs
Kernels
Two Essential Ingredients
Data
Similarity Measures
1.
2.
Bioinformatics analogy and differences:


Data (GenBank, Swissprot, PDB)
Similarity (BLAST)
Data

Mutag (Mutagenicity)


PTC (Predictive Toxicity Challenge)


All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([164,174])
Benzodiazepines (QSAR)


70,000 compounds screened for ability to inhibit growth in 60 human tumor
cell lines
Alkanes (Boiling points)


A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
NCI (Anti-cancer activity)


200 compounds (125/63), mutagenicity in Salmonella
79 1,4-benzodiazepines-2-one, affinity towards GABAA
ChemDB

7M compounds
Similarity

Rapid Searches of Large Databases

Predictive Methods (Kernel Methods)

Why it is not hopeless?
Similarity



Rapid Search of Large Databases

Protein Receptor (Docking)

Small Molecule/Ligand (Similarity)
Predictive Methods (Kernel Methods)
Why it is not hopeless
Linear Classifiers
Classification

Learning to Classify




Limited number of training
examples (molecules, patients,
sequences, etc.)
Learning algorithm (how to
build the classifier?)
Generalization: should correctly
classify test data.
Formalization



X is the input space
Y (e.g. toxic/non toxic, or {1,1}) is the target class
f: X→Y is the classifier.
Classification
 Fundamental Point:
 f is entirely determined
by the dot products xi,xj
measuring the similarity
between pairs of data
points
Non Linear Classification
(Kernel Methods)

We can transform a nonlinear problem
into a linear one using a kernel.
Non Linear Classification
(Kernel Methods)
We can transform a nonlinear problem
into a linear one using a kernel K.
 Fundamental property: the linear
decision surface depends on
K(xi ,xj)=(xi ) , (xj).
 All we need is the Gram similarity
matrix K. K defines the local metric of
the embedding space.

Similarity: Data Representations
O
H2N
OH
OH
NC(O)C(=O)O
Molecular Representations





1D: SMILES strings
2D: Graph of bonds
2.5D: Surfaces
3D: Atomic coordinates
4D: Temporal evolution
1D SMILES Kernel
HO
C H3
OH
H3C
HO
CCCCCc1ccc(cc1)CO
Kmer Count
CCCC
2
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1)
1
c1)C
1
1)CO
1
Kmer Count1 Count2 Product
(cc1
1
1
1
1)CO
0
1
0
1O)O
1
0
0
1ccc
1
1
1
CCCC
3
2
6
CCCc
1
1
1
CCc1
1
1
1
Cc1c
1
1
1
c(cc
1
1
1
c1)C
0
1
0
c1O)
1
0
0
c1cc
1
1
1
cc(c
1
1
1
cc1)
0
1
0
cc1O
1
0
0
ccc(
1
1
1
Total: 15
CCCCCCc1ccc(cc1O)O
Kmer Count
CCCC
3
CCCc
1
CCc1
1
Cc1c
1
c1cc
1
1ccc
1
ccc(
1
cc(c
1
c(cc
1
(cc1
1
cc1O
1
c1O)
1
1O)O
1
2D Molecule Graph Kernel

For chemical compounds




atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Count labeled paths
(CsNsCdO)
Fingerprints
Similarity Measures
3D Coordinate Kernel
2.8 A
2.0 A
1.4 A
4.2 A
3.4 A
Atom Distance Histogram
8
7
Count
6
5
4
3
2
1
0
0
1
2
3
Distance (Angstroms)
4
5
Distance Count
0
0
1
5
2
7
3
3
4
1
5
0
Example of Results
Results
Results
Results
Example of Results
Summary







Derived a variety of kernels for small molecules
State-of-the-art performance on several benchmark datasets
2D kernels slightly better than 1D and 3D kernels
Many possible extensions: 2.5D kernels, isomers, etc…
Need for larger data sets and new models of cooperation in the
chemistry community
Many open (ML) questions (e.g. clustering and visualizing 107
compounds, intelligent recognition of useful molecules,
information retrieval from literature, docking, prediction of
reaction rates, matching table of all proteins against all known
compounds, origin of life)
Chemistry version of the Turing test
ChemDB







7M compounds (3.5M unique)
Commercially available
PostgreSQL/Oracle
Annotation (Experimental,
Computational)
Searchable
Web interface
Similarity, in silico reactions
Acknowledgements

Informatics







Liva Ralaivola
J. Chen
S. J. Swamidass
Yimeng Dou
Peter Phung
Jocelyne Bruand
Funding



NIH
NSF
IGB

Pharmacology


Daniele Piomelli
Chemistry



G. Weiss
J. S. Nowick
R. Chamberlin
New Questions

Predict drug-like molecules? toxicity?


How can we search efficiently? Intelligently?



New data structures and algorithms
Optimizing old structures
How can we understand this much data?



New Strategies
Cluster and visualize millions of data points
Define commercially accessible space.
Are there other useful things we can do with this?



Discover new polymers, etc.
Wonder about the origin of life.
Combinatorially combine all known chemicals.
Acknowledgements






Jocelyne Bruand
Peter Phung
Liva Ralaivola
S. Joshua
Swamidass
Yimeng Dou
NIH/NSF/IGB
Questions
Docking
Query:
Binding Site of Protein
Scoring
Function
&
Efficient
Minimizer
…
Some Targets




P53 (Luecke)
ACCD5 (Tsai)
IMPDH, PPAR, etc. (Luecke)
HIV Integrase (Robinson)
P53
Drug Rescue of P53 Mutants
Docking → ChemDB



~6 million commercially available
compounds
Searchable, annotated, downloadable.
Other Databases:



Cambridge Structural Database
ChemBank
PubChem
Chemical Toxicity Prediction
By Kernel Methods
Jonathan Chen
S Joshua Swamidass
The Baldi Lab
Data Flow
ID
Toxic?
1
No
O
Kernel
HN
N
H3C
O
CH 3
OH
2
No
Cl
Cl
Gram Matrix
ID
1
2
3
4
…
1 2 3
21 4 5
4 14 5
5 5 15
10 3 6
… … …
4
10
3
6
23
…
Cl
3
Yes
O
O
4
Toxicity
State List
Linear
Classifier
Yes
C H3
O
O
O
P
C H3
S
S
HN
C H3
Predictions
…
…
…
…
…
…
Results
Example of Results
Kernel/Method Mutag MM
FM
MR
FR
Kashima (2003) 89.1
61.0
61.0
62.8
66.7
Kashima (2003) 85.1
64.3
63.4
58.4
66.1
1D SMILES spec. 84.0
66.1
61.3
57.3
66.1
1D SMILES spec+ 85.6
66.4
63.0
57.6
67.0
2D Tanimoto
87.8
66.4
64.2
63.7
66.7
2D MinMax
86.2
64.0
64.5
64.5
66.4
2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9
2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8
2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1
2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7
2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7
2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5
2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4
2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4
3D Histogram
81.9
59.8
61.0
60.8
64.4
Chemical Informatics






Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Catalog
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Datasets
Small Molecules as Undirected Labeled
Graphs of Bonds


atom/node labels:
A = {C,N,O,H, … }
bond/edge labels:
B = {s, d, t, ar, … }
Chemical Informatics




Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:




Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Chemical Informatics




Historical perspective: physics, chemistry and biology
Understanding chemical space
Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology)
Bioinformatics analogy:




Catalog (GenBank)
Search (BLAST)
Predict physical, chemical, biological properties
Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.