Transcript Powerpoint

MDL Keys Revisited
Joseph L. Durant, Burton A. Leland,
Douglas R. Henry and
James G. Nourse
MDL Information Systems
Overview
• What are MDL Keys?
• Constructing better keys
– metrics
– optimization by "educated guesswork"
– optimization by Genetic Algorithms
• Conclusions
What are MDL Keys
• a.k.a. SSKeys
• Originally designed to support sub-structure
searching
• Bits encoding molecular features
• Most follow the structure of:
–
–
–
–
a property on atom A
a property on atom B
A and B are separated by N bonds (0<=N<=4)
this pattern is encountered M or more times
MDL Keys - What Are They?
• Some keys code for specific bonds (C-Cl,
S-P)
• Other keys code for a property in an atomic
neighborhood
(C-CCO, Q-OO)
• Still others are custom properties
– Sgroup properties
– rings
– atom types
MDL Keys - Standard
Implementation
• MDL’s SSKeys are encountered in 2
flavors:
– a 960 keybitset
– a 166 keybitset (Subset or User Keys)
The 960 Keybitset
• Created to support substructure searching
• Encodes 1387 molecule features
• Encodes features with >0, >1, >2 and >4
occurrences
• Features can turn on 1, 2 or 3 keybits
• many of the keybits can be set by multiple
features
The 166 Keybitset
• Originally created to embody an earlier
MDL keybitset
• Largely correspond to “chemistmeaningful” features
166 Keybitset Definitions
1 - isotope
2 - 103<atomic number<256
.
84 - NH2
85 - CN(C)C
86 - CH2QCH2
.
165 - ring
166 - fragments
Current Uses for MDL Keys
• Clustering/diversity
– Brown & Martin, JCICS, 1996, 36, 572-584.
– McGregor & Pallai, JCICS, 1997, 37, 443-448.
• Library generation/evaluation
–
–
–
–
Brown & Martin, J. Med. Chem., 1997, 40, 2304-2313.
Koehler, Dixon, & Villar, J. Med. Chem.,1999, 42, 4695-4704.
Ajay, Bemis, & Murcko, J. Med. Chem., 1999, 42, 4942-4951.
Koehler & Villar, J. Comp. Chem., 2000, 21, 1145-1152.
• Information content/comparison
– Brown & Martin, JCICS, 1997, 37, 1-9.
– Jamois, Hassan, & Waldman, JCICS, 2000, 40, 63-70.
– Briem & Lessel, Perspect. Drug Disc. Des., 2000, 20, 231-244.
Can We Construct Better Keys?
• Keybitsets optimized for substructure
searching
• Keybitsets constructed to minimize
memory/storage footprint
• But they work remarkably well already
But...
• bit-setting algorithm has untapped power
• algorithm defines ~3200 unique features
• algorithm allows keybit to be set for "N or
more occurrences"
Find a Metric
Success Measure
• Defined by Briem and Lessel, Perspect.
Drug Disc. Des., 20, 231 (2000).
• Modified to account for ties
• Evaluates the ability to differentiate classes
of activity
Test Set
•
•
•
•
•
•
134 PAF antagonists
49 5-HT3 antagonists
49 TXA2 antagonists
40 ACE inhibitors
111 HMG-CoA reductase inhibitors
574 "random" MDDR compounds
Success Measure - Evaluation
• Calculate the 10 nearest neighbors for each
"active" molecule
• Calculate the fraction of nearest neighbors
in the same activity class as the target
• Allow for ties; expand the number of
neighbors until the tie is broken
Success Measure
Starting Points...
• 166 keybitset
• 960 keybitset
• 3234 keybitset
Modifying the 960 Keybitset
• all the "singly mapped" keybits
– 726 keybitset
• all the 960 keybitset features, one feature
per bit
– 1387 keybitset
Initial Success Measures
Optimization?
Results of Random Pruning
Intelligent Selection
(Educated Guesswork)
• Differentiating compounds
– active from inactive
– active from other actives
Surprisal Analysis
• Surprisal =
log ( probability 1 / probability 2)
probability 1 = "active" molecules
probability 2 = "inactive" molecules
• assume Poisson-distributed errors
• | Surprisal S/N | =
| Surprisal / ssurprisal |
Surprisals for 166 Keybitset
Surprisal S/N for 166 Keybitset
Surprisal Pruning
Success Measure vs.
Surprisal S/N
Success Measure vs. # of Keys
Success Measure vs. # of Keys
Success Measures
Success Measures
What About Multiple
Occurrences?
• Keybits can be set for >0, >1, >2,...
occurrences of features
• Inclusion of multiple occurrence keybits
enhances performance for substructure
searching
Assembling a Composite
Keybitset
• Construct keybitsets for >0, >1, >2, >3...
occurrences
• Surprisal prune to the 2-sigma level
• Concatenate the resulting keybitsets
– only add keybits for new features
Success Measure
• Success Measure increases until
"7 or more" occurrences
• 1283 keybits in final set
• Final success measure = 71.26%
Genetic Algorithm
• We used the SUGAL genetic algorithm
package
– written by Dr. Andrew Hunter at University of
Sunderland, UK
• Identification of local minima is
straightforward
• Small keybitsets with good performance can
be identified
• The global minimum is elusive
Final Success Measures
Conclusions
• Key performance can be substantially
improved by reoptimizing keybitsets
• Key performance is not substantially
improved for MDL keybitsets longer than
~500 bits
Acknowledgements
• use of SUGAL Genetic Algorithm Package,
written by Dr. Andrew Hunter at University
of Sunderland, UK
• correspondence with and MDDR extregs
from Dr. Hans Briem, Boehringer
Ingelheim Pharma KG