Transcript Document

Active Learning Strategies for Drug Screening
1
Walker
1,2
Kasif
Megon
and Simon
1Bioinformatics Program, Boston University, Boston, MA
2Department of Biomedical Engineering, Boston University, Boston, MA
1. Introduction
2. Objectives
At the intersection of drug discovery and experimental design,
active learning algorithms guide selection of successive
compound batches for biological assays when screening a
chemical library in order to identify many target binding
compounds with minimal screening iterations.1-3
• exploitation: optimize the number of target binding (active)
drugs retrieved with each batch
• exploration: optimize the prediction accuracy of the
committee during each iteration of querying
Start
Input data files with
compound
descriptors
Designate training and testing sets
for this round of cross validation
3. Methods
Datasets
• a binary feature vector for each compound indicated the
presence or absence of structural fragments
• 200 features with highest feature-activity mutual information
(MI) selected for each dataset
• retrospective data: labels provided with the features
• labels: target binding or active (A); non-binding or inactive (I)
• 632 DuPont thrombin-targeting compounds4 (149 A, 483 I,
mean MI = 0.126)
• 1346 Abbott monoamine oxidase inhibitors5 (221 A,1125 I,
mean MI = 0.006)
Pipeline
• 5X cross validation
• 5% batch size
• 5 classifiers in the committee (Figure 2)
• perceptron classifier data shown
Classifier committees
• bagging: samples from the labeled training data with uniform
distribution
• boosting: samples from the labeled training data with varied
sampling distribution such that compounds misclassified by
the previously obtained hypothesis are more likely to be
sampled again
Sample selection strategies
• random
• uncertainty: compounds on which the committee disagrees
most strongly are selected
• density with respect to actives: compounds most similar to
previously labeled or predicted actives are selected
(Tanimoto similarity metric)
• P(active) : compounds predicted active with highest
probability by the committee are selected
1st batch of drugs
whose labels are queried?
The drug screening pipeline proposed here combines committeebased active learning with bagging and boosting techniques
yes
and several options for sample selection. Our best strategy
1st batch selected
retrieves up to 87% of the active compounds after screening
by chemist’s
only 30% of the chemical datasets analyzed.
Figure 1: The Drug Discovery Cycle
compounds
Drugs
descriptors
Features
0 1 1 1 0
1 0 1 1 0
1 1 1 0 1
0 1 1 0 1
Sample
Selection
random
uncertainty
density
P(active)
domain knowledge
Labels for a batch
from the unlabeled
training set queried
committee of classifiers
trained on sub-samples from
the labeled training set drugs
Classifiers
Committees
naïve Bayes
bagging
perceptron
boosting
selection
screening
no
Unlabeled testing set & training set drugs
classified by committee
(weighted majority vote)
Figure 3: Querying for labels &
training classifiers on sub-samples
after 1st query
Drugs
Drugs
Features
A/I
train classifier # 1 I
train classifier # 2 A
?
?
?
NOT labeled
?
?
?
?
test
?
after 2nd query
Features
A/I
I
train classifier # 1 A
A
I
train classifier # 2 A
I
?
NOT labeled
?
?
test
?
All training set labels queried?
no
yes
Cross validation completed?
no
yes
Accuracy and performance statistics
5. Discussion
End
4. Results
total number of hits
• exploitation: number of active drugs retrieved with each batch queried Figure 4: Hit Performance and Sensitivity
• P(active) sample selection shows best hit performance when feature
a. Thrombin Hit Performance
information content is higher (Figure 4a)
(actives queried)
-after 30% of drug are labeled (cross validation averages):
130
1. P(active) retrieves 84% actives
110
2. density
retrieves 77% actives
90
3. uncertainty retrieves 65% actives
4. random
retrieves 42% actives
70
• density sample selection strategy shows best initial hit performance
50
when feature information content is lower (Figure 4b)
30
-classifier sensitivity is compromised
10
0.0506
0.1519
0.2532
0.3544
0.4557
0.557
0.6582
0.7595
-linear hit performance for all strategies after 20% of drugs labeled
b. MAOI Hit Performance
(actives queried)
170
total number of hits
The active learning paradigm refers to the ability of the learner to
modify the sampling strategy of data chosen for training based
on previously seen data. During each round of screening, the
active learning algorithm selects a batch of unlabeled
compounds to be tested for target binding activity and added to
the training set. Once the labels for this batch are known, the
model of activity is recomputed on all examples labeled so far,
and a new chemical set for screening is selected (Figure 1).
Figure 2: Pipeline Flowchart
150
130
110
90
70
50
30
10
0.0506
Future work will involve ROC and precision-recall analysis, along with
comparison of various classifiers and feature descriptors.
c. Bagged Committee Sensitivity
0.3544
0.4557
0.557
0.6582
0.7595
d. Single Classifier vs. Bagged Committee
Sensitivity
0.75
thrombin testing set
sensitivity
thrombin testing set sensitivity
• tradeoff: sample selection methods resulting in the best hit
performance display the lowest testing set sensitivity (Figure 4c)
• bagging and boosting methods do not result in significantly different hit
performance for any sample selection strategy on these datasets
• bagging and boosting techniques significantly enhance the testing set
sensitivity of the component learning algorithm (Figure 4d)
0.2532
fraction of examples labeled
fraction of examples labeled
• exploration: the prediction accuracy of the committee on the testing
data set during each iteration of querying
• uncertainty sample selection shows best testing set sensitivity
• increases in the labeled training set size during progressive rounds of
querying result in no significant increase in testing set sensitivity
(Figure 4c)
-labeled training set ratio actives:inactives biases the classifier?
-multiple modes of drug activity present in datasets?
0.1519
0.73
0.71
0.69
0.67
0.65
0.63
0.61
0.75
0.7
0.65
0.6
0.55
0.0506 0.1519 0.2532 0.3544 0.4557
0.59
0.0506
0.1519
0.2532
0.3544
0.4557
0.557
fraction of examples labeled
0.6582
0.7595
0.557
0.6582 0.7595
fraction of examples labeled
6. References
1. N. Abe, and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging. ICML 1998, 1-9.
2. G. Forman. Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery.
BIOKDD 2002, 33-36.
3. M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C. Lemmen. Active Learning in the Drug Discovery Process. NIPS
2001, 1449-1456.
4. KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/
5. R. Brown and Y. Martin. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and
Descriptors for Use in Compound Selection. Journal of Chemical Information and Computer Science.1996. 36, 572584.