NaiveBayesian

Download Report

Transcript NaiveBayesian

…ask more of your data
Bayesian Learning
• Build a model which estimates the likelihood that a
given data sample is from a "good" subset of a larger
set of samples (classification learning)
• SciTegic uses modified Naïve Bayesian statistics
– Efficient:
• scales linearly with large data sets
– Robust:
• works for a few as well as many ‘good’ examples
– Unsupervised:
• no tuning parameters needed
– Multimodal:
• can model broad classes of compounds
• multiple modes of action represented in a single model
1
…ask more of your data
Learn Good from Bad
• “Learn Good from Bad” examines what distinguishes
“good” from “baseline” compounds
– Molecular properties (molecular weight, alogp, etc)
– Molecular fingerprints
“Good”
N
O
A
A
Baseline
2
…ask more of your data
Learning: “Learn Good From Bad”
• User provides name for new
component and a “Test for
good”, e.g.:
– Activity > 0.5
– Conclusion EQ ‘CA’
• User specifies properties
– Typical: fingerprints, alogp,
donors/acceptors, number of
rotatable bonds, etc.
• Model is new component
• Component calculates a
number
– The larger the number, the
more likely a sample is “good”
3
…ask more of your data
Using the model
• Model can be used to prioritize samples for screening, or search
vendor libraries for new candidates for testing
• Quality of model can be evaluated:
–
–
–
–
Split data into training and test sets
Build model using training set
Sort test set using model value
Plot how rapidly hits are found in sorted list
4
…ask more of your data
Using a Learned Model
• Model appears on
your tab in
LearnedProperties
– Drag it into a protocol
to use it “by value”
– Refer to it by name to
use it “by reference”
5
Fingerprints
6
…ask more of your data
ECFP: Extended Connectivity
Fingerprints
• New class of fingerprints for molecular characterization
– Each bit represents the presence of a structural (not
substructural) feature
– 4 Billion different bits
– Multiple levels of abstraction contained in single FP
– Different starting atom codes lead to different fingerprints
(ECFP, FCFP, ...)
– Typical molecule generates 100s - 1000s of bits
– Typical library generates 100K - 10M different bits.
7
…ask more of your data
Advantages
• Fast to calculate
• Represents much larger number of features
• Features not "pre-selected"
• Represents tertiary/quaternary information
– Opposed to path based fp’s
• Bits can be “interpreted”
8
…ask more of your data
FCFP: Initial Atom Codes
16
3
N
16
16
0
O
1
16
16
FCFP Atom code bits from:
1: Has lone pairs
2: Is H-bond donor
4: Is negative ionizable
8: Is positive ionizable
16: Is aromatic
32: Is halogen
9
…ask more of your data
ECFP: Generating the Fingerprint
• Iteration is repeated desired number of times
– Each iteration extends the diameter by two bonds
• Codes from all iterations are collected
• Duplicate bits may be removed
> <FCFP_0#S>
16
0
1
3
...
> <FCFP_2#S>
16
0
1
3
1618154665
203677720
-1549103449
1872154524
1070061035
...
> <FCFP_4#S>
16
0
1
3
1618154665
203677720
-1549103449
1872154524
1070061035
991735244
-453677277
-581879738
-1094243697
690083042
-975279903
...
10
…ask more of your data
ECFP: Extending the Initial Atom
Codes
• Fingerprint bits
indicate presence and
absence of certain
structural features
• Fingerprints do not
depend on a
predefined set of
substructural features
A
A
A
Iteration 0
N
A
O
A
A
A
Iteration 1
Each iteration adds bits
that represent larger and
larger structures
N
O
A
A
Iteration 2
11
…ask more of your data
The Statistics Table: Features
• A feature is a binary attribute of a data record
– For molecules, it may be derived from a property range or a fingerprint
bit
• A molecule typically contains a few hundred features
• A count of each feature is kept:
– Over all the samples
– Over all samples that pass the test for good
• The Normalized Probability is log(Laplacian-corrected
probability)
• The normalized probabilities are summed over all features to
give the relative score.
12
…ask more of your data
Normalized Probability
• Given a set of N samples
• Given that some subset A of them are good (‘active’)
– Then we estimate for a new compound: P(good) ~ A / N
• Given a set of binary features Fi
– For a given feature F:
• It appears in NF samples
• It appears in AF good samples
– Can we estimate: P(good | F) ~ AF / NF
• (Problem: Error gets worse as NF  small)
13
…ask more of your data
Quiz Time
• Have an HTS screen with 1% actives
• Have two new samples X and Y to test
• For each sample, we are given the results from one
feature (FX and FY)
• Which one is most likely to be active?
14
…ask more of your data
Question 1
• Sample X:
– AFx: 0
– NFx: 100
• Sample Y:
– AFy: 100
– NFy: 100
15
…ask more of your data
Question 2
• Sample X:
– AFx: 0
– NFx: 100
• Sample Y:
– AFy: 1
– NFy: 100
16
…ask more of your data
Question 3
• Sample X:
– AFx: 0
– NFx: 100
• Sample Y:
– AFy: 0
– NFy: 0
17
…ask more of your data
Question 4
• Sample X:
– AFx: 2
– NFx: 100
• Sample Y:
– AFy: 0
– NFy: 0
18
…ask more of your data
Question 5
• Sample X:
– AFx: 2
– NFx: 4
• Sample Y:
– AFy: 200
– NFy: 400
19
…ask more of your data
Question 6
• Sample X:
– AFx: 0
– NFx: 100
• Sample Y:
– AFy: 0
– NFy: 1,000,000
20
…ask more of your data
Normalized Probability
• Thought experiment:
– What is the probability of a feature which we have seen in
NO samples? (i.e., a novel feature)
– Hint: assume most features have no connection to the reason
for “goodness”…
21
…ask more of your data
Normalized Probability
• Thought experiment:
– What is the probability of a feature which we have seen in
NO samples? (i.e., a novel feature)
– The best guess would be P(good)
• Conclusion:
– Want estimator P(good | F)  P(good) as NF  small
• Add some “virtual” samples (with prob P(good)) to every bin
22
…ask more of your data
Normalized Probability
Our new estimate (after adding K virtual samples)
• P’(good | F) = (AF + P(good)K) / (NF + K)
– P’(good | F)  P(good) as NF  0
– P’(good | F)  AF / NF as NF  large
• (If K = 1/P(good) this is the Laplacian correction)
• K is the duplication factor in our data
23
…ask more of your data
Normalized Probability
• Final issue: How do I combine multiple features?
– Assumption: number of features doesn’t matter
– Want to limit contribution from random features
• P’’’(good | F) = ((AF + P(good)K) / (NF + K)) / P(good)
• Pfinal = P’’’(good|F1) * P’’’(good|F2) * …
• Phew!
• (The good news: for most real-world data, default value of K is
quite satisfactory…)
24
Validation of the Model
25
…ask more of your data
Generating Enrichment Plots
• “If I prioritized my testing using this model, how well
would I do?”
• Graph shows % actives (“good”) found vs % tested
• Use it on a test dataset:
– That was not part of the training data
– That you already have results for
26
…ask more of your data
Modeling Known Activity Classes from
the World Drug Index
• Training set
25,000 random selected
compounds from WDI
• Test set
25,000 remaining cmpds from
WDI + 25,000 cmpds from
Maybridge
• Descriptors
fingerprints, ALogP, molecular
properties
• Build models for each activity
class: progestogen, estrogen,
etc
WDI
50K
25K
Training set
25K
Maybridge
25K
Test set
27
…ask more of your data
Enrichment Plots
• Apply activity model to
compounds in test set
• Order compounds from
‘best’ to ‘worst’
actives
• Plot cumulative
distribution of known
actives
• Do this for each activity
class
28
…ask more of your data
Enrichment Plot for High Actives
29
…ask more of your data
Choosing a Cutoff Value
• Models are relative predictors
– Suggest which to test first
– Not a classifier (threshold independent)
• To make it a classifier, need to choose a cutoff
– Balance between
• sensitivity (True Positive rate)
• specificity (1 - False Positive rate)
– Requires human judgment
• Two useful views
– Histogram plots
– ROC (Receiver Operating Characteristic) plots
30
…ask more of your data
Choosing a Cutoff Value: Histograms
• A histogram can visually show the separation of actives
and nonactives using a model
31
…ask more of your data
Choosing a Cutoff Value: ROC Plots
• Derived from clinical medicine
• Shows balance of costs of missing a
true positive versus falsely accepting a
negative
• Area under the curve is a measure of
quality :
–
–
–
–
–
- .90-1 = excellent (A)
- .80-.90 = good (B)
- .70-.80 = fair (C)
- .60-.70 = poor (D)
- .50-.60 = fail (F)
32
…ask more of your data
ROC Plot for MAO
33
…ask more of your data
Postscript: non-FP Descriptors
• AlogP
– A measure of the octanol/water partition coefficient
– High value means molecule "prefers" to be in octanol rather
than water – i.e., is nonpolar
– A real number
• Molecular Weight
– Total mass of all of the atoms making up the molecule
– Units are atomic mass units (a.m.u.) in which the mass of
each proton or neutron is approximately 1
– A positive real number
34
…ask more of your data
Postscript: non-FP Descriptors
• Num H Acceptors, Num H Donors
–
–
–
–
Molecules may link to each other via hydrogen bonds
H-bonds are weaker than true chemical bonds
H-bonds play a role in drug activity
H donors are polar atoms such as N and O with an attached H
(can "donate" a hydrogen to form H-bond)
– H acceptors are polar atoms lacking an attached H (can
"accept" a hydrogen to form H-bond)
– Num H Acceptors, Num H Donors are counts of atoms
meeting the above criteria
– Non-negative integers
35
…ask more of your data
Postscript: non-FP Descriptors
• Num Rotatable Bonds
– Certain bonds between atoms are rigid
• Bonds within rings
• Double and triple bonds
– Others are rotatable
• Attached parts of molecule can freely pivot around bond
– Num Rotable Bonds is count of rotatable bonds in molecule
– A non-negative integer
36