Faculty of Computer Science - Department of Computing Science
Download
Report
Transcript Faculty of Computer Science - Department of Computing Science
Faculty of Computer Science
Novel Approaches for Small Bio-molecule
Classification and Structural Similarity
Search
Karakoc E, Cherkasov A., and Sahinalp S.C.
Amit Satsangi
[email protected]
CMPUT 605
February 04, 2008
© 2006
Department of Computing Science
Background and Focus
Identification of molecules that play an active role
in regulation of biological processes or disease
states (Aspirin)
Structural similarity Similar biological and/or
physico-chemical properties (Maggiora et al.)
Classification of probe compound (unknown
bioactivity)
Similarity search amongst compounds with known
bioactivity
CMPUT 605
© 2006
Department of Computing Science
Background and Focus
Determining similarity distance measures (SDM)
Using SDM for classification of compounds—k-NN
classification
Efficient data structures for fast similarity search—
DMVP trees (an improvement over SCVP trees used
previously)
CMPUT 605
© 2006
Department of Computing Science
Outline
Similarity measures
Classification techniques
k-NN classifier
DMVP tree
Results, Observations and Conclusion
CMPUT 605
© 2006
Department of Computing Science
Similarity between Molecules
Structural Similarity—doubly bonded C pair,
existence of aromatic atom etc. (Used in structural
similarity search engines)
Similarity of chemical descriptors—atomic wt.,
hydrophobicity, charge, density etc. (Used in
QSAR* tools)
*Quantitative Structure-Activity Relationship
CMPUT 605
© 2006
Department of Computing Science
Similarity Measures
Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:
X & Y: n-dimensional bit-vectors (representation
used by PubChem & some other databases)
Range of Tanimoto coefficient: [0, 1]
CMPUT 605
© 2006
Department of Computing Science
Similarity measures
Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y)
Minkowski distance (LP):
Real valued data possible
CMPUT 605
© 2006
Department of Computing Science
Classification Techniques
Multiple Linear Regression (MLR)
Linear Discriminant Analysis (LDA)
Artifical Neural Networks (ANN)
Support Vector Machines (SVM)
k-nearest Neighbor (k-NN) classification not used
previously.
CMPUT 605
© 2006
Department of Computing Science
Distance-based Classification
Compounds—s & r
S & R respective descriptor arrays
If D(S,R) is small then bioactivity levels of s & r are
similar
Notion of distance classification of new
compounds
Distance measure == metric (conditions) e.g.
Hamming Distance, Tanimoto distance etc.
CMPUT 605
© 2006
Department of Computing Science
k-nn Classification
Given Bioactivity
To Find Distance measure that separates active
and inactive compounds for the training set Ndimensional plane
Problem Easy
CMPUT 605
© 2006
Department of Computing Science
k-nn Classification
Given Bioactivity
To Find Distance measure that separates active
and inactive compounds for the training set Ndimensional plane
Problem NP-hard
Solution Use Genetic Algorithms, heuristic linear
search to find the plane
CMPUT 605
© 2006
Department of Computing Science
QSAR approach
• Uses a linear combination of descriptors
• Assigns a weight to each dimension
, W [0,1]
• Weighted Minkowski distance of order 1
• Only binary classification considered (A/I)
• Methods are general
CMPUT 605
© 2006
Department of Computing Science
Parameter Optimization
CMPUT 605
© 2006
Department of Computing Science
k-NN Classifier
Set of data elements: {X1, … Xn}
Query element: Y
Range query Find Xi such that D(Y,Xi) < R1 (user
defined)
k-nn query Find k items such that their distance
to Y is as small as possible
CMPUT 605
© 2006
Department of Computing Science
Data structures: VP-Trees
Vantage Point (VP) tree
Choose an arbitrary data point (called Vantage
Point)
Binary tree—recursively partitions the dataset into
two equal sized subsets
Zero in on the nearest neighbor
CMPUT 605
© 2006
Department of Computing Science
Efficient data structures: SCVP Trees
Space Covering Vantage Point tree
Multiple vantage points chosen at each level
No more a binary tree—multiple branches at each
internal node
Multiple inner partitions—hope is that each data
point lies in atleast one inner partition
CMPUT 605
© 2006
Department of Computing Science
DMVP Tree
Memory requirements of SCVP tree can be large—redundancy
of data elements
Deterministic selection of Vantage points
VP minimization—NP-Hard
Minimization == Weighted set cover problem
Use of greedy Algorithm: O(log l); l<n
Approximates the min number of VP’s
CMPUT 605
© 2006
Department of Computing Science
Experiments
Five types of bioactivities viz. being antibiotic (520), bacterial
metabolite (562), human metabolite(1104), drug(958), druglike(1202)
62 dimensional descriptor array (30 QSAR & 32 physicochemical properties)
k=1 i.e. one NN
Comparison with LDA, MLR, ANN
70% data used for training
wL1 distance is calculated in all cases
CMPUT 605
© 2006
Department of Computing Science
Experimental Results
Table 1 shows that in almost all cases in terms of accuracy,
and T_P, T_N, F_P etc. k-NN does better than LDA and MLR
ANN beats k-NN on almost all counts
Pruning—more than 80% in each kind of bioactivity (over
brute-force search)
Key point – k-NN classifier is faster
More than 100 times faster than ANN
CMPUT 605
© 2006
Department of Computing Science
Experimental Results
Can calculate the level of bioactivity instead of a
YES/NO
The value of the weights provides insights into the
importance of descriptors for each bioactivity
CMPUT 605
© 2006
Department of Computing Science
Observations & Conclusion
Bacterial metabolites & antimicrobial drugs overlap
(confirmation)
Human metabolites display distinctive properties
QSAR models for drugs + human metabolites
dominated by few descriptors
These descriptors favored by drug developers and
natural evolution
CMPUT 605
© 2006
Department of Computing Science
Observations & Conclusion
Classification results from k-NN can help rationalize the
design and discovery of drugs
DMVP tree improves the space utilization of the program
Provides a means for fast similarity search
Data structure can be applied to any metric distance like wLp
and Tanimoto distance
CMPUT 605
© 2006
Department of Computing Science
Thank You For Your Attention!
CMPUT 605
© 2006