Faculty of Computer Science - Department of Computing Science

Download Report

Transcript Faculty of Computer Science - Department of Computing Science

Faculty of Computer Science
Novel Approaches for Small Bio-molecule
Classification and Structural Similarity
Search
Karakoc E, Cherkasov A., and Sahinalp S.C.
Amit Satsangi
[email protected]
CMPUT 605
February 04, 2008
© 2006
Department of Computing Science
Background and Focus
 Identification of molecules that play an active role
in regulation of biological processes or disease
states (Aspirin)
 Structural similarity  Similar biological and/or
physico-chemical properties (Maggiora et al.)
 Classification of probe compound (unknown
bioactivity)
 Similarity search amongst compounds with known
bioactivity
CMPUT 605
© 2006
Department of Computing Science
Background and Focus
 Determining similarity distance measures (SDM)
 Using SDM for classification of compounds—k-NN
classification
 Efficient data structures for fast similarity search—
DMVP trees (an improvement over SCVP trees used
previously)
CMPUT 605
© 2006
Department of Computing Science
Outline
 Similarity measures
 Classification techniques
 k-NN classifier
 DMVP tree
 Results, Observations and Conclusion
CMPUT 605
© 2006
Department of Computing Science
Similarity between Molecules
 Structural Similarity—doubly bonded C pair,
existence of aromatic atom etc. (Used in structural
similarity search engines)
 Similarity of chemical descriptors—atomic wt.,
hydrophobicity, charge, density etc. (Used in
QSAR* tools)
*Quantitative Structure-Activity Relationship
CMPUT 605
© 2006
Department of Computing Science
Similarity Measures
 Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:
 X & Y: n-dimensional bit-vectors (representation
used by PubChem & some other databases)
 Range of Tanimoto coefficient: [0, 1]
CMPUT 605
© 2006
Department of Computing Science
Similarity measures
 Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y)
 Minkowski distance (LP):
 Real valued data possible
CMPUT 605
© 2006
Department of Computing Science
Classification Techniques
 Multiple Linear Regression (MLR)
 Linear Discriminant Analysis (LDA)
 Artifical Neural Networks (ANN)
 Support Vector Machines (SVM)
 k-nearest Neighbor (k-NN) classification not used
previously.
CMPUT 605
© 2006
Department of Computing Science
Distance-based Classification
 Compounds—s & r
 S & R respective descriptor arrays
 If D(S,R) is small then bioactivity levels of s & r are
similar
 Notion of distance  classification of new
compounds
 Distance measure == metric (conditions) e.g.
Hamming Distance, Tanimoto distance etc.
CMPUT 605
© 2006
Department of Computing Science
k-nn Classification
 Given  Bioactivity
 To Find  Distance measure that separates active
and inactive compounds for the training set Ndimensional plane
 Problem  Easy
CMPUT 605
© 2006
Department of Computing Science
k-nn Classification
 Given  Bioactivity
 To Find  Distance measure that separates active
and inactive compounds for the training set Ndimensional plane
 Problem  NP-hard
 Solution  Use Genetic Algorithms, heuristic linear
search to find the plane
CMPUT 605
© 2006
Department of Computing Science
QSAR approach
• Uses a linear combination of descriptors
• Assigns a weight to each dimension
, W [0,1]
• Weighted Minkowski distance of order 1
• Only binary classification considered (A/I)
• Methods are general
CMPUT 605
© 2006
Department of Computing Science
Parameter Optimization
CMPUT 605
© 2006
Department of Computing Science
k-NN Classifier
 Set of data elements: {X1, … Xn}
 Query element: Y
 Range query  Find Xi such that D(Y,Xi) < R1 (user
defined)
 k-nn query  Find k items such that their distance
to Y is as small as possible
CMPUT 605
© 2006
Department of Computing Science
Data structures: VP-Trees
 Vantage Point (VP) tree
 Choose an arbitrary data point (called Vantage
Point)
 Binary tree—recursively partitions the dataset into
two equal sized subsets
 Zero in on the nearest neighbor
CMPUT 605
© 2006
Department of Computing Science
Efficient data structures: SCVP Trees
 Space Covering Vantage Point tree
 Multiple vantage points chosen at each level
 No more a binary tree—multiple branches at each
internal node
 Multiple inner partitions—hope is that each data
point lies in atleast one inner partition
CMPUT 605
© 2006
Department of Computing Science
DMVP Tree
 Memory requirements of SCVP tree can be large—redundancy
of data elements
 Deterministic selection of Vantage points
 VP minimization—NP-Hard
 Minimization == Weighted set cover problem
 Use of greedy Algorithm: O(log l); l<n
 Approximates the min number of VP’s
CMPUT 605
© 2006
Department of Computing Science
Experiments
 Five types of bioactivities viz. being antibiotic (520), bacterial
metabolite (562), human metabolite(1104), drug(958), druglike(1202)
 62 dimensional descriptor array (30 QSAR & 32 physicochemical properties)
 k=1 i.e. one NN
 Comparison with LDA, MLR, ANN
 70% data used for training
 wL1 distance is calculated in all cases
CMPUT 605
© 2006
Department of Computing Science
Experimental Results
 Table 1 shows that in almost all cases in terms of accuracy,
and T_P, T_N, F_P etc. k-NN does better than LDA and MLR
 ANN beats k-NN on almost all counts
 Pruning—more than 80% in each kind of bioactivity (over
brute-force search)
 Key point – k-NN classifier is faster
 More than 100 times faster than ANN
CMPUT 605
© 2006
Department of Computing Science
Experimental Results
 Can calculate the level of bioactivity instead of a
YES/NO
 The value of the weights provides insights into the
importance of descriptors for each bioactivity
CMPUT 605
© 2006
Department of Computing Science
Observations & Conclusion
 Bacterial metabolites & antimicrobial drugs overlap
(confirmation)
 Human metabolites display distinctive properties
 QSAR models for drugs + human metabolites
dominated by few descriptors
 These descriptors favored by drug developers and
natural evolution
CMPUT 605
© 2006
Department of Computing Science
Observations & Conclusion
 Classification results from k-NN can help rationalize the
design and discovery of drugs
 DMVP tree improves the space utilization of the program
 Provides a means for fast similarity search
 Data structure can be applied to any metric distance like wLp
and Tanimoto distance
CMPUT 605
© 2006
Department of Computing Science
Thank You For Your Attention!
CMPUT 605
© 2006