SFU Lab for Computational Biology
Download
Report
Transcript SFU Lab for Computational Biology
EXTENDED NEAREST NEIGHBOR CLASSIFICATION
METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY
Farhad Hormozdiari
Lab for Computational Biology, Simon Fraser University
Outline
Small Molecule
Similarity Measure
Classification
Kernel
Methods
Nearest Neighbor classifier
Centroid based Nearest Neighbor
Distance / Metric Learning
Results
What are small molecules ?
Chemical compounds with small molecular mass
Important in the synthesis and maintenance of larger
molecules (DNA, RNA and proteins).
High potential as medicine.
Increasing number of databases: PubChem, ChemDB,
ChemBank…
Standard task in in silico drug discovery:
Classifying an compound with unknown activity
Representation of small molecules
Chemical (Conventional) Descriptors:
A(x)=(25, 0.24, 1, 12.3,….., 5, 2.12,……..)
Chemical structures represented by labeled graphs
Classification methods for small
molecules
Artificial Neural Networks (ANN)
Support Vector Machine (SVM)
K-Nearest Neighbor Classification
Recent works focused on Kernel Methods
SVM (Support Vector Machine)
Φ(x) fixed feature transformation
tn ϵ{1,-1}
Find a decision boundary
Y(x)
= WT Φ(x) + b
Goal to maximize the distance
y
(
x
) t(
w
(
x
)
b
)
1
Dist= t||
arg
max
{min
[
t
(
w
(
x
)
b
)]
w
||
||
w
||
|
w
|
||
Quadratic programming
T
n
n
n
w
,
b
T
n
n
n
Recent works on small molecule
classification
Mariginalized Kernel (MK)
Tsuda
et.al 2002, Kashima et al. (ICML 2003)
Features are number of labeled paths of random walks
Improved Mariginalized Kernel
Mahe
et al. (ICML 2005)
Avoid totters (walks that visit a node which was visited
in two previous stages)
Recent works on small molecule
Classification
Swamidass et al. (Bioinformatics 2003)
Kernels
based on 3D Euclidean coordinates of atoms
One histogram per pair of atom labels
Similarity between histograms
Cao et al. (ISMB 2008, Bioinformatics 2010)
Use
Maximum Common Substructure (MCS) as a
measure of similarity
Randomly pick ”basis” compounds
Features of a molecule are MCS between that molecule and
all basis compounds
Nearest Neighbor Classification
Nearest Neighbor (NN) Classification
The
label of a molecule is predicted based on ones of
its nearest neighbors
NN Error < 2*Bayes error (Cover et al. 1967)
One of most used classifiers in small molecule
classification because of its simplicity
Nearest Neighbor Classification
Drawbacks
Speed/Memory
Distances
to all traning set points should be computed
All the traning set is stored in the memory
Overfitting
Centroid based Nearest Neighbor
(CBNN) Classificatrion
CBNN Classification
Centroids are picked from each class
Bioactivity of a small molecule is predicted based on its nearest centroids
CBNN tackle NN drawbacks
Centroid Selection
Hart et al., 1968 introduced Condensed NN
Classification
Initially,
the set of centroids S includes one point
Iteratively go through each remaining point p, if its
nearest neighbor in S has the opposite class, p is added
to S
Fast condensed NN Classification (Angiulli et al.,
ICML 2005)
S
is assigned to medoids of each class
For each point in S their Voronoi cell is build
In each Voronoi cell if there exist a point from different class is added to S
Centroid Selection
Gabriel Graph (Gabriel et al. 1969,1980)
There
exist an edge between two points u,v
If for any point w dist(u,v) < min{dist(u,w),dist(w,v)}
After
the graph is built, connected nodes from different
classes are selected
Removed
link
w
u
v
Centroid Selection
Relative Neighborhood Graph (Toussiant et al.
1980)
There
exists an edge between two points u,v if
for any point w, dist(u,v) < max{dist(u,w),dist(w,v)}
After
the graph is built, connected nodes from different
classes are selected
Combinatorial Centroid Selection
Combinatorial Centroid Selection(CCS)
Given
a training set of points (compounds) where
distances satisfy triangle inequalities
Asked to find the minimum number of centroids
(selected compounds) such that for each point, its
nearest centroid is from same class
For simplicity, we only deal with binary classification
i.e. C1 first class and C2 second class.
CCS Complexity
k-CCS problem
Asked
to select a set of points with cardinality less than
k such that for each point, its nearest centroid is from
same class
k-CCS is NP-Complete
K-Dominating
Set (k-DS): given a graph G(V,E), ask
whether there exists V' ⊆ V, |V'| ≤ k and each node
v∊V either exist in V' or it is adjecent to a node in V'
k-DS ≤p k-CCS
This reduction states no approximation better than
O(log n) exists for CCS unless P = NP
Integer Linear Program Solution
Notations:
C
'
s
are
points
in
class
C
1
1
i
C
'
s
are
points
in
class
C
2
2
j
(
x
)
if
x
is
a
cen
and
0
oth
d
(
X
,
Y
)
:
a
me
dis
be
o
p
X
a
Y
To minimize the number of chosen points or
compounds (called centroids)
minimize
(
C
)
(
C
)
1
2
i
j
C
C
1
1
i
C
C
2
2
j
Integer Linear Program Solution
s
.
t.
(C)(C);
C
C
,C
C
1
k
d
(
C
,C
)
d
(
C
,C
)
1
1
1
2
i
k
i
j
2
j
1
i
1
2
j
2
(C)(C);
C
C
,C
C
1
l
d
(
C
,C
)
d
(
C
,C
)
2
2
2
1
p
l
p
q
2
q
2
p
2
1
q
1
Ensure that for every pair of compounds i of class 1
and j of class 2, if j is chosen as a centroid, a
compound k of class 1within the radius of between i
and j should be chosen as a centroid as well.
Integer Linear Program Solution
s.t.
(C
) 1
(C
) 1
C1i C1
C2i C2
1i
2i
(X) {0,1};X C1 C2
Ensure that for each class there is a compound
chosen as a centroid
Fixed Size Neighborhood Solution
ILP solution suffers from
Huge
size
due to pairwise constraints among points
Potential
Propose a relaxed version of ILP
Reduce
trivial solution
the number of constraints
for each point p within the radius equal to the distance from
p to its k-th nearest neighbor of the different class there must
be one centroid of same class of p
We will call this method CCNN1
Special case of CCS
When the majority of the compounds do not exhibit
the bioactivity of interest
All
compounds that exhibit bioactivity of interest are
picked as centroids
We minimize the number of compounds chosen from
compounds that does not exhibit the activity of interest
Special case of CCS
It can be reduced to Set Cover
O(logn)-approximation
algorithm
Set Cover problem
Given
a Universal Set (U) and a collection of subsets
(C) from U. Goal is to pick the minimum number of sets
from C which cover all the elements in U.
NP-Complete
Greedy Algorithm
Pick the set which cover the maximum number of uncoverd
elements from the universal set
We will call this method CCNN2
Experimental Results - Datasets
Mutageniticy dataset
includes aromatic and hetero-aromatic nitro compounds that are
tested for mutagenicity on Salmonella
188 compounds with positive levels of log mutagenicity
63 negative examples
Drug dataset includes
958 drug compounds
6550 non-drug compounds including antibiotics, human,
bacterial, plant, fungal metabolites and drug-like compounds
Experimental Results - Descriptors
The structures of the compounds have been used
30 3D inductive QSAR descriptors by Cherkasov et
al. 2005
32 conventional QSAR by MOE:
Number
of basic atoms
Number of bonds
….
Comparison with other CBNN
based methods
Drug dataset
Method
#Centroids
%Training Set
Accuracy
RNG
1705
28.39
89.00
GG
4804
79.99
92.00
CCNN
1489
24.79
89.89
CCNN2
1052
17.51
92.17
NN
6006
100
91.02
Comparison with small molecule
classication methods
Mutag Data set
Method
Precision
Recall
Accuracy
Running
Time(min)
NN
87.80
92.00
86.17
1(?)
CCNN
92.00
92.74
89.94
6
CCNN2
92.13
94.35
90.91
6
SVM-Linear
92.00
92.00
89.36
6
SVM-ploy
91.30
92.00
88.83
6
SVM-Radial
86.60
92.80
85.63
6
Cao et.al.
88.2
77.8
82.35
20
MK Kashima et.al. 94.4
88.7
89.10
6
Comparison with small molecule
classication methods
Drug
Method
Precision
Recall
Accuracy
Running
Time(min)
NN
64.70
65.30
91.02
45(?)
CCNN
56.36
61.18
89.89
181
CCNN2
69.12
69.70
92.17
150
SVM-Linear
76.10
8.70
87.89
121
SVM-Poly
77.10
38.30
90.17
180
SVM-Radial
80.10
35.00
90.60
121
Cao et.al.
81.20
56.20
92.00
~5days
MK Kashima et.al. 53.70
57.00
89.10
~1days
Learning the Metric Space
Emre Karakoc, Artem Cherkasov, S.Cenk
Sahinalp (ISMB 2006)
Quantitative Structure-Activity
Relationship(QSAR)
Similarity measure
Minkowski
distance
(
|
X
[
i
]
Y
[
i
]
|)
L
p
i
1
n
p
1
/
p
Each feature is equally significant
But
some features should be more significant and some
less
Weighted Minkowski distance
n
wL
(
X
,
Y
)
w
X
[
i
]
Y
[
i
]
|
1
i|
i
1
Main Idea
Can weighted Minkowski be useful?
Reduce
the number of features.
PCA
Increase
the accuracy
How to learn the right W?
Decrease
the within-class distance
Increase the between-class dist.
Learn the optimal W
Given the training set T let
{
T
,
T
,...
T
}
set T
I
{
T
,
T
,...
T
}
Inactive set T
Min f(T) m
m
n
Active
A
AA A
12 m
I I
I
12 l
m
w
|
T
[
i
]
T
[
i
]
|/
m
f(T) =
i
h
11
j
i
1
A
h
A
j
2
l
m
l
n
n
I
I
2
(
w
|
T
[
i
]
T
[
i
]
|
)
/(
l
m
)
i h
j
h
1
j
1
i
1
m
l
m
n
A I
(
w
|
T
[
i
]
T
[
i
]
|
)
/(
m
(
l
m
))
i h
j
h
1
j
1
i
1
Learn the optimal W (cont.)
Min f(T)
s.t
l
m
n
AA2
AI
ih j
ih j
j
1
i
1
j
1
i
1
m
n
w
|
T
[
i
]
T
[
i
]
|
/
m
w
|
T
[
i
]
T
[
i
]
|
/(
m
(
l
m
)
w
[0
,1
]
i
Metric Learning
Weinberger et al. NIPS 2006
Semidefinite
program
D(xi,xj) = (xi-xj)TM(xi-xj) where M = LTL
s.t. M > 0
The difference between between-class and within-class
distances is pre-fixed
It
aims to compute the
“best” M
Classification of new compounds
Input:
Distances
of new compound Q to the ones in the data-
set
Assumption:
Bioactivity
level of Q is likely to be similar to its close
neighbors
kNN classifier estimate the bioactivity of Q:
The
majority bioactivity among its k-nearest neighbors
Querying a compound
Naïve Method
O(S)
which S is the number element in database.
Binary search tree
Vantage
Point (VP) tree (Uhlmann 1991)
Binary tree that recursively partition data space using
distances of data points to randomly picked vantage
point.
VP-Tree
Internal nodes: (Xvp, M, Rptr, Lptr)
M: median distance of among d(Xvp, Xi) for all Xi in
the space partitioned.
Xvp: Vantage point.
Leaves: references to data points
Proximity search in VP-tree
Given a query point q, metric distance d(.,.) and a
proximity radius r
Goal is to find all points x where d(x,q) < r
If
d(q,Xvp) – r < M recursively search the inner
partition
If d(q, Xvp) + r > M recursively search the outer
partition
Else search both
Can we do better?
Select multiple vantage points at each level
Space
Covering VP (SCVP) Trees (Sahinalp et.al 2003)
Increasing the chance of inclusion of query in one of
the inner partitions.
Can we do much better?
Instead of selecting random vantage points select
them more intelligently
Deterministic Multiple Vantage Point (DMVP) Tree
Select minimum number of multiple vantage points
that cover the entire data collection (OVPS
problem)
Better space utilization (Optimal redundancy)
OVPS problem is NP-hard for any wLp
Conclusion
NN is powerful classifier
Small
molecule classification
NN problem
CBNN
CCNN1 and CCNN2
Distance learning
Accuracy
DMVP tree
Future work
Further investigation of possible approximation
algorithms for selecting centroids
Combining CCNN (selecting centroids) with metric
learning
Ideally the problem formulation should ask to
ensure the NN of each point in the training set is in
the same class with that point
Adapt CCNN to work with regression datasets
References
Phuong Dao*, Farhad Hormozdiari*, Hossien
Jowhari, Kendall Byler, Artem Cherkasov, S. Cenk
Sahinalp, Improved Small Molecule Activity Determination via Centroid
Nearest Neighbors Classification, CSB 2008.
Emre Karakoc, Artem Cherkasov, S. Cenk Sahinalp
Distance Based Algorithm for small Biomolecule Classification and Structural
Similarity Search, ISMB
2006
Lurii Sushko et.al. Applicability domains for classification problems:
benchmarking of distance to models for AMES mutagenicity set, J.
Chemical Informatics 2010.
Acknowledgments
Cenk Sahinalp
Artem Cherkasov
Zehra Cataltepe
Emre Karakoc
Phuong Dao
Hossien Jowhari
Kendall Byler
All members of Lab
Questions