Acyclic Subgraph Based Descriptor Spaces for Chemical
Download
Report
Transcript Acyclic Subgraph Based Descriptor Spaces for Chemical
Methods for Virtual
Screening and Scaffold
Hopping for Chemical
Compounds
Nikil Wale1, George Karypis1, Ian A. Watson2
1Department of Computer Science &
Engineering, University of Minnesota.
[email protected], [email protected]
2Eli
Lilly Inc. Lilly Research Labs, Indianapolis.
[email protected]
Drug Development Pipeline
Choose a drug
target
Drug screening/
lead development/
lead optimization
Small scale
production
106 → 1
File IND
Production for
clinical trials
Laboratory and
animal testing
Drug Screening
Experimental
High Throughput Screening
Target specific assays
.....
InSilico
clustering
ranked-retrieval
classification
docking
.....
Ranked-Retrieval Problem
Given a query compound, rank
the compounds in the database
based on how similar they are
to the query in terms of their
bioactivity.
O
H
H
H
H
H
1
H
H
H
H
N
H
O
H
H
H
H
H
2
NH2
H
N
H
O
H
HO
H
3
Cl
query
O
NH2
H
H
H
H
H
H
Database of Compounds
H
Bioactivity
Ranking
Knowledge / Information Retrieval / Datamining Based Approaches
Key Principle: Structure Activity Relationship
The properties of a chemical compound
largely depend on its structure
(structure-activity relationship or SAR).
Exploit structural similarity
Capturing structure → structural descriptors (descriptor-space)
Drawbacks of Structural Descriptor-Space
based Ranked-Retrieval
Too much emphasis on structural similarity
Query and subsequently the hits (top ranked
compounds) may have bad ADME properties.
Structures may be toxic or promiscuous.
Retrieve compounds that are structurally
diverse and different from the query and yet
bioactive to avoid the above drawback
Scaffold-Hopping Problem
Given a query compound, rank
the compounds in the database
based on how similar they are to
the query in terms of their
bioactivity but as dissimilar as
possible in terms of their structure
to that of the query.
H
NH2
H
H
H
1
N
O
H
NH2
H
H
H
2
H
H
N
O
HO
H
H
O
Cl
O
query
3
H
H
H
H
H
H
H
Database of Compounds
with a descriptor-space
representation
H
H
H
H
Diverse
structure
but same
Bioactivity
H
H
H
Hard Problem
Runs counter to SAR
For a query its hard to distinguish a
genuinely structurally different but active
compound from an inactive compound.
Definition of a scaffold-hop for a query is not
clear or objective in many cases.
Examples
Celebrex
Vioxx
Bextra
COX2 (cyclooxygenase-2) inhibitors
Viagra
Levitra
Cialis
PDE5 (phosphodiesterase type 5) inhibitors
Methods for Scaffold-Hopping
Based on Indirect measures of similarity.
High
structural
(direct)
similarity
High
structural
(direct)
similarity
CH3
O
O
CH3
NH2
Low structural similarity but similarity by association
NH2
CH3
CH3
Includes information beyond structural similarity →
allows structural diversity.
Automatic Relevance-Feedback based methods
Graph analysis based methods.
Automatic Relevance Feedback based
Methods
Based on Indirect Similarity derived from by
automatic relevance feedback mechanism
Set of three most similar compounds to the query in terms of structure
Query
CH3
CH3
CH3
NH2
O
Compound with low structural
similarity to the query but high
similarity to set of most similar
compounds to the query.
Automatic Relevance Feedback based
Methods
q
top-k
TopKAvg
q+A
c39
{sim(q, c39)}
c9
c2
{sim(q, c2)}
c13 {simA(q, c13)}
c13
{sim(q, c13)}
cj
Ranked Database
A = {c39, c2, c13}
{simA(q, c9)}
c40 {simA(q, c40)}
cj
Re-ranked Database
simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based
Methods
ClustWt
q
q+A
q
c39 {sim(q, c39)}
c2 {sim(q, c2)}
top-k
c13
Cluster
{sim(q, c13)} into m
clusters
1
c8
c8
A
c10 {simA(q, c10)}
c13 c
10
A = {c8, c10, c13}
2
{simA(q, c8)}
c1
{simA(q, c1)}
c40
c1
cj
cj
Ranked Database
Clusters ranked with
respect to their
similarity to the query
Re-ranked Database
simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based
Methods
q
c1
c2
c3
A = {}
BestSumDescSim
c next arg max {
c i D
sim( c ,c
c j A { q }
j
)}
BestMaxDescSim
cnext argmax { max {sim( ci , c j )}}
ci D
c j A { q }
ci
Database (D)
i
D = D – {cnext}
A = A U {cnext }
Methods for Scaffold-Hopping
Based on Indirect measure of similarity derived
from a nearest-neighbor graph for compounds.
c8
c5
c6
c1
c7
c4
c2
c3
Graph formed using information on the
neighborhood of each chemical compound
c2 and c4 are strongly related by
the metric of the number of paths
of size 2 that connect them, even
though they do not have a direct
relation (path of size 1).
Network Analysis based Methods
top-k
q
c1
c2
ci
c44
c5
c18
c2
cj
c1
q
q
c1
c14
c2
c9
ci
ci
ck
ck
D U {q} – {q}
D U {q} – {c1}
D U {q} – {c2}
D U {q} – {cj}
Rank every compound in the database and the query with respect to every other compound
Network Analysis based Methods
Nearest Neighbor graph (NG)
There exist an edge between two
nodes (compounds) ci and cj if ci
occurs in the list of top-k neighbors
of cj or vice-versa.
Mutual Nearest Neighbor graph (MG)
There exist an edge between two
nodes (compounds) ci and cj if ci
occurs in the list of top-k neighbors
of cj and vice-versa.
Example – NG and MG
top-3
q
c44
c5
c1
c18
c2
c1
q
c44
c2
q
c1
cj
c5
cj
c14
c2
c9
cj
q
adjG(q) = {c44, c5, c1}
adjG(c2) = {cj, q, c1}
c44
c5
c1
cj
c2
adjG(q) = {c1}
adjG(c2) = {cj, c1}
c18
cj
q
c1
NG
c2
q
c1
c18
MG
c2
Network Analysis based Methods
The graph based similarity is given by
simG (ci , c j )
adjG (ci ) adjG (c j )
adjG (ci ) adjG (c j )
This similarity can be used in conjunction with Sum or Max
search schemes.
c next arg max {
c i D
sim
c j A { q }
G
(c i ,c j )}
cnext argmax { max {simG (ci , c j )}}
ci D
c j A{ q }
Four methods derived from graph based similarity –
BestSumNG, BestMaxNG, BestSumMG, BestMaxMG.
Experimental Methodology
A combination of 6 target specific
datasets and 3 descriptor-spaces
(GF, ECFP, ErG) used resulting in a Standard Retrieval
total of 18 problems.
q
Tanimoto similarity used to measure
c39 {sim(q, c39)}
all direct similarities.
c2 {sim(q, c2)}
Standard Retrieval used as baseline
c13 {sim(q, c13)}
for comparison.
Two related schemes, Turbo
cj
Max/Sum, are also compared.
Ranked Database
Related Schemes
Turbo SumFusion/MaxFusion
q
top-k
q+A
c39
{sim(q, c39)}
c9
c2
{sim(q, c2)}
c13 {simA(q, c13)}
c13
{sim(q, c13)}
cj
Ranked Database
A = {c39, c2, c13}
{simA(q, c9)}
c40 {simA(q, c40)}
cj
Re-ranked Database
TurboSumFusion: simA(q,c) = sim(q,c39) + sim(q,c2) + sim(q,c13)
TurboMaxFusion: simA(q,c) = max{sim(c,c39), sim(q,c2), sim(q,c13)}
Definition of Scaffold-Hops
For every active (ai) in a
dataset scaffold-hops are
defined.
For every active, all other
actives are ranked
against it using pathbased fingerprints. The
lowest 50% actives in this
list form scaffold- hops for
a i.
q = ai
Top 50% of
total actives
Bottom 50%
of total actives
Scaffold-hops for ai
a4
a10
a6
ak
Performance Evaluation
For each problem, every active in it
was used as query exactly once.
The ranked-retrieval and scaffoldhopping performance is measured
using uninterpolated precision in the
top 50.
Two methods compared using the
average of log2 ratios of their
uninterpolated precision for the 18
problems.
q
rank
bioactivity
precision
active
1/1
2
c44
c5
active
2/2
3
c1
inactive
11
c18
inactive
12
ci
active
50
ck
inactive
1
id
Uninterpolated precision =
(1/1 + 2/2+ 3/12) / 50 = 0.045
3/12
BestMaxMG
BestSumMG
BestMaxNG
BestSumNG
BestMaxDS
BestSumDS
ClustWt
TopkAvg
TurboMax
TurboSum
StdRet
Results – Scaffold-Hopping Performance
Row method
statistically better
Row method
statistically worse
StdRet
TurboSum
TurboMax
TopkAvg
ClustWt
BestSumDS
BestMaxDS
BestSumNG
BestMaxNG
BestSumMG
BestMaxMG
Row and column
methods
statistically same
BestMaxMG
BestSumMG
BestMaxNG
BestSumNG
BestMaxDS
BestSumDS
ClustWt
TopkAvg
TurboMax
TurboSum
StdRet
Results – Ranked-Retrieval Performance
Row method
statistically better
Row method
statistically worse
StdRet
TurboSum
TurboMax
TopkAvg
ClustWt
BestSumDS
BestMaxDS
BestSumNG
BestMaxNG
BestSumMG
BestMaxMG
Row and column
methods
statistically same
Results - Problem Specific
Conclusions & Future Work
Automatic relevance feedback mechanism
inspired and Indirect similarity improves scaffoldhopping performance.
Indirect similarity based methods are more
powerful that direct similarity based measures and
show significant improvements over the state of
the art.
Problem of selecting the right value for the
parameter ‘k’.
Selecting the right descriptor space.
Thanks to –
• Karypis Research
• Kevin DeRonne
• Christopher Kauffman
• Huzefa Rangwala
• Xia Ning
•Yevgeniy Podoylan
THANK YOU!
Questions?
www.cs.umn.edu/~karypis