Chemical similarity with Toxmatch 1.03 and Ambit

Download Report

Transcript Chemical similarity with Toxmatch 1.03 and Ambit

Chemical similarity
with
Toxmatch 1.03
and
Ambit Discovery
Nina Jeliazkova, Joanna Jaworska
2
Chemical similarity assessment using
Ambit Database
•Exact substructure search based on 2D
•Structural Similarity search (various methods)
•Criteria on descriptors
•Based on mechanistic understanding
• Verhaar scheme
3
Another view on Similarity assessments
with Toxmatch and Ambit Discovery
Ambit Discovery
• Similarity to a set
− The query compound is compared to a summary representation
of a set:
− Descriptor space – center of the data cloud
− Fingerprints – a consensus fingerprint
Toxmatch
• Pairwise similarities
• Similarity to a set
− The query compound is compared to its nearest neighbours,
selected by various similarity measures
4
Similarity searching
Derived with expert knowledge
• SAR
Derived with similarity searching methods
5
Similarity searching
•Rationale
• Based in the Similar Property Principle:
“Structurally similar compounds tend to exhibit similar properties”
•Application 1
• Calculate the pair wise similarity between a compound with known activity and each
compound in the database
• Rank the database on similarity to the known active
• Select top n% for further biological testing
•Application 2
• Calculate the pair wise similarity between a compound with unknown activity and each
compound from a set with known activity
• Rank the dataset on similarity to the query compound
• Do inference on query compound activity based on activity of top n% compounds
•Application 3
• Calculate the similarity between a compound with unknown activity and two or more sets
of compounds with known activity
• Decide which set is more similar to the query compound
6
Chemical similarity quantified
Numerical representation of chemical structure
• Structural similarity 2D, 3D
• Descriptor –based similarity
• Field –based
• Spectral
• Quantum mechanics
• More…
Comparison between numerical representations
• Distance-like
• Association,
• Correlation
7
Structural similarity
Substructure searching
Maximum Common Substructure
Fragment approach
Slow !
Fast
• Atom, bond or ring counts, degree of connectivity
• Atom-centred, bond-centred, ring-centred fragments
• Fingerprints, molecular holograms, atom
environments
8
Fingerprints with Tanimoto distance
Tanimoto coefficient
Partial list of fingerprints for 3
compounds
Pairwise Tanimoto similarity
9
2D: Atom Environment (AE)
E.g. 6-aminoquinoline
•Assign Sybyl mol2 atom types
•find connections
•find connections to connections
•create a tree down to n levels
•‘bin’ the atom types for each level
•create a ‘fingerprint’ for this atom
N2
Level 0
Car Car
Level 1
Car, Car, Car
Level 2
2
These features are created for every (heavy) atom in the molecule
(J. Chem. Inf. Comput. Sci. 2004, 44, 170-178; 2004, 44, 1710-1718)
10
What do we measure
We compare numerical representations of chemical
compounds
• The numerical representation is not unique
• The numerical representation includes only part of all the information
about the compound
• A distance measure reflects “closeness” only if the data holds
specific assumptions
• Statistical assumptions and statistical error is involved
11
Fingerprint similarity:
Tanimoto coefficient specifics
Information loss – fragments presence
and absence instead of counts
Bit string saturation – within a large
database almost all bits are set
Can give nonintuitive results:
• The average similarity appears to increase
with the complexity of the query
compound
The distribution of Tanimoto values
found in database searches with a
range of query molecules
Flower D., On the Properties of
Bit
String-Based Measures of Chemical Similarity, J.
Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998
• queries for large molecules are more
discriminating (flatter curve, Tanimoto
values spread wider)
• queries for small molecules have sharp
peak, unable to distinguish between
molecules
12
Descriptor similarity: Distances specifics
Euclidean distance
City-block distance
Mahalanobis distance
Equidistant contours = Points on the
equal distance from the query point
13
Chemical similarity by
12
13
11
12
10
11
9
10
8
9
7
8
6
13
12
11
10
9
8
7
pKa
pKa
pKa
Distance in relevant descriptor space
6
5
4
3
(a)
2
1
-11
-10
E_HOMO
-9
7
6
5
5
4
4
3
3
2
2
1
1
-12
-11
-10
-9
E_HOMO
-8
-7
-6
(c)
-12
-11
-10
-9
E_HOMO
-8
-7
(a) Probabilistic classification (b) Classification by Euclidean distance
(c) by Mahalanobis distance
-6
14
Do structurally similar molecules have
similar biological activity ?
Set of 1645 chemicals with IC50s for monoamine oxidase
inhibition
Daylight fingerprints 1024 bits long ( 0-7 bonds)
When using Tanimoto coefficient with a cut off value of 0.85
only 30 % of actives were detected
Cutoff values
% of actives detected
% False positives
Y. Martin et al ( 2002) J. Med. Chem. 2002,45,4350-4358
15
Structurally similar compounds can have
very different 3D properties
Kubinyi, H., Chemical Similarity and Biological activity
16
Back to the meaning of chemical
similarity
• Structural similarity (the available methods)
• Similarity in activity (the need)
Reconciliation: Similarity in “tailored” space
• Proximity in relevant descriptor space
• Structural similarity based on mechanism of action
• Weighted structural similarity
Data fusion – combine different methods
17
What kind of searches are desirable?
Detailed analysis for pair wise similarity
• Toxmatch
Similarity of a compound to compounds in the database
• Ambit Database Tools
Similarity of a compounds to a reference set
• Toxmatch
Similarity of a set of compounds to compounds in the database
• Ambit Discovery
• Toxmatch
Grouping based on chemical class
• Toxtree
• Ambit db
18
AMBIT Discovery
Software for applicability domain and similarity assessment
Methods:
• Ranges
• Euclidean distance
• City-block Distance
• Probability Density
• Fingerprints
− Consensus fingerprint + Tanimoto distance
− Consensus fingerprint + Missing fragments
• Atom environments
− Consensus atom environments + Hellinger
distance
− kNN + Tanimoto distance
− Ranking
More options
• Threshold
• Preprocessing (e.g. PCA)
• Center
19
AMBIT Discovery
Data visualisation
20
AMBIT Discovery
Results
6
5
4.85
4
3.79
3.72
3.79
Residual
3
2
2.18
2.05
1
1.11
1.08
2.26
1.9
2.17
1.29
0.2
0.4
0.6
0.8
0.46
0.21
1
1.2
-0.98
-1
-2
Consensus domain
In domain
1.1
1.02
0.65
0
0
Out of domain
21
AMBIT Discovery
Results (exported to MSExcel file)
22
Ambit Discovery applications
Variety of exploring similarities
• Based on properties
• Based on structural similarity
Application
• Consensus domain for a robust assessment
• SAR difficult due to multiple functional groups, multi target toxicity
• Ascribing chemical to a particular group defined by expert
− Mechanistic reactivity domains for skin sensitization
J.Jaworska , N.Nikolova-Jeliazkov, How can structural similarity analysis help in category
formation, SAR and QSAR in Environmental Research, Vol. 18, No. 3-4, June 2007, 1-13
23
What Toxmatch can do for you ?
Assess pairwise similarity
Classify into groups
Predict activity
Pair wise similarity between each analyzed compound and compounds in the
training/test sets are calculated;
Composite (averaged) similarity measure between a compound and a user
selected subset of compounds is performed. Besides predefined subsets for
certain training data sets, subsets can be selected manually or automatically by a
clustering algorithm;
The software allows compounds to be ranked according to a selected similarity
index;
24
Similarity
What is implemented
• Structural – fingerprints (Tanimoto), atom environments, MCCS
• Descriptor – Euclidean distance, Hodgkin-Richards, Cosine, Tanimoto
• Set of rules for specific activities (e.g. BfR rules)
How to use similarity
• Similarity values say nothing about biological activity
• Similarity calculation has to be linked to specific activity
− Prediction – predict dependent variable (activity)
− Classification – classify into groups of activity
25
Similarity indices:
Distance-Like similarity indices:
• General definition:
DAB (k , x)   k ( Z AA  Z BB ) / 2  xZ AB  , DAB  [0, )
1/ 2
DAB (k , x)   Z AA  Z BB  2Z AB 
• Euclidean Distance Index (k=x=2):
1/ 2
Correlation-like similarity indices:
• General definition:
2
VAB (k , x)  (k  x) Z AB DAB
(k , x),VAB  [0,1]
H AB  2Z AB [ Z AA  Z BB ]1
• Hodgkin-Richards index:
• Tanimoto index:
TAB  2Z AB [ Z AA  Z BB  Z AB ]1
• Cosine-like similarity index or Carbó index:
C AB  Z AB [ Z AA Z BB ]1/ 2
26
Toxmatch - main screen
Similatit
y
Controls which group
will be displayed on the
chart below
Click on a point to
view the compound
Controls what will
be displayed on
X/Y axes
27
Similarity and activity
Background:
• Prediction – predict dependent variable (activity)
• Classification – classify into groups of activity
Implementation:
• Prediction – find k most similar compounds and predict
activity based on activities of those compounds (weighted
average)
• Classification – classify the query compound into the group
where most of the k most similar compounds belong
28
Activity prediction by similarity
Predict dependent variable (activity)
• Measured activity values should be available for the
training set
Find k most similar compounds and predict activity based on
activities of these compounds
• The actual set of k most similar compounds depends on
similarity measure
• The predicted value is weighted average of k activities
Reported values:
• Similarity to the entire data set
• Predicted activity value
29
Classification by similarity
Classify into groups of activity
• Activity groups should be available for the training set (e.g. potency
classes or other grouping)
Find k most similar compounds and classify the query
compound into the group where most of these compounds
belong
• The actual set of k most similar compounds depends on similarity
measure
• The predicted value is
− Probability to belong to a group ( m/k , where m is the number of
compounds in a group)
− The group predicted (one with highest probability)
Reported values:
• Similarity to each group (Dataset.distance.group)
• The group predicted
30
More in Toxmatch
•Predefined training sets for 4 endpoints
•Similarity matrix
•Importing descriptors
•Calculating descriptors
•Comparing training and test sets
•BfR rules for predicting skin irritation potential
Thank you!