Chemical Similarity – An overview
Download
Report
Transcript Chemical Similarity – An overview
Toxmatch - a tool to assess
chemical similarity
Nina Jeliazkova - Ideaconsult Ltd., Sofia, Bulgaria
Ana Gallegos Saliner, Grace Patlewicz - European Chemicals
Bureau, Ispra, Italy
Joanna Jaworska - Procter & Gamble, Brussels, Belgium
Introduction
Chemical similarity is a widely used concept in toxicology, based on the
hypothesis that similar compounds have similar biological activities.
Toxmatch software
• Full-featured and flexible user-friendly open source software,
• Encodes several chemical similarity indices in order to facilitate systematic
approaches of classifying chemicals into categories.
• The core functionalities include
• The ability to compare datasets, based on various structural and descriptor-based
similarity indices;
• Calculate pair wise similarity between compounds and aggregated similarity of a
compound to a set;
• Several graphical displays that highlight the closeness of chemicals between data
sets.
Toxmatch has been commissioned by the European Chemicals Bureau
(ECB) and will be made available as a free download from its website
[http://ecb.jrc.it/qsar/qsar-tools/ ].
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
2
Similarity : philosophers’ view
“it is ill defined to say “A is similar to B” and it
is only meaningful to say “A is similar to B with
respect to C”
A chemical “A” cannot be similar to a chemical “B”
in absolute terms
but only with respect to some measurable key feature
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
3
Similarity : chemists’ view
Intuitively, based on expert judgment
A chemist would describe “similar” compounds in
terms of “approximately similar backbone and
almost the same functional groups”.
Chemists have different views on similarity
(based on experience or context)
Lajiness et al. (2004). Assessment of the Consistency of Medicinal Chemists in
Reviewing Sets of Compounds, J. Med. Chem., 47(20), 4891-4896.
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
4
Similarity by computers
Computerized chemical similarity assessment needs
unambiguous definitions
Chemicals of interest are encoded numerically (using
graphs, descriptors, wave functions, etc)
The measure between the numerical representations
is called similarity index.
The variety of numerical representations and
ways to define a comparative measure have
resulted in plethora of approaches to measure
similarity between chemical compounds.
How to select the proper one?
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
5
From similarity to
toxicological activity
• The aim in assessing chemical similarity in toxicology
is to systemically identify chemicals with similar
biological activities.
• The similarity hypothesis is well substantiated but
there are also many contradictory examples, where a
small change in chemical structure has led to a
dramatic change in the biological response.
• This “similarity paradox” suggests that no single and
universally applicable similarity measure exists, but
the choice depends on the particular endpoint.
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
6
Tailored similarity
A “tailored” similarity space is a space comprising specifically
selected descriptors or structural patterns. The process within
Toxmatch is as follows:
Train similarity measures for specific activities (a training set is
required)
Select relevant features by supervised learning methods (e.g.
Weka data mining library)
Calculate similarity to the set
Calculate pair wise similarity
Use pair wise similarity values of k most similar chemicals to make
a decision on toxicity:
1.
2.
3.
4.
5.
6.
Predict toxicological activity or
Classify into groups of similar toxicity
Group similar chemicals (through unsupervised clustering based on pairwise
similarity). These groups will not necessarily coincide with groups of
similar toxicity
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
7
Tailored similarity in Toxmatch
Select training set ( 4 predefined datasets available)
Encode chemicals into descriptors, fingerprints or
atom environments
Perform data preprocessing
Perform feature selection
Calculate similarity
Use similarity values for:
classification into categories
toxicity prediction, read across
clustering
dataset comparison
Visualization, interpretation
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
8
Toxmatch main screen
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
9
Training sets
Toxmatch comes with
following training sets:
•Aquatic toxicity
•Bioconcentration factor
•Skin sensitization
•Skin irritation
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
10
Structure representation
Descriptors
Fingerprints
Imported (e.g. calculated by third party software)
Calculated ( > 100 descriptors available from open
source software The Chemistry Development Kit
(CDK) http://cdk.sf.net)
Daylight style hashed fingerprints, 1024 bit length
CDK implementation
Atom environments (circular fingerprints)
Ambit implementation
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
11
Data preprocessing
and feature selection
Data preprocessing:
Data standardization
Principal Component Analysis (PCA)
Missing values processing
Feature selection
Purpose: select most relevant descriptors or structural patterns (tailored similarity
space)
Algorithms:
RelieFF ( finds subset of descriptors that will give best results in prediction or classification)
Infogain (finds descriptors that discriminate best between groups)
Implementation
Toxmatch makes use of open source Waikato Environment for Knowledge
Analysis (WEKA) [http://www.cs.waikato.ac.nz/ml/weka/ ]
Developed by the Department of Computer Science, University of Waikato, New
Zealand
Machine learning/data mining software written in Java (distributed under the GNU
Public License)
Used for research, education, and applications worldwide
Implementation of hundreds of published data mining algorithms (regression,
classification, clustering, evaluation and validation)
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
12
Similarity indices in Toxmatch
•Euclidean distance
•Cosine similarity
•Hodgkin-Richards index
•Tanimoto distance on
descriptors
•Tanimoto distance on
fingerprints
•Hellinger distance on atom
environments
•Maximum Common
Structure similarity
Descriptors, Euclidean distance
Fingerprints, Tanimoto distance
Atom environments, Hellinger distance
Ideaconsult Ltd.,
Sofia, Bulgaria
Similarity values colour coding:
0
1
Similarity matrices for structures with
Reactive Mode of Action EPA Fathead
Minnow dataset (DSSTox)
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
13
Pairwise similarity - visualization
•Similarity matrix
•Compare chemicals in:
•the training set
•subsets of the training set
•the test set
•subsets of the test set
•training set and test set
•subsets of training and test set
•Click on a matrix cell displays the pair
of chemicals and similarity value
•Retrieve most similar chemicals to a
query one (user selected threshold)
Similarity values colour coding:
1
0
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
14
Pairwise similarity - visualization
•Retrieve most similar
chemicals to a query one
•Load training set
•Select similarity measure
•Load (draw or enter by IUPAC
name or SMILES your query
compounds)
•Switch to similarity matrix tab
•Specify similarity threshold
•Press <Show> button to
display most similar chemicals
•The results can be browsed
or exported to a file
Ideaconsult Ltd.,
Sofia, Bulgaria
Similarity values colour coding:
0
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
1
15
Similarity to nearest neighbors, by
Fingerprints and Tanimoto distance
Example: Most similar compounds to n-hexanal from BCF training set Cyclohexane
CAS: 110-82-7
LogBCF=1.84
2-Butanone oxime
CAS: 96-29-7
LogBCF=0.58
Isophorone
CAS: 78-59-1
LogBCF=0.84
0.46
Tanimoto =0.43
Tanimoto = 0.47
N-hexanal
0.43
0.375
Decalin
CAS: 91-17-8
LogBCF=3.25
Pentadecane
CAS: 4390-04-9
LogBCF=1.21
Ideaconsult Ltd.,
Sofia, Bulgaria
0.375
0.375
Pentadecane
CAS: 629-62-9
LogBCF=1.21
0.375
Methylcyclohexane
CAS: 108-87-2
LogBCF=2.25
CAS: 544-76-3
LogBCF=1.31
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
16
Pairwise similarity is not everything!
Similarity to a set:
Similarity between a query
structure and a representative
point of the set
Average similarity between a query
structure and the nearest k
chemicals;
dataset centre (descriptor
space)
weighted fingerprint
Doesn’t work well for diverse data sets
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
17
How to use similarity values
•Similarity to the
set vs. activity
Similarity vs Activity plot
•Similarity values
per se are not
correlated with
toxicity values
•How to make use
of similarity
values?
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
18
Toxicity prediction by similarity (1)
in Toxmatch is based on weighted average of activity
values of the k nearest neighbours.
The actual set of k most similar compounds depends
on similarity measure.
The weights are proportional to the pair wise
similarities (e.g. the activity value of most similar
compound is has largest weight and vice versa).
In order to predict dependent variable (activity), the
measured activity values should be available for the
training set.
Two values are reported per each compound–
averaged similarity to the k nearest neighbours and
predicted activity value.
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
19
Toxicity prediction by similarity (2)
The procedure
Predicted vs. Observed plot
See also poster Ana Gallegos Saliner et al. THE USE OF A DESCRIPTOR-BASED
APPROACH TO PREDICT SKIN IRRITATION USING READILY ACCESSIBLE SOFTWARE TOOLS
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
20
Read across
Read-across is the process by which endpoint
information for one chemical is used to make a
prediction of the endpoint for another chemical, which
is considered to be similar in some way. Read across
can be either qualitative or quantitative though in both
cases, a common substructure is required.
Toxmatch : Prediction by weighted average of toxicity
values of most similar chemicals (nearest neighbours)
is essentially performing quantitative read across
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
21
Classification into toxicity classes
Toxmatch classifies chemical compounds into groups
of toxicological activity, based on similarity values of
k nearest neighbours (k most similar compounds)
The query compound is classified into the group
where most of the k nearest neighbours belong.
For this purpose activity groups should be available
for the training set (e.g. potency classes or other
grouping).
The values reported are :
Probability to belong to a group ( m/k , where m is the
The group predicted.
number of compounds in the group)
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
22
Classification into toxicity classes
skin sensitization training set, 5 potency classes
Query:
Most similar chemicals
Moderate – 40%
Weak-40%
Nonsensitizers –
20%
Set of query chemicals
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
23
Categories
“Traditional organic chemical categories do not
encompass groups of chemicals that are
predominately either toxic or nontoxic across a
number of toxicological endpoints or even for specific
toxic activities”
Rosenkranz H.S., Cunningham A.R. (2001) Chemical Categories for
Health Hazard Identification: A feasibility Study, Regulatory
Toxicology and Pharmacology 33, 313-318.
Conclusion:
use expert knowledge
or machine learning methods to develop categories
Toxmatch provides both options
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
24
Classification by expert
defined rules
Implementation of
structure and
physicochemical
property rules (BfR)
for skin irritation
prediction
Available also as a
Toxtree plug-in
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
25
Datasets comparison in Toxmatch
Example: Comparison of EINECS database and LLNA
Theory:
skin sensitization dataset in fingerprint similarity space
Test set far from the training set
Test set close to the training
set
Distance to the test set
Training set far from Test set
Training set close to Test set
Distance to the training set
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
26
Conclusions
One approach to comparing the similarity between two or more
chemicals is through the use of similarity indices.
This relies on the chemicals of interest being encoded numerically
(using graphs, descriptors, wave functions, etc) and then using a
measure, the similarity index to make the comparison.
To facilitate similarity assessment, such indices can be readily encoded
into software tools.
Toxmatch is a new program that helps to facilitate the systematic
assessment of chemical similarity which is a key component in the
development and evaluation of grouping approaches such as readacross and chemical categories.
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
27
Why Toxmatch
Open source
Toxmatch core relies on actively developed
and widely used open source software
Chemoinformatics (The CDK)
Data mining (WEKA)
Scientifically transparent (there are many CDK and
WEKA related peer reviewed publications)
Easily extendable
Platform independent
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
28
Acknowledgments
ECB contract #CCR.IHCP.C431607.X0 / 22.12.2005
Various open source software packages:
“DEVELOPMENT OF A SOFTWARE TOOL TO ENCODE
AND APPLY CHEMICAL SIMILARITY INDICES”
The Chemistry Development Kit (cheminformatics)
WEKA data mining library (data mining algorithms)
Ambit (structural similarity and data management)
Toxtree (Verhaar rules for toxicity MOA and implementation of
BfR rules for skin irritation prediction)
JFreechart (visualization)
JAMA (matrix operations)
Many more
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
29
Thank you!
What do we measure
We compare numerical
representations of chemical
compounds
The numerical representation
is not unique
The numerical representation
includes only part of all the
information about the
compound
A distance measure reflects
“closeness” only if the data
holds specific assumptions
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
31
ReliefF
•An instance based method that involves finding nearest neighbours
•Supervised (makes use of the class attribute, i.e. makes use of
available grouping)
•References
•Kira, K. and Rendell, L. A. (1992). A practical approach to
feature selection. In D. Sleeman and P. Edwards, editors,
Proceedings of the International Conference on Machine
Learning, pages 249-256. Morgan Kaufmann.
•Kononenko, I. (1994). Estimating attributes: analysis and
extensions of Relief. In De Raedt, L. and Bergadano, F., editors,
Machine Learning: ECML-94, pages 171-182. Springer Verlag.
•Marko Robnik Sikonja, Igor Kononenko: An adaptation of Relief
for attribute estimation on regression. In D.Fisher (ed.): Machine
Learning, Proceedings of 14th International Conference on
Machine Learning ICML'97, Nashville, TN, 1997.
Ideaconsult Ltd.,
Sofia, Bulgaria
SETAC Europe 17th Annual
Meeting (20-24 May 2007)
32