Machine Learning for QSAR Development
Download
Report
Transcript Machine Learning for QSAR Development
A Data-Driven Approach for
Improved Effective Classification
in Predictive Toxicology
ICCC 2006, Tallinn
Dr. Daniel NEAGU, Dr. Gongde GUO
Bradford, UK
Bradford,
West Yorkshire
National Museum of Film and Television
School of Informatics,
University of Bradford
Overview
Short Introduction to Predictive Toxicology Data
and Models
The Current Context on Interspecies Data
Extrapolation
Our Motivation and Approach
Algorithm for Data-driven Hybrid Classification
Model development
Case studies
Results and Conclusions
Predictive Data Mining
The processes of data classification/ regression
having the goal to obtain predictive models for a
specific target, based on predictive relationships
among large number of input variables.
Classification defines characteristics of data and
identifies a data item as member of one of several
predefined categorical classes.
Regression uses the existing numerical data values
and maps them to a real valued prediction (target)
variable.
Predictive Toxicology
Predictive Toxicology:
a multi-disciplinary science
requires close collaboration among toxicologists,
chemists, biologists, statisticians and AI/ML
researchers.
The goal of toxicity prediction is to describe the
relationship between chemical properties,
biological and toxicological processes:
relates features of a chemical structure to a property,
effect or biological activity associated with the
chemical
Data in Predictive Toxicology
ML applications for Predictive Toxicology
The EC proposal for the REACH regulation indicates that the
information requirements under REACH can be (partially)
fulfilled by using scientifically valid (Q)SAR models.
To guide the validation of computer-based methods, five
OECD principles for the validation of (Quantitative)
Structure-Activity Relationships were adopted:
a defined endpoint
an unambiguous algorithm
a defined domain of applicability
appropriate measures of goodness-of-fit, robustness and
predictivity
a mechanistic interpretation, if possible
The Context for our Approach
Data from In Vivo experiments:
In Vivo
Data
In Vitro generated data
In Silico
(Algorithms)
In Vitro
Data
increased laboratory standards
financial and social costs
questionable outputs given different initial
conditions for tests and also the definition of the
output between various experiments
reduces the costs of in vivo experiments
dependent on artificial conditions
focused on particular output measurements,
without an integrated biological dependency and
reaction
In Silico data
depends on the computing and modelling
resources
far less expensive than previous two
one might define an inverse proportional
relationship between data quality and data
quantity
Our Approach
Data availability: different chemical compounds are chosen and tested
on different species for different purposes, and some of them are tested
on more than one species by various experimental reasons
Sparse data sets
Copyrighted
Not homogeneous (endpoint, laboratory conditions, standards, measurement
units)
Distributed in time and sources
Further supporting experimental data for training classifiers are
frequently limited and expensive.
Some endpoints show good correlations (i.e. Aquatic toxicity measured
for various fish species, daphnia etc.)
Consequently, extrapolation methods can be used in regulatory
toxicology to overcome these drawbacks
The goal is to predict toxic effects of different chemical compounds to
particular species by considering both, toxicity values/classes of
chemical compounds which have been tested on these species and on
other species with correlated toxicity values/classes.
Multi-Classifier Systems
Different classifiers potentially offer complementary
or at least additional information about patterns to
be classified
Various approaches to classifier combinations:
majority voting
entropy-based combination
Dempster-Shafer theory-based combination
Bayesian classifier combination
similarity-based classifier combination
fuzzy inference
gating networks
statistical models
We propose a Data-driven Multi-Classifier
Model for correlated PT Data Sets
Step 1: for each dataset, build a model on all instances with a
predefined class label, and then use this model to predict any
unclassified instances.
Step 2: for every two datasets count the number of instances both have
predefined class label, and the numbers of exact match, match with
distance=“1” and match with distance=“2” among them.
Step 3: find potential pairs from different endpoints with highly
correlation of their toxicity classes, i.e. the match rate of distance ≤ “1”
is greater or equal to 90%.
Based on previous investigations, under assumption that for two datasets
exists highly correlation between their classes of the same chemical
compounds, a hybrid integration scheme is proposed:
Step 4: for each dataset, we build a model based on the training set
and then use it to classify new instances. In the case the distance
between the predicted class and the class of the same chemical
compound in its most correlated dataset with different endpoint is 2 we
give the class label of the latter to the new instance.
The Architecture of the Data-driven Multiple
Classifier System for PT Interspecies Extrapolation
Descriptors
Class
Descriptors
Class
Training
(Endpoint1)
Model
Testing
Descriptors
Class
The class of an instance t:
Cj
d(Ci, Cj) ≤ δ
Training
(Endpoint2)
Testing
Model
The predicted class of
an instance t: Ci
C=Ci
Otherwise
C=f(Ci, Cj)
Datasets
DEMETRA*
LC50 96h Rainbow Trout
acute toxicity (ppm)
1.
282 compounds
EC50 48h Water Flea
acute toxicity (ppm)
2.
264 compounds
LD50 14d Oral Bobwhite
Quail (mg/ kg)
3.
116 compounds
LC50 8d Dietary Bobwhite
Quail (ppm)
4.
123 compounds
LD50 48h Contact Honey
Bee (μg/ bee)
5.
105 compounds
*http://www.demetra-tox.net
Descriptors
Multiple descriptor types
Various software packages to calculate 2D and
3D attributes*
*http://www.demetra-tox.net
Model Development
Algorithms chosen for their representability
and diversity, easy, simple and fast access
Bayes Networks (BN)
Instance-Based Learning algorithm (IBL)
Decision Tree learning algorithm (DT)
Repeated Incremental Pruning to Produce Error
Reduction (RIPPER)
Multi-Layer Perceptrons (MLPs)
Support Vector Machine (SVM)
Experiments
1. For each dataset the most relevant descriptors were selected
by considering the individual predictive ability of each descriptor
along with the degree of redundancy between them:
Subsets of descriptors that are highly correlated with the class
while having low intercorrelation were preferred.
2. A model based on all available training instances with
predefined classes was built for each dataset and then used to
predict unclassified instances.
3. Comparison of the differences of toxicity classes of the same
chemical compounds for two different endpoints.
The difference between toxicity classes is measured by a distance
function: for class labels in descent order in terms of toxicity
(C={c1, c2,.., cm}, Toxicity(c1)≥Toxicity(c2) ≥ … ≥ Toxicity(cm))
j i , if
i j,
ji
otherwise
Distance(ci, cj)=
The pairs (Trout, Daphnia), (Bee, Dietary_Quail), (Dietary_Quail,
Oral_Quail) are significantly correlated
Results
C1 stands for high toxic class; C2 stands for medium toxic class; C3
stands for non toxic class; PTN is the Percentage of Toxic chemical
compounds being classified as Non-toxic chemical compounds
Conclusions
no matter the performance of each original
classification method is good or bad, its counterpart
that integrates available correlative information has
obtained better performance.
experimental results of the proposed hybrid
classification system tested on five toxicity datasets
obtain better performance than that of each single
classifier-based model.
hybrid integration systems (IBL-HIS) reduced the
percentage of toxic chemicals being classified as
non-toxic chemicals
Acknowledgements
This work is part-funded by:
EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and
Processing Tool based on a Hybrid Intelligent Systems Approach
EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of
Environmental Modules for Evaluation of Toxicity of pesticide Residues in
Agriculture
http://www.demetra-tox.net
Special thanks also to:
http://pythia.inf.brad.ac.uk/
Dr. Q. Chaudhry (CSL York)
Dr. Mark Cronin (LJMU)
and PhD students:
Ms. Ladan Malazizi, BSc, PhD student
Mr. Paul Trundle, BSc, PhD student
Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity
Ms. Areej Shhab, BEng, MPhil
Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in
pesticide risk assessment
Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining
Mr. M. Craciun (University of Galati), BSc, MSc