Toxicology application
Download
Report
Transcript Toxicology application
Discovering Substructures in
Chemical Toxicity Domain
Masters Project Defense
by
Ravindra Nath Chittimoori
Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson
Department of Computer Science and Engineering
University of Texas at Arlington
Outline
Chemical Toxicity Database
Motivation and Goal
Knowledge Discovery in Databases (KDD)
SUBDUE Knowledge Discovery System
Experiments with Unsupervised SUBDUE
Experiments with Supervised SUBDUE
Discussion of Results
Conclusions
Future Work
Chemical Toxicity Database
Carcinogenesis Prediction Problem
Toxicology Evaluation Challenge
Domain:
Compounds
Training set
Experimental set
+
162
27
Total
136 298
25 69
Motivation and Goal
Ever-increasing number of chemical compounds
Needs analysis to obtain the Structure-Activity
relationships of a compound
Determine SUBDUE’s applicability to chemical
toxicity domain
Knowledge Discovery in Databases
(KDD)
Process of identifying valid, novel, potentially
useful and understandable patterns in data
Goal of Knowledge Discovery:
Verification
Discovery
Data mining methods
Model Representation, Evaluation and Search
Steps in KDD
Identify the goal of the process
Collect, create and prepare the dataset
Select the data mining method
Select the data mining algorithm
Transform the data
Execute the algorithm
Interpret/evaluate the discovered patterns
Consolidate the knowledge discovered
SUBDUE Knowledge Discovery
System
SUBDUE discovers patterns [substructures] in
structural data sets
Vertices: objects or attributes
Edges: relationships
shape triangle
object
on shape square
object
4 instances of
SUBDUE - Input Representation
Each atom is
represented as a vertex with
directed edges to the name, type and the partial
charge of the atom
Bonds are represented as undirected edges
Each group is represented as a vertex having a
string label specifying the group name with
directed edges to all participating atom
vertices
SUBDUE - Input Representation
Representation used in Unsupervised SUBDUE
A vertex having a string label specifying the
alert with directed edges to all the atoms in
the compound
Representation used in Supervised SUBDUE
A vertex for all the compounds with string label
compound
The compound vertex has directed edges to all
the vertices representing the activity of an
alert on a compound
Unsupervised SUBDUE Input
Representation Example
C
10
10
C
t
n
Atom
0.062
p
Ames
t
p
Atom
1
gr
po
n
0.063
po
n - Name
gr
Methyl
t - Type
p - Partial charge
po - Positive
gr - group
Supervised SUBDUE Input
Representation Example
C
10
10
C
0.062
t
n
Atom
n
t
p
p
Atom
1
gr
contains
gr
contains
0.063
n - Name
t - Type
Com
Positive
Ames
Methyl
p - Partial charge
gr - group
Com - Compound
SUBDUE - Model Evaluation
Minimum Description Length Principle
Best theory to describe any graph
Minimize I(S) + I(G/S)
Graph Compression
Other important Concepts of SUBDUE
Inexact Graph Match Approach
Concept - Learning
Predefined Substructures
Unsupervised SUBDUE Methodology
Training set further divided
3 approaches to determine carcinogenicity of
compounds in experimental set
-- Apply SUBDUE individually to the compounds
-- Inclusion of pre-defined substructures
-- Check for matching of substructure in the
compound to be classified
Unsupervised SUBDUE - Results
10
3
0.062
c
t
n
0.057
br
p
t
n
atom
p
atom
1
Third approach used to classify compounds in
experimental set
Accuracy Level -> 0.322
Cyanate & ether groups are also discovered to
be indicators of carcinogenic activity
Supervised SUBDUE - Methodology
Create set of indicators of carcinogenic activity
Create set of indicators of noncarcinogenic
activity
Calculate value of substructures discovered in
carcinogenic and noncarcinogenic set
Select a set of substructures to be used in
classifying compounds in experimental set
Supervised SUBDUE - Methodology
Check for the existence of these substructures in
the compound to be classified
Calculate the Carcinogenic Activity Value of the
compound
Calculate the NonCarcinogenic Activity Value of
the
compound
Determine the activity of the compound
Supervised SUBDUE - Results
A set of 12 substructures discovered by SUBDUE used to
classify compounds in the experimental set
6 substructures from carcinogenic set include
substructures which form part of groups like amino, di10,
methyl, ether, halide10 and substructure which indicates
compound testing positive on AMES, Salmonella, etc.
6 substructures from noncarcinogenic set include
substructures which form part of groups like methoxy,
Ar_Halide, di64, nitro and alkyl_halide and substructure
which indicates compound testing negative on AMES,
Salmonella, etc.
Supervised SUBDUE - Substructure
Example - Carcinogenic Set
positive
Ames
Compound
Salmonella
Salmonella_n
positive
positive
Supervised SUBDUE - Substructure
Example - Carcinogenic Set
Cl
10
t
C
n
-0.024
Atom
p
t
Atom
-0.123
p
n - Name
gr
gr
Halide10
n
93
t - Type
p - Partial charge
gr - group
Supervised SUBDUE - Substructure
Example - NonCarcinogenic Set
negative
Ames
Compound
Salmonella
negative
Cytogen_ca
negative
Supervised SUBDUE - Substructure
Example - NonCarcinogenic Set
Cl
10
C
0.477
n
t
Atom
p
n
93
t
-0.124
p
Atom
gr
gr
n - Name
t - Type
A-H
p - Partial charge
gr - group
A-H - Alkyl Halide
Supervised SUBDUE - Results
PTE-1 Results:
Compounds
PTE-1
Correct Prediction
Incorrect Prediction
Accuracy:
+
20
12
8
19
6
13
0.6 (+ ), 0.315 (-) , 0.462 (total)
Total
39
18
22
Supervised SUBDUE - Results
PTE-2 Results:
Compounds
+
Total
*
PTE-2
7
6
13
Correct Prediction
4
3
7
Incorrect Prediction
3
3
6
* : # of compounds whose activity is known
Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total)
Results - Discussion
Unsupervised SUBDUE successful in discovering
lead indicators of carcinogenic activity
Supervised SUBDUE also successful in
discovering lead indicators of carcinogenic
activity
ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62)
Ashby, TOPKAT are other toxicity prediction
methods
Conclusions
Consistent with results obtained by logic based
systems like PROGOL
Prefer to use Concept Learner when positive and
negative examples of target concept available
SUBDUE is capable of discovering lead
indicators of carcinogenic/noncarcinogenic
activity in chemical toxicity domain .
Future Work
PTE-3 Evaluation Challenge
Trimmed Data Sets (Partial Charge)
Newer Version of Concept Learning SUBDUE
being
developed
Reference
http://cygnus.uta.edu/subdue