Toxicology application

Download Report

Transcript Toxicology application

Discovering Substructures in
Chemical Toxicity Domain
Masters Project Defense
by
Ravindra Nath Chittimoori
Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn Peterson
Department of Computer Science and Engineering
University of Texas at Arlington
Outline
 Chemical Toxicity Database
 Motivation and Goal
 Knowledge Discovery in Databases (KDD)
 SUBDUE Knowledge Discovery System
 Experiments with Unsupervised SUBDUE
 Experiments with Supervised SUBDUE
 Discussion of Results
 Conclusions
 Future Work
Chemical Toxicity Database
 Carcinogenesis Prediction Problem
 Toxicology Evaluation Challenge
 Domain:
Compounds
Training set
Experimental set
+
162
 27
Total
136 298
 25 69
Motivation and Goal
 Ever-increasing number of chemical compounds
 Needs analysis to obtain the Structure-Activity
relationships of a compound
 Determine SUBDUE’s applicability to chemical
toxicity domain
Knowledge Discovery in Databases
(KDD)
 Process of identifying valid, novel, potentially
useful and understandable patterns in data
 Goal of Knowledge Discovery:
Verification
Discovery
 Data mining methods
 Model Representation, Evaluation and Search
Steps in KDD
 Identify the goal of the process
 Collect, create and prepare the dataset
 Select the data mining method
 Select the data mining algorithm
 Transform the data
 Execute the algorithm
 Interpret/evaluate the discovered patterns
 Consolidate the knowledge discovered
SUBDUE Knowledge Discovery
System

SUBDUE discovers patterns [substructures] in
structural data sets
Vertices: objects or attributes
Edges: relationships
shape triangle
object
on shape square
object
4 instances of
SUBDUE - Input Representation
 Each atom is
represented as a vertex with
directed edges to the name, type and the partial
charge of the atom
 Bonds are represented as undirected edges
 Each group is represented as a vertex having a
string label specifying the group name with
directed edges to all participating atom
vertices
SUBDUE - Input Representation
 Representation used in Unsupervised SUBDUE
A vertex having a string label specifying the
alert with directed edges to all the atoms in
the compound
 Representation used in Supervised SUBDUE
A vertex for all the compounds with string label
compound
The compound vertex has directed edges to all
the vertices representing the activity of an
alert on a compound
Unsupervised SUBDUE Input
Representation Example
C
10
10
C
t
n
Atom
0.062
p
Ames
t
p
Atom
1
gr
po
n
0.063
po
n - Name
gr
Methyl
t - Type
p - Partial charge
po - Positive
gr - group
Supervised SUBDUE Input
Representation Example
C
10
10
C
0.062
t
n
Atom
n
t
p
p
Atom
1
gr
contains
gr
contains
0.063
n - Name
t - Type
Com
Positive
Ames
Methyl
p - Partial charge
gr - group
Com - Compound
SUBDUE - Model Evaluation
 Minimum Description Length Principle
Best theory to describe any graph
Minimize I(S) + I(G/S)
 Graph Compression
Other important Concepts of SUBDUE
 Inexact Graph Match Approach
 Concept - Learning
 Predefined Substructures
Unsupervised SUBDUE Methodology

Training set further divided

3 approaches to determine carcinogenicity of
compounds in experimental set
-- Apply SUBDUE individually to the compounds
-- Inclusion of pre-defined substructures
-- Check for matching of substructure in the
compound to be classified
Unsupervised SUBDUE - Results
10
3
0.062
c
t
n
0.057
br
p
t
n
atom
p
atom
1
 Third approach used to classify compounds in
experimental set
 Accuracy Level -> 0.322
 Cyanate & ether groups are also discovered to
be indicators of carcinogenic activity
Supervised SUBDUE - Methodology
 Create set of indicators of carcinogenic activity
 Create set of indicators of noncarcinogenic
activity
 Calculate value of substructures discovered in
carcinogenic and noncarcinogenic set
 Select a set of substructures to be used in
classifying compounds in experimental set
Supervised SUBDUE - Methodology
 Check for the existence of these substructures in
the compound to be classified
 Calculate the Carcinogenic Activity Value of the
compound
 Calculate the NonCarcinogenic Activity Value of
the
compound
 Determine the activity of the compound
Supervised SUBDUE - Results
A set of 12 substructures discovered by SUBDUE used to
classify compounds in the experimental set

6 substructures from carcinogenic set include
substructures which form part of groups like amino, di10,
methyl, ether, halide10 and substructure which indicates
compound testing positive on AMES, Salmonella, etc.

 6 substructures from noncarcinogenic set include
substructures which form part of groups like methoxy,
Ar_Halide, di64, nitro and alkyl_halide and substructure
which indicates compound testing negative on AMES,
Salmonella, etc.
Supervised SUBDUE - Substructure
Example - Carcinogenic Set
positive
Ames
Compound
Salmonella
Salmonella_n
positive
positive
Supervised SUBDUE - Substructure
Example - Carcinogenic Set
Cl
10
t
C
n
-0.024
Atom
p
t
Atom
-0.123
p
n - Name
gr
gr
Halide10
n
93
t - Type
p - Partial charge
gr - group
Supervised SUBDUE - Substructure
Example - NonCarcinogenic Set
negative
Ames
Compound
Salmonella
negative
Cytogen_ca
negative
Supervised SUBDUE - Substructure
Example - NonCarcinogenic Set
Cl
10
C
0.477
n
t
Atom
p
n
93
t
-0.124
p
Atom
gr
gr
n - Name
t - Type
A-H
p - Partial charge
gr - group
A-H - Alkyl Halide
Supervised SUBDUE - Results
 PTE-1 Results:
Compounds
PTE-1
Correct Prediction
Incorrect Prediction
 Accuracy:
+
20
12
8
19
6
13
0.6 (+ ), 0.315 (-) , 0.462 (total)
Total
39
18
22
Supervised SUBDUE - Results
 PTE-2 Results:
Compounds
+
Total
*
PTE-2
7
6
13
Correct Prediction
4
3
7
Incorrect Prediction
3
3
6
* : # of compounds whose activity is known

Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total)
Results - Discussion
 Unsupervised SUBDUE successful in discovering
lead indicators of carcinogenic activity
 Supervised SUBDUE also successful in
discovering lead indicators of carcinogenic
activity
 ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62)
 Ashby, TOPKAT are other toxicity prediction
methods
Conclusions
 Consistent with results obtained by logic based
systems like PROGOL
 Prefer to use Concept Learner when positive and
negative examples of target concept available
 SUBDUE is capable of discovering lead
indicators of carcinogenic/noncarcinogenic
activity in chemical toxicity domain .
Future Work
 PTE-3 Evaluation Challenge
 Trimmed Data Sets (Partial Charge)
 Newer Version of Concept Learning SUBDUE
being
developed
Reference
http://cygnus.uta.edu/subdue