Data Mining in DNA: Using the SUBDUE Knowledge Discovery

Transcript Data Mining in DNA: Using the SUBDUE Knowledge Discovery

Data Mining in DNA: Using
the SUBDUE Knowledge
Discovery System to Find
Potential Gene Regulatory
Sequences
by
Ronald K. Maglothin
Committee Members
• Dr. Lawrence B. Holder, Supervisor
• Dr. Diane J. Cook
• Dr. Lynn L. Peterson
Outline
• DNA Sequence Domain
• SUBDUE Knowledge Discovery System
• Experiments with Unsupervised
SUBDUE
• Experiments with Supervised SUBDUE
• Conclusion and Future Work
DNA Structure
• All cells use DNA to store their
genetic information.
• A DNA molecule is composed of two
linear strands coiled in a double
helix.
• Each strand is made of the bases
adenine (A), thymine (T), cytosine
(C), and guanine (G), joined in a
linear sequence.
DNA Sequence
• These four bases constitute a fourletter alphabet that cells use to store
genetic information.
• Molecular biologists can break up a
DNA molecule and determine its base
sequence, which can be stored as a
character string in a computer:
TTCAGCCGATATCCTGGTCAGATTCTCT
AAGTCGGCTATAGGACCAGTCTAAGAGA
Genes
• A gene is a DNA sequence that
encodes instructions for building a
protein.
• Gene expression is the process of
using a gene to make a protein:
DNA
gene
transcription
RNA
transcript
translation
Protein
product
Gene Regulation
• Primary mechanism is to control the
rate of DNA transcription:
– Faster transcription
more protein
– Slower transcription
less protein
• Transcription rate is controlled by
transcription factors, which are
proteins which bind to specific DNA
sequences.
Human Genome Project
• A U.S.-led, worldwide effort to
determine the complete DNA
sequence for humans, as well as
several other organisms.
• These sequences will be used to
study:
– mechanisms of disease
– growth and development
– evolutionary relationships
A Genome is a LOT of Data
• Raw sequence (text)
– Human (2005): 3 x 10 9 base pairs
– Yeast (finished): 1.2 x 107 base pairs
• Annotated sequence (Relational DB)
– Links to 3D structures of protein
products, other genes in family, known
transcription factors, journal references,
and other databases.
A Rich Domain for
Knowledge Discovery
• Most of the sequences (and genes)
have unknown function.
• Efficient algorithms are needed to:
– identify important patterns
– identify and classify possible genes
– infer relationships between genes
– predict protein structure
The SUBDUE Knowledge
Discovery System
• Input: A graph G
• Output: A list of substructures that
compress G well
• Uses a computationally-constrained
beam search and inexact graph
match
What is a substructure?
• A definition subgraph and a list of
subgraph instances :
Input Graph
A
1
next
T
next
2
A
3
next
C
next
4
A
next
5
T
6
Substructure
Definition
A
next
T
Instances
A
next
1
A
5
T
2
next
T
6
next
G
7
MDL Heuristic
• SUBDUE uses the Minimum
Description Length Principle to
evaluate substructures.
• Description Length of a graph is the
number of bits needed to send the
graph’s adjacency matrix to a remote
computer.
• Goal is to minimize DL(S) + DL(G|S).
SUBDUE Parameters
• Iterations: Graph is compressed using
the best substructure, discovery is
restarted
• Threshold: Controls how much two
subgraphs can differ to be considered
similar
• Beam Width: The number of
substructures in the expansion list
Graph Representations
• Simple linear
A
next
C
next
A
next
T
next
G
1
G
• Downstream edges
4
3
2
A
1
C
3
2
1
A
2
1
T
Graph Representations
• Start vertex
5
4
3
A
next
C
next
A
2
next
next
T
1
G
Start
• Backbone
base
next
name
A
base
next
name
C
base
next
name
A
base
next
name
T
base
name
G
Graph Representations
• Backbone-star
*
star
base
next
name
A
star
base
star
next
name
C
star
base
next
name
A
star
base
next
name
T
base
name
G
Unsupervised SUBDUE
• Input: An entire yeast chromosome
A
next
C
next
A
next
T
next
G
• Heuristic:
1
Value 
DL(S)  DL(G | S)
• Results: Not good; patterns with two
to three bases
Polynomial Heuristic
Value  SizeOf(Definition)2 
NumberOfInstances
Threshold  0.2
Pattern
Instances
TTTTTTTTTTTG
196
AAATTTTTTATT
158
TTTTTTTTTTGC
158
TTTTAATTTTTT
155
GAAATTTTTTAA
144
Unsupervised SUBDUE Discussion
• Random noise is not a meaningful
kind of pattern variation in DNA.
• Unsupervised SUBDUE finds DNA
patterns that are hard to evaluate
and that are not focused on any
target concept.
• We need to give SUBDUE more
targeted input data and to modify the
system to use it effectively.
Supervised SUBDUE
• Give SUBDUE two graphs: a graph of
positive instances of a target concept,
and a graph of negative instances.
• SUBDUE discovers substructures in
the positive graph, finds instances in
the negative graph, and bases the
overall heuristic value on the values
in both graphs.
New Data Sets
• Clusters of coexpressed yeast genes
compiled by Brazma et al., from
expression data generated by DeRisi
et al.
• The expression level of each gene in
a cluster changed at the same time
and by a similar degree during the
experiment; perhaps some genes in
a cluster are regulated by similar
mechanisms?
New Data Sets
• Positive examples:
– 300-bp upstream windows (both strands)
for all genes in a given cluster
• Negative examples:
– 300-bp upstream windows for genes not
in the cluster, OR
– 300-bp windows randomly selected from
the complete genome (probably not
involved in gene regulation)
Supervised Heuristic
• Based on the substructure’s values in
the positive and negative graphs
Value  Value  / Value 
1
1
/
DL(S)  DL(G  | S) DL(S)  DL(G - | S)
DL(S)  DL(G - | S)

DL(S)  DL(G  | S)
• Numerator set to 1 when no
negative instances
Compression Ratio
• Normalize the graph values by using
the inverse of the graph compression
Value 
DL(G  )
DL(S)  DL(G  | S)
DL(G  )
DL(S)  DL(G  | S)
Negative Graph Value
DL(S)  DL(G- | S)
Value 
DL(S)  DL(G  | S)
• When there are no negative instances,
setting numerator to 1 actually
penalizes such substructures.
• Using 2 x DL(G-) in this situation gave
better results.
Ratio Heuristic Results
Cluster
Best Pattern
Instances
c2_4.2222200.39
CCCCTTA
7
c2_4.2222201.41
ATATAATA
10
c2_4.2222210.37
GATATATA
6
cr2_4.222202.55 ATATATATATATAT
cr4.111101.77
CCCCTTA
6
10
Concept DL Heuristic
• Based on the size of a message
containing the compressed positive
graph, plus the errors (negative
instances).
DL(S)  DL(G | S)  DL(G ) - DL(G | S)
Value  - (DL(S)  DL(G | S) - DL(G- | S))
Concept DL Heuristic Results
• Relative graph size affected results
Cluster
Best Pattern
Instances
c2_4.2222200.39
AAAAAA
53
c2_4.2222201.41
AAAAAA
41
c2_4.2222210.37
AAAAAA
38
cr2_4.222202.55
AAAAAA
66
cr4.111101.77
ATATAA
57
Backbone Representation
base
next
name
A
base
next
name
C
base
next
name
A
base
next
name
T
base
name
G
• “Base” vertices allowed don’t-care
positions, but heuristic had to be
changed to accommodate them.
• Overlap became very important.
DL Equations
DL(G)  vbits  rbits  ebits
vbits  lgv  v lg lv
v
v
rbits  (v  1) lg(b  1)   lg
k
i

i 1
ebits  e (1  lg l e )  (K  1) lg m



Negative Graph Value
• Using 2 x DL(G-) for no negative
instances favored such substructures
too strongly.
DL(G- )
Value - 
DL(S)  DL(G- | S)
DL(G- )

DL(S)  DL(G- )
where lv   lv  1
Compression Difference
Heuristic
• Use subtraction with the compression
values instead of division.
Value  Value   Value DL(G  )


DL(S)  DL(G  | S)
DL(G  )
DL(S)  DL(G  | S)
Results
Cluster cr4.111101.77
Pattern
N+
N-
TRANSFAC matches
ATCCAT
16
12
GGGA.G.A
19
16
MOUSE(3), HS(4), RAT(2),
HCMV(1), RICE(1)
MOUSE(2), HS(3), RABBIT(1)
TCCCT
65
35
Y$G3PDH_01
AAGGG
95
37
CCCT
128
76
CAMV(2), RAT(9), AD(3),
DROME(8), MOUSE(14),
HS(30), DROOR(1), PCF(1),
HPV(1), CHICK(2),
RABBIT(1), EBV(1)
Y$BCY_01, Y$GAL1_04,
Y$X40_01, Y$CYC1_12,
Y$GAL1_14, Y$G3PDH_01,
Y$POX1_01, Y$DDR2_01,
Y$DDR2_02, Y$TPI_02
Results
Cluster c2_4.2222200.39
Pattern
N+
N-
TRANSFAC matches
CATCC.T
6
10
T.CTGCT
13
13
AGGGA
30
35
GCTGC.G
2
10
Y$RP51A_01, Y$RPL16A_01,
Y$FBP1_01
DROME(2), HS(8),
MOUSE(2), RAT(2)
Y$HIS3_03, Y$STE6_01,
Y$STE6_02, Y$DAL4_01
Y$CTT1_02
111
202
TTGC
Y$TEF2_01, Y$HMR_02,
Y$CUP1_01, Y$CUP1_02,
Y$MAL61_01, Y$URA3_04,
Y$CYB2_02, Y$ARS1_06,
Y$DAL7_01, Y$DAL7_02,
Y$PGK_03
Results of Brazma et al.
Cluster c2_4.2222200.39
Pattern
N+
TRANSFAC matches
CCCCT..T
27
A..AGGGG
27
Y$DDR2_01, Y$DDR2_02,
Y$TPI_02
none
GGGGC
27
GCCCC
27
G..GGGG
28
Y$GAL2_02, Y$SUC2_02,
Y$RRNA_01, Y$ERG11_01
Y$CYB2_02
Y$CYC1_04, Y$CYC1_05,
Y$CYC1_06
Brazma Heuristic
• Based on number of positive and
negative instances
N
Value 
N  4.667
SUBDUE Using Brazma
Heuristic
Cluster c2_4.2222200.39
Pattern
N+
N-
TRANSFAC matches
CCCCT.AT
10
0
Y$DDR2_02
AT.AGGGG
10
0
CHICK$VIT2_18
A…GGGGG
10
2
Y$SUC2_02
CCCC..CT
14
4
Y$GAL3_01, Y$MAL2R_01
CCGGG.T
5
0
Y$CYC1_04, Y$CYC1_05,
Y$CYC1_06, Y$MAL63_01
Conclusion
• SUBDUE can be used to discover
likely transcription factor binding
sites.
• Patterns found by SUBDUE are
different from those found by stringbased algorithms, due to the graph
representation, beam search, and
different search heuristic.
Conclusion
• Patterns found by unsupervised
SUBDUE in DNA are difficult to
evaluate.
• Using supervised SUBDUE can
greatly focus the search on the
target concept.
• Choosing the right graph
representation and heuristic are
critical to success.
Future Work
• Further refinement of the supervised
MDL heuristic.
• Application of graph grammar theory
to SUBDUE’s search.
• Close collaboration with molecular
biologists to select data sets and
evaluate results.

Data Mining in DNA: Using the SUBDUE Knowledge Discovery

Transcript Data Mining in DNA: Using the SUBDUE Knowledge Discovery

Directory