Efficient and Accura..
Download
Report
Transcript Efficient and Accura..
Efficient and Accurate Discovery of
Patterns in Sequence Datasets
Avrilia Floratou, SandeepTata, and Jignesh M. Patel
ICDE 2010
Outline
Motivation
A powerful new model
FLAME(FLexible and Accurate Motif Detector) algorithm
Experimental Results
Conclusion
2
2010/5/27
Motivation
Existing sequence mining algorithms mostly focus on mining
for subsequences(non-contiguous).
For instance, assume that a sequence is “a, b, a, c, b, a, c”, the
sequence “a, b, b, c” is a subsequence constructed by choosing
the 1st, 2nd, 5th, and 7th.
“Are there any frequently recurring patterns in this time
series dataset?”
The recurring subsequences are similar, but not identical.
It allow for some noise.
Approximate subsequence mining problem.(contiguous)
3
2010/5/27
(Cont.)
Computational biology
To detect short sequences, usually of length 6-15, that occur
frequently in a given set of DNA or protein sequences.
We call these frequently occurring patterns as motifs.
Goal
To present a new model for approximate motif mining in many
different domains.
To present FLAME algorithm to efficiently find motifs that
satisfy these model.
4
2010/5/27
The model
(L, M, s, k) model
L is the length of the motif.
M is a distance matrix that is used to compute the similarity
between two strings.
s is the maximum distance threshold within which two strings
are considered similar.
k is the min_sup.
If
at least k strings T1,…, Tk in the DB each of them is
length L, and d(S,Ti) s, where d ( A , B ) M ( a , b ) is a
distance function. Then, a string S is an (L, M, s, k) motif.
n
i 1
5
i
i
2010/5/27
(Cont.)
Protein motif mining
Some amino acids are very similar to each other, while some are
very different.
Because Alanine and Valine are both hydrophobic, the matrix can be used
to award a small penalty for M(X,Y) when X and Y are similar.
When Alanine is matched to Glycine, the large penalty is awarded.
Because Glycine is a hydrophilic amino acid.
(L, d, k) model
(L, f, d, k) model
6
2010/5/27
(L, d, k) model
(L, d, k) model is a mismatch based model for finding DNA
motifs.
The distance measure between two strings is the Hamming
distance, M(X,Y)=1 if X Y and M(X,Y)=0 if X=Y
The signature is usually a short string of DNA 6-15 bases long.
These signatures are seldom identical and differ in a few
positions.
For instance
{ABCD, ACCD, ABCA} if ABCD is the model sequence, the other two
sequences are within one mismatch of the motif, so these sequences
would constitute a (4,1,3) motif.
7
2010/5/27
(L, f, d, k) model
This model
builds
on themotif
(L, d, k) model to include positional
(8,1,0,5)
motif
or (8,1,5)
constraints
onk)the
mismatches.
Use
the (L, d,
model
to retrieve the pattern, we will end up
Tomany
specify
the number
of fixed-position
mismatches.
with
extraneous
hits that
might not be meningful.
{ABCD, ACCD, ADCD}, this set forms a (4,1,3) motif, but the
mismatches, whenever they occur, are always in position two
(AcCD, AdCD).
The advantage of this model
Consider a DNA DB consisting of 5 sequences, each of length
500.
Assume that each sequence has in it the motif GTGAACAC,
and each instance of the motif has a mismatch at the fifth
position.
8
2010/5/27
FLAME algorithm
It first construct a data suffix tree and model suffix tree.
To traversing the nodes of the model space in depth first
order.
Using two subroutines
Evaluate_support: to compute the list of matches and the
support.
Expand_Matches: to ensure that the number of mismatches to
the model string does not exceed d.
9
2010/5/27
(Cont.)
The data suffix tree:
The model suffix tree:
On the set of all possible model srtings.
To help guide the exploration of the model space in a way that
avoids redundant work.
10
2010/5/27
(Cont.)the strategy of pruning the
model suffix tree
Assume that the dataset consists of sequences over the
alphabet {A, B, C, D, E}
All the strings of length L starting with the symbol A form a
subset of the model space.
…
A
AA
AB
AC
AD
AE
…
11
2010/5/27
(Cont.)
No mismatches are allowed.
AA
AB
…
k
A
AC
AD
AE
…
12
2010/5/27
(Cont.)
As mismatches are allowed.
A
AA
AB
…
B
AC
AD
AE
…
13
2010/5/27
Experimental results
14
2010/5/27
(Cont.)
15
2010/5/27
Conclusion
This paper presented a powerful new model: (L, M, s, k)
It also presented FLAME to find (L, M, s, k) motifs
16
2010/5/27