CSE591 Data Mining

Download Report

Transcript CSE591 Data Mining

7. Sequence Mining
Sequences and Strings
Recognition with Strings
MM & HMM
Sequence Association Rules
Spring 2003
Data Mining by H. Liu, ASU
1
Sequences and Strings
• A sequence x is an ordered list of discrete items, such
as a sequence of letters or a gene sequence
– Sequences and strings are often used interchangeably
– String elements (characters, letters, or symbols) are nominal
– A particularly long string is called text
• |x| denotes the length of sequence x
– |AGCTTC| is 6
• Any contiguous string that is part of x is called a
substring, segment, or factor of x
– GCT is a factor of AGCTTC
Spring 2003
Data Mining by H. Liu, ASU
2
Recognition with Strings
• String matching
– Given x and text, determine whether x is a factor of text
• Edit distance
– Given two strings x and y, compute the min number of
basic operations (character insertions, deletions and
exchanges) needed to transform x into y
Spring 2003
Data Mining by H. Liu, ASU
3
String Matching
• Given |text| >> |x|, each discrete character is taken
from an alphabet A
– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}
• A shift s is an offset needed to align the first
character of x with character number s+1 in text
• Find if there exists a valid shift where there is a
perfect match between each character in x and the
corresponding one in text
Spring 2003
Data Mining by H. Liu, ASU
4
Naïve String Matching
• Given alphabet A, x, text, n = |text|, m = |x|
s=0
while s ≤ n-m
if x[1 …m] = text [s+1 … s+m]
then print “pattern occurs at shift” s
s=s+1
• Time complexity (worst case): O((n-m+1)m)
• One character shift at a time is not necessary
Spring 2003
Data Mining by H. Liu, ASU
5
Boyer-Moore String Matching
• Given A, x, text, n = |text|, m = |x|
F(x) = last-occurrence function
G(x) = good-suffix function; s = 0
while s ≤ n-m
j=m
while j>0 and x[j] = text [s+j]
j = j-1
if j = 0
then print “pattern occurs at shift” s
s = s + G(0)
else s = s + max[G(j), j-F(text[s+j0])]
Spring 2003
Data Mining by H. Liu, ASU
6
Edit Distance
• ED between x and y describes how many fundamental
operations are required to transform x to y.
• Fundamental operations (x=‘excused’, y=‘exhausted’)
– Substitutions, ‘c’ is replaced by ‘h’
– Insertions, ‘a’ is inserted into x after ‘h’
– Deletions, a character in x is deleted
• ED is one way of measuring similarity between two
strings
Spring 2003
Data Mining by H. Liu, ASU
7
Classification using ED
• Nearest-neighbor algorithm can be applied for
pattern recognition.
– Training: data of strings with their class labels stored
– Classification (testing): a test string is compared to each
stored string and an ED is computed; the nearest stored
string’s label is assigned to the test string.
• The key is how to calculate ED.
• An example of calculating ED
Spring 2003
Data Mining by H. Liu, ASU
8
Hidden Markov Model
•
•
•
•
•
Markov Model: transitional states
Hidden Markov Model: additional visible states
Evaluation
Decoding
Learning
Spring 2003
Data Mining by H. Liu, ASU
9
Markov Model
• The Markov property:
– given the current state, the transition
probability is independent of any
previous states.
• A simple Markov Model
– State ω(t) at time t
– Sequence of length T:
• ωT = {ω(1), ω(2), …, ω(T)}
– Transition probability
• P(ω j(t+1)| ω i(t)) = aij
– It’s not required that aij = aji
Spring 2003
Data Mining by H. Liu, ASU
10
Hidden Markov Model
• Visible states
– VT = {v(1), v(2), …, v(T)}
• Emitting a visible state vk(t)
– P(v k(t)| ω j(t)) = bjk
• Only visible states vk (t) are
accessible and states ωi (t) are
unobservable.
• A Markov model is ergodic if
every state has a nonzero prob of
occuring give some starting state.
Spring 2003
Data Mining by H. Liu, ASU
11
Three Key Issues with HMM
• Evaluation
– Given an HMM, complete with transition probabilities aij
and bjk. Determine the probability that a particular sequence
of visible states VT was generated by that model
• Decoding
– Given an HMM and a set of observations VT. Determine
the most likely sequence of hidden states ωT that led to VT.
• Learning
– Given the number of states and visible states and a set of
training observations of visible symbols, determine the
probabilities aij and bjk.
Spring 2003
Data Mining by H. Liu, ASU
12
Sequence Association Rule Mining
• SPADE (Sequential Pattern Discovery using
Equivalence classes)
• Constrained sequence mining (SPIRIT)
Spring 2003
Data Mining by H. Liu, ASU
13
Bibliography
• R.O. Duda, P.E. Hart, and D.G. Stork, 2001.
Pattern Classification. 2nd Edition. Wiley
Interscience.
Spring 2003
Data Mining by H. Liu, ASU
14