Finding Frequent Patterns

Download Report

Transcript Finding Frequent Patterns

Algorithms for Discovering Patterns
in Sequences
Raj Bhatnagar
University of Cincinnati
Outline
• Applications of Sequence Mining
– Genomic Sequences
– Engineering/Scientific Data Analysis
– Market Basket Analysis
• Algorithm Goals
–
–
–
–
Temporal Patterns
Interesting Frequent Subsequences
Interesting Frequent Substrings (with mutations)
Detection of Outlier sequences
Why Mine a Dataset?
Discover Patterns:
– Associations
• A subsequence pattern follows another
• Help make a prediction
– Clusters
• Typical modes of operation
– Temporal Dependencies
• Events related in time
– Spatial Dependencies
• Events related in space
Example Problems
From various domains
–
–
–
–
Engineering
Scientific
Genomic
Business
Energy Consumption
Gene Expression over Time
Multivariate Time Series
• Multivariate time series data is a series of multiple attributes
observed over a period of time at equal intervals
• Examples: stocks, weather, utility, sales, scientific, sensor
monitoring and bioinformatics
Spatio-Temporal Patterns
Goal:
Find patterns in
space-time
dimensions for
phenomena
Issues:
Simultaneous handling
of space and time
dimensions
Problem:
Phenomena in space-time
How to handle large
complexity of
algorithms
Discover Interesting Subsequences
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAGacgt
aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATACAGgatcagt
atgacttAAAAAAAAGGGGGGGtgctctcccgattttcCTGATTCCAATACAGc
aggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAGta
ggaggAAAAAAAAGGGGGGGagccctaacggacttaatCCTGATTCCAATACAG
Blue pattern sequence may have upto k substitutions
Main Sequence Mining Tasks
•
•
•
•
•
•
•
Finding frequent patterns
Doing similarity search
Clustering of time series
Periodicity detection
Finding temporal associations
Summarization
Prediction
Finding Frequent Substrings
Two Main Approaches:
• Generalized Suffix Trees (Linear Time)
• Find Least Common Substrings (Linear Time)
abacebc . abcdobd . abaaebd . eoaoobd . abceoad .
0 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3738 39
 GST generated considering the substrings starting at location #0 of each
string is:
(0) < root > - (8) ab - (10) c – (11)dobd - (15) 8
- (35) e0ad – (39) 32
- (18) a – (19) aebd – (23) 16
- (3) cebc – (7) 0
- (24) e0a00bd - (31) 24
Substring Length Position
s
size(s)
pos
Set
set P
Cardinality
|P|
aba
3
0
{0, 2}
2
abc
2
0
{1,4}
3
ab
2
0
{0,1,2,4}
4
Finding Frequent Substrings
• Recursively generate the GST with each string
truncated by removing the first character
Original strings:
abacebc . abcdobd . abaaebd . eoaoobd . abceoad .
Truncated strings:
bacebc . bcdobd . baaebd . oaoobd .bceoad .
(0) < root > - (8) b - (10) dobd - (14) 8
- (31) e0ad – (35) 29
- (16) a – (17) aebd – (21) 15
- (3) cebd – (7) 1
- (22) 0a00bd - (28) 24
s
|s|
pos
set P
|P|
b
a
2
1
{0, 2}
2
b
c
2
1
{1,4}
2
b
1
1
{0, 1, 2,4} 4
Results after Phase-I
• Substrings generated after Phase-I are:
Phase2: Subsequence
Hypotheses
Result of Phase 1:
string
ABa---AB------a-------*bd
-----bd
size(s)
3
2
1
3
2
profiles(G)
{0,2}
{0,1,2}
{0,2,3}
{1,3}
{1,2,3}
size(G)
2
3
3
2
3
Result after
subsequence
AB---bd
--a--bd
Phase 2:
size(T)
4
3
profiles(R)
{1,2}
{2,3}
size(R)
2
2
Merging Profile Sets
• Two sets of profiles are similar and will be
merged together if:
Size(Intersection(P1, P2)) ≥ threshold
Size(Union(P1, P2 ))
where P1 and P2 any two sets and threshold is user-defined
Merged Substrings
Generalizing Substring Patterns
• Core subsequences can be written in the form of
regular expression or regex
• Each symbol is considered replaceable by
preceding or following alphabet
E.g. for the substring,
aba – eb - regex will be
[ab]{1}[abc]{1}[ab]{1}.{1}[def]{1}[abc]{1}.{2}
Generalization of Sequence Patterns
• Hypothesis T specifies a
core subsequence shared
by the set of profiles R
• Core subsequences and set
R, affected by various
factors such as loss of
information
• T is considered as seed of
hypothesis
Cluster #1 from Utility Dataset
Cluster #2 from Utility Dataset
What is Achieved
• Discover:
– Partial-sequence temporal
hypotheses
– identities of profiles for each
temporal hypothesis
Histone Cluster
Histone Cluster
ORF
Process
Function
Peak
Phase Order
Cluster Order
YBL002W chromatin structure
histone H2B
S
424
789
YBL003C chromatin structure
histone H2A
S
441
790
YBR009C chromatin structure
histone H4
S
417
791
YBR010W chromatin structure
histone H3
S
420
795
YDR224C chromatin structure
histone H2B
S
432
794
YDR225W chromatin structure
histone H2A
S
437
796
YNL030W chromatin structure
histone H4
S
418
792
YNL031C chromatin structure
histone H3
S
430
793
YPL127C chromatin structure
histone H1
S
448
788
-------------------------------------------------------------------------------------------
Fourier Results
YBL002W
YBL003C
YBR009C
YBR010W
YDR224C
YDR225W
YNL030W
YNL031C
YPL127C
CC
CC
xC
BC
BC
CC
CC
CC
Da
ab
ab
*x
aA
aa
*b
bb
ab
bb
dccaCBBBabc*BCCCA*
cdcbBCBBbacaBCCC**
ddcaCCBBaab*BCCB*a
ccc*CBBAbbc*BCCBAa
ccc*CCB*abbaBBCA*b
cdcaCCC*abbbBCCBAa
dccACCBAabb*CBCBAa
cccABCB*abb*BCCAAa
dcaBBCBAabb*BBBBAa
xcCBCbbcccACACCBBbaabbb*
cbCCCAbcccaCCCCBA*bcbbaA
xcCCBbccccbCBCBBa*bbxaAB
bbCCB*ccdbbCBCCCAAxbcaAB
cbCCB*cccbbCCCCCAAcccaAB
cbCCC*bccbaCCCBCBAbccbaB
cbCCBacccbbCCCCB*AbbcaAB
bbCCB*bccbbBBCBC*Abbbb*B
cbBCABbcdbaBABBCBBbcbaAB
dcABCaaabbBCxBBBa
bcABBb*ab*AxABBAa
baB*AaA*bxdBaBBBB
bbA*Ab*bbAAxaBBB*
caBAAaabbAAxaBBBA
cbAA*a*bbAAxaBBBB
b*AaAbabbBxBaA*BA
bAAAAbabbBAxaA*B*
ccBBBB*bcbABBxAAb
String-Based Results
cccbABCBBBBBab
cb*baBBCBBBB*b
cccbABCCCBABAb
bccbaBBBBB*Bab
cccb*BBCBBBBAb
bccbaBCBBBBB*b
cccb*BCCBBBB*b
bcbb*BBCBAA**b
bccbABBCBBBA*a
Regular Expression
^.{4}[cdx]{2}.{3}[BCx]{1}.{3}[abx]{1}.{3}[BCDx].{7}[BCx]{1}.{3}[bcdx]{2}.{5}[BCx]{1}.{33}[BCx]
{1}[ABCx]{1}.{5}$
Generalization
YBL002W
YBL003C
YDR225W
YNL031C
CC-b--c---------BCC-----C---bcc----C-------b----A-----b-----------b-B--B----b
CCabdccaCBBBabc*BCCCA*xcCBCbbcccACACCBBbaabbb*dcABCaaabbBCxBBBacccbABCBBBBBab
CCabcdcbBCBBbacaBCCC**cbCCCAbcccaCCCCBA*bcbbaAbcABBb*ab*AxABBAacb*baBBCBBBB*b
CC*bcdcaCCC*abbbBCCBAacbCCC*bccbaCCCBCBAbccbaBcbAA*a*bbAAxaBBBBbccbaBCBBBBB*b
CCabcccABCB*abb*BCCAAabbCCB*bccbbBBCBC*Abbbb*BbAAAAbabbBAxaA*B*bcbb*BBCBAA**b
++x----x+++x---x+++xxx--+++x----x+x+++xx-x--xxxxx+x-x--x++x+x+x--x-x++++++xx5,6,10,14,18,26,30,31,37,71,72
Localized Focus of Generalization = more
genes included
Example 2 with Yeast Cell Data
Original Hypothesis
ORF
Process
YBR009C
YBR010W
YDR224C
YDR225W
YLR300W
YNL030W
YNL031C
chromatin
chromatin
chromatin
chromatin
cell wall
chromatin
chromatin
structure
structure
structure
structure
biogenesis
structure
structure
Function
Peak
histone H4
S
histone H3
S
histone H2B
S
histone H2A
S
"exo-beta-1,3-glucanase"
histone H4
S
histone H3
S
Phase Order Cluster Order
417
420
432
437
G1
791
795
794
796
381
418
430
756
792
793
New Genes Included After Localized Generalization
Not classified as cell cycle related:
YBR106W, YBR118W, YBR189W, YBR206W, YCLX11W, YDL014W, YDL213C,
YDR037W, YDR134C, YGL148W, YKL009W, YLR449W, YNL110C, YPR163C
New Regular Expression
^.{46}[bcx]{1}.{1}[ABx]{1}[aA*x]{1}[AB*]{1}[abx]{1}[aA*x]{1}.{1}[abcx]{1}[ABx]{1}.{1}
[ABCx]{1}[ab*x]{1}[ABx]{1}.{1}[ABCx]{1}.{15}$
Find Longest Common Substrings
Algorithms for LCS
• Substring : continuous sequence of
characters in a string
• Subsequence : obtained by deleting zero
or more symbols in a given string
abcdefghia
Substrings : cdefg, efgh, abcd
Subsequences: ade , cefhi, abc, aia
Longest Common Subsequence
• LCS is common subsequence of maximal
length between two strings
String 1 : abcdabcefghijk
String 2:
xbcaghaehijk
LCS = bcaehijk, Length of LCS = 8
Finding LCS
• Brute Force has exponential time
complexity in the length of string
• Dynamic Programming can find LCS in
O(mn) time and space complexity
• Length of LCS can be found in O(min(m,n))
space complexity and O(mn) time
complexity
Main Sequence Mining Tasks
•
•
•
•
•
•
•
Finding frequent patterns
Doing similarity search
Clustering of time series
Periodicity detection
Finding temporal associations
Summarization
Prediction
Finding the LCS Length
Recursive Formulation:
LCS[i, j] = 0,
LCS[i-1, j-1] + 1,
max(LCS[i, j-1], LCS[i-1, j]),
if i = 0 or j = 0
if i, j > 0 and ai = bj
if i, j > 0 and ai ≠ bj
Finding the LCS Length
Recursive Formulation:
LCS[i, j] = 0,
LCS[i-1, j-1] + 1,
max(LCS[i, j-1], LCS[i-1, j]),
if i = 0 or j = 0
if i, j > 0 and ai = bj
if i, j > 0 and ai ≠ bj
Iterative solution is more efficient than
recursive
Finding the LCS Length: Algorithm
lcs_length(A, B)
{
// A is a string with length m
// b is another string with length n , m>=n
// L is an array to keep intermediate values in Dynamic Programming
for (i = m; i >= 0; i--)
for (j = n; j >= 0; j--)
{
if (A[i] = '$' || B[j] = '$') L[i,j] = 0; //end of strings
else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];
else L[i,j] = max(L[i+1, j], L[i, j+1]);
}
return L[0,0];
}
Sequential Clustering
• Clustering is partitioning the data in
equivalence classes
• Data is input one or few times
• Unique classification based on input order
• Simple and fast for large data
Basic Sequential Clustering
Sequential Clustering with Buckets
Main Sequence Mining Tasks
•
•
•
•
•
Finding frequent patterns
Doing similarity search
Clustering of time series
Periodicity detection
Biological Sequence Problems
Multivariate Time Series
• Multivariate time series data is a series of multiple attributes
observed over a period of time at equal intervals
• Examples: stocks, weather, utility, sales, scientific, sensor
monitoring and bioinformatics
Time Series Analysis Tasks
•
•
•
•
•
•
Finding frequent patterns
Doing similarity search
Clustering of time series
Periodicity detection
Finding temporal associations
Prediction
Why temporal association rules?
• More information about correlations between
frequent patterns
• Contains richer information than knowledge
frequent patterns
• Helps to build diagnostic and prediction tools
Finding temporal associations: recent work
• Mannilla - Discovery of frequent episodes in event
sequences [2], 1997
• Das - Rule Discovery from Time Series [1], 1998
• Kam - Discovering temporal patterns for interval-based
events [3], 2000
• Roddick - Discovering Richer Temporal Association
Rules from Interval-based Data [4], 2004
• Mörchen - Discovering Temporal Knowledge in
Multivariate Time Series [5], 2004
Research Issues
• Finding richer set of temporal relationships
{ contains, follows, overlaps, meets, equals …} than sequence
mining does {follows}
• Robustness of rules - room for noise in patterns
• Understanding temporal relationships at different
levels of abstraction
• Efficient algorithms to find patterns with noise
and temporal associations.
Temporal Association Rules
Frameworks
• Kam [1] uses Allens temporal relations to find
rules
• Roddick [4] uses state sequence framework
similar to Höppner [6]
• Mörchen [5] is based on Unification-based
Temporal Grammar.
Allen’s relationships
Above figure is presented from [1]
What are A1 rules?
...  A1 rel1
A2  rel 2 A3  rel 3 A4 ....rel k 1 Ak

Problem
Given a multivariate time series, minimum
support for number of occurrences, minimum
pattern length find all the temporal association
rules (similar to A1) less than size k
dimensionality reduction, discretization
and symbolic representation
aaaeffdaaaaaaaaacccaaaaaaaaaaedefggcbabaacfgfc…
dccbbccdccdeedcdcdeecccdddeeecdeccccddegedbcdc…
dddcbbcfffeeegffcbcdeeefffffecbbaaaaaacffecbbb…
multivariate time series
Frequent pattern enumeration
sequences
clustering
{aaaaaaaa, bbbaaa, bbaaa, eeef, aaac, eaaa ...}
{cdee, dccde, deed, ddeeec, eeec, ccccc,edddd …}
{fff,fffe,aaaaa,fecbb,ddcb,bbbbc,bbbbc,cbbb…}
{aaaaaaaa} {bbba,bba,..} {eeef…} {aaac,eaaa …}…
{cdee}{dccde…}{deed,ddeeec,eeec…}{ccccc…}{edddd…}…
{fff,fffe…}{fecbb…}{ddcb…}{bbbbc,bbbbc,cbbb…} …
frequent patterns
Temporal association rule discovery
clusters
Summarization and visualization
1.{aaaa,aaac,aaae,abaa…}followed by{cbb,ccbb,cccbb…}
2.{bbbbb,cbbbbbc,…}followed by{baaaa,aaaaaaa,aaaaaa …}
overlaps {aaaa,aaac,aaae…}
3.{dccccdd,ccccdd} contains {aaaaa,aaaa }
4. …
5. …
….
Temporal rules
Summarized rule
Mining in 3 steps
1. Find all frequent patterns in each dimension
along with all the occurrence.
2. Cluster the similar Patterns to form
equivalence classes
3. Find temporal associations between these
equivalence classes using iterative algorithm
Step 1: Finding Frequent Patterns
1. The Data in each dimension is quantized to
form a string ( equal frequency, equal
interval, SAX, persist etc…) [8].
2. Enhanced Suffix Tree is constructed for this
string using O(n) Algorithm [7].
3. All the Frequent patterns along with the
locations are enumerated in Linear Time by
complete traversal of Tree.
Step 1: Enhanced Suffix Tree
Enhanced Suffix Tree Construction
Algorithm
Step 2: Clustering frequent patterns
• The Frequent patterns are clustered together
using string similarity measures.
• Any reasonable string similarity measure and
algorithm can be chosen.
• LCS based similarity measure along with
sequential clustering algorithm is chosen
because of robustness and efficiency.
Step 2: string similarity measure
• Similarity measure Sim( s1, s2) = 2 * LCS(s1, s2)
(| s1 |  | s2 |)
• Distance measure Dist(s1, s2) = 1 - Sim( s1,s2)
• LCS (s1, s2) is the length of longest common
subsequence
• |s| is the length of string s
Step 2. Clustering
• Sequential clustering Algorithm
• Frequent patterns are divided in to overlapping
buckets based upon their length
• Clustering is done for patterns in each bucket to
reduce the clustering complexity
• Finally, each pattern is assigned to closest cluster
if it is a member of two clusters
Step 2. Clustering
An Example cluster with cluster center ccccddccccccc
ccccdcccccccc
cccccddbccccc
ccccdcccccccd
cccccdcddcccc
cccccdcdccccc
ccccddccccccc
cccccddcccccc
dccccdccccccc
ccccccddccccc
cccccdccccccc
ccccccdccdccc
ccccccdcccccc
ccccccdcdcccc
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Step 2. Clustering Algorithm
Step 3. Temporal Relations
• Only three relations are explored here
• Overlap also means meets, starts with, overlaps
and overlapped by
Temporal Relations
• Temporal relationship concept extended to
interval sequence by including support
Temporal Relations
Temporal relationship concept extended to interval sequence by
including support
Temporal Relations
Examples
Example of terminology
Step 3. Algorithm
Step 3. Example
Step 3. Example
Summarization
• Inference of higher level or more general knowledge
from lower level temporal dependencies.
• Summarization generalizes the temporal association
rules by identifying the time windows in which the rule
is applicable.
• Measures for summarization are coverage, average
length of coverage and maximum coverage length.
Results: Utility Dataset
Results: EEG Dataset
Scalable up to 1 million rows
Results: Great Lakes dataset
Results: Power Consumption dataset
daily patterns
Weekly patterns after
summarization
Spatio-Temporal Clusters
Temporal Patterns
Observed Value
Time axis
Goal:
Cluster together
profiles with similar
behavior
Problem:
Example Domains for Temporal Profiles:
- Gene Expression
- Each gene’s expression level
(1000s genes)
- Utility Consumption
- Each day/week’s utility consumption
(daily profile for multiple years)
- Social Data
- Number of crime incidents/month
How to define
similar behavior?
Similar for?
- Complete period
- One interval
- Multiple Intervals
Spatial Patterns
x
Goal:
. .. . .
.. . . . .
.. . . . .
.. .
.. .
.
. ... .
.
. ..
.
.
. .. .
Cluster together
profiles close to
each other
y
Example Domains for Spatial Profiles:
- Social Data
- Number of crime incidents/area
- Scientific Data
- Pollution level in land
Spatial Closeness:
- Euclidean distance
- Chain connectivity
Spatio-Temporal Patterns
Goal:
Find patterns in
space-time
dimensions for
phenomena
Issues:
Simultaneous handling
of space and time
dimensions
Problem:
Phenomena in space-time
How to handle large
complexity of
algorithms
Examples:
Time
t
Energy Consumption
x
y
Avg. Production
Avg. Temperature
Crime data:
- Each layer represents month i.e. t-axis
- x, y axis represents space
coordinates
Brewery data:
- t-axis represent weeks
- x, y axis represents the Average Production
and Average Temperature
Clustering Algorithms
Temporal Clustering
Spatial Clustering
Spatio-Temporal Clustering
• Discrete Fourier
Transform [Agarwal]
• PAM
• R-Trees
- Pros: Works well for
small datasets
- Cons: Expensive
- Often used for indexing spatiotemporal data
- Overlapping causes
backtracking in search
- Pros: Index time
sequences for similarity
searching
- Cons: Not able to
represent subsequence
temporal concepts
• Rule discovery from
Time Series [Das1998]
• CLARA
- Pros: Works well for
large datasets
- Cons: Expensive
• CLARANS
- Pros: Windows clustered
- Motivated by PAM and
according to their similarity
CLARA
to determine temporal rules.
- Cons: Unable to cluster
• BIRCH & CURE
profiles according to
- Suitable for large
subsequences of similarity
datasets, clusters found
with one scan
• MR-Tree & HR- Tree
- use overlapping of R-Tree to
represent successive states of
database
- if the number of moving objects
from one time instant to another is
large, the approach degenerates
to independent tree structures and
thus no paths are common
Generalizing to multiple dimensions
• Two different approaches
– Phenomenon moving in the same
direction i.e. parallel to time axis
– Phenomenon moving in different
directions i.e. N, NW, NE, S, SW,
SE, W, E
• Task: classify the phenomena in
time, then track them in space
– STEP I: Discovery of temporal
clusters
– STEP II: Finding spatial clusters
Phenomenon moving in different directions
Goal:
To discover subsequences of the complete temporal
profile that are shared by a large number of profiles
and are neighbors spatially
Modified Sequences Generated
a b
b c
a c
c c
c d
d c
b c
d b
There are 4 layers, so
the final sequences
formed are of length 7
i.e. 2n – 1, where
n is number of layers
Complexity
• Number of strings generated moving from Layer 1
to Layer 2 are:
• For L layers, number of strings generated when
phenomenon move from layer 1 to L are:
Results
Time
- Test data with 20
profiles
- Each profile observed
at 30 time points
Profile
Results (Brewery Data 1)
Energy Consumption
-Energy Consumption data of size 51X14
- 51 profiles, each observed over a time period of 14 days
Time
Results (Brewery Data 2)
- Production data of size 51X14
- 51 profiles, each observed over a time period of 14 days
Results
- Sequences generated from a test data with 7 layers,
each layer of size 10X10
Time
Observed Values
- 4480 strings generated
Profile Index
Sequences and Bioinformatics
Importance of Sequences
• DNA: sequence of letters A,C,G,T
• Proteins: sequence of 20 amino acids
• Central premise of bioinformatics
– Sequence  structure  function
• Ultimate goal: predict function based on
sequence
Strings and Sequences
Exact String Matching
Solutions to Exact Matching
• O(n+m) upper bound for worst case
• Preprocess the pattern P
• Preprocess the text T
– Suffix tree method: Weiner, Ukkonen and
others
• Building the tree: O(m)
• Searching: O(n)
Suffix Trees
• T = cabca
• Suffixes are
– cabca
– abca
– bca
– ca
–a
• P = ca
c
b
$
a
c
a
b
a
$
4
c
b
$
a
5
c
$
3
1
a
$
2
Approximate String Matching
• Error in the experiment or observation
• Redundancy in biological system
– Insertion/deletions
– Mutations
The Task: Best Alignment
• Given Strings
– HEAGAWGHE
– PAWHE
• Many Alignments possible
HEAGAWGHE
P_A _ _W_HE
HEAGAWGHE
_ _P_AW _HE
HEAGAWGHE
_PA_ _W _HE
Scoring Function
• a score for match
HEAG…
XXAX …
• a score for mis-match
HEAG…
XXPX …
• a score for alignment against gap
HEAG…
XX_X …
Algorithms
• Needleman-Wunsch (NW): Global
Alignment
• Smith-Waterman (SW): Local Alignment
• Both Based on idea of dynamic
programming
All Possible Alignments
• Represented using a table
– First string, a1a2, is aligned along top
– Second string, b1b2, is aligned along the left
side
\  a1b1a2b2
|  b1a1b2a2
\
a1
a2
\_
0
_  a1a2b1b2
\
|
b1
b2
1
1
1
1
3
5
5
| _ _ |_
_
|_ _
|_ _ |
|_ |_
|
|
|
\
|_
\_
|_
_
13
_
|
|_
|
\
|
b1a1a2b2
 b1b2a1a2\
 a1b1b2a2

Main idea of DP
a1
a2
a3
b1
b2
5
12
b3
-3
4
• Choices:
–
–
–
Align a3 to b3, s(a3,b3)= -5
Align a3 to a gap, s(a3,-)= -8
Align b3 to a gap, s(-,b3)= -8
score: 5 -5 =0
score: -3 -8=-11
score: 12-8=4
Main idea of DP
• Score is sum of independent piecewise scores
F(i-1,j• Global alignment (Needleman-Wunsch):
F(i-1,j)
1)
F(i,j-1)
– F(0,0) = 0; F(k,0) = F(0,k) = - k d;
– F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }
• Local alignment (Smith-Waterman):
– F(0,0) = 0; F(k,0) = F(0,k) = 0;
– F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d }
(i,j)
Overview of DP
•
•
•
•
Start from the upper left corner
Assign scores for the leading gaps
Iteratively fill in the scores in the table
Find the maximum score, bottom-right for
global alignment
• Back-track to the upper-left to find the best
overall alignment
Pairwise score from Blosum50
H
E
A
G
A
W
G
H
E
P
-2
-1
-1
-2
-1
-4
-2
-2
-1
A
-2
-1
5
0
5
-3
0
-2
-1
W
-3
-3
-3
-3
-3
15
-3
-3
-3
H
10
0
-2
-2
-2
-3
-2
10
0
E
0
6
-1
-3
-1
-3
-3
0
6
DP Table for global alignment
H
E
A
G
A
W
G
H
E
0
<-8
<-16
<-24
<-32
<-40
<-48
<-56
<-64
<-72
P
^-8
*-2
*-9
*-17
<-25
*-33
<-41
<-49
<-57
*-65
A
^-16
^-10
*-3
*-4
<-12
*-20
<-28
<-36
<-44
<-52
W
^-24
^-18
^-11
*-6
*-7
*-15
*-5
<-13
<-21
<-29
H
^-32
*-14
*-18
*-13
*-8
*-9
^-13
*-7
*-3
<-11
E
^-40
^-22
*-8
<-16
^-16
*-9
*-12
^-15
*-7
*3
HEAGAWGHE
--P-AW-HE
DP Table for local alignment
H
E
A
G
A
W
G
H
E
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
A
0
0
0
*5
0
*5
0
0
0
0
W
0
0
0
0
*2
0
*20
<12
<4
0
H
0
*10
<2
0
0
0
^12
*18
*22
<14
E
0
^2
*16
<8
0
0
^4
^10
*18
*28
AWGHE
AW-HE
Multiple Alignment
• Complexity of DP: O(n2)
• For k strings: O(nk)
• Explore other options
– Hidden Markov Model
Markov Chain Models
• Similar to finite automata
• Emits sequences with certain probability
A
T
ATGT…
C
G
P(ATGT) = P(A|TGT) P(T|GT) P(G|T) P(T)
P(ATGT) = P(A|T) P(T|G) P(G|T) P(T)
Hidden Markov Models
• Generalization of Markov Models
• Hidden states that emit sequences
A*
T*
A
T
C*
G*
C
G
Adding four more states (A*,C*,T*,G*) to represent the “island” model,
as opposed to non-island model with unlikely transitions between the
models one obtains a “hidden” MM for CpG islands.
HMMs for Gene Prediction
HMMs & Supervised Learning
• Input: a training set of aligned sequences
• Find optimal transition and emission
probabilities
• Criteria: maximize probability of observing
the training sequences
• Algorithms
– Baum-Welch (Expectation Maximization)
– Viterbi training algorithm
Recognition Phase
• We have optimized probablities
• Predict likelihood of a sequence belonging
to a family
– Whats the probablity that a sequence is
generated by the HMM?
A Simple HMM Model
Each blue square represents a match state that “emits” each letter with
certain probability ej(a) which is defined by frequency of a at position j:
Beg
…
…
Mj
Example
End
1
2
3
4
5
6
7
AGAAACT
AGGAATT
TGAATCT
A
2/3
0
2/3
1
2/3
0
0
T
1/3
0
0
0
1/3
1/3
1
C
0
0
0
0
0
2/3
0
P(AGAAACT)=16/81
P(TGGATTT)=1/81
G
0
1
1/3
0
0
0
0
Insertions…
Ij
Beg
Mj
End
Insert states emit symbols just like the match states, however, the
emission probabilities are typically assumed to follow the background
distribution and thus do not contribute to log-odds scores.
Transitions Ij -> Ij are allowed and account for an arbitrary number
of inserted residues that are effectively unaligned (their order within
an inserted region is arbitrary).
… and Deletions
Dj
Beg
Mj
End
Deletions are represented by silent states which do not emit any letters.
A sequence of deletions (with D -> D transitions) may be used to connect
any two match states, accounting for segments of the multiple alignment
that are not aligned to any symbol in a query sequence (string).
The total cost of a deletion is the sum of the costs of individual transitions
(M->D, D->D, D->M) that define this deletion. As in case of insertions, both
linear and affine gap penalties can be easily incorporated in this scheme.
HMMs for Multiple Alignment
Dj
Ij
Beg
Mj
Example
AG---C
A-AG-C
AG-AA--AAAC
AG---C
**
*
End
Emision and Transition Counts
C0
C1
C2
1
Beg
4
3
4
C0
C1
C2
C3
A
-
4
0
0
C
-
0
0
G
-
0
T
-
0
AG...C
A-AG.C
AGAA.--AAAC
AG...C
Dj
1
1
C3
2
1
1
Ij
1
2
Mj
2
4
End
C0
C1
C2
C3
A
0
0
6
0
4
C
0
0
0
0
3
0
G
0
0
1
0
0
0
T
0
0
0
0
Match emissions
Insert emissions
Results
Training File 2
Validation File 2
Bibliography
1. G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyth, "Rule Discovery from Time
Series," in Knowledge Discovery and Data Mining, 1998, pp. 16-22
2. H. Mannila, "Discovery of Frequent Episodes in Event Sequences," Data Mining and
Knowledge Discovery, vol. 1, pp. 259, 1997.
3. P. Kam and A.W. Fu, "Discovering Temporal Patterns for Interval-Based Events," in
Second International Conference on Data Warehousing and Knowledge Discovery
(DaWaK 2000); Lecture Notes in Computer Science, 2000, pp. 317-326.
4. J. Roddick F. and E. Winarko, "Discovering Richer Temporal Association Rules from
Interval-based Data : Extended Report," School of Informatics and
Engineering,Flinders University., Adelaide, Australia, Tech. Rep. SIE-05-003, March
2005, 2005.
5. F. Mörchen and A. Ultsch, "Discovering Temporal Knowledge in Multivariate Time
Series," in GfKl Dortmund, 2004,
6. H¨oppner, F.: Learning temporal rules from state sequence. In: Proceedings of IJCAI
Workshop on Learning from Temporal and Spatial Data,Seattle, USA (2001) 25-31
7. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, vol. 14, pp. 249260, 1995.
8. C.S. Daw, C.E.A. Finney and E.R. Tracy, "A review of symbolic analysis of
experimental data," Rev.Sci.Instrum., vol. 74, pp. 915-930, feb. 2003.
9. T. Sergios and K. Konstantinos Eds., Pattern Recognition, San Diego: Elsevier
Academic Press, 2003