single period
Download
Report
Transcript single period
Finding Periodic Partial Patterns
in Time Series Database
Huiping Cao
Apr. 30, 2003
1
Outline
Problem Definition
Mining partial periodicity for some given
period(s)
–
–
2
single period
multi periods
Mining partial periodicity when no period length
is given in advance
Conclusion & Future work
Problem definition
Time series S = D1, D2, ..., Dn, where Di is a set of
features for time instant i.
Partial pattern s = s1 ... sp . Here, si is defined over (2L{}{*}) where L is the underlying set of features and *
refers to the “don’t care” character.
–
–
–
–
|s|: pattern length
L-length of s: number of si which contains letters from L.
subpattern of a pattern s: is a pattern s’ = s’1 ... s’p such that |s|
= |s’| and s’i si for every position i where s’i *.
E.g.: s = a*{a,c}de
3
|s|=5,
L-length is 4(also called 4-pattern)
a*{a,c}** and **cde are all its subpatterns.
Problem definition
frequency_count(s) in sequence S=D1, D2, ..., Dn
–
confidence(s) = frequency_count(s)/m
–
–
A patterns s = s1 ... sp is true in some period segment means: for each
position i, either si is * or all the letters in si occur in the ith set of the features
in the segment. Pattern “a*b” is true in segment “acb”, but not true in “bcb”
frequent partial periodic pattern s:
–
4
m: maximum number of periods of length |s| contained in the time
series.(m|s| n<(m+1)|s|).
E.g.: In a{b,c}baebaced, freq_count(a*b) =2, conf(a*b) =2/3
period segment: segment in form of Di|s|+1, Di|s|+s, ..., Di|s|+|s| where 0i<m.
–
frequency_count(s) = |{i|0i<m, and string s is true in Di|s|+1, Di|s|+s, ...,
Di|s|+|s|}|.
confidence(s) min_conf, which is a user specified threshold
Problem definition
Input:
–
–
–
–
Goal:
–
5
A time series S
Specified period(s)
m: indicating the ratio of the lengths of S and the
patterns must be at least m
min_conf
Discover all the frequent patterns for one period or
some periods
Mining partial periodicity for some
given period(s)
For single period
For multi periods
Deviation of all partial patterns
–
–
6
Max-subpattern tree
Deviation of frequent patterns from max-subpattern
tree
Mining partial periodicity for single
period
Notation:
–
F1: the set of frequent 1-patterns of period p. For example,
p=3, a**, *{b,c}*, **g are all in F1.
Single-period Apriori
–
Find frequent F1.
Find all frequent i-patterns of period p(2ip) using the Apriori
property. Terminate when the candidate i-pattern set is empty.
Step 1 scan source data once, and step 2 need scan source
data up to p-1 times in the worst case.
–
7
Accumulate the frequency count for each 1-pattern in each whole
period segment;
select those F1 whose frequency count min_conf*m
Mining partial periodicity for single
period
The advantage of Apriori property in mining
partial patterns is not as obvious as that in
mining association rule.
–
–
8
In mining association rule, the number of frequent iitemsets shrinks quickly as i increase because of
the sparsity of frequent i-itemsets.
In mining periodic patterns, the number of frequent
i-patterns does not shrink quickly as i increase
because of strong correlation between frequencies
of patterns and their subpatterns.
Max-subpattern hit set method
-single period
Candidate max-pattern, Cmax, is the maximal pattern
generated from F1.
–
A subpattern of Cmax is hit in a period segment Si of S if
it is the maximal subpattern of Cmax.
–
9
E.g., F1={a****,*b***,**c**}, Cmax =abc**
E.g., Cmax =abc**, Si = abdef, then ab*** is its hit subpattern.
The complete set of partial periodic patterns can be
derived from the frequency counts of all the hit maximal
subpatterns of Cmax.
Max-subpattern hit set method
-single period(algorithm2)
Scan S once to find frequent F1. Form the
candidate Cmax.
Scan S once again. For each period segment,
if it’s nonempty, add it to the max-subpattern
tree(introduced later).
Derive frequent patterns from the maxsubpattern tree(introduced later).
Only two scans on source data.
10
Max-subpattern tree
11
Max-subpattern tree is used to facilitate the process of
deriving the set of frequent patterns, which is the step 2
of the algorithm 2.
Rooted at Cmax
Each subpattern of Cmax with one non-* letter missing is
a direct child node of the root.
A node w with one more non-* letters may have a set of
children, each of which is a subpattern of w with one
more non-* letter missing.
Each node has a “count” field.
Max-subpattern tree
a{b1,b2}*d* 10
~a
~b1
0
*{b1,b2}*d*
~b1
~b2
50
~a
ab2*d*
~a
40
ab1*d*
~b1
~b2
~b2
~d
*b2*d*
8
~d
*b1*d*
18
*{b1,b2}***
5
32
a{b1,b2}***
~a
~d
a**d*
19
~b1
~b2
~d
ab2***
0
One node is linked to only one parent, e.g., a**d* is linked to
ab2*d*,but not linked to ab1*d*(missing link is marked by a green
dash line.)
12
ab1***
2
Insertion in the max-subpattern tree
Insert a max-subpattern w found during the
scan of S into the max-subpattern tree T.
Step1: Starting from the root of the tree, find
the corresponding node by checking the
missing non-* letter in order.
–
13
E.g., To max-pattern node *b1*d* , it has two letters,
a and b2 missing from the Cmax =a{b1,b2}*d*. The
node can be found follong ~a link to *{b1,b2}*d*,
then following ~b2 link to *b1*d*.
Insertion in the max-subpattern tree
Step2: If the node W is found, increase its count by 1.
Otherwise, create a new node w with count 1 and its
missing ancestor nodes(only those on the path to w,
with count 0), if any, and insert it/them into the
corresponding place(s) of the tree.
–
14
E.g., If the first max-subpattern node found is *b1*d*, we will
create the node *b1*d* with count 1 and create two ancestor
nodes with count 0: w1=a{b1,b2}*d*(root) and w2= *{b1,b2}*d*
following ~a link of w1. The node *b1*d* is w2’s child, following
the ~b2 link.
Derivation of frequent patterns
from max-subpattern tree
The set frequent F1 is derived in the first scan.
The set of frequent k-pattern(k>1) is derived
from the max-subpattern tree as follows:
–
for i=2 to |F1|
15
derive candidate patterns with L-length i from frequent (L1)-length patterns by “(i+1)-way join”
scan tree T to find frequency counts of these candidate
patterns and eliminate the non-frequent once.(note: the
frequency count of a node is the sum of count(itself) and
count(all of its reachable ancestors)). IF the derived
frequent i-pattern set is empty, return.
Example:
16
Set of reachable ancestors of a node w in a
max-subpattern tree T is the set of all the nodes
in T, which are proper super-patterns of w.
Let min_conf*m=45.
We check level-2 node, w = *b2*d*
Example:
a{b1,b2}*d* 10
~a
~b1
0
*{b1,b2}*d*
~b1
~b2
50
~a
ab2*d*
~a
~b2
~d
–
–
17
–
40
ab1*d*
~b1
~b2
*b2*d*
8
~d
*b1*d*
18
*{b1,b2}***
5
32
a{b1,b2}***
~a
~d
~b1
~b2
~d
a**d*
ab2***
19
0
reachable ancestor set(w) = {root, *{b1,b2}*d*,ab2*d*}
frequency_count(w) = 10+0+50+8 =68 >45
it’s frequent!
ab1***
2
Max-subpattern hit set method
-multi period
18
Scan S once, for all periods pj. Find F1(pj) and
form Cmax(pj).
Scan S again, for all periods pj do the same as
step2 in the former algorithm.
Only two scans on source data.
Experiment
Synthetic data
–
Performance gain when L-length increase
19
100k, 550k
|F1|=12
p=50
HitSet method
outperforms
Apriori method
Mining partial periodicity when no
period is given in advance
In the algorithms provided before, the period
length is given in advance.
–
PPD(Partial Periodic Detection) algorithm
–
–
20
How to find frequent partial patterns when periods
are unknown?
Filter step: scan the data once to find those possible
periods automatically(introduce later).
Mining step:Apply Han’s algorithm to discover
frequent partial patterns for the periods gotten in the
first step.
Filter step
Assume the size of a time series is N
Step1: Create a binary vector of size N for
every letter in the alphabet.
–
–
1 for every occurrence of the corresponding letter
0 for every other letter
21
e.g.: abcdabebadfcacdcfcaa, N=20,
binary vector(a) = 10001000100010000011
binary vector(b) = 01000101000000000000
...
Filter step
Step2: calculate the Circular Autocorrelation Function
for every binary vector.
–
–
Autocorrelation means self-correlation. I.e., discovering
correlations among the elements of the same vector v for
every possible period length(1,...,N).
The function value is the dot products between v and v shifted
circularly by a lag k. We denote v(k) as vector shifted by k.
–
22
Circular means that if one point will be out of N after shifting k, it
will be moved to the beginning of the vector.
E.g., assume v =1001, then v shifted by 2 is 0110
For each k from 1 to N, compute autocorrelation function r(k)
r(k) = v.v(k)
E.g., r(2) =(1001)
0
1
1
0
.
= 0 for v=1001
–
Example:abcdabebadfcacdcfcaa, N=20,
23
binary vector(a) = 10001000100010000011
first value of the autocorrelation vector is the dot product of the binary
vector with itself. It’s 6.
The peak identified at position 5 implies that there is probably a period
of length 4 and the value of 3 at this position is an estimate of the
frequency count of this period.
Identifying all those peaks get a set of candidate periods.
Given min-conf c, extract frequent periods. If the value for period p is
greater than or equal to cN/p, then p is frequent period.
Exprement:
Test data:
–
–
24
Real data: Wal-Mart stores and power consumption data.
Synthetic data: Machine Learning Repository.
Different runs over different portions of the data sets showed that
the execution time is linearly proportional to the size of the time
series as well as the size of the alphabet.
Conclusion & future work
Introduce algorithms used to mine frequent
partial patterns in time series database
–
–
How to solve the similar problem in data
stream setting considering some constraint:
–
–
25
given periods
periods are unknown
one-pass
memory limitation
References:
26
J. Han, G. Dong, Y. Yin. Efficient Mining of
Partial Periodic Patterns in Time Series
Database. In ICDE99.
C.Berberidis, I. Vlahavas, W. G. Aref. etc. On
the Discovery of Weak Periodicities in Large
Time Series. In PKDD02.