Advanced_time_series

Download Report

Transcript Advanced_time_series

We have seen the Classic Time
Series Data Mining Tasks…
Clustering
Classification
Query by Content
Lets look at some problems which are
more interesting…
Novelty
Detection
Rule Discovery
10

s = 0.5
c = 0.3
Motif Discovery
Novelty Detection
Fault detection
Interestingness detection
Anomaly detection
Surprisingness detection
…note that this problem should not be
confused with the relatively simple problem
of outlier detection. Remember Hawkins
famous definition of an outlier...
... an outlier is an observation
that deviates so much from
other observations as to arouse
suspicion that it was generated
from a different mechanism...
Thanks Doug, the check is in the
mail.
We are not interested in finding
individually surprising
datapoints, we are interested in
finding surprising patterns.
Douglas M. Hawkins
Lots of good folks have worked on
this, and closely related problems.
It is referred to as the detection of
“Aberrant Behavior1”, “Novelties2”,
“Anomalies3”, “Faults4”, “Surprises5”,
“Deviants6” ,“Temporal Change7”, and
“Outliers8”.
1.
2.
3.
4.
5.
6.
7.
8.
Brutlag, Kotsakis et. al.
Daspupta et. al., Borisyuk et. al.
Whitehead et. al., Decoste
Yairi et. al.
Shahabi, Chakrabarti
Jagadish et. al.
Blockeel et. al., Fawcett et. al.
Hawkins.
Arrr... what be wrong with
current approaches?
The blue time series at the top is a normal
healthy human electrocardiogram with an
artificial “flatline” added. The sequence in
red at the bottom indicates how surprising
local subsections of the time series are
under the measure introduced in Shahabi
et. al.
Simple Approaches I
Limit Checking
35
30
25
20
15
10
5
0
-5
-10
0
100
200
300
400
500
600
700
800
900
1000
Simple Approaches II
Discrepancy Checking
35
30
25
20
15
10
5
0
-5
-10
0
100
200
300
400
500
600
700
800
900
1000
Our Solution
Based on the following intuition, a
pattern is surprising if its frequency of
occurrence is greatly different from
that which we expected, given
previous experience…
This is a nice intuition, but useless unless we can
more formally define it, and calculate it efficiently
Note that unlike all previous attempts to solve this
problem, our notion surprisingness of a pattern is not tied
exclusively to its shape. Instead it depends on the
difference between the shape’s expected frequency and
its observed frequency.
For example consider the familiar head and shoulders
pattern shown below...
The existence of this pattern in a stock market time series
should not be consider surprising since they are known to occur
(even if only by chance). However, if it occurred ten times this
year, as opposed to occurring an average of twice a year in
previous years, our measure of surprise will flag the shape as
being surprising. Cool eh?
The pattern would also be surprising if its frequency of
occurrence is less than expected. Once again our definition
would flag such patterns.
We call our algorithm… Tarzan!
“Tarzan” is not an
acronym. It is a pun on
the fact that the heart
of the algorithm relies
comparing two suffix
trees, “tree to tree”!
Homer, I hate to be a fuddyduddy, but could you put on
some pants?
Tarzan (R) is a registered
trademark of Edgar Rice
Burroughs, Inc.
We begin by defining some
terms… Professor Frink?
Definition 1: A time series pattern P,
extracted from database X is surprising
relative to a database R, if the
probability of its occurrence is greatly
different to that expected by chance,
assuming that R and X are created by the
same underlying process.
Definition 1: A time series pattern P,
extracted from database X is surprising
relative to a database R, if the
probability of occurrence is greatly
different to that expected by chance,
assuming that R and X are created by the
same underlying process.
But you can never know the
probability of a pattern you have
never seen!
And probability isn’t even defined
for real valued time series!
We need to discretize the time series into
symbolic strings…
UUFDUUFDDUUD
DQ, C  
1.5
 qi  ci 
n
2
i 1
MINDIST (Qˆ , Cˆ ) 
n
w
 dist (qˆ , cˆ )
w
i 1
C
1
0.5
0
Ĉ =
baabccbc
Q̂ =
babcacca
-0.5
-1
Q
-1.5
0
20
40
60
80
100
120
2
i
i
I’ve prepared a little math background...
• We use  to denote a nonempty alphabet of
symbols.
• A string over  is an ordered sequence of
symbols from the alphabet.
• Given a string x, the number of symbols in x
defines the length |x| of x. We assume |x| = n.
Let us decompose a text x in uvw, i.e., x = uvw where u, v,
and w are strings over .
• Strings u, v, and w are substrings, or words, of x.
• u is called a prefix of x.
• w is called a suffix of x.
• We write x[i] 1  i  |x| to indicate the ith symbol in x.
• We use x[i,j] as shorthand for the substring x[i] x[i+1]… x[j]
where 1  i  j  n, with the convention that x[i,i] = x[i].
• Substrings in the form x[1,j] correspond to to the prefixes of x.
• Substrings in the form x[i,n] correspond to to the suffixes of x.
• We say that a string y has an occurrence at position i of a text
x if y[1] = x[i] , y[2] = x[i+1] , …, y[m] = x[i+m-1] where m = |y|.
• For any substring y of x, we denote by fx(y) the number of
occurrences of y in x.
If x = principalskinner
 is
{a,c,e,i,k,l,n,p,r,s}
|x| is 16
skin is a substring of x
prin is a prefix of x
ner is a suffix of x
If y = in, then fx(y) = 2
If y = pal, then fx(y) = 1
We consider a string generated by a stationary Markov chain of order M ≥ 1 on
the finite alphabet ∑. Let x = x[1]x[2] … x[n] be an observation of the random
process and y = y[1]y[2] … y[m] an arbitrary but fixed pattern over ∑with m < n.
The stationary Markov chain is completely determined by its transition matrix ∏
= (π(y[1,M], c))y[1],…,y[M],c∑ where
π(y[1,M],c) = P(Xi+1 = c|X[i-M+1,i] = y[1,M])
are called transition probabilities, with y[1],…,y[M],c∑ and M ≤ i ≤ n-1. The
vector of the stationary probabilities  of a stationary Markov chain with
transition matrix ∏ is defined as the solution of  = ∏.
We now introduce the random variable that describes the occurrences of the word
y. We define Zi, 1 ≤ i ≤ n-m+1 to be 1 if y occurs in x starting at position i, 0
otherwise. We set
Zy 
n  m 1
Z
i 1
i, y
so that Zy is the random variable for the total number of occurrences fx(y).
We can now estimate the probability of seeing a
pattern, even if we have never seen it before!
John Smith
Cen Lee
John Jones
Mike Lee
John Smith
Bill Chu
Selina Chu
Bill Smith
David Gu
Peggy Gu
John Lee
Susan Chu
E(John Smith) = 2/12
E(Selina Chu) = 1/12
E(John Chu) = 0?
E(John Chu) = (4/12) * (3/12)
f(John) = 4/12
f(Chu) =3/12
suffix_tree Preprocess (string r, string x)
let Tr = Suffix_tree(r)
let Tx = Suffix_tree (x)
let

| x | m  1
| r | m  1
Annotate_f(w)(Tr)
Annotate _f(w)(Tx)
visit Tx in breadth-first traversal, for each node u do
let w = L(u), m=|w|
if w occurs in Tr then
let Ê(w)=αfr(w)
else
find the largest 1 < l <m-1 such that
m 1
f
using the suffix tree Tr
if such l exists then
let
r
( w[ j , j l ] )  0
j 1
m l
Eˆ ( w)  
 f (w
r
[ j , j l ]
)
j 1
m l

.
f r ( w[ j , j l 1] )
j 2
else
m
let Eˆ ( w)  (| x | m  1) wy ).
[i ]
let z(w) = fx(w)- Ê(w)
store z(w) in the node u
return Tx
i 1
void Tarzan(time_series R, time_series X, int l1, int a, int l2, real c)
let x = Discretize_time_series(X, l1, a)
let r = Discretize_time_series(R, l1, a)
let Tx = Preprocess(r, x)
for i=1, |x| - l2 + 1
w  x[i ,i l2 1] .
let
retrieve z(w) from Tx
if |z(w)| > c then print i, z(w)
Experimental Evaluation
We would like to demonstrate two
features of our proposed approach
Sensitive
and
Selective,
• Sensitivity (High True Positive Rate)
The algorithm can find truly surprising
patterns in a time series.
just like me
• Selectivity (Low False Positive Rate)
The algorithm will not find spurious
“surprising” patterns in a time series
We compare our work to two obvious rivals…
• TSA-tree: A wavelet based approach by Shahabi et al.
• The authors suggest an method to find both trends and “surprises” in large
time series datasets. The authors achieve this using a wavelet-based tree
structure that can represent the data at different scales.
• They define “surprise” in time series as “…sudden changes in the original
time series data, which are captured by local maximums of the absolute values
of (wavelet detail coefficients)”.
• IMM: Immunology inspired approach by Dasgupta et.al.
• This work is inspired by the negative selection mechanism of the immune
system, which discriminates between self and non-self.
• In this case self is the model of the time series learned from the reference
dataset, and non-self are any observed patterns in the new dataset that do not
conform to the model within some tolerance.
Experiment 1
Training data R
Test data X
I created an anomaly by
halving the period of the
sine wave in the green
section here
We constructed a reference dataset
R by creating a sine wave with 800
datapoints and adding some
Gaussian noise. We then built a test
dataset X using the same
parameters as the reference set,
however we also inserted an
artificial anomaly by halving the
period of the sine wave in the
green section.
The test
data has an
anomaly, a
subsection
where the
period was
halved
The training
data is just a
noisy sin wave
The IMM
algorithm
failed to find
the anomaly
But Tarzan
detects the
anomaly!
So did
the TSA
Wavelet
tree!
Experiment 2
We consider a dataset that contains the power
demand for a Dutch research facility for the entire
year of 1997. The data is sampled over 15 minute
averages, and thus contains 35,040 points.
Demand for
Power?
Excellent!
2500
2000
1500
1000
500
0
200
400
600
800
1000
1200
1400
1600
The first 3 weeks of the power demand dataset. Note the
repeating pattern of a strong peak for each of the five
weekdays, followed by relatively quite weekends
1800
2000
We used from Monday January 6th to Sunday
March 23rd as reference data. This time
period is devoid of national holidays. We
tested on the remainder of the year.
Mmm..
anomalous..
We will just show the 3 most surprising
subsequences found by each algorithm. For
each of the 3 approaches we show the entire
week (beginning Monday) in which the 3
largest values of surprise fell.
Both TSA-tree and IMM returned sequences
that appear to be normal workweeks, however
Tarzan returned 3 sequences that correspond
to the weeks that contain national holidays in
the Netherlands. In particular, from top to
bottom, the week spanning both December
25th and 26th and the weeks containing
Wednesday April 30th (Koninginnedag,
“Queen's Day”) and May 19th (Whit
Monday).
Tarzan
TSA Tree
IMM
Experiment 3
The previous experiments demonstrate the ability of Tarzan to
find surprising patterns, however we also need to consider Tarzans
selectivity. If even a small fraction of patterns flagged by our
approach are false alarms, then, as we attempt to scale to massive
datasets, we can expect to be overwhelmed by innumerable
spurious “surprising” patterns
Shut up Flanders!! In designing an experiment to show
selectivity we are faced with the problem of finding a
database guaranteed to be free of surprising patterns.
Because using a real data set for this task would always be
open to subjective post-hoc explanations of results, we will
conduct the experiment on random walk data
Och aye! By definition, random walk data can contain any
possible pattern. In fact, as the size of a random walk
dataset goes to infinity, we should expect to see every
pattern repeated an infinite number of times. We can exploit
this property to test our wee algorithm
If we train Tarzan on another short random walk dataset, we
should expect that the test data would be found surprising, since
the chance that similar patterns exist in the short training
database are very small. However as we increase the size of the
training data, the surprisingness of the test data should decrease,
since it is more likely that similar data was encountered. To
restate, the intuition is this, the more experience our algorithm
has seeing random walk data, the less surprising our particular
section of random walk XRW should appear.
Future Work or
Tarzan’s Revenge!
Although we see Tarzans ability to find
surprising patterns without user
intervention as a great advantage, we
intend to investigate the possibility on
incorporating user feedback and domain
based constraints
We have concentrated solely on the intricacies
of finding the surprising patterns, without
addressing the many Meta questions that arise.
For example, the possible asymmetric costs of
false alarms and false dismissals, and the
actionability of discovered knowledge. We intend
to address these issues in future work
Motif Discovery
Winding Dataset
( The angular speed of reel 2 )
0
50 0
1000
150 0
2000
Informally, motifs are reoccurring patterns…
2500
Motif Discovery
To find these 3 motifs would require about
6,250,000 calls to the Euclidean distance function.
A
0
500
20
1500
2000
B
40
60
80
100
120
140
0
20
C
(The angular speed of reel 2)
1000
A
0
Winding Dataset
B
2500
C
40
60
80
100
120
140
0
20
40
60
80
100
120
140
Why Find Motifs?
· Mining association rules in time series requires the discovery of motifs.
These are referred to as primitive shapes and frequent patterns.
· Several time series classification algorithms work by constructing typical
prototypes of each class. These prototypes may be considered motifs.
· Many time series anomaly/interestingness detection algorithms essentially
consist of modeling normal behavior with a set of typical shapes (which we see
as motifs), and detecting future patterns that are dissimilar to all typical shapes.
· In robotics, Oates et al., have introduced a method to allow an autonomous
agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these “experiences” as motifs.
· In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patient’s recovery
based of the discovery of similar patterns. Once again, we see these “similar
patterns” as motifs.
T
Trivial
Matches
Space Shuttle STS - 57 Telemetry
C
0
100
200
3 00
400
( Inertial Sensor )
500
600
70 0
800
900
100 0
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M)  R, then M is
called a matching subsequence of C.
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q
or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p
or p < q’< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower
variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence
CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1  i < K.
OK, we can define motifs, but
how do we find them?
The obvious brute force search algorithm is just
too slow…
Our algorithm is based on a “hot” idea from
bioinformatics, random projection*
* J Buhler and M Tompa. Finding motifs using random
projections. In RECOMB'01. 2001.
A simple worked example of our motif discovery algorithm
The next 4 slides
T
0
500
( m= 1000)
1000
C1
C^1 a c b a
^
S
1 a
2 b
: :
: :
58 a
: :
985 b
c
c
:
:
c
:
c
b
a
:
:
c
:
c
a
b
:
:
a
:
c
a = 3 {a,b,c}
n = 16
w=4
Assume that we have a
time series T of length
1,000, and a motif of
length 16, which occurs
twice, at time T1 and
time T58.
A mask {1,2} was randomly chosen,
so the values in columns {1,2} were
used to project matrix into buckets.
Collisions are recorded by
incrementing the appropriate
location in the collision matrix
A mask {2,4} was randomly chosen,
so the values in columns {2,4} were
used to project matrix into buckets.
Once again, collisions are
recorded by incrementing the
appropriate location in the
collision matrix
We can calculate the expected values in the matrix,
assuming there are NO patterns…
k   i 
E( k , a, w, d , t )      1- 
 2  i 0  w 
d
t
 w a 1
 i  a 

 
i
1
 
a
w i
1
Suppose
E(k,a,w,d,t) = 2
2
2
:
1
3
58 27
: 3
2
1
2
2
1
985 0
1
2
1
1
2
:
58
3
: 985
A Simple Experiment
Lets imbed two motifs into a random walk time
series, and see if we can recover them
C
A
D
B
0
0
20
40
60
200
80
100
120
400
0
20
40
600
60
80
100
800
120
1000
1200
Planted Motifs
C
A
B
D
“Real” Motifs
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Some Examples of Real Motifs
Motor 1 (DC Current)
0
500
1000
1500
2000
Astrophysics (Photon Count)
250
0
350
0
450
0
550
0
650
0
How Fast can we find Motifs?
Seconds
10k
8k
Brute Force
6k
TS-P i = 100
4k
TS-P Exp
2k
0
1000
2000
3000
4000
Length of Time Series
5000
Future Work on Motif Detection
• A more detailed theoretical analysis.
• Significance testing.
• Discovery of motifs in multidimensional
time series
• Discovery of motifs under different
distance measures such as Dynamic Time
Warping
Rule Finding in Time Series
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
15
-1.5
-2
0
2
4
6
8
10
12
14
16
18
20
-1
-1.5
-2
0
5
10
15
20
25
30
Support = 9.1
Confidence = 68.1
Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series.
In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY,
Aug 27-31. pp 16-22.
Papers Based on Rule Discovery
from Time Series
• Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of
Association Rules from Human Motion.
• Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules.
• Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock
Time Series Using Self-Organizing Maps.
• Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential
Association Rules with Constraints and Time Lags in Multiple Sequences.
• Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic
Programming and Specialized Hardware.
• Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of
Temporal Rules.
• Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules
in House-keeping Data.
• Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in
Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams.
and many more…
All these people are
fooling themselves!
They are not finding rules in time
series, and it is easy to prove this!
A Simple Experiment...
w
20
30
d
5.5
5.5
Rule
7 15 8
18 20 21
Sup %
8.3
1.3
Conf %
73.0
62.7
J-Mea.
0.0036
0.0039
Fig
(a)
(b)
“if stock rises then falls greatly, follow a smaller rise,
then we can expect to see within 20 time units, a
pattern of rapid decrease followed by a leveling out.”
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
0
2
4
6
8
10
12
(a)
14
16
18
20
w
20
30
0
5
10
15
(b)
20
25
d
5.5
5.5
Rule
11 15 3
24 20 19
Sup %
6.9
2.1
Conf %
71.2
74.7
30
The punch line is…
J-Mea
0.0042
0.0035
Fig
(a)
(b)
If the top data miners in the world
can fool themselves into finding
patterns that don’t exist, what does
that say about the field?
Conclusions Part I
• We have reviewed the state of the art in
classic time series data mining (classification
clustering and indexing).
• We have seen that the choice of
representation and similarity measure is
critical to the success of any data mining
project.
Conclusions Part II
• We have seen some cutting edge time series
data mining tasks (Novelty detection, Motif
discovery, rule finding).
• In every case, we have seen initial
promising results, but further work remains to
be done…
Questions?
All datasets and code used in this talk can be found at
www.cs.ucr.edu/~eamonn/TSDMA/index.html