lowerbounding_search3 - Computer Science and Engineering

Download Report

Transcript lowerbounding_search3 - Computer Science and Engineering

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
1 23 12 5 19 4 13 29 26 9 3 10 7 24 6 11 28 17 20 21 2 18 8 30 25 14 15 27 16 22
010110010111100110100100001000
101001101101011100001010101110
1111100011011011011111101001100
100100011010001111001101101000
101111000101101001101100110100
000010011000100111000001110100
1100101100001010010
10001000101001000101010100001
010100010101110111101011010010
11101001010100111010101010010
10010101011101010100101010101
101010100101100101110111101000
111000010100001001110101000111
00001010101100101110101
Here are two sets of bit
strings. Which set is generated
by a human and which one is
generated by a computer?
VizTree
010110010111100110100100001000101
001101101011100001010101110111110
001101101101111110100110010010001
101000111100110110100010111100010
110100110110011010000001001100010
011100000111010011001011000010100
10
10001000101001000101010100001010
100010101110111101011010010111010
010101001110101010100101001010101
110101010010101010110101010010110
010111011110100011100001010000100
111010100011100001010101100101110
101
Lets put the sequences into a depth limited suffix
tree, such that the frequencies of all triplets are
encoded in the thickness of branches…
“humans usually try to fake randomness by alternating patterns”
Motivating example
You go to the doctor
because of chest pains.
Your ECG looks
strange…
You doctor wants to
search a database to find
similar ECGs, in the
hope that they will offer
clues about your
condition...
Two questions:
•How do we define similar?
•How do we search quickly?
Indexing Time Series
We have seen techniques for assessing the similarity of
two time series.
However we have not addressed the problem of finding
the best match to a query in a large database (other than
the lower bounding trick)
Query Q
The obvious solution, to retrieve and
examine every item (sequential
scanning), simply does not scale to
large datasets.
1
6
2
7
3
8
4
9
5
10
Database C
We need some way to index the data...
We can project time series
of length n into ndimension space.
The first value in C is the
X-axis, the second value in
C is the Y-axis etc.
One advantage of doing
this is that we have
abstracted away the
details of “time series”,
now all query processing
can be imagined as finding
points in space...
…these nested MBRs are organized
as a tree (called a spatial access tree
or a multidimensional tree). Examples
include R-tree, Hybrid-Tree etc.
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We can use the MINDIST(point, MBR), to do fast search..
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
If we project a query into ndimensional space, how many
additional (nonempty) MBRs must
we examine before we are
guaranteed to find the best match?
For the one dimensional case, the
answer is clearly 2...
For the three
dimensional case,
the answer is 26...
More generally, in n-dimension space
n
For the two we must examine 3 -1 MBRs
dimensional
n = 21  10,460,353,201 MBRs
case, the
answer is 8... This is known as the curse of
dimensionality
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
Magic
LB
data
1.5698
1.0485
Motivating example revisited…
You go to the doctor
because of chest pains.
Your ECG looks
strange…
Your doctor wants to
search a database to find
similar ECGs, in the
hope that they will offer
clues about your
condition...
Two questions:
•How do we define similar?
•How do we search quickly?
The Generic Data Mining Algorithm
• Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
• Approximately solve the problem at hand in main memory
• Make (hopefully very few) accesses to the original data on disk
to confirm the solution obtained in Step 2, or to modify the
solution so it agrees with the solution we would have obtained on
the original data
But which approximation
should we use?
Time Series Representations
Model Based
Hidden
Markov
Models
Data Adaptive
Data Dictated
Non Data Adaptive
Grid
Statistical
Models
Sorted
Coefficients
Piecewise
Polynomial
Piecewise
Linear
Approximation
Interpolation
Regression
Singular
Symbolic
Value
Approximation
Adaptive
Natural
Piecewise
Language
Constant
Approximation
Trees
Wavelets
Strings
Symbolic
Aggregate
Approximation
Orthonormal
Non
Lower
Bounding
Value
Based
Slope
Based
Haar
Daubechies
dbn n > 1
Random
Mappings
Bi-Orthonormal
Coiflets
Symlets
Spectral
Discrete
Fourier
Transform
Piecewise
Aggregate
Approximation
Discrete Chebyshev
Cosine
Polynomials
Transform
Clipped
Data
The Generic Data Mining Algorithm (revisited)
• Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
• Approximately solve the problem at hand in main memory
• Make (hopefully very few) accesses to the original data on disk
to confirm the solution obtained in Step 2, or to modify the
solution so it agrees with the solution we would have obtained on
the original data
This only works if the
approximation allows lower
bounding
What is Lower Bounding?
• Recall that we have seen lower bounding for distance measures (DTW and uniform
scaling) Lower bounding for representations is a similar idea…
Q
Raw Data
S
D(Q,S) 
Q’
S’
 qi  si 
n
i 1
2
Approximation
or
“Representation”
DLB(Q’,S’)
DLB(Q’,S’) 

M
i 1
( sri  sri 1 )( qvi  svi ) 2
Lower bounding means that for all Q and S, we
have:
DLB(Q’,S’)  D(Q,S)
In a seminal* paper in SIGMOD 93, I
showed that lower bounding of a
representation is a necessary and
sufficient condition to allow time series
indexing, with the guarantee of no false
dismissals
Christos work was originally with
indexing time series with the Fourier
representation. Since then, there
have been hundreds of follow up
papers on other data types, tasks and
representations
An Example of a
Dimensionality Reduction
Technique I
C
0
20
40
60
80
n = 128
100
120
140
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
The graphic shows a
time series with 128
points.
The raw data used to
produce the graphic is
also reproduced as a
column of numbers (just
the first 30 or so points are
shown).
An Example of a
Dimensionality Reduction
Technique II
C
0
20
40
60
80
100
..............
120
140
Raw
Data
Fourier
Coefficients
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
…
…
We can decompose the
data into 64 pure sine
waves using the Discrete
Fourier Transform (just
the first few sine waves are
shown).
The Fourier Coefficients
are reproduced as a
column of numbers (just
the first 30 or so
coefficients are shown).
Note that at this stage we
have not done
dimensionality reduction,
we have merely changed
the representation...
An Example of a
Dimensionality Reduction
Technique III
C
C’
0
20
40
60
80
100
120
We have discarded
of the data. 15
16
140
Raw
Data
Fourier
Coefficients
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
…
…
Truncated
Fourier
Coefficients
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
n = 128
N=8
Cratio = 1/16
… however, note that the first
few sine waves tend to be the
largest (equivalently, the
magnitude of the Fourier
coefficients tend to decrease
as you move down the
column).
We can therefore truncate
most of the small coefficients
with little effect.
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
Truncated
Fourier
Coefficients
1.5698
1.0485
An Example of a
Dimensionality Reduction
Technique IIII
DQ, C    qi  ci 
n
i 1
2
Raw
Raw
Data 1 Data 2
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
…
-
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
…
Truncated
Fourier
Coefficients 1

1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
Truncated
Fourier
Coefficients 2
-
1.1198
1.4322
1.0100
0.4326
0.5609
0.8770
0.1557
0.4528
The Euclidean distance between
the two truncated Fourier
coefficient vectors is always less
than or equal to the Euclidean
distance between the two raw
data vectors*.
So DFT allows lower bounding!
*Parseval's Theorem
Mini Review for the Generic Data Mining Algorithm
We cannot fit all that raw data in main memory.
We can fit the dimensionally reduced data in main memory.
Raw
Raw
aw
ata 1 Data 2 Data n
.4995
.5264
.5523
.5761
.5973
.6153
.6301
.6420
.6515
.6596
.6672
.6751
.6843
.6954
.7086
.7240
.7412
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
0.4995
0.7412
0.7595
0.7780
0.7956
0.5264
0.5523
0.5761
0.5973
0.6153
So we will solve the problem at hand on the
dimensionally reduced data, making a few
accesses to the raw data were necessary,
and, if we are careful, the lower bounding
property will insure that we get the right
answer!
Disk
Main
Memory
Truncated
Fourier
Coefficients 1
Truncated
Fourier
Coefficients 2
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
1.1198
1.4322
1.0100
0.4326
0.5609
0.8770
0.1557
0.4528
Truncated
Fourier
Coefficients n
1.3434
1.4343
1.4643
0.7635
0.5448
0.4464
0.7932
0.2126
aabbbccb
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
a
a
b
b
b
c
DFT
DWT
SVD
APCA
PAA
PLA
SAX
c
b
Discrete Fourier
Transform I
X
Basic Idea: Represent the time
series as a linear combination of
sines and cosines, but keep only the
first n/2 coefficients.
X'
0
20
40
60
80
100
120
140
Why n/2 coefficients? Because each
sine wave requires 2 numbers, for the
phase (w) and amplitude (A,B).
Jean Fourier
0
1768-1830
1
2
3
4
n
C (t )   ( Ak cos( 2wk t )  Bk sin( 2wk t ))
k 1
5
6
7
8
9
Excellent free Fourier Primer
Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS95-37, Department of Computer Science, Brown University, 1995.
http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/
Discrete Fourier
Transform II
X
X'
0
20
40
60
80
100
120
140
Pros and Cons of DFT as a time series
representation.
• Good ability to compress most natural signals.
• Fast, off the shelf DFT algorithms exist. O(nlog(n)).
• (Weakly) able to support time warped queries.
0
1
2
3
• Difficult to deal with sequences of different lengths.
• Cannot support weighted distance measures.
4
5
6
7
8
9
Note: The related transform DCT, uses only cosine
basis functions. It does not seem to offer any
particular advantages over DFT.
Discrete Wavelet
Transform I
X
Basic Idea: Represent the time
series as a linear combination of
Wavelet basis functions, but keep
only the first N coefficients.
X'
DWT
0
20
40
60
80
100
120
140
Haar 0
Although there are many different
types of wavelets, researchers in
time series mining/indexing
generally use Haar wavelets.
Alfred Haar
1885-1933
Haar 1
Haar 2
Haar 3
Haar wavelets seem to be as
powerful as the other wavelets for
most problems and are very easy to
code.
Haar 4
Haar 5
Excellent free Wavelets Primer
Haar 6
Haar 7
Stollnitz, E., DeRose, T., & Salesin, D. (1995). Wavelets for
computer graphics A primer: IEEE Computer Graphics and
Applications.
Discrete Wavelet
Transform II
X
X'
DWT
0
20
40
60
80
100
120
Ingrid Daubechies
140
1954 -
Haar 0
Haar 1
We have only considered one type of wavelet, there are
many others.
Are the other wavelets better for indexing?
YES: I. Popivanov, R. Miller. Similarity Search Over Time
Series Data Using Wavelets. ICDE 2002.
NO: K. Chan and A. Fu. Efficient Time Series Matching by
Wavelets. ICDE 1999
Later in this tutorial I will answer this
question.
Discrete Wavelet
Transform III
Pros and Cons of Wavelets as a time series
representation.
X
X'
DWT
0
20
40
60
80
100
120
140
• Good ability to compress stationary signals.
• Fast linear time algorithms for DWT exist.
• Able to support some interesting non-Euclidean
similarity measures.
Haar 0
Haar 1
Haar 2
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
• Signals must have a length n = 2some_integer
• Works best if N is = 2some_integer. Otherwise wavelets
approximate the left side of signal at the expense of the right side.
• Cannot support weighted distance measures.
Singular Value
Decomposition I
X
Basic Idea: Represent the time
series as a linear combination of
eigenwaves but keep only the first N
coefficients.
X'
SVD
0
20
40
60
80
100
120
140
eigenwave 0
eigenwave 1
SVD is similar to Fourier and
Wavelet approaches is that we
represent the data in terms of a
linear combination of shapes (in
this case eigenwaves).
James Joseph Sylvester
1814-1897
eigenwave 2
eigenwave 3
SVD differs in that the eigenwaves
are data dependent.
Camille Jordan
(1838--1921)
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
SVD has been successfully used in the text
processing community (where it is known as
Latent Symantec Indexing ) for many years.
Good free SVD Primer
Singular Value Decomposition - A Primer.
Sonia Leach
Eugenio Beltrami
1835-1899
Singular Value
Decomposition II
How do we create the eigenwaves?
We have previously seen that
we can regard time series as
points in high dimensional
space.
X
X'
SVD
0
20
40
60
80
100
120
140
We can rotate the axes such
that axis 1 is aligned with the
direction of maximum
variance, axis 2 is aligned with
the direction of maximum
variance orthogonal to axis 1
etc.
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
Since the first few eigenwaves
contain most of the variance of
the signal, the rest can be
truncated with little loss.
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
A  UV
T
This process can be achieved by factoring a M
by n matrix of time series into 3 other matrices,
and truncating the new matrices at size N.
Singular Value
Decomposition III
Pros and Cons of SVD as a time series
representation.
X
X'
SVD
0
20
40
60
80
100
120
140
• Optimal linear dimensionality reduction technique .
• The eigenvalues tell us something about the
underlying structure of the data.
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
eigenwave 4
• Computationally very expensive.
• Time: O(Mn2)
• Space: O(Mn)
• An insertion into the database requires recomputing
the SVD.
• Cannot support weighted distance measures or non
Euclidean measures.
eigenwave 5
eigenwave 6
eigenwave 7
Note: There has been some promising research into
mitigating SVDs time and space complexity.
Basic Idea: Represent the time
series as a linear combination of
Chebyshev Polynomials
Chebyshev
Polynomials
X
X'
Cheb
0
20
40
60
80
100
120
Ti(x) =
140
Pros and Cons of Chebyshev
Polynomials as a time series
representation.
1
x
2x2−1
4x3−3x
• Time series can be of arbitrary length
• Only O(n) time complexity
• Is able to support multi-dimensional
time series*.
Pafnuty Chebyshev
1821-1946
8x4−8x2+1
16x5−20x3+5x
32x6−48x4+18x2−1
64x7−112x5+56x3−7x
128x8−256x6+160x4−32x2+1
• Time series must be renormalized to
have length between –1 and 1
Chebyshev
Polynomials
X
X'
Cheb
0
20
40
60
80
100
120
Ti(x) =
140
Pafnuty Chebyshev
1821-1946
In 2006, Dr Ng published a “note of Caution” on his
webpage, noting that the results in the paper are not
reliable..
“…Thus, it is clear that the C++ version contained a
bug. We apologize for any inconvenience caused…”
Both Dr Keogh and Dr Michail Vlachos
independently found Chebyshev Polynomials are
slightly worse than other methods on more than 80
different datasets.
Piecewise Linear
Approximation I
Basic Idea: Represent the time
series as a sequence of straight
lines.
X
Karl Friedrich Gauss
X'
1777 - 1855
0
20
40
60
80
100
120
140
Lines could be connected, in
which case we are allowed
N/2 lines
Each line segment has
• length
• left_height
(right_height can
be inferred by looking at
the next segment)
If lines are disconnected, we
are allowed only N/3 lines
Personal experience on dozens of datasets
suggest disconnected is better. Also only
disconnected allows a lower bounding
Euclidean approximation
Each line segment has
• length
• left_height
• right_height
How do we obtain the Piecewise Linear
Approximation?
Piecewise Linear
Approximation II
X
X'
0
20
40
60
80
100
120
140
Optimal Solution is O(n2N), which is too
slow for data mining.
A vast body on work on faster heuristic
solutions to the problem can be classified
into the following classes:
• Top-Down
• Bottom-Up
• Sliding Window
• Other (genetic algorithms, randomized algorithms,
Bspline wavelets, MDL etc)
Extensive empirical evaluation* of all approaches
suggest that Bottom-Up is the best approach
overall.
Piecewise Linear
Approximation III
Pros and Cons of PLA as a time series
representation.
X
X'
0
20
40
60
80
100
120
140
• Good ability to compress natural signals.
• Fast linear time algorithms for PLA exist.
• Able to support some interesting non-Euclidean
similarity measures. Including weighted measures,
relevance feedback, fuzzy queries…
•Already widely accepted in some communities (ie,
biomedical)
• Not (currently) indexable by any data structure (but
does allows fast sequential scanning).
Piecewise Aggregate
Approximation I
Basic Idea: Represent the time series as a
sequence of box basis functions.
Note that each box is the same length.
X
X'
0
20
40
60
80
100
120
140
x1
x2
x3
x4
x5
xi  Nn
ni
N
x
j
j  Nn ( i 1) 1
Given the reduced dimensionality representation
we can calculate the approximate Euclidean
distance as...
DR ( X , Y ) 
n
N
2


x

y
i1 i i
N
This measure is provably lower bounding.
x6
x7
Independently introduced by two authors
x8
• Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) / Keogh &
Pazzani PAKDD April 2000
• Byoung-Kee Yi, Christos Faloutsos, VLDB September 2000
Piecewise Aggregate
Approximation II
X
X'
0
20
40
60
80
100
120
140
X1
X2
X3
Pros and Cons of PAA as a time series
representation.
• Extremely fast to calculate
• As efficient as other approaches (empirically)
• Support queries of arbitrary lengths
• Can support any Minkowski metric@
• Supports non Euclidean measures
• Supports weighted Euclidean distance
• Can be used to allow indexing of DTW and uniform
scaling*
• Simple! Intuitive!
X4
X5
X6
X7
X8
• If visualized directly, looks ascetically unpleasing.
Adaptive Piecewise
Constant
Approximation I
Basic Idea: Generalize PAA to allow the
piecewise constant segments to have arbitrary
lengths.
Note that we now need 2 coefficients to represent
each segment, its value and its length.
X
Raw Data (Electrocardiogram)
X
Adaptive Representation (APCA)
0
20
40
60
80
100
120
Reconstruction Error 2.61
140
Haar Wavelet or PAA
Reconstruction Error 3.27
<cv1,cr1>
<cv2,cr2>
DFT
Reconstruction Error 3.11
<cv3,cr3>
0
50
100
150
200
250
<cv4,cr4> The intuition is this, many signals have little detail in some
places, and high detail in other places. APCA can adaptively fit
itself to the data achieving better approximation.
Adaptive Piecewise
Constant
Approximation II
X
X
0
20
40
60
80
100
120
140
<cv1,cr1>
<cv2,cr2>
<cv3,cr3>
<cv4,cr4>
The high quality of the APCA had been noted by
many researchers.
However it was believed that the representation
could not be indexed because some coefficients
represent values, and some represent lengths.
However an indexing method was discovered!
(SIGMOD 2001 best paper award)
Unfortunately, it is non-trivial to understand and
implement and thus has only been
reimplemented once or twice (In contrast, more
than 50 people have reimplemented PAA).
Adaptive Piecewise
Constant
Approximation III
• Pros and Cons of APCA as a time
series representation.
X
X
0
20
40
60
80
100
120
140
<cv1,cr1>
<cv2,cr2>
• Fast to calculate O(n).
• More efficient as other approaches (on some
datasets).
• Support queries of arbitrary lengths.
• Supports non Euclidean measures.
• Supports weighted Euclidean distance.
• Support fast exact queries , and even faster
approximate queries on the same data
structure.
<cv3,cr3>
<cv4,cr4>
• Somewhat complex implementation.
• If visualized directly, looks ascetically
unpleasing.
Clipped Data
X
X'
00000000000000000000000000000000000000000000111111111111111111111110000110000001111111111111111111111111111111111111111111111111
0
20
40
60
80
100
120
140
…110000110000001111….
44 Zeros
23 Ones
4 Zeros
2 Ones
6 Zeros
49 Ones
44 Zeros|23|4|2|6|49
Basic Idea: Find the mean of a time
series, convert values above the line to
“1’ and values below the line to “0”.
Runs of “1”s and “0”s can be further Tony Bagnall
compressed with run length encoding if
desired.
This representation does allow lower
bounding!
Pros and Cons of Clipped as a
time series representation.
Ultra compact representation which may
be particularly useful for specialized
hardware.
• Pros and Cons of natural language as
a time series representation.
Natural Language
X
rise, plateau, followed by a rounded peak
0
20
40
60
80
100
120
• The most intuitive representation!
• Potentially a good representation for low
bandwidth devices like text-messengers
140
• Difficult to evaluate.
rise,
plateau,
followed by a rounded peak
To the best of my knowledge only one group is
working seriously on this representation. They
are the University of Aberdeen SUMTIME
group, headed by Prof. Jim Hunter.
Basic Idea: Convert the time series into an alphabet
of discrete symbols. Use string indexing techniques
to manage the data.
Symbolic
Approximation I
X
X'
a a b b b c c b
0
20
40
60
80
100
120
140
0
a
1
a
b
2
b
Potentially an interesting idea, but all work thus far
are very ad hoc.
Pros and Cons of Symbolic Approximation
as a time series representation.
• Potentially, we could take advantage of a wealth of
techniques from the very mature field of string
processing and bioinformatics.
3
4
b
5
c
c
6
b
7
• It is not clear how we should discretize the times
series (discretize the values, the slope, shapes? How
big of an alphabet? etc).
• There are more than 210 different variants of this,
at least 35 in data mining conferences.
SAX: Symbolic Aggregate
approXimation
SAX allows (for the first time) a symbolic
representation that allows
X
• Lower bounding of Euclidean distance
X'
• Dimensionality Reduction
• Numerosity Reduction
0
20
40
60
80
100
120
Jessica Lin
140
1976d
c
b
a
c
c
c
b
b
aaaaaabbbbbccccccbbccccdddddddd
a
0
20
b
a
40
60
80
baabccbc
100
120
Comparison of all Dimensionality
Reduction Techniques
• We have already compared features (Does
representation X allow weighted queries, queries of arbitrary
lengths, is it simple to implement…
• We can compare the indexing efficiency. How long
does it take to find the best answer to out query.
• It turns out that the fairest way to measure this
is to measure the number of times we have to
retrieve an item from disk.
Data Bias
Definition: Data bias is the conscious or
unconscious use of a particular set of
testing data to confirm a desired finding.
Example: Suppose you are comparing Wavelets to Fourier methods,
the following datasets will produce drastically different results…
Good for
wavelets
bad for
Fourier
0
Good for
Fourier
bad for
wavelets
200
400
600
0
200
400
600
Example of Data Bias: Whom to Believe?
For the task of indexing time series for similarity search, which
representation is best, the Discrete Fourier Transform (DFT), or
the Discrete Wavelet Transform (Haar)?
• “Several wavelets outperform the DFT”.
• “DFT-based and DWT-based techniques yield
comparable results”.
• “Haar wavelets perform slightly better that DFT”
• “DFT filtering performance is superior to DWT”
Example of Data Bias: Whom to Believe II?
To find out who to believe (if anyone) we
performed an extraordinarily careful and
comprehensive set of experiments. For example…
• We
used a quantum mechanical device generate
random numbers.
• We averaged results over 100,000 experiments!
• For fairness, we use the same (randomly chosen)
subsequences for both approaches.
I tested on the Powerplant,
Infrasound and Attas datasets,
and I know DFT outperforms
the Haar wavelet
1
0.9
Pruning Power
0.8
0.7
0.6
DFT
0.5
HAAR
0.4
0.3
0.2
0.1
0
Powerplant
Infrasound
Attas (Aerospace)
Stupid Flanders! I tested on the
Network, ERPdata and Fetal EEG
datasets and I know that there
is no real difference between
DFT and Haar
0.8
0.7
Pruning Power
0.6
0.5
DFT
0.4
HAAR
0.3
0.2
0.1
0
Network
EPRdata
Fetal EEG
Those two clowns are both wrong!
I tested on the Chaotic,
Earthquake and Wind datasets,
and I am sure that the Haar
wavelet outperforms the DFT
0.5
0.45
Pruning Power
0.4
0.35
0.3
DFT
0.25
HAAR
0.2
0.15
0.1
0.05
0
Chaotic
Earthquake
Wind (3)
The Bottom Line
Any claims about the relative performance
of a time series indexing scheme that is
empirically demonstrated on only 2 or 3
datasets are worthless.
1
0.5
0.8
0.9
0.45
0.7
0.8
0.4
0.6
0.7
0.35
0.5
0.6
0.3
DFT
0.5
HAAR
0.4
DFT
0.4
HAAR
DFT
0.25
HAAR
0.2
0.3
0.15
0.3
0.2
0.2
0.1
0.1
0.1
0
0.05
0
0
Powerplant
Infrasound
Attas (Aerospace)
Network
EPRdata
Fetal EEG
Chaotic
Earthquake
Wind (3)
So which is really the best technique?
I experimented with all the techniques (DFT, DCT, Chebyshev,
PAA, PLA, PQA, APCA, DWT (most wavelet types), SVD) on 65
datasets, and as a sanity check, Michail Vlachos independently
implemented and tested on the same 65 datasets.
On average, they are all about the same. In particular, on 80% of
the datasets they are all within 10% of each other.
If you want to pick a representation, don’t do so based on the
reconstruction error, do so based on the features the representation
has.
On bursty datasets* APCA can be significantly better
Lets take a tour of other time series problems
• Before
we do, let us briefly
revisit SAX, since it has some
implications for the other
problems…
Exploiting Symbolic Representations of Time Series
• One central theme of this tutorial is that lowerbounding is a very
useful property. (recall the lower bounds of DTW /uniform scaling, also recall the
importance of lower bounding dimensionality reduction techniques).
•Another central theme is that dimensionality reduction is very
important. That’s why we spend so long discussing DFT, DWT, SVD,
PAA etc.
• Until last year there was no lowerbounding, dimensionality
reducing representation of time series. In the next slide, let us think
about what it means to have such a representation…
Exploiting Symbolic Representations of Time Series
• If we had a lowerbounding, dimensionality
reducing representation of time series, we could…
• Use data structures that are only defined for discrete data, such as
suffix trees.
• Use algorithms that are only defined for discrete data, such as
hashing, association rules etc
• Use definitions that are only defined for discrete data, such as
Markov models, probability theory
• More generally, we could utilize the vast body of research in text
processing and bioinformatics
Exploiting Symbolic Representations of Time Series
There is now a lower bounding dimensionality
reducing time series representation! It is called SAX
(Symbolic Aggregate ApproXimation)
SAX has had a major impact on time series data
mining in the last few years…
3
2
1
0
-1
-2
-3
DFT
SAX
f
e
d
c
b
a
PLA
Haar
APCA
ffffffeeeddcbaabceedcbaaaaacddee
SAX
Time Series Motif Discovery
(finding repeated patterns)
Winding Dataset
(
0
50 0
1000
150 0
Are there any repeated
patterns, of about this
length
in the above
time series?
The angular speed of reel 2 )
2000
2500
Time Series Motif Discovery
(finding repeated patterns)
A
0
50 0
20
(
1000
A
0
Winding Dataset
B
150 0
2000
B
40
60
80
100
120
140
0
20
C
The angular speed of reel 2 )
2500
C
40
60
80
100
120
140
0
20
40
60
80
100
120
140
Why Find Motifs?
· Mining association rules in time series requires the discovery of motifs. These
are referred to as primitive shapes and frequent patterns.
· Several time series classification algorithms work by constructing typical
prototypes of each class. These prototypes may be considered motifs.
· Many time series anomaly/interestingness detection algorithms essentially
consist of modeling normal behavior with a set of typical shapes (which we see
as motifs), and detecting future patterns that are dissimilar to all typical
shapes.
· In robotics, Oates et al., have introduced a method to allow an autonomous
agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these “experiences” as motifs.
· In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patient’s recovery
based of the discovery of similar patterns. Once again, we see these “similar
patterns” as motifs.
T
Trivial
Matches
Space Shuttle
C
0
100
200
3 00
400
STS -57 Telemetry
(Inertial Sensor )
500
600
70 0
800
900
100 0
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M)  R, then M
is called a matching subsequence of C.
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q
or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p
or p < q’< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower
variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence
CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1  i < K.
OK, we can define motifs, but
how do we find them?
The obvious brute force search algorithm is just too slow…
The most reference algorithm is based on a hot idea from
bioinformatics, random projection* and the fact that SAX allows use
to lower bound discrete representations of time series.
* J Buhler and M Tompa. Finding
motifs using random projections. In
RECOMB'01. 2001.
A simple worked example of the motif discovery algorithm
The next 4 slides
T
0
500
(m= 1000 )
1000
C1
^
1
2
C^ 1 a c b a
S
a c b a
b c a b
: : : : :
: : : : :
58
a c c a
985
b c c c
: : : : :
a = 3 {a,b,c}
n = 16
w=4
Assume that we have a time
series T of length 1,000, and
a motif of length 16, which
occurs twice, at time T1 and
time T58.
A mask {1,2} was randomly chosen,
so the values in columns {1,2} were
used to project matrix into buckets.
Collisions are recorded by
incrementing the appropriate
location in the collision matrix
A mask {2,4} was randomly chosen,
so the values in columns {2,4} were
used to project matrix into buckets.
Once again, collisions are
recorded by incrementing the
appropriate location in the
collision matrix
We can calculate the expected values in the matrix, assuming
there are NO patterns…
 k   i   w a 1
E( k , a, w, d , t )      1-    

2
i
w
a
  

  i 0 
d
t
i
1
 
a
w i
This is a good example of the Generic
Data Mining Algorithm…
The Generic Data Mining Algorithm
• Create an approximation of the data, which will fit in main
memory, yet retains the essential features of interest
• Approximately solve the problem at hand in main memory
• Make (hopefully very few) accesses to the original data on disk
to confirm the solution obtained in Step 2, or to modify the
solution so it agrees with the solution we would have obtained on
the original data
But which
which approximation
approximation
But
should we
we use?
use?
should
1
2
2
:
1
3
58 27
: 3
2
1
2
2
1
985 0
1
2
1
1
2
:
58
3
: 985
A Simple Experiment
Let us imbed two motifs into a random walk time series,
and see if we can recover them
C
A
D
B
0
0
20
40
60
200
80
100
120
400
0
20
40
600
60
80
100
120
800
1000
1200
Planted Motifs
C
A
B
D
“Real” Motifs
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Some Examples of Real Motifs
Motor 1 (DC Current)
0
500
1000
1500
2000
Astrophysics ( Photon Count)
250
0
350
0
450
0
550
0
650
0
Motifs Discovery Challenges
How can we find motifs…
• Without having to specify the length/other parameters
• In massive datasets
• While ignoring “background” motifs (ECG example)
• Under time warping, or uniform scaling
• While assessing their significance
A
0
B
50 0
1000
Winding Dataset
( The angular speed of reel 2 )
150 0
C
2000
Finding these 3 motifs requires about 6,250,000 calls to the Euclidean distance function
2500
The DNA of two species…
TGGCCGTGCTAGGCCCCACCCCTACCTTGCAG
CCCCGCAAGCTCATCTGCGCGAACCAGAACGC
CACCACCCTTGGGTTGAAATTAAGGAGGCGGT
GGCAGCTTCCCAGGCGCACGTACCTGCGAATA
ATAACTGTCCGCACAAGGAGCCCGACGATAGT
GACCCTCTCTAGTCACGACCTACACACAGAAC
TGTGCTAGACGCCATGAGATAAGCTAACACAA
AACATTTCCCACTACTGCTGCCCGCGGGCTAC
GGCCACCCCTGGCTCAGCCTGGCGAAGCCGCC
CTTCA
CCGTGCTAGGGCCACCTACCTTGGTCCGC
GCAAGCTCATCTGCGCGAACCAGAACGCC
CCACCTTGGGTTGAAATTAAGGAGGCGGT
GGCAGCTTCCAGGCGCACGTACCTGCGAA
AAATAACTGTCCGCACAAGGAGCCGACGA
AAGAAGAGAGTCGACCTCTCTAGTCACGA
CTACACACAGAACCTGTGCTAGACGCCAT
GATAAGCTAACA
C
T
A
G
0.20 0.24
0.26 0.30
CCGTGCTAGGGCCACCTACCTTGGTCCGCC
GCAAGCTCATCTGCGCGAACCAGAACGCC
CCACCTTGGGTTGAAATTAAGGAGGCGGTT
GGCAGCTTCCAGGCGCACGTACCTGCGAA
AAATAACTGTCCGCACAAGGAGCCGACGAT
AAGAAGAGAGTCGACCTCTCTAGTCACGAC
CTACACACAGAACCTGTGCTAGACGCCATG
GATAAGCTAACA
CC CT TC TT
C
T
CA CG TA TG
TC
CCC CCT CTC
CCA CCG CTA
CAC CAT
CAA
AC AT GC GT
A
G
AA AG GA GG
CCGTGCTAGGGCCACCTACCTTGGTCCGC
GCAAGCTCATCTGCGCGAACCAGAACGCC
CCACCTTGGGTTGAAATTAAGGAGGCGGT
GGCAGCTTCCAGGCGCACGTACCTGCGAA
AAATAACTGTCCGCACAAGGAGCCGACGA
AAGAAGAGAGTCGACCTCTCTAGTCACGA
CTACACACAGAACCTGTGCTAGACGCCAT
GATAAGCTAACA
1
0.02 0.04 0.09 0.04
CA 0.03 0.07 0.02
AC AT 0.11 0.03
AA AG
0
CCGTGCTAGGCCCCACCCCTACCTTGCAG
CCCGCAAGCTCATCTGCGCGAACCAGAAC
CCCACCACCCTTGGGTTGAAATTAAGGAG
CGGTTGGCAGCTTCCCAGGCGCACGTACC
GCGAATAAATAACTGTCCGCACAAGGAGC
GACGATAGTCGACCCTCTCTAGTCACGAC
ACACACAGAACCTGTGCTAGACGCCATGA
ATAAGCTAACA
OK. Given any DNA
string I can make a
colored bitmap, so what?
CCGTGCTAGGCCCCACCCCTACCTTGCAG
CCCGCAAGCTCATCTGCGCGAACCAGAAC
CCCACCACCCTTGGGTTGAAATTAAGGAG
CGGTTGGCAGCTTCCCAGGCGCACGTACC
GCGAATAAATAACTGTCCGCACAAGGAGC
GACGATAGTCGACCCTCTCTAGTCACGAC
ACACACAGAACCTGTGCTAGACGCCATGA
ATAAGCTAACA
Two Questions
• Can we do something
similar for time series?
• Would it be useful?
Can we do make bitmaps for time series?
Yes, with SAX!
accbabcdbcabdbcadbacbdbdcadbaacb…
a b
c d
aa
ac
ca
cc
ab
ad
cb
cd
ba
bc
da
dc
bb
bd
db
dd
aaa
aab
aba
aac
aad
abc
aca
acb
acc
Time Series Bitmap
Time Series Bitmaps
Imagine
we
have
following SAX strings…
abcdba
bdbadb
cbabca
There are 5 “a”
There are 7 “b”
There are 3 “c”
There are 3 “d”
the
a
c
b
d
5
3
7
3
7
We can paint the
pixels based on
the frequencies
While they are all example of EEGs, example_a.dat is from a normal
trace, whereas the others contain examples of spike-wave
discharges.
We can further enhance
the time series bitmaps
by arranging the
thumbnails by “cluster”,
instead of arranging by
date, size, name etc
We can achieve this with
MDS.
August.txt
July.txt
June.txt
May.txt
Sept.txt
April.txt
Oct.txt
Feb.txt
March.txt
Nov.txt
Dec.txt
Jan.txt
300
One Year of Italian Power Demand
200
100
January
0
December
August
A well known dataset
Kalpakis_ECG, allegedly
contains 70 ECGS
If we view them as time series
bitmaps, a handful stand
out…
normal9.txt
normal8.txt
normal5.txt
normal1.txt normal10.txt normal11.txt
normal15.txt normal14.txt
normal13.txt normal7.txt normal2.txt
normal16.txt normal18.txt
normal4.txt normal3.txt normal12.txt
normal6.txt
normal17.txt
ventricular depolarization
“plateau” stage
repolarizatio
nrecovery
initial rapid
repolarization
0
100
200
300
phase
400
normal9.txt
500
normal8.txt normal5.txt
normal1.txt normal10.txt normal11.txt
normal15.txt normal14.txt
normal13.txt normal7.txt normal2.txt
normal16.txt normal18.txt
normal4.txt normal3.txt normal12.txt
normal17.txt
normal6.txt
Some of the data are not
heartbeats! They are the
action potential of a normal
pacemaker cell
0
100
200
300
400
500
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
20
19
17
18
16
8
7
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
10
9
6
15
14
12
13
Data Key
Cluster 1 (datasets 1 ~ 5):
BIDMC Congestive Heart Failure Database (chfdb): record chf02
Start times at 0, 82, 150, 200, 250, respectively
11
5
Cluster 2 (datasets 6 ~ 10):
BIDMC Congestive Heart Failure Database (chfdb): record chf15
Start times at 0, 82, 150, 200, 250, respectively
4
Cluster 3 (datasets 11 ~ 15):
3
2
Long Term ST Database (ltstdb): record 20021
Start times at 0, 50, 100, 150, 200, respectively
Cluster 4 (datasets 16 ~ 20):
1
MIT-BIH Noise Stress Test Database (nstdb): record 118e6
Start times at 0, 50, 100, 150, 200, respectively
24
23
25
26
20
19
16
15
14
13
12
11
10
9
8
7
22
21
6
5
4
3
30
29
28
27
18
17
2
1
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
Here is a Premature Ventricular
Contraction (PVC)
Here the bitmaps are almost the same.
Here the bitmaps are very
different. This is the most
unusual section of the time
series, and it coincidences
with the PVC.
Annotations by
a cardiologist
Premature ventricular contraction
Supraventricular escape beat
Premature ventricular contraction
The Last Word
The sun is setting on all other
symbolic representations of
time series, SAX is the only
way to go