Transcript sax

0
20 40 60 80 100 120
DFT
0
20 40 60 80 100 120
DWT
0
20 40 60 80 100 120
SVD
0
20 40 60 80 100 120
APCA
0
20 40 60 80 100120
PAA
0
20 40 60 80 100 120
PLA
0
20 40 60 80 100 120
CHEB
A Different Approach…
All the previous representations have been real valued,
but think of what you can do with discrete data that you
cannot do (or do easily) with real valued data…
Markov Models, Suffix Trees, Hashing, Relevance
Feedback, Kolmogorov Complexity etc
There are many symbolic representations in the literature,
but none lower bound, and they are typically ad hoc, high
dimensionally and generally not useful for data mining.
There is now a symbolic
representation of time series that
allows…
• Lower bounding of Euclidean distance
• Dimensionality Reduction
• Numerosity Reduction
We call our representation SAX
Symbolic Aggregate ApproXimation
baabccbc
How do we obtain SAX?
C
C
0
40
60
80
100
120
c
First convert the time
series to PAA
representation, then
convert the PAA to
symbols
It takes linear time
20
c
c
b
b
a
0
20
b
a
40
60
80
100
baabccbc
120
Visual Comparison
3
2
1
0
-1
-2
-3
DFT
f
e
d
c
b
a
PLA
Haar
APCA
A raw time series of length 128 is transformed into the
word “ffffffeeeddcbaabceedcbaaaaacddee.”
– We can use more symbols to represent the time series since each symbol
requires fewer bits than real-numbers (float, double)
SAX is Good!
For classification, clustering and indexing of time
series, SAX is as good or better than…
• Fourier Transforms
• Wavelets
• The raw data!
But I am not going to show you this today!
(See Jessica Lin’s DMKD 2003 paper…)
SAX is Great!
SAX lets us do things that are difficult or impossible
with other representations.
• Finding motifs in time series (ICDM 02, SIGKDD 03)
• Visualizing massive time series (SIGKDD04, VLDB 04)
• Cluster from streams (ICDM 03, KAIS 04)
• Kolmogorov complexity data mining (SIGKDD 04)
• The papers above are just from my group, there are now a few dozen groups around
the world using SAX….
The Joy of SAX
SAX Ideas
Idea I:
A lite-weight, but incredibly useful tool call time
series bitmaps.
To explain time series bitmaps, we begin with a
digression into DNA…
The DNA of two species…
TGGCCGTGCTAGGCCCCACCCCTACCTTGC
GTCCCCGCAAGCTCATCTGCGCGAACCAGA
ACGCCCACCACCCTTGGGTTGAAATTAAGG
GGCGGTTGGCAGCTTCCCAGGCGCACGTA
CTGCGAATAAATAACTGTCCGCACAAGGAG
CCGACGATAGTCGACCCTCTCTAGTCACGA
CTACACACAGAACCTGTGCTAGACGCCATG
GATAAGCTAACACAAAAACATTTCCCACTAC
TGCTGCCCGCGGGCTACCGGCCACCCCTG
CTCAGCCTGGCGAAGCCGCCCTTCA
CCGTGCTAGGGCCACCTACCTTGGTCC
CCGCAAGCTCATCTGCGCGAACCAGAA
GCCACCACCTTGGGTTGAAATTAAGGA
GCGGTTGGCAGCTTCCAGGCGCACGTA
CTGCGAATAAATAACTGTCCGCACAAG
AGCCGACGATAAAGAAGAGAGTCGACC
CTCTAGTCACGACCTACACACAGAACC
GTGCTAGACGCCATGAGATAAGCTAAC
C
T
A
G
0.20 0.24
0.26 0.30
CCGTGCTAGGGCCACCTACCTTGGTCCG
CCGCAAGCTCATCTGCGCGAACCAGAA
GCCACCACCTTGGGTTGAAATTAAGGAG
GCGGTTGGCAGCTTCCAGGCGCACGTA
CTGCGAATAAATAACTGTCCGCACAAGG
AGCCGACGATAAAGAAGAGAGTCGACC
CTCTAGTCACGACCTACACACAGAACCT
GTGCTAGACGCCATGAGATAAGCTAACA
CC CT TC TT
C
T
CA CG TA TG
TC
CCC CCT CTC
CCA CCG CTA
CAC CAT
CAA
AC AT GC GT
A
G
AA AG GA GG
CCGTGCTAGGGCCACCTACCTTGGTCC
CCGCAAGCTCATCTGCGCGAACCAGAA
GCCACCACCTTGGGTTGAAATTAAGGA
GCGGTTGGCAGCTTCCAGGCGCACGT
CTGCGAATAAATAACTGTCCGCACAAG
AGCCGACGATAAAGAAGAGAGTCGAC
CTCTAGTCACGACCTACACACAGAACC
GTGCTAGACGCCATGAGATAAGCTAAC
1
0.02 0.04 0.09 0.04
CA 0.03 0.07 0.02
AC AT 0.11 0.03
AA AG
0
CCGTGCTAGGCCCCACCCCTACCTTGC
GTCCCCGCAAGCTCATCTGCGCGAACC
GAACGCCCACCACCCTTGGGTTGAAAT
AAGGAGGCGGTTGGCAGCTTCCCAGG
CACGTACCTGCGAATAAATAACTGTCC
ACAAGGAGCCCGACGATAGTCGACCCT
TCTAGTCACGACCTACACACAGAACCT
TGCTAGACGCCATGAGATAAGCTAACA
OK. Given any DNA
string I can make a
colored bitmap, so what?
CCGTGCTAGGCCCCACCCCTACCTTGC
GTCCCCGCAAGCTCATCTGCGCGAACC
GAACGCCCACCACCCTTGGGTTGAAAT
AAGGAGGCGGTTGGCAGCTTCCCAGG
CACGTACCTGCGAATAAATAACTGTCC
ACAAGGAGCCCGACGATAGTCGACCCT
TCTAGTCACGACCTACACACAGAACCT
TGCTAGACGCCATGAGATAAGCTAACA
Two Questions
Indian
rhinoceros.dna
• Can we do
something similar
for time series?
white
rhinoceros.dna
rhesus
monkey.dna
pygmy
chimpanzee.dna
sperm
whale.dna
• Would it be useful?
Indian
elephant.dna
hippopotamus.dna
chimpanzee.dna
Human.dna
African
elephant.dna
orangutan.dna
pygmy
sperm whale.dna
Can we do make bitmaps for time series?
Yes, with SAX!
accbabcdbcabdbcadbacbdbdcadbaacb…
a
c
b
d
aa
ac
ca
cc
ab
ad
cb
cd
ba
bc
da
dc
bb
bd
db
dd
aaa
aab
aba
aac
aad
abc
aca
acb
acc
Time Series Bitmap
While they are all example of EEGs, example_a.dat is from a normal trace,
whereas the others contain examples of spike-wave discharges.
We can further enhance
the time series bitmaps by
arranging the thumbnails
by “cluster”, instead of
arranging by date, size,
name etc
We can achieve this with
MDS.
normal9.txt
normal8.txt normal5.txt
normal1.txt normal10.txt normal11.txt
normal15.txt normal14.txt
normal13.txt normal7.txt normal2.txt
normal16.txt normal18.txt
normal4.txt normal3.txt normal12.txt
normal6.txt
normal17.txt
ventricular depolarization
“plateau” stage
repolarizatio
nrecovery
initial rapid
repolarization
0
100
200
300
phase
400
normal9.txt
500
normal8.txt normal5.txt
normal1.txt normal10.txtnormal11.txt
normal15.txtnormal14.txt
normal13.txtnormal7.txt normal2.txt
normal16.txtnormal18.txt
normal4.txt normal3.txt normal12.txt
normal17.txt
normal6.txt
0
100
200
300
400
500
Some of the data are not
heartbeats! They are the
action potential of a
normal pacemaker cell
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
20
19
17
18
16
8
7
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
10
9
6
15
14
12
13
Data Key
Cluster 1 (datasets 1 ~ 5):
BIDMC Congestive Heart Failure Database (chfdb): record chf02
Start times at 0, 82, 150, 200, 250, respectively
11
5
Cluster 2 (datasets 6 ~ 10):
BIDMC Congestive Heart Failure Database (chfdb): record chf15
Start times at 0, 82, 150, 200, 250, respectively
4
Cluster 3 (datasets 11 ~ 15):
3
2
Long Term ST Database (ltstdb): record 20021
Start times at 0, 50, 100, 150, 200, respectively
Cluster 4 (datasets 16 ~ 20):
1
MIT-BIH Noise Stress Test Database (nstdb): record 118e6
Start times at 0, 50, 100, 150, 200, respectively
24
23
25
26
20
19
16
15
14
13
12
11
10
9
8
7
22
21
6
5
4
3
30
29
28
27
18
17
2
1
We can test how much useful
information is retained in the bitmaps
by using only the bitmaps for
clustering/classification/anomaly
detection
Here is a Premature Ventricular
Contraction (PVC)
Here the bitmaps are almost the same.
Here the bitmaps are very
different. This is the most
unusual section of the time
series, and it coincidences
with the PVC.
Annotations by
a cardiologist
Premature ventricular contraction
Supraventricular escape beat
Premature ventricular contraction
Time Series Bitmaps Summary
The first paper to describe Time Series Bitmaps appeared in SDM 05.
There are lots of possible ideas for extensions/ commercialization.
Time series bitmaps could be one of the few contributions of data
mining to make a real world impact, because there is essentially no
barrier to adoption.
“The greatest value of a picture is when it forces us
to notice what we never expected to see”
John Turkey
Exploring data analysis. Addison-Wesley, Reading MA, 1977.
Using SAX to Visualize Time Series
Motivation of VizTree
010110010111100110100100001000
101001101101011100001010101110
1111100011011011011111101001100
100100011010001111001101101000
101111000101101001101100110100
000010011000100111000001110100
1100101100001010010
10001000101001000101010100001
010100010101110111101011010010
11101001010100111010101010010
10010101011101010100101010101
101010100101100101110111101000
111000010100001001110101000111
00001010101100101110101
Here are two sets of bit
strings. Which set is
generated by a human and
which one is generated by
a computer?
VizTree
010110010111100110100100001000101
001101101011100001010101110111110
001101101101111110100110010010001
101000111100110110100010111100010
110100110110011010000001001100010
011100000111010011001011000010100
10
0
1
10001000101001000101010100001010
100010101110111101011010010111010
010101001110101010100101001010101
110101010010101010110101010010110
010111011110100011100001010000100
111010100011100001010101100101110
101
0
1
0
0
1
1
Lets put the sequences into a depth limited tree, such
that the frequencies of all triplets are encoded in the
thickness of branches…
“humans usually try to fake randomness by alternating patterns”
VizTree
The “trick” on the previous slide
only works for discrete data, but
time series are real valued.
Details 2
Zoom in
But we can
SAX up a time
series to make
it discrete!
Overview
VisTree
• Convert the time series to SAX
• Push the data in a depth-limited
suffix tree
• Encode the frequencies as the
line thickness
Overview, zoom
& filter, details
on demand
Details 1
SAX for Motif Discovery
SAX allows Motif
Discovery!
Winding Dataset
( The angular speed of reel 2 )
0
50 0
1000
150 0
2000
Informally, motifs are reoccurring patterns…
2500
Motif Discovery
To find these 3 motifs would require about
6,250,000 calls to the Euclidean distance function.
A
0
500
20
1500
2000
B
40
60
80
100
120
140
0
20
C
(The angular speed of reel 2)
1000
A
0
Winding Dataset
B
2500
C
40
60
80
100
120
140
0
20
40
60
80
100
120
140
Why Find Motifs?
· Mining association rules in time series requires the discovery of motifs.
These are referred to as primitive shapes and frequent patterns.
· Several time series classification algorithms work by constructing typical
prototypes of each class. These prototypes may be considered motifs.
· Many time series anomaly/interestingness detection algorithms essentially
consist of modeling normal behavior with a set of typical shapes (which we see
as motifs), and detecting future patterns that are dissimilar to all typical shapes.
· In robotics, Oates et al., have introduced a method to allow an autonomous
agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these “experiences” as motifs.
· In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patient’s recovery
based of the discovery of similar patterns. Once again, we see these “similar
patterns” as motifs.
• Animation and video capture… (Tanaka and Uehara, Zordan and Celly)
T
Trivial
Matches
Space Shuttle STS - 57 Telemetry
C
0
100
200
3 00
400
( Inertial Sensor )
500
600
70 0
800
900
100 0
Definition 1. Match: Given a positive real number R (called range) and a time series T containing a
subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M)  R, then M is
called a matching subsequence of C.
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position
p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q
or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p
or p < q’< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most
significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count
of non-trivial matches (ties are broken by choosing the motif whose matches have the lower
variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence
CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1  i < K.
OK, we can define motifs, but
how do we find them?
The obvious brute force search algorithm is just too slow…
Our algorithm is based on a hot idea from bioinformatics,
random projection* and the fact that SAX allows use to
lower bound discrete representations of time series.
* J Buhler and M Tompa. Finding motifs using random projections. In
RECOMB'01. 2001.
A simple worked example of our motif discovery algorithm
The next 4 slides
T
0
500
( m= 1000)
1000
C1
C^1 a c b a
^
S
1 a
2 b
: :
: :
58 a
: :
985 b
c
c
:
:
c
:
c
b
a
:
:
c
:
c
a
b
:
:
a
:
c
a = 3 {a,b,c}
n = 16
w=4
Assume that we have a
time series T of length
1,000, and a motif of
length 16, which occurs
twice, at time T1 and
time T58.
A mask {1,2} was randomly chosen,
so the values in columns {1,2} were
used to project matrix into buckets.
Collisions are recorded by
incrementing the appropriate
location in the collision matrix
A mask {2,4} was randomly chosen,
so the values in columns {2,4} were
used to project matrix into buckets.
Once again, collisions are
recorded by incrementing the
appropriate location in the
collision matrix
We can calculate the expected values in the
matrix, assuming there are NO patterns…
k   i 
E( k , a, w, d , t )      1- 
 2  i 0  w 
d
t
 w a 1
 i  a 

 
i
1
 
a
w i
1
Suppose
E(k,a,w,d,t) = 2
2
2
:
1
3
58 27
: 3
2
1
2
2
1
985 0
1
2
1
1
2
:
58
3
: 985
A Simple Experiment
Lets imbed two motifs into a random walk time
series, and see if we can recover them
C
A
D
B
0
0
20
40
60
200
80
100
120
400
0
20
40
600
60
80
100
800
120
1000
1200
Planted Motifs
C
A
B
D
“Real” Motifs
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Some Examples of Real Motifs
Motor 1 (DC Current)
0
500
1000
1500
2000
Astrophysics (Photon Count)
250
0
350
0
450
0
550
0
650
0
Motifs in Music
jingle
•
•
•
•
Single channel (mono) 225000 samples at sample rate of 6000
samples/sec, 32bits per sample.
Pre-processing: Absolute-valued and down-sampled to total of
600 samples and new sample rate of 16 samples/sec.
400 projections with instance length equal to 2 seconds of
sample. w=16, a=8.
Jingle is highly repetitive, these motifs were found:
How Fast can we find Motifs?
Seconds
10k
8k
Brute Force
6k
TS-P
4k
2k
0
1000
2000
3000
4000
Length of Time Series
5000
The sun is setting on all other
symbolic representations of
time series, we have seen
SAX for discord discovery,
anomaly detection, clustering
and visualization