robbie williams speech

Download Report

Transcript robbie williams speech

All data files and
format can be
imagined as a a
single row of data.
The number of
columns is the
dimensionality of
the data.
The number of rows
is the numerosity.
Customer 1
12
23
12
43
25
Yes
Customer 2
34
56
12
76
23
Yes
Customer 3
45
23
45
34
23
No
Customer 4
54
23
23
55
54
Yes
25.1750
25.2250
25.2500
25.2500
25.2750
25.3250
25.3500
25.3500
25.4000
25.4000
25.3250
25.2250
25.2000
25.1750
..
..
24.6250
24.6750
24.6750
24.6250
24.6250
24.6250
24.6750
24.7500
What are Time Series?
A time series is a collection of observations
made sequentially in time.
29
28
27
26
25
24
23
0
50
100
150
200
250
300
350
400
450
500
Note that virtually all similarity measurements, indexing
and dimensionality reduction techniques discussed in
this tutorial can be used with other data types.
Time Series are Ubiquitous! I
People measure things...
• George Bush's popularity rating.
• Their blood pressure.
• The annual rainfall in Dubrovnik.
• The value of their Yahoo stock.
• The number of web hits per second.
… and things change over time.
Thus time series occur in virtually every medical, scientific and
businesses domain.
Time Series are Ubiquitous! II
A random sample of 4,000 graphics from 15
of the world’s newspapers published from
1974 to 1989 found that more than 75% of all
graphics were time series (Tufte, 1983).
Image data, may best be thought of as time series…
Video data, may best be thought of as time series…
Steady
pointing
Hand moving to
shoulder level
Point
Hand at rest
0
10
20
30
40
50
60
70
80
90
Steady
pointing
Hand moving to
shoulder level
Hand moving
down to grasp gun
Hand moving
above holster
Hand at rest
Gun-Draw
0
10
20
30
40
50
60
70
80
90
Handwriting data, may best be thought of as time series…
George Washington manuscript
Time Series are Everywhere…
Bioinformatics: Aach, J. and
Robotics: Schmill, M., Oates, T. &
Church, G. (2001). Aligning gene
expression time series with time
warping algorithms. Bioinformatics.
Volume 17, pp 495-508.
Cohen, P. (1999). Learned models for
continuous planning. In 7th International
Workshop on Artificial Intelligence and
Statistics.
Medicine:
Caiani, E.G., et. al.
(1998) Warped-average template technique
to track on a cycle-by-cycle basis the
cardiac filling phases on left ventricular
volume. IEEE Computers in Cardiology.
Gesture Recognition:
Gavrila, D. M. & Davis,L. S.(1995).
Towards 3-d model-based tracking and
recognition of human movement: a
multi-view approach. In IEEE IWAFGR
Chemistry: Gollmer, K., & Posten, C.
(1995) Detection of distorted pattern using
dynamic time warping algorithm and
application for supervision of bioprocesses.
IFAC CHEMFAS-4
Meteorology/ Tracking/
Biometrics / Astronomy /
Finance / Manufacturing …
Why is Working With Time Series so
Difficult? Part I
Answer: How do we work with very large databases?
 1 Hour of EKG data: 1 Gigabyte.
 Typical Weblog: 5 Gigabytes per week.
 Space Shuttle Database: 200 Gigabytes and growing.
 Macho Database: 3 Terabytes, updated with 3 gigabytes a day.
Since most of the data lives on disk (or tape), we need a
representation of the data we can efficiently manipulate.
Why is Working With Time Series so
Difficult? Part II
Answer: We are dealing with subjectivity
The definition of similarity depends on the user, the domain and
the task at hand. We need to be able to handle this subjectivity.
Why is working with time series so
difficult? Part III
Answer: Miscellaneous data handling problems.
• Differing data formats.
• Differing sampling rates.
• Noise, missing values, etc.
We will not focus on these issues in this tutorial.
What do we want to do with the time series data?
Clustering
Motif Discovery
Classification
Rule
10
Discovery

Query by
Content
s = 0.5
c = 0.3
Novelty Detection
All these problems require similarity matching
Clustering
Motif Discovery
Classification
Rule
10
Discovery

Query by
Content
s = 0.5
c = 0.3
Novelty Detection
Here is a simple motivation….
You go to the doctor
because of chest pains.
Your ECG looks
strange…
You doctor wants to
search a database to find
similar ECGs, in the
hope that they will offer
clues about your
condition...
Two questions:
• How do we define similar?
• How do we search quickly?
The similarity matching problem can come in two flavors I
Query Q
(template)
1
6
2
7
3
8
4
9
5
10
1: Whole Matching
C6 is the best match.
Database C
Given a Query Q, a reference database C and a
distance measure, find the Ci that best matches Q.
The similarity matching problem can come in two flavors II
Query Q
(template)
2: Subsequence Matching
Database C
The best matching
subsection.
Given a Query Q, a reference database C and a distance measure, find the
location that best matches Q.
Note that we can always convert subsequence matching to whole matching by
sliding a window across the long sequence, and copying the window contents.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) is denoted by D(O1,O2)
What properties should a distance measure have?
• D(A,B) = D(B,A)
• D(A,A) = 0
• D(A,B) = 0 IIf A= B
• D(A,B)  D(A,C) + D(B,C)
Symmetry
Constancy of Self-Similarity
Positivity
Triangular Inequality
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require the triangular inequality to hold.
Suppose I am looking for the closest point to Q, in a database
of 3 objects.
Further suppose that the triangular inequality holds, and that
we have precomplied a table of distance between all the
items in the database.
a
Q
c
b
a
a
b
c
b
6.70
c
7.07
2.30
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require the triangular inequality to hold.
Suppose I am looking for the closest point to Q, in a database
of 3 objects.
Further suppose that the triangular inequality holds, and that
we have precomplied a table of distance between all the
items in the database.
a
Q
I find a and calculate that it is 2 units from Q, it becomes my
best-so-far. I find b and calculate that it is 7.81 units away
from Q.
I don’t have to calculate the distance from Q to c!
D(Q,b)  D(Q,c) + D(b,c)
D(Q,b) - D(b,c)  D(Q,c)
7.81 - 2.30  D(Q,c)
5.51  D(Q,c)
So I know that c is at least 5.51 units away, but my best-sofar is only 2 units away.
c
b
I know
a
a
b
c
b
6.70
c
7.07
2.30
A Final Thought on the Triangular Inequality I
Sometimes the triangular inequality requirement maps
nicely onto human intuitions.
Consider the similarity between the horse, the zebra and the lion.
The horse and the zebra are very similar, and both are very
unlike the lion.
A Final Thought on the Triangular Inequality II
Sometimes the triangular inequality requirement fails to
map onto human intuition.
Consider the similarity between the horse, a man and the centaur.
The horse and the man are very
different, but both share many
features with the centaur. This
relationship does not obey the
triangular inequality.
The centaur example is due to Remco Velkamp
The Minkowski Metrics
DQ, C  
 qi  ci 
n
p
p
C
i 1
Q
p = 1 Manhattan (Rectilinear, City Block)
p = 2 Euclidean
p =  Max (Supremum, “sup”)
D(Q,C)
Euclidean Distance Metric
Given two time series
Q = q1…qn
and
C = c1…cn
their Euclidean distance is defined as:
DQ, C    qi  ci 
n
2
C
Q
i 1
Ninety percent of all work on
time series uses the Euclidean
distance measure.
D(Q,C)
Optimizing the Euclidean
Distance Calculation
DQ, C    qi  ci 
n
2
Instead of using the
Euclidean distance
we can use the
Squared Euclidean distance
i 1
Dsquared Q, C    qi  ci 
n
2
i 1
Euclidean distance and Squared Euclidean
distance are equivalent in the sense that they
return the same rankings, clusterings and
classifications.
This optimization helps
with CPU time, but most
problems are I/O bound.
Preprocessing the data before
distance calculations
If we naively try to measure the distance between two “raw” time
series, we may get very unintuitive results.
This is because Euclidean distance is very sensitive to some
distortions in the data. For most problems these distortions are not
meaningful, and thus we can and should remove them.
In the next 4 slides I will discuss the 4 most common distortions, and
how to remove them.
• Offset Translation
• Amplitude Scaling
• Linear Trend
• Noise
Transformation I: Offset Translation
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
50
100
150
200
250
300
0
D(Q,C)
0
50
100
150
200
250
300
Q = Q - mean(Q)
C = C - mean(C)
D(Q,C)
0
0
50
100
150
200
250
300
50
100
150
200
250
300
Transformation II: Amplitude Scaling
0
100
200
300
400
500
600
700
800
900 1000
0
100
200
300
400
500
600
700
800
900 1000
Q = (Q - mean(Q)) / std(Q)
C = (C - mean(C)) / std(C)
D(Q,C)
Transformation III: Linear Trend
5
4
Removed offset translation
3
2
Removed amplitude scaling
1
0
12
-1
10
-2
8
-3
0
20
40
60
80
100
120
140
160
180
200
6
4
2
0
5
-2
-4
0
4
20
40
60
80
100
120
140
160
180
200
Removed linear trend
3
2
The intuition behind removing
linear trend is this.
Removed offset translation
1
0
Removed amplitude scaling
-1
Fit the best fitting straight line to the
time series, then subtract that line
from the time series.
-2
-3
0
20
40
60
80
100
120
140
160
180
200
Transformation IIII: Noise
8
8
6
6
4
4
2
2
0
0
-2
-2
-4
0
20
40
60
80
100
120
140
-4
0
20
40
60
Q = smooth(Q)
The intuition behind
removing noise is this.
Average each datapoints
value with its neighbors.
C = smooth(C)
D(Q,C)
80
100
120
140
A Quick Experiment to Demonstrate the
Utility of Preprocessing the Data
3
2
9
6
Clustered using
Euclidean distance
on the raw data.
8
5
7
4
1
9
8
7
Instances from
Cylinder-Bell-Funnel
with small, random
amounts of trend, offset
and scaling added.
5
6
4
3
2
1
Clustered using
Euclidean distance
on the raw data,
after removing
noise, linear trend,
offset translation
and amplitude
scaling.
Summary of Preprocessing
The “raw” time series may have distortions which we
should remove before clustering, classification etc.
Of course, sometimes the distortions are the most
interesting thing about the data, the above is only a
general rule.
We should keep in mind these problems as we consider
the high level representations of time series which we
will encounter later (Fourier transforms, Wavelets etc).
Since these representations often allow us to handle
distortions in elegant ways.
Dynamic Time Warping
Fixed Time Axis
“Warped” Time Axis
Sequences are aligned “one to one”.
Nonlinear alignments are possible.
Note: We will first see the utility of DTW, then see how it is calculated.
Euclidean
Dynamic Time Warping
Let us compare Euclidean Distance and DTW on 4 problems
Point
0
10
20
30
40
50
60
70
80
90
4
3
2
1
0
GunDraw
-1
GUN
-2
Sign language
-3
0
4
4
3
3
2
2
1
1
0
0
-1
-1
-2
-2
-3
0
50
100
150
200
250
300
-3
0
50
100
150
200
250
10
20
30
40
50
60
70
80
90
-4
0
10
20
300
Nuclear
Trace
Word Spotting
30
40
50
60
70
80
Results: Error Rate
DataSet
Word Spotting
Sign language
GUN
Nuclear Trace
Euclidean
4.78
28.70
5.50
11.00
DTW
1.10
25.93
1.00
0.00
We use 1-nearest-neighbor, leaving one out evaluation
Results: Time (msec )
DataSet
Word Spotting
Sign language
GUN
Nuclear Trace
Euclidean
DTW
40
8,600
10
1,110
60 11,820
210 144,470
DTW is hundred to thousands of times slower than
Euclidean distance.
215
110
197
687
Utility of Dynamic Time Warping: Example II, Clustering
Power-Demand Time Series.
Each sequence corresponds to
a week’s demand for power in
a Dutch research facility in
1997 [van Selow 1999].
Wednesday was a
national holiday
For both dendrograms the
two 5-day weeks are
correctly grouped.
Note however, that for
Euclidean distance the
three 4-day weeks are not
clustered together.
and the two 3-day weeks
are also not clustered
together.
In contrast, Dynamic Time
Warping clusters the three
4-day weeks together, and
the two 3-day weeks
together.
Euclidean
Dynamic Time Warping
Time taken to create hierarchical
clustering of power-demand time series.
• Time to create dendrogram
using Euclidean Distance
0.012 seconds
• Time to create dendrogram
using Dynamic Time Warping
2.0 minutes
How is DTW
Calculated? I
We create a matrix the size of
|Q| by |C|, then fill it in with the
distance between every pair of
point in our two time series.
C
Q
C
Q
How is DTW
Calculated? II
Every possible warping between two time
series, is a path though the matrix. We
want the best one…
DTW (Q, C )  min 

C
Q
C

Q
Warping path w
K
k 1
wk K
How is DTW
Calculated? III
This recursive function gives use the
minimum cost path.
(i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) }
C
Q
C
Q
Warping path w
recursive function gives use the
How is DTW This
minimum cost path.
Calculated? IIII
(i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) }
C
(i,j) =
d(qi,cj) + min{
(i-1,j-1)
(i-1,j )
(i,j-1)
}
Let us visualize the cumulative matrix on a real world problem I
This example shows 2
one-week periods from
the power demand time
series.
Note that although they
both describe 4-day work
weeks, the blue sequence
had Monday as a holiday,
and the red sequence had
Wednesday as a holiday.
Let us visualize the cumulative matrix on a real world problem II
What we have seen so far…
• Dynamic time warping gives much
better results than Euclidean distance on
virtually all problems (recall the
classification example, and the
clustering example)
• Dynamic time warping is very very
slow to calculate!
Is there anything we can do to speed up similarity
search under DTW?
Fast Approximations to Dynamic Time Warp Distance I
C
Q
C
Q
Simple Idea: Approximate the time series with some compressed or downsampled
representation, and do DTW on the new representation.
How well does this work...
Fast Approximations to Dynamic Time Warp Distance II
22.7 sec
1.3 sec
… strong visual evidence to suggests it works well.
Good experimental evidence the utility of the approach on clustering,
classification and query by content problems also has been demonstrated.
In general, it is hard to speed up a single DTW calculation
However, if we have to make many DTW calculations
(which is almost always the case), we can potentiality speed
up the whole process by lowerbounding.
Keep in mind that the lowerbounding trick works for any
situation were you have an expensive calculation that can
be lowerbound (string edit distance, graph edit distance etc)
I will explain how lowerbounding works in a generic
fashion in the next two slides, then show concretely how
lowerbounding makes dealing with massive time series
under DTW possible…
Lower Bounding I
Assume that we have two functions:
• DTW(A,B)
• lower_bound_distance(A,B)
The true DTW
function is very
slow…
The lower
bound function
is very fast…
By definition, for all A, B, we have
lower_bound_distance(A,B)  DTW(A,B)
Lower Bounding II
We can speed up similarity search under DTW
by using a lower bounding function.
Intuition
Try to use a cheap lower
bounding calculation as
often as possible.
Only do the expensive,
full calculations when it is
absolutely necessary.
Algorithm Lower_Bounding_Sequential_Scan(Q)
1. best_so_far = infinity;
2. for all sequences in database
3.
LB_dist = lower_bound_distance( Ci, Q);
if LB_dist < best_so_far
4.
5.
true_dist = DTW(C
Ci, Q);
if true_dist < best_so_far
6.
7.
best_so_far = true_dist;
8.
index_of_best_match = i;
endif
9.
endif
10.
11. endfor
Lower Bound of Yi et. al.
max(Q)
min(Q)
LB_Yi
Yi, B, Jagadish, H & Faloutsos,
C. Efficient retrieval of similar
time sequences under time
warping. ICDE 98, pp 23-27.
The sum of the squared length of gray
lines represent the minimum the
corresponding points contribution to the
overall DTW distance, and thus can be
returned as the lower bounding measure
Lower Bound of Kim et al
C
A
D
LB_Kim
Kim, S, Park, S, & Chu, W. An
index-based approach for
similarity search supporting time
warping in large sequence
databases. ICDE 01, pp 607-614
B
The squared difference between the two
sequence’s first (A), last (D), minimum
(B) and maximum points (C) is returned
as the lower bound
Global Constraints
• Slightly speed up the calculations
• Prevent pathological warpings
C
Q
C
Q
Sakoe-Chiba Band
Itakura Parallelogram
A global constraint constrains the indices of the
warping path wk = (i,j)k such that j-r  i  j+r
Where r is a term defining allowed range of
warping for a given point in a sequence.
r=
Sakoe-Chiba Band
Itakura Parallelogram
A Novel Lower Bounding Technique I
Q
C
U
Sakoe-Chiba Band
Ui = max(qi-r : qi+r)
Li = min(qi-r : qi+r)
L
Q
Q
C
U
Q
Itakura Parallelogram
L
A Novel Lower Bounding Technique II
C
C
Q
U
Sakoe-Chiba Band
 (ci  U i ) 2 if ci  U i
n

LB _ Keogh(Q, C )    (ci  Li ) 2 if ci  Li
i 1 
 0 otherwise
L
Q
C
Q
C
Itakura Parallelogram
U
LB_Keogh
Q
L
The tightness of the lower bound for each technique is proportional
to the length of gray lines used in the illustrations
LB_Kim
LB_Yi
LB_Keogh
Sakoe-Chiba
LB_Keogh
Itakura
Let us empirically evaluate the quality
of the lowering bounding techniques.
This is a good idea, since it is an
implementation free measure of quality.
First we must discuss our experimental
philosophy…
Experimental Philosophy
• We tested on 32 datasets from such diverse fields as
finance, medicine, biometrics, chemistry, astronomy,
robotics, networking and industry. The datasets cover the complete
spectrum of stationary/ non-stationary, noisy/ smooth, cyclical/ non-cyclical, symmetric/ asymmetric etc
• Our experiments are completely reproducible. We
saved every random number, every setting and all data.
• To ensure true randomness, we use random numbers
created by a quantum mechanical process.
• We test with the Sakoe-Chiba Band, which is the
worst case (the Itakura Parallelogram would give us
much better results).
32 datasets from such diverse fields as finance, medicine, biometrics,
chemistry, astronomy, robotics, networking and industry
1
Sunspot
2,899
17
Evaporator
2
Power
3
ERP data
4
37,830
35,040
18
Ballbeam
198,400
19
Tongue
Spot Exrates
2,567
20
Fetal ECG
5
Shuttle
6,000
21
Balloon
6
Water
6,573
22
Stand’ & Poor
7
Chaotic
1,800
23
Speech
1,020
8
Steamgen
38,400
24
Soil Temp
2,304
9
Ocean
4,096
25
Wool
2,790
10
Tide
8,746
26
Infrasound
8,192
11
CSTR
22,500
27
Network
18,000
12
Winding
17,500
28
EEG
11,264
13
Dryer2
5,202
29
Koski EEG
144,002
14
Robot Arm
2,048
30
Buoy Sensor
55,964
15
Ph Data
6,003
31
Burst
16
Power Plant
2,400
32
Random Walk I
2,000
700
22,500
4,002
17,610
9,382
65,536
Tightness of Lower Bound Experiment
• We measured T
T
Lower Bound Estimateof Dynamic Time W arp Distance
True Dynamic Time W arp Distance
0T1
• For each dataset, we randomly extracted 50
sequences of length 256. We compared each
sequence to the 49 others.
The larger the
better
Query length
of 256 is about
the mean in the
literature.
• For each dataset we report T as average ratio
from the 1,225 (50*49/2) comparisons made.
LB_Keogh
LB_Yi
LB_Kim
1.0
0.8
0.6
0.4
0.2
0
1
17
18
2
19
3
20
4
21
5
22
6
7
23
8
24
9
25
10
26
11
27
12
28
13
29
14
15
30
16
31
32
Effect of Query Length on Tightness of Lower Bounds
31
32
LB_Keogh
LB_Yi
LB_Kim
Tightness of Lower Bound T
1.0
0.8
0.6
0.4
0.2
0
16
32
64
128 256 512 1024
Query Length
How do the tightness of lower bounds translate to speed up?
Fraction of the database on which
we must do full DTW calculation
1
Random Walk II
0.8
Sequential Scan
LB_Keogh
0.6
0.4
0.2
0
210
212
214
216
Note that the X-axis is logarithmic
218
220
These experiments suggest
we can use the new lower
bounding technique to speed
up sequential search.
That’s super!
Excellent!
But what we really need
is a technique to index
the time series
According to the most referenced paper
on time series similarity searching
“dynamic time warping cannot be speeded
up by indexing *”,
As we noted in an earlier slide, virtually
all indexing techniques require the
triangular inequality to hold.
DTW does NOT obey the
triangular inequality!
* Agrawal, R., Lin, K. I., Sawhney, H.
S., & Shim, K. (1995). Fast similarity
search in the presence of noise, scaling,
and translation in times-series
databases. VLDB pp. 490-501.
In fact, it was shown that DTW can be
indexed! (VLDB02)
We won’t give details here, other than
to note that the technique is based on
the lowerbounding technique we have
just seen
Let us quickly see some success
stories, problems that we now solve,
given that we can index DTW
Success Story I
The lower bounding
technique has been used
to support indexing of
massive archives of
handwritten text.
Surprisingly, DTW works
better on this problem that
more sophisticated approaches
like Markov models
R. Manmatha, T. M. Rath: Indexing of Handwritten Historical Documents - Recent Progress. In:
Proc. of the 2003 Symposium on Document Image Understanding Technology (SDIUT), Greenbelt, MD, April 9-11, 2003, pp. 77-85.
T. M. Rath and R. Manmatha (2002): Lower-Bounding of Dynamic Time Warping Distances for
Multivariate Time Series. Technical Report MM-40, Center for Intelligent Information Retrieval, University of Massachusetts Amherst.
Grease is
the word…
Success Story II
The lower bounding
technique has been used
to support “query by
humming”, by several
groups of researchers
Best 3 Matches
1) Bee Gees: Grease
2) Robbie Williams: Grease
3) Sarah Black: Heatwave
Yunyue Zhu, Dennis Shasha (2003). Query by Humming: a Time Series Database Approach, SIGMOD.
Ning Hu, Roger B. Dannenberg (2003). Polyphonic Audio Matching and Alignment for Music Retrieval
Success Story III
The lower bounding
technique is being used
by ChevronTexaco for
comparing seismic data
Weighted Distance Measures I
Intuition: For some queries
different parts of the sequence
are more important.
Weighting features is a well known technique in the machine learning community to improve classification
and the quality of clustering.
Weighted Distance Measures II
DQ, C  
DQ, C ,W  
 qi  ci 
n
D(Q,C)
2
i 1
 wi qi  ci 
n
2
D(Q,C,W)
i 1
The height of this histogram
indicates the relative importance
of that part of the query
W
Weighted Distance Measures III
How do we set the weights?
One Possibility: Relevance Feedback
Definition: Relevance Feedback is the reformulation of a search query in response
to feedback provided by the user for the results of previous versions of the query.
Term Vector
Term Weights
[Jordan , Cow, Bull, River]
[
1 , 1 , 1 , 1 ]
Search
Display Results
Gather Feedback
Term Vector
[Jordan , Cow, Bull, River]
Term Weights
[ 1.1 , 1.7 , 0.3 , 0.9 ]
Update Weights
Relevance Feedback for Time Series
The original query
The weigh vector.
Initially, all weighs
are the same.
Note: In this example we are using a piecewise linear
approximation of the data. We will learn more about this
representation later.
The initial query is
executed, and the five
best matches are
shown (in the
dendrogram)
One by one the 5 best
matching sequences
will appear, and the
user will rank them
from between very bad
(-3) to very good (+3)
Based on the user
feedback, both the
shape and the weigh
vector of the query are
changed.
The new query can be
executed.
The hope is that the
query shape and
weights will converge
to the optimal query.
Similarity Measures
Many papers have introduced a new
similarity measure for time series, do
they make a contribution?
• 61.9% of the papers fail to show a single example
of a matching time series under the measure.
• 52.3% of the papers fail to compare their new
measure to a single strawman.
• 85.7% of the papers fail to demonstrate an
objective test of their new measure.
38.1% of the papers do show an example of a
query and a match, but what does that mean if we
don’t see all the non-matches?
Query
Best Match
2nd Best Match
3rd Best Match
38.1% of the papers do show an example of a
query and a match, but what does that mean if we
don’t see all the non-matches? … it means little
Query
Best Match
2nd Best Match
3rd Best Match
Database
Subjective Evaluation of Similarity Measures
Euclidean Distance
We believe that
one of the best
(subjective) ways
to evaluate a
proposed similarity
measure is to use it
to create a
dendrogram of
several time series
from the domain of
interest.
Distance measure
introduced in one of the
papers in the survey
8
8
6
6
7
4
5
3
4
2
3
7
2
5
1
1
Objective Evaluation of Similarity Measures
We can use nearest neighbor
classification to evaluate the
usefulness of propose distance
measures.
We compared 11 different measures to
Euclidean distance, using 1-NN.
Some of the measures require the user to set
some parameters. In these cases we wrapped
the classification algorithm in a loop for each
parameter, searched over all possible
parameters and reported only the best result.
Cylinder-Bell-Funnel: This
synthetic dataset has been in
the literature for 8 years, and
has been cited at least a
dozen times. It is a 3-class
problem; we create 128
examples of each class for
these experiments.
Control-Chart: This
synthetic dataset has been
freely available for the UCI
Data Archive since June
1998. It is a 6-class problem,
with 100 examples of each
class.
Results: Classification Error Rates
Approach
Cylinder-Bell-F’
Control-Chart
Euclidean Distance
0.003
0.013
Aligned Subsequence
0.451
0.623
Piecewise Normalization
0.130
0.321
Autocorrelation Functions
0.380
0.116
Cepstrum
0.570
0.458
String (Suffix Tree)
0.206
0.578
Important Points
0.387
0.478
Edit Distance
0.603
0.622
String Signature
0.444
0.695
Cosine Wavelets
0.130
0.371
Hölder
0.331
0.593
Piecewise Probabilistic
0.202
0.321
Lets summarize, then take a break!
• Almost all time series problems require similarity calculations.
• Euclidean Distance accounts for 90% of the literature.
• DTW outperforms Euclidean distance for classification, clustering,
indexing etc
• DTW is very slow compared to Euclidean distance, but recent work
on tight lower bounds has opened up the possibility of tackling
massive datasets that would have been unthinkable just a few years
ago…
• More generally, lower bounding is useful.
• Weighted distance measures may be useful for some problem.
Motivating example revisited…
You go to the doctor
because of chest pains.
Your ECG looks
strange…
You doctor wants to
search a database to find
similar ECGs, in the
hope that they will offer
clues about your
condition...
Two questions:
•How do we define similar?
•How do we search quickly?
Indexing Time Series
We have seen techniques for assessing the similarity of
two time series.
However we have not addressed the problem of finding
the best match to a query in a large database (other than
the lower bounding trick)
Query Q
The obvious solution, to retrieve and
examine every item (sequential
scanning), simply does not scale to
large datasets.
1
6
2
7
3
8
4
9
5
10
Database C
We need some way to index the data...
We can project time series
of length n into ndimension space.
The first value in C is the
X-axis, the second value in
C is the Y-axis etc.
One advantage of doing
this is that we have
abstracted away the details
of “time series”, now all
query processing can be
imagined as finding points
in space...
…we can project the query time
series Q into the same n-dimension
space and simply look for the nearest
points.
Q
Interesting Sidebar
The Minkowski Metrics have
simple geometric
interoperations...
Euclidean
Weighted Euclidean
Manhattan
Max
Mahalanobis
…the problem is that we have to look at
every point to find the nearest neighbor..
We can group clusters of datapoints
with “boxes”, called Minimum
Bounding Rectangles (MBR).
R1
R2
R4
R5
R3
R6
R9
R7
R8
We can further recursively group
MBRs into larger MBRs….
…these nested MBRs are organized
as a tree (called a spatial access tree
or a multidimensional tree). Examples
include R-tree, Hybrid-Tree etc.
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We can define a function, MINDIST(point, MBR), which tells us
the minimum possible distance between any point and any MBR, at
any level of the tree.
MINDIST(point, MBR) = 5
MINDIST(point, MBR) = 0
Note that this is another example of a useful lower bound
We can use the MINDIST(point, MBR), to do fast search..
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We can use the MINDIST(point, MBR), to do fast search..
0
R10
R11
10
17
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
We now go to disk, and retrieve all the data objects whose pointers are in the green
node. We measure the true distance between our query and those objects. Let us
imagine two scenarios, the closest object, the “best-so-far” has a value of..
• 1.5 units
(we are done searching!)
• 4.0 units
(we have to look in R2, but then we are done)
0
R10
10
17
R10 R11 R12
R11
0
2
10
R1 R2 R3
R4 R5 R6
R7 R8 R9
Data nodes containing points
R12
If we project a query into ndimensional space, how many
additional (nonempty) MBRs
must we examine before we are
guaranteed to find the best match?
For the one dimensional case,
the answer is clearly 2...
If we project a query into ndimensional space, how many
additional (nonempty) MBRs
must we examine before we are
guaranteed to find the best match?
For the one dimensional case, the
answer is clearly 2...
For the two
dimensional
case, the
answer is 8...
If we project a query into ndimensional space, how many
additional (nonempty) MBRs
must we examine before we are
guaranteed to find the best match?
For the one dimensional case, the
answer is clearly 2...
For the three
dimensional case,
the answer is 26...
More generally, in n-dimension space
n
For the two we must examine 3 -1 MBRs
dimensional
n = 21  10,460,353,201 MBRs
case, the
answer is 8... This is known as the curse of
dimensionality
Spatial Access Methods
We can use Spatial Access Methods like the R-Tree to index our
data, but…
The performance of R-Trees degrade exponentially with the
number of dimensions. Somewhere above 6-20 dimensions the RTree degrades to linear scanning.
Often we want to index time series with hundreds, perhaps even
thousands of features….
GEMINI GEneric Multimedia INdexIng
{Christos Faloutsos}
 Establish a distance metric from a domain expert.
 Produce a dimensionality reduction technique that
reduces the dimensionality of the data from n to N,
where N can be efficiently handled by your
favorite SAM.
 Produce a distance measure defined on the N
dimensional representation of the data, and prove
that it obeys Dindexspace(A,B)  Dtrue(A,B).
i.e. The lower bounding lemma.
 Plug into an off-the-shelve SAM.
We have 6 objects in 3-D space. We issue a query to find all
objects within 1 unit of the point (-3, 0, -2)...
A
3
2.5
2
1.5
C
1
B
0.5
F
0
-0.5
-1
3
2
D
1
0
E
-1
0
-1
-2
-2
-3
-3
-4
3
2
1
Consider what would happen if
we issued the same query after
reducing the dimensionality to
2, assuming the dimensionality
technique obeys the lower
bounding lemma...
The query successfully
finds the object E.
A
3
2
C
1
B
F
0
-1
3
2
D
1
0
E
-1
0
-1
-2
-2
-3
-3
-4
3
2
1
Example of a dimensionality reduction technique
in which the lower bounding lemma is satisfied
Informally, it’s OK if objects appear
closer in the dimensionality reduced
space, than in the true space.
Note that because of the
dimensionality reduction, object F
appears to less than one unit from
the query (it is a false alarm).
3
2.5
A
2
1.5
C
F
1
0.5
0
B
-0.5
-1
D
E
-4
-3
-2
-1
0
1
2
3
This is OK so long as it does not
happen too much, since we can
always retrieve it, then test it in
the true, 3-dimensional space. This
would leave us with just E , the
correct answer.
Example of a dimensionality reduction technique
in which the lower bounding lemma is not satisfied
Informally, some objects appear
further apart in the dimensionality
reduced space than in the true space.
3
2.5
Note that because of the
dimensionality reduction, object E
appears to be more than one unit
from the query (it is a false
dismissal).
A
2
E
1.5
1
C
0.5
This is unacceptable.
0
F
-0.5
B
D
-1
-4
-3
-2
-1
0
1
2
3
We have failed to find the true
answer set to our query.
GEMINI GEneric Multimedia INdexIng
{Christos Faloutsos}
 Establish a distance metric from a domain expert.
 Produce a dimensionality reduction technique that
reduces the dimensionality of the data from n to N,
where N can be efficiently handled by your
favorite SAM.
 Produce a distance measure defined on the N
dimensional representation of the data, and prove
that it obeys Dindexspace(A,B)  Dtrue(A,B).
i.e. The lower bounding lemma.
 Plug into an off-the-shelve SAM.
The examples on the previous
slides illustrate why the lower
bounding lemma is so
important.
Now all we have to do is to
find a dimensionality
reduction technique that obeys
the lower bounding lemma,
and we can index our time
series!
Notation for Dimensionality Reduction
For the future discussion of dimensionality reduction
we will assume that
M is the number time series in our database.
n is the original dimensionality of the data.
(i.e. the length of the time series)
N is the reduced dimensionality of the data.
CRatio = N/n is the compression ratio.
An Example of a
Dimensionality Reduction
Technique I
C
0
20
40
60
80
n = 128
100
120
140
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
The graphic shows a
time series with 128
points.
The raw data used to
produce the graphic is
also reproduced as a
column of numbers (just
the first 30 or so points are
shown).
An Example of a
Dimensionality Reduction
Technique II
C
0
20
40
60
80
100
120
..............
140
Raw
Data
Fourier
Coefficients
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
We can decompose the
data into 64 pure sine
waves using the Discrete
Fourier Transform (just the
first few sine waves are
shown).
The Fourier Coefficients
are reproduced as a
column of numbers (just
the first 30 or so
coefficients are shown).
Note that at this stage we
have not done
dimensionality reduction,
we have merely changed
the representation...
An Example of a
Dimensionality Reduction
Technique III
C
C’
0
20
40
60
80
100
We have
discarded
15
16
of the data.
120
140
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
Truncated
Fourier
Fourier
Coefficients Coefficients
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
n = 128
N=8
Cratio = 1/16
… however, note that the first
few sine waves tend to be the
largest (equivalently, the
magnitude of the Fourier
coefficients tend to decrease
as you move down the
column).
We can therefore truncate
most of the small coefficients
with little effect.
An Example of a
Dimensionality Reduction
Technique IIII
C
C’
0
20
40
60
80
100
120
140
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
Sorted
Truncated
Fourier
Fourier
Coefficients Coefficients
1.5698
1.0485
0.7160
0.8406
0.3709
0.1670
0.4667
0.1928
0.1635
0.1302
0.0992
0.1282
0.2438
0.2316
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
1.5698
1.0485
0.7160
0.8406
0.2667
0.1928
0.1438
0.1416
Instead of taking the first few
coefficients, we could take
the best coefficients
This can help greatly in terms
of approximation quality, but
makes indexing hard
(impossible?).
Note this applies also to Wavelets
aabbbccb
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
a
DFT
DWT
SVD
APCA
PAA
PLA
a
b
b
b
c
c
SYM
b
Time Series
Representations
Data Adaptive
Sorted
Coefficients
Singular Symbolic
Piecewise
Value
Polynomial Decomposition
Piecewise
Linear
Approximation
Adaptive
Piecewise
Constant
Approximation
Natural
Language
Non Data Adaptive
Trees
Strings
Wavelets
Random
Mappings
Spectral
Orthonormal Bi-Orthonormal Discrete
Fourier
Transform
Haar Daubechies Coiflets
dbn n > 1
Interpolation Regression
Piecewise
Aggregate
Approximation
Discrete
Cosine
Transform
Symlets
UUCUCUCD
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100120
U
U
C
U
C
U
D
D
DFT
DWT
SVD
APCA
PAA
PLA
SYM
Discrete Fourier
Transform I
X
Basic Idea: Represent the time
series as a linear combination of
sines and cosines, but keep only the
first n/2 coefficients.
X'
0
20
40
60
80
100
120
140
Why n/2 coefficients? Because each
sine wave requires 2 numbers, for the
phase (w) and amplitude (A,B).
Jean Fourier
0
1768-1830
1
2
3
4
n
C (t )   ( Ak cos( 2wk t )  Bk sin( 2wk t ))
k 1
5
6
7
8
9
Excellent free Fourier Primer
Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS95-37, Department of Computer Science, Brown University, 1995.
http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/
Discrete Fourier
Transform II
X
X'
0
20
40
60
80
100
120
140
Pros and Cons of DFT as a time series
representation.
• Good ability to compress most natural signals.
• Fast, off the shelf DFT algorithms exist. O(nlog(n)).
• (Weakly) able to support time warped queries.
0
1
2
3
• Difficult to deal with sequences of different lengths.
• Cannot support weighted distance measures.
4
5
6
7
8
9
Note: The related transform DCT, uses only cosine
basis functions. It does not seem to offer any
particular advantages over DFT.
Discrete Wavelet
Transform I
X
Basic Idea: Represent the time
series as a linear combination of
Wavelet basis functions, but keep
only the first N coefficients.
X'
DWT
0
20
40
60
80
100
120
140
Haar 0
Although there are many different
types of wavelets, researchers in
time series mining/indexing
generally use Haar wavelets.
Alfred Haar
1885-1933
Haar 1
Haar 2
Haar 3
Haar wavelets seem to be as
powerful as the other wavelets for
most problems and are very easy to
code.
Haar 4
Haar 5
Excellent free Wavelets Primer
Haar 6
Haar 7
Stollnitz, E., DeRose, T., & Salesin, D. (1995). Wavelets for
computer graphics A primer: IEEE Computer Graphics and
Applications.
Discrete Wavelet
Transform II
X
X'
DWT
0
20
40
60
80
100
120
Ingrid Daubechies
140
1954 -
Haar 0
Haar 1
We have only considered one type of wavelet, there
are many others.
Are the other wavelets better for indexing?
YES: I. Popivanov, R. Miller. Similarity Search Over Time
Series Data Using Wavelets. ICDE 2002.
NO: K. Chan and A. Fu. Efficient Time Series Matching by
Wavelets. ICDE 1999
Later in this tutorial I will answer
this question.
Discrete Wavelet
Transform III
Pros and Cons of Wavelets as a time series
representation.
X
X'
DWT
0
20
40
60
80
100
120
140
• Good ability to compress stationary signals.
• Fast linear time algorithms for DWT exist.
• Able to support some interesting non-Euclidean
similarity measures.
Haar 0
Haar 1
Haar 2
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
• Signals must have a length n = 2some_integer
• Works best if N is = 2some_integer. Otherwise wavelets
approximate the left side of signal at the expense of the right side.
• Cannot support weighted distance measures.
Singular Value
Decomposition I
X
Basic Idea: Represent the time
series as a linear combination of
eigenwaves but keep only the first
N coefficients.
X'
SVD
0
20
40
60
80
100
120
140
eigenwave 0
eigenwave 1
SVD is similar to Fourier and
Wavelet approaches is that we
represent the data in terms of a
linear combination of shapes (in
this case eigenwaves).
James Joseph Sylvester
1814-1897
eigenwave 2
eigenwave 3
SVD differs in that the eigenwaves
are data dependent.
Camille Jordan
(1838--1921)
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
SVD has been successfully used in the text
processing community (where it is known as
Latent Symantec Indexing ) for many years.
Good free SVD Primer
Singular Value Decomposition - A Primer.
Sonia Leach
Eugenio Beltrami
1835-1899
Singular Value
Decomposition II
How do we create the eigenwaves?
We have previously seen that
we can regard time series as
points in high dimensional
space.
X
X'
SVD
0
20
40
60
80
100
120
140
We can rotate the axes such
that axis 1 is aligned with the
direction of maximum
variance, axis 2 is aligned with
the direction of maximum
variance orthogonal to axis 1
etc.
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
Since the first few eigenwaves
contain most of the variance of
the signal, the rest can be
truncated with little loss.
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
A  UV
T
This process can be achieved by factoring a M
by n matrix of time series into 3 other matrices,
and truncating the new matrices at size N.
Singular Value
Decomposition III
Pros and Cons of SVD as a time series
representation.
X
X'
SVD
0
20
40
60
80
100
120
140
• Optimal linear dimensionality reduction technique .
• The eigenvalues tell us something about the
underlying structure of the data.
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
eigenwave 4
• Computationally very expensive.
• Time: O(Mn2)
• Space: O(Mn)
• An insertion into the database requires recomputing
the SVD.
• Cannot support weighted distance measures or non
Euclidean measures.
eigenwave 5
eigenwave 6
eigenwave 7
Note: There has been some promising research into
mitigating SVDs time and space complexity.
Basic Idea: Represent the time series
as a linear combination of
Chebyshev Polynomials
Chebyshev
Polynomials
X
X'
Cheb
0
20
40
60
80
100
120
Ti(x) =
140
1
x
2x2−1
Pros and Cons of SVD as a time
series representation.
• Time series can be of arbitrary length
• Only O(n) time complexity
• Is able to support multi-dimensional
time series*.
Pafnuty Chebyshev
1821-1946
4x3−3x
8x4−8x2+1
16x5−20x3+5x
32x6−48x4+18x2−1
64x7−112x5+56x3−7x
128x8−256x6+160x4−32x2+1
* Raymond T. Ng, Yuhan Cai: Indexing Spatio-Temporal
Trajectories with Chebyshev Polynomials. SIGMOD 2004
Piecewise Linear
Approximation I
Basic Idea: Represent the time
series as a sequence of straight
lines.
X
Karl Friedrich Gauss
X'
1777 - 1855
0
20
40
60
80
100
120
140
Lines could be connected, in
which case we are allowed
N/2 lines
Each line segment has
• length
• left_height
(right_height can
be inferred by looking at
the next segment)
If lines are disconnected, we
are allowed only N/3 lines
Personal experience on dozens of datasets
suggest disconnected is better. Also only
disconnected allows a lower bounding
Euclidean approximation
Each line segment has
• length
• left_height
• right_height
How do we obtain the Piecewise Linear
Approximation?
Piecewise Linear
Approximation II
X
X'
0
20
40
60
80
100
120
140
Optimal Solution is O(n2N), which is too
slow for data mining.
A vast body on work on faster heuristic
solutions to the problem can be classified
into the following classes:
• Top-Down
• Bottom-Up
• Sliding Window
• Other (genetic algorithms, randomized algorithms,
Bspline wavelets, MDL etc)
Recent extensive empirical evaluation of all
approaches suggest that Bottom-Up is the best
approach overall.
Piecewise Linear
Approximation III
Pros and Cons of PLA as a time series
representation.
X
X'
0
20
40
60
80
100
120
140
• Good ability to compress natural signals.
• Fast linear time algorithms for PLA exist.
• Able to support some interesting non-Euclidean
similarity measures. Including weighted measures,
relevance feedback, fuzzy queries…
•Already widely accepted in some communities (ie,
biomedical)
• Not (currently) indexable by any data structure (but
does allows fast sequential scanning).
Basic Idea: Convert the time series into an alphabet
of discrete symbols. Use string indexing techniques
to manage the data.
Symbolic
Approximation I
X
X'
a a b b b c c b
0
20
40
60
80
100
120
140
0
a
Potentially an interesting idea, but all work thusfar
are very ad hoc.
Pros and Cons of Symbolic Approximation
as a time series representation.
1
a
b
2
b
3
4
b
5
c
c
• Potentially, we could take advantage of a wealth of
techniques from the very mature field of string
processing and bioinformatics.
• It is not clear how we should discretize the times
series (discretize the values, the slope, shapes? How
big of an alphabet? etc)
6
b
7
•There is no known technique to allow the support of
Euclidean queries. (Breaking news, later in tutorial)
Symbolic
Approximation II
SAX
X
X'
0
20
40
60
80
100
120
140
Clipped Data
X
X'
00000000000000000000000000000000000000000000111111111111111111111110000110000001111111111111111111111111111111111111111111111111
0
20
40
60
80
100
120
140
…110000110000001111….
44 Zeros
23 Ones
4 Zeros
2 Ones
6 Zeros
49 Ones
44 Zeros|23|4|2|6|49
Piecewise Aggregate
Approximation I
Basic Idea: Represent the time series as a
sequence of box basis functions.
Note that each box is the same length.
X
X'
0
20
40
60
80
100
120
140
x1
x2
x3
x4
x5
xi  Nn
ni
N
x
j
j  Nn ( i 1) 1
Given the reduced dimensionality representation
we can calculate the approximate Euclidean
distance as...
DR ( X , Y ) 
n
N
2


x

y
i1 i i
N
This measure is provably lower bounding.
x6
x7
Independently introduced by two authors
x8
Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000)
Byoung-Kee Yi, Christos Faloutsos, VLDB (2000)
Piecewise Aggregate
Approximation II
X
X'
0
20
40
60
80
100
120
140
X1
Pros and Cons of PAA as a time series
representation.
• Extremely fast to calculate
• As efficient as other approaches (empirically)
• Support queries of arbitrary lengths
• Can support any Minkowski metric
• Supports non Euclidean measures
• Supports weighted Euclidean distance
• Simple! Intuitive!
X2
X3
X4
X5
X6
X7
X8
• If visualized directly, looks ascetically unpleasing.
Adaptive Piecewise
Constant
Approximation I
Basic Idea: Generalize PAA to allow the
piecewise constant segments to have arbitrary
lengths.
Note that we now need 2 coefficients to represent
each segment, its value and its length.
X
Raw Data (Electrocardiogram)
X
Adaptive Representation (APCA)
0
20
40
60
80
100
120
Reconstruction Error 2.61
140
Haar Wavelet or PAA
Reconstruction Error 3.27
<cv1,cr1>
<cv2,cr2>
DFT
Reconstruction Error 3.11
<cv3,cr3>
0
<cv4,cr4>
50
100
150
200
250
The intuition is this, many signals have little detail in some
places, and high detail in other places. APCA can adaptively fit
itself to the data achieving better approximation.
Adaptive Piecewise
Constant
Approximation II
X
X
The high quality of the APCA had been noted by
many researchers.
However it was believed that the representation
could not be indexed because some coefficients
represent values, and some represent lengths.
However an indexing method was discovered!
(SIGMOD 2001 best paper award)
0
20
40
60
80
100
120
140
<cv1,cr1>
<cv2,cr2>
<cv3,cr3>
<cv4,cr4>
Unfortunately, it is non-trivial to understand and
implement….
Adaptive Piecewise
Constant
Approximation III
• Pros and Cons of APCA as a time
series representation.
X
X
0
20
40
60
80
100
120
140
<cv1,cr1>
• Fast to calculate O(n).
• More efficient as other approaches (on some
datasets).
• Support queries of arbitrary lengths.
• Supports non Euclidean measures.
• Supports weighted Euclidean distance.
• Support fast exact queries , and even faster
approximate queries on the same data structure.
<cv2,cr2>
<cv3,cr3>
<cv4,cr4>
• Somewhat complex implementation.
• If visualized directly, looks ascetically
unpleasing.
Natural Language
• Pros and Cons of natural language as
a time series representation.
X
rise, plateau, followed by a rounded peak
0
20
40
60
80
100
120
• The most intuitive representation!
• Potentially a good representation for low
bandwidth devices like text-messengers
140
• Difficult to evaluate.
rise
plateau
followed by a rounded peak
To the best of my knowledge only one group is
working seriously on this representation. They
are the University of Aberdeen SUMTIME
group, headed by Prof. Jim Hunter.
Comparison of all dimensionality reduction techniques
• We can compare the time it takes to build the index, for
different sizes of databases, and different query lengths.
• We can compare the indexing efficiency. How long
does it take to find the best answer to out query. It turns out
that the fairest way to measure this is to measure the number of times we have
to retrieve an item from disk.
• We can simply compare features. Does approach X
allow weighted queries, queries of arbitrary lengths, is it
simple to implement…
The time needed
to build the
index
Black topped histogram
bars (in SVD) indicate
experiments abandoned
at 1,000 seconds.
The fraction of the data
that must be retrieved from
disk to answer a one
nearest neighbor query
SVD
DWT
DFT
PAA
1
1
1
1
.5
.5
.5
.5
0
1024
512
256
128
64
0
0
16
20 18
14 12
10 8
1024
512
256
128
64
16
20 18
10
14 12
8
0
1024
512
256
128
64
20
18 16
10
14 12
Dataset is Stock Market Data
8
1024
512
256
128
64
20
14
18 16
12 10
8
The fraction of the data
that must be retrieved from
disk to answer a one
nearest neighbor query
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
0
0
16
1024
32
512
256
64
DFT
16
1024
32
512
256
64
DWT
16
1024
32
512
256
PAA
64
16
1024
32
512
256
APCA
Dataset is mixture of many “structured” datasets, like ECGs
64
Summary of Results
On most datasets, it does not matter too much which
representation you choose. Wavelets, PAA, DFT all seem
to do about the same.
However, on some datasets, in particular those datasets
where the level of detail varies over time, APCA can do
much better.
Raw Data (Electrocardiogram)
Haar Wavelet
Adaptive Representation (APCA)
A Surprising Claim…
The claim I made on the previous slide might
surprise some people…
I said that for the most part it does not matter
which representation we use (in terms of
indexing efficiency), but many people have
written papers claiming otherwise.
Who is right?
More than 95% of all the time series
database/data mining papers published
do not make any contribution!
In the next 20 minutes I will justify the surprising
claim above.
Outline of the next 20 minutes of the
Tutorial
• My Claim
• Results of a Survey
• Size of test datasets
• Number of rival methods considered
• Diversity of test datasets
• Implementation Bias
• What it is, why it matters
• Data Bias
• What it is, why it matters
• Case Study: Similarity Measures
• Subjective testing
• Objective testing
• Concrete Suggestions
Important Note
• I am anxious that this work should not be taken
as been critical of the database/data mining
community.
• Note that several of my papers are among the
worst offenders in terms of weak experimental
evaluation!!!
• My goal is simply to demonstrate that empirical
evaluations in the past have often been
inadequate, and I hope this section of the tutorial
will encourage more extensive experimental
evaluations in the future.
The Claim
Much of the work in the time series data
mining literature suffers from two types
of experimental flaws, implementation
bias and data bias (defined later).
Because of these flaws, much of the work
has very little generalizability to real
world problems.
In More Detail…
Many of the contributions offer an
amount of “improvement” that would
have been completely dwarfed by the
variance that would have been observed
by testing on many real world datasets,
or the variance that would have been
observed by changing minor (unstated)
implementation details.
Time Series Data Mining Tasks
Indexing (Query by Content): Given a query time series
Q, and some similarity/dissimilarity measure D(Q,C), find
the nearest matching time series in database DB.
Clustering: Find natural groupings of the time series in
database DB under some similarity/dissimilarity measure
D(Q,C).
Classification: Given an unlabeled time series Q, assign it
to one of two or more predefined classes.
Literature Survey
We read more than 370 papers, but we only included
the subset of 58 papers actually cited in our paper
when assessing statistics.
The subset was chosen based on the following criteria.
• Was the paper ever referenced?
• Was the paper published in a conference or journal
likely to be read by a data miner?
In general the papers come from high quality
conferences (SIG)KDD (11), ICDE (11), VLDB (5),
SIGMOD/PODS (5), and CIKM (6).
A Cautionary Note
In presenting the results of the survey, we
echo the caution of Prechelt, that “while high
numbers resulting from such counting
cannot prove that the evaluation has high
quality, low numbers (suggest) that the
quality is low”.
Prechelt. L. (1995). A quantitative study of neural network
learning algorithm evaluation practices. In proceedings of the
4th Int’l Conference on Artificial Neural Networks.
Finding 1: Size of Test Datasets
We recorded the size the test dataset for each paper.
Where two or more datasets are used, we considered
only the size of the largest.
The median size of the test databases was
only 10,000 objects. Approximately 84%
of the test databases are less than one
megabyte in size
This number only reflects the indexing papers, the
other papers have even smaller sizes.
Finding 2: Number of Rival Methods
We recorded the number of rival methods to which
the contribution of each paper is compared.
The median number is 1 (average is 0.91)
This number is even worse than it seems, because many of
the strawman are very unrealistic. More about this later…
This number reflects all papers included in the survey.
Finding 3: Number of Test Datasets
We recorded number of different datasets used
in the experimental evaluation.
The average is 1.85 datasets
(1.26 real and 0.59 synthetic)
In fact, this number may be optimistic, if you
count stock market data as being the same as
random walk data, then…
The average is 1.28 datasets
Having seen the statistics,
let us see why these low
numbers are a real
problem…
Data Bias
Definition: Data bias is the conscious or
unconscious use of a particular set of testing
data to confirm a desired finding.
Example: Suppose you are comparing Wavelets to Fourier methods,
the following datasets will produce drastically different results…
Good for
wavelets
bad for
Fourier
0
Good for
Fourier
bad for
wavelets
200
400
600
0
200
400
600
Example of Data Bias: Who to Believe?
For the task of indexing time series for similarity search, which
representation is best, the Discrete Fourier Transform (DFT), or
the Discrete Wavelet Transform (Haar)?
• “Several wavelets outperform the DFT”.
• “DFT-based and DWT-based techniques yield
comparable results”.
• “Haar wavelets perform slightly better that DFT”
• “DFT filtering performance is superior to DWT*”
Example of Data Bias: Who to Believe II?
To find out who to believe (if anyone) we
performed an extraordinarily careful and
comprehensive set of experiments. For example…
• We used a quantum mechanical device generate
random numbers.
• We averaged results over 100,000 experiments!
• For fairness, we use the same (randomly chosen)
subsequences for both approaches.
Take another quick look at the
conflicting claims, the next slide will
tell us who was correct…
• “Several wavelets outperform the DFT”.
• “DFT-based and DWT-based techniques yield
comparable results”.
• “Haar wavelets perform slightly better that DFT”
• “DFT filtering performance is superior to DWT*”
I tested on the Powerplant,
Infrasound and Attas datasets,
and I know DFT outperforms
the Haar wavelet
1
0.9
Pruning Power
0.8
0.7
0.6
DFT
0.5
HAAR
0.4
0.3
0.2
0.1
0
Powerplant
Infrasound
Attas (Aerospace)
Stupid Flanders! I tested on the
Network, ERPdata and Fetal EEG
datasets and I know that there
is no real difference between
DFT and Haar
0.8
0.7
Pruning Power
0.6
0.5
DFT
0.4
HAAR
0.3
0.2
0.1
0
Network
EPRdata
Fetal EEG
Those two clowns are both wrong!
I tested on the Chaotic,
Earthquake and Wind datasets,
and I am sure that the Haar
wavelet outperforms the DFT
0.5
0.45
0.4
Pruning Power
0.35
0.3
DFT
0.25
HAAR
0.2
0.15
0.1
0.05
0
Chaotic
Earthquake
Wind (3)
The Bottom Line
Any claims about the relative performance
of a time series indexing scheme that is
empirically demonstrated on only 2 or 3
datasets should be viewed with suspicion.
1
0.5
0.8
0.9
0.45
0.7
0.8
0.4
0.6
0.7
0.35
0.5
0.6
0.3
DFT
0.5
HAAR
0.4
DFT
0.4
HAAR
DFT
0.25
HAAR
0.2
0.3
0.15
0.3
0.2
0.2
0.1
0.1
0.1
0
0.05
0
0
Powerplant
Infrasound
Attas (Aerospace)
Network
EPRdata
Fetal EEG
Chaotic
Earthquake
Wind (3)
Implementation Bias
Definition: Implementation bias is the
conscious or unconscious disparity in the
quality of implementation of a proposed
approach, vs. the quality of implementation
of the completing approaches.
Example: Suppose you want to compare your new representation
to DFT. You might use the simple O(n2) DFT algorithm rather
than spend the time to code the more complex O(nLogn) radix 2
algorithm. This would make your algorithm run relatively faster.
Example of Implementation Bias:
Similarity Searching
Algorithm sequential_scan(data,query)
best_so_far = inf;
for every item in the database
if euclidean_dist(datai,query) < best_so_far
pointer_to_best_match = i;
best_so_far = euclidean_dist(datai,query);
end;
end;
Algorithm accum = euclidean_dist(d,q);
accum = 0;
for i = 1 to length(d)
accum = accum + (di – qi)2
end;
accum = sqrt(accum);
Optimization 1:
Neglect to take to
square root
Optimization 2:
Pass the best_so_far
into the euclidean_dist
function, and abandon
the calculation if accum
ever gets larger than
best_so_far
Results of a Similarity Searching Experiment
on Increasingly Large Datasets
This trivial optimizations
can have differences
which are larger than
many of the speedups
claimed for new
techniques.
Euclid
Opt1
Opt2
4
3
Seconds
This experiment only
considers the main
memory case, disk based
techniques offer many
other possibilities for
implementation bias.
5
2
1
0
10,000
50,000
Number of Objects
100,000
Another Example of Implementation Bias I
Researchers have found a
difference between the indexing
ability of Haar and PAA…
0
20
40
60
80
100 120
0
20
40
60
80
100 120
We believe that the
reported result might be the
result of an (unstated)
implementation detail.
For normalized data, the
first Haar coefficient is
always zero.
What kind of difference
would it make if we forgot
that fact?
Lets find out!
Haar
PAA
Another Example of Implementation Bias II
We re-implemented the experiments, averaging the results
over 100 experiments as the original authors did.
We compared the results of two implementations, one that
takes advantage of the “1st coefficient is zero”, and one
that does not.
We tested on 50 datasets, and plotted
the ratio of the results of both
implementations as a histogram.
10
8
This minor (unstated)
implementation detail can have
effects larger than the claimed
improvement!
6
4
2
0.95
1
1.05
1.1
1.15
1.2
The Bottom Line
• There are many many possibilities for
implementation bias when implementing a
strawman.
• Minor changes in these minor details can have
effects as large as the claimed improvement.
• Experiments should be done in a manner which
is completely free of implementation bias.
The Bottom Line
Almost all papers that introduced a new
distance measure as their primary
contribution, failed to demonstrate the
utility of their measure on a single
objective or subjective test.
To the best of our knowledge there are no distance
measures in the literature that are better than the decades
old Euclidean Distance and Dynamic Time Warping.
The Overall Bottom Line
The time series data mining/
database community is generally
doing very poor quality work.
Concrete Suggestion I
Algorithms should be tested on a
wide range of datasets, unless the
utility of the approach is only been
claimed for a particular type of data
If possible, one subset of the datasets should be used to
fine tune the approach, then a different subset of the
datasets should be used to do that the actual testing.This
methodology is widely used in the machine learning
community to help prevent implementation and data bias.
Concrete Suggestion II
Where possible, experiments should
be designed to be free of the
possibility of implementation bias
Note that this does not preclude the addition of
extensive implementation testing.
Concrete Suggestion III
Novel similarity measures should be
compared to simple strawman, such as
Euclidean distance or Dynamic Time
Warping. Some subjective visualization,
or objective experiments should justify
their introduction
Concrete Suggestion IIII
Where possible, all data and code
used in experiments should be made
freely available to allow independent
duplication of findings
Overall Conclusion
• Sorry that the last 20 minutes of this tutorial
have been on a pessimistic note!
• I should emphasize that there have been
some really clever ideas introduced in the
last decade, and we have seen many of them
in the first few hours of this tutorial.
Hot Topics in Time Series Research
• In my (subjective) opinion, time series
similarity search is dead (or at least dying) as a
research area.
• However, there are lots of interesting
problems left.
• In the last 10 to 15 minutes of this tutorial, I
will discuss the most interesting open
problems…
Hot Topics in Time Series Research
• Exploiting symbolic representations of time series.
• Anomaly (interestingness) detection.
• Motif discovery (finding repeated patterns).
• Rule discovery in time series.
• Visualizing massive time series databases.
Exploiting Symbolic Representations of Time Series
• One central theme of this tutorial is that lowerbounding is
a very useful property. (recall the lower bounds of DTW, and that r-tree
are based on the MinDist function, which lower bounds the distance between a
point and an MBR)
• Another central theme is that dimensionality reduction is
very important. That’s why we spend so long discussing
DFT, DWT, SVD, PAA etc.
• There does not currently exist a lowerbounding,
dimensionality reducing representation of time series. In
the next slide, let us think about what it would mean if we
had such a representation…
Exploiting Symbolic Representations of Time Series
• If we had a lowerbounding, dimensionality reducing
representation of time series, we could…
• Use data structures that are only defined for discrete data,
such as suffix trees.
• Use algorithms that are only defined for discrete data,
such as hashing, association rules etc
• Use definitions that are only defined for discrete data,
such as Markov models, probability theory
• More generally, we could utilize the vast body of
research in text processing and bioinformatics
Exploiting Symbolic Representations of Time Series
There is now a lower bounding dimensionality
reducing time series representation! It is called
SAX (Symbolic Aggregate ApproXimation)
I expect SAX to have a major impact on time
series data mining in the coming years…
3
2
1
0
-1
-2
-3
DFT
SAX
f
e
d
c
b
a
PLA
Haar
APCA
SAX
Anomaly (interestingness) detection
We would like to be able to discover surprising (unusual, interesting,
anomalous) patterns in time series.
Note that we don’t know in advance in what way the time series
might be surprising
Also note that “surprising” is very context dependent, application
dependent, subjective etc.
Arrr... what be wrong with
current approaches?
The blue time series at the top is a normal healthy
human electrocardiogram with an artificial
“flatline” added. The sequence in red at the
bottom indicates how surprising local subsections
of the time series are under the measure
introduced in Shahabi et. al.
Note that the beginning of each normal heartbeat
is very surprising, but the “flatline” is the least
surprising part of the time series!!!
• Note that this problem has been solved for text strings
• You take a set of text which has been labeled
“normal”, you learn a Markov model for it.
• Then, any future data that is not modeled well by the
Markov model you annotate as surprising.
• Since we have just seen that we can convert time
series to text (i.e SAX). Lets us quickly see if we can
use Markov models to find surprises in time series…
Training
data
0
2000
4000
6000
8000
10000
12000
Test data
(subset)
0
2000
4000
6000
8000
10000
12000
10000
12000
Markov model
surprise
0
2000
4000
6000
8000
These were
converted to the
symbolic
representation.
I am showing the
original data for
simplicity
In the next slide we will zoom in on
this subsection, to try to understand
why it is surprising
Training
data
0
2000
4000
6000
8000
10000
12000
Test data
(subset)
0
2000
4000
6000
8000
10000
12000
10000
12000
Markov model
surprise
0
2000
4000
6000
8000
Normal Time
Series
Surprising
Time Series
Normal
sequence
0
100
Actor
misses
holster
200
300
Laughing and
flailing hand
Normal
sequence
Briefly swings gun at
target, but does not aim
400
500
600
700
Anomaly (interestingness) detection
In spite of the nice example in the previous slide, the
anomaly detection problem is wide open.
How can we find interesting patterns…
• Without (or with very few) false positives…
• In truly massive datasets...
• In the face of concept drift…
• With human input/feedback…
• With annotated data…
Motif discovery (finding repeated patterns)
Winding Dataset
( The angular speed of reel 2 )
0
50 0
1000
150 0
2000
Are there any repeated patterns, of about this length
the above time series?
2500
in
Motif discovery (finding repeated patterns)
A
0
500
20
1500
2000
B
40
60
80
100
120
140
0
20
C
(The angular speed of reel 2)
1000
A
0
Winding Dataset
B
2500
C
40
60
80
100
120
140
0
20
40
60
80
100
120
140
Why Find Motifs?
· Mining association rules in time series requires the discovery of motifs.
These are referred to as primitive shapes and frequent patterns.
· Several time series classification algorithms work by constructing typical
prototypes of each class. These prototypes may be considered motifs.
· Many time series anomaly/interestingness detection algorithms essentially
consist of modeling normal behavior with a set of typical shapes (which we see
as motifs), and detecting future patterns that are dissimilar to all typical shapes.
· In robotics, Oates et al., have introduced a method to allow an autonomous
agent to generalize from a set of qualitatively different experiences gleaned
from sensors. We see these “experiences” as motifs.
· In medical data mining, Caraca-Valente and Lopez-Chavarrias have
introduced a method for characterizing a physiotherapy patient’s recovery
based of the discovery of similar patterns. Once again, we see these “similar
patterns” as motifs.
• Animation and video capture… (Tanaka and Uehara, Zordan and Celly)
Motifs Discovery Challenges
How can we find motifs…
• Without having to specify the length/other parameters
• In massive datasets
• While ignoring “background” motifs (ECG example)
• Under time warping, or uniform scaling
• While assessing their significance
A
0
B
50 0
1000
Winding Dataset
(
150 0
The angular speed of reel 2 )
C
2000
Finding these 3 motifs requires about 6,250,000 calls to the Euclidean distance function
2500
Rule Discovery in Time Series
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
15
-1
-1.5
-2
-1
-1.5
-2
0
2
4
6
8
10
12
14
16
18
20
0
5
10
15
20
25
30
Support = 9.1
Confidence = 68.1
Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series.
In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY,
Aug 27-31. pp 16-22.
Papers Based on Rule Discovery
from Time Series
• Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of
Association Rules from Human Motion.
• Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules.
• Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock
Time Series Using Self-Organizing Maps.
• Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential
Association Rules with Constraints and Time Lags in Multiple Sequences.
• Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic
Programming and Specialized Hardware.
• Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of
Temporal Rules.
• Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules
in House-keeping Data.
• Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in
Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams.
and many more…
All these people are
fooling themselves!
They are not finding rules in time
series (and, given their
representation of time series, they
cannot), and it is easy to prove this!
A Simple Experiment...
w
20
30
d
5.5
5.5
Rule
7 15 8
18 20 21
Sup %
8.3
1.3
Conf %
73.0
62.7
J-Mea.
0.0036
0.0039
Fig
(a)
(b)
“if stock rises then falls greatly, follow a smaller rise,
then we can expect to see within 20 time units, a
pattern of rapid decrease followed by a leveling out.”
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
0
2
4
6
8
10
12
(a)
14
16
18
20
w
20
30
0
5
10
15
(b)
20
25
d
5.5
5.5
Rule
11 15 3
24 20 19
Sup %
6.9
2.1
Conf %
71.2
74.7
30
The punch line is…
J-Mea
0.0042
0.0035
Fig
(a)
(b)
Rule Discovery in Time Series
A paper that will appear in ICDM 2003 (you have a
preview copy on your CD-Rom) completely invalidates 50
or so papers on rule discovery in time series. (I.e, the vast
majority of the work)
So the good news is that the problem is wide open again!
The first challenge is to show that it makes sense as a
problem…
Visualizing Massive Time Series Databases
The best data mining/pattern recognition tool is the human eye.
Can we exploit this fact?
How can we visually summarize massive time series, such that regularities,
outliers, anomalies etc, become visible?
Beginning of Mary Had a Little Lamb, patterns of 3 or more notes: Figure by Martin Wattenberg
Conclusions
• As with all computer science problems, the right
representation is the key to an effective and efficient
solution, with time series, there are many choices.
• The evaluation of work in time series data mining has
been rather sloppy.
• There are lots of interesting problems left to tackle.
Thanks!
Thanks to the people with whom I have co-authored
Time Series Papers
• Wagner Truppel
• Marios Hadjieleftheriou
• Dimitrios Gunopulos
• Michail Vlachos
• Marc Cardle
• Stephen Brooks
• Bhrigu Celly
• Jiyuan An, H. Chen, K. Furuse and N. Ohbo
• Michael Pazzani
• Sharad Mehrotra
• Kaushik Chakrabarti
• Selina Chu
• David Hart
• Padhraic Smyth
• Jessica Lin
• Bill 'Yuan-chi' Chiu
• Stefano Lonardi
• Shruti Kasetty
• Chotirat (Ann) Ratanamahatana,
• Pranav Patel
• Harry Hochheiser
• Ben Shneiderman
• Wagner Truppel
• Marios Hadjieleftheriou
• Victor Zordan
• Your name here! (I welcome collaborators)
Questions?
All datasets and code used in this tutorial can be found at
www.cs.ucr.edu/~eamonn/TSDMA/index.html
126
130
132
134
135
134
134
138
143
147
146
145
145
144
142
142
142
144
145
147
145
142
137
134
129
129
127
128
131
132
129
126
124
121
122
124
130
134
136
139
141
142
142
145
152
158
163
168
172
178
186
188
190
193
194
198
204
209
212
213
216
219
217
219
222
222
219
219
216
211
206