Querying and Mining of Time Series Data: Experimental Comparison
Download
Report
Transcript Querying and Mining of Time Series Data: Experimental Comparison
Querying and Mining of Time Series Data:
Experimental Comparison of Representations
and Distance Measures
Hui Ding, Goce Trajcevski, Peter Scheuermann
Dept. of EECS, Northwestern University
hdi117,goce,[email protected]
Xiaoyue Wang, Eamonn Keogh
Dept. of CS, U. of California, Riverside
xwang,[email protected]
34th VLDB Conference, Auckland, New Zealand
August 26, 2008
Motivation and Summary of
Findings
The tightness of lower bounding
(thus the pruning power, indexing
effectiveness) of different representation
methods for time series data, for the most
Key aspects for achieving
part, makes a very little difference on various
effectiveness and efficiency:
data sets.
representation methods
Classification error ratios of elastic
similarity measures.
measures, e.g, DTW, LCSS, EDR and ERP
can be significantly more accurate than other
measures
Consolidate the large amount of
With large training data set size,
existing research efforts
Euclidean distance is competitive with elastic
We conducted the largest (by a
measures such as DTW (thus getting more
data helps more than fussing with distance
huge margin) set of time series
measures in most cases )
data mining experiments
Time series are ubiquitous
Comparison of Time Series
Representation Methods
SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA
0.8
0.6
0.4
0.2
480
960
1440
1920
0
10
8
6
foetal_ecg (excerpt)
0
200
4
400
TLB on an ECG data set
SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA
1
0.5
960
480
1920
0
1440
8 representation methods:
SAX, DFT, DWT, DCT, PAA, CHEB, APCA,
IPLA
Use tightness of lower bounds (TLB) as a
metric for comparison:
TLB = LowerBoundDist /
TrueEuclideanDist
The tightness of lower bounding
( pruning power, effectiveness of
the indexing) of different representation
methods, for the most part, makes little
difference on various data sets
10
8
6
4
SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA
TLB on a bursty data set
0.8
0.6
0.4
0.2
0
10
480
960
1440
1920
TLB on a periodic data set
8
6
4
Comparison of Time Series Similarity
Measures - Findings
Compared 9 similarity measures:
Euclidean, L1, Linf, DISSIM,
TQuEST, DTW, EDR, ERP, LCSS,
Swale and Spade
on 38 diverse data sets
Used 1-Nearest Neighbor
Classification for evaluating the
accuracy of underlying measures
Used stratified cross-validation to
minimize the impact of class
distribution of the data sets
As training set size increases,
Euclidean distance quickly
becomes as effective as elastic
measures (e.g., DTW, EDR)
Edit-distance based measures
are, for the most part, as
effective as DTW (but require
more effort for tuning) However
they are not vastly superior as
some have suggested
Some measures (e.g., DISSIM,
TQuEST) which were claimed as
being vastly superior to simpler
methods, are in fact no better
or worse
Example: Impact of Training Data
Set Size
0.03
Euclidean
DTW
0.025
CBF Dataset
Out-of-Sample Error Rate
0.02
0.015
0.01
0.005
0
0.5
Two- Pat Dataset
0.4
0.3
0.2
0.1
0
0
1000
2000
3000
4000
5000
6000
Increasingly Large Training Sets
If large training set is available, Euclidean may be as
good as DTW, and is the fastest one can get…
Visualizing Classification Accuracy
Using Scatter Plot (1)
Euclidean Distance vs.
L1 Norm and Linf Norm
DTW distance vs.
Euclidean distance
Visualizing Classification Accuracy
Using Scatter Plot (2)
LCSS distance vs.
Euclidean and
DTW distance
ERP distance vs.
Euclidean and DTW
distance
Visualizing Classification Accuracy
Using Scatter Plot (3)
DISSIM distance vs. Euclidean
and DTW distance
It has been claimed that DISSIM
“efficiently retrieves similar
trajectories in cases where related
work fails” However, on average it is
no better than DTW
TQuEST distance vs. Euclidean
and DTW distance
It has been claimed that “DTW is the
only competitor that achieves roughly
similar accuracy (to TQuEST)” However,
DTW and even Euclidean Distance is
significantly better than TQuEST on
average
Visualizing Classification Accuracy
Using Scatter Plot (4)
Both SpADe and Swale have been proposed
as been significantly better than Euclidean
Distance and DTW.
However, they are both about as good as
Euclidean Distance on average (show to the
left), and slightly worse than DTW on
average.
Conclusions & Future Work
We attempted to consolidate
existing works on representation methods and similarity measures
for time series data
Future extensions include:
Conducting statistical analysis to investigate relationships among different
similarity measures and present correlation-based comparison.
Investigate (meta) properties of the datasets that could yield favorable
effectiveness of some (or other) similarity measure
Anything else You Suggest!