Transcript Poster

NSF Career Award 0237918. IIS-0237918-001
University of California Riverside
Eamonn Keogh
Efficient Discovery of Previously Unknown Patterns and Relationships
in Massive Time Series Databases
Research Objectives:
To date, the vast majority of research on time series data
mining has focused on similarity search, and to a lesser extent
on clustering. We believe that these problems should now be
regarded as essentially solved. In particular, there are now
fast exact techniques for searching and clustering patterns
under both the Euclidean distance and Dynamic Time
Warping, the two most useful distance measures. However,
from a knowledge discovery viewpoint, there are several
important unsolved problems in time series data mining that
are more interesting, important, and challenging. In this
project, we are to addressing these problems. Our long-term
goal is the creation of efficient algorithms to allow the
extraction of knowledge in the form of patterns, anomalies,
regularities and rules, from massive time series dataset
Significant Results:
Our work has had a large impact on the state of the art of time series indexing and time
series data mining. Some concrete examples include:
• Our time series motif finding algorithm is being used to find video textures by Celly
and Zordan, , and to find repeated motions in motion capture data by Tanaka and
Uehara .
• Our time series anomaly detection algorithm is being used by the Aerospace Corp to
monitor spacecraft telemetry, and by a joint Berkeley/Stanford group to monitor
computer systems. A NASA white paper noted that "it has great promise for the
future".
• Our time series indexing technique (LB_Keogh indexing) has been expanded by us,
and by many others, including groups that use it for Euclidean indexing, for
subsequence matching, for indexing handwriting, for indexing multidimensional
sequences, for indexing music and for indexing motion capture data.
• Our papers in the area have been referenced more than a 1,000 times, see
www.cs.ucr.edu/~eamonn/selected_publications.htm
• We have being maintaining the UCR time series data mining archive, we have given
test datasets and code to more than 400 research groups and individuals.
Approach:
Time Series Representations
Many high level representations of time series have been proposed for data
mining. See the figure to the right for a hierarchy of all the various time series
representations in the literature. One representation that the data mining
community has not considered in detail is the discretization of the original data
into symbolic strings. At first glance this seems a surprising oversight. There is
an enormous wealth of existing algorithms and data structures that allow the
efficient manipulations of strings. Such algorithms have received decades of
attention in the text retrieval community, and more recent attention from the
bioinformatics community. Some simple examples of “tools” that are not defined
for real-valued sequences but are defined for symbolic approaches include
hashing, Markov models, suffix trees, decision trees etc.
The core of our contributions are based on a new symbolic representation of time
series, called SAX (Symbolic Aggregate ApproXimation ). Our representation is
unique in that it allows dimensionality/numerosity reduction, and it also allows
distance measures to be defined on the symbolic representation that lower bound
corresponding popular distance measures defined on the original data. As we
have demonstrated, the latter feature is particularly exciting because it allows
one to run certain data mining algorithms on the efficiently manipulated
symbolic representation.
Model Based
Hidden
Markov
Models
Data Adaptive
Grid
Statistical
Models
Sorted
Coefficients
Piecewise
Polynomial
Piecewise
Linear
Approximation
Interpolation
Singular
Symbolic
Value
Approximation
Adaptive
Piecewise
Constant
Approximation
Natural
Language
Trees
Wavelets
Strings
Symbolic
Aggregate
Approximation
Regression
Orthonormal
Non
Lower
Bounding
Value
Based
Haar
Random
Mappings
Daubechies
dbn n > 1
Coiflets
0
B
500
1000
Discrete
Chebyshev
Cosine
Polynomials
Transform
Symlets
Slope Based
Winding Dataset
…Indexing,
Classification and
Clustering (DMKD
2003)
C
( The angular speed of reel 2 )
1500
Clipped
Data
Piecewise
Aggregate
Approximation
Spectral
Discrete
Fourier
Transform
Bi-Orthonormal
SAX has made contributions to…
A
Broader Impact:
Data Dictated
Non Data Adaptive
2000
2500
…Motif Discovery (SIGKDD 2003)
It is expected that this work will have a broad impact, because:
Details 2
• Time series are ubiquitous, occurring in virtually every human
endeavor, including medicine, finance, entertainment etc.
Zoom in
• The proposed work is very general, and has already made
contributions to virtually every time series problem, including
Visualization (VLDB 2004 and SIGKDD 2004a), Motif Discovery
(SIGKDD 2003), Anomaly Detection (SIGKDD 2004a) and
Indexing (DMKD 2003).
Normal
Time Series
Surprising
Time Series
Overview
Details 1
…Visualization ( VLDB
2004 and SIGKDD 2004a)
Normal
sequence
0
10
0
Actor
misses
holster
20
0
30
0
Laughing and
flailing hand
Normal
sequence
Briefly swings gun at
target, but does not
aim
40
0
50
0
600
70
0
…Anomaly Detection (SIGKDD 2004b)