Time Series Data Mining Group - University of California, Riverside

Download Report

Transcript Time Series Data Mining Group - University of California, Riverside

iSAX: Indexing and Mining Terabyte
Sized Time Series
Jin Shieh, Eamonn Keogh
Computer Science & Eng. Dept.
University of California, Riverside
Time Series Data Mining Group
Outline
• Introduction
– Motivating example
• iSAX representation
– Indexing time series
• Experimental evaluation
• Conclusion
Time Series Data Mining Group
Introduction
• Our work extends a popular symbolic representation of
time series to allow for the indexing and retrieval of
millions of time series
• Symbolic Aggregate approXimation (SAX)
– Represent a time series T of length n in w-dimensional
i
space using PAA
ti  wn  T j
• Where the ith element of is:
n
w
j  wn ( i 1) 1
– Then discretize into a vector of symbols
• Breakpoints map to a small alphabet a of symbols
3
3
2
3
A time series T
2
2
PAA(T,4)
iSAX(T,4,4)
00
1
1
1
01
0
0
0
-1
-1
-2
-2
10
11
-1
-3
0
4
8
12
16
Time Series Data Mining Group
-3
-2
-3
0
4
8
12
16
0
4
8
12
16
Introduction (cont.)
• SAX is lower bounding
– Given a SAX representations Ta, Sa a lower bound
to the Euclidean distance is:
MINDIST(Ta,
Sa)

n
w
 dist (t , s )
w
i 1
2
i
i
– dist(ti,si) is the smallest distance between the
breakpoints that characterize each symbol, 0 if
they overlap
Time Series Data Mining Group
Motivating Example
• Why not just index using SAX?
• For example: index 1,000,000 time series using SAX
– Choose SAX parameters
• cardinality = 8, wordlength = 4
• 84 = 4,096 possible SAX word labels
– Place time series which map to the same label in the same
file on disk
– Compute label for query and retrieve matching file
• Time series in file likely to be good approximate matches
– Average label occupancy 1,000,000/4,096 = ~244
(reasonable)
Time Series Data Mining Group
Motivating Example (cont.)
• In practice, the distribution of time series to SAX word
labels is not uniform!
– Empty
– Disproportionate percentage of the dataset
• Ideal condition: We want to give a threshold th, and
have the number of entries n mapped to a label to be 1 ≤
n ≤ th
– Favor larger n
• How can we achieve this? We need to make SAX more
flexible
Time Series Data Mining Group
iSAX Representation
• SAX uses a single hard-coded cardinality
– Unable to differentiate only on dimensions of
interest
• We will show that the indexing problem can be
solved if we extend SAX to allow:
– Different cardinalities within a single word
– Comparison of words with different cardinalities
• We call this extension indexable SAX (iSAX)
Time Series Data Mining Group
iSAX Representation (cont.)
• Multi-resolution property
– Readily convert to any lower resolution that differs by a
power of two
{12,13, 6, 1} = {1100,1101,0110,0001}
{ 6, 6, 3, 0} = {110 ,110 ,011 ,000 }
{ 3, 3, 1, 0} = {11 ,11 ,01 ,00 }
{ 1, 1, 0, 0} = {1
,1
,0
,0
}
• Lower bounding distance between iSAX words enforced
through examination of both sets of breakpoints
• iSAX offers a bit aware, quantized, multi-resolution
representation with variable granularity
Time Series Data Mining Group
Indexing with iSAX
• Split a set of time series represented by a common iSAX
word into mutually exclusive subsets
(using multi-resolution property):
– Increase cardinality along dimensions d, word length w,
1≤d≤w
– Fan-out rate bound by 2d
• Iterative doubling
– Given a base cardinality b, cardinality at i-th increase is b*2i
– Alignment of breakpoints overlap
• Allows for index structures which are hierarchical, with nonoverlapping regions, and a controlled fan-out rate
Time Series Data Mining Group
Indexing with iSAX (cont.)
• Simple tree-based index (base cardinality b, word length w, threshold th)
– Hierarchically subdivides SAX space until entries in each subspace falls
within th
• Leaf nodes point to index files on disk
• Internal nodes designate a split in SAX space
• Approximate Search
– Similar time series often represented by same iSAX word
– Traverse index until leaf
• Match iSAX representation at each level
• Apply heuristics if no match
• Exact Search
– Leverage approximate search
– Prune search space
• Lower bounding distance
Time Series Data Mining Group
Experimental Evaluation
• We conduct experiments to identify
characteristics of the iSAX representation:
– Tightness of the lower bound
– Indexing performance on massive datasets
– Applicability to data-mining algorithms
Time Series Data Mining Group
Tightness of Lower Bounds
• TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S)
• For a given dataset
– Time series length [480, 960, 1440, 1920]
– Bytes available for representation [16, 24, 32, 40]
– Results similar across thirty datasets
iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA
Koski ECG
0.8
0
500
1000
0.4
0.2
480
960
1440
0
1920
TLB
0.6
Time series Length
Koski ECG dataset
Time Series Data Mining Group
40 bytes
32 bytes
24 bytes
16 bytes
Indexing Performance on Massive Datasets
• Indexed random walk datasets of [1, 2, 4, 8] million time series of length
256
– Parameters: b = 4, w = 8, th = 100
– Generated [39,255, 57,365, 92,209, 162,340] index files
Percentage of Queries
• Approximate Search (1000 queries):
At least 1 from top 100
100
80
At least 1 from top 10
60
40
Outside top 1000
1 from top 1 (true nearest neighbor)
20
0
1m
2m
Size of Random Walk Database
4m
8m
Exact Search (100 queries):
Avg. Disk Accesses/Query
Avg. Time/Query (min)
1M
2M
4M
8M
Exact Search
3.8
5.8
9.0
14.1
Sequential Scan
71.5
104.8
168.8
297.6
Time Series Data Mining Group
1M
2M
4M
8M
Exact Search
2115.3
3172.5
4925.3
7719.1
Sequential Scan
39255
57365
92209
162340
Data Mining
• Definition: Time Series Set Difference (TSSD) (A,B).
Given two collections of time series A and B, the time
series set difference is the subsequence in A whose
distance from its nearest neighbor in B is maximal
• Electrocardiogram dataset from a 45 year old male
subject with suspected sleep-disordered breathing
– 7.2 hours as reference set B (1,000,000 time series)
– 8 minutes 39 seconds as “novel” set A (20,000 time series)
where the patient woke up
4
0
90
-4
0
50
100
109
150
200
250
The Time Series Set Difference discovered between ECGs recorded during a
waking cycle and the previous 7.2 hours (respiration pattern change in
accordance with change in sleep stages)
Time Series Data Mining Group
Data Mining (cont.)
•
Solutions:
1) Sequential scan A across B
2) Exact search each entry in A using index on B
3) Leverage approximate and exact search
•
•
Order A by approximate search distance in a queue
Perform exact search using index on B in descending distance
– Suspend if distance becomes lower than next entry in the
queue
– If search completes, return as TSSD
Distance Computations
Disk Accesses
Est. Time
1) Sequential Scan
20,000,000,000
31,196
6.25 days
2) Exact Search
325,604,200
5,676,400
1.04 days
3) Leveraged
2,365,553
43,779
34 minutes
Time Series Data Mining Group
Conclusion
• Introduced the iSAX representation and shown
how it can be used for indexing time series
• Demonstrated scalability and efficacy on massive
datasets
• Showed how approximate and exact search can
be used in conjunction to produce exact results on
data mining problems
Time Series Data Mining Group
THANK YOU!
Time Series Data Mining Group