Time Series I

Download Report

Transcript Time Series I

Time Series II
1
Syllabus
Nov 4
Introduction to data mining
Nov 5
Association Rules
Nov 10, 14
Clustering and Data Representation
Nov 17
Exercise session 1 (Homework 1 due)
Nov 19
Classification
Nov 24, 26
Similarity Matching and Model Evaluation
Dec 1
Exercise session 2 (Homework 2 due)
Dec 3
Combining Models
Dec 8, 10
Time Series Analysis
Dec 15
Exercise session 3 (Homework 3 due)
Dec 17
Ranking
Jan 13
Review
Jan 14
EXAM
Feb 23
Re-EXAM
2
Last time…
• What is time series?
• How do we compare time series data?
3
Today…
• What is the structure of time series data?
• Can we represent this structure compactly
and accurately?
• How can we search streaming time series?
4
Time series summarization
aabbbccb
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
0
20 40 60 80 100 120
a
a
b
b
b
c
DFT
DWT
PAA
APCA
PLA
c
b
SAX
5
Why Summarization?
• We can reduce the length of time series
• We should not lose any information
• We can process it faster
6
Discrete Fourier Transform
(DFT)
X
X'
0
20
40
60
80
100
120
140
0
Basic Idea: Represent the time
series as a linear combination of
sines and cosines
Transform the data from the time
domain to the frequency domain Jean Fourier
1768-1830
1
2
Highlight the periodicities but keep
only the first n/2 coefficients
3
4
5
6
Why n/2 coefficients?
 Because they are symmetric
7
Excellent free Fourier Primer
8
9
Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS95-37, Department of Computer Science, Brown University, 1995.
http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/
7
Why DFT?
A: several real sequences are periodic
Q: Such as?
A:
sales patterns follow seasons
economy follows 50-year cycle (or 10?)
temperature follows daily and yearly cycles
Many real signals follow (multiple) cycles
8
How does it work?
• Decomposes signal to a sum of sine and cosine waves
• How to assess ‘similarity’ of x with a (discrete) wave?
value
x ={x0, x1, ... xn-1}
s ={s0, s1, ... sn-1}
0 1
n-1
time
9
How does it work?
• Consider the waves with frequency 0, 1, …
• Use the inner-product (~cosine similarity)
Freq=1/period
value
value
freq. f=0
0 1
n-1
freq. f=1
time 0 1
sin(t * 2 p/n)
n-1
time
10
How does it work?


Consider the waves with frequency 0, 1, …
Use the inner-product (~cosine similarity)
value
freq. f=2
time
0 1
n-1
11
How does it work?
‘basis’ functions
0 1 n-1
cosine, f=1
sine, freq =1
0 1 n-1
0 1 n-1
cosine, f=2
sine, freq = 2
01
n-1
01
n-1
12
How does it work?
• Basis
functions
are
actually
n-dim
vectors,
orthogonal to each other
• ‘similarity’ of x with each of them: inner product
• DFT: ~ all the similarities of x with the basis functions
13
How does it work?
Since:
ejf = cos(f) + j sin(f),
we finally have:
X f  1/ n
n 1
x
t 0
t
( j  1 )
xt  1 / n
with j=sqrt(-1)
* exp( j 2p tf / n)
inverse DFT
n 1
X
t 0
f
* exp( j 2p tf / n)
14
How does it work?
Each Xf is an imaginary number:
Xf = a + b j
• α is the real part
• β is the imaginary part
• Examples:
– 10 + 5j
– 4.5 – 4j
15
How does it work?
SYMMETRY property of imaginary numbers:
Xf = (Xn-f )*
( “*”: complex conjugate: (a + b j)* = a - b j )
Thus: we use only the first n/2 numbers
16
DFT: Amplitude spectrum
• Amplitude Af  Re ( X f )  Im ( X f )
2
2
2
• Intuition: strength of frequency ‘f’
count
Af
freq: 12
time
freq. f
17
Example
Reconstruction using 1coefficients
5
0
-5
50
100
150
200
250
18
Example
Reconstruction using 2coefficients
5
0
-5
50
100
150
200
250
19
Example
Reconstruction using 7coefficients
5
0
-5
50
100
150
200
250
20
Example
Reconstruction using 20coefficients
5
0
-5
50
100
150
200
250
21
DFT: Amplitude spectrum
 Can achieve excellent approximations, with only very few
frequencies!
 SO what?
22
DFT: Amplitude spectrum
 Can achieve excellent approximations, with only very few
frequencies!
 We can reduce the dimensionality of each time series by
representing it with the k most dominant frequencies
 Each frequency needs two numbers (real part and
imaginary part)
 Hence, a time series of length n can be represented using
2*k real numbers, where k << n
23
Raw
Data
C
0
20
40
60
80
n = 128
100
120
140
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
The graphic shows a time series
with 128 points.
The raw data used to produce the
graphic is also reproduced as a
column of numbers (just the
first 30 or so points are
shown).
C
0
20
40
60
80
100
120
..............
140
Raw
Data
Fourier
Coefficients
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
We can decompose the
data into 64 pure sine
waves using the Discrete
Fourier Transform (just the
first few sine waves are
shown).
The Fourier Coefficients
are reproduced as a
column of numbers (just
the first 30 or so
coefficients are shown).
Raw
Data
C
C’
0
20
40
60
80
100
We have
discarded
15
16
of the data.
120
140
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
Truncated
Fourier
Fourier
Coefficients Coefficients
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
0.1635
0.1602
0.0992
0.1282
0.1438
0.1416
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
1.5698
1.0485
0.7160
0.8406
0.3709
0.4670
0.2667
0.1928
n = 128
N=8
Cratio = 1/16
Raw
Data
C
C’
0
20
40
60
80
100
120
140
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
Sorted
Truncated
Fourier
Fourier
Coefficients Coefficients
1.5698
1.0485
0.7160
0.8406
0.3709
0.1670
0.4667
0.1928
0.1635
0.1302
0.0992
0.1282
0.2438
0.2316
0.1400
0.1412
0.1530
0.0795
0.1013
0.1150
0.1801
0.1082
0.0812
0.0347
0.0052
0.0017
0.0002
...
1.5698
1.0485
0.7160
0.8406
0.2667
0.1928
0.1438
0.1416
Instead of taking the first few
coefficients, we could take
the best coefficients
Discrete Fourier
Transform…recap
X
Pros and Cons of DFT as a time
series representation
X'
0
20
40
60
80
100
120
140
0
1
2
Pros:
• Good ability to compress most natural
signals
• Fast, off the shelf DFT algorithms exist
O(nlog(n))
3
4
5
6
7
Cons:
• Difficult to deal with sequences of
different lengths
8
9
28
Piecewise Aggregate
Approximation (PAA)
Basic Idea: Represent the time series as a
sequence of box basis functions, each box
being of the same length
X
Computation:
X'
• X: time series of length n
0
20
40
60
80
100
120
140
x1
• Can be represented in the N-dimensional
space as:
x2
x3
x4
x5
xi = Nn
ni
N
å
xj
j= Nn (i-1)+1
x6
x7
Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000)
x8
Byoung-Kee Yi, Christos Faloutsos, VLDB (2000)
29
Piecewise Aggregate
Approximation (PAA)
Example
X
Let X = [1 3 -1 4 4 4 5 3 7]
X'
0
20
40
60
80
100
120
140
x1
x2
• X can be mapped from its original
dimension n = 9 to a lower dimension,
e.g., N = 3, as follows:
x3
x4
[1 3 -1 4 4 4 5 3 7]
x5
x6
x7
[
1
4
5
]
x8
30
Piecewise Aggregate
Approximation (PAA)
Pros and Cons of PAA as a time series
representation.
X
Pros:
X'
• Extremely fast to calculate
0
20
40
60
80
100
120
140
• As efficient as other approaches (empirically)
x1
• Support queries of arbitrary lengths
x2
• Can support any Minkowski metric
x3
• Supports non Euclidean measures
x4
•Simple! Intuitive!
x5
x6
Cons:
x7
• If visualized directly, looks ascetically
x8
unpleasing
31
Symbolic ApproXimation (SAX)
• similar in principle to PAA
– uses segments to represent data series
• represents segments with symbols (rather
than real numbers)
– small memory footprint
32
Creating SAX
PAA
• Input
– A time series (blue curve)
Input Series
• Output
– SAX representation of the
input time series (red string)
SAX
baabccbc
33
The Process (STEP 1)
• Represent time series T of length n with w segments using
Piecewise Aggregate Approximation (PAA)
• PAA(T,w) = T
 t1 ,  , t w
3
where
ti  wn
ni
w
T
j
j  wn ( i 1) 1
3
A time series T
2
PAA(T,4)
2
1
1
0
0
-1
-1
-2
-2
-3
-3
0
4
8
12
16
0
4
8
12
16
34
The Process (STEP 2)
• Discretize into a vector of symbols
• Use breakpoints to map to a small alphabet α of
symbols
3
3
PAA(T,4)
2
iSAX(T,4,4)
2
00
1
1
01
0
0
-1
-1
-2
-2
-3
0
4
8
12
16
10
11
-3
0
4
8
12
16
35
Symbol Mapping
• Each average value from the PAA vector is replaced
by a symbol from an alphabet
• An alphabet size, a of 5 to 8 is recommended
– a,b,c,d,e
– a,b,c,d,e,f
– a,b,c,d,e,f,g
– a,b,c,d,e,f,g,h
• Given an average value we need a symbol
36
Symbol Mapping
This is achieved by using the normal distribution
from statistics:
– Assuming our input series is normalized we can use
normal distribution as the data model
– We divide the area under the normal distribution into
‘a’ equal sized areas where a is the alphabet size
– Each such area is bounded by breakpoints
37
SAX Computation – in pictures
C
C
0
20
40
60
This slide taken
from Eamonn’s
Tutorial on SAX
80
100
120
c
c
c
b
b
a
0
20
b
a
40
60
80
100
baabccbc
120
38
Finding the BreakPoints
• Breakpoints for different alphabet
sizes can be structured as a lookup
table
• When a=3
– Average values below -0.43 are
replaced by ‘A’
– Average values between -0.43
and 0.43 are replaced by ‘B’
– Average values above 0.43 are
replaced by ‘C’
a=3
a=4
a=5
b1
-0.43 -0.67 -0.84
b2
0.43
b3
b4
0
-0.25
0.67
0.25
0.84
39
The GEMINI Framework
•
•
•
•
Raw data: original full-dimensional space
Summarization: reduced dimensionality space
Searching in original space costly
Searching in reduced space faster:
– Less data, indexing techniques available, lower bounding
• Lower bounding enables us to
– prune search space: through away data series based on
reduced dimensionality representation
– guarantee correctness of answer
• no false negatives
• false positives: filtered out based on raw data
40
GEMINI
Solution: Quick filter-and-refine:
• extract m features (numbers, e.g., average)
• map into a point into m-dimensional feature space
• organize points
• retrieve the answer using a NN query
• discard false alarms
41
Generic Search using Lower Bounding
Simplified DB
Answer
Superset
No false
negatives!!
Original DB
Final
Answer
set
Verify
against
original
DB
Remove false
positives!!
simplified
query
query
42
GEMINI: contractiveness
• GEMINI works when:
Dfeature(F(x), F(y)) <= D(x, y)
• Note that, the closer the feature distance to the
actual one, the better
43
Streaming Algorithms
• Similarity search is the bottleneck for most time series data
mining algorithms, including streaming algorithms
• Scaling such algorithms can be tedious when the target time
series length becomes very large!
• This will allow us to solve higher-level time series data mining
problems: e.g., similarity search in data streams, motif
discovery, at scales that would otherwise be untenable
44
Fast Serial Scan
• A streaming algorithm for fast and exact search in
very large data streams:
query
data stream
45
Z-normalization
• Needed when interested in detecting trends and not
absolute values
B
• For streaming data:
C
A
– each subsequence of interest should be z-normalized before
being compared to the z-normalized query
– otherwise the trends lost
• Z-normalization guarantees:
– offset invariance
– scale/amplitude invariance
46
Pre-Processing
z-Normalization
• data series encode trends
• usually interested in identifying similar trends
• but absolute values may mask this similarity
47
Pre-Processing
z-Normalization
 xi   
zi  

  
v1
v2
• two data series with similar trends
• but large distance…
48
Pre-Processing
z-Normalization
v1
v2
• zero mean
– compute the mean of the sequence
– subtract the mean from every value of the sequence
49
Pre-Processing
z-Normalization
• zero mean
– compute the mean of the sequence
– subtract the mean from every value of the sequence
50
Pre-Processing
z-Normalization
• zero mean
– compute the mean of the sequence
– subtract the mean from every value of the sequence
51
Pre-Processing
z-Normalization
• zero mean
– compute the mean of the sequence
– subtract the mean from every value of the sequence
52
Pre-Processing
z-Normalization
• zero mean
• standard deviation one
– compute the standard deviation of the sequence
– divide every value of the sequence by the stddev
53
Pre-Processing
z-Normalization
• zero mean
• standard deviation one
– compute the standard deviation of the sequence
– divide every value of the sequence by the stddev
54
Pre-Processing
z-Normalization
• zero mean
• standard deviation one
– compute the standard deviation of the sequence
– divide every value of the sequence by the stddev
55
Pre-Processing
z-Normalization
• zero mean
• standard deviation one
56
Pre-Processing
z-Normalization
• when to z-normalize
– interested in trends
• when not to z-normalize
– interested in absolute values
57
Proposed Method: UCR Suite
• An algorithm for similarity search in large data streams
• Supports both ED and DTW search
• Works for both z-normalized and un-normalized data
series
• Combination of various optimizations
58
Squared Distance + LB
• Using the Squared Distance
𝑛
2
𝐸𝐷 𝑄, 𝐶 =
𝑖=1
𝑞𝑖 − 𝑐𝑖
2
• Lower Bounding
– LB_Yi
– LB_Kim
– LB_Keogh
LB_Keogh
C
U
L
Q
59
Lower Bounds
• Lower Bounding
– LB_Yi
max(Q)
min(Q)
– LB_Kim
C
A
D
B
– LB_Keogh
C
U
Q
L
60
Early Abandoning
• Early Abandoning of ED
ED(Q, C )  i 1 (qi  ci ) 2  bsf
n
Q
C
We can early abandon at this point
• Early Abandoning of LB_Keogh
U
C
Q
L
L
U, L is an envelope of Q
61
Early Abandoning
• Early Abandoning of DTW
• Earlier Early Abandoning of DTW using LB Keogh
C
Q
Stop
Fully calculated LB
Keogh if
dtw_dist ≥ bsf
U
Partial truncation of
LBKeogh
C
L
K =0
Q
dtw_dist
C
Partial
calculation of
DTW
K = 11
About to begin calculation of DTW
R (Warping Windows)
62
Early Abandoning
• Early Abandoning of DTW
• Earlier Early Abandoning of DTW using LB_Keogh
C
Q
Stop if dtw_dist +lb_keogh ≥ bsf
Fully calculated LBKeogh
U
L
Partial truncation of
LBKeogh
(partial) C
lb_keogh
K =0
Q
(partial)
dtw_dist
C
Partial
calculation of
DTW
K = 11
About to begin calculation of DTW
R (Warping Windows)
63
Z-normalization
• Early Abandoning Z-Normalization
– Do normalization only when needed (just in time)
– Every subsequence needs to be normalized before it is
compared to the query
– Online mean and std calculation is needed
– Keep a buffer of size m and compute a running mean and
standard deviation
 xi   
zi  

  
64
The Pseudocode
65
Reordering
• Reordering Early Abandoning
– We don’t have to compute ED or LB from left to right
– Order points by expected contribution
Standard early abandon ordering
12
34
5
6
7
8
Optimized early abandon ordering
9
5
1
3 24
C
Q
C
Q
Idea
-
Order by the absolute height of the query point
This step is performed only once for the query and can save about 30%-50%
of calculations
66
Reordering
• Reordering Early Abandoning
– We don’t have to compute ED or LB from left to right
– Order points by expected contribution
Idea
Intuition
- The query will be compared to many data stream points during a search
- Candidates are z-normalized:
- the distribution of many candidates will be Gaussian, with a zero mean
of zero
- the sections of the query that are farthest from the mean (zero) will on
average have the largest contributions to the distance measure
67
Different Envelopes
• Reversing the Query/Data Role in LB_Keogh
– Make LB_Keogh tighter
– Much cheaper than DTW
– Online envelope calculation
Envelop on Q
U
Envelop on C
C
L
Q
U
L
68
Lower bounds
• Cascading Lower Bounds
– At least 18 lower bounds of DTW was proposed.
– Use some lower bounds only on the Skyline.
Tightness of LB
Tightness of
(LB/DTW)
bound
lower
1
Early_abandoning_DTW
max(LB_Keogh EQ, LB_Keogh EC)
LB_KimFL
LB_FTW
LB_Ecorner
LB_Keogh EQ
LB_Yi
LB_Kim
0
O(1)
O(n)
DTW
LB_PAA
O(nR)
69
Experimental Result: Random Walk
• Random Walk: Varying size of the data
UCR-ED
Million
(Seconds)
0.034
Billion
(Minutes)
0.22
Trillion
(Hours)
3.16
SOTA-ED
0.243
2.40
39.80
UCR-DTW
0.159
1.83
34.09
SOTA-DTW
2.447
38.14
472.80
Code and data is available at:
www.cs.ucr.edu/~eamonn/UCRsuite.html
70
Experimental Result: ECG
• Data: One year of Electrocardiograms 8.5 billion data points.
• Query: Idealized Premature Ventricular Contraction (PVC) of
length 421 (R=21=5%).
PVC (aka. skipped beat)
ECG
UCR-ED
SOTA-ED
UCR-DTW
SOTA-DTW
4.1 minutes
66.6 minutes
18.0 minutes
49.2 hours
~30,000X faster than real time!
71
Up next…
Nov 4
Introduction to data mining
Nov 5
Association Rules
Nov 10, 14
Clustering and Data Representation
Nov 17
Exercise session 1 (Homework 1 due)
Nov 19
Classification
Nov 24, 26
Similarity Matching and Model Evaluation
Dec 1
Exercise session 2 (Homework 2 due)
Dec 3
Combining Models
Dec 8, 10
Time Series Analysis
Dec 15
Exercise session 3 (Homework 3 due)
Dec 17
Ranking
Jan 13
No Lecture
Jan 14
EXAM
Feb 23
Re-EXAM
72