Transcript Document

A Query Adaptive Data Structure for
Efficient Indexing of Time Series
Databases
Presented by Stavros Papadopoulos
Time Series and Similarity
Search

Time Series
A sequence (ordered collection)
of real values. X=x1,x2,..,xn
where n can be very large

Similarity
Given a query sequence q, a
database
S of N sequences S1, S2, ..., SN,
a
distance measure D and a
tolerance
threshold ε, two time series are
regarded as similar within range
ε when D(x, y) ≤ ε
Time Series and Similarity
Search

Whole matching
Given a collection of N data sequences of real numbers S1, S2 ,
…, SN and a query sequence Q, we want to find those data
sequences that are within distance ε from Q. The data and query
sequences must have the same length

Subsequence matching
Given N data sequences S1, S2,…, SN of arbitrary lengths, a
query sequence Q and a tolerance ε, we want to identify the data
sequences Si that contain matching subsequences. Report those
data sequences, along with the correct offsets within the data
sequences that best match the query sequence
Dimensionality Reduction

Instead of sequential scanning the database in order to find
similarity, an indexing method is required to reduce the query
time

Every time series of length n is regarded as a point in the ndimensional space

Indexing structures are used: R-Trees and their variants (the
R*-tree is the most commonly used data structure)

R-trees scale badly with the increase of dimensionality

To efficiently search time series databases each sequence is
represented as a multidimensional vector and dimensionality
is reduced to a degree that index structures can be applied
efficiently
Dimensionality Reduction

There are the following known dimensionality
reduction techniques:








Fourier Transform (DFT)
Wavelet Transform (DWT)
Piecewise Aggregate Approximation (PAA)
Singular Value Decomposition (SVD)
Chebyshev Polynomials (Cheb)
Piecewise Linear Approximation (PLA)
Adaptive Piecewise Constant Approximation (APCA)
Symbolic Aggregate Approximation (SAX)
Dimensionality Reduction
using DFT

Let X=(x1,x2,…,xn) be
a time series

We take the Fourier
transform
FFT ( X )  X  ( X 1 , X 2 ,..., X n )

We keep only the
first fc coefficients
Dimensionality Reduction
using DFT

Motivation behind using DFT:




In most applications the data does not exhibit
rapid changes (e.g. stock data)
Therefore, the energy of a time series vector
is concentrated in the lower frequencies.
The first DFT coefficients contain this
information for the lower frequencies
The reconstruction error using only a few
coefficients is not big
Dimensionality Reduction
using DFT
Dimensionality Reduction
Techniques
DFT
DWT
X
X'
0
20
40
60
80
100
120 140
0
1
2
3
4
5
6
7
8
9
0
20
40
60
80
SVD
X
X
X'
X'
100 120 140
0
20
40
60
80
100 120 140
Dimensionality Reduction
Techniques
Cheb
PLA
PAA
X
X
X'
X
X'
0
20
40
60
80
X'
100 120 140
0
20
40
60
80
100 1
2
0
140
0
20
40
60
80
100 120 140
Dimensionality Reduction
Techniques
APCA
SAX
X
X'
X
X
0
20
40
60
0
20
40
60
80
100 120 140
80 100 120 140
d
c
b
a
aaaaaabbbbbccccccbbccccdddddddd
Dimensionality Reduction

Basically, dimensionality reduction is a technique for
approximating the original sequence of size n by another
sequence of much smaller length

Any dimensionality reduction technique potentially suffers
from two problems:

False alarms
Occur when objects that appear to be close in the index
space are actually distant. False alarms are removed in a
post-processing stage.

False dismissals
Occur when qualifying objects are missed because they
appear distant in the index space. They cannot be
tolerated by the system.
Evaluating the different
techniques

The aforementioned techniques guarantee no false
dismissals

False alarms occur because every transform of a ndimensional point to a point in a space of reduced
dimensionality approximates the true location of the
transformed point.

CPU time is highly dependent on the implementation.

A more accurate and objective measure of the effectiveness
of a dimensionality reduction technique is the pruning power
Subsequence similarity search
FRM







Allows similarity matching between time series of variable
size
Predefines a window length ω
It divides every time series in the database into n-ω
sliding windows
It indexes all these subsequences with pointers that point
to the original sequence they belong to
It divides the query sequence into n/ω disjoint windows
Then they search for all the query subsequences the
similar subsequences in the index
The candidate set is furthermore examined in order to
remove any false alarms
Subsequence similarity search
FRM

Disadvantage:
When a MBR is found, we include all the points in this
MBR in the candidate set
Subsequence similarity search
DUAL MATCH








It is the dual of FRM
Predefines a window length ω
It divides every time series in the database into n/ω
disjoint windows
It indexes all these subsequences with pointers that point
to the original sequence they belong to
It divides the query sequence into n-ω sliding windows
Then they search for all the query subsequences the
similar subsequences in the index
The candidate set is furthermore examined in order to
remove any false alarms
It is shown by the authors’ experiments that DUAL MATCH
outperforms FRM
Dimensionality reduction and
query time

Two factors that affect similarity search queries:

Index search
The larger the preselected dimensionality, the larger
the search time in the R*-tree

False alarms
The smaller the preselected dimensionality, the lower
the accuracy and the larger the number of false
alarms. Note that a false alarm is expensive since we
have to fetch the ‘false’ time series from the database
and sequentially compare it with the query sequence.
Question ?

Is there an optimal dimensionality and if
there is how do we select it?
The authors’ point of view

The problem of finding the optimal combination between
accuracy and index search time has not been addressed yet

The authors that proposed the various dimensionality
reduction techniques conduct experiments and test the
pruning power and CPU time when using different
dimensionalities

They empirically find the optimal dimensionality counting the
number of page accesses, after they issue all the queries

But what do we do when we have an application that supports
online queries or when we do not have all the queries in
advance?

A dynamic solution must be found!!
A naïve solution

We keep multiple R*-trees with different dimensionalities, e.g. 2,
4, 6, …, 16

Each one indexes the whole database

We adopt a heuristic function to count the page accesses

We define proper thresholds

We start by using the R*-tree with the lowest dimensionality, e.g.
2-dimensonal

As queries arrive, if the thresholds are exceeded, we change the
tree we use, e.g. from the 2-dimensional to the 4-dimensional
A naïve solution (cont.)

Advantages
 We can dynamically adjust the dimensionality of our index
as queries arrive
 We don’t have to decide the dimensionality in advance
 It can work well with online queries

Disadvantages
 Increased space
 Suppose that there is only a fraction of the database that is
affected by the queries and result in false alarms, then the
whole tree will unnecessarily upgrade its dimensionality
 The unaffected fraction of the database will then be
indexed with a tree of higher dimensionality than it is
actually needed
Convention for the purposes
of our discussion

‘Simple’ time series
A time series that concentrates its energy in
its low frequencies

‘Complicated’ time series
A time series that concentrates its energy in
its high frequencies
Observations

In a dataset, there could exist simple time series as well as
complicated times series

A single time series can contain simple subsequences as
well as complicated subsequences, e.g. there might be
some ‘white’ noise in a time series

The query could be simple or complicated

The query defines where we will have the false alarms,
since the query is the one that falls in an MBR and causes
false alarms there
Conclusions

It is necessary for our solution to be query
adaptive, since we search the index according
to the queries and the queries are the ones that
cause the alarms

Since not all the indexed subsequences are of
same complexity, our structure must support
different dimensionalities for different portions
of the database

This means that our structure should support
MBRs of different dimensionalities
Proposed solution






We will use a modification of the R*-tree
We will also use the DUAL MATCH technique
for similarity matching
We do not have any knowledge about our
dataset, not about the nature of the queries
The queries can be offline or online
Note that for online queries, most of the time,
applications just use an active window, which
can be regarded as a sliding window
The structure is being adjusted throughout
time, as new queries arrive
The structure



The structure starts as a 2dimensional R*-tree
Every fc-dimensional point has
a pointer to its original time
series as well as to its
complete transformation vector
Along with every node/MBR,
we associate a variable dim
that holds the dimensionality of
the current node
y axis
10
dim=2
a
E3
b
h
l
8
k
e f
6
i
window query
j
d
4
E3
b
2
a
Minimum Bounding Rectangle (MBR)
c
x axis
0
2
E1
E3
E4
c
d
E4
4
8
6
E2
dim=2
E5
E6
e
f
E
dim=2 5
10
dim=2
Root
dim=2
E1
m
g
g
E7
i
h
dim=2
E6
E2
j
l
k
dim=2
E7
m
dim=2
The structure




We search the index based on the dimensionality of every node
We keep a heuristic function for the false alarms in every leaf MBR
and we define a threshold
If the threshold is exceeded, we increase the dimensionality of the
corresponding leaf MBR
This is achieved by retrieving more coefficients from the
transformation vector pointed by the pointer
y axis
10
8
k
e f
6
i
d
4
b
c
2
0
2
E3
E3
c
d
E4
window query
a
Minimum Bounding Rectangle (MBR)
4
dim=4
dim=2 E1 E3 E4 E5
b
j
l
6
8
E6
E7
x axis
10
Root dim=2
E1 E2
dim=2
dim=2
a
m
g h
e
f
E5
dim=2
g
h
dim=2
i
E6
E2
j
k
l
E7
dim=2
m
dim=2
The structure



At some point the dimensionality of all (or most of) the leaf MBRs of
a particular internal MBR may increase their dimensionality
At that point we have to increase the dimensionality of the internal
MBR as well  this is not trivial (it will be discussed later)
This can propagate up to the root
y axis
10
8
k
e f
6
i
d
4
b
c
2
E3
E3
c
d
E4
e
E6
f
E5
dim=4
8
x axis
10
dim=2
dim=4E1 E3 E4 E5
b
window query
Minimum Bounding Rectangle (MBR)
Root
E1 E2
dim=2
j
l
a
2 dim=2
4
6
0
a
m
g h
g
h
dim=6
E7
i
E6
E2
j
k
l
E7
dim=2
m
dim=2
The structure

At some point we will result in a R*-tree which
has different dimensionality for different
subtrees

Benefit


The subsequences that were creating the false
alarms in the beginning have now increased their
dimensionality and they do not cause alarms any
more  the index search time has increased
The subsequences that didn’t cause alarms in the
beginning haven’t changed their dimensionality and
therefore the index search time for these
subsequences has remained unchanged
The structure

Worst case




There are two possibilities:
 The dataset subsequences are all complicated
 The query sequences are all complicated
In the above cases the whole R*-tree may have a
uniform dimensionality over its MBRs
It may even reach the maximum dimensionality
(i.e.16 )
Even in the worst case we have the contribution that
we can dynamically define the optimal dimensionality
for our R*-tree
Concerns

How do we merge subtrees when the dimensionality of their children MBRs has
increased?

Updates. Insertions, Deletions

What will the percentage of the children MBRs that should decide when to upgrade the
dimensionality of their parent MBR be?

What will the heuristic function of false alarms be?

How about increasing the dimensionality from 2 to 6 instead of increasing it from 2 to 4?

There should be a special handling for PLA, SAX, PAA and APCA

We should be careful with merges and splits according to the fan out, since when we
increase the dimensionality we increase the space as well

How do we decide if we want to decrease the dimensionality?

If we prove that we outperform the original R*-tree for all the dimensionality reduction
techniques then we have a good contribution (the experimental section can be large)
Thank You !
For those that are interested, I have an
extended bibliography for the issues
covered