Transcript t 1
DB Seminar Presentation
Top-k Queries on Temporal Data (VLDB J., to appear)
Authors
Presenter
Supervisors
Date
Feifei Li, Ke Yi, & Wangchao Le
Hao Wang
Nikos Mamoulis & David Cheung
16 Nov 2010
Content
I. The problem & baseline solutions
II. The index: SEB-tree (the sampled envelope B-tree)
III. Extensions
IV. Experiments
Appendix
Time series segmentation
Upper envelope: the complexity
Configuration space & Lemma 1
2
Part I
The Problem & Baseline Solutions
Top-k(t) query
Naïve solutions
R-tree-based solutions
3
The problem setting
Data
Piece-wise linear functions
How to approximate time series data is beyond
the scope of the paper. See Appendix A for the
method used in the experiments.
Score
o1
o2
o3
o4
t1
t2
Time
t1
t2
o3
o2
o1
o4
o1
o3
o4
o2
Task
Top-k(t) query: Given any time instance t, find top-k objects at time t.
How to answer the query for any t and k ≤ kmax efficiently?
4
Baseline: Naïve solutions
Naïve solution 1 (based on raw data)
Sort all objects at each time instance
Organize the sorted lists in a B-tree
Indexing attribute: time instances
Indexing record: sorted lists
Expensive storage cost
Overhead query performance
Naïve solution 2 (based on piece-wise linear representation)
Pre-compute all intersections of line segments, which partition the time
axis into disjoint intervals
Score
Sort all objects in each interval
Organize the sorted lists in a B-tree
Indexing attribute: time intervals
Indexing record: sorted lists
Quadratic space
Time
5
Baseline: R-tree-based solutions
Framework
Break up each trajectory into a number of MBRs
Build an R-tree on top of these MBRs
Applying generic kNN algorithm to answer top-k(t)
[Anagnostopoulos et al., 2006]
Filtering & Refinement
Not tailored for top-k(t) queries
Current commercial DBMSs have limited supports on R-trees
6
Part II The Index: SEB-tree
The design principle
The index
Analysis
Comparison of different solutions
7
The design principle
Consider the dataset as a set of line segments, S
p-Sample
Sample each line segment from S randomly with probability p
Upper envelope of a set of line segments X, U(X)
(point-wise maximum)
Also piece-wise linear
U(X)
e1
e5
e2
e3
e6
e7
e4
8
The design principle (cont’d)
Trapezoidal decomposition (also: vertical visibility map)
Conflict list
For ei ∈ U(X), C(ei) is the set of line segments from S that ever locate
above ei (i.e. intersect Δi)
Δ1
Δ2
Δ3
Δ5
Δ4
Δ6
Δ7
U(X)
e1
C(e3)
e5
e2
e3
e6
e7
e4
9
The design principle (cont’d)
How can this help top-k(t) queries?
Intuition: The larger the probability p, the smaller the size of C(e).
If we are lucky…
C(e) is almost the result set
Δ1
Δ2
Δ3
Δ5
Δ4
Δ6
Δ7
U(X)
e1
C(e3)
e5
e2
e3
t
e6
e7
e4
10
How to calculate the upper envelope?
– O(nlogn) time. (Appendix B)
SEB-tree
How to get the conflict lists efficiently?
(1) Geometrically
decreasing samples
(2) Upper envelope,
Decomposition, &
Conflict lists
(3) B-trees on
conflict lists
S0
U(S0), D(S0), C(S0)
T0
S1
U(S1), D(S1), C(S1)
T1
S
SEB-tree
Sl
U(Sl), D(Sl), C(Sl)
Tl
11
Building the conflict lists
Keep on sampling within Si = L0
Si = L0 ⊇ L1 ⊇ … ⊇ Lλ
Li +1 is a 1/2-sample of Li
Compute trapezoidal decomposition D(L0), D(L1), …, D(Lλ)
L0 ( = Si )
L1
12
Answering top-k(t) queries
T0
S
T1
Correctness.
Tl
t
13
Analysis: Lemmas
Lemma 1. If X is a p-sample of S, then for any e ∈ U(X), E[|C(e)|] = O(1/p).
Involves some well-established computational geometry tools
See Appendix C for more explanation
U(X)
C(e)
Lemma 2. There are O(nα(n)) vertices on the upper envelope of n
segments, where α(n) is the inverse Ackermann function.
α(n) is a very slowly growing function; α(n) < 4 even if n = 2↑↑65536
See Appendix B for more explanation
14
Analysis: Space
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
Lemma 2. O(nα(n)) envelope complexity.
Expected size of Si
S0
Expected total size of Ti
T0
S1
Index size
Index size (occupied disk blocks)
T1
Sl
Tl
15
Analysis: Query performance
Lemma 3. The probability that the algorithm queries Ti is at most
−Ω( (i − l0)2 ) , for any i > l .
0
2
Step 1. Prob. of “fails when querying Ti”
At least one segment in top-k(t) is sampled into Si
S0
T0
Step 2. Prob. of “queries Ti”
Tlo, Tlo+1, …, Ti-1 have been queried and all failed
S1
T1
Sl
Tl
16
Analysis: Query performance (cont’d)
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
−Ω( (i − l0)2 )
Lemma 3. Queries Ti w. prob. 2
.
Visiting Tl0
S0
T0
Visiting Ti’s (i > l0)
S1
T1
Total:
Sl
Tl
17
Analysis: Query performance (cont’d)
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
−Ω( (i − l0)2 )
Lemma 3. Queries Ti w. prob. 2
Visiting Tl0
Visiting Ti’s (i > l0)
Scanning the entire S
.
S0
T0
S1
T1
Query performance (number of I/Os)
Sl
Tl
18
Analysis: Construction time
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
Lemma 2. O(nα(n)) envelope complexity.
Generating samples S0, S1, …, Sl
Building the upper envelopes
Lj
Geom.
Lj+1
Constructing C(Si)
Construction time
19
Comparison of different solutions
20
Part III Extensions
Space-query tradeoff
Aggregation of multiple attributes
21
Space-query tradeoff
T0
T1
S
“Bad things happen.”
What if C(Δ) get peculiarly large?
Tl
SEB-treeλ (λ = 3, 4)
22
Aggregation of multiple attributes
Multiple temporal attributes
a1(t), a2(t), …, ad(t)
Aggregation scoring function f ( a1(t), a2(t), …, ad(t) )
Extended solution
Build a SEB-tree for each of the temporal attributes
Visit trees Ti0, Ti1, … in order to get a sorted list of attribute i
Apply top-k algorithms such as TA to answer the extended top-k(t) query
Analysis
Unknown
Probably NOT instance-optimal
23
Part IV Experiments
Setup
Index size and construction cost
Query cost
24
Setup
Programming
language: C++
Upper envelope: CGAL library [CGAL]
B-trees: TPIE library [Arge et al., 2002]
R*-tree, MVR-tree, TPR-tree: The spatial index library [SIL]
Machine: 64-bit Linux, 4GB RAM, 2GHz Intel Xeon CPU
Datasets
UCR (Keogh’s CD ROM) [UCR]
Mallat Techno-metrics (2400 time series * 1024 time instances)
Lightcurve (5000 time series * 1024 time instances)
Segmentation: 200 line segments / time series [Keogh et al., 2001]
25
Setup (cont’d)
Datasets (cont’d)
UCR (Keogh’s CD ROM) [UCR]
Synthetic data
Pick m the average number of line segments in each fi
For fi, pick mean μi ∈ [100, 300] and number of line segments mi ∈ [1, 2m]
Generate mi time instances t1 < t2 < … < tmi from [0, 1000]
Set fi (tj) according to N(μi, σ2); if fi (tj) falls out of [0, 400], reset
σ is a system parameter used to control the degree of intersections as well as
the stableness of top-k distribution
Metrics
Index size, construction cost, query cost, update cost
Sensitivity analysis of data characteristics (σ, n, m, kmax, …)
26
Datasets (rand. sample of 5 objects)
More “vertical”.
More intersections.
27
Index size and construction cost (σ)
SEB-tree
R-tree
fi (tj) ∈ [0, 400]
kmax=200
28
Index size and construction cost (n)
SEB-tree
R-tree
fi (tj) ∈ [0, 400]
kmax=200
29
Query cost (σ)
SEB-tree
R-tree
MVB-tree
30
Query cost (n)
SEB-tree
R-tree
MVB-tree
31
Conclusion
Propose an index to answer top-k(t) queries
Near-linear construction time; near-linear space
Easy to deploy (using B-trees only)
Interesting and important directions
Ranking queries within a temporal range
Ranking queries on original time series without segmentation
32
Appendix A
Time Series Segmentation
33
Time series segmentation
Purpose
For efficient storage, transmission, indexing, and querying
Given a time series T, produce the best representation
using only K segments;
s.t. the maximum error for any segment does not exceed some threshold;
s.t. the combined error of all segments does not exceed some threshold.
Algorithms
Sliding Window. A segment is grown until it exceeds some error bound.
Bottom-Up. Starting from the finest possible approximation, segments
are merged until some stopping criteria is met.
Top-Down. The time series is recursively partitioned until some stopping
criteria is met.
34
Sliding window algorithm
Linear interpolation
Linear regression
Online
Poor quality
35
Bottom-up algorithm
Offline
Good quality
36
SWAB
[Keogh et al., 2001]
Sliding Window And Bottom-up
Applying Bottom-up
Read in one more segment (obtained
by applying Sliding-window)
37
Appendix B
Upper Envelope: The Complexity
38
From envelopes to sequences
5
3
4
2
1
5
1 2 1
3
1
3
4
5
4
5
2
3
4
The length of any alternating subsequence is at most 4.
39
Davenport-Schinzel sequence
A sequence U = (u1, u2, …, um) is an (n, s) Davenport-Schinzel
sequence, if it satisfies the following conditions
1 ≤ ui ≤ n for each i;
for each i < m we have ui ≠ ui+1; and
if a and b are two distinct values, then there is no subsequence …, a, …,
b, …, a, …, b, … consisting of s + 2 values alternating between a and b.
An envelope is in general a DS(n, 3), sometimes a DS(n, 2).
The number λs(n)
λs(n) = max{|U| : U is a DS(n, s) sequence}.
Theorem [Hart & Sharir, 1986]
λ2(n) = Θ(n);
λ3(n) = Θ(nα(n)).
40
Finding the envelope: D&C
Divide the set S into two subsets of equal size
Find the upper envelope for each subset
Merge the two upper envelopes
41
A better division
[Hershberger, 1989]
Interval tree
DIVIDE
O(logn) layers;
CONQUER
O(n) worst case to find
the envelope of each layer;
MERGE
O(nα(n)loglogn) using
the line sweep method
TOTAL: O(nlogn)
42
Appendix C
Configuration Space & Lemma 1
43
Configuration space
[Clarkson & Shor, 1989][Mulmuley, 1994]
The framework
A configuration space is a 4-tuple (X, Π, D, C), where X is the set of objects and
Π is the set of configurations.
A configuration σ over X is a pair (D, C) = (D(σ), C(σ)), where D and C are
disjoint subsets of X. The objects in D are said to define σ, whereas the objects in
C are said to conflict with σ.
Given a subset Y ⊆ X, the induced configuration space Π(Y) is the set of all
configurations σ such that D(σ) ⊆ Y and C(σ) ∩ Y = ∅.
In our case…
X ---- The set of all line segments.
Π ---- All possible top-open trapezoids that are determined by two vertices.
D(σ) ---- The at most 4 segments that determine σ.
C(σ) ---- All the segments that intersect σ.
Π(Y) ---- Trapezoids defined by Y ⊆ X.
44
More notations and a lemma
More notations
d(σ), c(σ) ---- The sizes of D(σ) and C(σ), respectively.
Πc(Y) ---- The set of configurations in Π(Y) with conflict size c.
πc(Y) ---- The size of Πc(Y).
Lemma C1. If Y is a p-sample of X, then for any a ≥ 0,
Proof.
Note that σ ∈ Πa(Y) if and only if D(σ) ⊆ Y and |C(σ) ∩ Y| = a.
Since Y is a p-sample of X, the probability of this event can be calculated
in the standard Bernoulli way.
45
Bounding the expected total size
Lemma C2. If Y is a p-sample of X, then for σ ∈ Π0(Y),
Proof.
Lemma C1
46
Bounding π1(Y)
Lemma C3. Let Y be a p-sample of X and Q be a 1/2-sample of Y, then
Proof.
For σ ∈ Π1(Y), σ ∈ Π0(Q) if and only if Q contains all d(σ) elements in
D(σ) but not the element in C(σ)∩Y. Since Q is a 1/2-sample of Y, the
probability of σ ∈ Π0(Q) is thus
It follows that
d is the maximum degree
of σ. In our case, d = 4.
47
Average conflict size
Theorem. Let Y be a p-sample of X, then for any σ ∈ Π(Y),
Proof.
Directly deduced from Lemma C2 and Lemma C3.
Note that
and
48
References
[Anagnostopoulos et al., 2006] A. Anagnostopoulos, M. Vlachos, M. Hadjieleftheriou, E. Keogh,
and P. S. Yu. Global distance-based segmentation of trajectories. In: Proc. 12th KDD, 2006.
[Arge et al., 2002] L. Arge, O. Procopiuc, and J. S. Vitter. Implementing I/O-efficient data structures
using TPIE. In: Proc. Euro. Symp. Algo., 2002.
[CGAL] Computational geometry algorithms library. http://www.cgal.org.
[Clarkson & Shor, 1989] K. L. Clarkson and P. W. Shor. Applications of random sampling in
computational geometry, II. Discrete Comput. Geom., 4(1): 387-421, 1989.
[Hart & Sharir, 1986] S. Hart and M. Sharir. Nonlinearity of Davenport-Schinzel sequences and of
a generalized path compression scheme. Combinatorica, 6(2): 151-178, 1986.
[Hershberger, 1989] J. Hershberger. Finding the upper envelope of n line segments in O(nlogn)
time. Inform. Process. Lett., 33(4): 169-174, 1989.
[Keogh et al., 2001] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting
time series. In: ICDM, 2001.
[Mulmuley, 1994] K. Mulmuley. Computational Geometry: An Introduction through Randomized
Algorithms. Prentice Hall, Englewood Cliffs, N. J., USA, 1994.
[SIL] M. Hadjieleftheriou. http://www2.research.att.com/~marioh/spatialindex/index.html.
[UCR] E. Keogh, X. Li, and C. Ratanamahatana. The UCR time series dataset.
http://www.cs.ucr.edu/~eamonn/time_series_data/.
49