Transcript t 1

DB Seminar Presentation
Top-k Queries on Temporal Data (VLDB J., to appear)
Authors
Presenter
Supervisors
Date
Feifei Li, Ke Yi, & Wangchao Le
Hao Wang
Nikos Mamoulis & David Cheung
16 Nov 2010
Content
 I. The problem & baseline solutions
 II. The index: SEB-tree (the sampled envelope B-tree)
 III. Extensions
 IV. Experiments
 Appendix
 Time series segmentation
 Upper envelope: the complexity
 Configuration space & Lemma 1
2
Part I
The Problem & Baseline Solutions
 Top-k(t) query
 Naïve solutions
 R-tree-based solutions
3
The problem setting
 Data
 Piece-wise linear functions
How to approximate time series data is beyond
the scope of the paper. See Appendix A for the
method used in the experiments.
Score
o1
o2
o3
o4
t1
t2
Time
t1
t2
o3
o2
o1
o4
o1
o3
o4
o2
 Task
 Top-k(t) query: Given any time instance t, find top-k objects at time t.
 How to answer the query for any t and k ≤ kmax efficiently?
4
Baseline: Naïve solutions
 Naïve solution 1 (based on raw data)
 Sort all objects at each time instance
 Organize the sorted lists in a B-tree
 Indexing attribute: time instances
 Indexing record: sorted lists
Expensive storage cost
Overhead query performance
 Naïve solution 2 (based on piece-wise linear representation)
 Pre-compute all intersections of line segments, which partition the time
axis into disjoint intervals
Score
 Sort all objects in each interval
 Organize the sorted lists in a B-tree
 Indexing attribute: time intervals
 Indexing record: sorted lists
Quadratic space
Time
5
Baseline: R-tree-based solutions
 Framework
 Break up each trajectory into a number of MBRs
 Build an R-tree on top of these MBRs
 Applying generic kNN algorithm to answer top-k(t)
[Anagnostopoulos et al., 2006]
Filtering & Refinement
Not tailored for top-k(t) queries
Current commercial DBMSs have limited supports on R-trees
6
Part II The Index: SEB-tree




The design principle
The index
Analysis
Comparison of different solutions
7
The design principle
 Consider the dataset as a set of line segments, S
 p-Sample
 Sample each line segment from S randomly with probability p
 Upper envelope of a set of line segments X, U(X)
(point-wise maximum)
 Also piece-wise linear
U(X)
e1
e5
e2
e3
e6
e7
e4
8
The design principle (cont’d)
 Trapezoidal decomposition (also: vertical visibility map)
 Conflict list
 For ei ∈ U(X), C(ei) is the set of line segments from S that ever locate
above ei (i.e. intersect Δi)
Δ1
Δ2
Δ3
Δ5
Δ4
Δ6
Δ7
U(X)
e1
C(e3)
e5
e2
e3
e6
e7
e4
9
The design principle (cont’d)
 How can this help top-k(t) queries?
 Intuition: The larger the probability p, the smaller the size of C(e).
 If we are lucky…
 C(e) is almost the result set
Δ1
Δ2
Δ3
Δ5
Δ4
Δ6
Δ7
U(X)
e1
C(e3)
e5
e2
e3
t
e6
e7
e4
10
How to calculate the upper envelope?
– O(nlogn) time. (Appendix B)
SEB-tree
How to get the conflict lists efficiently?
(1) Geometrically
decreasing samples
(2) Upper envelope,
Decomposition, &
Conflict lists
(3) B-trees on
conflict lists
S0
U(S0), D(S0), C(S0)
T0
S1
U(S1), D(S1), C(S1)
T1
S
SEB-tree
Sl
U(Sl), D(Sl), C(Sl)
Tl
11
Building the conflict lists
 Keep on sampling within Si = L0
 Si = L0 ⊇ L1 ⊇ … ⊇ Lλ
 Li +1 is a 1/2-sample of Li
 Compute trapezoidal decomposition D(L0), D(L1), …, D(Lλ)
L0 ( = Si )
L1
12
Answering top-k(t) queries
T0
S
T1
Correctness.
Tl
t
13
Analysis: Lemmas

Lemma 1. If X is a p-sample of S, then for any e ∈ U(X), E[|C(e)|] = O(1/p).
 Involves some well-established computational geometry tools
 See Appendix C for more explanation
U(X)
C(e)

Lemma 2. There are O(nα(n)) vertices on the upper envelope of n
segments, where α(n) is the inverse Ackermann function.
 α(n) is a very slowly growing function; α(n) < 4 even if n = 2↑↑65536
 See Appendix B for more explanation
14
Analysis: Space
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
Lemma 2. O(nα(n)) envelope complexity.


Expected size of Si
S0
Expected total size of Ti
T0
S1

Index size

Index size (occupied disk blocks)
T1
Sl
Tl
15
Analysis: Query performance

Lemma 3. The probability that the algorithm queries Ti is at most
−Ω( (i − l0)2 ) , for any i > l .
0
2

Step 1. Prob. of “fails when querying Ti”
 At least one segment in top-k(t) is sampled into Si
S0
T0

Step 2. Prob. of “queries Ti”
 Tlo, Tlo+1, …, Ti-1 have been queried and all failed
S1
T1
Sl
Tl
16
Analysis: Query performance (cont’d)
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
−Ω( (i − l0)2 )
Lemma 3. Queries Ti w. prob. 2

.
Visiting Tl0
S0

T0
Visiting Ti’s (i > l0)
S1
T1
Total:
Sl
Tl
17
Analysis: Query performance (cont’d)
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
−Ω( (i − l0)2 )
Lemma 3. Queries Ti w. prob. 2

Visiting Tl0

Visiting Ti’s (i > l0)

Scanning the entire S
.
S0
T0
S1
T1

Query performance (number of I/Os)
Sl
Tl
18
Analysis: Construction time
Lemma 1. If p-sample then E[|C(e)|] = O(1/p).
Lemma 2. O(nα(n)) envelope complexity.

Generating samples S0, S1, …, Sl

Building the upper envelopes
Lj
Geom.
Lj+1

Constructing C(Si)

Construction time
19
Comparison of different solutions
20
Part III Extensions
 Space-query tradeoff
 Aggregation of multiple attributes
21
Space-query tradeoff
T0
T1
S
“Bad things happen.”
What if C(Δ) get peculiarly large?
Tl
SEB-treeλ (λ = 3, 4)
22
Aggregation of multiple attributes
 Multiple temporal attributes
 a1(t), a2(t), …, ad(t)
 Aggregation scoring function f ( a1(t), a2(t), …, ad(t) )
 Extended solution
 Build a SEB-tree for each of the temporal attributes
 Visit trees Ti0, Ti1, … in order to get a sorted list of attribute i
 Apply top-k algorithms such as TA to answer the extended top-k(t) query
 Analysis
 Unknown
 Probably NOT instance-optimal
23
Part IV Experiments
 Setup
 Index size and construction cost
 Query cost
24
Setup
 Programming





language: C++
Upper envelope: CGAL library [CGAL]
B-trees: TPIE library [Arge et al., 2002]
R*-tree, MVR-tree, TPR-tree: The spatial index library [SIL]
Machine: 64-bit Linux, 4GB RAM, 2GHz Intel Xeon CPU
 Datasets
 UCR (Keogh’s CD ROM) [UCR]
 Mallat Techno-metrics (2400 time series * 1024 time instances)
 Lightcurve (5000 time series * 1024 time instances)
 Segmentation: 200 line segments / time series [Keogh et al., 2001]
25
Setup (cont’d)
 Datasets (cont’d)
 UCR (Keogh’s CD ROM) [UCR]
 Synthetic data





Pick m the average number of line segments in each fi
For fi, pick mean μi ∈ [100, 300] and number of line segments mi ∈ [1, 2m]
Generate mi time instances t1 < t2 < … < tmi from [0, 1000]
Set fi (tj) according to N(μi, σ2); if fi (tj) falls out of [0, 400], reset
σ is a system parameter used to control the degree of intersections as well as
the stableness of top-k distribution
 Metrics
 Index size, construction cost, query cost, update cost
 Sensitivity analysis of data characteristics (σ, n, m, kmax, …)
26
Datasets (rand. sample of 5 objects)
More “vertical”.
More intersections.
27
Index size and construction cost (σ)
SEB-tree
R-tree
fi (tj) ∈ [0, 400]
kmax=200
28
Index size and construction cost (n)
SEB-tree
R-tree
fi (tj) ∈ [0, 400]
kmax=200
29
Query cost (σ)
SEB-tree
R-tree
MVB-tree
30
Query cost (n)
SEB-tree
R-tree
MVB-tree
31
Conclusion

Propose an index to answer top-k(t) queries

Near-linear construction time; near-linear space

Easy to deploy (using B-trees only)

Interesting and important directions
 Ranking queries within a temporal range
 Ranking queries on original time series without segmentation
32
Appendix A
Time Series Segmentation
33
Time series segmentation
 Purpose
 For efficient storage, transmission, indexing, and querying
 Given a time series T, produce the best representation
 using only K segments;
 s.t. the maximum error for any segment does not exceed some threshold;
 s.t. the combined error of all segments does not exceed some threshold.
 Algorithms
 Sliding Window. A segment is grown until it exceeds some error bound.
 Bottom-Up. Starting from the finest possible approximation, segments
are merged until some stopping criteria is met.
 Top-Down. The time series is recursively partitioned until some stopping
criteria is met.
34
Sliding window algorithm
Linear interpolation
Linear regression
Online
Poor quality
35
Bottom-up algorithm
Offline
Good quality
36
SWAB
[Keogh et al., 2001]
 Sliding Window And Bottom-up
Applying Bottom-up
Read in one more segment (obtained
by applying Sliding-window)
37
Appendix B
Upper Envelope: The Complexity
38
From envelopes to sequences
5
3
4
2
1
5
1 2 1
3
1
3
4
5
4
5
2
3
4
The length of any alternating subsequence is at most 4.
39
Davenport-Schinzel sequence
 A sequence U = (u1, u2, …, um) is an (n, s) Davenport-Schinzel
sequence, if it satisfies the following conditions
 1 ≤ ui ≤ n for each i;
 for each i < m we have ui ≠ ui+1; and
 if a and b are two distinct values, then there is no subsequence …, a, …,
b, …, a, …, b, … consisting of s + 2 values alternating between a and b.
 An envelope is in general a DS(n, 3), sometimes a DS(n, 2).
 The number λs(n)
 λs(n) = max{|U| : U is a DS(n, s) sequence}.
 Theorem [Hart & Sharir, 1986]
 λ2(n) = Θ(n);
 λ3(n) = Θ(nα(n)).
40
Finding the envelope: D&C
 Divide the set S into two subsets of equal size
 Find the upper envelope for each subset
 Merge the two upper envelopes
41
A better division
[Hershberger, 1989]
 Interval tree
DIVIDE
O(logn) layers;
CONQUER
O(n) worst case to find
the envelope of each layer;
MERGE
O(nα(n)loglogn) using
the line sweep method
TOTAL: O(nlogn)
42
Appendix C
Configuration Space & Lemma 1
43
Configuration space

[Clarkson & Shor, 1989][Mulmuley, 1994]
The framework
 A configuration space is a 4-tuple (X, Π, D, C), where X is the set of objects and
Π is the set of configurations.
 A configuration σ over X is a pair (D, C) = (D(σ), C(σ)), where D and C are
disjoint subsets of X. The objects in D are said to define σ, whereas the objects in
C are said to conflict with σ.
 Given a subset Y ⊆ X, the induced configuration space Π(Y) is the set of all
configurations σ such that D(σ) ⊆ Y and C(σ) ∩ Y = ∅.

In our case…





X ---- The set of all line segments.
Π ---- All possible top-open trapezoids that are determined by two vertices.
D(σ) ---- The at most 4 segments that determine σ.
C(σ) ---- All the segments that intersect σ.
Π(Y) ---- Trapezoids defined by Y ⊆ X.
44
More notations and a lemma

More notations
 d(σ), c(σ) ---- The sizes of D(σ) and C(σ), respectively.
 Πc(Y) ---- The set of configurations in Π(Y) with conflict size c.
 πc(Y) ---- The size of Πc(Y).

Lemma C1. If Y is a p-sample of X, then for any a ≥ 0,

Proof.
 Note that σ ∈ Πa(Y) if and only if D(σ) ⊆ Y and |C(σ) ∩ Y| = a.
 Since Y is a p-sample of X, the probability of this event can be calculated
in the standard Bernoulli way.
45
Bounding the expected total size

Lemma C2. If Y is a p-sample of X, then for σ ∈ Π0(Y),

Proof.
Lemma C1
46
Bounding π1(Y)

Lemma C3. Let Y be a p-sample of X and Q be a 1/2-sample of Y, then

Proof.
 For σ ∈ Π1(Y), σ ∈ Π0(Q) if and only if Q contains all d(σ) elements in
D(σ) but not the element in C(σ)∩Y. Since Q is a 1/2-sample of Y, the
probability of σ ∈ Π0(Q) is thus
 It follows that
d is the maximum degree
of σ. In our case, d = 4.
47
Average conflict size

Theorem. Let Y be a p-sample of X, then for any σ ∈ Π(Y),

Proof.
 Directly deduced from Lemma C2 and Lemma C3.
 Note that
 and
48
References










[Anagnostopoulos et al., 2006] A. Anagnostopoulos, M. Vlachos, M. Hadjieleftheriou, E. Keogh,
and P. S. Yu. Global distance-based segmentation of trajectories. In: Proc. 12th KDD, 2006.
[Arge et al., 2002] L. Arge, O. Procopiuc, and J. S. Vitter. Implementing I/O-efficient data structures
using TPIE. In: Proc. Euro. Symp. Algo., 2002.
[CGAL] Computational geometry algorithms library. http://www.cgal.org.
[Clarkson & Shor, 1989] K. L. Clarkson and P. W. Shor. Applications of random sampling in
computational geometry, II. Discrete Comput. Geom., 4(1): 387-421, 1989.
[Hart & Sharir, 1986] S. Hart and M. Sharir. Nonlinearity of Davenport-Schinzel sequences and of
a generalized path compression scheme. Combinatorica, 6(2): 151-178, 1986.
[Hershberger, 1989] J. Hershberger. Finding the upper envelope of n line segments in O(nlogn)
time. Inform. Process. Lett., 33(4): 169-174, 1989.
[Keogh et al., 2001] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting
time series. In: ICDM, 2001.
[Mulmuley, 1994] K. Mulmuley. Computational Geometry: An Introduction through Randomized
Algorithms. Prentice Hall, Englewood Cliffs, N. J., USA, 1994.
[SIL] M. Hadjieleftheriou. http://www2.research.att.com/~marioh/spatialindex/index.html.
[UCR] E. Keogh, X. Li, and C. Ratanamahatana. The UCR time series dataset.
http://www.cs.ucr.edu/~eamonn/time_series_data/.
49