Transcript Slides
Towards Robust Indexing for
Ranked Queries
Dong Xin, Chen Chen, Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
VLDB 2006
Outline
• Introduction
• Robust Index
• Compute Robust Index
– Exact Solution
– Approximate Solution
– Multiple Indices
• Performance Study
• Discussion and Conclusions
2
Introduction
Sample Database R
Ranked Query
Tid
A1
A2
t1
0.10
1.00
Select top 3 * from R
t2
0.15
0.80
order by A1+A2 asc
t3
0.25
0.55
t4
0.40
0.35
t5
0.80
0.25
t6
0.30
0.70
t7
0.35
0.50
t8
0.75
0.45
Linear Ranking
Functions
Query Results
Tid
A1
A2
A1+A2
t4
0.40
0.35
0.75
t3
0.25
0.55
0.80
t7
0.35
0.50
0.85
3
Efficient Processing of Ranked Queries
•
Naïve Solution: scan the whole database and evaluate all tuples
•
Using indices or materialized views
– Distributed Indexing
• Sort each attribute individually and merge attributes by a threshold algorithm
(TA) [Fagin et al, PODS’96,’99,’01]
– Spatial Indexing
• Organize tuples into R-tree and determine a threshold to prune the search
space [Goldstain et al, PODS’97]
• Organize tuples into R-tree and retrieve data progressively [Papadias et al,
SIGMOD’03]
– Sequential Indexing
• Organize tuples into convex hulls [Chang et al, SIGMOD’00]
• Materialize ranked views according to the preference functions [Hristidis et al,
SIGMOD’01]
– And More…
4
Sequential Indexing
• Sequential Index (ranked view)
– Linearly sort tuples
– No sophisticated data structures
– Sequential data access (good for database I/O)
• Representative work
– Onion [Chang et al, SIGMOD’00]
– PREFER [Hristidis et al, SIGMOD’01]
• Our proposal: Robust Index
5
Review: Onion Technique
Sample Database R
A1
Tid
A1
A2
t1
0.10
1.00
t2
0.15
0.80
t3
0.25
0.55
t4
0.40
0.35
t5
0.80
0.25
t6
0.30
0.70
t7
0.35
0.50
t8
0.75
0.45
Ranked Query
Select top 3 * from R
Index by Convex hull
t1
Second
layer
t2
t6
t3 t7
Retrieve data layer by
layer until the top-k
results are found
t8
t4
In worst case, retrieve
top-k layers of tuples
t5
First
layer
A1
A2
If a and b are non-negative
(a, b are weighing
parameters in linear
ranking function)
t1
t2
t6
t3 t7
order by aA1+bA2 asc
Second
layer
t8
Index by Convex Shell
t4
Expect less number of
tuples in each layer
t5
First layer
A2
6
Review: PREFER System
Sample Database R
Tid
A1
A2
t1
0.10
1.00
t2
0.15
0.80
t3
0.25
0.55
t4
0.40
0.35
t5
0.80
0.25
t6
0.30
0.70
t7
0.35
0.50
t8
0.75
0.45
Index by the ranking
function: A1+A2
Index on preference
ranking function
A1
t1
Given query ranking
function: A1+2A2
t2
t6
t3
t7
t4
t5
A2
Ranked Query
Map query ranking
function to index
ranking function
Will retrieve t1, t2, t3,
t4, t6, t7
Select top 3 * from R
order by w1A1+w2A2 asc
t8
Query ranking
function
Map from query
to preference
7
Comments on Sequential Indexing
• PREFER
– Works extremely well when query functions are close to the
index function; Sensitive to query weights
• Onion
– Less sensitive to query weights; Can we do better?
• Both methods
– Require considerable online computation
• Motivation for robust indexing
– Not sensitive to query weights
– Push most computation to index building phase
Average
#tuples
retrieved for 10
random queries
asking for top50 answers
Query weights
are randomly
selected from
1,2,3,4
8
Outline
• Introduction
• Robust Index
• Compute Robust Index
– Exact Solution
– Approximate Solution
– Multiple Indices
• Performance Study
• Discussion and Conclusions
9
Robust Indexing: Motivating Example
A1
t1
Index by Convex hull (shell)
Second
layer
t2
t6
t3 t7
Organize data layer by layer
In order to keep the convexity, each
layer is built conservatively
t8
t4
t5
Robust Index
First layer
A2
A1
Organize data layer by layer
Layer 4
t1
Layer 3
t2
t6
t3 t7
Layer 3
Exploit dominating properties between
data and push a tuple as deep as possible
t8
t7: dominated by t2 and t4 (for any query,
at least one of t2 and t4 ranks before t7)
t4
t5
First layer
A2
t7: dominated by t3 and t5
10
Robust Indexing: Formal Definition
• How does it work?
– Offline phase
• Put each tuple in its deepest layer: the minimal
(best) rank of all possible linear queries
– Online phase
• Retrieve tuples in top-k layers
• Evaluate all of them, and report top-k
• What are expected?
– Correctness
– Less tuples in each layer than convex hull
• If a tuple does not belong to top-k for any query, it
will not be retrieved
11
Robust Indexing: Appealing Properties
• Database Friendly
– No online algorithm required
– Simply use the following SQL statement
Select top k * from R
where layer <=k
order by Frank
• Space efficient
– Suppose the upper bound of the value k is given (e.g.
k<=100)
– Only need to index those tuples in top 100 layers
– Robust indexing uses the minimal space comparing
with other alternatives
12
Outline
• Introduction
• Robust Index
• Compute Robust Index
– Exact Solution
– Approximate Solution
– Multiple Indices
• Performance Study
• Discussion and Conclusions
13
Robust Indexing: Algorithm Highlights
• Exact Solution
– Compute the deepest layer for each tuple
– Complexity:
• n: number of tuples
• d: number of dimensions
• Approximate Solution
– Compute the lower bound layer for each tuple
– Complexity:
• Multiple Indices
– Transform R to different subspaces by linear
transformation
– Build an index in each subspace
14
Exact Solution
Task: to compute the minimal rank over all
possible linear queries for tuple t
L1
A1
L2
L3
L4
t1
t6
t2
t
t5
t3
t4
Given a query Q, with ranking function
F=w1A1+w2A2, 0<=w1,w2<=1, and w1+w2=1
Q is one-to-one mapped to a line L
e.g. A1+2A2 maps to L1
A2
Naïve Proposal:
Enumerate all possible
combinations of (w1,w2)
Not feasible since the
enumerating space is infinite
Alternative Solution:
Only enumerate (w1,w2) whose
corresponding line passes t and
another tuple, e.g., L1, … ,L4
Do not consider t3 and t6 because
the corresponding weights does
not satisfy 0<=w1,w2<=1
15
Exact Solution, cont.
L1
Task: to compute the minimal rank over
all possible linear queries for tuple t
LV
A1
L2
L3
L4
t1
t6
t2
LH
t
Given a query Q, with ranking function
F=w1A1+w2A2, 0<=w1,w2<=1, and w1+w2=1
t5
Lv=>L1: minimal rank is 4 (after t1, t2, t3)
t3
t4
L1=>L2: minimal rank is 3 (after t2, t3)
A2
Complexity: to sort all
lines takes O(n log n), to
compute minimal rank
for all t,
In general,
L2=>L3: minimal rank is 4 (after t2, t3, t4)
L3=>L4: minimal rank is 3 (after t3, t4)
L4=>LH: minimal rank is 4 (after t3, t4, t5)
Minimal rank (the deepest layer) of t is 3
16
Approximate Solution
A1
I4
I
Task: to compute the lower bould of
the minimal rank of tuple t
IV
I3
I2
I1
Four regions
II: dominating region, data ranked before t
IV: dominated region, data ranked after t
I and III?
t
III4
II
III3
III1
III2
III
A2
Lower Bounding Theorem
[Minimal ranking of t]
>= card(II) + min(
card(I3+I2+I1),
card(I2+I1+III1),
card(I1+III1+III2),
card(III1+III2+III3))
Step 1: Partition regions I and III
Step 2: Count cardinalities of region
II and sub-regions I1,…,I4, III1,…,III4
Step 3: Match the cardinalities of the
sub-regions and compute the lower
bound
17
Approximate Solution, Cont.
A1
I4
I
Step 2: Count cardinalities of region II
and sub-regions I1,…,I4, III1,…,III4
IV
I3
L
I2
I1
Count the cardinality of region II?
1. All tuples in region II dominate t
2. A reversed version of skyline
problem
3. Standard divide and conquer
solution (details in the paper)
t
III4
III3
II
III1
III2
III
A2
Tid
A1
A2
Tid
A1
A2
t
0.50
0.50
t
-0.50
0.63
t1
0.15
0.80
t1
-0.15
0.35
t2
0.25
0.55
t2
-0.25
0.39
t3
0.40
0.35
t3
-0.40
0.49
A1=-A1
A2=A1+0.25A2
Count the cardinality of region I1?
Suppose t: (a1,a2)
Line L: A1 + 0.25A2=a1 + 0.25a2
Tuples in region I1 satisfy
-A1 <= -a1
A1+0.25A2 <= a1 + 0.25 a2
18
Quality of the Approximate Solution
• Complexity:
– B: number of partitions in each subspace
– n: number of tuples
– d: number of dimensions
• Approximate quality:
– Assume data forms a uniform distribution
– Each subspace is partitioned evenly
– Partitioning according to the data distribution
is an important and interesting future topic
19
Multiple Indices
• Why?
Ranking function:
F=w1A1+w2A2
– To relax the constraint
– To decompose and strengthen
the constraints
• How? (e.g., for w1<=w2)
– Linearly transform R to R’, and
build index on R’
(A1,A2) => (A1+A2, A2)
– Rewrite query weights
(w1,w2) => (w1,w2-w1)
Relax
Ranking function:
F=w1A1+w2A2
Where 0<=w1,w2<=1
Data are projected to a smaller subspace
(e.g., A1’ >=A2’ in the transformed subspace)
Tuples can be pushed deeper
since more domination can be found
Strengthen
Ranking function:
F=w1A1+w2A2
Where 0<=w1<=w2<=1,
or 0<=w2<=w1<=1
20
Multiple Indices, Cont.
Number of tuples in top-k layers
Top-k
Convex
Shell
Robust
Indexing
5
329
148
10
823
262
20
2064
427
50
6130
813
100
9965
1271
150
10000
1618
200
10000
1922
Using the same
index space, robust
indexing can build
8 indices
(if the value of k is
up bounded by 100)
Synthetic Data: 10K tuples
21
Outline
• Introduction
• Robust Index
• Compute Robust Index
– Exact Solution
– Approximate Solution
– Multiple Indices
• Performance Study
• Discussion and Conclusions
22
Performance Study
• Data
– Synthetic data
– Real dataset (abalone3D, cover3D)
• Measure
– Number of tuples retrieved
– Execution time not reported, but the robust indexing is
expected to be even better
• Approaches for comparison
– Onion (convex shell)
– PREFER
– Approximate Robust Indexing (AppRI), #partition=10
23
Index Construction Time
Convex Shell, Convex Hull and
AppRI are implemented by C++
Construction time on PREFER
is not included since it is
implemented in Java
Using the system default
parameter, PREFER takes more
than 1200 seconds on the 50k
data set
24
Query Performance
Average Number of tuples
retrieved on synthetic data
Average Number of tuples
retrieved on Cover3D data set
25
Multiple Indices (Views)
Synthetic Data, 3
dimensions
Build 3 robust indices
by decompose the
weighting parameters:
w1=max(w1,w2,w3)
w2=max(w1,w2,w3)
w2=max(w1,w2,w3)
26
Discussion and Conclusions
• Strength
– Easy to integrate with current DBMS
– Good query performance
– Practical construction complexity
• Limitation
– Online index maintenance is expensive (some
weaker maintaining strategies available)
– Indexing high dimensional data remains a
challenging problem
27