Progressive Merge - University of Illinois Urbana
Download
Report
Transcript Progressive Merge - University of Illinois Urbana
Progressive and Selective
Merge: Computing Top-K with
ad-hoc Ranking Functions
Dong Xin, Jiawei Han, Kevin C.-C. Chang
Department of Computer Science
University of Illinois at Urbana-Champaign
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Experimental Results
Conclusions
2
Motivation: Smart Search in Massive
Databases
Search is still a problem in
massive databases!
Challenges on data
Apartment Search on www.rent.com
similar web database search on
www.cars.com
www.bizrate.com
Gigabytes of data
Hundreds of dimensions
Challenges on query processing
Flood of results: A query may
return too many answers
Quality of answers: Users are
often interested in only a few but
high-quality results
Efficiency: It is slow to process
and return so many answers!
3
Lesson from the Web Search: Ranked
Queries
Ranking is the key
in successful Web
Search
All Two bedroom
Apartments
Ranking Apts by
a|price-1000| +
[distance to shopping]
Top-K Results
4
Ranked Query: More Example
Stock Price Predication
Stock Value: V
Earnings: E
Research Expenditure: R
Train a Predication Model from history data
f(E,R)
Top-k stocks which are most predicable
Select top k from Stock S
order by (S.V-f(S.E, S.R))^2 asc
5
Data and Query Model
Data: Relation R
Indices are built on attributes
B-tree index on single attribute
R-tree index on multiple attributes
Query with ad hoc Ranking Function F
Attributes from multiple indices
F is lower-boundable (assume minimal k)
Given
a domain D
One
can derive the lower bound value of F(D)
The
lower bound value need not be tight
6
Example Ranking Functions
Typical functions
All linear functions: x+y
All quadratic functions: (x-100)^2+(y-100)^2
All monotone functions: x*y
Some special functions
Min square error: (x-y)^2
Median, min, max of input variables
7
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Experimental Results
Conclusions
8
Previous Study: Sort-Merge
Sort-Merge
Assume
Many
[Fagin et al, 2001]
monotone functions
variations
tid
Price
Sq feet
t1
500
600
t2
700
800
t3
800
900
t4
1000
1000
t5
1100
200
t6
1200
500
t7
1200
560
t8
1350
1120
Select top 1 from Apartment
order by [price] / [sq feet] asc
tid
t1
t2
t3
t4
t5
t6
t7
t8
Price
500
700
800
1000
1100
1200
1200
1350
tid
t8
t4
t3
t2
t1
t7
t6
t5
Sq feet
1120
1000
900
800
600
560
500
200
Intuition: Data appearing top on
both lists likely has higher score
9
Our Solution: Index-Merge
The
function involves both [price] and [sq feet]
[Price]
and [Sq feet] are indexed individually
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3 10, t4 11, t5 12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1
8, t4
9, t2 10, t8 11,t3
10
Search over Joint Space
Merge
both indices online
Search
top results over the joint Space
The joint state (P1, S2)
Boundary: [5 - 8] X [6 - 8]
Join results: { t1 }
(P1,S1)
Join [Price] and [Sq feet]
during online computation
(P1,S2)
(Price, Sq feet)
(P1,S3)
(P2,S1)
…
(P3,S3)
Cartesian product of partitions from [price] and [sq feet]
11
Sort-Merge vs. Index-Merge
Sort-Merge
Index-Merge
Data
Organization
Each attribute is
sorted in a list
Each attribute or multiple
attribute is indexed by
hierarchical tree structures
Data Access
Methods
Sorted Access (SA)
Random Access (RA)
Node Access (following tree
structures)
Ranking
Functions
Optimization
Opportunities
Monotone functions
Ad-hoc functions (lowerbound-able)
SA only
Combine SA and RA
Progressive Merge
Selective Merge
12
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Problem Analysis
Progressive Merge
Selective Merge
Experimental Results
Conclusions
13
Challenge 1: Complexity of Join
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
…
(P3,S3)
Number of children by joining two index node: Cartesian Product
Complexity of Join
B-tree with page size 4kB
The fan-out of a node is 204
Join 2 B-trees: 40k children
Join 3 B-trees: 8M children
14
Challenge 2: Empty State
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
t1,t2,t3
2, t5 5, t6 5,t7
t5,t6,t7
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
…
Empty join state
should be pruned
Probability of Empty State
Join B-trees on 1M data
Up to 1M non-empty Joint leaf states
Each B-tree has 5140 leaf nodes
Join 2 B-trees: 25.2M joint leaf (96% are empty)
Join 3 B-trees: 126.5G joint leaf (99.999% are empty)
15
Efficient Search Over The Joint Space
Straight forward search is expensive
Assemble the complete joint space first
Search over the joint space
Smart search: join as necessary
Progressive Merge
Identify which joint states have top scores
Assemble “good” joint states first
Selective Merge
Identify which joint states contain non-empty result
Avoid generating joint states that are empty
16
Problem Analysis
Minimize
CPU cost: Number of joint states to be assembled
I/O cost: Number of joint states to be retrieved
Performance At a Glance
Ranking Function: f=(A-B)^2
A, B are indexed by B-trees, data has 1M records
17
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Problem Analysis
Progressive Merge
Selective Merge
Experimental Results
Conclusions
18
Progressive Merge (1)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
1000-1100
1200-1350
Sq feet (S1, S2, S3)
200 - 500
600 - 800
900 -1120
Special Case: convex function
Find the best joint state by the extreme point, e.g.,
best state (P2, S2)
Progressive assemble neighboring states, e.g., (P2, S1) (P2,
S3) (P1, S2) (P3, S2)
19
Progressive Merge (2)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
0
Sq feet (S1, S2, S3)
1200-1350
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
General function: not convex, not monotone…?
Step 1: Sort Pi by f(Pi, S), Sort Si by f(P, Si)
P2 S2
P1 S3
f(Pi, S) = min f(x) (x in Pi and S)
i.e., price in Pi node, and sq_feet in
S node
P3 S1
Sorted Lists of P and S
20
Progressive Merge (3)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
Sq feet (S1, S2, S3)
1200-1350
0
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
General function: not convex, not monotone…?
Step 2: Merge Pi, Sj progressively
Sorted Lists of P
and S
P2 S2
(P2, S2)
P1 S3
(P2, S3) (S2, P1) (P1, S3)
P3 S1
(P2, S1) (P1, S1) (P3, S2) (P3, S3)
(P3, S1)
Progressive merged results
21
Progressive Merge (4)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
0
Sq feet (S1, S2, S3)
1200-1350
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
General function: not convex, not monotone…?
Step 3: Stop when the best seen is no worse than the unseen
(P2, S2)
Best seen: f(P2, S2)=0
(P2, S3) (S2, P1) (P1, S3)
f(P2,S2) ≤ min (f(P3,S), f(P, S3))
(P2, S1) (P1, S1) (P3, S2) (P3, S3)
(P3, S1)
Best possible of unseen:
min ( f(P1, S), f(P, S3) ) = 10,000
22
How Good is the Progressive Merge?
Optimal case
Suppose Sk is the kth best score in the final results
Optimal algorithm only examines all joint state s, such that f(s) > Sk
Convex functions: N ≤ mN*
N: number of joint states examined by neighborhood expansion
N*: optimal number
m: number of indices to be merged
General functions: N ≤ 2mN’
N: number of joint states examined by threshold expansion
N’: any other algorithm in the same category
m: number of indices to be merged
23
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Problem Analysis
Progressive Merge
Selective Merge
Experimental Results
Conclusions
24
Selective State
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
…
(P3,S3)
P1 contains (t1, t2, t3)
S1 contains (t5, t6, t7)
Join P1 and S1: Empty state
How to identify empty states before the
joint states are retrieved?
25
Join Signature
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3 10, t4 11, t5 12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
S1 S2 S3
P1 P2 P3
Data and their location
in the index
0
1
1
1
1
0
1
0
1
6, t1
8, t4
9, t2 10, t8 11,t3
Join Signature indicates
Empty States
Compress the join
signature by bloom filter
Check the Join Signature
before assembling states
Join Signature
26
Revisit: Comparing to Sort-merge
Progressive Merge
Corresponds to Sorted Access scheduling in sortmerge
Node access scheduling in index-merge is more
complicated
Selective Merge
Corresponds to Random Access in Sort-merge
Random access dose not help in index-merge (details
in paper)
The join signature is also applicable to Sort-merge
27
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Problem Analysis
Progressive Merge
Selective Merge
Experimental Results
Conclusions
28
Execution Time w.r.t. Different
Functions
Join two B-tree indices
TS: Table Scan, BL: Baseline, PE: Progressive Merge,
SIG: Signature-based Selective Merge
F=(x-a)^2+(y-b)^2
F=(x-y^2)^2
29
Disk Access and Memory Usage
Join two B-tree indices, Top 100 query
BL: Baseline, PE: Progressive Merge, SIG: Signature
F=(x-y^2)^2
30
Road Map
Introduction
Sort-Merge vs. Index-Merge
Proposed Solutions
Problem Analysis
Progressive Merge
Selective Merge
Experimental Results
Conclusions
31
Discussion and Conclusions
The Index-Merge Framework
Extends the sort-merge framework
Does not confine to monotone functions
Efficient Computation
Progressive Merge: reduce the CPU and memory
overhead
Selective Merge: reduce the disk I/O cost
32
Thank You!