Progressive Merge - University of Illinois Urbana

Download Report

Transcript Progressive Merge - University of Illinois Urbana

Progressive and Selective
Merge: Computing Top-K with
ad-hoc Ranking Functions
Dong Xin, Jiawei Han, Kevin C.-C. Chang
Department of Computer Science
University of Illinois at Urbana-Champaign
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Experimental Results

Conclusions
2
Motivation: Smart Search in Massive
Databases
 Search is still a problem in
massive databases!
 Challenges on data
Apartment Search on www.rent.com
similar web database search on
www.cars.com
www.bizrate.com
 Gigabytes of data
 Hundreds of dimensions
 Challenges on query processing
 Flood of results: A query may
return too many answers
 Quality of answers: Users are
often interested in only a few but
high-quality results
 Efficiency: It is slow to process
and return so many answers!
3
Lesson from the Web Search: Ranked
Queries
Ranking is the key
in successful Web
Search
All Two bedroom
Apartments
Ranking Apts by
a|price-1000| +
[distance to shopping]
Top-K Results
4
Ranked Query: More Example
 Stock Price Predication
 Stock Value: V
 Earnings: E
 Research Expenditure: R
Train a Predication Model from history data
f(E,R)
Top-k stocks which are most predicable
Select top k from Stock S
order by (S.V-f(S.E, S.R))^2 asc
5
Data and Query Model

Data: Relation R
 Indices are built on attributes
 B-tree index on single attribute
 R-tree index on multiple attributes

Query with ad hoc Ranking Function F
 Attributes from multiple indices

F is lower-boundable (assume minimal k)
 Given
a domain D
 One
can derive the lower bound value of F(D)
 The
lower bound value need not be tight
6
Example Ranking Functions
 Typical functions
 All linear functions: x+y
 All quadratic functions: (x-100)^2+(y-100)^2
 All monotone functions: x*y
 Some special functions
 Min square error: (x-y)^2
 Median, min, max of input variables
7
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Experimental Results

Conclusions
8
Previous Study: Sort-Merge
Sort-Merge
Assume
Many
[Fagin et al, 2001]
monotone functions
variations
tid
Price
Sq feet
t1
500
600
t2
700
800
t3
800
900
t4
1000
1000
t5
1100
200
t6
1200
500
t7
1200
560
t8
1350
1120
Select top 1 from Apartment
order by [price] / [sq feet] asc
tid
t1
t2
t3
t4
t5
t6
t7
t8
Price
500
700
800
1000
1100
1200
1200
1350
tid
t8
t4
t3
t2
t1
t7
t6
t5
Sq feet
1120
1000
900
800
600
560
500
200
Intuition: Data appearing top on
both lists likely has higher score
9
Our Solution: Index-Merge
The
function involves both [price] and [sq feet]
[Price]
and [Sq feet] are indexed individually
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3 10, t4 11, t5 12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
6, t1
8, t4
9, t2 10, t8 11,t3
10
Search over Joint Space
Merge
both indices online
Search
top results over the joint Space
The joint state (P1, S2)
Boundary: [5 - 8] X [6 - 8]
Join results: { t1 }
(P1,S1)
Join [Price] and [Sq feet]
during online computation
(P1,S2)
(Price, Sq feet)
(P1,S3)
(P2,S1)
…
(P3,S3)
Cartesian product of partitions from [price] and [sq feet]
11
Sort-Merge vs. Index-Merge
Sort-Merge
Index-Merge
Data
Organization
Each attribute is
sorted in a list
Each attribute or multiple
attribute is indexed by
hierarchical tree structures
Data Access
Methods
Sorted Access (SA)
Random Access (RA)
Node Access (following tree
structures)
Ranking
Functions
Optimization
Opportunities
Monotone functions
Ad-hoc functions (lowerbound-able)
SA only
Combine SA and RA
Progressive Merge
Selective Merge
12
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Problem Analysis

Progressive Merge

Selective Merge

Experimental Results

Conclusions
13
Challenge 1: Complexity of Join
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
…
(P3,S3)
Number of children by joining two index node: Cartesian Product
Complexity of Join
B-tree with page size 4kB
The fan-out of a node is 204
Join 2 B-trees: 40k children
Join 3 B-trees: 8M children
14
Challenge 2: Empty State
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3
t1,t2,t3
2, t5 5, t6 5,t7
t5,t6,t7
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
…
Empty join state
should be pruned
Probability of Empty State
Join B-trees on 1M data
Up to 1M non-empty Joint leaf states
Each B-tree has 5140 leaf nodes
Join 2 B-trees: 25.2M joint leaf (96% are empty)
Join 3 B-trees: 126.5G joint leaf (99.999% are empty)
15
Efficient Search Over The Joint Space

Straight forward search is expensive
 Assemble the complete joint space first
 Search over the joint space

Smart search: join as necessary
 Progressive Merge
 Identify which joint states have top scores
 Assemble “good” joint states first

Selective Merge
 Identify which joint states contain non-empty result
 Avoid generating joint states that are empty
16
Problem Analysis
 Minimize
 CPU cost: Number of joint states to be assembled
 I/O cost: Number of joint states to be retrieved
 Performance At a Glance
 Ranking Function: f=(A-B)^2
 A, B are indexed by B-trees, data has 1M records
17
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Problem Analysis

Progressive Merge

Selective Merge

Experimental Results

Conclusions
18
Progressive Merge (1)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
1000-1100
1200-1350
Sq feet (S1, S2, S3)
200 - 500
600 - 800
900 -1120
 Special Case: convex function
 Find the best joint state by the extreme point, e.g.,
best state (P2, S2)
 Progressive assemble neighboring states, e.g., (P2, S1) (P2,
S3) (P1, S2) (P3, S2)
19
Progressive Merge (2)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
0
Sq feet (S1, S2, S3)
1200-1350
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
 General function: not convex, not monotone…?
 Step 1: Sort Pi by f(Pi, S), Sort Si by f(P, Si)
P2 S2
P1 S3
f(Pi, S) = min f(x) (x in Pi and S)
i.e., price in Pi node, and sq_feet in
S node
P3 S1
Sorted Lists of P and S
20
Progressive Merge (3)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
Sq feet (S1, S2, S3)
1200-1350
0
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
 General function: not convex, not monotone…?
 Step 2: Merge Pi, Sj progressively
Sorted Lists of P
and S
P2 S2
(P2, S2)
P1 S3
(P2, S3) (S2, P1) (P1, S3)
P3 S1
(P2, S1) (P1, S1) (P3, S2) (P3, S3)
(P3, S1)
Progressive merged results
21
Progressive Merge (4)
Ranking Function : [price – 1k]^2 + [sq feet - 800]^2
Price (P1, P2, P3)
500 - 800
f(Pi, S)
40,000
1000-1100
0
Sq feet (S1, S2, S3)
1200-1350
40,000
200 - 500
f(P, Si)
90,000
600 - 800
0
900 -1120
10,000
 General function: not convex, not monotone…?
 Step 3: Stop when the best seen is no worse than the unseen
(P2, S2)
Best seen: f(P2, S2)=0
(P2, S3) (S2, P1) (P1, S3)
f(P2,S2) ≤ min (f(P3,S), f(P, S3))
(P2, S1) (P1, S1) (P3, S2) (P3, S3)
(P3, S1)
Best possible of unseen:
min ( f(P1, S), f(P, S3) ) = 10,000
22
How Good is the Progressive Merge?



Optimal case

Suppose Sk is the kth best score in the final results

Optimal algorithm only examines all joint state s, such that f(s) > Sk
Convex functions: N ≤ mN*

N: number of joint states examined by neighborhood expansion

N*: optimal number

m: number of indices to be merged
General functions: N ≤ 2mN’

N: number of joint states examined by threshold expansion

N’: any other algorithm in the same category

m: number of indices to be merged
23
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Problem Analysis

Progressive Merge

Selective Merge

Experimental Results

Conclusions
24
Selective State
Join [Price] and [Sq feet]
(Price, Sq feet)
(P1,S1)
(P1,S2)
(P1,S3)
(P2,S1)
…
(P3,S3)
P1 contains (t1, t2, t3)
S1 contains (t5, t6, t7)
Join P1 and S1: Empty state
How to identify empty states before the
joint states are retrieved?
25
Join Signature
Price (P1, P2, P3)
Sq feet (S1, S2, S3)
5 - 8 10-11 12-13
2 - 5 6 - 8 9 -11
5, t1 7, t2 8,t3 10, t4 11, t5 12, t6 12, t7 13,t8
2, t5 5, t6 5,t7
S1 S2 S3
P1 P2 P3
Data and their location
in the index
0
1
1
1
1
0
1
0
1
6, t1
8, t4
9, t2 10, t8 11,t3
Join Signature indicates
Empty States
Compress the join
signature by bloom filter
Check the Join Signature
before assembling states
Join Signature
26
Revisit: Comparing to Sort-merge


Progressive Merge

Corresponds to Sorted Access scheduling in sortmerge

Node access scheduling in index-merge is more
complicated
Selective Merge

Corresponds to Random Access in Sort-merge

Random access dose not help in index-merge (details
in paper)

The join signature is also applicable to Sort-merge
27
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Problem Analysis

Progressive Merge

Selective Merge

Experimental Results

Conclusions
28
Execution Time w.r.t. Different
Functions
 Join two B-tree indices
 TS: Table Scan, BL: Baseline, PE: Progressive Merge,
SIG: Signature-based Selective Merge
F=(x-a)^2+(y-b)^2
F=(x-y^2)^2
29
Disk Access and Memory Usage
 Join two B-tree indices, Top 100 query
 BL: Baseline, PE: Progressive Merge, SIG: Signature
F=(x-y^2)^2
30
Road Map

Introduction

Sort-Merge vs. Index-Merge

Proposed Solutions

Problem Analysis

Progressive Merge

Selective Merge

Experimental Results

Conclusions
31
Discussion and Conclusions


The Index-Merge Framework

Extends the sort-merge framework

Does not confine to monotone functions
Efficient Computation

Progressive Merge: reduce the CPU and memory
overhead

Selective Merge: reduce the disk I/O cost
32
Thank You!