slides - Cornell University

Download Report

Transcript slides - Cornell University

Query Optimization
In Compressed Database Systems
Zhiyuan Chen and Johannes Gehrke
Cornell University
Flip Korn
AT&T Labs
1
Why Compression?

CPU speed outpaces Disk speed exponentially!
– x10 / decade (bandwidth), x100 / decade (latency)

Trade CPU for I/O: improve query performance
+ Save bandwidth for sequential I/O
+ Improve buffer pool hit ratio
- Pay decompression cost

Environment
– Decision support queries
– Lossless compression
2
Issues

Database compression methods
 Efficient
query processing
3
Database Compression Methods
General-purpose
compression
Database compression

Only compression ratio
matters


Large decompression unit
(whole file)

Both compression ratio
and decompression cost
matter
Small decompression unit
(attribute or tuple)
Our setting: allow to decompress a single attribute
4
Efficient Query Processing

Compared to uncompressed DB
– When to decompress
– Assumption: no compression in query processing

Our story
– Different strategies of when to decompress
– None of them is always optimal
– Combined optimization problem:
Query plan + decompression placement
– Solutions
– Experiments
5
Different Decompression Strategies
Eager
Lazy
All uncompressed
R.A = S.B
D(R)
R
Transient
AB uncompressed
R.A = S.B
D(S)
S
D(R.A)
R
All compressed
d(R.A) = d(S.B)
D(S.B)
S
Mem
R
Disk
S 6
Which Strategy Is Optimal?

Lazy vs. eager
– Lazy is always better

Transient vs. Lazy
– Transient: more I/O savings
– Lazy: lower decompression cost

In practice
– Numerical attributes: transient is always better
– String attributes: no clear winner
• Expensive to decompress
• High I/O savings if compressed
7
An Example With TPCH Data
Select S_NAME, S_ADDRESS, C_NAME, C_PHONE
From Supplier, Customer
Where S_ADDRESS = C_ADDRESS
Order by S_NAME, C_NAME
Sort(S_N, C_N)
S_A = C_A
Supplier
Customer
8
Transient vs. Lazy
Lazy sort (7s)
1 attribute compressed
Lazy BNL (2s)
Transient sort (3s) Transient sort (0.5s)
3 attributes compressed
Lazy BNL (2s)
An optimization problem!
All attributes
compressed
Transient BNL (42s)
9
Interactions With Traditional Optimization
Algorithm: run System R, then decide when to decompress.
Transient sort (3s)
3 attributes compressed
Transient sort (0.5s)
All attributes compressed
Pruned by System R
Lazy BNL (2s)
Transient SM (2.5s)
Optimal plan returned by System R is no longer optimal!
10
Compression Aware Optimization

Given a query and a compressed DB:
Find the optimal query plan
 New operators
– Explicit decompression operators
– Transient versions of existing relational operators

Search space: O (nm) factor over old search space
–
–
–
–
n is the depth of the plan
m is the number of attributes
Each attribute explicitly decompressed at most once
For each attribute, n places to decompress explicitly
11
Dynamic Programming - OPT
Extend system R optimizer
– Bottom up, one minimal plan per interesting property
– What attributes remain compressed as a new property
Lazy BNL (2s)
Transient SM join (2.5s)
Property: S_A, C_A uncompressed Property: all compressed
Customer Supplier
Customer
Supplier
Blowup reduced from nm to 2m
12
Min-K Heuristic Algorithm

Store plans for k rather than 2m properties
– The k properties whose plans are cheapest
Storage blowup reduced from 2m to k
 Time: still exponential blowup in the worst case

Join on
Lazy: S_A, C_A
S_A, C_A Transient: S_A, C_A
Lazy: S_A, transient: C_A
S_A,… C_A,…
Transient: S_A, Lazy: C_A
Stored plans:
13
Min-K Heuristics (2)

If transient decompression is bad for
one join attribute, often so for the other
– BNL join: both S_A and C_A decompressed N2 times

Only consider two cases
Stored plans:
Join on
S_A, C_A
S_A,… C_A,…

Lazy: S_A, C_A
Transient: S_A, C_A
Time blowup is 2k
14
Experiments

Setup
– Modify Predator query engine & optimizer
– Algorithms
• Uncompressed, Eager, Lazy, Transient-Only,
Two-Step, OPT, Min-1, Min-2
– 100 MB TPCH data
– 50% compression ratio
– Pentium III 550 Mhz, vary buffer pool size
15
Experimental Setup (2)

Randomly add join conditions on string attributes
 Divide queries into workloads
– Number of string join conditions, number of join tables

Metrics: for algorithm X
– Average relative-cost:
Average(cost of plan returned by X / cost of opt plan)
– Average blowup factor:
Average(# plans searched by X / # plans by System R)
16
Average Relative Cost
Queries with 3-4 join tables, buffer pool 10% of compressed DB
14
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
12
Rel-cost
10
8
6
4
2
0
0
1
2
3
Number of join conditions on string
attributes
17
Distribution of Query Performance
Percentage of Good plans (cost within twice of OPT) for all queries
Percentage of good plans
100%
Min-2
90%
80%
Min-1
Two-Step
70%
Lazy
60%
TransientOnly
Not
Compressed
50%
40%
30%
20%
Eager
10%
0%
18
Optimization Cost
Queries with 3-4 join tables
Blowup Factor
60
OPT
50
40
30
Min-2
20
10
4
0
0
1
2
Number of join conditions on string
attributes
3
19
Related Work

How to compress
– Roth&Horn93, Iyer&Wilhite94, Goldstein98

How to query
– Graefe&Shapiro91, Westmann00, Greer99

Query optimization
– Compressed MOLAP aggregates: Li99
– Compressed Bitmap indices:Amer-Yahia&Johnson00
– Expensive predicates:
• Chaudhuri&Shim99, Hellerstein93
20
Conclusions & Future Work

Novel optimization problem
–
–
–
–

Search for regular query plan + when to decompress
Separate search sub-optimal
OPT and Min-K heuristic
Up to an order improvement in experiments
Future work
– Caching decompressed values
– Updates
21
Search Space
Regular sort
Sort(S_A)
3 places to place D(S_A)
Before: convert to transient
After: as it is
D(S_A)
3 extended plans (3 is depth)
Transient join
S_A = C_A
S_A, …
nm blow up over old space
-n: depth of plan
-m: number of attributes
22
Relative-Cost
- Varying Buffer Pool Size
Queries with 3- 4 join tables, 2 additional string joins
14
OPT
12
Min-2
Min-1
Rel-cost
10
Two-Step
8
Eager
6
Lazy
4
Transient-Only
Uncompressed
2
0
10%
40%
200%
Buffer Pool Size (% of compressed DB)
23
Relative Performance (2)
Queries with more than 5 join tables
12
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
Rel-Cost
10
8
6
4
2
0
0
1
2
3
Number of join conditions on string
attributes
24