Slides - Cornell University

Download Report

Transcript Slides - Cornell University

Compressing Query Results for
Mobile Clients
Zhiyuan Chen and Praveen Seshadri
Cornell University
1
Motivation(1)
Results
Database
Server
Slow network!!!
9.6 - 170 kbps, 1 - 10 minutes for 1 MB
Compress results on the server to save network bandwidth.
Concern: compress as much as possible.
2
Motivation(2)
Database
Server
Results
I am a PDA, I can’t store that much for later use!
–Often work offline. Need to store results for later use.
–Severe storage constraints.
–Usually only a small portion of results will be accessed, random access
possible.
Store the result compressed, decompress on demand.
Both space & decompression cost matters!
3
Why not just GZip or WinZip?
• Small decompression unit for PDA
reduce decompression cost
• Utilize information of the query result:
– Choose a combination of compression methods
based on semantic and statistical information of
the result.
– Because different attributes have different characteristics,
there is no unique winner.
– The choice is made by cost-based optimization.
More compression
4
An Example
• Query
Select Year, Month, Day, Ticker, Low, High
From Quotes where Year is between 1998 and 1999
Ordered by Year, Month, Day
Year
99
99
99
Semantic
compression
Universal
compression
Month
03
03
03
Grouping
Day
01
01
01
Ticker Low
High
T
40 1/16
42
IBM 100 3/8 103
DIS 30 1/16 32 1/2
Dictionary Use 4 bits for
each digit
Ziv-Lempel on each field
5
Outline
• Related work
• An algebraic framework
– To represent the valid combinations.
• Compression optimization
– To choose the best combination.
• Experiments
– Chosen combinations beat universal compression tools like
WinZip.
6
Related Work
• Compression community
– Different compression methods. Run-length encoding, ZivLempel, differential, etc.
• Database community – compress tables or indices
– Specific compression methods
• Iyer & Wilhite - tuple level Ziv-Lempel.
• Goldstein etc.- per-page offset-encoding.
• Ng & Ravishankar - tuple differential coding.
• Theodore Johnson - compressing bitmap indices.
– Impact on query processing
• (Roth&Horn, Graefe & Shapiro, Theodore)
7
Compression Framework
• Compressed results as a compressed table.
• A compression method as a compression operator:
Input table -> output table.
• A combination of compressing methods as a
compression plan
• Decompression operators and plans.
8
Compressed Table
•Field and tuple boundaries may blur. - Compressed data blocks
•Extra information enables decompression. - Compression schema
Customer ID
Order #
Customer_1
Customer_1
Customer_2
Customer_2
10000
10001
10003
10011
Grouping
Customer_1
(Block 1)
Customer_2
(Block 2)
10000
10001
10003
10011
A compressed block = a value compressed from some cells in the uncompressed table.
Also the unit of decompression.
A compression schema includes:
Things compressed to a compressed block, compression method, relational schema.
9
Compression Operator
Defines how a compression method is applied on what
part of the input table.
•The fields the operator is applied on.
•What input blocks will be compressed together.
•The compression method and information used in compression.
•Applied on Customer ID
Customer_1
Customer_1
Customer_1
(Block 1)
•Same Customer IDs compressed
together.
•Method = grouping.
10
Compression Plan
A sequence of compression operators applied on the original
result table, each takes the output of the previous operator as
the input.
Grouping
Customer ID
Offset-encoding
Order #
10000
Customer_1
10000
Customer_1
10001
10001
Customer_2
10003
10003
Customer_2
10011
Customer_1
Customer_2
10011
0
Customer_1
1
3
Customer_2
11
11
Optimization - Cost Model
• Formula
– Cost = w1 * compression cost
+ w2 * decompression cost
- w3 * saving on network transfer
- w4 * saving on client side storage.
– Adjust weight based on goal of compression.
• Compression cost
– CPU speed, compression plans, results size.
• Decompression cost
– Client processor speed, access pattern, etc. Provided by clients!
• Network transfer saving - compressed results size.
• Client side storage saving - If decompress on demand.
12
Searching
• Naïve algorithm has exponential search space.
• Heuristics:
• Consider semantic compression first.
– For each field, use the naïve algorithm to find the best plan
compressing this field only.
– Combine these plans.
• Consider universal compression methods.
– Add Ziv-Lempel applied on each field only if this will reduce
the overall cost.
• Not complete but polynomial search space
– O(#of fields * #of valid plans on each field)
13
Experiments
• Data:
– TPCD
• Queries:
– Adapted from TPCD queries by deleting aggregates.
• Experiment 1: To save network bandwidth.
– Measure: the overall end-to-end time.
• Compression time + transfer time + decompression time.
• Experiment 2: To save PDA clients’ storage.
– Measure: the space and random access time.
• Compare: Common tools v.s. chosen combinations.
14
–Semantic compression - allow decompress individual attribute values. (S)
–WinZip on the whole table. (W)
–WinZip applied on each field. (PW)
–Semantic compression + WinZip applied on each field. (SPW)
W
S: 3.1
W: 3.4
PW: 4.4
SPW:6.5
%of no compression
Un/compressed
PW
SPW
40%
30%
20%
10%
0%
7.075
20
40
60
80
100
network bandwidth (KB/s)
Modem
Wireless
15
Internet
Compression plans:
– Semantic compression (S) - allow to decompress an individual tuple.
– Windows CE’s default compressor. (D)
– Ziv-Lempel applied on each page. (Z)
Result size: 3.7 MB. 2 MB Data storage, 2MB program storage.
50(+1 for semantic compression) KB program size.
Storage
Accessing
Time
S+D
0.75 MB
2.85 second
D
1.42 MB
2.48 second
Z+D
1.43 MB
23.1 second
Storage usage and time to randomly
access 1000 tuples
16
Summary
• A combination of compression methods based on
semantic and statistical information of the result.
• Choice made by cost-based optimization.
• A framework to model combinations of compression
methods.
• Future work:
– Apply the methodology to compress data tables.
– Joint optimization of result compression & query processing.
17