Q 3 - Data Stream Mining group!
Download
Report
Transcript Q 3 - Data Stream Mining group!
Fine-grained Partitioning for
Aggressive Data Skipping
SIGMOD 2014
UC Berkeley
Calvin
2015-06-03
Contents
• Background
• Contribution
• Overview
• Algorithm
• Data skipping
• Experiment
Background
• How to get insights of enormous datasets interactively ?
• How to shorten query response time on huge datasets ?
• Block / Partition
• Oracle / Hbase / Hive / LogBase
Prune data block(partition) according to metadata
• Drawbacks
1.
2.
3.
4.
5.
Coarse-grained block(partition)s
Not balance
The remaining block(partition)s still contain many tuples
Blocks do not match the workload skew
Data and query filter correlation
Goals
• Workload-driven blocking technique
Fined-grained
Balance-sized
Offline
Re-executable
Co-exists with original partitioning techniques
Example
Condition
Skip
F3
P1, P3
F1^F2
P2, P3
Vectorization
Extract features
1.Split block
2.Storage
How to choose
How to split
Contribution
• Feature selection
Identity representative filters
Modeled as Frequent itemset mining
• Optimal partitioning
Balanced-Max-Skip partitioning problem – NP Hard
A bottom-up framework for approximate solution
Overview
(1)extract features from workload
(2)scan table and transform tuple to (vector,
tuple)-pair
(3)count by vector to reduce partitioner input
(4)generate blocking map(vector -> blockId)
(5)route each tuple to its destination block
(6)update union block feature to catalog
Workload Assumptions
• Filters in query of the workload have commonality and stability
Scheduled or reporting queries
Template query with different value range
Workload Modeling
• Q={Q1,Q2,…Qm}
• Examples:
Q1: product=‘shoes’
Q2: product in (‘shoes’, ‘shirts’), revenue > 32
Q3: product=‘shirts’, revenue > 21
• F: All predicates in Q
• Fi: Qi’s predicates
• fij: Each item in Fi
product in (‘shoes’, ‘shirts’) vs product= ‘shoes’
product in (‘shoes’, ‘shirts’) vs revenue > 21
Filter augmentation
• Examples:
Q1: product=‘shoes’
Q2: product in (‘shoes’, ‘shirts’), revenue > 32
Q3: product=‘shirts’, revenue > 21
• Examples:
Q1: product=‘shoes’, product in (‘shoes’, ‘shirts’)
Q2: product in (‘shoes’, ‘shirts’), revenue > 32, revenue > 21
Q3: product=‘shirts’, revenue > 21, product in (‘shoes’, ‘shirts’)
Frequent itemset mining with threshold T(=2)
numFeat
Partitioning problem modeling
• ={F1,F2,…Fm} as features, weight wi
• V={v1,v2,…vn} as transformed tuple
Vij indicates whether vi satisfies Fj
• P={P1,P2,P3} as a partition
𝑣(𝑃𝑖 ) = 𝑂𝑅𝑣𝑗 ∈𝑃𝑖 𝑣𝑗
• Cost function C(Pi) as sum of tuples
that Pi can skip for all queries in
workload :
Max(C(P))
NP-Hard
The bottom up framework
n2log(n)
R: {vector -> blockId, …}
Ward’s method: Hierarchical grouping to optimize an objective function
Data skipping
1. Generate vector
2. OR with each partition vector
3. Block with at least one 0 bit can be skipped
Experiment
• Environment
Amazon Spark EC2 cluster with 25 instances
8*2.66GHz CPU cores
64 GB RAM
2*840 GB disk storage
• Implement and experiment on Shark (SQL on spark)
Datasets
• TPC-H
•
•
•
•
600 million rows, 700GB in size
Query templates (q3,q5,q6,q8,q10,q12,q14,q19)
800 queries as training workload, 100 from each
80 testing queries, 10 from each
• TPC-H Skewed
• TPC-H query generator has a uniform distribution
• 800 queries as training workload, 100 from each under Zipf distribution
• Conviva
•
•
•
•
User access log of video streams
104 columns: customerId, city, mediaUrl, genre, date, time, responseTime, …
674 training queries and 61 testing queries
680 million tuples, 1TB in size
TPC-H相关说明: http://blog.csdn.net/fivedoumi/article/details/12356807
TPC-H results
• Query performance
• Measure number of tuples scanned and response time for different blocking
and skipping schemas
• Full scan: no data skipping, baseline
• Range1: filter on o_orderdate, about 2300 partitions. Shark’s data skipping used
• Range2: filter on {o_orderdate, r_name, c_mkt_segment, quantity}, about 9000
partitions. Shark’s data skipping used
• Fineblock: numFeature=15 features from 800 training queries, minSize=50k,
Shark’s data skipping and feature-based data skipping are used
TPC-H results - efficiency
TPC-H results – effect of minSize
• The smaller the block size is, the more chance we can skip data
• numFeature=15 and various minSize
Y-value : ratio of number
scanned to number must
be scanned
TPC-H results – effect of numFeat
TPC-H results – blocking time
• A month partition in TPC-H
• 7.7 million tuples, 8GB in size
• 1000 blocks
• numFeat=15,minSize=50
• One minute
Convia results
• Query performance
• Fullscan: no data skipping
• Range: partition on date and a frequently queried column, Shark’s skipping used
• Fineblock: first partition on date, numFeature=40, minSize=50k, Shark’s skipping
and feature-based skipping used