Dense-Region Based Compact Data Cube
Download
Report
Transcript Dense-Region Based Compact Data Cube
Dense-Region Based Compact
Data Cube
Presented by: Kan Kin Fai
Outline
Background
Introduction to Compact Data Cube
Pros and cons of the Compact Data Cube
method
Dense-Region Based Compact Data Cube
Background
Why is a data cube?
– Some pre-computed aggregates on the underlying data
warehouse.
System constraints on materializing data cube(s)
– Disk space, maintenance cost, etc.
Common approach: materialize parts of a data cube.
Alternative: use approximation technique
– Reason: OLAP applications accept approximate answers in
many scenarios.
Introduction to Compact Data Cube
Compact Data Cube was proposed by Vitter and Wang in
Approximate Computation of Multidimensional Aggregates of
Sparse Data Using Wavelets (SIGMOD 99).
Main Ideas
– Offline phase: perform Haar wavelet transform on the
underlying data (i.e. the base cuboid) and store the k most
significant coefficients.
– Online phase: process any given query based on the k most
significant coefficients.
Introduction to Compact Data Cube
Basics of Haar wavelet transform
Building Compact Data Cube
Thresholding and Ranking
Answering On-Line Queries
Introduction to Compact Data Cube
Basics of Haar wavelet transform
– e.g. S = [2, 2, 0, 2, 3, 5, 4, 4]
Resolution
8
4
2
1
Averages
[2, 2, 0, 2, 3, 5, 4, 4]
[2, 1, 4, 4]
[1.5, 4]
[2.75]
Detail Coefficients
[0, -1, -1, 0]
[0.5, 0]
[-1.25]
Introduction to Compact Data Cube
Basics of Haar wavelet transform
– For compression reasons, the detail coefficients are
normalized.
– The coefficients at the lower resolutions are weighted more
heavily.
– Approximates the original signal by keeping only the most
significant coefficients.
– Requires only O(N) CPU time and O(N/B) I/Os to compute for
a signal of N values.
– Multidimensional wavelet transform: a series of onedimensional wavelet transforms.
Introduction to Compact Data Cube
Building the Compact Data Cube
– Problem 1: the size of the multidimensional array
representing the underlying data is too large
(assume the data are very sparse).
– Solution: Divide the wavelet transform process into
multiple passes.
Introduction to Compact Data Cube
Building the Compact Data Cube
– Problem 2: The density of the intermediate results
would increase from pass to pass.
– Solution: truncate the intermediate multidimensional
array by cutting off entries with small magnitude.
– I/O complexity:
Nz
N
O(
log M / B
)
B
B
Introduction to Compact Data Cube
Thresholding and Ranking
– Choice 1: keep the C largest (in absolute value)
wavelet coefficients.
– Choice 2: keep the C wavelet coefficients with the
largest weights among the C’ largest coefficients
(C < C’).
• The weight of a coefficient equals to the number of its
dimensions with value zero.
Introduction to Compact Data Cube
Answering On-Line Queries
– Space: ((d+1)k), CPU time:
2d ' min k ,2
1 i d'
log Di
Pros and cons of the compact data
cube method
Pros:
– Requires little disk spaces (a small number of disk
blocks).
– Responds to on-line query fast.
– Answers OLAP queries more accurately than other
approximation techniques like histogram and random
sampling.
– Can progressively refine the approximate answer
with no added overhead.
Pros and cons of the compact data
cube method
Cons:
– Approximates a vast amount of useless empty cells
in base cuboid together with useful non-empty cells
in base cuboid.
– Needs to cut off entries with small magnitude at the
end of each pass in order to maintain a constant
amount of I/O operations from pass to pass.
Dense-Region Based Compact Data
Cube
Aim:
– To enhance the ability of the compact data cube method to
handle datasets having dense-regions-in-sparse-cube
property.
Main Idea:
– To exclude empty cells in base cuboid from approximation.
Two-phase approach
– Compute dense regions in base cuboid.
– Approximate each dense region independently.
Dense-Region Based Compact Data
Cube
Question 1: how can we find the dense regions
efficiently?
–
Efficient DEnse region Mining (EDEM) algorithm
proposed by Cheung et al. in DROLAP -- A DenseRegion-Based Approach to On-line Analytical
Processing (DEXA99)
Dense-Region Based Compact Data
Cube
– Basic ideas of EDEM:
• Build a k-d tree to store the valid cells.
• Grow dense region covers along boundaries.
• Search dense regions among the covers.
– Complexity of EDEM: linear to the number of
dimensions and sub-quadratic to the number of data
points.
Dense-Region Based Compact Data
Cube
Question 2: how should we allocate disk space
in approximating the dense regions?
– Choice 1: allocate disk space equally to each dense
regions.
– Choice 2: allocate disk space according to the sizes
of dense regions.
– Choice 3: order the wavelet coefficients of all the
dense regions and keep the most significant ones (in
absolute value).
Dense-Region Based Compact Data
Cube
Question 3: how should we treat the data points outside
the dense regions?
– Keep all or keep only significant ones.
Question 4: how do we answer on-line queries using the
dense-region based approach?
– Check if a dense region covered by the given query.
– Check if the stored coefficients contribute to the range sum
and compute the amount of contribution if needed.
Dense-Region Based Compact Data
Cube
One favorable side effect:
– we may parallelize the construction of compact data cube.
More questions:
– How can we handle updates to the underlying data?
– How can we approximate iceberg cube? Can we apply the
idea of compact data cube to iceberg cube?
– Can compact data cube be used to answer other types of
OLAP queries besides range-sum?