Data-Streams and Histograms

Download Report

Transcript Data-Streams and Histograms

Data-Streams and Histograms
Sudipto Guha, Nick Koudas &
Kyuseok Shim
Background
• Histogram
– Captures distribution statistics in an efficient
manner
– Applications
•
•
•
•
Query optimization
Approximate query answering
Data mining (time series in particular)
Piecewise transmission of data
• EquiWidth, EquiDepth, MHIST, MaxDiff, VOPT
Background
• Data Stream
– An ordered sequence of points that can be
read only once or a small number of times
– Applications
• Mission critical network components
• Dynamic traffic configuration, fault identification,
troubleshooting
– Performance of algorithm measured by
number of passes algorithm must make over
the stream
Motivation
• Since the end use of a histogram is to
approximate a data distribution, why not
use a near-optimal approximation of the
best histogram if it means linear time
computation?
Motivation
• Approximate V-OPT histograms by
improving the dynamic programming
solution from quadratic to linear time
• Revised algorithm uses little space, hence
suitable for data stream model
• Assumes cost of interval is monotonic
under inclusion
Problem Statement
• Given:
– non-negative integers v1, ..., vn
– k intervals or buckets to partition the index 1..n
• Constraint:
– Minimize k VARk where is the variance of values in
the kth bucket
• Dynamic Programming solution:
– OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]}
x<n
– Runs in O(n2k) time with O(n) space
Intuition of Improvement
• For a  x  b,
– VAR[a..n]  VAR[x..n]  VAR [b..n] (1)
– OPT[a..n]  OPT[x..n]  OPT[b..n] (2)
• Use this monotonicity property to reduce
the search space by settling for an
approximation
• Instead of storing the whole OPT function,
approximate it by a histogram!
Intuition of Improvement
•
•
•
•
For all 1  p  k, maintain intervals (a1,b1),…, (al, bl)
Value of bi  (1+)ai
The number of intervals l depends on p
The value for each interval substitutes for each value in
the interval reducing space and time complexity
Results
• Theorem: A (1+) approximation for V-OPT
runs in O((k2/)log n) space and time
O((nk2/)log n) in the data stream model
Advantages and Disadvantages
• Accuracy/runtime tradeoff can be
controlled by the parameter 
• For data-stream model, alternatives
abound:
– Random sampling (simple, assumption of
distribution)
– Other histogram techniques (faster, less
optimal)
– Wavelet (flexibility)
– Sliding Windows (later paper)
Improvements
Conclusion
• The authors provided an algorithm for
approximating a distribution that runs
reasonably fast and with small space
requirements
• Proposed solution can be applied to datastream model because values are not
referred to unless they are stored