Document 16980
Download
Report
Transcript Document 16980
Efficient Processing of
Massive Data Streams for
Mining and Monitoring
Mirek Riedewald
Department of Computer Science
Cornell University
Acknowledgements
Al Demers
Abhinandan Das
Alin Dobra
Sasha Evfimievski
Johannes Gehrke
KD-D initiative (Art Becker et al.)
Introduction
Data streams versus databases
Network monitoring
High arrival rates, approximation [CGJSS02]
Stock trading
Infinite stream, continuous queries
Limited resources
Complex computation [ZS02]
Retail, E-business, Intelligence, Medical
Surveillance
Identify relevant information on-the-fly, archive for
data mining
Exact results, error guarantees
Information Spheres
Local Information Sphere
Within
each organization
Continuous processing of distributed data
streams
Online evaluation of thousands of triggers
Storage/archival of important data
Global Information Sphere
Between
organizations
Share data in privacy preserving way
Local Information Sphere
Distributed data stream event processing
and online data mining
Technical challenges
Blocking
operators, unbounded state
Graceful degradation under increasing load
Integration with archive
Processing of physically distributed streams
Event Matching, Correlation
Join of data streams
Brand
Mpix
Canon 3.0
Price
Mpix
Price
200
>2.0
<250
Event Matching, Correlation
Join of data streams
Price
Mpix
Price
Canon 3.0
200
>2.0
<250
Fuji
100
>4.0
<400
Brand
Mpix
3.0
Event Matching, Correlation
Join of data streams
Price
Mpix
Price
Canon 3.0
180
> 2.0
< 250
Fuji
3.0
220
> 4.0
< 400
Kodak
4.0
340
= 3.0
< 200
Brand
Mpix
Equi-join, text similarity, geographical
proximity,…
Problem: unbounded state, computation
Window Joins
Restrict join to window of most recent
records (tuples)
Landmark
window
Sliding window based on time or number of
records
Problem definition
Window
based on time: size w
Synchronous record arrival
Equi-join
Abstract Model
Data streams R(A,…), S(A,…)
Compute equi-join on A
Match
all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
R
1
1
1
S
2
3
1
(r0,s2), (r1,s2), (r2,s2)
Abstract Model (cont.)
Data streams R(A,…), S(A,…)
Compute equi-join on A
Match
all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
R
1
1
1
3
S
2
3
1
1
(r0,s2), (r1,s2), (r2,s2)
(r3,s1), (r1,s3), (r2,s3)
Abstract Model (cont.)
Data streams R(A,…), S(A,…)
Compute equi-join on A
Match
all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
R
1
1
1
3
2
S
2
3
1
1
4
(r0,s2), (r1,s2), (r2,s2)
(r3,s1), (r1,s3), (r2,s3)
No new output
Limited Resources
Focus on limited memory M<2w
State of the art: random load shedding
[KNV03]
Random
sample of streams
Desired approach: semantic load shedding
Goal: graceful degradation
Approximation
Set-valued
result: Error measure?
Set-Approximation Error
What is a good error measure?
Information Retrieval, Statistics, Data Mining
Matching coefficient
Dice coefficient
Jaccard coefficient
Cosine coefficient
Overlap coefficient
| A B |
2 | A B | /(| A | | B |)
| A B | / | A B |
| A B | / | A| | B |
| A B | / min{| A |, | B |}
Earth Mover’s Distance (EMD) [RTG98]
Match And Compare (MAC) [IP99]
Join: subset of output result
EMD, Overlap coefficient trivially 0 or 1
Others (except MAC) reduce to MAX-subset error
measure
Optimization Problem
Select records to be kept in memory such
that the result size is maximized subject
to memory constraints
Lightweight online technique
Adaptivity in presence of memory
fluctuations
Optimal Offline Algorithm
What is the best possible that can be
achieved?
Optimal
sampling strategy for MAX-subset
Bottom-line for evaluation of any online
algorithm
Same optimization problem, but knows future
Finite subsets of input streams
Formulate as linear flow problem
Generation of Flow Model
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Fixed memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
-1
-1
-1
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
-1
-1
-1
-1
-1
S=2,3,1,1
-1
Complexity
Integer solution exists
Optimal solution found in O(n2 m log n)
N
input size of single stream
#nodes: n < 2wN + N + 2
#arcs: m < 2n + M + 1
Reasonable costs for benchmarking
Approx.
1GB memory (w=800, M=800)
Approx. 1h computation time
Optimal Flow
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Fixed memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Easy to Extend
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Variable memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Online Heuristics
Maximize expected output
PROB:
sort tuples by join partner arrival
probability
LIFE: sort tuples by product of partner arrival
probability and remaining lifetime
Maintain stream statistics
Histograms
(DGIM02, TGIK02), wavelets
(GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary
Information sphere architecture
Optimal algorithm and fast efficient heuristic for
sliding window joins
Open problems
Other set error measures, resource models
Other joins: compress records
Complex queries
Distributed processing
Integration with other techniques into local
information sphere
Related Work
Aurora (Brown, MIT), STREAM (Stanford),
Telegraph (Berkeley), NiagaraCQ
(Wisconsin, OGI)
Memory requirements [ABBMW02,TM02]
Aggregation
Alon,
Bar-Yossef, Datar, Dobra, Garofalakis,
Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis,
Koudas, Matias, Motwani, Muthukrishnan,
Rastogi, Srivastava, Strauss, Szegedy
Other Results
[DGR03]
Integration with archive
Load smoothing, not shedding
Novel “error” measure: archive access cost
Static join for sensor networks
Maximize result size subject to constraints on energy
consumption
Polynomial dynamic programming solution
Fast 2-approximation algorithms
NP-hardness proof for join of 3 or more streams
Other Results (cont.)
[DGGR02]
Computation of aggregates over streams
for multiple joins
Small
pseudo-random sketch synopses
(randomized linear projections)
Explicit, tunable error guarantees
Sketch partitioning to boost accuracy
(intelligently partition join attribute space)
Thanks!
?
?
?
Questions?
?
?
?
?