Document 16980

Download Report

Transcript Document 16980

Efficient Processing of
Massive Data Streams for
Mining and Monitoring
Mirek Riedewald
Department of Computer Science
Cornell University
Acknowledgements
Al Demers
 Abhinandan Das
 Alin Dobra
 Sasha Evfimievski
 Johannes Gehrke
 KD-D initiative (Art Becker et al.)

Introduction

Data streams versus databases



Network monitoring


High arrival rates, approximation [CGJSS02]
Stock trading


Infinite stream, continuous queries
Limited resources
Complex computation [ZS02]
Retail, E-business, Intelligence, Medical
Surveillance


Identify relevant information on-the-fly, archive for
data mining
Exact results, error guarantees
Information Spheres

Local Information Sphere
 Within
each organization
 Continuous processing of distributed data
streams
 Online evaluation of thousands of triggers
 Storage/archival of important data

Global Information Sphere
 Between
organizations
 Share data in privacy preserving way
Local Information Sphere
Distributed data stream event processing
and online data mining

Technical challenges
 Blocking
operators, unbounded state
 Graceful degradation under increasing load
 Integration with archive
 Processing of physically distributed streams
Event Matching, Correlation

Join of data streams
Brand
Mpix
Canon 3.0
Price
Mpix
Price
200
>2.0
<250
Event Matching, Correlation

Join of data streams
Price
Mpix
Price
Canon 3.0
200
>2.0
<250
Fuji
100
>4.0
<400
Brand
Mpix
3.0
Event Matching, Correlation

Join of data streams
Price
Mpix
Price
Canon 3.0
180
> 2.0
< 250
Fuji
3.0
220
> 4.0
< 400
Kodak
4.0
340
= 3.0
< 200
Brand
Mpix
Equi-join, text similarity, geographical
proximity,…
 Problem: unbounded state, computation

Window Joins

Restrict join to window of most recent
records (tuples)
 Landmark
window
 Sliding window based on time or number of
records

Problem definition
 Window
based on time: size w
 Synchronous record arrival
 Equi-join
Abstract Model
Data streams R(A,…), S(A,…)
 Compute equi-join on A

 Match
all r and s of streams R, S such that
r.A=s.A
 Sliding window of size w
R
1
1
1
S
2
3
1
(r0,s2), (r1,s2), (r2,s2)
Abstract Model (cont.)
Data streams R(A,…), S(A,…)
 Compute equi-join on A

 Match
all r and s of streams R, S such that
r.A=s.A
 Sliding window of size w
R
1
1
1
3
S
2
3
1
1
(r0,s2), (r1,s2), (r2,s2)
(r3,s1), (r1,s3), (r2,s3)
Abstract Model (cont.)
Data streams R(A,…), S(A,…)
 Compute equi-join on A

 Match
all r and s of streams R, S such that
r.A=s.A
 Sliding window of size w
R
1
1
1
3
2
S
2
3
1
1
4
(r0,s2), (r1,s2), (r2,s2)
(r3,s1), (r1,s3), (r2,s3)
No new output
Limited Resources
Focus on limited memory M<2w
 State of the art: random load shedding
[KNV03]

 Random
sample of streams
 Desired approach: semantic load shedding

Goal: graceful degradation
 Approximation
 Set-valued
result: Error measure?
Set-Approximation Error

What is a good error measure?

Information Retrieval, Statistics, Data Mining








Matching coefficient
Dice coefficient
Jaccard coefficient
Cosine coefficient
Overlap coefficient
| A B |
2 | A  B | /(| A |  | B |)
| A B | / | A B |
| A B | / | A|  | B |
| A B | / min{| A |, | B |}
Earth Mover’s Distance (EMD) [RTG98]
Match And Compare (MAC) [IP99]
Join: subset of output result


EMD, Overlap coefficient trivially 0 or 1
Others (except MAC) reduce to MAX-subset error
measure
Optimization Problem
Select records to be kept in memory such
that the result size is maximized subject
to memory constraints
Lightweight online technique
 Adaptivity in presence of memory
fluctuations

Optimal Offline Algorithm

What is the best possible that can be
achieved?
 Optimal
sampling strategy for MAX-subset
 Bottom-line for evaluation of any online
algorithm
 Same optimization problem, but knows future
 Finite subsets of input streams

Formulate as linear flow problem
Generation of Flow Model
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Fixed memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
-1
-1
-1
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
-1
-1
-1
-1
-1
S=2,3,1,1
-1
Complexity
Integer solution exists
 Optimal solution found in O(n2 m log n)

N
input size of single stream
 #nodes: n < 2wN + N + 2
 #arcs: m < 2n + M + 1

Reasonable costs for benchmarking
 Approx.
1GB memory (w=800, M=800)
 Approx. 1h computation time
Optimal Flow
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Fixed memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Easy to Extend
M=2, w=3
-1
R=1,1,1,3
-1
-1
-1
Variable memory allocation
S=2,3,1,1
cost
Capacity: 0..1, linear cost
-1
3
-3
-1
Keep in memory
Replace
Online Heuristics

Maximize expected output
 PROB:
sort tuples by join partner arrival
probability
 LIFE: sort tuples by product of partner arrival
probability and remaining lifetime

Maintain stream statistics
 Histograms
(DGIM02, TGIK02), wavelets
(GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary



Information sphere architecture
Optimal algorithm and fast efficient heuristic for
sliding window joins
Open problems





Other set error measures, resource models
Other joins: compress records
Complex queries
Distributed processing
Integration with other techniques into local
information sphere
Related Work
Aurora (Brown, MIT), STREAM (Stanford),
Telegraph (Berkeley), NiagaraCQ
(Wisconsin, OGI)
 Memory requirements [ABBMW02,TM02]
 Aggregation

 Alon,
Bar-Yossef, Datar, Dobra, Garofalakis,
Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis,
Koudas, Matias, Motwani, Muthukrishnan,
Rastogi, Srivastava, Strauss, Szegedy
Other Results
[DGR03]
 Integration with archive



Load smoothing, not shedding
Novel “error” measure: archive access cost
Static join for sensor networks




Maximize result size subject to constraints on energy
consumption
Polynomial dynamic programming solution
Fast 2-approximation algorithms
NP-hardness proof for join of 3 or more streams
Other Results (cont.)
[DGGR02]
 Computation of aggregates over streams
for multiple joins
 Small
pseudo-random sketch synopses
(randomized linear projections)
 Explicit, tunable error guarantees
 Sketch partitioning to boost accuracy
(intelligently partition join attribute space)
Thanks!
?
?
?
Questions?
?
?
?
?