The Socio-monetary Incentives of Online Social

Download Report

Transcript The Socio-monetary Incentives of Online Social

Data Streams
Topics in Data Mining
Fall 2015
Bruno Ribeiro
© 2015 Bruno Ribeiro
Data Streams Applications

Stream item counting

Stream statistics

Stream classification

Stream matching
© 2015 Bruno Ribeiro
2
What are Data Streams?

Data Streams
◦ Data streams—continuous, ordered, changing, fast, huge amount
◦ Traditional DBMS—data stored in finite, persistent data sets

Characteristics
◦ Huge volumes of continuous data, possibly infinite
◦ Fast changing and requires fast, real-time response
◦ Random access is expensive—single scan algorithm (only single pass)
◦ Store only the summary of the data seen thus far
◦ Most stream data are at pretty low-level or multi-dimensional in nature,
needs multi-level and multi-dimensional processing
Ack. From Jiawei Han
3
Examples

Telecommunication calling records

Business: credit card transaction flows

Network monitoring and traffic engineering

Financial market: stock exchange

Engineering & industrial processes: power supply & manufacturing

Sensor, monitoring & surveillance: video streams, RFIDs

Security monitoring

Web logs and Web page click streams

Massive data sets (even saved but random access is too expensive)
Ack. From Jiawei Han
4
DBMS versus DSMS

Persistent relations

Transient streams

One-time queries

Continuous queries

Random access

Sequential access

“Unbounded” disk store

Bounded main memory

Only current state matters

Historical data is important

No real-time services

Real-time requirements

Relatively low update rate

Possibly multi-GB arrival rate

Data at any granularity

Data at fine granularity

Assume precise data

Data stale/imprecise

Access plan determined by query
processor, physical DB design

Unpredictable/variable data
arrival and characteristics
Ack. From Motwani’s PODS tutorial slides
5
In General: Streaming algorithm
Continuous Data Stream (Terabytes)
X1
(Gigabytes)
summary
(in memory)
Hashing
X2
…
stream
processing
engine
Xn
estimate of θ,
where θ = g(X1,...,Xn)
Query Q
“indirect” observation
© 2015 Bruno Ribeiro
6
Querying

Query types
◦ One-time query vs. continuous query (being evaluated continuously as
stream continues to arrive)
◦ Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements
◦ For real-time response, main memory algorithm should be used
◦ Memory requirement is unbounded if one will join future tuples

Approximate query answering
◦ With bounded memory, it is not always possible to produce exact
answers
◦ High-quality approximate answers are desired
◦ Data reduction and synopsis construction methods
 Sketches, random sampling, histograms, wavelets, etc.
Ack. From Jiawei Han
7
Synopses/Approximate Answers



Major challenges
◦ Keep track of a large universe, e.g., pairs of IP address, not ages
Methodology
◦ Synopses (trade-off between accuracy and storage): A summary given in
brief terms that covers the major points of a subject matter
◦ Use synopsis data structure, much smaller (O(logk N) space) than their
base data set (O(N) space)
◦ Compute an approximate answer within a small error range
(factor ε of the actual answer)
Major methods
◦ Random sampling
◦ Histograms
◦ Sliding windows
◦ Multi-resolution model
◦ Sketches
◦ Radomized algorithms
Ack. From Jiawei Han
8
Types of Streaming Algorihms



Sliding windows
◦ Only over sliding windows of recent stream data
◦ Approximation but often more desirable in applications
Batched processing, sampling and synopses
◦ Batched if update is fast but computing is slow
 Compute periodically, not very timely
◦ Sampling if update is slow but computing is fast
 Compute using sample data
◦ Synopsis data structures
 Maintain a small synopsis or sketch of data
 Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.
◦ Blocking if unable to produce the first output until seeing the entire input
Ack. From Jiawei Han
9
Stream Processing

Random sampling (but without knowing the total length in advance)

Sliding windows
◦ Make decisions based only on recent data of sliding window size w
◦ An element arriving at time t expires at time t + w

Histograms
◦ Approximate the frequency distribution of element values in a stream
◦ Partition data into a set of contiguous buckets
◦ Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)

Multi-resolution models
◦ Popular models: balanced binary trees, micro-clusters, and wavelets
Ack. From Jiawei Han
10
Random Sampling:
A Simple Approach to Item Counts
© 2015 Bruno Ribeiro
11
Random Sampling: Packet sampling
Router
Internet
Internet
Bernoulli sampling
Widely used: processing
overhead controlled by sampling rate (1/200)
Traffic summary:
Estimate packet-level statistics
* Find % traffic from Netflix @ Purdue
>>
© 2015 Bruno Ribeiro
12
A Fair Measure: Flow-level Statistics
>>

Find % connections from Netflix @ Purdue
Estimate flow-level statistics
Estimate flow size distribution
© 2015 Bruno Ribeiro
13
Flow-level Statistics from Sampled Packets?

Reverse problem (inference problem)
m
© 2015 Bruno Ribeiro
c
a
o
j
g
14
Finding estimates – schematic view
Sampling
Estimator
© 2015 Bruno Ribeiro
15
Flow size distribution: maximum
likelihood estimation
sampling rate = 1/200

128,000 sampled flows

EM algorithm
◦ 2 initializations
100%
Cumul. % of flows

90%
80%
70%
Estimate 1
60%
Estimate 2
Original
50%
40%
30%
1
3
5
7
9
11 13 15 17 19
Flow size
Estimates highly sensitive to initialization
© 2015 Bruno Ribeiro
16
MLE: more samples
pkt sampling rate = 1/200, 1 trillion sampled flows
100%
Cumul. % of flows
90%
80%
70%
60%
Estimate
Original
50%
40%
30%
1
© 2015 Bruno Ribeiro
3
5
7
9
11
13 15 17 19
Flow size
17
Problem: Uniform sampling
Wikipedia

Surface: 71% is water
© 2015 Bruno Ribeiro
18
Importance Sampling

Dedicates precious memory only to “important”
observations

Sample flows, rather than packets
◦ Problem?
◦ Will likely miss large flows

Sample flows ∝ flow size
◦ Problem?
◦ Streaming setting: We don’t yet know the size
Example pf compromise: Sample and Hold
◦ Sample packets, keep all remaining packets of same
flow
© 2015 Bruno Ribeiro

19
Different Sampling Designs
seeing as a stream of elements
m
c
a
o
j
g

Packet Sampling = Packet Sampling: Sample elements with
probability p

Flow Sampling = Flow sampling: Sample sets with probability q

Sample & Hold = Randomly sample elements with probability q’
from the stream but collect all future elements with same color

Dual Sampling = Sample first element with high probability. Sample
following elements with low probability and use “sequence numbers”
to obtain elements lost “in the middle”
© 2015 Bruno Ribeiro
Results: Different Sampling Designs


FS = Flow sampling • DS = Dual sampling
SH = Sample and hold• PS = Packet sampling
© 2015 Bruno Ribeiro
Tune & Veitch, 2014
Sketches
© 2015 Bruno Ribeiro
22

Note that not every problem can be solved well with
sampling
◦ Example: flow size estimation

“Sketch”: a linear transformation of the input
◦ Model stream as defining a vector, sketch is result of
multiplying stream vector by an (implicit) matrix
X1
X2
…
Xn
stream
linear projection
sketch
© 2015 Bruno Ribeiro

Definitions
◦ N → number of flows
◦ W → maximum flow size
◦ M → memory size

Space Complexity
◦ Available memory
M = k N log W, k < 1
2
4
© 2015 Bruno Ribeiro
Hash function:
Motivation:
❍
Estimate flow size distribution
Uniformly at random associates
a newly arrived flow to a counter
Hash function f
elements of
flow blue
Counters
f
f
elements of
flow red
f
Flow size
distribution
collision
elements of
flow green
© 2015 Bruno Ribeiro
Uses precious memory with counters >
2
5
Data Streaming on Flow Size
Estimationrouter
0
0
12
120
universal
hash
function
0
0
counters
Sketch phase
summary
collision!!
Disambiguate
flow size
distribution
estimate
powerful
back end
server
Estimation phase
© 2015 Bruno Ribeiro
26
Issues with Kumar et al.

Effectively only works if counter load < 2

In practice reduces required memory by 1/2

Very resource-intensive estimation procedure
© 2015 Bruno Ribeiro
27
Eviction Sketch

Ribeiro et al. 2008
© 2015 Bruno Ribeiro
28
Eviction Sketch: Probabilistic collision avoidance
 Maximum
 M/2
hash value = M
counters

If hash(packet) < M/2
→ red

Otherwise (hash(packet) mod M/2) → blue
Detectable blue – red
collision: 1 bit required
flow 7
flow 8
flow 9
Undetectable
collision
Flows:
Counters: 6
© 2015 Bruno Ribeiro
1
2
0
2
1
6
0
M/2 counters
0
Eviction
Collision policy:
 “red
flow cannot
increment blue counter”
Flows:
 “blue
Counters:
6
1
2
0
2
1
3
0
0
flow overwrites red
counter”
 counter = 0 are red
Counter colors: 1 1 0 0 1 0 1 0 0
(extra bit)
Result: e.g. if 1 counter / flow
 All red counters are also blue counters = 0
 Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)

Blue counters evict red counters

Flow sampling effect: Discards 15% flows at random
Folding: interesting fact
Flow sampling
© 2015 Bruno Ribeiro
Number of eviction classes ∞
Policy: Evicts random flow
Reduce counter size:
Probabilisitc counter increments
Hash
counter
p=1/m21
k-1
k+1
k+2
2
k
1
0
Arrived packets:
…
k-1
…
k
m1
…
m2
average
Counter value k → average flow sizes = [k,
 k+m
Counter
1-1] value k+1 → average flow sizes = [k+m1,
k+m1+m2-1]

With ma = 2ª , 6 bit counter bins up flows up to average
size 1014
© 2015 Bruno Ribeiro
Experiment
Same accuracy without counter
folding requires 13MB of memory


Evaluated with
simulations
Our worst result with
Internet core traces
◦
◦
◦
◦
9.5 million flows
8MB of memory
k=16
W=1014
k
© 2015 Bruno Ribeiro
Input: 106 flows with 250KB memory
3
3
© 2015 Bruno Ribeiro
Good Tutorial: Andrei Broder and Michael Mitzenmacher, Network Applications of
Bloom Filters: A Survey, Internet Mathematics Vol. 1, No. 4: 485-509, 2003
© 2015 Bruno Ribeiro
34
A Filters
Bloom Work
filter example
How Bloom
((three hash functions))
Insert X1 and X2
Hash function f1
f3
f2
Check Y1 and Y2
3/17/2014
© 2015 Bruno Ribeiro
Bloom Filters (Simon S. Lam)
5
35
Why Bloom Filters Work

S = set of items
m = |S|
k = hash functions
n = number of stored bits in filter

Assume kn < m



To check membership: y ∊ S, check whether fi(y), 1≤i≤k,
are all set to 1
o If not, y ∉ S
o Else, we conclude that y ∊ S, but sometimes y ∉ S (false
positive)

In many applications, false positives are OK as long as
© 2015 Brunohappens
Ribeiro
with


Assumption: Hash functions look random

Given m bits for filter and n elements, choose
number k of hash functions to minimize false
positives:
◦ Let
◦ Then,

As k increases, more chances to find at least one
0
but we also insert more 1’s in bit vector
© 2015 Bruno Ribeiro
0.1
m/n = 8
False positive rate
0.075
Opt k = 8 ln 2 = 5.45...
0.05
0.025
0.
0.
© 2015 Bruno Ribeiro
2.3
4.5
6.8
Hash functions
9.
11.3
Ack Mitzenmacher