Finding the Frequent Items in Streams of Data

Download Report

Transcript Finding the Frequent Items in Streams of Data

By Graham Cormode and Marios Hadjieleftheriou
Presented by
Ankur Agrawal (09305002)


Many data generation processes produce huge
numbers of pieces of data, each of which is
simple in isolation, but which taken together lead
to a complex whole.
Examples
◦ Sequence of queries posed to an Internet search engine.
◦ Collection of transactions across all branches of a
supermarket chain.

Data can arrive at enormous rates - hundreds of
gigabytes per day or higher.
Finding the Frequent Items in
Streams of Data
2

Storing and indexing such large amount of data
is costly.

Important to process the data as it happens.

Provide up to the minute analysis and statistics.

Algorithms take only a single pass over their
input.

Compute various functions using resources that
are sublinear in the size of input.
Finding the Frequent Items in
Streams of Data
3
Problems range from simple to very complex.

Given a stream of transactions, finding the mean
and standard deviation of the bill totals.
◦ Requires only few sufficient statistics to be stored.

Determining whether a search query has already
appeared in the stream of queries.
◦ Requires a large amount of information to be stored.

Algorithms must
◦ Respond quickly to new information.
◦ Use very less amount of resources in comparison to total
quantity of data.
Finding the Frequent Items in
Streams of Data
4



Given a stream of items, the problem is
simply to find those items which occur most
frequently.
Formalized as finding all items whose
frequency exceeds a specified fraction of the
total number of items.
Variations arise when the items are given
weights, and further when these weights can
also be negative.
Finding the Frequent Items in
Streams of Data
5

The problem is important both in itself and as a
subroutine in more advanced computations.

For example,

Algorithms for the problem have been applied by
large corporations: AT&T and Google.
◦ It can help in routing decisions, for in-network caching
etc (if items represent packets on the Internet).
◦ Can help in finding popular terms if items represent
queries made to an Internet search engine.
◦ Mining frequent itemsets inherently builds on this
problem as a basic building block.
Finding the Frequent Items in
Streams of Data
6



Given a stream S of n items t1…tn, the exact
ɸ-frequent items comprise the set {i | fi>ɸn},
where fi is the frequency of item i.
Solving the exact frequent items problem
requires Ω(n) space.
Approximate version is defined based on
tolerance for error parameterized by ɛ.
Finding the Frequent Items in
Streams of Data
7

items, the ɛapproximate frequent items problem is to
return a set of items F so that for all items
i ε F, fi>(ɸ-ɛ)n, and there is no i∉F such that
fi>ɸn.
Given
a
stream
S
of
n
Finding the Frequent Items in
Streams of Data
8

Given a stream S of n items, the frequency
estimation problem is to process a stream so
that, given any i, an fi* is returned satisfying
fi≤ fi* ≤ fi+ɛn.
Finding the Frequent Items in
Streams of Data
9

Two main classes of algorithms:
◦ Counter-based Algorithms
◦ Sketch Algorithms

Other Solutions:
◦ Quantiles : based on various notions of randomly sampling
items from the input, and of summarizing the distribution
of items.
◦ Less effective and have attracted less interest.
Finding the Frequent Items in
Streams of Data
10


Track a subset of items from the input and
monitor their counts.
Decide for each new arrival whether to store
or not.

Decide what count to associate with it.

Cannot handle negative weights.
Finding the Frequent Items in
Streams of Data
11

Problem posed by J. S. Moore in Journal of
Algorithms, in 1981.
Finding the Frequent Items in
Streams of Data
12

Start with a counter set to zero. For each item:
◦ If counter is zero, store the item, set counter to 1.
◦ Else, if item is same as item stored, increment counter.
◦ Else, decrement counter.

If there is a majority item, it is the item stored.

Proof outline:
◦ Each decrement pairs up two different items and cancels
them out.
◦ Since majority occurs > N/2 times, not all of its
occurrences can be canceled out.
Finding the Frequent Items in
Streams of Data
13


First proposed by Misra and Gries in 1982.
Finds all items in a sequence whose
frequency exceeds a 1/k fraction of the total
count.

Stores k-1 (item, counter) pairs.

A generalization of the Majority algorithm.
Finding the Frequent Items in
Streams of Data
14

For each new item:
◦ Increment the counter if the item is already stored.
◦ If <k items stored, then store new item with
counter set to 1.
◦ Otherwise decrement all the counters.
◦ If any counter equals 0, then delete the
corresponding item.
Finding the Frequent Items in
Streams of Data
15
3
2
+1
6
1
+1
+1
Finding the Frequent Items in
Streams of Data
0
2
1
16
-1
-1
-1
3
2
5
6
0
1
+1
-1
Finding the Frequent Items in
Streams of Data
2
1
17
Finding the Frequent Items in
Streams of Data
18

Time cost involves:
◦ O(1) dictionary operations per update.
◦ Cost of decrementing counts.
 Can be performed in O(1) time.

Also solves the frequency estimation problem
if executed with k=1/ɛ.
Finding the Frequent Items in
Streams of Data
19

Proposed by Manku and Motwani in 2002.

Stores tuples comprising
◦ An item
◦ A lower bound on its count
◦ A delta (△) value which records the difference
between the upper bound and the lower bound.
Finding the Frequent Items in
Streams of Data
20

For processing i th item:
◦ Increment the lower bound if it is already stored.
◦ Else create a new tuple for the item with lower bound set
to 1 and △ set to ⌊i/k⌋.



Periodically delete tuples with upper bound less
than ⌊i/k⌋.
Uses O(k log(n/k)) space in worst case.
Also solves frequency estimation problem with
k=1/ɛ.
Finding the Frequent Items in
Streams of Data
21
* The CACM 2009 version of the paper has erroneously shown if block (line 8)
outside for loop.
Finding the Frequent Items in
Streams of Data
22

Introduced by Metwally et al. in 2005.

Store k (item, count) pairs.

Initialize by first k distinct items and their exact
counts.


If new item is not already stored, replace the item
with least count and set the counter to 1 more than
the least count.
Items which are stored by the algorithm early in the
stream and are not removed have very accurate
estimated counts.
Finding the Frequent Items in
Streams of Data
23
Finding the Frequent Items in
Streams of Data
24

The space required is O(k).

The time cost involves cost of:
◦ the dictionary operation of finding if an item is stored.
◦ finding and maintaining the item with minimum count.


Simple heap implementations can track the
smallest count item in O(log k) time per update.
As in other counter based algorithms, it also
solves frequency estimation problem with k=1/ɛ.
Finding the Frequent Items in
Streams of Data
25

The term “sketch” denotes

Sketch is the product of frequency vector with a matrix.

Updates
with
accommodated.
◦ a data structure.
◦ a linear projection of the input frequency vector.
negative
values
can
easily
be
◦ Allows us to model the removal of items.

Primarily solve the frequency estimation problem.

Algorithms are randomized.
◦ Also take a failure probability δ so that the probability of failure is
at most δ.
Finding the Frequent Items in
Streams of Data
26

Introduced by Charikar et al. in 2002.

Each update only affects a small subset of the sketch.

The sketch consists of a two-dimensional array C
with d rows of w counters Each.

Uses two hash functions for each row:

Each input item i
◦ hj which maps input items onto [w].
◦ gj which maps input items onto {−1, +1}.
◦ Add gj(i) to entry C[ j, hj(i)] in row j, for 1 ≤ j ≤ d.
Finding the Frequent Items in
Streams of Data
27
Finding the Frequent Items in
Streams of Data
28
•For any row j, the value gj(i)*C[j,hj(i)] is an unbiased
estimator for fi.
•The estimate fi* is the median of these estimates over
the d rows.
Finding the Frequent Items in
Streams of Data
29

Setting d=log(4/δ) and w=O(1/ɛ2) ensures that fi has
error at most ɛn with probability of at least 1−δ.
◦ Requires the hash functions to be chosen randomly from a
family of “fourwise independent” hash functions.

The total space used is

The time per update is
.
in worst case.
Finding the Frequent Items in
Streams of Data
30

Introduced by Cormode and Muthukrishnan in
2005.

Similarly uses a sketch consisting of 2-D array of
d rows of w counters each.

Uses d hash functions hj, one for each row.

Each update is mapped onto d entries in the
array, each of which is incremented.

Now frequency estimate for any item is
fi* = min 1≤j≤d C[ j, hj(i)]
Finding the Frequent Items in
Streams of Data
31
Finding the Frequent Items in
Streams of Data
32

The estimate for an item in each row overestimates
by less than n/w.
◦ Can be shown using Markov inequality.

Setting d=log(1/δ) and w=O(1/ɛ) ensures that fi has
error at most ɛn with probability of at least 1−δ.

Similar to CountSketch with gj(i) always equal to 1.

The total space used is

The time per update is
.
in worst case.
Finding the Frequent Items in
Streams of Data
33

Sketches estimate the frequency of a single item.
◦ How to find frequent items without explicitly storing sketch for all
items?

Divide-and-conquer approach limits search cost.
◦ Impose a binary tree over the domain
◦ Keep a sketch of each level of the tree
◦ Descend if a node is heavy, else stop

Correctness: all ancestors of a frequent item are also
frequent.

Assumes that negative updates are possible but no negative
frequencies are allowed.

Updates take
\
hashing operations, and O(1)
counter updates for each hash.
Finding the Frequent Items in
Streams of Data
34

Randomly divides the input into buckets.

Expect at most one frequent item in each bucket.


Within each bucket, the items are divided into
subgroups so that the “weight” of each group indicates
the identity of the frequent item.
Repeating this for all bit positions reveals the full
identity of the item.
Finding the Frequent Items in
Streams of Data
35

Experiments performed using 10 million packets of HTTP
traffic, representing 24 hours of traffic from a backbone
router in a major network.

The frequency threshold ɸ varied from 0.0001 to 0.01 and
the error guarantee ɛ set to ɸ.

Compare the efficiency of the algorithms with respect to:
◦ Update throughput, measured in number of updates per
millisecond.
◦ Space consumed, measured in bytes.
◦ Recall, measured in the total number of true heavy hitters
reported over the number of true heavy hitters given by an exact
algorithm.
◦ Precision, measured in total number of true heavy hitters reported
over the total number of answers reported.
Finding the Frequent Items in
Streams of Data
36
Finding the Frequent Items in
Streams of Data
37
Finding the Frequent Items in
Streams of Data
38



Overall, the SpaceSaving algorithm appears
conclusively better than other counter-based
algorithms.
SSH yields very good estimates, typically
achieving 100% recall and precision, consumes
very small space, and is fairly fast to update.
SSL is the fastest algorithm with all the good
characteristics of SSH but consumes twice as
much space on average.
Finding the Frequent Items in
Streams of Data
39

No clear winner among the sketch algorithms.

CMH has small size and high update throughput, but
is only accurate for highly skewed distributions.


CGT consumes a lot of space but it is the fastest
sketch and is very accurate in all cases, with high
precision and good frequency estimation accuracy.
CS has low space consumption and is very accurate in
most cases, but has slow update rate and exhibits
some random behavior.
Finding the Frequent Items in
Streams of Data
40
Finding the Frequent Items in
Streams of Data
41