CS206 --- Electronic Commerce
Download
Report
Transcript CS206 --- Electronic Commerce
More Stream-Mining
Counting How Many Elements
Computing “Moments”
1
Counting Distinct Elements
Problem: a data stream consists of
elements chosen from a set of size n.
Maintain a count of the number of
distinct elements seen so far.
Obvious approach: maintain the set of
elements seen.
2
Applications
How many different words are found
among the Web pages being crawled at
a site?
Unusually low or high numbers could
indicate artificial pages (spam?).
How many different Web pages does
each customer request in a week?
3
Using Small Storage
Real Problem: what if we do not have
space to store the complete set?
Estimate the count in an unbiased way.
Accept that the count may be in error,
but limit the probability that the error is
large.
4
Flajolet-Martin* Approach
Pick a hash function h that maps each of
the n elements to log2n bits, uniformly.
Important that the hash function be (almost)
a random permutation of the elements.
For each stream element a, let r (a ) be
the number of trailing 0’s in h (a ).
Record R = the maximum r (a ) seen.
Estimate = 2R.
* Really based on a variant due to AMS (Alon, Matias, and Szegedy)
5
Why It Works
The probability that a given h (a ) ends in
at least r 0’s is 2-r.
If there are m different elements, the
probability that R ≥ r is 1 – (1 - 2-r )m.
Prob. all h(a)’s
end in fewer than
r 0’s.
Prob. a given h(a)
ends in fewer than
r 0’s.
6
Why It Works --- (2)
-r
Since 2-r is small, (1-2-r)m ≈ e -m2 .
If 2r >> m, 1 – (1 - 2-r )m ≈ 1 – (1 - m2-r)
First 2 terms of the
≈ m /2r ≈ 0.
Taylor expansion of e
-r
r
r
m
-m2
If 2 << m, 1 – (1 - 2 ) ≈ 1 – e
≈
1.
Thus, 2R will almost always be around m.
x
7
Why It Doesn’t Work
E(2R ) is actually infinite.
Probability halves when R -> R +1, but
value doubles.
Workaround involves using many hash
functions and getting many samples.
How are samples combined?
Average? What if one very large value?
Median? All values are a power of 2.
8
Solution
Partition your samples into small
groups.
Take the average of groups.
Then take the median of the averages.
9
Moments (New Topic)
Suppose a stream has elements chosen
from a set of n values.
Let mi be the number of times value i
occurs.
The k th moment is the sum of (mi )k
over all i.
10
Special Cases
0th moment = number of different
elements in the stream.
The problem just considered.
1st moment = sum of the numbers of
elements = length of the stream.
Easy to compute.
2nd moment = surprise number = a
measure of how uneven the distribution is.
11
Example: Surprise Number
Stream of length 100; 11 values
appear.
Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9. Surprise # = 910.
Surprising: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1,
1. Surprise # = 8,110.
12
AMS Method
Works for all moments; gives an
unbiased estimate.
We’ll just concentrate on 2nd moment.
Based on calculation of many random
variables X.
Each requires a count in main memory, so
number is limited.
13
One Random Variable
Assume stream has length n.
Pick a random time to start, so that any
time is equally likely.
Let the chosen time have element a in
the stream.
X = n * ((twice the number of a ’s in the
stream starting at the chosen time) – 1).
Note: just store the count.
14
Expected Value of X
2nd moment is Σa (ma )2.
E(X ) = (1/n )(Σall times t n * (twice the
number of times the stream element at
time t appears from that time on) – 1).
= Σa (1/n)(n )(1+3+5+…+2ma-1) .
= Σa (ma )2.
15
Combining Samples
Compute as many variables X as can fit in
available memory.
Average them in groups.
Take median of averages.
Proper balance of group sizes and number
of groups assures not only correct
expected value, but expected error goes
to 0 as number of samples gets large.
16
Problem: Streams Never End
We assumed there was a number n,
the number of positions in the stream.
But real streams go on forever, so n is
a variable --- the number of inputs seen
so far.
17
Fixups
1. The variables X have n as a factor --keep n separately; just hold the count
in X.
2. Suppose we can only store k counts.
We must throw some X ’s out as time
goes on.
Objective: each starting time t is
selected with probability k / n.
18
Solution to (2)
Choose the first k times for k
variables.
When the n th element arrives (n > k ),
choose it with probability k / n.
If you choose it, throw one of the
previously stored variables out, with
equal probability.
19