Working Memory in the Two Cerebral Hemispheres M.G

Download Report

Transcript Working Memory in the Two Cerebral Hemispheres M.G

Estimating Entropy and Entropy Norm on Data Streams
Amit
†
Chakrabarti
Khanh Do
†,*
Ba
S.
‡
Muthukrishnan
†
Department of Computer Science, Dartmouth College
‡ Department of Computer Science, Rutgers University
* Work done during DIMACS REU 2005; supported by a Dean of Faculty Undergraduate Research Grant
ABSTRACT
PROBLEM STATEMENTS
We consider the problem of computing information theoretic functions
such as entropy on a data stream, using sublinear space.
Our first result deals with a measure we call the “entropy norm” of an
input stream: it is closely related to entropy but is structurally similar
to the well-studied notion of frequency moments. We give a
polylogarithmic space one-pass algorithm for estimating this norm
under certain conditions on the input stream. We also prove a lower
bound that rules out such an algorithm if these conditions do not hold.
Our second group of results are for estimating the empirical entropy of
an input stream. We first present a sublinear space one-pass algorithm
for this problem. For a stream of m items and a given real parameter α,
our algorithm uses space Õ(m2α) and provides an approximation of 1/α
in the worst case and (1 + ε) in “most” cases. We then present a twopass polylogarithmic space (1 + ε)-approximation algorithm.
n
i 1
1) Entropy norm:
FH  i 1 mi lg mi
n
2) Empirical entropy:
mi m
H  i 1 lg
m mi
mik
Already the focus of much recent research, algorithms for
computational problems on data streams grow increasingly essential in
today’s highly connected world. In this model, the input is a stream of
“items” too long to be stored completely in memory, and a typical
problem involves computing some statistic on this stream. The
challenge is to design algorithms efficient not only in terms of running
time, but also in terms of space: sublinear is a must and polylogarithmic
is often the goal.
The quintessential need for such algorithms arises in analyzing IP
network traffic at packet level on high speed routers. In monitoring IP
traffic, one cares about anomalies, which in general are hard to define
and detect since there are subtle intrusions and sophisticated
dependence amongst network events and agents. An early attempt to
capture the overall behavior of a data stream, now a classical problem in
the area, was a family of statistics called frequency moments. If a stream
th
of length m contains mi occurrences of item i (1 
i

n),
then
its
k
n
k
frequency moment, denoted Fk, is defined by i 1 mi . In their seminal
paper, Alon et al. [1] showed that Fk can be estimated arbitrarily well in
o(n) space for all integers k  0 and in Õ(1) space for k  2. Their results
were later improved by Coppersmith and Kumar [2] and Indyk and
Woodruff [4].
Result 1.1: We designed a one-pass algorithm that estimates FH
to a (1 + ε)-factor in Õ(1) space, assuming FH is sufficiently
large.
Result 1.2: If FH is too small to satisfy the assumption above,
then we proved that no one-pass, Õ(1)-space algorithm can
approximate it to even a constant factor.
Result 2.1: We designed a one-pass algorithm that estimates H
to a (1 + ε)-factor in Õ(1) space if H is sufficiently large, or
Õ(m2/3) space in general.
n
OUTLINE OF ALGORITHMS
BACKGROUND AND MOTIVATIONS

We want to estimate the following
statistics on the input stream in o(n),
possibly Õ(1), space and a single pass.
SUMMARY OF RESULTS
A subroutine computes the basic estimator: random variable X whose
mean is the target quantity and whose variance is small. The algorithm
uses this subroutine to maintain s1s2 independent basic estimators Xij,
for 1  i  s1, 1  i  s2. It outputs a final estimator Y, defined by
1

median  X ij .
1 j  s2
 s1 i 1

s1
For entropy norm, our basic estimator subroutine is
Input stream: A = a1, a2,…, am, where each ai  {1,…, n}.
1. Choose p uniformly at random from {1,…, m}.
2. Let r = |{q : aq = ap, p  q  m}|.
3. Let X = m[r lg r – (r – 1)lg(r – 1)], where 0 lg 0 = 0.
For empirical entropy, the subroutine is identical except for line 3, where
we now have
m 
 r m r 1
X  m lg 
lg
.

m
r
m
r

1


Result 2.2: We designed a two-pass algorithm that estimates H
to a (1 + ε)-factor in Õ(1) space for any H.
CONCLUSIONS
Our complete results have been published as a technical report
(DIMACS-TR-2005-33). We believe our algorithms will be of
practical interest in data stream systems, as recent work in the
networking community appear to be converging on entropy as a
reasonable approach to anomaly detection [3, 5].
In future work, although we have proven our entropy norm algorithm
to be optimal, it appears feasible to improve our last algorithm for
estimating empirical entropy (Result 2.2) to complete in a single pass.
It will also be of interest to study these problems on streams in the
presence of deletions as well as insertions.
REFERENCES
[1] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the
frequency moments. Proc. ACM STOC, 20-29, 1996.
[2] D. Coppersmith and R. Kumar. An improved data stream algorithm for
frequency moments. ACM-SIAM SODA, 151-156, 2004.
[3] Y. Gu, A. McCallum, D. Towsley. Detecting anomalies in network traffic
using maximum entropy estimation. Proc. Internet Measurement
Conference, 2005.
[4] P. Indyk and D. Woodruff. Optimal approximations of the frequency
moments of data streams. ACM STOC, 202-208, 2005.
[5] K. Xu, Z. Zhang, S. Bhattacharya. Profiling internet backbone traffic:
behavior models and applications. Proc. ACM SIGCOMM, 2005.