Data Aggregation
Download
Report
Transcript Data Aggregation
CPSC 689: Discrete Algorithms
for Mobile and Wireless Systems
Spring 2009
Prof. Jennifer Welch
Lecture 33
Topic:
Data Aggregation in Sensor Networks
Sources:
Nath, Gibbons, Seshan & Anderson
Shrivastava, Buragohain, Agrawal & Suri
Discrete Algs for Mobile Wireless Sys
2
Aggregation Problem
How to compute the answer to a query in a sensor
network that requires aggregating data from all (or
many) sensors?
Example: Suppose the nodes take temperature
readings and queries ask for min/max/average
temperature
Data has to flow through the network to the node that
issues the query
In some cases, data can be aggregated on the way
save bandwidth and energy
Example: to find max temp., each node
propagates largest temp. it has learned about
Discrete Algs for Mobile Wireless Sys
3
Communication to Support
Aggregation
Need to propagate sensor readings in some
orderly way
Example: send data over a spanning tree rooted
at the querying node
not robust: link or node failure will partition the tree,
lose contact with sensors in subtree
Prefer to use multipath routing (message is sent
on several paths)
redundancy provides more resilience
But duplication causes problems for aggregation
OK for max, but what about average?
Discrete Algs for Mobile Wireless Sys
4
Overview of Algorithm
Provides framework for synopses of the data to be sent
over multiple paths and then reconstructing correct answer
Phase 1: aggregate query is flooded through the network
and an aggregation topology is constructed
Phase 2: aggregate values are continually routed toward
the querying node:
each node converts its sensor data to a synopsis (SG
function)
nodes merge two synopses into one (SF function)
querying node converts synopsis back to final answer (SE
function)
Discrete Algs for Mobile Wireless Sys
5
Specific Aggregation Topology
Rings:
kind of like levels in breadth-first search
Nodes are partitioned into rings during Phase 1:
querying node q is in ring 0
a node is in ring i if it receives the query first from a node in ring i–1
Phase 2 is divided into epochs, one aggregate answer per
epoch
each node in outer ring (farthest distance from q) computes s :=
SG(r), where r is its sensor reading, and broadcasts s
each node in ring i computes s := SG(r), where r is its sensor
reading, and updates s := SF(s,s'), where s' is each synopsis
received from a neighbor, then broadcasts s
querying node computes SE(s)
Synchronous algorithm
Discrete Algs for Mobile Wireless Sys
6
Figure
q
R0
R1
R2
A
C
B
Discrete Algs for Mobile Wireless Sys
7
Analysis of Framework
Complexity: each node broadcasts once
per epoch
Same as spanning-tree-based approach
More resilient than spanning-tree-based
approach
Discrete Algs for Mobile Wireless Sys
8
The Functions
What should SG, SF, and SE be in order to
give the "correct" answer?
First, give a condition on the functions that
is intuitive
Then show there are 4 simple checks that
can be done on proposed functions
These conditions are necessary and
sufficient to preserve correctness
Discrete Algs for Mobile Wireless Sys
9
ODI-Correctness
Final result should be independent of how the
data was routed to querier:
same no matter in which order the readings are
combined and how many times they are included
(duplicated) during the routing
Sensor reading r : <measurement, metadata>
assumed to be unique
Suppose we have SG, SF and SE
Define synopsis label SL(s) = {r} if s = SG(r ) and
SL(s) = SL(s1) Ums SL(s2) if s = SF(s1,s2)
Discrete Algs for Mobile Wireless Sys
10
ODI-Correctness (cont'd)
What constitutes a "duplicate" depends on
what is being computed
Ex: average temp vs. number of distinct temps
q : multiset of sensor readings set of
(unique) values
q(SL(s)) = set of unique values in all the
sensor readings that formed the synopsis
Discrete Algs for Mobile Wireless Sys
11
ODI-Correctness Definition
Let {v1,…,vk} be set of values in the label of s, i.e.,
q(SL(s)).
Then s must be same as computation on
"canonical left-deep tree":
s := SG(v1)
for i = 2 to k do
s := SF(s,SG(vi))
I.e., regardless of redundancy caused by
multipath routing, the final synopsis is the same
as if each distinct value is included just once
Discrete Algs for Mobile Wireless Sys
12
ODI-Correctness Figure
s
s
SF
SF
SF
SF
SF
SF
SF
SF
SF
SF
SF
SG
SG
SG
SG
SG
SG
r1
r2
r3
r4
r5
r1
r2
Aggregation DAG
SG
SG
SG
SG
r5
r4
r3
Canonical left-deep tree
Discrete Algs for Mobile Wireless Sys
13
A Simple Test for ODI-Correctness
duplicate preservation: q({r1}) = q({r2})
SG(r1) = SG(r2)
if two readings are considered duplicates, then
the same synopsis is generated
commutativity: SF(s1,s2) = SF(s2,s1)
associativity: SF(s1,SF(s2,s3)) =
SF(SF(s1,s2),s3)
idempotence: SF(s,s) = s
Discrete Algs for Mobile Wireless Sys
14
More About the Conditions
Theorem: The previous 4 conditions are
necessary and sufficient for the SG and SF
functions to ensure ODI-correctness.
Proof Sketch:
sufficiency: If SG and SF satisfy the 4 conditions, then
show that any computation DAG can be transformed
into a canonical left-deep binary tree that produces the
same output
necessity: Argue that the 4 conditions follow from the
definition of ODI-correctness.
Discrete Algs for Mobile Wireless Sys
15
Count Example
Query: What is the (approximate) total
number of sensor nodes in the network?
Synopsis: a bit vector of length k > log N,
where N is an upper bound on the number
of nodes
N could be original number of nodes deployed,
or some function of the size of the id space
Discrete Algs for Mobile Wireless Sys
16
SG for Count Example
No sensor is actually read for this example.
Let SG return vector s[1..k], where
a certain entry is 1
rest of the entries are 0
How to decide which entry should be 1:
entry CT(k), where CT(k) is a random variable that
returns value i with probability 1/2i, 1 ≤ i < k.
How to compute CT(k):
Toss a fair coin until either the first head occurs or k
coin tosses have occurred with no heads; return
number of tosses
Discrete Algs for Mobile Wireless Sys
17
Computation of CT(k)
Why does the coin-tossing protocol give the
desired random variable?
Proof by Example: Suppose k = 4.
First toss is H, and 1 is returned, with probability 1/2
Otherwise, second toss is H, and 2 is returned, with
probability 1/4
Otherwise third toss is H and 3 is returned, with
probability 1/8
(and then 4 is returned with probability 1/8, but the
definition of CT(4) only cares about 1 through 3)
Discrete Algs for Mobile Wireless Sys
18
SF and SE for Count Example
SF(s,s'):
s[i] := s[i] OR s'[i], 1 ≤ i ≤ k
return s
SE(s):
return 2i-1/.77351, where i is the minimum
index such that s[i] = 0
Discrete Algs for Mobile Wireless Sys
19
Intuition for Count Synopsis
Functions
Suppose all (live) sensors have a failure-free path
to the querier.
The final bit vector to which SE is applied
indicates which bit positions have been set by at
least one node
The probability of n nodes failing to set the i-th bit
is (1–2i)n by definition of SG
Thus the number of (live) nodes is proportional to
2i–1
constant of proportionality is 1/.77351
Discrete Algs for Mobile Wireless Sys
20
Intuition for Count Synopsis
Functions
Alternatively…
We expect half the nodes to set the 1st bit,
a quarter of the nodes to set the 2nd bit, an
eighth of the nodes to set the 3rd bit, etc.
If there are n distinct nodes, then we might
expect log n bits to be set
I.e., if log n = i bits are set, then we might
expect there to be about n = 2i nodes
Discrete Algs for Mobile Wireless Sys
21
Count Algorithm is ODI-Correct
Note that ODI-correctness says nothing
about the SE function, only that SE will
return the same result as in the canonical
tree.
"Clever algorithms are still required to get
provably good approximations, although the
task has been simplified…"
Commutativity, associativity, and
idempotence follow from properties of
Boolean OR
Discrete Algs for Mobile Wireless Sys
22
Count Algorithm is ODI-Correct
Why does SG preserve duplicates?
Assume each node calls SG only once.
Show that if sensor readings are
considered duplicates, then the synopsis
generated by SG is the same.
Since there is no actual sensor reading for this
algorithm, we just use ids for the readings.
Assumption that each node calls SG only once
ensures the property.
Discrete Algs for Mobile Wireless Sys
23
Implicit Acknowledgments
When a node broadcasts a synopsis, avoid
overhead of explicit acknowledgments from
receivers this way:
node u broadcasts its synopsis
node u snoops (listens to) subsequent
broadcasts by its parent nodes (nodes closer
to the querying node)
if the synopsis broadcast by a parent
"effectively includes" u's synopsis, u does not
need to rebroadcast, otherwise rebroadcast (or
adapt the topology)
Discrete Algs for Mobile Wireless Sys
24
Implicit Acknowledgments (cont'd)
How can u accurately infer if its broadcasts was
"effectively included"?
Suppose u's synopsis was x and the parent's was
z.
If SF(x,z) = z, then x is effectively included.
Why? Since SF is commutative, associative, and
idempotent, it is a "semi-lattice".
in a semi-lattice, every 2 elements x and y have a least
upper bound z, and SF(x,z) = z = SF(y,z)
Count example: check if appropriate bits are set
Discrete Algs for Mobile Wireless Sys
25
Error Bounds of Approximate
Answers
Sources of error:
communication error: some nodes have no
failure-free propagation path to querier
approximation error: introduced by SG, SF
and SE functions.
defined as relative error of computed answer w.r.t.
exact algorithm using the same readings
Argue that communication error can be
made negligible by deploying sensor nodes
sufficiently densely
Discrete Algs for Mobile Wireless Sys
26
Error Bounds of Approximate
Answers (cont'd)
Approximation error analysis for the
centralized data stream model work in this
model, since synposis is ODI-correct
canonical left-deep tree corresponds to
processing a data stream of sensor readings in
a centralized location
Thus, e.g., Count algorithm has same
approximation error guarantees as
computed by Flajolet & Martin
Discrete Algs for Mobile Wireless Sys
27
More Examples
Max and Min: easy.
SG is the value, SF takes larger/smaller, SE is
identity
Sum: cf. paper by Considine et al. which
adapts Count algorithm
Average, Standard deviation, Second
Moment: cf. paper by Considine et al.
which uses Sum
Count Distinct: modification of Count
Discrete Algs for Mobile Wireless Sys
28
Uniform Sample Example
Compute a uniform sample of a given size K of the values
occurring at all nodes in the network
Synopsis: a sample of size K tuples (or fewer initially)
SG: output (val,r,id) where
val is the sensor reading of the node
r is a random number drawn uniformly from [0,1]
id is the node's id
SF(s,s'): list the tuples in s U s' in decreasing order of rvalue, and output the first K (or all, if less than K total)
U is set union, removes duplicates
SE(s): output the set of values in the tuples of s
Discrete Algs for Mobile Wireless Sys
29
Uniform Sample Example (cont'd)
SG labels each reading with a random
number, thus placing it in a random position
in the global ordering of all readings
So taking first K in the ordering gives a
uniform sample.
Uniform sample can then be used…
Discrete Algs for Mobile Wireless Sys
30
More Examples
Use uniform samples to compute these
aggregates:
k-th statistical moment (k = 1 is the mean)
k-th percentile value (k = 50 is the median)
with certain error and probability, by choosing the
sample size appropriately (cf. Bar-Yossef et al.)
Compute the k most frequent values (k = 1 is the
mode): run an ODI-correct Count algorithm for
each value
Discrete Algs for Mobile Wireless Sys
31
Adapting the Topology
If message loss is detected as occurring "too
frequently", nodes can adapt the Ring topology
Idea: use a heuristic that tries to assign a node u
to a ring so that there are plenty of ndoes in the
next ring to forward u's synopsis to the querier
ODI-correct synopses are helpful:
implicit acks are used to detect message loss energyefficiently
duplicates that occur during the adaptation of the
topology are not a problem
Discrete Algs for Mobile Wireless Sys
32
Simulation Results
Extensive!
Synopsis diffusion
reduces answer errors in lossy environments
helps address challenges from correlated node failures
does not use significantly more power
What topology to use?
Adaptive Rings has same overhead as Rings but much
better accuracy
Adaptive Rings gets about 90% of the sensor readings
most of the time vs. 100% with Flooding, but uses
much less power
Discrete Algs for Mobile Wireless Sys
33
Medians and Beyond [SBAS]
Extend beyond min/max/sum the class of queries
that can be answered in sensor networks to
include
approximate quantiles (including median)
most frequent data values (including consensus)
histogram of data distribution
range queries
Provide strict theoretical guarantees on the
approximation quality of the answers in terms of
message size
Discrete Algs for Mobile Wireless Sys
34
Comparison with Nath Paper
Some of the same problems are
considered
"Medians and Beyond" is concerned with
efficiency of message size and its tradeoff
with quality of approximation
Nath paper was concerned with handling
arbitrary ordering and duplicates
"Medians and Beyond" assumes no duplicates
Discrete Algs for Mobile Wireless Sys
35
Overview
Assume we have a tree rooted at the querying
node
To compute Average: each node sends to its
parent the sum of thedata values of its
descendants and its number of descendants
constant size messages
To compute Median, need to keep track of all
distinct values
size of messages, and memory, grows linearly
Trade off memory and bandwidth with accuracy of
approximations
Discrete Algs for Mobile Wireless Sys
36
Q-Digests
Assume sensor readings are integers in the range
[1,s]
Introduce q-digest data structure to answer
quantile queries with
messages of size m
error O((log s)/m)
Users specify message size vs. error tradeoff
q-digest measures maximum error accumulated
so far
Once q -digest query is done, use it to compute
quantiles, data distribution,…
Discrete Algs for Mobile Wireless Sys
37
More on q-Digest
Compute a compressed view of the complete
distribution of values (instead of just a function of
the values)
Use this view of the distribution to compute
approximations of various functions
Basic idea: Essentially compute a histogram, but
equally large, instead of equally spaced, buckets
buckets can overlap
size of buckets gives accuracy vs. communication
tradeoff
Discrete Algs for Mobile Wireless Sys
38
Definition of q-Digest
Group values into variable-sized buckets of
almost equal weights
size refers to range
weight refers to number of elements
q-digest consists of a set of buckets
Build a complete binary tree
1,…,s at the leaves
every tree node is a bucket, its range is all the leaves in
its subtree
At any given point, only some of the buckets are
being used
Discrete Algs for Mobile Wireless Sys
39
Example
1
data range 1-8
15 data items
5 buckets
2
4
1
2
3
2
6
4
5
6
Discrete Algs for Mobile Wireless Sys
7
8
40
Definition of q-Digest
Given compression parameter k and number of
data items n, a (tree) node v is in the q-digest iff:
count(v) ≤ n/k
node should not have a high count
count(v) + count(parent(v)) + count(sibling(v)) > n/k
if a node and its children have low total count then
combine using Compress algorithm
For a leaf node, if count > n/k, then it is in the qdigest
Root only needs to satisfy first condition
Discrete Algs for Mobile Wireless Sys
41
Example
1
check that this has k = 5;
n/k = 3
2
4
1
2
3
2
6
4
5
6
Discrete Algs for Mobile Wireless Sys
7
8
42
Centralized Construction of q-Digest
Go through all the tree nodes bottom up
Check which ones satisfy the 2 properties.
If a node v has a child that violates 2nd
property then merge v with both its children
Detailed info about values which occur
frequently is preserved, while less
frequently occurring values are lumped into
larger buckets resulting in info loss
Discrete Algs for Mobile Wireless Sys
43
1
1
4
1
2
3
6 1
1
4
6
5
1
1
8
7
4
1
2
2
2
3
6
4
5
6
7
8
1
1
2
4
1
2
2
2
4
6
3
4
5
6
7
8
1
2
3
2
6
4
5
6
7
8
Distributed Construction of q-Digest
Represent a q-digest by numbering the nodes of
the digest tree and sending a set of (node id,
count) pairs
q-digests move up the spanning tree, being
merged as they go.
To merge 2 q-digests:
take their union
add the counts of buckets with the same range
compress the result
Merging can cause information loss.
Discrete Algs for Mobile Wireless Sys
45
Analysis of Q-Digest
Lemma 1: A q-digest with parameter k has size (number of
buckets) at most 3k.
because the count of a node and its children can't be too
small
Lemma 2: In a q-digest with parameter k, the maximum
error in the count of any node is n(log s)/k.
because in the worst case the count of a node can deviate
from the actual value by the sum of the counts of its
ancestors
Lemma 3: Merging multiple q-digests gives the same error
as in Lemma 2.
Discrete Algs for Mobile Wireless Sys
46
Quantile Queries
Problem Statement: Given a fraction q
between 0 and 1, find the value whose rank
in sorted sequence of the n values is qn.
Median is when q = 1/2
Relative error is defined to be |r – qn|/n,
where r is the true rank of the returned
value
Discrete Algs for Mobile Wireless Sys
47
Using Q-Digest to Answer a
Quantile Query
Goal: find q-th quantile
Sort the nodes of the q-digest in increasing
order of max values (right endpoints);
break ties by putting smaller ranges first
this gives post-order traversal of the tree
Scan sorted list and add up the counts
Let v be the first node at which the running
sum exceeds qn
Return the max value of node v
Discrete Algs for Mobile Wireless Sys
48
Error Analysis
Answer returned is v.max
There are at least qn values less than or
equal to v.max, by choice of v
Error comes from values that are less than
v.max but are stored in ancestors of v
(these buckets are listed after v)
But this error is at most n(log s)/k
Note that estimate is always at least as
great as the eact answer
Discrete Algs for Mobile Wireless Sys
49
Example
a
1
b
c
d
e
i
h
1
j
2
a through o are the
ids of the digest tree
nodes:
j = [3:3]
k = [4:4]
f = [5:6]
g = [7:8]
a = [1:8]
2
6
k
l
3
4
g
f
4
m
5
2
o
n
6
7
8
Find Median (q = 1/2); recall n = 15 so look for 7.5
Sorted list is (j,4), (k,6), (f,2), (g,2), (a,1)
Running sums of counts are 4, 10 - done!
Return max value in tree node k, which is 4
Error is at most sum of counts on path from k to root, which is 1
Discrete Algs for Mobile Wireless Sys
50
Trading Off Error and Message Size
Memory and message size are controlled
by the compression factor k:
If k is small, then fewer buckets but wider
range of values are lumped together
If k is large, then more buckets but more finegrained distribution of values to buckets
If the maximum number of buckets you can
afford is m, then set k = m/3 (by Lemma 1)
and get error at most = 3(log s)/m (by
Lemmas 2 and 3)
Discrete Algs for Mobile Wireless Sys
51
Other Queries
Inverse Quantile: given a value x, determine its
rank in the sorted sequence of input values
Algorithm:
construct same sorted list
traverse list from beginning to end
return as the answer the sum of the counts of buckets v
for which x > v.max.
Reported rank is between
rank(x) and rank(x) + n
Discrete Algs for Mobile Wireless Sys
52
Other Queries
Range Query: find the number of values in
the range [low,high].
Algorithm:
perform inverse quantile queries to get the
ranks of low and high
return the difference in their ranks
Maximum error is 2n
Discrete Algs for Mobile Wireless Sys
53
Other Queries
Consensus Query: Given a fraction f between 0
and 1, find all values that are reported by more
than fn sensors
Algorithm:
Find all unit-width (leaf) buckets with count > (f–)n and
return their values
Since a leaf bucket's count has error at most n,
this finds all values with frequency more than fn
There may be some false positives: some values
with count between (f–)n and fn may also be
reported
Discrete Algs for Mobile Wireless Sys
54
Confidence Factor
Worst-case error is 3 (log s)/m, but it is unlikely that an
execution will be this bad
choosing message size m according to this constraint will be
overkill and waste bandwidth
Instead set m to a value for which it is expected that the
error bound will be met
Need to calculate the actual error in each q-digest: called
confidence factor
Define weight of a path: sum of counts of the nodes in the
path
Define confidence factor: maximum weight of any root-toleaf path, divided by n
Discrete Algs for Mobile Wireless Sys
55
Simulation Results
Compared against simple scheme of
keeping track of every distinct value
together with its count
q-digest scheme works well
Discrete Algs for Mobile Wireless Sys
56
Open Questions
Continuous queries?
Lost messages?
Duplicate invariance?
Include spatial information?
Optimality of results?
Discrete Algs for Mobile Wireless Sys
57