Data Aggregation

Download Report

Transcript Data Aggregation

CPSC 689: Discrete Algorithms
for Mobile and Wireless Systems
Spring 2009
Prof. Jennifer Welch
Lecture 33
 Topic:
 Data Aggregation in Sensor Networks
 Sources:
 Nath, Gibbons, Seshan & Anderson
 Shrivastava, Buragohain, Agrawal & Suri
Discrete Algs for Mobile Wireless Sys
2
Aggregation Problem
 How to compute the answer to a query in a sensor
network that requires aggregating data from all (or
many) sensors?
 Example: Suppose the nodes take temperature
readings and queries ask for min/max/average
temperature
 Data has to flow through the network to the node that
issues the query
 In some cases, data can be aggregated on the way
 save bandwidth and energy
 Example: to find max temp., each node
propagates largest temp. it has learned about
Discrete Algs for Mobile Wireless Sys
3
Communication to Support
Aggregation
 Need to propagate sensor readings in some
orderly way
 Example: send data over a spanning tree rooted
at the querying node
 not robust: link or node failure will partition the tree,
lose contact with sensors in subtree
 Prefer to use multipath routing (message is sent
on several paths)
 redundancy provides more resilience
 But duplication causes problems for aggregation
 OK for max, but what about average?
Discrete Algs for Mobile Wireless Sys
4
Overview of Algorithm
 Provides framework for synopses of the data to be sent
over multiple paths and then reconstructing correct answer
 Phase 1: aggregate query is flooded through the network
and an aggregation topology is constructed
 Phase 2: aggregate values are continually routed toward
the querying node:
 each node converts its sensor data to a synopsis (SG
function)
 nodes merge two synopses into one (SF function)
 querying node converts synopsis back to final answer (SE
function)
Discrete Algs for Mobile Wireless Sys
5
Specific Aggregation Topology
 Rings:
 kind of like levels in breadth-first search
 Nodes are partitioned into rings during Phase 1:
 querying node q is in ring 0
 a node is in ring i if it receives the query first from a node in ring i–1
 Phase 2 is divided into epochs, one aggregate answer per
epoch
 each node in outer ring (farthest distance from q) computes s :=
SG(r), where r is its sensor reading, and broadcasts s
 each node in ring i computes s := SG(r), where r is its sensor
reading, and updates s := SF(s,s'), where s' is each synopsis
received from a neighbor, then broadcasts s
 querying node computes SE(s)
 Synchronous algorithm
Discrete Algs for Mobile Wireless Sys
6
Figure
q
R0
R1
R2
A
C
B
Discrete Algs for Mobile Wireless Sys
7
Analysis of Framework
 Complexity: each node broadcasts once
per epoch
 Same as spanning-tree-based approach
 More resilient than spanning-tree-based
approach
Discrete Algs for Mobile Wireless Sys
8
The Functions
 What should SG, SF, and SE be in order to
give the "correct" answer?
 First, give a condition on the functions that
is intuitive
 Then show there are 4 simple checks that
can be done on proposed functions
 These conditions are necessary and
sufficient to preserve correctness
Discrete Algs for Mobile Wireless Sys
9
ODI-Correctness
 Final result should be independent of how the
data was routed to querier:
 same no matter in which order the readings are
combined and how many times they are included
(duplicated) during the routing
 Sensor reading r : <measurement, metadata>
 assumed to be unique
 Suppose we have SG, SF and SE
 Define synopsis label SL(s) = {r} if s = SG(r ) and
SL(s) = SL(s1) Ums SL(s2) if s = SF(s1,s2)
Discrete Algs for Mobile Wireless Sys
10
ODI-Correctness (cont'd)
 What constitutes a "duplicate" depends on
what is being computed
 Ex: average temp vs. number of distinct temps
 q : multiset of sensor readings  set of
(unique) values
 q(SL(s)) = set of unique values in all the
sensor readings that formed the synopsis
Discrete Algs for Mobile Wireless Sys
11
ODI-Correctness Definition
 Let {v1,…,vk} be set of values in the label of s, i.e.,
q(SL(s)).
 Then s must be same as computation on
"canonical left-deep tree":
 s := SG(v1)
 for i = 2 to k do
 s := SF(s,SG(vi))
 I.e., regardless of redundancy caused by
multipath routing, the final synopsis is the same
as if each distinct value is included just once
Discrete Algs for Mobile Wireless Sys
12
ODI-Correctness Figure
s
s
SF
SF
SF
SF
SF
SF
SF
SF
SF
SF
SF
SG
SG
SG
SG
SG
SG
r1
r2
r3
r4
r5
r1
r2
Aggregation DAG
SG
SG
SG
SG
r5
r4
r3
Canonical left-deep tree
Discrete Algs for Mobile Wireless Sys
13
A Simple Test for ODI-Correctness
 duplicate preservation: q({r1}) = q({r2}) 
SG(r1) = SG(r2)
 if two readings are considered duplicates, then
the same synopsis is generated
 commutativity: SF(s1,s2) = SF(s2,s1)
 associativity: SF(s1,SF(s2,s3)) =
SF(SF(s1,s2),s3)
 idempotence: SF(s,s) = s
Discrete Algs for Mobile Wireless Sys
14
More About the Conditions
 Theorem: The previous 4 conditions are
necessary and sufficient for the SG and SF
functions to ensure ODI-correctness.
 Proof Sketch:
 sufficiency: If SG and SF satisfy the 4 conditions, then
show that any computation DAG can be transformed
into a canonical left-deep binary tree that produces the
same output
 necessity: Argue that the 4 conditions follow from the
definition of ODI-correctness.
Discrete Algs for Mobile Wireless Sys
15
Count Example
 Query: What is the (approximate) total
number of sensor nodes in the network?
 Synopsis: a bit vector of length k > log N,
where N is an upper bound on the number
of nodes
 N could be original number of nodes deployed,
or some function of the size of the id space
Discrete Algs for Mobile Wireless Sys
16
SG for Count Example
 No sensor is actually read for this example.
 Let SG return vector s[1..k], where
 a certain entry is 1
 rest of the entries are 0
 How to decide which entry should be 1:
 entry CT(k), where CT(k) is a random variable that
returns value i with probability 1/2i, 1 ≤ i < k.
 How to compute CT(k):
 Toss a fair coin until either the first head occurs or k
coin tosses have occurred with no heads; return
number of tosses
Discrete Algs for Mobile Wireless Sys
17
Computation of CT(k)
 Why does the coin-tossing protocol give the
desired random variable?
 Proof by Example: Suppose k = 4.
 First toss is H, and 1 is returned, with probability 1/2
 Otherwise, second toss is H, and 2 is returned, with
probability 1/4
 Otherwise third toss is H and 3 is returned, with
probability 1/8
 (and then 4 is returned with probability 1/8, but the
definition of CT(4) only cares about 1 through 3)
Discrete Algs for Mobile Wireless Sys
18
SF and SE for Count Example
 SF(s,s'):
 s[i] := s[i] OR s'[i], 1 ≤ i ≤ k
 return s
 SE(s):
 return 2i-1/.77351, where i is the minimum
index such that s[i] = 0
Discrete Algs for Mobile Wireless Sys
19
Intuition for Count Synopsis
Functions
 Suppose all (live) sensors have a failure-free path
to the querier.
 The final bit vector to which SE is applied
indicates which bit positions have been set by at
least one node
 The probability of n nodes failing to set the i-th bit
is (1–2i)n by definition of SG
 Thus the number of (live) nodes is proportional to
2i–1
 constant of proportionality is 1/.77351
Discrete Algs for Mobile Wireless Sys
20
Intuition for Count Synopsis
Functions
 Alternatively…
 We expect half the nodes to set the 1st bit,
a quarter of the nodes to set the 2nd bit, an
eighth of the nodes to set the 3rd bit, etc.
 If there are n distinct nodes, then we might
expect log n bits to be set
 I.e., if log n = i bits are set, then we might
expect there to be about n = 2i nodes
Discrete Algs for Mobile Wireless Sys
21
Count Algorithm is ODI-Correct
 Note that ODI-correctness says nothing
about the SE function, only that SE will
return the same result as in the canonical
tree.
 "Clever algorithms are still required to get
provably good approximations, although the
task has been simplified…"
 Commutativity, associativity, and
idempotence follow from properties of
Boolean OR
Discrete Algs for Mobile Wireless Sys
22
Count Algorithm is ODI-Correct
 Why does SG preserve duplicates?
 Assume each node calls SG only once.
 Show that if sensor readings are
considered duplicates, then the synopsis
generated by SG is the same.
 Since there is no actual sensor reading for this
algorithm, we just use ids for the readings.
 Assumption that each node calls SG only once
ensures the property.
Discrete Algs for Mobile Wireless Sys
23
Implicit Acknowledgments
 When a node broadcasts a synopsis, avoid
overhead of explicit acknowledgments from
receivers this way:
 node u broadcasts its synopsis
 node u snoops (listens to) subsequent
broadcasts by its parent nodes (nodes closer
to the querying node)
 if the synopsis broadcast by a parent
"effectively includes" u's synopsis, u does not
need to rebroadcast, otherwise rebroadcast (or
adapt the topology)
Discrete Algs for Mobile Wireless Sys
24
Implicit Acknowledgments (cont'd)
 How can u accurately infer if its broadcasts was
"effectively included"?
 Suppose u's synopsis was x and the parent's was
z.
 If SF(x,z) = z, then x is effectively included.
 Why? Since SF is commutative, associative, and
idempotent, it is a "semi-lattice".
 in a semi-lattice, every 2 elements x and y have a least
upper bound z, and SF(x,z) = z = SF(y,z)
 Count example: check if appropriate bits are set
Discrete Algs for Mobile Wireless Sys
25
Error Bounds of Approximate
Answers
 Sources of error:
 communication error: some nodes have no
failure-free propagation path to querier
 approximation error: introduced by SG, SF
and SE functions.
 defined as relative error of computed answer w.r.t.
exact algorithm using the same readings
 Argue that communication error can be
made negligible by deploying sensor nodes
sufficiently densely
Discrete Algs for Mobile Wireless Sys
26
Error Bounds of Approximate
Answers (cont'd)
 Approximation error analysis for the
centralized data stream model work in this
model, since synposis is ODI-correct
 canonical left-deep tree corresponds to
processing a data stream of sensor readings in
a centralized location
 Thus, e.g., Count algorithm has same
approximation error guarantees as
computed by Flajolet & Martin
Discrete Algs for Mobile Wireless Sys
27
More Examples
 Max and Min: easy.
 SG is the value, SF takes larger/smaller, SE is
identity
 Sum: cf. paper by Considine et al. which
adapts Count algorithm
 Average, Standard deviation, Second
Moment: cf. paper by Considine et al.
which uses Sum
 Count Distinct: modification of Count
Discrete Algs for Mobile Wireless Sys
28
Uniform Sample Example
 Compute a uniform sample of a given size K of the values
occurring at all nodes in the network
 Synopsis: a sample of size K tuples (or fewer initially)
 SG: output (val,r,id) where
 val is the sensor reading of the node
 r is a random number drawn uniformly from [0,1]
 id is the node's id
 SF(s,s'): list the tuples in s U s' in decreasing order of rvalue, and output the first K (or all, if less than K total)
 U is set union, removes duplicates
 SE(s): output the set of values in the tuples of s
Discrete Algs for Mobile Wireless Sys
29
Uniform Sample Example (cont'd)
 SG labels each reading with a random
number, thus placing it in a random position
in the global ordering of all readings
 So taking first K in the ordering gives a
uniform sample.
 Uniform sample can then be used…
Discrete Algs for Mobile Wireless Sys
30
More Examples
 Use uniform samples to compute these
aggregates:
 k-th statistical moment (k = 1 is the mean)
 k-th percentile value (k = 50 is the median)
with certain error and probability, by choosing the
sample size appropriately (cf. Bar-Yossef et al.)
 Compute the k most frequent values (k = 1 is the
mode): run an ODI-correct Count algorithm for
each value
Discrete Algs for Mobile Wireless Sys
31
Adapting the Topology
 If message loss is detected as occurring "too
frequently", nodes can adapt the Ring topology
 Idea: use a heuristic that tries to assign a node u
to a ring so that there are plenty of ndoes in the
next ring to forward u's synopsis to the querier
 ODI-correct synopses are helpful:
 implicit acks are used to detect message loss energyefficiently
 duplicates that occur during the adaptation of the
topology are not a problem
Discrete Algs for Mobile Wireless Sys
32
Simulation Results
 Extensive!
 Synopsis diffusion
 reduces answer errors in lossy environments
 helps address challenges from correlated node failures
 does not use significantly more power
 What topology to use?
 Adaptive Rings has same overhead as Rings but much
better accuracy
 Adaptive Rings gets about 90% of the sensor readings
most of the time vs. 100% with Flooding, but uses
much less power
Discrete Algs for Mobile Wireless Sys
33
Medians and Beyond [SBAS]
 Extend beyond min/max/sum the class of queries
that can be answered in sensor networks to
include




approximate quantiles (including median)
most frequent data values (including consensus)
histogram of data distribution
range queries
 Provide strict theoretical guarantees on the
approximation quality of the answers in terms of
message size
Discrete Algs for Mobile Wireless Sys
34
Comparison with Nath Paper
 Some of the same problems are
considered
 "Medians and Beyond" is concerned with
efficiency of message size and its tradeoff
with quality of approximation
 Nath paper was concerned with handling
arbitrary ordering and duplicates
 "Medians and Beyond" assumes no duplicates
Discrete Algs for Mobile Wireless Sys
35
Overview
 Assume we have a tree rooted at the querying
node
 To compute Average: each node sends to its
parent the sum of thedata values of its
descendants and its number of descendants
 constant size messages
 To compute Median, need to keep track of all
distinct values
 size of messages, and memory, grows linearly
 Trade off memory and bandwidth with accuracy of
approximations
Discrete Algs for Mobile Wireless Sys
36
Q-Digests
 Assume sensor readings are integers in the range
[1,s]
 Introduce q-digest data structure to answer
quantile queries with
 messages of size m
 error O((log s)/m)
 Users specify message size vs. error tradeoff
 q-digest measures maximum error accumulated
so far
 Once q -digest query is done, use it to compute
quantiles, data distribution,…
Discrete Algs for Mobile Wireless Sys
37
More on q-Digest
 Compute a compressed view of the complete
distribution of values (instead of just a function of
the values)
 Use this view of the distribution to compute
approximations of various functions
 Basic idea: Essentially compute a histogram, but
 equally large, instead of equally spaced, buckets
 buckets can overlap
 size of buckets gives accuracy vs. communication
tradeoff
Discrete Algs for Mobile Wireless Sys
38
Definition of q-Digest
 Group values into variable-sized buckets of
almost equal weights
 size refers to range
 weight refers to number of elements
 q-digest consists of a set of buckets
 Build a complete binary tree
 1,…,s at the leaves
 every tree node is a bucket, its range is all the leaves in
its subtree
 At any given point, only some of the buckets are
being used
Discrete Algs for Mobile Wireless Sys
39
Example
1
data range 1-8
15 data items
5 buckets
2
4
1
2
3
2
6
4
5
6
Discrete Algs for Mobile Wireless Sys
7
8
40
Definition of q-Digest
 Given compression parameter k and number of
data items n, a (tree) node v is in the q-digest iff:
 count(v) ≤ n/k
 node should not have a high count
 count(v) + count(parent(v)) + count(sibling(v)) > n/k
 if a node and its children have low total count then
combine using Compress algorithm
 For a leaf node, if count > n/k, then it is in the qdigest
 Root only needs to satisfy first condition
Discrete Algs for Mobile Wireless Sys
41
Example
1
check that this has k = 5;
n/k = 3
2
4
1
2
3
2
6
4
5
6
Discrete Algs for Mobile Wireless Sys
7
8
42
Centralized Construction of q-Digest
 Go through all the tree nodes bottom up
 Check which ones satisfy the 2 properties.
 If a node v has a child that violates 2nd
property then merge v with both its children
 Detailed info about values which occur
frequently is preserved, while less
frequently occurring values are lumped into
larger buckets resulting in info loss
Discrete Algs for Mobile Wireless Sys
43
1
1
4
1
2
3
6 1
1
4
6
5
1
1
8
7
4
1
2
2
2
3
6
4
5
6
7
8
1
1
2
4
1
2
2
2
4
6
3
4
5
6
7
8
1
2
3
2
6
4
5
6
7
8
Distributed Construction of q-Digest
 Represent a q-digest by numbering the nodes of
the digest tree and sending a set of (node id,
count) pairs
 q-digests move up the spanning tree, being
merged as they go.
 To merge 2 q-digests:
 take their union
 add the counts of buckets with the same range
 compress the result
 Merging can cause information loss.
Discrete Algs for Mobile Wireless Sys
45
Analysis of Q-Digest
 Lemma 1: A q-digest with parameter k has size (number of
buckets) at most 3k.
 because the count of a node and its children can't be too
small
 Lemma 2: In a q-digest with parameter k, the maximum
error in the count of any node is n(log s)/k.
 because in the worst case the count of a node can deviate
from the actual value by the sum of the counts of its
ancestors
 Lemma 3: Merging multiple q-digests gives the same error
as in Lemma 2.
Discrete Algs for Mobile Wireless Sys
46
Quantile Queries
 Problem Statement: Given a fraction q
between 0 and 1, find the value whose rank
in sorted sequence of the n values is qn.
 Median is when q = 1/2
 Relative error is defined to be |r – qn|/n,
where r is the true rank of the returned
value
Discrete Algs for Mobile Wireless Sys
47
Using Q-Digest to Answer a
Quantile Query
 Goal: find q-th quantile
 Sort the nodes of the q-digest in increasing
order of max values (right endpoints);
break ties by putting smaller ranges first
 this gives post-order traversal of the tree
 Scan sorted list and add up the counts
 Let v be the first node at which the running
sum exceeds qn
 Return the max value of node v
Discrete Algs for Mobile Wireless Sys
48
Error Analysis
 Answer returned is v.max
 There are at least qn values less than or
equal to v.max, by choice of v
 Error comes from values that are less than
v.max but are stored in ancestors of v
(these buckets are listed after v)
 But this error is at most n(log s)/k
 Note that estimate is always at least as
great as the eact answer
Discrete Algs for Mobile Wireless Sys
49
Example
a
1
b
c
d
e
i
h
1





j
2
a through o are the
ids of the digest tree
nodes:
j = [3:3]
k = [4:4]
f = [5:6]
g = [7:8]
a = [1:8]
2
6
k
l
3
4
g
f
4
m
5
2
o
n
6
7
8
Find Median (q = 1/2); recall n = 15 so look for 7.5
Sorted list is (j,4), (k,6), (f,2), (g,2), (a,1)
Running sums of counts are 4, 10 - done!
Return max value in tree node k, which is 4
Error is at most sum of counts on path from k to root, which is 1
Discrete Algs for Mobile Wireless Sys
50
Trading Off Error and Message Size
 Memory and message size are controlled
by the compression factor k:
 If k is small, then fewer buckets but wider
range of values are lumped together
 If k is large, then more buckets but more finegrained distribution of values to buckets
 If the maximum number of buckets you can
afford is m, then set k = m/3 (by Lemma 1)
and get error at most  = 3(log s)/m (by
Lemmas 2 and 3)
Discrete Algs for Mobile Wireless Sys
51
Other Queries
 Inverse Quantile: given a value x, determine its
rank in the sorted sequence of input values
 Algorithm:
 construct same sorted list
 traverse list from beginning to end
 return as the answer the sum of the counts of buckets v
for which x > v.max.
 Reported rank is between
rank(x) and rank(x) + n
Discrete Algs for Mobile Wireless Sys
52
Other Queries
 Range Query: find the number of values in
the range [low,high].
 Algorithm:
 perform inverse quantile queries to get the
ranks of low and high
 return the difference in their ranks
 Maximum error is 2n
Discrete Algs for Mobile Wireless Sys
53
Other Queries
 Consensus Query: Given a fraction f between 0
and 1, find all values that are reported by more
than fn sensors
 Algorithm:
 Find all unit-width (leaf) buckets with count > (f–)n and
return their values
 Since a leaf bucket's count has error at most n,
this finds all values with frequency more than fn
 There may be some false positives: some values
with count between (f–)n and fn may also be
reported
Discrete Algs for Mobile Wireless Sys
54
Confidence Factor
 Worst-case error is 3 (log s)/m, but it is unlikely that an
execution will be this bad
 choosing message size m according to this constraint will be
overkill and waste bandwidth
 Instead set m to a value for which it is expected that the
error bound will be met
 Need to calculate the actual error in each q-digest: called
confidence factor
 Define weight of a path: sum of counts of the nodes in the
path
 Define confidence factor: maximum weight of any root-toleaf path, divided by n
Discrete Algs for Mobile Wireless Sys
55
Simulation Results
 Compared against simple scheme of
keeping track of every distinct value
together with its count
 q-digest scheme works well
Discrete Algs for Mobile Wireless Sys
56
Open Questions





Continuous queries?
Lost messages?
Duplicate invariance?
Include spatial information?
Optimality of results?
Discrete Algs for Mobile Wireless Sys
57