Counting at Large: Efficient Cardinality Estimation in

Download Report

Transcript Counting at Large: Efficient Cardinality Estimation in

Counting at Large: Efficient
Cardinality Estimation in
Internet-Scale Data Networks
By Nikon Ntarmos, Peter Triantafillou, and
Gerhard Weikum
Anthony Okorodudu
CSE 6392
2006-4-25
Outline








Introduction
Motivation
Related Work
Distributed Hash Tables (DHT)
Hash Sketches
Distributed Hash Sketches (DHS)
Counting with DHS
Conclusion
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
2
Introduction


Peer-to-peer (P2P) started as a way of
sharing files/CPU cycles among end-users
Evolved into cutting networks of today



Distributed Hash Tables (DHT) made this feasible
Probabilistic guarantees for degree of
efficiency, fault tolerance, and availability
Data management systems of huge scale
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
3
Motivation

Need for distributed counting
mechanisms



2006/4/25
File-sharing P2P systems: total number of
documents shared by users
Sensor networks: compute aggregates in a
duplicate-insensitive manner
Internet-scale DB system: build histograms
for query access plans
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
4
Central Goals
1.
2.
3.
2006/4/25
Efficiency: number of nodes contacted for
counting must be small
Scalability and availability: large numbers of
nodes may need to add elements to a
(multi-) set
Access and storage load balancing: counting
and related overheads should be fairly
distributed across all nodes
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
5
Central Goals (continued)
4.
5.
6.
2006/4/25
Accuracy: tunable, robust, and highly
accurate cardinality estimation
Simplicity and ease of integration:
special, solution-based indexing
structures should be avoided
Duplicate (in)sensitivity: count total
number of items as well as the
number of unique items in multi-sets
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
6
Distributed Counting Protocols




One-node-per-counter protocols
Gossip-based protocols
Broadcast/convergecast-type protocols
Sampling-based protocols
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
7
One-node-per-counter



Select a node in the overlay of the DHT
and use it to maintain counter value
Poor scalability
Resembles a centralized system
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
8
Gossip-based





Provide weak probabilistic semantics of
“eventual consistency” for outcome
Every node exchanges information with
a set of nodes
Low bandwidth
Not efficient in terms of number of
nodes to be contacted
Low accuracy
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
9
Broadcast/Convergecast-type
Broadcast phase
1.

Querying node broadcasts query through
network, creating tree of nodes as query
propagates the overlay
Convergecast phase
2.


2006/4/25
Node sends its local part of the answer
along with answers received from nodes
deeper down the tree to “parent” node
Similar to gossip-based
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
10
Sampling-based




Estimate the value of the counter by
selectively querying a set of nodes in the
network
Sampling based techniques suffer from
accuracy issues
Large samples lead to higher accuracy but
more nodes need to be contacted
Sampling based techniques are usually
duplicate-sensitive
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
11
Distributed Hash Tables (DHT)
Family of structured P2P network
overlays exposing hash-table like
interface

1.
2.
insert(key, value)
lookup(key)

Highly efficient for point queries
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
12
Hash Sketches


First proposed as a means of estimating
the cardinality of a multiset in a
database
Used in many application domains for
counting distinct elements in multi-sets

2006/4/25
Approximate query answering in very large
DBs, data mining on the internet graph,
stream processing
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
13
Hash Sketches (continued)


PCSA (Probabilistic Counting with
Stochastic Averaging) algorithm
assumes of a pseudo-uniform hash
function
Super-LogLog algorithm relaxes
pseudo-uniform hash function
constraints of PCSA
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
14
Distributed Hash Sketches
(DHS)



Fully decentralized, scalable, and
efficient mechanism capable of
providing estimates on the cardinality of
multi-sets
Satisfy all the central goals
Implemented using PCSA (DHS-PCSA)
or super-LogLog (DHS-sLL) hash
sketches
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
15
DHS

O(log N) cost to insert object in an Nnode DHS


O(b * log N) bandwidth consumption if
size of data is b bytes
Data items are deleted if not updated
within time-to-live so deleting an item
incurs no extra cost
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
16
DHS (continued)


Accuracy of hash sketches increases
with multiple bitmap vectors
Either PCSA or super-LogLog algorithm
is applied for counting
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
17
Counting with DHS
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
18
Conclusion



Distributed Hash Sketches is a fully
decentralized, scalable, and efficient
mechanism for providing estimates on the
cardinality of multi-sets in internet-scale
information systems
DHS implemented using either PCSA or the
super-LogLog hash sketches
DHS histograms can introduce great
performance savings during query
optimization
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
19
References

N. Ntarmos, P. Triantafillou, and G.
Weikum. Counting at Large: Efficient
Cardinality Estimation in Internet-Scale
Data Networks. ICDE 2006.
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
20
Thanks
2006/4/25
Counting at Large: Efficient
Cardinality Estimation in InternetScale Data Networks
21