Bloom-Based Filters for Hierarchical Data

Download Report

Transcript Bloom-Based Filters for Hierarchical Data

Bloom Based Filters for
Hierarchical Data
Georgia Koloniari and Evaggelia Pitoura
University of Ioannina, Greece
Outline
•
•
•
•
•
•
•
•
•
Motivation
Problem Description
Related Work
Our approach: Multi-Level Bloom Filters
Performance Evaluation
Hierarchical Distribution of Filters
Experimental Results
Conclusions
Future Work
2
Motivation
• Evolution of peer-to-peer systems as an effective
way of sharing data
• Wide use of XML for data representation and
exchange in the Internet
• Service Descriptions in XML-based languages
• Growing interest in content-based routing of data
Challenge: How to efficiently discover the appropriate
data based on their content?
3
The Problem
• A peer-to-peer system where each node stores a
set of XML documents
• A query issued at a node may need results from
multiple nodes in the system
• Use data summaries at each node to assist query
routing
B
SumB
A
C
SumC
4
Summaries Requirements
• Scalability: summaries should be able to scale to
a large number of users and shared documents.
• Distribution: should be distributed across the
nodes of the peer-to-peer system without requiring
any central point of control.
• Dynamic: should support updates, since in a
peer-to-peer system, users join and leave the
system at will.
5
Related Work
• XML Indices
– The Index Fabric [Cooper & Shadmon, RightOrder Inc
2001]
– XSKETCH Synopsis [Polyzotis & Garofalakis, VLDB
2002]
– APEX [Chung, Min & Chim, ACM SIGMOD 2002 ]
– Path Tree [Aboulnaga, Alameldeen & Naughton,
VLDB 2001]
– Signature-based Indices [Park & Kim, DASFAA 2001]
• Routing in P2P
– Secure Service Discovery [Hodes et al, Mobicom ’99]
– Routing indices [Crespo & Garcia-Molina, ICDCS
6
2002]
Data Model
<xml>
<device>
<printer>
<color></color>
<postscript></postscript>
</printer>
<camera>
<digital></digital>
</camera>
</device>
</xml>
device
printer
color
postscript
camera
digital
7
Querying
• XML-based data or service descriptions
• Find the documents that satisfy a given query
• Queries that exploit content and structure of the
data
• Membership Queries: “Is element X in set Y?”
• Path Queries: consisting of regular path
expressions, i.e. device/*/camera
8
Bloom Filters
• Compact data structures for a probabilistic
representation of a set
• Appropriate to answer membership queries
9
Bloom Filters (cont’d)
Bit vector v
Element a
1
H1(a) = P1
H2(a) = P2
1
H3(a) = P3
1
H4(a) = P4
1
m bits
Query for b: check the bits at positions H1(b),
H2(b), ..., H4(b).
10
Bloom Filters (cont’d)
• Appearance of false positives.
False positive: the probabilty that the filter
recognizes an elemnt as belonging to the set
although it does not.
P = (1 - e-kn/m)k
• Ease of updates with the use of an array of
counters
• Unable to represent relationships between
elements
11
Our approach:
• Bloom filters suitable for distributed environments
• Main drawback: Unable to represent hierarchies
• Extend to multi-level Bloom Filters in order to
support path queries
• Two approaches:
– Breadth Bloom Filters
– Depth Bloom Filters
12
Breadth Bloom Filters
• One Bloom Filter BBFi for each level of the tree i
• In each filter BBFi we insert the elements of all the
nodes of level i.
• An additional BBF0 with all the elements to
improve performance
• Different sizes of the filter for each filter
Look-up:
– check BBF0 for all elements of the path
– check each element ai of the path to the
corresponding level
13
device
Breadth Bloom Filters
printer
color
BBF0
BBF1
1 1 1 1 1 0 1 0 1 1 1 1
camera
postscript digital
(deviceprintercamera
colorpostscriptdigital)
device
1 0 0 0 1 0 1 0 0 0 0 0
BBF2
0 1 1 1 1 0 0 0 1 0 0 1
printer  camera
BBF3
0 0 0 1 0 0 1 0 0 1 1 1
(colorpostscriptdigital)
Queries: $device/printer/color
/printer/postscript
14
Depth Bloom Filters
• One Bloom Filter DBFi for each path of the tree
with length i, i.e. each path with i+1 nodes
• In each DBFi we insert all paths of the tree with
length i.
Look-up for path of length p:
– Check all elements of the query in DBF
– Check for every sub-path of length 2 to p
– For * split the path at the positition of * and
check each sub-path seperately
15
device
Depth Bloom Filters
printer
color
DBF0
(deviceprintercamera
colorpostscriptdigital)
Paths of length 1
0 1 1 1 0 0 1 0 0 1 0 1
DBF2
postscript digital
Paths of length 0
1 0 1 1 0 0 1 0 1 1 00
DBF1
camera
(device/printerdevice/camera
camera/digitalprinter/color
printer/postscript)
Paths of length 2
1 0 0 1 1 0 0 1 0 1 1 0
(device/camera/digital
device/printer/color
device/printer/postscript)
Queries: /device/printer/color
16
/device/*/postscript
Experimental Evaluation
• 200 XML documents produced by the Niagara Generator
(www.cs.wisc.edu/niagara)
• 4 hash functions using the MD5 message digest algorithm
(RFC1321)
• Size of the filter: 78000 bits, about 2% of the size of the
documents
• Levels of the documents: 4
• Elements per document: 50
• No repetition between element names
• Length of queries: 3 (e.g. /device/camera/digital)
• 90% of the elements forming the queries were contained in
the documents
• Metric: Percentage of false positives
17
Influence of filter size
false positives
percentage
sim ple
breadth
depth
100
50
0
30000 50000 78000 96000 150000
size of the filter
18
Influence of the number of
elements per document
false positives
percentage
simple
breadth
depth
100
50
0
10
25
50
100
150
elements per document
19
Influence of the levels of the
document
false positives
percentage
sim ple
breadth
depth
100
50
0
2
3
4
5
6
num ber of levels
20
Influence of the length of the
queries
false positives
percentage
sim ple
breadth
depth
100
80
60
40
20
0
2
3
4
5
6
length of queries
21
Varying the query workload
sim ple
breadth
depth
false positives
percentage
120
100
80
60
40
20
0
0%
10%
20%
50%
75%
100%
percentage of queries
Workload type: /printer/digital
22
Summary of Results
• Multi-level Bloom filters outperform Simple
Bloom filters in evaluating path queries.
• For 2% of the total size of the data, multi-level
Bloom filters evaluate path queries for a false
positives ratio below 3%, while Simple Blooms
fail to recognize the correct paths, no matter how
much the filter size increases.
• Breadth Blooms work better than Depth Blooms.
• Depth Blooms require more space but are suitable
for handling queries for which Breadth Blooms
present a high ratio of false positives (exp. 5)
23
Distribution
• Each node stores:
– local summary
– merged summary of neighbours
– merged summary constructed by applying the
bit-wise OR per level
• Nodes organized according to topological
proximity
• Two organizations of nodes:
– hierarchical
– horizons
24
Distribution: Hierarchical
Organization
A
root peer
B
C
E
peer
main channel
D
F
Node C:
G
H
Local filter
Merged filter :E F  G  H
Root filters: A, B, D
25
Bloom Filter Similarity
• Nodes organized according to Bloom Filter
Similarity
• Measure: similarity measure based on the
Manhattan distance metric.
Let two filters B and C of size m
d(B, C) = |B[1] – C[1]| + |B[2] – C[2]| + … |B[m] –
C[m]|.
similarity(B, C) = m – d(B, C).
26
Bloom Filter Similarity (cont’d)
B
C
1
0
1
0
0
0
1
1
1
1
0
0
1
0
0
1
similarity(B, C) =8 - (1 + 0 + 0 + 1 + 0 + 1+ 0 + 1) = 4
For multi-level Bloom filters similarity is defined
as the sum of each pair of corresponding levels
27
Content-Based Organization
• When a node joins the system:
– it broadcasts its local summary and attaches to the most
«similar» node available
28
Performance in Distributed
Setting
• Hierarchical organization of nodes
• Metric: Number of hops
• Parameters:
–
–
–
–
–
–
–
Variable number of nodes
Number of hierarchies: 5
Maximum out-degree: 5
Every 10% of all docs 70% similar
Length of queries: 2
10% of the documents have results
70% of the documents contain the elements of the path
query
29
– One document per node
Finding the first result with
respect to the nodes
30
Finding all the results with
respect to the nodes
31
Finding the first result with
varying number of results
32
Finding the first result with
respect to the nodes
33
number of hops
Finding all the results with
respect to the nodes
400
350
300
250
200
150
100
50
0
simple-proximity
simple-content
breadth-proximity
breadth-content
depth-proximity
depth-content
20
50
100
150
200
number of nodes
34
Summary of Results
• The content-based organization is much more
efficient in finding all the results for a query, than
the proximity organization.
• They both perform similarly in discovering the
first result.
• The content-based organization outperforms the
proximity one when the nodes that satisfy a given
query are limited.
• Both Simple and multi-level Blooms can be
efficiently used as distributed filters.
• For path queries, multi-level Blooms outperform 35
Simple ones.
Conclusions
• We introduced two novel data structures: Breadth
and Depth Bloom Filters that exploit both the
content and structure of the XML documents
given a small space overhead.
• The new data structures outperform simple Bloom
Filters with respect to false positives when
addresing regular path expression queries
• Distributed in large-scale systems to support
efficient service discovery
• Extended the use of Bloom filters to organize the
nodes according to their content.
36
Future Work
• Explore different policies for the filters
distribution.
• Explore different types of data summaries
(e.g. Signatures)
• Extend the data model to XML graphs and
incorporate values into the indexes
37
Thank you
38