Transcript new1
Querying the Internet with PIER
(PIER = Peer-to-peer Information Exchange and
Retrieval)
What is PIER?
Peer-to-Peer Information Exchange and Retrieval
Query engine that runs on top of P2P network
• step to the distributed query processing at a
larger scale
• way for massive distribution: querying
heterogeneous data
Architecture meets traditional database query
processing with recent peer-to-peer technologies
Key goal is scalable indexing system for largescale decentralized storage applications on the
Internet
P2P, Large scale storage management
systems (OceanStore, Publius), wide-area
name resolution services
What is Very Large?
Depends on Who You Are
Internet scale systems vs. hundred node systems
Single Site
Clusters
Distributed
10’s – 100’s
Database Community
Internet Scale
1000’s – Millions
Network Community
How to run DB style queries at Internet
Scale!
What are the Key Properties?
Lots of data that is:
1. Naturally distributed (where it’s
generated)
2. Centralized collection undesirable
3. Homogeneous in schema
4. Data is more useful when viewed as
a whole
Who Needs Internet Scale?
Example 1: Filenames
Simple ubiquitous schemas:
• Filenames, Sizes, ID3 tags
Born from early P2P systems such as
Napster, Gnutella etc.
Content is shared by “normal” non-expert
users… home users
Systems were built by a few individuals ‘in
their garages’ Low barrier to entry
Example 2: Network Traces
Schemas are mostly standardized:
• IP, SMTP, HTTP, SNMP log formats
Network administrators are looking for
patterns within their site AND with other sites:
• DoS attacks cross administrative boundaries
• Tracking virus/worm infections
• Timeliness is very helpful
Might surprise you how useful it is:
• Network bandwidth on PlanetLab (world-wide
distributed research test bed) is mostly filled with
people monitoring the network status
Our Challenge
Our focus is on the challenge of
scale:
• Applications are homogeneous and
distributed
Already have significant interest
• Provide a flexible framework for a
wide variety of applications
Four Design Principles (I)
Relaxed Consistency
• ACID transactions severely limits the
scalability and availability of distributed
databases
• We provide best-effort results
Organic Scaling
• Applications may start small, without
a priori knowledge of size
Four Design Principles (II)
Natural habitat
• No CREATE TABLE/INSERT
• No “publish to web server”
• Wrappers or gateways allow the information to
be accessed where it is created
Standard Schemas via Grassroots
software
• Data is produced by widespread software
providing a de-facto schema to utilize
Declarative
Queries
Query Plan
Overlay Network
Physical Network
>>based on Can
Network
Monitoring
Other User
Apps
Applications
Query
Optimizer
Catalog
Manager
Core
Relational
Execution
Engine
PIER
DHT
Wrapper
Storage
Manager
IP
Network
Overlay
Routing
DHT
Network
Applications
P2P Databases
Highly distributed and
available data
Network Monitoring
Intrusion detection
Fingerprint queries
DHTs
Implemented with CAN (Content
Addressable Network).
Node identified by hyper-rectangle in
d-dimensional space
Key hashed to a point, stored in
corresponding node.
Routing Table of neighbours is
maintained. O(d)
Given a message with an ID, route the
message to the computer currently
responsible for that ID
(16,16)
(16,0)
Key = (15,14)
Data
(0,0)
(0,16)
DHT Design
Routing Layer
Mapping for keys
(-- dynamic as nodes leave and join)
Storage Manager
DHT based data
Provider
Storage access interface for higher
levels
DHT – Routing
Routing layer
maps a key into the IP address of the node currently
responsible for that key. Provides exact lookups,
callbacks higher levels when the set of keys has
changed
Routing layer API
lookup(key) ipaddr (Asynchronous Fnc)
join(landmarkNode)
leave()
locationMapChange()
DHT – Storage
Storage Manager
stores and retrieves records, which consist
of key/value pairs. Keys are used to locate
items and can be any data type or structure
supported
Storage Manager API
store(key, item)
retrieve(key) item
remove(key)
DHT – Provider (1)
Storage
Manager
Provider
ties routing and storage manager layers
and provides an interface
Each object in the DHT has a
namespace, resourceID and instanceID
DHT key =
hash(namespace,resourceID)
namespace - application or group of object, table or relation
resourceID – primary key or any attribute(Object)
instanceID – integer, to separate items with the same namespace
and resourceID
Lifetime - item storage duration
CAN’s mapping of resourceID/Object is equivalent to an index
Provider
Overlay
Routing
Provider
Storage
Manager
DHT – Provider (2)
Provider API
Overlay
get(namespace, resourceID) item
Routing
put(namespace, resourceID, item, lifetime)
renew(namespace, resourceID, instanceID, lifetime)
bool
multicast(namespace, resourceID, item)
lscan(namespace) items
newData(namespace, item)
rID3
Node R1
Table R (namespace)
(1..n) tuples
(n+1..m) tuples
item
(1..n)
Node R2
(n+1..m)
rID2
rID1
item
item
Query Processor
How it works?
• performs selection, projection, joins, grouping,
aggregation ->Operators
• Operators push and pull data
• simultaneous execution of multiple operators pipelined
together
• results are produced and queued as quick as possible
How it modifies data?
• insert, update and delete different items via DHT
interface
How it selects data to process?
• dilated-reachable snapshot – data, published by
reachable nodes at the query arrival time
Join Algorithms
Limited Bandwidth
Symmetric Hash Join:
- Rehashes both tables
Semi Joins:
- Transfer only matching tuples
At 40% selectivity, bottleneck switches from
computation nodes to query sites
Future Research
Routing, Storage and Layering
Catalogs and Query Optimization
Hierarchical Aggregations
Range Predicates
Continuous Queries over Streams
Sharing between Queries
Semi-structured Data
Distributed Hash Tables (DHTs)
What is a DHT?
• Take an abstract ID space, and partition
among a changing set of computers (nodes)
• Given a message with an ID, route the
message to the computer currently
responsible for that ID
• Can store messages at the nodes
• This is like a “distributed hash table”
Provides a put()/get() API
• Cheap maintenance when nodes come and
go
Distributed Hash Tables (DHTs)
Lots of effort is put into making DHTs
better:
•
•
•
•
•
•
Scalable (thousands millions of nodes)
Resilient to failure
Secure (anonymity, encryption, etc.)
Efficient (fast access with minimal state)
Load balanced
etc.
PIER’s Three Uses for DHTs
Single elegant mechanism with many
uses:
• Search: Index
Like a hash index
• Partitioning: Value (key)-based routing
Like Gamma/Volcano
• Routing: Network routing for QP messages
Query dissemination
Bloom filters
Hierarchical QP operators (aggregation, join, etc)
Not clear there’s another substrate that
supports all these uses
Metrics
We are primarily interested in 3 metrics:
• Answer quality (recall and precision)
• Bandwidth utilization
• Latency
Different DHTs provide different properties:
• Resilience to failures (recovery time) answer quality
• Path length bandwidth & latency
• Path convergence bandwidth & latency
Different QP Join Strategies:
• Symmetric Hash Join, Fetch Matches, Symmetric SemiJoin, Bloom Filters, etc.
• Big Picture: Tradeoff bandwidth (extra rehashing) and
latency
Symmetric Hash Join (SHJ)
r.c > s.c
NS=temp
r.a = s.a
PUT r.a
PUT s.a
r.b=constant s.b=constant
R
S
NS=r
NS=s
Fetch Matches (FM)
s.b=constant AND r.c > s.c
r.a = s.a
r.b=constant
GETs.a
R
S
NS=r
NS=s
Symmetric Semi Join (SSJ)
NS=temp
r.c > s.c
r.a = s.a
r.a = r.a
s.a = s.a
GET s.key
S
r.a = s.a
PUT
r.a
GET r.key
PUT s.a
NS=s
R
NS=r
r.a, r.key
s.a, s.key
r.b=constant s.b=constant
R
S
NS=r
NS=s
Both R and S are
projected to save
bandwidth
The complete R
and S tuples are
fetched in parallel
to improve latency
Overview
CAN is a distributed system that
maps keys onto values
Keys hashed into d dimensional
space
Interface:
• insert(key, value)
• retrieve(key)
Overview
y
State of the system at time t
Peer
Resource
Zone
x
In this 2 dimensional space a key is mapped to a point (x,y)
DESIGN
D-dimensional Cartesian coordinate
space (d-torus)
Every Node owns a distinct Zone
Map Key k1 onto a point p1 using a
Uniform Hash function
(k1,v1) is stored at the node Nx
that owns the zone with p1
• Node Maintains routing
table with neighbors
Ex: A Node holds{B,C,E,D}
• Follow the straight line path through
the Cartesian space
Routing
y
d-dimensional space
with n zones
(x,y)
Peer
Q(x,y) Query/
Resource
2 zones are neighbor
if d-1 dim overlap
Routing path of
length:
Q(x,y)
Algorithm:
Choose the
neighbor nearest
to the destination
key
CAN: construction*
Bootstrap
node
new node
CAN: construction
Bootstrap
node
I
new node
1) Discover some node “I” already in CAN
CAN: construction
(x,y)
I
new node
2) Pick random point in space
CAN: construction
(x,y)
J
I
new node
3) I routes to (x,y), discovers node J
CAN: construction
J
new
4) split J’s zone in half… new owns one half
Maintenance
Use zone takeover in case of failure
or leaving of a node
Send your neighbor table to
neighbors to inform that you are
alive at discrete time interval t
If your neighbor does not send alive
in time t, takeover its zone
Zone reassignment is needed
Node Departure
Some one has to take over the Zone
Explicit hand over of the zone to one of its
Neighbors
Merge to valid Zone if ”possible”
If not Possible ”then to Zones are temporary
handled by the smallest neighbor
Zone reassignment
1
3
1
3
2
Zoning
4
2
Partition tree
4
Zone reassignment
3
1
1
3
4
4
Partition tree
Zoning
Design Improvements
• Multi-Dimension
• Multi-Coordinate Spaces
• Overloading the Zones
• Multiple Hash Functions
• Topologically Sensitive Construction
• Uniform Partitioning
• Caching
Multi-Dimension
Increase in the dimension reduces
the path length
Multi-Coordinate Spaces
Multiple coordinate
spaces
Each node is assigned
different zone in each
of them.
Increases the
availability and reduces
the path length
Overloading the Zones
More than one peer are assigned to
one zone.
Increases availability
Reduces path length
Reduce per-hop latency
Uniform Partitioning
Instead of splitting directly splitting
the node occupant node
• Compare the volume of its zone with
neighbors
• The one to split is the one having
biggest volume