ppt - CSE, IIT Bombay

Download Report

Transcript ppt - CSE, IIT Bombay

PIER (Peer-to-Peer Information
Exchange and Retrieval)
30 March 07
Neha Singh
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
Motivation: What is Very Large?
Single Site
Clusters
Distributed
10’s – 100’s
Database Community
Internet Scale
1000’s – Millions
Network Community
“The database research community prides itself on scalable technologies. Yet
database systems traditionally do not excel on one important scalability
dimension the degree of distribution. This limitation has hampered the
impact of database technologies on massively distributed systems like the
Internet.”
Motivation

Databases:



powerful query facilities
potential to scale up to few hundred
computers
For querying Internet, there is a well
distributed system that has



query facilities (SQL)
fault tolerance
flexibility (in terms of nature of data)
Databases + P2P DHTs

Databases carry a lot of other expensive
baggage



ACID transactions
Consistency above all else
Need to unite the query processor with DHTs

DHTs + Relational Query Processing = PIER
 Bringing complex queries to DHTs
Outline


Motivation
Introduction






What is PIER
Design Principles
Architecture
Join Algorithms
Experimental Results
Conclusion
What is PIER?



Peer-to-Peer Information Exchange and Retrieval
It is a query engine that scales up to thousands of
participating nodes and can work on various data
It runs on top of P2P network



step to the distributed query processing at a larger scale
way for massive distribution: querying heterogeneous data
Architecture meets traditional database query
processing with recent peer-to-peer technologies
Design Principles

Relaxed Consistency

Brewer states that a distributed data system can
only have two out of three of the following
properties :
1.
2.
3.

(C)onsistency
(A)vailability
Tolerance of network (P)artitions
Pier : Priority : A ,P and sacrifice C
(Sacrificing Consistency in face of Availability and
Partition tolerance)
ie. “Best effort results”
Design Principles

Organic Scaling



Natural Habitats for Data


Growth with deployment
Does not require a priori allocation
Data remains in original format with a DB
interface
Standard Schemas

Achieved though common software
Initial Design Assumptions

Overlay Network





DHTs are highly scalable
Resilient to network failures
But DHTs provided limited functionalities
Design challenge: get lots of functionality from this simple
interface
Decouple Storage

PIER is just the query engine, no storage



Query data that is in situ
Give up ACID guarantees
Why not a design p2p storage manager too?


Not needed for many applications
Hard problem in itself – leave for others to solve (or not)
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
PIER Architecture
Various User Applications
Query
Optimizer
Catalog
Manager
Storage
Manager
Core
Relational
Execution
Engine
Apps
PIER
Provider
DHT
Overlay
Routing
Select R.Cpr,R.name,
s.Address From R,S
Where R.cpr=S.cpr
Declarative
Queries
Query Plan
Overlay Network
Physical Network
Network
Monitoring
Other User
Apps
Applications
Query
Optimizer
Catalog
Manager
Core
Relational
Execution
Engine
PIER
DHT
Wrapper
Storage
Manager
IP
Network
Overlay
Routing
DHT
Network
Architecture: DHT: Routing




DHT is divided into 3 modules
Very simple interface
Any routing algorithm here:
CAN, Chord, Pastry, etc.
Overlay Routing API:
Storage
Manager
lookup(key)  ipaddr
join(landmarkNode)
leave()
CALLBACK: locationMapChange()
Provider
Overlay
Routing
Architecture: DHT: Storage


Stores and retrieves
records, which consist of
key/value pairs. Keys are
used to locate items and
can be any data type or
structure supported
Storage
Manager
A complex storage system
can be used but here using
a simple in-memory storage
system.
store(key, item)
retrieve(key)  item
remove(key)
Provider
Overlay
Routing
DHT – Provider (1/2)
Provider ties routing and storage manager layers
and provides an interface


Each object in the DHT has a namespace,
resourceID and instanceID
Storage
Manager
DHT key = hash(namespace,resourceID)

namespace - application or group of object, table

resourceID – what is object, primary key or any attribute


instanceID – integer, to separate items with the same
namespace and resourceID
CAN’s mapping of resourceID/Object is equivalent to an
index
Provider
Overlay
Routing
DHT – Provider (2/2)
get (namespace, resourceID)  item
put (namespace, resourceID, item, lifetime)
renew (namespace, resourceID, instanceID,
lifetime)  bool
multicast(namespace, resourceID, item)
unicast(ns, rid, item)
lscan(namespace)  items
CALLBACK: newData(ns, item)
rID3
Table R (namespace)
(1..n) tuples
(n+1..m) tuples
item
Node R1
(1..n)
rID2
Node R2
rID1
(n+1..m)
item
item
Architecture: PIER



Consists only of the relational execution
engine
Executes a pre-optimized query plan
Query plan is a box-and-arrow description of
how to connect basic operators together

selection, projection, join, group-by/aggregation,
and some DHT specific operators such as rehash
Query Processor

How it works?




results are produced and queued as quick as possible
How it modifies data?


performs selection, projection, joins, grouping, aggregation
simultaneous execution of multiple operators pipelined
together
insert, update and delete different items via DHT interface
How it selects data to process?

dilated-reachable snapshot – data, published by reachable
nodes at the query arrival time
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
Joins: The Core of Query
Processing

Goal:


Get tuples that have the same value for a
particular attribute(s) (the join attribute(s))
to the same site, then append tuples
together.
Algorithms come from existing database
literature, minor adaptations to use
DHT.
DHT based Distributed Joins

DHTs have been used as both



Content Addressable Network
For routing tuples by value
Have implemented 2 binary equi-joins
and 2 band-width reducing rewrite
schemes
Joins: Symmetric Hash Join (SHJ)

Algorithm for each site






(Scan) Use two lscan calls to retrieve all data
stored at that site from the source tables
(Rehash) put a copy of each eligible tuple with the
hash key based on the value of the join attribute
(Listen) use newData to see the rehashed tuples
(Compute) Run standard one-site join algorithm on
the tuples as they arrive
Scan/Rehash steps must be run on all sites that
store source data
Listen/Compute steps can be run on fewer
nodes by choosing the hash key differently
Joins: Fetch Matches (FM)

Algorithm for each site




(Scan) Use lscan to retrieve all data from ONE
table
(Get) Based on the value for the join attribute,
issue a get for the possible matching tuples from
the other table
Note, one table (the one we issue the gets
for) must already be hashed/stored on the
join attribute
Big picture:


SHJ is put based
FM is get based
Query Processor – Join
rewriting
Symmetric semi-join



(Project) both R and S to
their resourceIDs and join
keys
(Small rehash) Perform a
SHJ on the two projections
(Compute) Send results into
FM join for each of the
tables
*Minimizes initial
communication
Bloom joins

(Scan) create Bloom Filter for a
fragment of relation

(Put) Publish filter for R, S

(Multicast) Distribute filters


(Rehash) only tuples matched
the filter
(Compute) Run SHJ
*Reduces rehashing
Joins: Additional Strategies

Bloom Filters


Symmetric Semi-Join



Use of bloom filters can be used to reduce the
amount of data rehashed in the SHJ
Run a SHJ on the source data projected to only
have the hash key and join attributes.
Use the results of this mini-join as source for two
FM joins to retrieve the other attributes for tuples
that are likely to be in the answer set
Big Picture:

Tradeoff bandwidth (extra rehashing) for latency
(time to exchange filters)
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
Workload Parameters







CAN configuration: d = 4
|R| =10 |S|
Constants provide 50% selectivity
f(x,y) evaluated after the join
90% of R tuples match a tuple in S
Result tuples are 1KB each
Symmetric hash join used
Simulation Setup


Up to 10,000 nodes
Network cross-traffic, CPU and memory
utilizations ignored
Scalability
Simulations of 1 SHJ Join
Warehousing
Full Parallelization
Result 1 (Scalability)

The system scales well as long as the
number of computation nodes is large
enough to avoid network congestion at
those nodes.
Join Algorithms (1/2)



Infinite Bandwidth
1024 data and computation nodes
Core join Algorithms


Perform faster
Rewrites


Bloom Filter: two multicasts
Semi-join: two CAN lookups
Performance: Join Algorithms
SHJ
Average network traffic
16000
FM
SSJ
BF
14000
12000
10000
8000
R + S = 25 GB
n = m = 1024
inbound capacity
= 10 Mbps
hop latency
=100 ms
6000
4000
2000
0
0
20
40
60
80
Selectivity of predicat on relation S
100

At 40% selectivity,
bottleneck switches
from computation
nodes to query sites
Soft State

Failure detection and
recovery



Refresh period


15 second failure detection
4096 nodes
Time to reinsert lost tuples
Average recall decreases as
the failure rate increases
and increases as the
refresh period decreases.
Some real-world results
1 SHJ Join on Millennium Cluster
Outline






Motivation
Introduction
Architecture
Join Algorithms
Experimental Results
Conclusion
Applications

P2P Databases


Highly distributed and available data
Network Monitoring


Distributed intrusion detection
Fingerprint queries
Conclusion


Distributed data needs a distributed query
processor
DHTs too simple, databases too complex



PIER occupies a point in the middle of this new
design space
The primary technical contributions of this
paper is architectural and evaluative, rather
than algorithmic
It presents the first serious treatment of
scalability issues in P2P style relational query
engine
References


Ryan Huebsch, Joseph M. Hellerstein, Nick
Lanham, Boon Thau Loo, Scott Shenker, Ion
Stoica. Querying the Internet with PIER.
VLDB 2003
Matthew Harren, Joseph M. Hellerstein, Ryan
Huebsch, Boon Thau Loo, Scott Shenker, and
Ion Stoica, Complex Queries in DHTbased Peer-to-Peer Networks. IPTPS,
March 2002.
Thanks you
Extra Slides
Transit Stub Topology

GT-ITM




4 Domains, 10
nodes per domain,
3 stubs per node
50ms, 10ms, 2ms
latency
10Mbps inbound
links
Similar trends as
fully connected
topology

A bit longer end-toend delays