Transcript Pastry
Pastry
Peter Druschel, Rice University
Antony Rowstron, Microsoft Research UK
Some slides are borrowed from the original presentation by the authors
Outline
•
•
•
•
•
•
Background
Pastry
Pastry proximity routing
PAST
SCRIBE
Conclusions
Common issues
•
•
•
•
Organize, maintain overlay network
Resource allocation/load balancing
Object / resource location
Network proximity routing
Pastry provides a generic p2p substrate
Architecture
Event
notification
Network
storage
Pastry
TCP/IP
?
P2p application layer
P2p substrate
(self-organizing
overlay network)
Internet
Pastry: Object distribution
2128-1 O
Consistent hashing
[Karger et al. ‘97]
objId
nodeIds
128 bit circular id space
* nodeIds (uniform random)
• objIds (uniform random)
Invariant: node with
numerically closest nodeId
maintains object. (Recall Chord)
0 1 2 3 4 5
x x x x x x
6 6 6 6 6
0 1 2 3 4
x x x x x
6
5
0
x
6
5
1
x
6
5
2
x
6
5
3
x
6
5
4
x
7 8 9 a b c d e fx
x x x x x x x x
6 6 6 6 6 6 6 6 6 6
6 7 8 9 a b c d e fx
x x x x x x x x x
6
5
5
x
6
5
6
x
6
5
7
x
6
5
8
x
6
5
9
x
6
5
b
x
6
5
c
x
6
5
d
x
6 6
5 5
e fx
x
Log 16 N
rows
Leaf set
Routing table for node 65a1fc (b=4)
Pastry: Leaf sets
Each node maintains IP addresses of the
nodes with the L/2 numerically closest
larger and smaller nodeIds, respectively.
• routing efficiency/robustness
• fault detection (keep-alive)
• application-specific local coordination
Pastry: Routing
d471f1
d467c4
d462ba
Prefix routing
d46a1c
d4213f
Route(d46a1c)
65a1fc
Properties
d13da3
log16 N steps
O(log N) state
Pastry: Routing procedure
if
then
else
destination is within the range of leaf set
forward to numerically closest member
{let l = length of shared prefix}
{let d = value of l-th digit in D’s address}
if (Rld exists) (Rld = entry at column d row l)
then forward to Rld
else {rare case}
forward to a known node that
(a) shares at least as long a prefix, and
(b) is numerically closer than this node
Pastry: Performance
Integrity of overlay/ message delivery:
• guaranteed unless L/2 simultaneous failures
of nodes with adjacent nodeIds
Number of routing hops:
• No failures: < log16 N expected, 128/b + 1 max
• During failure recovery:
– O(N) worst case, loose upper bound, average
case much better
Self-organization
How are the routing tables and leaf sets
initialized and maintained?
• Node addition
• Node departure (failure)
Pastry: Node addition
d471f1
d467c4
d462ba
d46a1c
d4213f
New node: d46a1c
The new node X asks
node 65a1fc to route
a message to it. Nodes
in the route share their
routing tables with X
Route(d46a1c)
65a1fc
d13da3
Node departure (failure)
Leaf set members exchange heartbeat
• Leaf set repair (eager): request the set
from farthest live node
• Routing table repair (lazy): get table
from peers in the same row, then higher
rows
Pastry: Average # of hops
4.5
Average number of hops
4
3.5
3
2.5
2
1.5
Pastry
Log(N)
1
0.5
0
1000
10000
Number of nodes
L=16, 100k random queries
100000
Pastry: Proximity routing
Proximity metric = time delay estimated by ping
A node can probe distance to any other node
Each routing table entry uses a node close to
the local node (in the proximity space),
among all nodes with the appropriate node Id
prefix.
Pastry: Routes in proximity
space
d467c4
d471f1
d467c4
d462ba
d46a1c
Proximity space
d4213f
Route(d46a1c)
d13da3
d4213f
65a1fc
NodeId space
d462ba
65a1fc
d13da3
Pastry: Proximity routing
Assumption: scalar proximity metric
• e.g. ping delay, # IP hops
• a node can probe distance to any other node
Proximity invariant
Each routing table entry refers to a node close to the
local node (in the proximity space), among all nodes
with the appropriate nodeId prefix.
Pastry: Distance traveled
1.4
Relative Distance
1.3
1.2
1.1
1
Pastry
0.9
Complete routing table
0.8
1000
10000
Number of nodes
L=16, 100k random queries, Euclidean proximity space
100000
PAST: File storage
k=4
fileId
Storage Invariant:
File “replicas” are
stored on k nodes
with nodeIds
closest to fileId
Insert fileId
(k is bounded by the
leaf set size)
PAST: File Retrieval
C
k replicas
Lookup
fileId
file located in log16 N
steps (expected)
usually locates replica
nearest to client C
SCRIBE: Large-scale,
decentralized multicast
• Infrastructure to support topic-based
publish-subscribe applications
• Scalable: large numbers of topics,
subscribers, wide range of
subscribers/topic
• Efficient: low delay, low link stress, low
node overhead
SCRIBE: Large scale
multicast
topicId
Publish topicId
Subscribe topicId
Scribe: Results
• Simulation results
• Comparison with IP multicast: delay,
node stress and link stress
• Experimental setup
– 100,000 nodes randomly selected out of .5M
– Zipf-like subscription distribution, 1500 topics
Summary
Self-configuring P2P framework for topic-based
publish-subscribe
• Scribe achieves reasonable performance when
compared to IP multicast
– Scales to a large number of subscribers
– Scales to a large number of topics
– Good distribution of load