Decentralized Location Services

Download Report

Transcript Decentralized Location Services

Tapestry: Scalable and
Fault-tolerant
Routing and Location
Stanford Networking Seminar
October 2001
Ben Y. Zhao
[email protected]
Challenges in the Wide-area
Trends:


Exponential growth in CPU, storage
Network expanding in reach and b/w
Can applications leverage new resources?



Scalability: increasing users, requests, traffic
Resilience: more components  inversely low
MTBF
Management: intermittent resource availability 
complex management schemes
Proposal: an infrastructure that solves these
issues and passes benefits onto applications
Stanford Networking Seminar, 10/2001
2
Driving Applications
Leverage of cheap & plentiful resources:
CPU cycles, storage, network bandwidth
Global applications share distributed
resources

Shared computation:


Shared storage


SETI, Entropia
OceanStore, Gnutella, Scale-8
Shared bandwidth

Application-level multicast, content distribution networks
Stanford Networking Seminar, 10/2001
3
Key: Location and Routing
Hard problem:

Locating and messaging to resources and data
Goals for a wide-area overlay infrastructure





Easy to deploy
Scalable to millions of nodes, billions of objects
Available in presence of routine faults
Self-configuring, adaptive to network changes
Localize effects of operations/failures
Stanford Networking Seminar, 10/2001
4
Talk Outline
Motivation
Tapestry overview
Fault-tolerant operation
Deployment / evaluation
Related / ongoing work
Stanford Networking Seminar, 10/2001
5
What is Tapestry?
A prototype of a decentralized, scalable, faulttolerant, adaptive location and routing infrastructure
(Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley)
Network layer of OceanStore
Routing: Suffix-based hypercube

Similar to Plaxton, Rajamaran, Richa (SPAA97)
Decentralized location:

Virtual hierarchy per object with cached location references
Core API:



publishObject(ObjectID, [serverID])
routeMsgToObject(ObjectID)
routeMsgToNode(NodeID)
Stanford Networking Seminar, 10/2001
6
Routing and Location
Namespace (nodes and objects)


160 bits  280 names before name collision
Each object has its own hierarchy rooted at Root
f (ObjectID) = RootID, via a dynamic mapping function
Suffix routing from A to B


At hth hop, arrive at nearest node hop(h) s.t.
hop(h) shares suffix with B of length h digits
Example: 5324 routes to 0629 via
5324  2349  1429  7629  0629
Object location:


Root responsible for storing object’s location
Publish / search both route incrementally to root
Stanford Networking Seminar, 10/2001
7
Publish / Lookup
Publish object with ObjectID:
// route towards “virtual root,” ID=ObjectID
For (i=0, i<Log2(N), i+=j) { //Define hierarchy
 j is # of bits in digit size, (i.e. for hex digits, j = 4 )
 Insert entry into nearest node that matches on
last i bits
 If no matches found, deterministically choose alternative
 Found real root node, when no external routes left
Lookup object
Traverse same path to root as publish, except search for entry
at each node
For (i=0, i<Log2(N), i+=j) {
 Search for cached object location
 Once found, route via IP or Tapestry to object
Stanford Networking Seminar, 10/2001
8
Tapestry Mesh
3 Incremental suffix-based routing
4
NodeID
0x79FE
NodeID
0x23FE
NodeID
0x993E
4
NodeID
0x035E
3
NodeID
0x43FE
4
NodeID
0x73FE
3
NodeID
0x44FE
2
2
4
3
1
1
3
NodeID
0xF990
2
3
NodeID
0x555E
1
NodeID
0x73FF
2
NodeID
0xABFE
NodeID
0x04FE
NodeID
0x13FE
4
NodeID
0x9990
1
2
2
NodeID
0x423E
Stanford Networking Seminar, 10/2001
3
NodeID
0x239E
1
NodeID
0x1290
9
Routing in Detail
Example: Octal digits, 212 namespace, 5712  7510
Neighbor Map
For “5712” (Octal)
5712
5712
0 1
2
3 4 5
6 7
0880
0 1
2
3 4 5
6 7
3210
0 1
2
3 4 5
6 7
4510
7510
0 1
0 1
2
2
3 4 5
3 4 5
6 7
6 7
0712
x012
xx02
xxx0
1712
x112
5712
xxx1
2712 0880
x212
xx22
5712
3712
x312
xx32
xxx3
4712
x412
xx42 3210
xxx4
5712
x512
xx52
xxx5
x612
xx62
xxx6
7712
5712
xx72
xxx7
4
3
2
7510
1
6712
4510
Routing Levels
Stanford Networking Seminar, 10/2001
10
Object Location
Randomization and Locality
Stanford Networking Seminar, 10/2001
11
Talk Outline
Motivation
Tapestry overview
Fault-tolerant operation
Deployment / evaluation
Related / ongoing work
Stanford Networking Seminar, 10/2001
12
Fault-tolerant Location
Minimized soft-state vs. explicit fault-recovery
Redundant roots



Object names hashed w/ small salts  multiple
names/roots
Queries and publishing utilize all roots in parallel
P(finding reference w/ partition) = 1 – (1/2)n
where n = # of roots
Soft-state periodic republish


50 million files/node, daily republish,
b = 16, N = 2160 , 40B/msg,
worst case update traffic: 156 kb/s,
expected traffic w/ 240 real nodes: 39 kb/s
Stanford Networking Seminar, 10/2001
13
Fault-tolerant Routing
Strategy:


Detect failures via soft-state probe packets
Route around problematic hop via backup pointers
Handling:



3 forward pointers per outgoing route
(2 backups)
2nd chance algorithm for intermittent failures
Upgrade backup pointers and replace
Protocols:


First Reachable Link Selection (FRLS)
Proactive Duplicate Packet Routing
Stanford Networking Seminar, 10/2001
14
Summary
Decentralized location and routing infrastructure




Core routing similar to PRR97
Distributed algorithms for object-root
mapping, node insertion / deletion
Fault-handling with redundancy,
soft-state beacons, self-repair
Decentralized and scalable, with locality
Analytical properties

Per node routing table size: bLogb(N)


N = size of namespace, n = # of physical nodes
Find object in Logb(n) overlay hops
Stanford Networking Seminar, 10/2001
15
Talk Outline
Motivation
Tapestry overview
Fault-tolerant operation
Deployment / evaluation
Related / ongoing work
Stanford Networking Seminar, 10/2001
16
Deployment Status
Java Implementation in OceanStore
Running static Tapestry
 Deploying dynamic Tapestry with faulttolerant routing

Packet-level simulator
Delay measured in network hops
 No cross traffic or queuing delays
 Topologies: AS, MBone, GT-ITM, TIERS

ns2 simulations
Stanford Networking Seminar, 10/2001
17
Evaluation Results
Cached object pointers
Efficient lookup for nearby objects
 Reasonable storage overhead

Multiple object roots
Improves availability under attack
 Improves performance and perf. stability

Reliable packet delivery
Redundant pointers approximate optimal
reachability
 FRLS, a simple fault-tolerant UDP protocol

Stanford Networking Seminar, 10/2001
18
First Reachable Link
Selection
Use periodic UDP packets to
gauge link condition
Packets routed to shortest
“good” link
Assumes IP cannot correct
routing table in time for
packet delivery
IP
A
B
C
D
E
Stanford Networking Seminar, 10/2001
Tapestry
No path exists to dest.
19
Talk Outline
Motivation
Tapestry overview
Fault-tolerant operation
Deployment / evaluation
Related / ongoing work
Stanford Networking Seminar, 10/2001
20
Bayeux
Global-scale application-level multicast
(NOSSDAV 2001)
Scalability


Scales to > 105 nodes
Self-forming member group partitions
Fault tolerance


Multicast root replication
FRLS for resilient packet delivery
More optimizations

Group ID clustering for better b/w utilization
Stanford Networking Seminar, 10/2001
21
Bayeux: Multicast
Receiver
79FE
23FE
993E
F9FE
43FE
73FE
44FE
29FE
F990
79FE
035E
79FE
555E
13FE
ABFE
9990
793E
79FE
793E
Receiver
04FE
093E
793E
793E
423E
Stanford Networking Seminar, 10/2001
239E
1290
793E
79FE
Multicast
Root
22
Bayeux: Tree Partitioning
JOIN
Receiver
79FE
23FE
JOIN
993E
F9FE
Receiver
43FE
73FE
44FE
F990
29FE
035E
Multicast
Root
13FE
555E
ABFE
04FE
9990
JOIN
793E
239E
093E
423E
Stanford Networking Seminar, 10/2001 Receiver
1290
Multicast
23
Root
Overlay Routing Networks
CAN: Ratnasamy et al., (ACIRI / UCB)


Uses d-dimensional coordinate space to
implement distributed hash table
Route to neighbor closest to destination
coordinate
Fast Insertion / Deletion
Constant-sized routing state
Unconstrained # of hops
Overlay distance not prop. to
physical distance
Chord: Stoica, Morris, Karger, et al.,
Simplicity in algorithms
(MIT / UCB)
Fast fault-recovery
Log2(N) hops and routing state
 Linear namespace modeled as circular
address space
Overlay distance not prop. to
 “Finger-table” point to logarithmic # of inc. physical distance
remote hosts
Pastry: Rowstron and Druschel
(Microsoft / Rice )


Hypercube routing similar to PRR97
Objects replicated to servers by name
Stanford Networking Seminar, 10/2001
Fast fault-recovery
Log(N) hops and routing state
Data replication required for
fault-tolerance
24
Ongoing Research
Fault-tolerant routing
Reliable Overlay Networks (MIT)
 Fault-tolerant Overlay Routing (UCB)

Application-level multicast

Bayeux (UCB), CAN (AT&T),
Scribe and Herald (Microsoft)
File systems
OceanStore (UCB)
 PAST (Microsoft / Rice)
 Cooperative File System (MIT)

Stanford Networking Seminar, 10/2001
25
For More Information
Tapestry:
http://www.cs.berkeley.edu/~ravenben/tapestry
OceanStore:
http://oceanstore.cs.berkeley.edu
Related papers:
http://oceanstore.cs.berkeley.edu/publications
http://www.cs.berkeley.edu/~ravenben/publications
[email protected]
Stanford Networking Seminar, 10/2001
26