Tapestry:A Resilient Global-Scale Overlay for Service Deployment

Download Report

Transcript Tapestry:A Resilient Global-Scale Overlay for Service Deployment

Tapestry:A Resilient GlobalScale Overlay for Service
Deployment
Zhao, Huang, Stribling, Rhea, Joseph,
Kubiatowicz
Presented by Rebecca Longmuir
What is Tapestry?

A peer-to-peer overlay routing infrastructure
offering efficient, scalable, locationindependent routing of message directly to
nearby copies of an object or service using
only localized resources.
 An extensible infrastructure that provides
decentralized object location and
routing(DOLR)
What is DOLR?

Interface focusing on routing of message to
endpoints such as nodes or object replicas.
 Virtualizes resources, since endpoints are
named by opaque identifiers encoding
nothing about physical location
 Allows for message delivery in instable
underlying infrastructure. Developers only
need to think about the dynamics of the
network for optimization
Some background

Nodes participate in the overlay and are
assigned nodeIDs uniformly at random from a
large identifier space.
 More than one node maybe be on one
physical host.
 Application-specific endpoints are assigned
globally unique identifiers(GUIDs) select from
the same identifier space.
More background

Node N has nodeID Nid and an object O
has GUID OG.

Every message contains an applicationspecific identifier Aid
Four part DOLR networking API

PUBLISHOBJECT(OG,Aid): Publish or make available, object O
on the local node. This is a best effort call and receives no
confirmation

UNPUBLISHOBJECT(OG,Aid): Best effort attempt to remove
location mappings for O

ROUTETOOBJECT(OG,Aid): Routes message to location of
object with GUID OG

ROUTETONODE(N,Aid,Exact): Routes message to application
Aid on node N. “Exact” specifies whether destination ID needs
to be matched exactly to deliver payload
Routing and Object Location

Tapestry dynamically maps each identifier G to an
Unique live node called the identifier's root GR

If node N exists with Nid = G then this is the root of G

Messages are delivered by using a routing table
consisting of nodeIDs and IP addresses of nodes it
communicates with referred to as neighbors.

Routing involves sending messages forward across
neighbor links to nodes whose nodeIDs are
progressively closer
Routing Mesh

Tapestry uses local tables at each node,
called neighbor maps, to route overlay
messages to the destination ID digit by digit

A node N has a neighbor map with multiple
levels, where each level contains links to
nodes matching a prefix up to a digit position
in the ID and contains a number of entries
equal to the ID’s base
Routing Mesh

The primary ith entry in
the jth level is the ID
and location of closest
node that begins with
the prefix(N,j-1)+i for
example the 3259 is the
ninth entry of the fourth
level for 325AE

It is this idea of closest
node that proves the
locality properties of
Tapestry
Routing Mesh

Since the nth hop shares a prefix of
length >=n with the destination
node, Tapestry looks in the (n+1)th
level map for the entry matching the
next digit in the destination ID

This guarantees that any existing
node is reached in at most logBN
logical hops where N is the
namespace size and the IDs are of
base B and assuming consistent
neighbor maps.

If a digit cannot be matched than
Tapestry looks for a close digit using
‘surrogate routing’ where each
nonexistent ID is mapped to some
live node with a similar ID
Providing resilience in the Routing
Mesh

Make use of redundant routing paths

Each Primary neighbor is augmented by backup links sharing
the same prefix

There are c x B pointers on a level(c is the number of neighbor
links that differ only at the nth digit on the nth routing level)

The total size of the neighbor maps is c x B x logBN

The expected total number of back pointers is c x B x logBN
Object Publication and Location

Each root node inherits a unique
spanning tree for routing

This is utilized to locate objects by
sending out soft-state directory
information across nodes
Object Publication and Location




A server S, storing an object
O periodically publishes this
object by routing a publish
message toward OR
Each node along the
publication path stores a
pointer mapping
For replicas of an object that
are on separate sever each
server publishes its copy
Node location is stored for
replicas in sorted order of
network latency
Object Publication and Location



A client locates an
object by routing a
message to OR
Each node checks to
see if it has location
map for the object
If a node has a location
map it send the request
to the servicing node S,
otherwise, it sends the
message onward to OR
Object Publication and Location

Each hop toward the root reduces the number of
nodes satisfying the next hop prefix constraint by a
factor of the identifier base

The path to the root is a function of the destination ID
only not the source ID

The closer a query gets to the object the more likely it
is to cross the paths with the objects published path
and the sooner it will reach the object
Dynamic Node Algorithms

There are many mechanisms in place to
maintain consistency and to make sure
objects are not lost

Many of the control messages require
acknowledgments and are retransmitted
as needed
Node Insertion (four components)

Need-to-know nodes are notified of N, because N fills
a null entry in their routing tables

N might become the new object root for existing
objects. References to those objects must be moved
to N to maintain object availability

The algorithms must construct a near optimal routing
table for N

Nodes near N are notified and may consider using N
in their routing tables as an optimization
Node Insertion

Begins at N’s surrogate S (the “root” node that Nid
maps to in the existing network)

S determines the longest prefix length it shares with
Nid this length is p

S sends out an Acknowledge Multicast message to
all existing nodes sharing the same prefix

The nodes that get the message add N to their
routing tables and transfer references of locally
rooted pointers as needed
Node Insertion

N’s initial neighbor set for it’s routing table are the
nodes reached by the multicast

Beginning at level p N does an interactive nearest
neighbor search

N uses the neighbor set to fill level p, trims this set to
size k and requests that the k nodes send there
backpointers

N decrements p and repeats the process until all
levels are filled
Multiple nodes inserting at once

Every node A in the multicast keeps track of every
node B that is still multicasting down to its neighbors

This lets any node C that is multicasting to A know of
B’s existence

Also multicast keep track of the holes in the new
node’s routing table and check their tables for any
entries that can fill in these holes
Voluntary Node Deletion

When a node N leaves it tells all of the nodes making
the set D that are it’s backpointers that it is leaving
and of replacement nodes at each level from N’s
routing table

The notified nodes send a republish object traffic to
both N and its replacement

N routes reference of locally rooted objects to their
new roots and lets D know when it is done
Involuntary Node Deletion

Tapestry improves object availability and routing in failure prone
networks by building redundancy into the routing tables and
object location references

Nodes send periodic beacons to detect outgoing links and node
failures.

When these problems are notices repair of the routing mesh
starts and redistribution and replication of the object location
references begins

This is also helped by soft-state republishing of object
references
Component Architecture

Transport layer provides the
abstraction of communication
channels from one overlay node to
another

Neighbor Link provides secure but
unreliable datagram facilities to
layers above, including the
fragmentation and reassembly of
large messages

Neighbor Links also provides fault
detection through keep alive
message, plus latency and loss rate
estimation

Neighbor Link optimizes message
processing by parsing the message
headers and only deserializing the
message contents when required
Component Architecture




The router implements
functionality unique to
Tapestry
This layer includes the
routing table and local object
points
The router examines the
destination GUID of the
message that is receives
and determines the next hop
using the routing table and
local object pointers
Message are passed back to
the neighbor link layer for
delivery
Component Architecture


Flow chart of the object
location process
Also keep in mind the
routing table and object
pointer database are
constantly changing as
the network changes
because of nodes
entering or leaving or
latency in the network
Tapestry Upcall Interface


To support functions that require greater control of
details than the DOLR API can provide Tapestry
supports an extensible upcall mechanism
Three primary calls provide the interaction between
Tapestry and application handlers(G is a generic ID)
– Deliver(G, Aid, Msg): Invoked on incoming messages
destined for the local node
– Forward(G,Aid,Msg): Invoked on incoming upcall-enabled
messages
– Route(G,Aid, Msg, NextHopNode): Invoked by the application
handler to forward a message on to the NextHopNode
Tapestry Upcall Interface

Tapestry sends the message to the
application via Forward(). The handler
is responsible for calling Route() with
the final destination. Finally, Tapestry
invokes Deliver() on messages destined
for the local node to complete routing
Implementation of a Tapestry Node
Tapestry is implemented as an eventdriven system for high throughput and
scalability
 This requires an asynchronous I/O layer
as well as an efficient model for internal
communication and control between
components

Implementation of a Tapestry Node


Network stage – is a
combination of part of the
transport layer and part of the
neighbor link layer, providing
neighbor communication that is
not provided by the operating
system. It also works with the
Patchwork monitoring to
measure loss rates and latency
Core router - utilizes the routing
and object reference tables to
handle application driven
messages. It is the critical path
for all messages entering or
exiting the system
Implementation of a Tapestry Node



Node membership – is
responsible for handling the
integration of new nodes into
the Tapestry mesh and the
voluntary exit of nodes
Mesh Repair-responsible for
adapting the mesh as the
environment changes, including
network failures, updating the
routing table for network latency
Patchwork- uses soft-state
beacons to probe outgoing links
for reliability and performance
allowing Tapestry to respond to
failures and changes in the
topology
Evaluation

Preformed on Several platforms

Microbenchmarks were run on local cluster, measure the large
scale performance of a deployed Tapestry on the PlanetLab
global testbed and make use of a local network simulation layer
to support controlled repeatable experiments

Also should be noted to enable a wider variety of experiments
multiple Tapestry node instance where place on each physical
machine. Each instance only shares code not data. In some
cases this leads to a decrease in the time to exchange message
but is more demanding on the processor so the system is
slowed in other ways
Performance in a Stable Network

Used micro benchmarks on a network of two
nodes to isolate Tapestry’s message
processing overhead.
 The sender establishes a binary network with
the receiver and sends 10,0001 messages
fore each message size
 The receiver measures the latency for each
size using the interarrival time between the
first and last messages
Microbenchmarks on Stable
Tapestry

First they eliminated network
delay to measure raw message
processing by placing both
nodes on different ports on the
same machine.

To see how performance scaled
with processor speed the test
was preformed on different
machines

For very small messages there
is a dominant constant
processing time. For messages
larger than 2kB the cost of
copying the data dominates and
processing time becomes linear
Microbenchmarks on Stable
Tapestry

Measurements of the routing
throughput show that
throughput is low for small
messages where processing
dominates but quickly
increases as the messages
increase in size

For the average 4kB
message the P-IV can
process 7100 messages/s
and where as the P-III can
process 3200 messages/s
Routing Overhead to Nodes and
Objects




The RDP is computed for node
routing by measuring all pairs
roundtrip routing latencies between
the 400 Tapestry instances used
and dividing each by the
corresponding ping round-trip time
Figure 13 shows that the median
values for node-node routing RDP
starts at ~3 and slowly decreases
~1
The object RDP is measure as a
ration of one-way Tapestry route to
object latency, verses the one-way
network latency
Figure 14 shows the RDP values
sorted by their ping values and
collect in 5ms bins with the 90th
percentile and median values
calculated per bin
Object Location Optimization


This figure was used to
demonstrate that
optimization can
significantly lower the
RDP observed by the
bulk of all requesters for
local-area network
distance
Their technique trades
extra storage space in
the network for faster
routing
Single Node Insertion

Measured the overhead
required for a single node to join
the Tapestry network, in terms
of time required for the network
and to stabilize and control
message bandwidth during
insertion
 Figure 16 shows insertion time
as a function of the network
size. It shows that latencies
scale sublinearly with the size of
the network
 For each datapoint a Tapestry
network of size N is
constructed and repeatedly a
single node is inserted and
deleted
Single Node Insertion

Figure 17 shows
that the total
bandwidth for a
single node insertion
scales
logarithmically with
the network size
Parallel Node Insertion



Started with a stable
network of 200 nodes.
Then repeated each
parallel insert 20 times
They plotted the min,
median and the 90th
percentile.
There is significant
variation in the 90th
percentile, which is
attributed to the effects
of node virtualization
Continuous Convergence and SelfRepair

Instead of measuring latency, these tests focused on
large-scale behavior under failures
 The routing to nodes test measures the success rate
of sending request to random keys in the namespace
 The routing to objects test sends messages to
previously published objects, located at servers
which were guaranteed to stay alive in the network
 Performance metrics include bandwidth and success
rate of requests successfully reaching their
destinations
Continuous Convergence and SelfRepair



For both figures 20% of the
existing network is killed and
after 15 minutes new nodes
equal to 50% of the existing
network are inserted
Only a small fraction of the
request are affected when
large portions of the network
fail.
All massive failures and
inserts lead to a small dip in
success rate but quickly
return to 100%
Continuous Convergence and SelfRepair



Each test included two
churns of different
levels of dynamicity
Constant change has
little effect on Tapestry
performance as
success rates rarely fall
even slightly below
100%
These dips happen
independently of the
parameters given to the
churn
Comparison with other peer-to-peer
systems




Unlike Gnutella Tapestry guarantees that queries will
find existing objects
Similar to Chord and CAN in that they all scale well
and guarantee that queries find existing objects
under nonfailure conditions
Differ from Chord and CAN which take network
distance into account when constructing their routing
overlay. Tapestry instead constructs locally optimal
routing tables from initialization and maintain them in
order to reduce routing stretch
Tapestry allows applications to place objects
according to their needs
Conclusion

Tapestry provides efficient and scalable routing of
messages directly to nodes and objects in a large,
sparse address space

Simulations show Tapestry performs near optimally
under faults, while a small protion(~5%) of the
queries fail on the faulty wide-area deployment
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang,
Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John Kubiatowicz: IEEE Journal
on Selected Areas in Communications, January 2004, Vol. 22, No. 1