Transcript CAN

Distributed Hash Tables and
Structured P2P Systems
Ningfang Mi
September 27, 2004
1
Outline
 The Lookup problem
Definition
Approaches
Issues
 Case study: CAN
Overview
Basic design
More design improvements (Performance, robustness)
Topology-aware routing in CAN
 Conclusion
2
The Lookup Problem -- Definition
Given a data item X stored at some
dynamic set of nodes, find it
It is an important and critical common
problem in P2P.
How to efficiently find it?
3
The Lookup Problem -- Approaches
 Central database: Napster
 Central point of failure
 Use hierarchy: DNS for name lookup
 Failure or removal of the root or nodes high in hierarchy
 the root or nodes high in hierarchy overload
 Symmetric lookup algorithm: Gnutella
 No more important node
 “Flooding” request  not scale well
 “Superpeers” in a hierarchical structure: KaZaA
 Failure of “Superpeers”
 Innovative symmetric lookup strategy: Freenet
 Need visit large number of nodes and no guarantee
4
The Lookup Problem -- Approaches
Structured and symmetric algorithms:
Implement the DHT (distributed hash table)
Convert name to key using a hash function, SHA-1
One operation: Lookup(key)
Examples:
CAN[2] -- ACM SIGCOMM 2001
Chord[3] -- ACM SIGCOMM 2001
Pastry[5] -- Middleware 2001
Tapestry[6] – UC Berkeley 2001
5
The Lookup Problem -- Issues
Mapping keys to nodes in a load balanced
way
Forwarding a lookup for a key to an
appropriate node
Distance function
Building routing tables adaptively
Discuss these issues by a case study -- CAN
6
“What is a scalable content-addressable
network”??? --- CANs Design Overview
 Features:
Completely distributed system
 No centralized control, coordination and configuration
Scalable
 Maintain only a small number of control states
 Independent of the number of nodes
content-addressable
 Resembling a hash table: keys  values
Fault-tolerant
 Can route around failures
7
“How does a scalable CAN work” ???
--- CANs Design Overview
 Using a virtual d-dimensional Cartesian
coordinate space
Zone: hyper-rectangles
 CAN is composed of individual nodes
Each node “owns” its individual, distinct zone
Each node “holds” state information about its neighbor
zones
 Three operations
insertion, lookup and deletion
(key, value)
 Requests for a key are routed by using a greedy
routing algorithm
8
 Map key k1 onto a
point p1 using a
uniform hash table
(k2,v2)
 Store (k1,v1) at the
node that owns the
zone with p1
p2
p1
(k1,v1)
2-d coordinate space with 5 nodes
 To retrieve (k1,v1),
apply the same hash
function to map k1
onto point p1 and get
the value from p1
9
Basic design for a CAN
CAN routing
Construction of the CAN overlay
Maintenance of the CAN overlay
Three most basic pieces of the design.
10
“How can we find a point (x/y)?”
--- Routing in a CAN
1.0
 Maintain a routing table of IP
address and virtual coordinate
zone of its neighbors
D
B
C
(0.2/0.4)
E
 Greedy forwarding CAN message
to neighbor closest to destination
A
 For d-dimensional space
partitioned into n equal zones,
 Each node maintains 2d
neighbors
 Average routing path length:
(d/4)(n1/d)
G
F
0.0
0.0
1.0
D’s neighbor set = {C,E}
C’s neighbor set = {B,D,E,F}
B’s neighbor set = {A,C,G}
11
“I want to join!”
--- CAN Construction
 Find a node already in the CAN
 Retrieve a bootstrap node’s IP address by looking up the CAN
domain name in DNS
 Bootstrap node provides IP addresses
of random nodes in CAN
 Finding a zone in the space
 Randomly choose a point in the space
 Send “JOIN” request to P via an exiting CAN node A
 Node B in zone with P splits the zone and allocates “half” with
(key, value) pairs to new node
 Joining the routing by notifying the neighbors
 Learn IP addresses of its neighbors set from previous occupant
 Previous occupant updates its neighbor set
 New and old neighbors be informed of this reallocation.
12
H
1.0
1.0
D
B
JOIN (0.2/0.1)
D
C
B
C
E
E
(0.2/0.1)
H
A
G
F
G
F
A
0.0
0.0
1.0
0.0
0.0
1.0
A’s neighbor set = {B,G}
A’s neighbor set = {G,H}
B’s neighbor set = {A,C,G}
B’s neighbor set = {C,G,H}
G’s neighbor set = {A,B,F}
G’s neighbor set = {A,B,F,H}
H’s neighbor set = {A,B,G}
13
“I want to leave!”
--- CAN Maintenance
 Zone must be taken over by some node
 Normal procedure
Explicitly hand over zone and (key, value) pairs to one
neighbor
Merge to valid zone if possible, otherwise the neighbor
with smallest zone handle both zones temporarily
 Failure: immediate takeover algorithm
Neighbors initialize takeover timers
Alive Neighbor with smallest zone takes over zone
14
1.0
B
D
Senario1: H leaves explicitly
E
Senario2: C leaves explicitly
C
Senario3: G dies
H
A A
G
F
A
0.0
1.0
15
“Can we do Better???”
--- Design Improvements (1)
Basic design
Balance low per-node state (O(d)) and short
path lengths (O(dn1/d))
Application level hops, not IP level hops
Neighbor nodes may be geographically distant
with many IP hops
Average total latency of a look up
== Avg(# of CAN hops) X Avg(latency of each
CAN hop)
16
“Can we do Better???”
--- Design Improvements (2)
 Primary goal:
Reduce the latency of CAN routing
 Path length
 Per-CAN-hop latency
 Other achievements
Robustness– routing & data availability
Load balancing
 Simulated CAN design on Transit-Stub (TS)
topologies using the GT-ITM topology generator
17
Multi-Dimensioned Coordinate Spaces
(1)
 Dimensions
Path length (latency)
Size of the coordinate
routing table
(small)
Fault tolerance
Effect of dimensions on path length
18
Realities: Multiple Coordinate Spaces
(2)
 Maintain multiple
independent coordinate
spaces (realities)
 For a CAN with r realities:
Each node is assigned r
zones and holds r
independent neighbor sets
Contents of the hash table
are replicated for each
reality
(k1,V1)
1.0
(0.6/0.8) G
A B
C
A D G
D
C F
E
B F
0.0
E
1.0
r=2
19
Realities: Multiple Coordinate Spaces
(2)
 Realities
Data availability
Faulty tolerance
Path length (latency)
Effect of multiple realities on path length
20
Dimensions vs. Realities
 More dimensions has
greater effect on path
length
 More realities provides
stronger fault-tolerance
and increased data
availability
Path length with increasing neighbor state
21
Better CAN Routing Metrics
(3)
 (Basic CAN routing metric:
Cartesian distance in virtual coordinate space)
 RTT-weighted routing (round-trip-time)
Reflect underlying IP topology
Each node measures RTT to each neighbor
Forward message to neighbor with maximum
ratio of progress to RTT
Reduce the latency of individual hops
22
Overloading Coordinate Zones
(4)
Multiple nodes share the same zone, with
a threshold MAXPEERS
Nodes know all peers in its zone and one
node in its neighbor zones
How to achieve overloading zones?
<
A
join
B
A join B’s zone
Replicate key-value pairs
? MAXPEERS
=
Split B’s zone
Divide B’s peer list
Divide key-value pairs
23
Overloading Coordinate Zones
(4)
 Advantage:
 Reduce path length (number of hops)
 Reduce per-hop latency
 Improve fault tolerance
 Disadvantage:
 Add somewhat to system complexity
24
Multiple Hash Functions
(5)
 Use k different hash
functions to map a key
onto k points and
replicate (key, value) at k
distinct nodes
 Data availability
 Average query latency
(query in parallel)
 The size of (key,value)
database and query
traffic by a factor of k
25
Topologically-Sensitive Construction of
the CAN Overlay Network (6)
 Problem of randomly allocating nodes to zones
 Strange routing scenario  inefficient routing
 Goal: construct CAN overlay that are congruent with
the underlying IP topology
 Landmarks: a well-known set of machines like DNS
servers
 Each node measures its RTT to each landmark
 Order each landmark in order of increasing RTT
 For m landmarks: m! possible orderings
 Partition coordinate space into m! equal size partitions
 Nodes join CAN at random point in the partition
corresponding to its landmark ordering
26
H
(l2l3l1)
1.0
A lll B
123
E
l1l3l2
F F H
C
(0.8/0.6)
l2l1l3
l2l3l1
D
l3l1l2
G
l3l2l1
0.0
0.0
1.0
m=3, m!=6 partitions
27
Topologically-Sensitive Construction of
the CAN Overlay Network (6)
 Evaluating metric:
Latency stretch=
the latency on the CAN network
the average latency on the IP network
 The coordinate space
is no longer uniformly
populated
Background load
balancing techniques
28
Load Balancing Techniques
(7)
 More uniform partitioning-- volume balancing
 When “JOIN” is received, compare its own zone volume with
neighbors’ zone volumes
 Split zone with largest volume
 Results: V=VT/n, VT is the total volume
Type
% of nodes with volume V
Largest Volume
without
40%
8V
with
90%
4V
 Catching and replication for “hot spot”
 Catch the data keys it recently accessed
 Replicate the “hot” data key at its neighboring nodes
29
Design Review
d : dimensions
r : number of realities
p : number of peers per zone
k : number of hash functions
n=218 nodes
30
Topology-Aware Routing in CAN
 “It is critical for overlay routing to be aware of the
network topology” [6]
31
D
A
B
C
D A
A
D
C
B
C
B
32
Approaches to Topology-Aware Routing in
CAN
Proximity routing
Topology-based nodeId assignment
Proximity neighbor selection
Doesn’t work on CAN
33
Proximity Routing
The overlay is constructed without regard
for the physical network topology
Among a set of possible next hops, select
the one that is
Closest in the physical network
Represents a good compromise between
progress in the id space and proximity
Example
Better CAN routing metrics --- RTT
34
Topology-based NodeId Assignment
Map the overlay’s logical id space onto the
physical network
Example1: Landmark binning [5][7]
Destroy the uniform population of id space
Landmark sites can become overloaded.
Coarse-grained and difficult to distinguish
relatively close nodes
Example2: SAT-match [8]
35
Self-Adaptive Topology (SAT) Matching
 Two phases:
The probing phase: probe nearby nodes for distance
measurements as soon as join the system
The jumping phase: pick the closest zone accordingly to
jump to
 The iterative process completes until it is close
enough to the zone where all its physically close
neighbors are located
 A global topology matching optimization is
achieved.
36
SAT-Matching
-- Probing Phase (1)
 Effective flooding: flooding with a low number of
TTL hops is highly effective, with produce few
redundant messages.
 Having joined the system based on a DHT
assignment, source node floods a message to
its neighbors
(source IP address, source timestamp, small TTL k)
 Node that receives message responds to the
source with its IP address and flood to its
neighbors with k-1 if k-1>0
37
SAT-Matching
-- Probing Phase (2)
Example
TTL-2 neighborhood of source node
5 6
8
4
7
11
Node 2’s TTL-2
neighborhood
3 14
10
9
1
2
12
13
Node 13’s TTL-2
neighborhood
38
SAT-Matching
-- Probing Phase (3)
After collecting a list of IP address, source
node uses a ping facility to measure the
RTTs (Round-Trip-Times) to each node
that has responded.
Sort these RTTs and select two nodes with
the smallest RTTs.
Select one zone associated with one of
the two nodes to jump in.
39
SAT-Matching
-- Jumping Phase
 If simply jumping to the zone, the stretch
reduction could be offset by latency increases
from other new connection.
 Jumping criteria:
only when the local stretch of the source’s and the
sink’s (two closest nodes) TTL-1 neighborhoods is
reduced
 How to jump? (XY)
X return its zone and (key, value) pairs to its
neighbor
Y allocate half of its zone and pairs to X
40
Discussion
Is there way to make structured P2P more
practical?
It is not free for peers. In CAN, peers are
required to follow the protocol and be
responsible for some data keys.
Only key search
Can the topology-aware routing be a costeffective manner in a highly dynamic
environment?
41
Conclusion
 CAN, Chord, Pastry and Tapestry are representative
Structured P2P system
Self-organizing
Scalable
Distributed DHT
Fault tolerance
Structured
P2P
Size of routing
table
number of hops
CAN
O(d)
O(n1/d)
Chord
O(log n)
O(log n)
Pastry
O(log n)
O(log n)
Tapestry
O(log n)
O(log n)
42
Reference
 [1]Hari Balakrishnan, et., al., “Looking Up Data in P2P Systems”, Communication of
the ACM, Vol. 46, N. 2, 2003.
 [2] Sylvia Ratnasamy, Paul Francis, Mark Handley, and Richard Karp, “A Scalable
Content-Addressable Network”, ACM SIGCOMM 2001.
 [3] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari
Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet
Applications”, ACM SIGCOMM 2001.
 [4]Antony Rowstron and Peter Druschel, “Pastry: Scalable, decentralized object
location and routing for large-scale peer-to-peer systems”, Middleware 2001.
 [5] B. Zhao and J. Kubiatowicz and A. Joseph, “Tapestry: An infrastructure for faulttolerant wide-area location and routing”, U. C. Berkeley, 2001
 [6] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron, "Topologyaware routing in
structured peer-to-peer overlay networks“, presented at Intl. Workshop on Future
Directions in Distributed Computing, June 2002.
 [7] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker, “Topologically-aware overlay
construction and server selection”, INFOCOM 2002.
 [8] Shansi Ren, Lei Guo, Song Jiang, and Xiaodong Zhang, “SAT-Match: a selfadaptive topology matching method to achieve low lookup latency in structured P2P
overlay networks" , Proceedings of the 18th International Parallel and Distributed
Processing Symposium (IPDPS'04), April 26-30, 2004.
43