Transcript Document

Distributed hash tables
Protocols and applications
Jinyang Li
Peer to peer systems
• Symmetric in node functionalities
– Anybody can join and leave
– Everybody gives and takes
• Great potentials
– Huge aggregate network capacity, storage etc.
– Resilient to single point of failure and attack
• E.g. Gnapster, Gnutella, Bittorrent, Skype,
Joost
Motivations: the lookup problem
• How to locate something given its name?
• Examples:
– File sharing
– VoIP
– Content distribution networks (CDN)
Strawman #1:
a central search service
• Central service poses a single point of
failure or attack
• Not scalable(?) Central service needs $$$
Improved #1: Hierarchical lookup
• Performs hierarchical lookup (like DNS)
• More scalable than a central site
• Root of the hierarchy is still a single
point of vulnerability
Strawman #2: Flooding
Where is item K?
•
•
•
•
Symmetric design: No special servers needed
Too much overhead traffic
Limited in scale
No guarantee for results
Improved #2: super peers
• E.g. Kazaa, Skype
• Lookup only floods super peers
• Still no guarantees for lookup results
DHT approach: routing
• Structured:
– put(key, value); get(key, value)
– maintain routing state for fast lookup
• Symmetric:
– all nodes have identical roles
• Many protocols:
– Chord, Kademlia, Pastry, Tapestry, CAN, Viceroy
DHT ideas
• Scalable routing needs a notion of “closeness”
• Ring structure [Chord]
Different notions of “closeness”
• Tree-like structure,
[Pastry, Kademlia, Tapestry]
01100101
01100100
011000**
01100101 01100110
011001**
011010**
01100111
011011**
Distance measured as the length of longest
matching prefix with the lookup key
Different notions of “closeness”
• Cartesian space
[CAN]
(1,1)
0,0.5, 0.5,1
0.5,0.5, 1,1
0.5,0.25,0.75,0.5
0.75,0, 1,0.5
0,0, 0.5,0.5
0.5,0,0.75,0.25
(0,0)
Distance measured as
geometric distance
DHT: complete protocol design
1.
2.
3.
4.
Name nodes and data items
Define data item to node assignment
Define per-node routing table contents
Handle churn
• How do new nodes join?
• How do nodes handle others’ failures?
Chord: Naming
• Each node has a unique flat ID
– E.g. SHA-1(IP address)
• Each data item has a unique flat ID (key)
– E.g. SHA-1(data content)
Data to node assignment
• Consistent hashing
Predecessor is
“closest” to
Successor is
responsible
for
Key-based lookup (routing)
• Correctness
– Each node must know its correct successor
Key-based lookup
• Fast lookup
– A node has O(log n) fingers exponentially “far” away
Node x’s i-th finger
is at x+2i away
Why is lookup fast?
Handling churn: join
• New node w joins an existing network
–
–
–
–
–
Issue a lookup to find its successor x
Set its successor: w.succ = x
Set its predecessor: w.pred = x.pred
Obtain subset of responsible items from x
Notify predecessor: x.pred = w
Handling churn: stabilization
• Each node periodically fixes its state
– Node w asks its successor x for x.pred
– If x.pred is “closer” to itself, set
w.succ = x.pred
Starting with any connected graph, stabilization
eventually makes all nodes find correct successors
Handling churn: failures
• Nodes leave without notice
– Each node keeps s (e.g. 16) successors
– Replicate keys on s successors
Handling churn: fixing fingers
• Much easier to maintain fingers
– Any node at [2i, 2i+1] distance away will do
– Geographically closer fingers --> lower latency
– Periodically flush dead fingers
Using DHTs in real systems
• Amazon’s Dynamo key-value storage [SOSP’07]
• Serve as persistent state for applications
– shopping carts, user preferences
• How does it apply DHT ideas?
– Assign key-value pairs to responsible nodes
– Automatic recovery from churn
• New challenges?
– Manage read-write data consistency
Using DHTs in real systems
• Keyword-based searches for file sharing
– eMule, Azureus
• How to apply a DHT?
– User has file 1f3d… with name jingle bell Britney
– Insert mappings: SHA-1(jingle)1f3d, SHA-1(bell)
1f3d, SHA-1(britney) 1f3d
– How to answer query “jingle bell”, “Britney”?
• Challenges?
– Some keywords are much more popular than others
– RIAA inserts a billion “Britney spear xyz”crap?
Using DHTs in real systems
CoralCDN
Motivation:
alleviating flash crowd
Browser
Origin
Server
Browser
http proxy
http proxy
Browser
http proxy
Browser
Browser
http proxy
http proxy
Browser
Browser
Browser
• Proxies handle most client requests
• Proxies cooperate to fetch content from each other
Getting content with CoralCDN
Origin
Server
2.Lookup(URL)
What nodes are
caching the URL?
3.Content transmission
From which caching nodes
should I download file?
1.Server selection
What CDN node
should I use?
• Clients use CoralCDN via modified domain name
nytimes.com/file → nytimes.com.nyud.net/file
Coral design goals
• Don’t control data placement
– Nodes cache based on access patterns
• Reduce load at origin server
– Serve files from cached copies whenever possible
• Low end-to-end latency
– Lookup and cache download optimize for locality
• Scalable and robust lookups
Lookup for cache locations
• Given a URL:
– Where is the data cached?
– Map name to location: URL  {IP1, IP2, IP3, IP4}
– lookup(URL)
 Get IPs of nearby
caching nodes
– insert(URL,myIP)
,TTL)  Add me as caching URL
for TTL seconds
Isn’t this what a distributed hash table is for?
A straightforward use of DHT
SHA-1(URL1)
URL1={IP1,IP2,IP3,IP4}
insert(URL1,myIP)
• Problems
– No load balance for a single URL
– All insert and lookup go to the same node
(cannot be close to everyone)
#1 Solve load balance problem:
relax hash table semantics
• DHTs designed for hash-table semantics
– Insert and replace: URL  IPlast
– Insert and append: URL  {IP1, IP2, IP3, IP4}
• Each Coral lookup needs only few values
– lookup(URL)  {IP2, IP4}
– Preferably ones close in network
Prevent hotspots in index
# hops:
1
2
3
Root node
(closest ID)
• Route convergence
– O(b) nodes are 1 hop from root
– O(b2) nodes are 2 hops away from root …
Leaf nodes
(distant IDs)
Prevent hotspots in index
# hops:
Root node
(closest ID)
1
2
3
Leaf nodes
(distant IDs)
URL={IP1,IP2,IP3,IP4}
• Request load increases exponentially towards
root
Rate-limiting requests
# hops:
1
2
Root node
(closest ID)
URL={IP1,IP2,IP3,IP4}
3
URL={IP5}
Leaf nodes
(distant IDs)
URL={IP3,IP4}
• Refuse to cache if already have max # “fresh” IPs / URL,
– Locations of popular items pushed down tree
Rate-limiting requests
# hops:
1
2
Root node
(closest ID)
URL={IP1,IP2,IP6,IP4}
3
URL={IP5}
Leaf nodes
(distant IDs)
URL={IP3,IP4}
• Refuse to cache if already have max # “fresh” IPs / URL,
– Locations of popular items pushed down tree
• Except, nodes leak through at most β inserts / min / URL
– Bound rate of inserts towards root, yet root stays fresh
Load balance results
7β
494 nodes on
PlanetLab
3β
2β
1β
• Aggregate request rate:
~12 million / min
• Rate-limit per node (β):
12 / min
• Root has fan-in from 7 others
#2 Solve lookup locality problem
• Cluster nodes hierarchically based on RTT
• Lookup traverses up hierarchy
– Route to “closest” node at each level
Preserve locality through hierarchy
000…
Distance to key
111…
Thresholds
None
< 60 ms
< 20 ms
• Minimizes lookup latency
• Prefer values stored by nodes within faster clusters
Clustering reduces lookup latency
Reduces median lat
by factor of 4
Putting all together:
Coral reduces server load
Most hits in
20-ms Coral
cluster
Local disk caches begin to
handle most requests
Few hits
to origin
400+ nodes provide 32 Mbps
100x capacity of origin