ecs251 Spring 2007: Operating System Models #1: File Systems
Download
Report
Transcript ecs251 Spring 2007: Operating System Models #1: File Systems
UCDavis, ecs251
Spring 2007
ecs251 Spring 2007:
Operating System Models
#3: Peer-to-Peer Systems
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.cs.ucdavis.edu/~wu/
[email protected]
05/03/2007
P2P
1
UCDavis, ecs251
Spring 2007
The role of service provider..
Centralized management of services
– DNS, Google, www.cnn.com, Blockbuster,
SBC/Sprint/AT&T, cable service, Grid
computing, AFS, bank transactions…
Information, Computing, & Network
resources owned by one or very few
administrative domains.
– Some with SLA (Service Level Agreement)
05/03/2007
P2P
2
UCDavis, ecs251
Spring 2007
Interacting with the “SP”
Service providers are the owner of the
information and the interactions
– Some enhance/establish the interactions
05/03/2007
P2P
3
UCDavis, ecs251
Spring 2007
Let’s compare …
Google
Blockbuster
CNN
MLB/NBA
LinkIn
e-Bay
05/03/2007
P2P
Skype
Bittorrent
Blog
Youtube
BotNet
Cyber-Paparazzi
4
UCDavis, ecs251
Spring 2007
Toward P2P
More participation of the end nodes (or their
users)
– More decentralized Computing/Network
resources available
– End-user controllability and interactions
– Security/robustness concerns
05/03/2007
P2P
5
UCDavis, ecs251
Spring 2007
Service Providers in P2P
We might not like SP, but we still can not
avoid SP entirely.
– Who is going to lay the fiber and switch?
– Can we avoid DNS?
– How can we stop “Cyber-Bullying” and other
similar?
– Copyright enforcement?
– Internet becomes a junkyard?
05/03/2007
P2P
6
UCDavis, ecs251
Spring 2007
We will discuss…
P2P system examples
– Unstructured, structured, incentive
Architectural analysis and issues
Future P2P applications and why?
05/03/2007
P2P
7
UCDavis, ecs251
Spring 2007
Challenge to you…
Define a new P2P-related application,
service, or architecture.
Justify why it is practical, useful and will
scale well.
– Example: sharing cooking recipes, experiences
& recommendations about restaurants and
hotels
05/03/2007
P2P
8
UCDavis, ecs251
Spring 2007
Napster
P2P File sharing
“Unstructured”
05/03/2007
P2P
9
UCDavis, ecs251
Spring 2007
Napster
pee rs
Napste r se rv er
Inde x
1. File locati on
request
Napste r se rv er
Inde x
3. File request
2. List of peers
offering the file
5. Index update
4. File deli vered
05/03/2007
P2P
10
UCDavis, ecs251
Spring 2007
Napster
Advantages?
Disadvantages?
05/03/2007
P2P
11
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
12
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
13
UCDavis, ecs251
Spring 2007
Originally conceived of by Justin Frankel, 21 year old founder of Nullsoft
March 2000, Nullsoft posts Gnutella to the web
A day later AOL removes Gnutella at the behest of Time Warner
The Gnutella protocol version 0.4
http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf
and version 0.6
http://rfc-gnutella.sourceforge.net/Proposals/Ultrapeer/Ultrapeers.htm
there are multiple open source implementations at http://sourceforge.net/
including:
– Jtella
– Gnucleus
Software released under the Lesser Gnu Public License (LGPL)
the Gnutella protocol has been widely analyzed
05/03/2007
P2P
14
UCDavis, ecs251
Spring 2007
Gnutella Protocol Messages
Broadcast Messages
– Ping: initiating message (“I’m here”)
– Query: search pattern and TTL (time-to-live)
Back-Propagated Messages
– Pong: reply to a ping, contains information about the
peer
– Query response: contains information about the
computer that has the needed file
Node-to-Node Messages
– GET: return the requested file
– PUSH: push the file to me
05/03/2007
P2P
15
UCDavis, ecs251
Spring 2007
Steps:
• Node 2 initiates search for file A
7
1
A
4
2
6
3
5
05/03/2007
P2P
16
UCDavis, ecs251
Spring 2007
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
7
1
4
2
3
A
6
A
5
05/03/2007
P2P
17
UCDavis, ecs251
Spring 2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
7
1
4
2
6
3
A
05/03/2007
A
5
P2P
18
UCDavis, ecs251
Spring 2007
A:7
A
7
1
4
2
6
3
A:5
05/03/2007
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
5
P2P
19
UCDavis, ecs251
Spring 2007
7
1
4
2
3
A:7
A:5
A 6
A
05/03/2007
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
P2P
20
UCDavis, ecs251
Spring 2007
7
1
A:7
2
4
A:5
6
3
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
5
05/03/2007
P2P
21
UCDavis, ecs251
Spring 2007
Limited Scope Flooding
Reverse Path Forwarding
download A
1
7
4
2
6
3
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
• File download
• Note: file transfer between
clients behind firewalls is not
possible; if only one client, X, is
behind a firewall, Y can request
that X push the file to Y
05/03/2007
P2P
22
UCDavis, ecs251
Spring 2007
Gnutella
Advantages?
Disadvantages?
05/03/2007
P2P
23
UCDavis, ecs251
Spring 2007
GUID:
Short for Global Unique Identifier, a randomized string
that is used to uniquely identify a host or message on the
Gnutella Network. This prevents duplicate messages from
being sent on the network.
GWebCache:
a distributed system for helping servants connect to the
Gnutella network, thus solving the "bootstrapping"
problem. Servants query any of several hundred
GWebCache servers to find the addresses of other servants.
GWebCache servers are typically web servers running a
special module.
Host Catcher:
Pong responses allow servants to keep track of active
gnutella hosts
On most servants, the default port for Gnutella is 6346
05/03/2007
P2P
24
05/03/2007
Gnutella Network Growth
P2P
05/12/01
05/16/01
05/22/01
05/24/01
05/29/01
50
02/27/01
03/01/01
03/05/01
03/09/01
03/13/01
03/16/01
03/19/01
03/22/01
03/24/01
11/20/00
11/21/00
11/25/00
11/28/00
Number of nodes in the largest
network component ('000)
UCDavis, ecs251
Spring 2007
.
40
30
20
10
-
25
UCDavis, ecs251
Spring 2007
“Limited Scope Flooding”
Ripeanu reported that Gnutella traffic totals 1Gbps (or
330TB/month).
– Compare to 15,000TB/month in US Internet backbone
(December 2000)
– this estimate excludes actual file transfers
Reasoning:
QUERY and PING messages are flooded. They form
more than 90% of generated traffic
predominant TTL=7
>95% of nodes are less than 7 hops away
measured traffic at each link about 6kbs
network with 50k nodes and 170k links
05/03/2007
P2P
26
UCDavis, ecs251
Spring 2007
A
B
F
D
E
C
G
H
Perfect Mapping
05/03/2007
P2P
27
UCDavis, ecs251
Spring 2007
A
B
F
D
E
C
G
H
Inefficient mapping
Link D-E needs to support six times higher
traffic.
05/03/2007
P2P
28
UCDavis, ecs251
Spring 2007
Topology mismatch
The overlay network topology doesn’t match
the underlying Internet infrastructure
topology!
40% of all nodes are in the 10 largest Autonomous
Systems (AS)
Only 2-4% of all TCP connections link nodes
within the same AS
Largely ‘random wiring’
Most Gnutella generated traffic crosses AS border,
making the traffic more expensive
May cause ISPs to change their pricing scheme
05/03/2007
P2P
29
UCDavis, ecs251
Spring 2007
Scalability
Whenever a node receives a message,
(ping/query) it sends copies out to all of its
other connections.
existing mechanisms to reduce traffic:
– TTL counter
– Cache information about messages they
received, so that they don't forward duplicated
messages.
05/03/2007
P2P
30
UCDavis, ecs251
Spring 2007
70% of Gnutella users share no files
90% of users answer no queries
Those who have files to share may limit number of connections or
upload speed, resulting in a high download failure rate.
If only a few individuals contribute to the public good, these few
peers effectively act as centralized servers.
05/03/2007
P2P
31
UCDavis, ecs251
Spring 2007
Anonymity
Gnutella provides for anonymity by
masking the identity of the peer that
generated a query.
However, IP addresses are revealed at
various points in its operation: HITS
packets includes the URL for each file,
revealing the IP addresses
05/03/2007
P2P
32
UCDavis, ecs251
Spring 2007
Query Expressiveness
Format of query not standardized
No standard format or matching semantics for the
QUERY string. Its interpretation is completely
determined by each node that receives it.
String literal vs. regular expression
Directory name, filename, or file contents
Malicious users may even return files unrelated to
the query
05/03/2007
P2P
33
UCDavis, ecs251
Spring 2007
Superpeers
Cooperative, long-lived peers typically with
significant resources to handle very high
amount of query resolution traffic.
05/03/2007
P2P
34
UCDavis, ecs251
Spring 2007
05/03/2007
P2P
35
UCDavis, ecs251
Spring 2007
Gnutella is a self-organizing, large-scale, P2P application
that produces an overlay network on top of the Internet; it
appears to work
Growth is hindered by the volume of generated traffic and
inefficient resource use
since there is no central authority the open source
community must commit to making any changes
Suggested changes have been made by
– Peer-to-Peer Architecture Case Study: Gnutella Network, by Matei
Ripeanu
– Improving Gnutella Protocol: Protocol Analysis and Research
Proposals by Igor Ivkovic
05/03/2007
P2P
36
UCDavis, ecs251
Spring 2007
Freenet
Essentially the same as Gnutella:
– Limited-scope flooding
– Reverse-path forwarding
Difference:
– Data objects (I.e., files) are also being delivered
via “reverse-path forwarding”
05/03/2007
P2P
37
UCDavis, ecs251
Spring 2007
P2P Issues
Scalability & Load Balancing
Anonymity
Fairness, Incentives & Trust
Security and Robustness
Efficiency
Mobility
05/03/2007
P2P
38
UCDavis, ecs251
Spring 2007
Incentive-driven Fairness
P2P means we all should contribute..
– Hopefully fair, but the majority is selfish…
“Incentive for people to contribute…”
05/03/2007
P2P
39
UCDavis, ecs251
Spring 2007
Bittorrent: “Tit for Tat”
Equivalent Retaliation (Game theory)
– A peer will “initially” cooperate, then respond
in kind to an opponent's previous action. If the
opponent previously was cooperative, the agent
is cooperative. If not, the agent is not.
05/03/2007
P2P
40
UCDavis, ecs251
Spring 2007
Bittorrent
Fairness of download and upload between a
pair of peers
Every 10 seconds, estimate the download
bandwidth from the other peer
– Based on the performance estimation to decide
to continue uploading to the other peer or not
05/03/2007
P2P
41
UCDavis, ecs251
Spring 2007
Client & its Peers
Client
– Download rate (from the peers)
Peers
– Upload rate (to the client)
05/03/2007
P2P
42
UCDavis, ecs251
Spring 2007
BT Choking by Client
By default, every peer is “choked”
– stop “uploading” to them, but the TCP connection is
still there.
Select four peers to “unchoke”
– Best “upload rates” and “interested”.
– Uploading to the unchoked ones and monitor the
download rate for all the peers
– “Re-choke” every 30 seconds
Optimistic Unchoking
– Randomly select a choked peer to unchoke
05/03/2007
P2P
43
UCDavis, ecs251
Spring 2007
“Interested”
A request for a piece (or its sub-pieces)
05/03/2007
P2P
44
UCDavis, ecs251
Spring 2007
Becoming “seed”
Use “upload” rate to the peers to decide
which peers to unchoke.
05/03/2007
P2P
45
UCDavis, ecs251
Spring 2007
05/03/2007
Bittorrent Wiki
P2P
46
UCDavis, ecs251
Spring 2007
BT Peer Selection
From the “Tracker”
– We receive a partial list of all active peers for
the same file
– We can get another 50 from the tracker if we
want
05/03/2007
P2P
47
UCDavis, ecs251
Spring 2007
Piece Selection
Piece (64K~1M) Sub-piece (16K)
– Piece-size: trade-off between performance and the size
of the torrent file itself
– A client might request different sub-pieces of the same
piece from different peers.
Strict Priority - sub-pieces and piece
Rarest First
– Exception: “random first”
– Get the stuff out of Seed(s) as soon as possible..
05/03/2007
P2P
48
UCDavis, ecs251
Spring 2007
Rarest First
Exchanging bitmaps with 20+ peers
– Initial messages
– “have” messages
Array of buckets
– Ith buckets contains “pieces” with I known
instances
– Within the same bucket, the client will
randomly select one piece.
05/03/2007
P2P
49
UCDavis, ecs251
Spring 2007
Random-First
Usually, rare-first pieces are rare.
The client has to get all the sub-pieces from
one or very few peers.
For the first 4~5 pieces, get some random
pieces so the client can have a few pieces to
upload.
05/03/2007
P2P
50
UCDavis, ecs251
Spring 2007
BitTorrent
Connect to the Tracker
Connect to 20+ peers
Random-first or Rarest-first
Monitoring the download rate from the
peers (or upload rate to the client)
Unchoke and Optimistic Unchoke
05/03/2007
P2P
51
UCDavis, ecs251
Spring 2007
Bittorrent
Advantages
Disadvantages
05/03/2007
P2P
52
UCDavis, ecs251
Spring 2007
Trackerless Bittorrent
Every BT peer is a tracker!
But, how would they share and exchange
information regarding other peers?
Similar to Napster’s index server or DNS
05/03/2007
P2P
53
UCDavis, ecs251
Spring 2007
Pure P2P
Every peer is a tracker
Every peer is a DNS server
Every peer is a Napster Index server
How can this be done?
– We try to remove/reduce the role of “special
servers”!
05/03/2007
P2P
54
UCDavis, ecs251
Spring 2007
Peer
The requirements of Peer?
05/03/2007
P2P
55
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
05/03/2007
P2P
56
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
Key/content assignment
– Which identity owns what? (Google Search?)
05/03/2007
P2P
57
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
Key/content assignment
– Which identity owns what?
Napster: centralized index service
Skype/Kazaa: login-server & super peers
DNS: hierarchical DNS servers
Two problems:
(1). How to connect to the “ring”?
(2). How to prevent failures/changes?
05/03/2007
P2P
58
UCDavis, ecs251
Spring 2007
DHT
Distributed hash tables (DHTs)
– decentralized lookup service of a hash table
– (name, value) pairs stored in the DHT
– any peer can efficiently retrieve the value
associated with a given name
– the mapping from names to values is distributed
among peers
05/03/2007
P2P
59
UCDavis, ecs251
Spring 2007
HT as a search table
Information/content is distributed, and we need
to know where?
Index key
05/03/2007
Where is this piece of music?
What is the location of this type of content?
What is the current IP address of this skype
user?
P2P
60
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
61
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
62
UCDavis, ecs251
Spring 2007
DHT as a search table
???
Index key
05/03/2007
P2P
63
UCDavis, ecs251
Spring 2007
DHT
Scalable
Peer arrivals, departures, and failures
Unstructured versus structured
05/03/2007
P2P
64
UCDavis, ecs251
Spring 2007
DHT (Name, Value)
How to utilize DHT to avoid Trackers in
Bittorrent?
05/03/2007
P2P
65
UCDavis, ecs251
Spring 2007
DHT-based Tracker
FreeBSD 5.4 CD images
Publish the key on
the class web site.
Index key
Whoever owns
this hash entry is
the tracker for the
corresponding
key!
Seed’s IP address
PUT & GET
05/03/2007
P2P
66
UCDavis, ecs251
Spring 2007
Chord
Consistent Hashing
A Simple Key Lookup Algorithm
Scalable Key Lookup Algorithm
Node Joins and Stabilization
Node Failures
05/03/2007
P2P
67
UCDavis, ecs251
Spring 2007
Chord
Given a key (data item), it maps the key
onto a peer.
Uses consistent hashing to assign keys to
peers.
Solves problem of locating key in a
collection of distributed peers.
Maintains routing information as peers join
and leave the system
05/03/2007
P2P
68
UCDavis, ecs251
Spring 2007
Issues
Load balance: distributed hash function, spreading keys
evenly over peers
Decentralization: chord is fully distributed, no node
more important than other, improves robustness
Scalability: logarithmic growth of lookup costs with
number of peers in network, even very large systems are
feasible
Availability: chord automatically adjusts its internal
tables to ensure that the peer responsible for a key can
always be found
05/03/2007
P2P
69
UCDavis, ecs251
Spring 2007
Example Application
File System
Block Store
Block Store
Block Store
Chord
Chord
Chord
Client
Server
Server
Highest layer provides a file-like interface to user including userfriendly naming and authentication
This file systems maps operations to lower-level block operations
Block storage uses Chord to identify responsible node for storing a
block and then talk to the block storage server on that node
05/03/2007
P2P
70
UCDavis, ecs251
Spring 2007
Consistent Hashing
Consistent hash function assigns each peer and
key an m-bit identifier.
SHA-1 is used as a base hash function.
A peer’s identifier is defined by hashing the peer’s
IP address.
A key identifier is produced by hashing the key
(chord doesn’t define this. Depends on the
application).
– ID(peer) = hash(IP, Port)
– ID(key) = hash(key)
05/03/2007
P2P
71
UCDavis, ecs251
Spring 2007
Consistent Hashing
m
In an m-bit identifier space, there are 2
identifiers.
Identifiers are ordered on an identifier circle
m
modulo 2 .
The identifier ring is called Chord ring.
Key k is assigned to the first peer whose identifier
is equal to or follows (the identifier of) k in the
identifier space.
This peer is the successor peer of key k, denoted
by successor(k).
05/03/2007
P2P
72
UCDavis, ecs251
Spring 2007
Consistent Hashing - Successor
Peers
identifier
node
6
1
0
6
identifier
circle
6
5
2
2
successor(2) = 3
3
4
05/03/2007
key
successor(1) = 1
1
7
successor(6) = 0
X
2
P2P
73
UCDavis, ecs251
Spring 2007
Consistent Hashing – Join and
Departure
When a node n joins the network, certain
keys previously assigned to n’s successor
now become assigned to n.
When node n leaves the network, all of its
assigned keys are reassigned to n’s
successor.
05/03/2007
P2P
75
UCDavis, ecs251
Spring 2007
Node Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
05/03/2007
P2P
76
UCDavis, ecs251
Spring 2007
Node Departure
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
05/03/2007
P2P
77
UCDavis, ecs251
Spring 2007
Technical Issues
???
05/03/2007
P2P
78
UCDavis, ecs251
Spring 2007
A Simple Key Lookup
A very small amount of routing information suffices
to implement consistent hashing in a distributed
environment
If each node knows only how to contact its current
successor node on the identifier circle, all node can
be visited in linear order.
Queries for a given identifier could be passed
around the circle via these successor pointers until
they encounter the node that contains the key.
05/03/2007
P2P
80
UCDavis, ecs251
Spring 2007
A Simple Key Lookup
Pseudo code for finding successor:
// ask node n to find the successor of id
n.find_successor(id)
if (id (n, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
05/03/2007
P2P
81
UCDavis, ecs251
Spring 2007
A Simple Key Lookup
The path taken by a query from node 8 for
key 54:
05/03/2007
P2P
82
UCDavis, ecs251
Spring 2007
Successor
Each active node MUST know the IP
address of its successor!
– N8 has to know that the next node on the ring is
N14.
Departure N8 => N21
But, how about failure or crash?
05/03/2007
P2P
83
UCDavis, ecs251
Spring 2007
Robustness
Successor in R hops
– N8 => N14, N21, N32, N38 (R=4)
– Periodic pinging along the path to check, &
also find out maybe there are “new members”
in between
05/03/2007
P2P
84
UCDavis, ecs251
Spring 2007
Is that good enough?
05/03/2007
P2P
85
UCDavis, ecs251
Spring 2007
Complexity of the search
Time/messages: O(N)
– N: # of nodes on the Ring
Space: O(1)
– We only need to remember R IP addresses
Stablization depends on “period”.
05/03/2007
P2P
86
UCDavis, ecs251
Spring 2007
Scalable Key Location
To accelerate lookups, Chord maintains
additional routing information.
This additional information is not essential
for correctness, which is achieved as long as
each node knows its correct successor.
05/03/2007
P2P
87
UCDavis, ecs251
Spring 2007
Scalable Key Location – Finger
Tables
Each node n’ maintains a routing table with up to m
entries (which is in fact the number of bits in
identifiers), called finger table.
The ith entry in the table at node n contains the
identity of the first node s that succeeds n by at least
i-1
2 on the identifier circle.
i-1
s = successor(n+2 ).
s is called the ith finger of node n, denoted by
n.finger(i)
05/03/2007
P2P
88
UCDavis, ecs251
Spring 2007
Scalable Key Location – Finger
Tables
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
1
3
0
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
finger table
For.
start
3
0
3+2
3+21
3+22
4
05/03/2007
succ.
finger table
For.
start
0
7
keys
6
P2P
4
5
7
succ.
keys
2
0
0
0
89
UCDavis, ecs251
Spring 2007
Finger Tables
A finger table entry includes both the Chord
identifier and the IP address (and port
number) of the relevant node.
The first finger of n is the immediate
successor of n on the circle.
05/03/2007
P2P
90
UCDavis, ecs251
Spring 2007
Scalable Key Location – Example
query
The path a query for key 54 starting at node 8:
05/03/2007
P2P
91
UCDavis, ecs251
Spring 2007
Scalable Key Location – A
characteristic
Since each node has finger entries at power of two
intervals around the identifier circle, each node
can forward a query at least halfway along the
remaining distance between the node and the
target identifier. From this intuition follows a
theorem:
Theorem: With high probability, the number of nodes
that must be contacted to find a successor in an N-node
network is O(logN).
05/03/2007
P2P
92
UCDavis, ecs251
Spring 2007
Complexity of the Search
Time/messages: O(logN)
– N: # of nodes on the Ring
Space: O(logN)
– We need to remember R IP addresses
– We need to remember logN Fingers
Stablization depends on “period”.
05/03/2007
P2P
93
UCDavis, ecs251
Spring 2007
An Example
M = 4096 (identifier size), ring size is 24096
N = 216 (# of nodes)
How many entries we need to have for the
Finger Table?
Each node n’ maintains a routing table with up to m entries
(which is in fact the number of bits in identifiers), called
finger table.
The ith entry in the table at node n contains the identity of
the first node s that succeeds n by at least 2i-1 on the
identifier circle.
s = successor(n+2i-1).
05/03/2007
P2P
94
UCDavis, ecs251
Spring 2007
Complexity of the Search
Time/messages: O(M)
– M: # of bits of the identifier
Space: O(M)
– We need to remember R IP addresses
– We need to remember M Fingers
Stablization depends on “period”.
05/03/2007
P2P
95
UCDavis, ecs251
Spring 2007
Structured Peering
Peer identity and routability
– 2M identifiers, Finger Table routing
Key/content assignment
– Hashing
Dynamics/Failures
– Inconsistency??
05/03/2007
P2P
96
UCDavis, ecs251
Spring 2007
Node Joins and
Stabilizations
The most important thing is the successor pointer.
If the successor pointer is ensured to be up to date,
which is sufficient to guarantee correctness of
lookups, then finger table can always be verified.
Each node runs a “stabilization” protocol
periodically in the background to update successor
pointer and finger table.
05/03/2007
P2P
97
UCDavis, ecs251
Spring 2007
Node Joins and
Stabilizations
“Stabilization” protocol contains 6
functions:
–
–
–
–
–
–
create( )
join( )
stabilize( )
notify( )
fix_fingers( )
check_predecessor( )
05/03/2007
P2P
98
UCDavis, ecs251
Spring 2007
Node Joins – join()
When node n first starts, it calls n.join(n’),
where n’ is any known Chord node.
The join() function asks n’ to find the
immediate successor of n.
join() does not make the rest of the network
aware of n.
05/03/2007
P2P
99
UCDavis, ecs251
Spring 2007
Node Joins – join()
// create a new Chord ring.
n.create()
predecessor = nil;
successor = n;
// join a Chord ring containing node n’.
n.join(n’)
predecessor = nil;
successor = n’.find_successor(n);
05/03/2007
P2P
100
UCDavis, ecs251
Spring 2007
Node Joins – stabilize()
Each time node n runs stabilize(), it asks its
successor for the it’s predecessor p, and decides
whether p should be n’s successor instead.
stabilize() notifies node n’s successor of n’s
existence, giving the successor the chance to
change its predecessor to n.
The successor does this only if it knows of no
closer predecessor than n.
05/03/2007
P2P
102
UCDavis, ecs251
Spring 2007
Node Joins – stabilize()
// called periodically. verifies n’s immediate
// successor, and tells the successor about n.
n.stabilize()
x = successor.predecessor;
if (x (n, successor))
successor = x;
successor.notify(n);
// n’ thinks it might be our predecessor.
n.notify(n’)
if (predecessor is nil or n’ (predecessor, n))
predecessor = n’;
05/03/2007
P2P
103
UCDavis, ecs251
Spring 2007
Node Joins – Join and
Stabilization
nil
succ(np) = n
n
succ(np) = ns
np
05/03/2007
–
–
pred(ns) = n
pred(ns) = np
ns
n joins
n runs stabilize
–
–
n notifies ns being the new predecessor
ns acquires n as its predecessor
np runs stabilize
–
–
–
–
predecessor = nil
n acquires ns as successor via some n’
np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor
all predecessor and successor pointers are
now correct
fingers still need to be fixed, but old
P2P fingers will still work
104
UCDavis, ecs251
Spring 2007
Node Joins – fix_fingers()
Each node periodically calls fix fingers to
make sure its finger table entries are correct.
It is how new nodes initialize their finger
tables
It is how existing nodes incorporate new
nodes into their finger tables.
05/03/2007
P2P
105
UCDavis, ecs251
Spring 2007
Node Joins – fix_fingers()
// called periodically. refreshes finger table entries.
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1);
// checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
05/03/2007
P2P
106
UCDavis, ecs251
Spring 2007
Node Failures
Key step in failure recovery is maintaining correct successor pointers
To help achieve this, each node maintains a successor-list of its r nearest
successors on the ring
If node n notices that its successor has failed, it replaces it with the first
live entry in the list
Successor lists are stabilized as follows:
– node n reconciles its list with its successor s by copying s’s successor list,
removing its last entry, and prepending s to it.
– If node n notices that its successor has failed, it replaces it with the first
live entry in its successor list and reconciles its successor list with its new
successor.
05/03/2007
P2P
108
UCDavis, ecs251
Spring 2007
Chord – The Math
Every node is responsible for about K/N keys (N nodes,
K keys)
When a node joins or leaves an N-node network, only
O(K/N) keys change hands (and only to and from
joining or leaving node)
Lookups need O(log N) messages
To reestablish routing invariants and finger tables after
node joining or leaving, only O(log2N) messages are
required
05/03/2007
P2P
109