Transcript Slides

CS522: Algorithmic and Economic
Aspects of the Internet
Instructors:
Nicole Immorlica ([email protected])
Mohammad Mahdian ([email protected])
Peer-To-Peer Networks



A network in which nodes employ distributed
resources to accomplish critical task
Nodes are typically equals, i.e.
(approximately) indistinguishable in
functionality
System is highly dynamic, nodes frequently
come and go
P2P Definition
Distributed systems consisting of interconnected
nodes able to self-organize into network topologies
with the purpose of sharing resources such as
content, CPU cycles, storage and bandwidth,
capable of adapting to failures and accommodating
transient populations of nodes while maintaining
acceptable connectivity and performance, without
requiring the intermediation or support of a global
centralized server or authority.
– A Survey of Peer-To-Peer Content Distribution Technologies,
Androutsellis-Theotokis and Spinellis
Peer-To-Peer Applications



Direct real-time communication: instant
messaging
Combine processing power of multiple
distributed machines to perform complex
computations: analysis of SETI data, prime
computation
Store and distribute digital content: mp3 file
sharing
Peer-To-Peer Benefits




Self-organized and adaptive
Easily scalable
Fault-tolerant and load balanced
Resistant to censorship
P2P Construction
Clients
Servers
P2P overlay network
SPRINT
AOL
MIT
UUNET
P2P Classification
Data organization
Centralization
Unstructured
Hybrid
Napster, IM
Partial
Kazaa, Gia
None
Gnutella
Loosely
Structured
Highly
Structured
Freenet
Chord, CAN
Napster, IM
2
3
1
Server
7
File Transfer
6



4
5
Centralized servers maintain list of files and
peer at which file is stored
Peers join, leave, and query network via
direct communication with servers
File transfers occur directly between peers
Napster, IM

Advantages:



Highly efficient data lookup
Rapidly adapts to changes in network
Disadvantages:


Questionable scalability
Vulnerable to censorship, failure, attack
Gnutella




All peers, called servents, are identical and
function as both servers and clients
A peer joins network by contacting existing
servents (chosen from online databases)
using PING messages
A servent receiving a PING message replies
with a PONG message and forwards PING to
other servents
Peer connects to servents who send PONG
Gnutella


A servent queries network by sending a
QUERY message
A servent receiving a QUERY message
replies with a QUERYHIT message if he can
answer the query. If not, he forwards
QUERY message to other servents
Routing in Gnutella


How PING/QUERY messages are forwarded
affects network topology, search
efficiency/accuracy, and scalability
Proposals


Breadth-First-Search: flooding, iterative
deepening, modified random BFS
Depth-First-Search: random walk, k-walker
random walks, two-level random walk, dominating
set based search
Hybrid Search


Random walk with lookahead: short random
walks with shallow local flooding
Takes advantage of “supernodes” or nodes of
high degree


Stationary dist. of random walk is naturally biased
towards supernodes
Lookahead allows search to quickly discover
content stored at all neighbors of these highdegree nodes
Supernodes




Improve scalability and performance of Gnutella-like
systems via supernodes
Supernodes are special peers with high degree,
elected dynamically according to bandwidth and
other considerations
Supernodes maintain a list of content stored at
peers
Advantages:


Searches propagate on supernodes, 3 to 5 times faster
Takes advantage of heterogeneity in network
Gnutella

Advantages



Entirely decentralized, pure P2P network
Highly resistant to failure
Disadvantages


Search is time-consuming
Network typically scales poorly
Chord





Distributed hash table (DHT) implementation
Each node/piece of content has an ID
Content IDs are deterministically mapped to
node IDs so a searcher knows exactly where
data is located, a content addressable
network
Efficient: O(log n) messages per lookup
Scalable: O(log n) state per node
Keys in Chord





m bit identifier space for both nodes and
content keys
Content ID = hash(content)
Node ID = hash(IP address)
Both are uniformly distributed
How to map content IDs to node IDs?
Mapping Content to Nodes
0 K5
IP=“198.10.10.1”
N123
K101
N90
K20
Circular 7-bit
ID space
N32
Content = “U2”
K60
Figure adapted from Stoica et al.
Content is stored at successor node, node with next higher ID
Routing

Every node knows of every other node

Routing tables O(n), lookup O(1)
N10
Where is “U2”?
Hash(“U2”) = K60
N123
N32
“N90 has K60”
K60
N90
N55
Figure adapted from Stoica et al.
Routing

Every node knows its successor in ring

Routing tables O(1), lookup O(n)
N10
Where is “U2”?
Hash(“U2”) = K60
N123
N32
“N90 has K60”
K60
N90
N55
Figure adapted from Stoica et al.
Routing



Every node knows m others
Distances increase exponentially, node i
points to node whose ID is successor of i + 2j
for j from 1 to m. These pointers are called
fingers.
The finger (routing) table and search time are
both O(log n)
Finger Tables
N112
80 + 25
N16
80 + 26
N96
80 + 24
80 + 23
80 + 22
80 + 21
80 + 20
N80
Figure adapted from Stoica et al.
Routing with Finger Tables
N5
N10
N110
N20 K19
N99
N32 Lookup(K19)
N80
N60
Figure adapted from Stoica et al.
Chord Dynamics

When a node joins




Initialize all fingers of new node
Update fingers of existing nodes
Transfer content from successor to new node
When a node leaves

Transfer content to successor
Chord Failures



Churn rate is very high (on average, nodes
are in system for only 60 minutes) and events
happen concurrently
Churn (esp. ungraceful departures or
simultaneous joins/departures) can failure
states, e.g. inconsistencies in successor
relationships or, worse, loopy states
Requires a lot of maintenance messages to
preserve ideal state
Maintenance in P2P

Maintenance protocol ensures global connectivity
and efficient lookup by continuously repairing
overlay network and routing tables

Maintenance is essential, e.g. when a node




Joins and announces presence
Updates routing table to ensure efficient search
Monitors neighbors for failures/departures
Cost of maintenance protocol can be measured in
terms of the rate of maintenance messages
Half Life

Defn. Suppose there are Nt nodes at time t.
Let the doubling time t be such that at time t
+ t, Nt new nodes have arrived. Similarly let
halving time t be such that at time t + t, Nt/2
nodes have departed. Then the half life of a
system is mint(t, t).


The half life is the average amount of time until
half the system has been replaced.
Measures rate of change of system.
Example




Nodes arrive according to Poisson with rate
: prob. k arrivals in time t proportional to e-t
Nodes remain for duration exponential rate :
prob. node stays for  amount of time is e-
If system in steady state, then arrival rate 
must equal departure rate  N, so N = / .
Doubling time  = N/ = 1/, halving time 
= (ln 2)/, and so half life = (ln 2)/.
Bounding Maintenance Costs


Thm. There exists a sequence of joins and
leaves such that any node that, at any time,
has received an average of fewer than k
notifications per half-life will be disconnected
from the network with prob. at least (1 – 1/e)k.
Cor. Any N-node P2P network that remains
connected with prob. at least 1 – 1/N must
generate an average of (log N) notifications
per node per half life.
Proof of Bound



Consider Poisson arrival rate , exponential
waiting time =1 in system.
Suppose node n averages fewer than k
notifications per half life and so there is a
minimum time t such that at time t, n has
received less than kt notifications.
Observe n is isolated at time t with probability
at least (1 – 1/(e-1))k.
Maintenance in Chord

Liben-Nowell, Balakrishnan, Karger:
(Modified) Chord requires only O(log2 n)
maintenance messages per half life to
maintain efficiency and correctness of
search.
Chord

Advantages:



Highly efficient search
Good load balancing
Disadvantages:



Locality of data is destroyed
Only handles exact match queries, but keyword
queries are more prevalent
Most requests are for highly replicated files
(needles vs haystack)
Conclusion


Saw several representative P2P systems,
each with advantages and disadvantages
Many important issues





Efficiency of search
Ability to adapt to dynamics of system
Security: Malicious peers, Spread of worms
Free riding: Reputation mechanisms, Micropayment mechanisms
Legality