i id - Yale "Zoo"

Download Report

Transcript i id - Yale "Zoo"

Network Applications
and Network Programming:
Web and P2P
9/28/2009
1
Admin.
 Submit programming assignment 0
using class server
 if you want, you can also send email attachment
to TA

2
Recap: FTP, HTTP
 FTP: file transfer
ASCII (human-readable
format) requests and responses
 stateful server
 one data channel and one control channel

 HTTP
 Extensibility: ASCII requests, header lines,
entity body, and responses line
 Scalability/robustness
• stateless server (each request should contain the full
information); DNS load balancing
• Client caching Web caches

one data channel
3
Recap: WebServer Flow
Create ServerSocket(6789)
TCP socket space
connSocket = accept()
read request from
connSocket
128.36.232.5
128.36.230.2
state: listening
address: {*.6789, *.*}
completed connection queue:
sendbuf:
recvbuf:
state: established
address: {128.36.232.5:6789, 198.69.10.10.1500}
sendbuf:
recvbuf:
read
local file
write file to
connSocket
close connSocket
state: listening
address: {*.25, *.*}
completed connection queue:
sendbuf:
recvbuf:
Recap: Writing High Performance
Servers:
 Major Issues: Many socket/IO operations can
cause processing to block, e.g.,




accept: waiting for new connection;
read a socket waiting for data or close;
write a socket waiting for buffer space;
I/O read/write for disk to finish
 Thus a crucial perspective of network server
design is the concurrency design (non-blocking)


for high performance
to avoid denial of service
 A technique to avoid
blocking: Thread
Multi-Threaded Web Server
Create ServerSocket(6789)
connSocket = accept()
Create thread for
connSocket
read request from
connSocket
read request from
connSocket
read
local file
read
local file
write file to
connSocket
write file to
connSocket
close connSocket
close connSocket
6
Recap: Writing High Performance
Servers
 Problems of multiple
threads

Too many threads 
throughput meltdown,
response time explosion
Event-Driven Programming
 Event-driven programming, also called asynchronous i/o
 Tell the OS to not block when accepting/reading/writing on sockets
 Java: asynchronous i/o
 for an example see: http://www.cafeaulait.org/books/jnp3/examples/12/
 Yields efficient and scalable concurrency
 Many examples: Click router, Flash web server, TP Monitors, etc.
Web Server
If the OS will not block on sockets, how
may the program structure look like?
Create ServerSocket(6789)
connSocket = accept()
Create thread for
connSocket
read request from
connSocket
read request from
connSocket
read
local file
read
local file
write file to
connSocket
write file to
connSocket
close connSocket
close connSocket
9
Typical Structure of Async i/o
 Typically, async i/o programs use Finite
State Machines (FSM) to monitor the
progress of requests

The state info keeps track of the execution
stage of processing each request, e.g., reading
request, writing reply, …
 The program has a loop to check potential
events at each state
10
Async I/O in Java
 An important class is the class Selector,
to support event loop
 A Selector is a multiplexer of selectable
channel objects
 example
channels: DatagramChannel,
ServerSocketChannel, SocketChannel
 use configureBlocking(false) to make a
channel non-blocking
 A selector may be created by invoking the
open method of this class
Async I/O in Java
 A selectable channel registers
events (called a
SelectionKey) with a
selector with the register
method
 A SelectionKey object
contains two operation sets


Selectable Channel
register
interest Set
ready Set
 A SelectionKey object has
an attachment which can store
data

Selector
often the attachment is a
buffer
Selection Key
Async I/O in Java
 Call select (or selectNow(), or
select(int timeout)) to check for
ready events, called the selected key set
 Iterate over the set to process all ready
events
Problems of Event-Driven Server
 Difficult to engineer, modularize, and tune
 No performance/failure isolation between
Finite-State-Machines (FSMs)
 FSM code can never block (but page faults,
i/o, garbage collection may still force a
block)

thus still need multiple threads
Summary of Traditional
C-S Web Servers
DNS
 Is the
application
extensible,
scalable,
robust,
secure?
app.
server
C0
client 1
client 2
client n
client 3
15
Content Distribution History...
“With 25 years of Internet experience,
we’ve learned exactly one way to deal with
the exponential growth: Caching”.
(1997, Van Jacobson)
16
Web Caches (Proxy)
 Web caches/proxy
placed at entrance of
an ISP
 Client sends all http
requests to web cache


if object at web
cache, web cache
immediately returns
object in http
response
else requests object
from origin server,
then returns http
response to client
origin
server
client
client
Proxy
server
origin
server
17
Web Proxy/Cache
app.
server
 Web caches give good
performance because very
often
 a single client
repeatedly accesses
the same
document
 a nearby client also
accesses the same
document
 Cache Hit ratio
increases
logarithmically with
number of users
C0
ISP
cache
ISP
cache
client 4
client 1
client 5
client 2
client 3
client 6
18
Benefits of Web Caching
Assume: cache is “close”
to client (e.g., in same
network)
 smaller response time:
cache “closer” to
client
 decrease traffic to
distant servers

link out of
institutional/local ISP
network often
bottleneck
origin
servers
public
Internet
1.5 Mbps
access link
institutional
network
10 Mbps LAN
institutional
cache
19
What went wrong with Web Caches?
 Web protocols evolved extensively to
accommodate caching, e.g. HTTP 1.1
 However, Web caching was developed with a
strong ISP perspective, leaving content providers
out of the picture


It is the ISP who places a cache and controls it
ISPs only interest to use Web caches is to reduce
bandwidth
 In the USA: Bandwidth relative cheap
 In Europe, there were many more Web caches
 However, ISPs can arbitrarily tune Web caches to
deliver stale content
20
Content Provider Perspective
 Content providers care about
User experience latency
 Content freshness
 Accurate access statistics
 Avoid flash crowds
 Minimize bandwidth usage in their access link

21
Content Distribution Networks
 Content Distribution Networks (CDNs) build an
overlay networks of caches to provide fast, cost
effective, and reliable content delivery, while
working tightly with content providers.
 Example:
 Akamai – original and largest commercial CDN
operates over 25,000 servers in over 1,000
networks
 Akamai (AH kuh my) is Hawaiian for intelligent,
clever and informally “cool”. Founded Apr 99,
Boston MA by MIT students
22
Basic of Akamai Operation
 Content provider server

provides the base HTML document
 Akamai caches embedded objects at a set of
its cache servers (called edge servers)
 Akamaization of embedded content: e.g.,
<IMG SRC= http://www.provider.com/image.gif >
changed to
<IMGSRC = http://a661. g.akamai.net/hash/image.gif>
 Akamai customizes DNS to select serving
edge servers based on
closeness to client browser
 server load

23
More Akamai information
 URL akamaization is becoming obsolete and
only supported for legacy reasons
Currently most content providers prefer to use
DNS CNAME techniques to get all their content
served from the Akamai servers
 still content providers need to run their origin
servers

 Akamai Evolution:
 Files/streaming
 Secure pages and whole pages
 Dynamic page assembly at the edge (ESI)
 Distributed applications
24
Discussion: Problems of Traditional
Content Distribution
DNS
app.
server
C0
client 1
client 2
client n
client 3
25
Objectives of P2P
 Share the resources
(storage and
bandwidth) of
individual clients to
improve
scalability/robustness
Internet
 Bypass DNS to find
clients with resources!

examples: instant
messaging, skype
26
P2P
 But P2P is not new
 Original Internet was a p2p system:



The original ARPANET connected UCLA,
Stanford Research Institute, UCSB, and Univ.
of Utah
No DNS or routing infrastructure, just
connected by phone lines
Computers also served as routers
 P2P is simply an iteration of scalable
distributed systems
P2P Systems
 File Sharing: BitTorrent, LimeWire
 Streaming: PPLive, PPStream, Zatto, …
 Research systems
 Collaborative computing:
 SETI@Home project
• Human genome mapping
• Intel NetBatch: 10,000 computers in 25 worldwide
sites for simulations, saved about 500million
Peer-to-Peer Computing
-
40-70% of total traffic in many networks
-
upset the music industry, drawn college students,
web developers, recording artists and universities
into court
Source: ipoque Internet study 2008/2009
Recap: P2P Objectives
 Bypass DNS to locate
clients with resources!
 examples:
instant
messaging, skype
Internet
 Share the storage
and bandwidth of
individual clients to
improve
scalability/robustness
30
The Lookup Problem
N1
N2
Internet
Key=“title”
Value=MP3 data…
Publisher
N4
N5
N3
?
Client
Lookup(“title”)
N6
find where a particular file is stored
pay particular attention to see its equivalence of DNS
Outline
 Recap
 P2P
 the lookup problem
 Napster
32
Centralized Database: Napster
 Program for sharing music over the Internet
 History:







5/99: Shawn Fanning (freshman, Northeasten U.) founded Napster
Online music service, wrote the program in 60 hours
12/99: first lawsuit
3/00: 25% UWisc traffic Napster
2000: est. 60M users
2/01: US Circuit Court of
Appeals: Napster knew users
violating copyright laws
7/01: # simultaneous online users:
Napster 160K
9/02: bankruptcy
We are referring to the Napster before closure.
03/2000
33
Napster: How Does it Work?
Application-level, client-server protocol over TCP
A centralized index system that maps files (songs) to
machines that are alive and with files
Steps:
 Connect to Napster server
 Upload your list of files (push) to server
 Give server keywords to search the full list
 Select “best” of hosts with answers
34
Napster Architecture
35
Napster: Publish
insert(X,
123.2.21.23)
...
Publish
I have X, Y, and Z!
123.2.21.23
Napster: Search
123.2.0.18
search(A)
-->
123.2.0.18
124.1.0.1
Query
Where is file A?
Reply
124.1.0.1
Napster: Ping
123.2.0.18
ping
124.1.0.1
ping
Napster: Fetch
123.2.0.18
124.1.0.1
fetch
Napster Messages
General Packet Format
[chunksize]
[chunkinfo]
[data...]
CHUNKSIZE:
Intel-endian 16-bit integer
size of [data...] in bytes
CHUNKINFO: (hex)
Intel-endian 16-bit integer.
5B - whois query
00 - login rejected
02 - login requested
5C - whois result
03 - login accepted
5D - whois: user is offline!
0D - challenge? (nuprin1715)
69 - list all channels
2D - added to hotlist
6A - channel info
2E - browse error (user isn't online!) 90 - join channel
2F - user offline
91 - leave channel
…..
40
Centralized Database: Napster
 Summary of features: a hybrid design
 control: client-server (aka special DNS) for files
 data: peer to peer
 Advantages
 simplicity, easy to implement sophisticated search
engines on top of the index system
 Disadvantages
 application specific (compared with DNS)
 lack of robustness, scalability: central search
server single point of bottleneck/failure
 easy to sue !
41
Variation: BitTorrent
 A global central index server is replaced by
one tracker per file (called a swarm)

reduces centralization; but needs other means
to locate trackers
 The bandwidth scalability management
technique is more interesting

more later
42
Outline
 Recap
 P2P
 the lookup problem
 Napster (central query server; distributed
data servers)
 Gnutella
43
Gnutella
 On March 14th 2000, J. Frankel and T.
Pepper from AOL’s Nullsoft division (also
the developers of the popular Winamp mp3
player) released Gnutella
 Within hours, AOL pulled the plug on it
 Quickly reverse-engineered and soon many
other clients became available: Bearshare,
Morpheus, LimeWire, etc.
44
Decentralized Flooding: Gnutella
 On startup, client contacts other servents (server + client)
in network to form interconnection/peering relationships

servent interconnection used to forward control (queries, hits,
etc)
 How to find a resource record: decentralized flooding


send requests to neighbors
neighbors recursively forward the requests
45
Decentralized Flooding
C
E
F
B
D
M
J
A
L
G
H
K
I
S
N
46
Decentralized Flooding
send query to neighbors
C
E
F
B
D
M
J
A
L
G
H
K
I
S
N
Each node forwards the query to its neighbors other than the one
who forwards it the query

47
Background: Decentralized Flooding
C
E
F
B
D
J
A
L
G
H
K
I

M
S
N
Each node should keep track of forwarded queries to avoid loop !
 nodes keep state (which will time out---soft state)
 carry the state in the query, i.e. carry a list of visited nodes
48
Decentralized Flooding: Gnutella
 Basic message header
 Unique ID, TTL, Hops
 Message types
 Ping – probes network for other servents
 Pong – response to ping, contains IP addr, # of files, etc.




Query – search criteria + speed requirement of servent
QueryHit – successful response to Query, contains addr +
port to transfer from, speed of servent, etc.
Ping, Queries are flooded
QueryHit, Pong: reverse path of previous message
49
Advantages and Disadvantages
of Gnutella
 Advantages:
 totally decentralized, highly robust
 Disadvantages:
 not scalable; the entire network can be swamped with
flood requests
• especially hard on slow clients; at some point broadcast
traffic on Gnutella exceeded 56 kbps

to alleviate this problem, each request has a TTL to limit
the scope
• each query has an initial TTL, and each node forwarding it
reduces it by one; if TTL reaches 0, the query is dropped
(consequence?)
50
Flooding: FastTrack (aka Kazaa)
 Modifies the Gnutella protocol into two-level hierarchy
 Supernodes
Nodes that have better connection to Internet
 Act as temporary indexing servers for other nodes
 Help improve the stability of the network
 Standard nodes
 Connect to supernodes and report list of files
 Search
 Broadcast (Gnutella-style) search across supernodes
 Disadvantages
 Kept a centralized registration  prone to law suits

Optional Slides
52
Optional Slides
53
Aside: Search Time?
Aside: All Peers Equal?
1.5Mbps DSL
1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
Aside: Network Resilience
Partial Topology
Random 30% die
Targeted 4% die
from Saroiu et al., MMCN 2002
Asynchronous Network
Programming
(C/C++)
57
A Relay TCP Client: telnet-like Program
fgets
fputs
TCP
client
writen
readn
TCP
server
http://zoo.cs.yale.edu/classes/cs433/programming/examples-c-socket/tcpclient
58
Method 1: Process and Thread
 process
fork()
 waitpid()

 Thread: light weight process
 pthread_create()
 pthread_exit()
59
pthread
Void main() {
char recvline[MAXLINE + 1];
ss = new socketstream(sockfd);
pthread_t tid;
if (pthread_create(&tid, NULL, copy_to, NULL)) {
err_quit("pthread_creat()");
}
}
while (ss->read_line(recvline, MAXLINE) > 0) {
fprintf(stdout, "%s\n", recvline);
}
void *copy_to(void *arg) {
char sendline[MAXLINE];
if (debug) cout << "Thread create()!" << endl;
while (fgets(sendline, sizeof(sendline), stdin))
ss->writen_socket(sendline, strlen(sendline));
shutdown(sockfd, SHUT_WR);
if (debug) cout << "Thread done!" << endl;
}
pthread_exit(0);
60
Method 2: Asynchronous I/O
(Select)
 select: deal with blocking system call
int select(int n, fd_set *readfds, fd_set
*writefds, fd_set *exceptfds, struct
timeval *timeout);
FD_CLR(int fd, fd_set *set);
FD_ZERO(fd_set *set);
FD_ISSET(int fd, fd_set *set);
FD_SET(int fd, fd_set *set);
61
Method 3: Signal and Select
 signal: events such as timeout
62
Examples of Network Programming
 Library to make life easier
 Four design examples
 TCP Client
 TCP server using select
 TCP server using process and thread
 Reliable UDP
 Warning: It will be hard to listen to me
reading through the code. Read the code.
63
Example 2: A Concurrent TCP
Server Using Process or Thread
 Get a line, and echo it back
 Use select()
 For how to use process or thread, see later
 Check the code at:
http://zoo.cs.yale.edu/classes/cs433/programming/examples-c-socket/tcpserver
 Are there potential denial of service problems with the
code?
64
Example 3: A Concurrent HTTP
TCP Server Using Process/Thread
 Use process-per-request or thread-per-
request
 Check the code at:
http://zoo.cs.yale.edu/classes/cs433/programming/examples-c-socket/simple_httpd
65
Peer-to-Peer Systems:
Unstructured
9/30/2009
66
Admin.
 Programming assignment 1 linked on the
schedule page
67
Recap: Objectives of P2P
 Share the resources
(storage and
bandwidth) of
individual clients to
improve
scalability/robustness
Internet
 Find clients with
resources!
68
Recap: The Lookup Problem
N1
N2
Internet
Key=“title”
Value=data…
Publisher
N4
N5
N3
?
Client
Lookup(“title”)
N6
find where a particular document is stored
Recap
 Napster (central query server)
 Gnutella (decentralized, flooding)
 Fast track: Modifies the Gnutella protocol
into two-level hierarchy



Supernodes
• Nodes that have better connection to Internet
• Act as temporary indexing servers for other nodes
• Help improve the stability of the network
Standard nodes
• Connect to supernodes and report list of files
Search
• Broadcast (Gnutella-style) search across supernodes
70
Outline
 Recap
 P2P
 the lookup problem
 Napster (central query server; distributed
data server)
 Gnutella (decentralized, flooding)
 Freenet
71
Freenet
 History
 final year project Ian Clarke , Edinburgh University,
Scotland, June, 1999
 Goals:
 totally distributed system without using centralized index
or broadcast (flooding), instead search by routing



routing/storing system responds adaptively to usage
patterns, transparently moving, replicating files as
necessary to provide efficient service
provide publisher anonymity, security
free speech : resistant to attacks – a third party shouldn’t
be able to deny (e.g., deleting) the access to a particular file
(data item, object)
72
Basic Structure of Freenet
 Each machine stores a set of
files; each file is identified
by a unique identifier (called
key or id)
id next_hop
“routing table”

…

id – file id, key
next_hop node – where to
search for a file with id’ that is
similar to id
file – local copy, if exists, of
file with id
…

…
 Each node maintains a
file
73
…
 API: file = query(id);
file
…
Freenet Query
id next_hop
 Upon receiving a query for file id

look for the “closest” id in the table with an unvisited next_hop node
• each query is associated a TTL that is decremented
each time the query message is forwarded
• when TTL=1, the query is forwarded with a probability
• TTL can be initiated to a random value (why random value?)
…

check whether the queried file is stored locally
check TTL to limit the search scope

• if found one, forward the query to the corresponding next_hop
• otherwise, backtrack
– ends up performing a Depth First Search (DFS)-like traversal
– search direction ordered by closeness to target
 When file is returned it is cached along the reverse path (any
advantage?)
74
Query Example
query(10)
n2
n1
4 n1 f4
12 n2 f12
5 n3
1
9 n3 f9
4’
4
2
n3
3 n1 f3
14 n4 f14
5 n3
n4
14 n5 f14
13 n2 f13
3 n6
n5
5
4 n1 f4
10 n5 f10
8 n6
Beside the routing table, each node also maintains a
query table containing the state of all outstanding
queries that have traversed it  to backtrack
75
Insert
 API: insert(id, file);
 Two steps
first attempt a “search” for the file to be
inserted
 if found, report collision


if not found, insert the file by sending it along
the query path (why?)
• a node probabilistically replaces the originator with
itself (why?)
76
Insert Example
 Assume query returned failure along the
shown path (backtrack slightly complicate
things); insert f10
insert(10, f10)
n1
4 n1 f4
12 n2 f12
5 n3
n2
9 n3 f9
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
77
Insert Example
insert(10, f10)
n2
n1
10 n1 f10
4 n1 f4
12 n2
orig=n1
9 n3 f9
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
78
Insert Example
insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
79
Insert Example
 n2 replaces the originator (n1) with itself
insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
orig=n2
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
80
Insert Example
insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
orig=n2
n3
10 n2 10
3 n1 f3
14 n4
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
81
Insert Example
Insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
n3
10 n2 10
3 n1 f3
14 n4
n4
n5
10 n2 f10
14 n5 f14
13 n2
10 n4 f10
4 n1 f4
11 n5
82
Freenet Analysis
 Authors claim the following effects:
 nodes eventually specialize in locating similar keys
• if a node is listed in a routing table, it will get queries for
related keys
• thus will gain “experience” answering those queries


popular data will be transparently replicated and will
exist closer to requestors
as nodes process queries, connectivity increases
• nodes will discover other nodes in the network
 Caveat: lexigraphic closeness of file names/keys
may not imply content similarity
83
…
 We create a Freenet reference graph
file
…
Understanding Freenet SelfOrganization: Freenet Graph
id next_hop
creating a vertex for each Freenet node
 adding a directed link from A to B if A refers
to an item stored at B

84
id next_hop
…
Experiment: Freenet Graph: Init
file
…
i-2
i-1
i
i+1
i+2
Assume a network of 1000 nodes, with node id 0 to 999
Each node can store 50 data items, and 200 references
Assume initially each node i has i, and knows the storage of i –
2, -1, i + 1, i + 2 (all mod 1000)
- thus a regular, locally-clustered graph with avg path length ~
1000 / 8 = 125
-
85
Experiment: Evolution of Freenet Graph
 At each step
pick a node
randomly
 flip a coin to
determine search
or insert

• if search,
randomly pick a
key in the
network
• if insert, pick a
random key
Evolution of path length and clustering;
Clustering is defined as percentage of
local links
86
Freenet Evolves to
Small-World Network
 With usage, the
regular, highly
localized Freenet
network evolved into
one irregular graph
 High percentage of
highly connected
nodes provide
shortcuts/bridges


make the world a “small
world”
most queries only traverse
a small number of hops to
find the file
87
Small-World
 First discovered by Milgrom
 in 1967, Milgram mailed 160 letters to a set of randomly
chosen people in Omaha, Nebraska
 goal: pass the letters to a given person in Boston
• each person can only pass the letter to an intermediary
known on a first-name basis
• pick the person who may make the best progress



result: 42 letters made it through !
median intermediaries was 5.5---thus six degree of
separation
a potential explanation: highly connected people with nonlocal links in mostly locally connected communities
improve search performance !
88
Kleinberg’s Result on Distributed Search
 Question: how many long distance links to maintain
so that distributed (greedy) search is effective?
 Assume that the probability of a long link is some
() inverse-power of the number of lattice steps
 Kleinberg’s Law: Distributed algorithm exists only
when probability is proportional to (lattice steps)-d,
where d is the dimension of the space
89
Distributed Search
 In other words, if double distance, increase number
of neighbors by a constant
-> see Chord
probability is proportional to (lattice steps)-d
90
Small World
Saul Steinberg; View of World from 9th Ave
91
Freenet: Properties
 Query using intelligent routing
 decentralized architecture  robust
 avoid flooding  low overhead
 DFS search guided by closeness to target
 Integration of query and caching makes it
 adaptive to usage patterns: reorganize network reference
structure
 free speech: attempts to discover/supplant existing files
will just spread the files !
 Provide publisher anonymity, security
 each node probabilistically replaces originator with itself
92
Freenet: Issues
 Does not always guarantee that a file is
found, even if the file is in the network
 Good average-case performance, but a
potentially long search path in a large
network

approaching small-world…
93
Summary
 All of the previous p2p systems are called
unstructured p2p systems
 Advantages of unstructured p2p
 algorithms tend to be simple
 can optimize for properties such as locality
 Disadvantages
 hard to make performance guarantee
 failure even when files exist
94
Distributed Hash Tables (DHT):
History
 In 2000-2001, academic researchers jumped on to
the P2P bandwagon
 Motivation:
 frustrated by popularity of all these “halfbaked” P2P apps. We can do better! (so they
said)
 guaranteed lookup success for data in system
 provable bounds on search time
 provable scalability to millions of node
DHT: Overview
 Abstraction: a distributed “hash-table” (DHT)
data structure



put(key, value) and get(key)  value
DHT imposes no structure/meaning on keys
one can build complex data structures using DHT
 Implementation:

nodes in system form an interconnection network: ring,
zone, tree, hypercube, butterfly network, ...
Distributed application
get (key)
put(key, data)
DHT
node
node
….
node
96
DHT Applications
 File sharing and backup [CFS, Ivy, OceanStore,







PAST, Pastiche …]
Web cache and replica [Squirrel, Croquet Media
Player]
Censor-resistant stores [Eternity]
DB query and indexing [PIER, Place Lab, VPN
Index]
Event notification [Scribe]
Naming systems [ChordDNS, Twine, INS, HIP]
Communication primitives [I3, …]
Host mobility [DTN Tetherless Architecture]
97
DHT: Basic Idea
98
DHT: Basic Idea (2)
99
DHT: Basic Idea (3)
100
DHT: Basic Idea (4)
101
DHT: Basic Idea (5)
102
Peer-to-Peer Systems:
DHT and Swarming
10/5/2009
103
Admin.
 Programming assignment 1

New due date: Oct. 12 11:59 pm EDT
104
Recap: Objectives of P2P
 Share the resources
(storage and
bandwidth) of
individual clients to
improve
scalability/robustness
Internet
 Find clients with
resources!
105
Recap: The Lookup Problem
N1
N2
Internet
Key=“title”
Value=data…
Publisher
N4
N5
N3
?
Client
Lookup(“title”)
N6
find where a particular document is stored
Recap: The Lookup Problem
 Napster (central query server)
 Gnutella (decentralized, flooding)
 Freenet (search by routing)
107
Recap: Kleinberg’s Result on Distributed Search
 Question: how many long distance links to maintain
so that distributed (greedy) search is effective?
 Assume that the probability of a long link is some
() inverse-power of the number of lattice steps
 Distributed algorithm exists only when probability is
proportional to (lattice steps)-d, where d is the
dimension of the space
108
Recap: Distributed Search
 In other words, a node connects to roughly the same
number of nodes in each region A1, A2, A4, A8,
where A1 is the set of nodes who are one lattice
step away, A2 is those two steps away, …
109
DHT: API
110
DHT: API
111
Key Issues in Understanding a
DHT Design
 How does the design map keys to internal
representation (typically a metric space)?
 Which space is a node responsible?
 How are the nodes linked?
112
Outline
 Admin. and review
 P2P
 the lookup problem
 Napster (central query server; distributed
data server)
 Gnutella (decentralized, flooding)
 Freenet (search by routing)
 Content Addressable Network
113
CAN
 Abstraction

map a key to a “point” in a multi-dimensional
Cartesian space

a node “owns” a zone in the overall space

route from one “point” to another
114
CAN Example: Two Dimensional Space
 Space
divided
among
nodes
 Each node
covers
either a
square or a
rectangular
area of
ratios 1:2 or
2:1
1
115
CAN Example: Two Dimensional Space
 Space
divided
among
nodes
 Each node
covers
either a
square or a
rectangular
area of
ratios 1:2 or
2:1
1
2
116
CAN Example: Two Dimensional Space
 Space
divided
among
nodes
 Each node
covers
either a
square or a
rectangular
area of
ratios 1:2 or
2:1
3
1
2
117
CAN Example: Two Dimensional Space
 Space
divided
among
nodes
 Each node
covers
either a
square or a
rectangular
area of
ratios 1:2 or
2:1
3
1
2
4
118
CAN Example: Two Dimensional Space
 Space
divided
among
nodes
 Each node
covers
either a
square or a
rectangular
area of
ratios 1:2 or
2:1
119
CAN Insert: Example (1)
node I::insert(K,V)
I
120
CAN Insert: Example (2)
node I::insert(K,V)
(1) a = hx(K)
b = hy(K)
I
y=b
Example: Key=“Matrix3” h(Key)=60
x=a
121
CAN Insert: Example (3)
node I::insert(K,V)
(1) a = hx(K)
b = hy(K)
I
y=b
(2) route(K,V) -> (a,b)
x=a
122
CAN Insert: Routing
 A node maintains
state only for its
immediate
neighboring nodes
 Forward to
neighbor which is
closest to the
target point

a type of greedy,
local routing
scheme
123
CAN Insert: Example (4)
node I::insert(K,V)
(1) a = hx(K)
b = hy(K)
I
y=b
(2) route(K,V) -> (a,b)
(3) (K,V) is stored at
(a,b)
x=a
124
CAN Retrieve: Example
node J::retrieve(K)
J
125
CAN Retrieve: Example
node J::retrieve(K)
(1) a = hx(K)
b = hy(K)
y=b
(2) route “retrieve(K)” to
(a,b)
J
x=a
126
CAN Insert: Join (1)
J
2) pick a
random
point (p,q)
in space
new node
1) Discover some node “J” already in CAN
127
CAN Insert: Join (2)
N
J
new node
3) J routes to (p,q), discovers node N
128
CAN Insert: Join (3)
Inserting a
new node
affects only
a single
other node
and its
immediate
neighbors
N
new node
J
4) split N’s zone in half… new node owns one half
129
CAN Evaluations
 Guarantee to find an item if in the network
 Load balancing
 hashing achieves some load balancing
 overloaded node replicates popular entries at neighbors
 Scalability
 for a uniform (regularly) partitioned space with n nodes
and d dimensions
 storage:
• per node, number of neighbors is 2d

routing
• average routing path is (dn1/d)/3 hops (due to Manhattan
distance routing, expected hops in each dimension is
dimension length * 1/3)
• a fixed d can scale the network without increasing per-node state
130
Outline
 Admin. and review
 P2P
 the lookup problem
 Napster (central query server; distributed
data server)
 Gnutella (decentralized, flooding)
 Freenet (search by routing)
 Content addressable networks
 Chord (search by routing/consistent hashing)
131
Chord
 Space is a ring
 Consistent hashing: m bit identifier space
for both keys and nodes

key identifier = SHA-1(key), where SHA-1() is a
popular hash function,
Key=“Matrix3”  ID=60

node identifier = SHA-1(IP address)
• IP=“198.10.10.1”  ID=123
132
Chord: Storage using a Ring
K125, K5, K10
IP=“198.10.10.1”
N10
K11, K20
N123
Circular
7-bit ID Space
K101
N32
N55
N90
K60
Key=“Matrix 3”
 A key is stored at its successor: node with next
higher or equal ID
133
How to Search: One Extreme
 Every node knows of every other node
 Routing tables are large O(N)
 Lookups are fast O(1)
134
How to Search: the Other Extreme
 Every node knows its successor in the ring
 Routing tables are small O(1)
 Lookups are slow O(N)
135
Chord Solution: “finger tables”
 Node K knows the node that is maintaining
K + 2i, where K is mapped id of current
node

increase distance exponentially
K+32
K+64
K+16
K+8
K+4
K+2
K+1
N80
136
Joining the Ring
 use a contact node to obtain info
 transfer keys from successor node to new node
 updating fingers of existing nodes
137
DHT: Chord Node Join
 Assume an identifier space [0..8]
 Node n1 joins
Succ. Table
i id+2i succ
0 2
1
1 3
1
2 5
1
0
1
7
6
2
5
3
4
DHT: Chord Node Join
 Node n2 joins
Succ. Table
i id+2i succ
0 2
2
1 3
1
2 5
1
0
1
7
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
1
1 4
1
2 6
1
DHT: Chord Node Join
 Node n6 joins
Succ. Table
i id+2i succ
0 2
2
1 3
6
2 5
6
0
1
7
Succ. Table
i id+2i succ
0 7
1
1 0
1
2 2
2
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
6
1 4
6
2 6
6
DHT: Chord Node Join
 Node n0 starts join
Succ. Table
i id+2i succ
0 2
2
1 3
6
2 5
6
0
1
7
Succ. Table
i id+2i succ
0 7
1
1 0
1
2 2
2
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
6
1 4
6
2 6
6
DHT: Chord Node Join
Succ. Table
i id+2i succ
0 1
1
1 2
2
2 4
6
 After Nodes n0 joins
Succ. Table
i id+2i succ
0 2
2
1 3
6
2 5
6
0
1
7
Succ. Table
i id+2i succ
0 7
0
1 0
0
2 2
2
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
6
1 4
6
2 6
6
DHT: Chord Insert Items
Succ. Table
 Nodes:
i
i id+2
0 1
1 2
2 4
n1, n2, n0, n6
 Items:
f7, f1
Items
7
succ
1
2
0
0
1
7
Succ. Table
i id+2i succ
0 7
0
1 0
0
2 2
2
Succ. Table
6
i
i id+2
0 2
1 3
2 5
Items
succ 1
2
6
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
6
1 4
6
2 6
6
DHT: Chord Routing
Succ. Table
i
 Upon receiving a query for
item id, a node:
 checks whether stores the
item locally
 if not, forwards the query to
the largest node in its
successor table that does not
exceed id
i id+2
0 1
1 2
2 4
Items
succ
7
1
2
0
0
Succ. Table
1
7
i
i id+2
0 2
1 3
2 5
query(7)
Succ. Table
i id+2i succ
0 7
0
1 0
0
2 2
2
6
Items
succ 1
2
6
6
2
Succ. Table
5
3
4
i id+2i succ
0 3
6
1 4
6
2 6
6
Chord/CAN Summary
 Each node “owns” some portion of the key-space
 in CAN, it is a multi-dimensional “zone”
 in Chord, it is the key-id-space between two nodes in 1-D
ring
 Files and nodes are assigned random locations in
key-space

provides some load balancing
• probabilistically equal division of keys to nodes
 Routing/search is local (distributed) and greedy
 node X does not know of a path to a key Z
 but if it appears that node Y is the closest to Z among all
of the nodes known to X
 so route to Y
145
There are other DHT algorithms
 Kadmelia
 Tapestry (Zhao et al)
– Keys interpreted as a sequence of digits
– Incremental suffix routing
» Source to target route is accomplished by correcting one
digit at a time
» For instance: (to route from 0312  1643)
» 0312  2173  3243  2643  1643
– Each node has a routing table
 Skip Graphs (Aspnes and Shah)
146
Summary: DHT
 Underlying metric space.
 Nodes embedded in metric space
 Location determined by key
 Hashing to balance load
 Greedy routing
 Typically
 O(log n) space at each node
 O(log n) routing time
147
Summary: the Lookup Problem
 Unstructured P2P design
 Napster
(central query server)
 Gnutella (decentralized, flooding)
 Freenet (search by routing)
 Structured P2P design (search by routing)
 Chord (a ring)
 CAN (zones)
148
Outline
 Recap
 P2P
 the lookup problem






Napster (central query server; distributed data
server)
Gnutella (decentralized, flooding)
Freenet (search by routing)
Content Addressable Network (virtual zones)
Chord (search by routing on a virtual ring)
the scalability problem
149
An Upper Bound on Scalability
 Assume
need to
achieve same
rate to all
clients
 only uplinks
can be
bottlenecks
server

 What is an
upper bound on
scalability?
C0
client 1
C1
C2
Cn
C3
client 2
client n
client 3
150
The Scalability Problem

Maximum
throughput
R = min{C0,
(C0+Ci)/n}

The bound is
theoretically
approachable
server
C0
client 1
C1
C2
Cn
C3
client 2
client n
client 3
151
Theoretical Capacity:
upload is bottleneck
 Assume
 Tree i:
c0 > (C0+Ci)/n
server  client i: ci /(n-1)
client i  other n-1 clients
R = min{C0, (C0+Ci)/n}
c0
ci /(n-1)
cn
ci
c1
 Tree 0:
server has remaining
cm = c0 – (c1 + c2 + … cn)/(n-1)
send to client i: cm/n
c2
C0
cm /n
Cn
C1
C2
Ci
152
Why not Building the Trees?
servers
C0
client 1
C1
C2
Cn
C3
client 2
 Clients come and go
(churn): maintaining the
trees is too expensive
client n
client 3
153
Key Design Issues
 Robustness

servers
Resistant to churn and failures
 Efficiency

A client has content that
others need; otherwise, its
upload capacity may not be
utilized
 Incentive: clients are willing to
upload


C0
client 1
C1
C2
Cn
C3
70% of Gnutella users share no
client 2
files,
nearly 50% of all responses are
client 3
returned by the top 1% of
sharing hosts
client n
 Lookup problem
154
Discussion: How to handle the
issues?
servers/
seeds
 Robustness
C0
 Efficiency
 Incentive
client 1
C1
C2
Cn
C3
client 2
client n
client 3
155
Outline
 Recap
 P2P
 the lookup problem
 the scalability problem
 BitTorrent
156
BitTorrent
 A P2P file sharing protocol
 Created by Bram Cohen in 2004
 A peer can download pieces concurrently
from multiple locations
157
File Organization
Piece: unit of information exchange among peers
Block: unit of download
File
1
2
3
4
Piece
256KB
Block
16KB
Incomplete Piece
158
Outline
 Recap
 P2P
 the lookup problem
 the scalability problem
 BitTorrent
 Lookup
159
BitTorrent
 Mostly tracker based
 Tracker-less mode; based on the Kademlia
DHT
160
BitTorrent: Initialization
HTTP GET MYFILE.torrent
webserver
MYFILE.torrent
http://mytracker.com:6969/
S3F5YHG6FEB
FG5467HGF367
F456JI9N5FF4E
…
user
161
Metadata File Structure
 Meta info contains information necessary to
contact the tracker and describes the files
in the torrent
announce URL of tracker
 file name
 file length
 piece length (typically 256KB)
 SHA-1 hashes of pieces for verification
 also creation date, comment, creator, …

Tracker Protocol
webserver
user
“register”
list of peers
tracker
ID1 169.237.234.1:6881
ID2 190.50.34.6:5692
ID3 34.275.89.143:4545
…
ID50 231.456.31.95:6882
…
Peer 40
Peer 2
Peer 1
163
Tracker Protocol
 Communicates with clients via HTTP/HTTPS
 Client GET request
 info_hash: uniquely identifies the file
 peer_id: chosen by and uniquely identifies the client
 client IP and port
 numwant: how many peers to return (defaults to 50)
 stats: e.g., bytes uploaded, downloaded
 Tracker GET response
 interval: how often to contact the tracker
 list of peers, containing peer id, IP and port
 stats
Outline
 Recap
 P2P
 the lookup problem
 the scalability problem
 BitTorrent
 Lookup
 Robustness
165
Robustness
 A swarming protocol
Peers exchange info about other peers in the
system
 Peers exchange piece availability and request
blocks from peers

166
Peer Protocol
(Over TCP)
0
BitField
Remote Peer
1
0
ID/Infohash Handshake
1
BitField
Local Peer
167
Peer Protocol
 Unchoke: indicate if a
allows b to download
 Interest/request:
indicate which block
to send from b to a
requests/
interests
requests/
interests
requests/
interests
requests/
interests
requests/
interests
168
Optional Slides
169
Skip List
170