Transcript file

Peer to Peer (1)
References
This discussion is based on the papers and
websites in the P2P File Sharing section
 Acknowledgements: Many of the figures
are from other presentations especially
from the original authors.

Client-Server Model
Let’s
look at the ClientServer model
Servers are centrally
maintained and
administered
Client has fewer
computing resources than
a server
This is the way the web
works
No interaction between
clients
Client Server Model

Disadvantages of the client-server model
o Reliability
—The network depends on a possibly highly loaded server
to function properly.
—Server needs to be replicated to some extent to provide
better reliability.
o Scalability
—More users imply more demand for computing power,
storage space and bandwidth
Peer-to-Peer Model
All
nodes have same
functional capabilities and
responsibilities
No reliance on central
services or resources.
A node acts as both as a
“server” and client.
Considered more scalable
Peer-to-Peer Model
Peer-to-peer systems provide access to
information resources located on
computers throughout network.
 Algorithms for the placement and
subsequent retrieval of information is a
key aspect of the design of P2P systems.

The Promise of P2P Computing

High capacity through parallelism:
o Many disks
o Many network connections
o Many CPUs

Reliability:
o Many replicas
o Geographic distribution
Automatic configuration
 Useful in public and proprietary settings

Why P2P?




P2P potentially can eliminating the single-source
bottleneck
P2P can be used to distribute data and control and
load-balance requests across the Net
P2P potentially eliminates the risk of a single point of
failure
P2P infrastructure allows direct access and shared
space, and this can enable remote maintenance
capability
Brief History

Generation 1 of P2P Systems
o Napster music exchange

Generation 2
o Freenet, Gnuetella, Kazaa, BitTorrent

Generation 3
o Characterized by the emergence of middleware
layers for the application-independent
management of distributed resources on a
global scale
—Pastry, Tapestry, Chord, Kademlia
Environment Characteristics for Peerto-Peer Systems
Unreliable environments
 Peers connecting/disconnecting – network
failures to participation
 Random Failures e.g. power outages, cable and
DSL failures, hackers
 Personal machines are much more vulnerable
than servers

Evaluating Peer-to-Peer Systems

A node’s database:
o What does a node need to save in order to operate
properly/efficiently


Success rate (if the file is in the network, what are
the changes that a search will find it)
Lookup cost:
o Time
o Communication (bandwidth usage)



Join/departure cost
Fault Tolerance – Resilience to faults
Resilience to denial of service attacks, security.
Issues in File Sharing Services
Publish – How to insert a new file into the
network
 Lookup – Find a specific file
 Retrieval – Getting a copy of a file

P2P File Sharing Software

Allows a user to open up a directory in their
file system
o Anyone can retrieve a file from directory
o Like a Web server

Allows the user to copy files from other
users’ open directories:
o Like a Web client

Allows users to search nodes for content
based on keyword matches:
o Like Google
Napster: How Did It Work



Application-level, client-server protocol over pointto-point TCP
Centralized directory server
Steps:
o Connect to Napster server
o Give server keywords to search the full list with.
o Select “best” of correct answers.
—One approach is select based on the response time of a
pings.
– Shortest response time is chosen.
Napster: How Did It Work
1. File list
and IP
address is
uploaded
napster.com
centralized directory
Napster: How Did It Work
2. User
requests
search at
server.
napster.com
centralized directory
Query
and
results
Napster: How Did It Work
3. User pings
hosts that
apparently
have data.
Looks for
best transfer
rate.
napster.com
centralized directory
pings
pings
Napster: How Did It Work
4. User chooses
server
Napster’s
centralized
server farm had
difficult time
keeping
up with traffic
napster.com
centralized directory
Retrieves
file
Napster
There are centralized indexes but users
supplied the files which were stored and
accessed on their personal computer
 Napster became very popular for music
exchange

Napster

History:
o 5/99: Shawn Fanning (freshman,
Northeasten U.) founds Napster Online
music service
o 12/99: first lawsuit
o 3/00: 25% UWisc traffic Napster
o 2/01: US Circuit Court of
Appeals: Napster knew users
violating copyright laws
o 7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K,
Morpheus (KaZaA): 300K
Napster

Judge orders Napster
to pull plug in July ‘01
Other file sharing apps
take over!
8M
6M
bits per sec

4M
2M
0.0
gnutella
napster
fastrack (KaZaA)
Napster’s Downfall

Napster’s developers argued they were not liable
for infringement of the copyrights
o Why? They were not participating in the copying process
which was performed entirely between users’ machines.

This argument was not accepted by the courts
o Why? The index servers were deemed an essential part
of the process

Since the index servers were located at wellknown addresses, their operators were unable to
remain anonymous.
o Makes for an easy lawsuit target
Napster’s Downfall

A more fully distributed file sharing
service spreads the responsibility across
all of the users
o Makes the pursuit of legal remedies difficult
Napster: Discussion
Locates files quickly
 Vulnerable to censorship and technical failure
 Popular data become less accessible because
of the load of the requests on a central
server
 People started to look for more distributed
solutions to file-sharing as a result of
Napster’s failure.

Gnutella


Napster’s legal problems motivated Gnutella where
there is not a use of centralized indexes
The focus is on a decentralized method of searching
for files
o Central directory server no longer the bottleneck
o More difficult to “pull plug”

Each application instance serves to:
o
o
o
o
Store selected files
Route queries from and to its neighboring peers
Respond to queries if file stored locally
Serve files
Gnutella

Gnutella history:
o 3/14/00: release by AOL, almost immediately withdrawn
o Became open source
o Many iterations to fix poor initial design (poor design turned
many people off)

Issues:
o
o
o
o
How much traffic does one query generate?
How many hosts can it support at once?
What is the latency associated with querying?
Is there a bottleneck?
Gnutella: Searching

Searching by flooding:
• A Query packet might ask, "Do you have any content that
matches the string ‘Homer"?
o If a node does not have the requested file, then 7
(default set by Gnutella) of its neighbors are queried.
o If the neighbors do not have it, they contact 7 of their
neighbors.
o Maximum hop count: 10 (this is called time-to-live TTL)
o Reverse path forwarding for responses (not files)
Gnutella: Searching

Downloading
• Peers respond with a “QueryHit” (contains contact info)
• File transfers use direct connection using HTTP protocol’s
GET method
• When there is a firewall a "Push" packet is used – reroutes
via Push path
Gnutella: Searching
Gnutella: Searching
Gnutella: Discovering Peers
A peer has to know at least one other peer to
send requests to.
 Addresses of some peers have been published
on a website.
 When a peer enters the network, it contacts
a designated peer and receives a list of other
peers that have recently entered the
network.

Gnutella: Discussion
Robust: The failure of peer is not a failure of
Gnutella.
 Performance: Flooding leads to poor
performance
 Free riders: Those who get data but do not
share data.
 The model of Gnutella just presented was
found not be workable.
 This led to models that and some peer nodes
have indexes.

KaZaA: The Service
More than 3 million up peers sharing over
3,000 terabytes of content
 More popular than Napster ever was
 MP3s & entire albums, videos, games
 Optional parallel downloading of files
 Automatically switches to new download
server when current server becomes
unavailable
 Provides estimated download times

KaZaA: The Service


A user can configure the maximum number of
simultaneous uploads and maximum number of
simultaneous downloads
Queue management at server and client
o Frequent uploaders can get priority in server queue

Keyword search
o User can configure “up to x” responses to keywords

Responses to keyword queries come in waves; stops
when x responses are found
KaZaA: The Technology
Proprietary
 Control data encrypted
 Everything in HTTP request and response
messages

KaZaA: Architecture
Each peer is either a
supernode or is assigned to a
supernode

o 56 min avg connect
o Each SN has about 100150 children
o Roughly 30,000 SNs
Each
supernode has TCP
connections with 30-50
supernodes
o 23 min avg connect
supernodes
KaZaA: Architecture
Nodes that have more connection bandwidth
and are more available are designated as
supernodes
 Each supernode acts as a mini-Napster hub,
tracking the content and IP addresses of its
descendants
 A supernode tracks only the content of its
children.
 Considered a cross between Napster and
Gnutella

KaZaA: Finding Supernodes


List of potential supernodes included within software
download
New peer goes through list until it finds operational
supernode
o Connects, obtains more up-to-date list, with 200 entries
o Nodes in list are “close” to ON.
o Node then pings 5 nodes on list and connects with the one

If supernode goes down, node obtains updated list
and chooses new supernode
KaZaA Queries

Node first sends query to supernode
o Supernode responds with matches
o If x matches found, done.

Otherwise, supernode forwards query to
subset of supernodes
o If total of x matches found, done.

Otherwise, query further forwarded
o Probably by original supernode rather than
recursively
Freenet: History
Final year (undergraduate) project at
Edinburgh University, Scotland, 1999.
 Around the time of Napster
 One of the goals was to provide users with
anonymity; circumvent censorship. Other
goals include filesharing, high availability
and reliability (through decentralization),
efficiency and scalability.

Freenet:Key Management

Files are stored according to an associated
key
o Cluster information about similar keys
Keys are used to locate a document
anywhere
 Keys are used to form a URI
 Two similar keys don’t mean the subjects
of the file are similar

Freenet: Routing Tables
Each node maintains a common stack
 id – file identifier
 next_hop – another node that stores the
id
file id
 file – file identified by id being stored
on the local node
 Forwarding of query for file id

o If not, search for the “closest” id in the stack,
and forward the message to the corresponding
next_hop
o If data is not found, failure is reported back
— Requestor then tries next closest match in
routing table
…
— Forward data back to upstream requestor
— Requestor adds file to cache, adds entry in
routing table
…
o If file id stored locally, then stop
next_hop
file
Freenet:Messages

Random 64 bit ID used for loop detection
—Each node maintains the list of query IDs that have
traversed it  help to avoid looping messages

TTL
—TTL is decremented each time the query message is
forwarded
Freenet:Query


API: file = query(id);
Upon receiving a query for document id
o Check whether the queried file is stored locally
— If yes, return it
— If not, forward the query message

Notes:
o Each query is associated with a TTL that is decremented
each time the query message is forwarded; to obscure
distance to originator:
— TTL can be initiated to a random value within some bounds
— When TTL=1, the query is forwarded with a finite
probability
o Each node maintains the state for all outstanding queries
that have traversed it  help to avoid cycles
o When file is returned it is cached along the reverse path
Freenet: Routing Example
query(10)
n2
n1
4 n1 f4
12 n2 f12
5 n3
1
9 n3 f9
4’
4
2
n3
3 n1 f3
14 n4 f14
5 n3
n4
14 n5 f14
13 n2 f13
3 n6
n5
5
Copies of the requested file may be cached along the
request path for scalability and robustness
4 n1 f4
10 n5 f10
8 n6
Freenet: Insert
API: insert(id, file);
 Two steps

o Search for the file to be inserted
— If found, report collision
— If number of nodes exhausted report clear
o If not found, insert the file
Freenet: Insert
Searching: like query
 Insertion

o Follow the forward path; insert the file at all
nodes along the path
o A node probabilistically replace the originator
with itself; obscure the true originator
Freenet: Insert Example

Assume query returned failure along “gray”
path; insert f10
insert(10, f10)
n1
4 n1 f4
12 n2 f12
5 n3
n2
9 n3 f9
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
Freenet: Insert Example
insert(10, f10)
n2
n1
10 n1 f10
4 n1 f4
12 n2
orig=n1
9 n3 f9
n3
3 n1 f3
14 n4 f14
5 n3
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
Freenet: Insert Example

n2 replaces the originator (n1) with itself
insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
orig=n2
n3
10 n2 10
3 n1 f3
14 n4
n4
n5
14 n5 f14
13 n2 f13
3 n6
4 n1 f4
11 n5 f11
8 n6
Freenet: Insert Example

n2 replaces the originator (n1) with itself
Insert(10, f10)
n1
10 n1 f10
4 n1 f4
12 n2
n2
10 n1 f10
9 n3 f9
n3
10 n2 10
3 n1 f3
14 n4
n4
n5
10 n2 f10
14 n5 f14
13 n2
10 n4 f10
4 n1 f4
11 n5
Insert Conflict
If the same GUID already exists, reject
insert
 Previous file propagation prevents
attempts to supplant file already in
network.

Freenet: Ensuring Anonymity
Nodes can lie randomly about the requests
and claim to be the origin or the
destination of a request
 TTL values are fuzzy
 It is difficult to trace back a document to
its original node
 Similarly, it is difficult to discover which
node inserted a given document.

Freenet: Discussion

What’s wrong with Freenet?
o Not well tested in the wild – scalability, resilience.
Insertion flooding is one way to take out the network.
o Anonymity guarantees not that strong – “Most non-trivial
attacks would probably be successful in identifying
someone making requests on Freenet.”
o No search mechanism – a standard search would allow
attacks to take out specific content holders
o Suffers from problems of establishing initial network
connection.
o Does not always guarantee that a file is found, even if
the file is in the network
Freenet: Discussion

Primary Points
o Prevention of censorship and protection of
privacy is an important and active field of
research.
o Freenet is a (successful?) implementation of a
system that resists information censorship
o Freenet is an ongoing project that still has
plenty of flaws
o There may be a tradeoff between network
efficiency and anonymity, robustness.
BitTorrent

Problems:
o Traditional Client/Server Sharing
—Performance deteriorates rapidly as the number of
clients increases
o Free-riding in P2P network
—Free riders only download without contributing to the
network.
BitTorrent: Started in 2002 by B. Cohen
 Focus: Efficient fetching and not searching

BitTorrent
url
1. GET file.torrent
2. GET
3. list of
peers
4.
file.torrent info:
• length
• name
• hash
• url of tracker
tracker
(tracks
peers)
BitTorrent: Pieces

File is broken into pieces
o Typically piece is 256 KBytes
o Upload pieces while downloading pieces

Tit-for-tat
o Bit-torrent uploads to at most four peers
o Among the uploaders, upload to the four that
are downloading to you at the highest rates
o Sometimes it allows for free loaders
Summary

P2P is a major portion of Internet traffic.
o Has exceeded web traffic
There are different approaches to
structuring P2P applications.
 There is a good deal of concern about future
scalability and traffic which leading to a lot
research from academia and industry.

Interesting Notes
P2P traffic significantly outweighs web
traffic
 P2P traffic continues to grow
 The focus of P2P file sharing is NOT MP3’s

o Many free software projects use BitTorrent to
download software

BitTorrent is the dominant P2P application
in use.
o Microsoft is jealous – they are developing their
own version called Avalanche