P2P File Sharing Systems - Computer Science Department

Download Report

Transcript P2P File Sharing Systems - Computer Science Department

P2P File Sharing Systems
Johnny Wong
Note: Materials of these slides are based on the
those from the textbook “Computer Networking: A
Top-Down Approach featuring the Internet”
by
J.F Kurose and K.W. Ross
P2P file sharing
Example
• Alice runs P2P client
application on her
notebook computer
• Intermittently
connects to Internet;
gets new IP address
for each connection
• Asks for “Hey Jude”
• Application displays
other peers that have
copy of Hey Jude.
• Alice chooses one of the
peers, Bob.
• File is copied from Bob’s
PC to Alice’s notebook:
HTTP
• While Alice downloads,
other users uploading
from Alice.
• Alice’s peer is both a
Web client and a
transient Web server.
All peers are servers =
highly scalable!
P2P: centralized directory
Bob
original “Napster” design centralized
directory server
1) when peer connects, it
informs central server:
1
peers
1
– IP address
– content
2) Alice queries for “Hey
Jude”
3) Alice requests file from
Bob
3
1
2
1
Alice
P2P: problems with centralized
directory
• Single point of failure
• Performance
bottleneck
• Copyright
infringement
file transfer is
decentralized, but
locating content is
highly centralized
P2P: Query flooding
•
•
•
•
• Send query to neighbors
Gnutella
• Neighbors forward
no hierarchy
query
use bootstrap node to • If queried peer has
learn about others
object, it sends
message back to
join message
querying peer
join
P2P: more on query flooding
Pros
• peers have similar
responsibilities: no
group leaders
• highly decentralized
• no peer maintains
directory info
Cons
• excessive query
traffic
• query radius: may not
have content when
present
• bootstrap node
• maintenance of
overlay network
P2P: decentralized directory
• Each peer is either a
group leader or
assigned to a group
leader.
• Group leader tracks
the content in all its
children.
• Peer queries group
leader; group leader
may query other
group leaders.
ordinary peer
group-leader peer
neighoring relationships
in overlay network
More about decentralized directory
overlay network
• peers are nodes
• edges between peers
and their group leaders
• edges between some
pairs of group leaders
• virtual neighbors
bootstrap node
• connecting peer is either
assigned to a group
leader or designated as
leader
advantages of approach
• no centralized directory
server
– location service
distributed over peers
– more difficult to shut
down
disadvantages of approach
• bootstrap node needed
• group leaders can get
overloaded
Unstructured P2P File Sharing
• Centralized
– Napster
• Distributed
– Gnutella
– KaZaA
Napster: how does it work
• Application-level, client-server protocol over
point-to-point TCP
• Centralized directory server
• Steps:
–
–
–
–
connect to Napster server
upload your list of files to server.
give server keywords to search the full list with.
select “best” of matching answers (pings)
• Pings the candidate server using RTT
Gnutella
• Open-source
• Links are TCP-connections
• Each peer sends a query to each of its
peers; the receiving peer sends a search
query; once the file is found, the response
follows the search path backward to the
querier; file transfer is point-to-point
Gnutella (con’t)
– decentralized searching for files
– central directory server no longer the
bottleneck
• more difficult to “pull plug”
– each application instance serves to:
•
•
•
•
store selected files
route queries from and to its neighboring peers
respond to queries if file stored locally
serve files
Gnutella history
– 3/14/00: release by AOL, almost
immediately withdrawn
– became open source
– many iterations to fix poor initial design
(poor design turned many people off)
– issues:
•
•
•
•
how much traffic does one query generate?
how many hosts can it support at once?
what is the latency associated with querying?
is there a bottleneck?
Gnutella: limited scope query
• Searching by flooding:
– if you don’t have the file you want, query 7 of
your neighbors.
– if they don’t have it, they contact 7 of their
neighbors, for a maximum hop count of 10.
– reverse path forwarding for responses (not
files)
• (useful for saving TCP connections)
Gnutella in practice
• Gnutella traffic << KaZaA traffic
• Anecdotal:
– Couldn’t find anything
– Downloads wouldn’t complete
• Fixes: do things KaZaA is doing: hierarchy,
queue management, parallel download,…
KaZaA: Technology
Software
• Proprietary
• control data encrypted (including queries/responses)
– KaZaA Web site gives a few hits
– Some studies described in Web
•
Everything in HTTP request and response messages
Architecture
• hierarchical
• cross between Napster and Gnutella
KaZaA: The service (2)
• User can configure max number of simultaneous
uploads and max number of simultaneous downloads
– queue management at server and client
• Frequent uploaders can get priority in server queue
– Keyword search
• User can configure “up to x” responses to keywords
• Responses to keyword queries come in waves; stops
when x responses are found
• From user’s perspective, service resembles Google, but
provides links to MP3s and videos rather than Web
pages
KaZaA: Architecture
• Each peer is either a supernode or an
ordinary node (assigned to one
supernode)
– Each supernode connected to many other
supernodes (supernode overlay)
• Nodes that have more connection
bandwidth and are more available are
designated as supernodes
KaZaa Supernode
• Each supernode acts as a mini-Napster
hub, tracking the content and IP
addresses of its descendants
• Guess: supernode has (on average) 200500 descendants; roughly 10,000
supernodes
KaZaA: Overlay maintenance
• List of potential supernodes included with
software download
– New peer goes through list until it finds
operational supernode
– Connects, obtains more up-to-date list
• Node then pings nodes on list and
connects with the one with smallest RTT
• If supernode goes down, node obtains
updated list and chooses new supernode
KaZaA Queries
– Node first sends query (keyword) to supernode
– Supernode responds with matches
• If x matches found, done.
– Otherwise, supernode forwards query to subset of
supernodes
• If total of x matches found, done.
– Otherwise, query further forwarded
– Probably by original supernode rather than
recursively?