Peer to Peer Network with Application

Download Report

Transcript Peer to Peer Network with Application

Peer-to-Peer (P2P) networks
and applications
What is P2P?
 “the sharing of computer resources and
services by direct exchange of
information”
What is P2P?
 “P2P is a class of applications that take
advantage of resources – storage, cycles,
content, human presence – available at the
edges of the Internet. Because accessing
these decentralized resources means
operating in an environment of unstable and
unpredictable IP addresses P2P nodes must
operate outside the DNS system and have
significant, or total autonomy from central
servers”
What is P2P?
 “A distributed network architecture may
be called a P2P network if the participants
share a part of their own resources. These
shared resources are necessary to provide
the service offered by the network. The
participants of such a network are both
resource providers and resource
consumers”
What is P2P?
 Various definitions seem to agree on
sharing of resources
 direct communication between equals (peers)
 no centralized control

What is a peer?
 “…an entity with capabilities similar
to other entities in the system.”
Client/Server Architecture
 Well known,
powerful, reliable
server is a data
source
 Clients request data
from server
Server
Internet
 Very successful
model

WWW (HTTP), FTP,
Web services, etc.
Client
Client
Client
* Figure from http://project-iris.net/talks/dht-toronto-03.ppt
Client
Client/Server Limitations
 Scalability is hard to achieve
 Presents a single point of failure
 Requires administration
 Unused resources at the network edge
 P2P systems try to address these
limitations
P2P Architecture
 All nodes are both
clients and servers


Provide and consume
data
Any node can initiate a
connection
Node
 No centralized data
Internet
source


“The ultimate form of
democracy on the
Internet”
“The ultimate threat to
copy-right protection
on the Internet”
Node
Node
Node
* Content from http://project-iris.net/talks/dht-toronto-03.ppt
Node
P2P Network Characteristics
 Clients are also servers and routers
 Nodes contribute content, storage, memory, CPU
 Nodes are autonomous (no administrative
authority)
 Network is dynamic: nodes enter and leave the
network “frequently”
 Nodes collaborate directly with each other (not
through well-known servers)
 Nodes have widely varying capabilities
P2P Goals and Benefits

Efficient use of resources


Scalability






Nodes self-organize
Built-in fault tolerance, replication, and load balancing
Increased autonomy

not easy in a centralized system


Replicas
Geographic distribution
No single point of failure
Ease of administration


No central information, communication and computation bottleneck
Aggregate resources grow naturally with utilization
Reliability


Unused bandwidth, storage, processing power at the “edge of the network”
Anonymity – Privacy
Dynamism


highly dynamic environment
ad-hoc communication and collaboration
What is Peer-to-Peer (P2P)?
P2P Applications
 File sharing (Napster, Gnutella, Kazaa)
 Multiplayer games (Unreal Tournament, DOOM)
 Collaborative applications (ICQ, shared whiteboard)
 Distributed computation (Seti@home)
 Ad-hoc networks
P2P System Taxonomy
 Historic
 Data-centric
 Computation-centric
 User-centric
 Network-centric
 Platforms
Computation-centric
User-centric
sendMessage
receiveMessage
sendMessage
receiveMessage
User-centric (Common
implementation)
sendMessage
receiveMessage
sendMessage
receiveMessage
Network-centric
Network-centric
Platforms
Gnutella
Find Peers
Instant Messaging
…
Send Messages
P2P Goals/Benefits
 Cost sharing
 Resource aggregation
 Improved scalability/reliability
 Increased autonomy
 Anonymity/privacy
 Dynamism
 Ad-hoc communication
P2P Challenges
 Decentralization
 Scalability and Performance
 Anonymity
 Fairness
 Dynamism
 Security
 Transparency
 Fault Resilience and Robustness
Peer-to-Peer Content Sharing
Popular file sharing P2P Systems
 Napster, Gnutella, Kazaa, Freenet
 Large scale sharing of files.
 User A makes files (music, video, etc.) on their
computer available to others
 User B connects to the network, searches for
files and downloads files directly from user A
 Issues of copyright infringement
Research Areas
 Peer discovery and group management
 Data placement and searching
 Reliable and efficient file exchange
 Security/privacy/anonymity/trust
Design Concerns
 Group Management
 Per-node state
 Load balancing
 Fault tolerance/resiliency
 Search
 Bandwidth usage
 Time to locate item
 Success rate
 Fault tolerance/resiliency
Approaches
 Centralized
 Unstructured
 Structured (Distributed Hash Tables)
Centralized index
original “Napster” design
1) when peer connects, it
informs central server:


Bob
centralized
directory server
1
peers
IP address
content
2) Alice queries for “Hey
Jude”
3) Alice requests file from
Bob
1
3
1
2
1
Alice
Centralized model
Bob
Alice
file transfer is
decentralized, but
locating content is
highly centralized
Judy
Jane
Centralized
 Benefits:
 Low per-node state
 Limited bandwidth
usage
 Short location time
 High success rate
 Fault tolerant
 Drawbacks:
 Single point of failure
 Limited scale
 Possibly unbalanced
load
 copyright infringement
Bob
Judy
Alice
Jane
Napster
 program for sharing files over the Internet
 a “disruptive” application/technology?
 history:






5/99: Shawn Fanning (freshman, Northeasten U.) founds
Napster Online music service
12/99: first lawsuit
3/00: 25% UWisc traffic Napster
2000: est. 60M users
2/01: US Circuit Court of
Appeals: Napster knew users
violating copyright laws
7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
Napster: how does it work
Application-level, client-server protocol over point-topoint TCP
Four steps:
 Connect to Napster server
 Upload your list of files (push) to server.
 Give server keywords to search the full list with.
 Select “best” of correct answers. (pings)
Napster
1. File list is
uploaded
napster.com
users
Napster
2. User
requests
search at
server.
napster.com
Request
and
results
user
Napster
3. User pings
hosts that
apparently
have data.
Looks for best
transfer rate.
napster.com
pings
pings
user
Napster
napster.com
4. User retrieves
file
Retrieves
file
user
Napster
 Central Napster server





Can ensure correct results
Fast search
Bottleneck for scalability
Single point of failure
Susceptible to denial of service
• Malicious users
• Lawsuits, legislation
 Hybrid P2P system – “all peers are equal but some are more equal
than others”


Search is centralized
File transfer is direct (peer-to-peer)
Unstructured
 fully distributed
 no central server
 used by Gnutella
 Each peer indexes the
files it makes available
for sharing (and no
other files)
overlay network: graph
 edge between peer X
and Y if there’s a TCP
connection
 all active peers and
edges form overlay net
 edge: virtual (not
physical) link
 given peer typically
connected with < 10
overlay neighbors
Gnutella: Query flooding
 Query message
sent over existing TCP
connections
 peers forward
Query message
 QueryHit
sent over
reverse
Query
path
QueryHit
File transfer:
HTTP
Query
QueryHit
Gnutella: Peer joining
joining peer Alice must find another peer in
Gnutella network: use list of candidate peers
2. Alice sequentially attempts TCP connections with
candidate peers until connection setup with Bob
3. Flooding: Alice sends Ping message to Bob; Bob
forwards Ping message to his overlay neighbors
(who then forward to their neighbors….)
 peers receiving Ping message respond to Alice
with Pong message
4. Alice receives many Pong messages, and can then
setup additional TCP connections
1.
Unstructured
Carl
Jane
Scalability:
limited scope
flooding
Bob
Alice
Judy
Unstructured
Carl
Jane
 Gnutella model
 Benefits:
 Limited per-node state
 Fault tolerant
 Drawbacks:
 High bandwidth usage
 Long time to locate item
 No guarantee on success rate
 Possibly unbalanced load
Bob
Alice
Judy
Gnutella
Searching by flooding:
 If you don’t have the file
you want, query 7 of your
neighbors.
 If they don’t have it, they
contact 7 of their
neighbors, for a maximum
hop count of 10.
 Requests are flooded, but
there is no tree structure.
 No looping but packets may
be received twice.
 Reverse path forwarding
* Figure from http://computer.howstuffworks.com/file-sharing.htm
Gnutella
fool.* ?
TTL = 2
Gnutella
X
fool.her
IPX:fool.her
TTL = 1
TTL = 1
TTL = 1
Gnutella
Y
fool.me
fool.you
IPY:fool.me
fool.you
Gnutella
IPY:fool.me
fool.you
Gnutella: strengths and weaknesses
 pros:
 flexibility in query processing
 complete decentralization
 simplicity
 fault tolerance/self-organization
 cons:
 severe scalability problems
 susceptible to attacks
 Pure P2P system
Gnutella: initial problems and fixes
 2000: avg size of reachable network only 400-800
hosts. Why so small?

modem users: not enough bandwidth to provide search
routing capabilities: routing black holes
 Fix: create peer hierarchy based on capabilities
 previously: all peers identical, most modem black holes
 preferential connection:
• favors routing to well-connected peers
• favors reply to clients that themselves serve large number
of files: prevent freeloading
Structured
001
012
 FreeNet, Chord, CAN,
Tapestry, Pastry model
212 ?
212 ?
332
212
305
Structured
001
012
 FreeNet, Chord, CAN,
Tapestry, Pastry model
 Benefits:



Manageable per-node state
Manageable bandwidth usage
and time to locate item
Guaranteed success
 Drawbacks:
 Possibly unbalanced load
 Harder to support fault
tolerance
212 ?
212 ?
332
212
305
Unstructured vs Structured
P2P
 The systems we described do not offer any
guarantees about their performance (or even
correctness)
 Structured P2P
Scalable guarantees on numbers of hops to answer
a query
 Maintain all other P2P properties (load balance,
self-organization, dynamic nature)

 Approach: Distributed Hash Tables (DHT)
Distributed Hash Tables (DHT)
 Distributed version of a hash table data structure
 Stores (key, value) pairs


The key is like a filename
The value can be file contents, or pointer to location
 Goal: Efficiently insert/lookup/delete (key, value) pairs
 Each peer stores a subset of (key, value) pairs in the system
 Core operation: Find node responsible for a key


Map key to node
Efficiently route insert/lookup/delete request to this node
 Allow for frequent node arrivals/departures
DHT Desirable Properties
 Keys should mapped evenly to all nodes in
the network (load balance)
 Each node should maintain information
about only a few other nodes (scalability,
low update cost)
 Messages should be routed to a node
efficiently (small number of hops)
 Node arrival/departures should only affect
a few nodes
DHT Routing Protocols
 DHT is a generic interface
 There are several implementations of this interface









Chord [MIT]
Pastry [Microsoft Research UK, Rice University]
Tapestry [UC Berkeley]
Content Addressable Network (CAN) [UC Berkeley]
SkipNet [Microsoft Research US, Univ. of Washington]
Kademlia [New York University]
Viceroy [Israel, UC Berkeley]
P-Grid [EPFL Switzerland]
Freenet [Ian Clarke]
Basic Approach
In all approaches:
 keys are associated with globally unique IDs

integers of size m (for large m)
 key ID space (search space) is uniformly populated
- mapping of keys to IDs using (consistent) hashing
 a node is responsible for indexing all the keys in a
certain subspace (zone) of the ID space
 nodes have only partial knowledge of other node’s
responsibilities
Improvements: SuperPeers
 KaZaA model
 Hybrid centralized and unstructured
 Advantages and disadvantages?
Hierarchical Overlay
 between centralized
index, query flooding
approaches
 each peer is either a
super node or assigned to
a super node


TCP connection between
peer and its super node.
TCP connections between
some pairs of super nodes.
 Super node tracks content
in its children
ordinary peer
group-leader peer
neighoring relationships
in overlay network
Kazaa (Fasttrack network)
 Hybrid of centralized Napster and decentralized Gnutella

hybrid P2P system
 Super-peers act as local search hubs


Each super-peer is similar to a Napster server for a small
portion of the network
Super-peers are automatically chosen by the system based on
their capacities (storage, bandwidth, etc.) and availability
(connection time)
 Users upload their list of files to a super-peer
 Super-peers periodically exchange file lists
 You send queries to a super-peer for files of interest
File Distribution: Server-Client vs P2P
Question : How much time to distribute file
from one server to N peers?
us: server upload
bandwidth
Server
us
File, size F
dN
uN
u1
d1
u2
ui: peer i upload
bandwidth
d2
Network (with
abundant bandwidth)
di: peer i download
bandwidth
File distribution time: server-client
 server sequentially
sends N copies:

NF/us time
 client i takes F/di
time to download
Server
F
us
dN
u1 d1 u2
d2
Network (with
abundant bandwidth)
uN
Time to distribute F
to N clients using = dcs = max { NF/us, F/min(di) }
i
client/server approach
increases linearly in N
(for large N)
File distribution time: P2P
 server must send one
Server
F
u1 d1 u2
d2
copy: F/us time
us
 client i takes F/di time
Network (with
dN
to download
abundant bandwidth)
uN
 NF bits must be
downloaded (aggregate)
 fastest possible upload rate: us + Sui
dP2P = max { F/us, F/min(di) , NF/(us + Sui) }
i
Server-client vs. P2P: example
Client upload rate = u, F/u = 1 hour, us = 10u, dmin ≥ us
Minimum Distribution Time
3.5
P2P
Client-Server
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
N
25
30
35
File distribution: BitTorrent
 P2P file distribution
tracker: tracks peers
participating in torrent
obtain list
of peers
trading
chunks
peer
torrent: group of
peers exchanging
chunks of a file
BitTorrent (1)
 file divided into 256KB chunks.
 peer joining torrent:
has no chunks, but will accumulate them over time
 registers with tracker to get list of peers,
connects to subset of peers (“neighbors”)
 while downloading, peer uploads chunks to other
peers.
 peers may come and go
 once peer has entire file, it may (selfishly) leave or
(altruistically) remain

BitTorrent (2)
Pulling Chunks
 at any given time,
different peers have
different subsets of
file chunks
 periodically, a peer
(Alice) asks each
neighbor for list of
chunks that they have.
 Alice sends requests
for her missing chunks
 rarest first
Sending Chunks: tit-for-tat
 Alice sends chunks to four
neighbors currently
sending her chunks at the
highest rate
 re-evaluate top 4 every
10 secs
 every 30 secs: randomly
select another peer,
starts sending chunks
 newly chosen peer may
join top 4
 “optimistically unchoke”
BitTorrent: Tit-for-tat
(1) Alice “optimistically unchokes” Bob
(2) Alice becomes one of Bob’s top-four providers; Bob reciprocates
(3) Bob becomes one of Alice’s top-four providers
With higher upload rate,
can find better trading
partners & get file faster!
P2P: searching for information
Index in P2P system: maps information to peer location
(location = IP address & port number)
.
Instant messaging
File sharing (eg e-mule)
 Index maps user
 Index dynamically
names to locations.
tracks the locations of
files that peers share.
 When user starts IM
application, it needs to
 Peers need to tell
inform index of its
index what they have.
location
 Peers search index to
 Peers search index to
determine where files
determine IP address
can be found
of user.
P2P Case study: Skype
Skype clients (SC)
 inherently P2P: pairs
of users communicate.
 proprietary
Skype
login server
application-layer
protocol (inferred via
reverse engineering)
 hierarchical overlay
with SNs
 Index maps usernames
to IP addresses;
distributed over SNs
Supernode
(SN)
Anonymity
 Napster, Gnutella, Kazaa don’t provide
anonymity
Users know who they are downloading from
 Others know who sent a query

 Freenet
 Designed to provide anonymity among other
features
Freenet
 Keys are mapped to IDs
 Each node stores a cache of keys, and a routing table
for some keys

routing table defines the overlay network
 Queries are routed to the node with the most similar
key
 Data flows in reverse path of query



Impossible to know if a user is initiating or forwarding a query
Impossible to know if a user is consuming or forwarding data
Keys replicated in (some) of the nodes along the path
Freenet routing
N02
N03
K117 ?
N04
TTL = 8
N01
N05
K124 N02
K317 N08
K613 N05
N06
N07
N08
Freenet routing
K514 N01
sorry...
N02
N03
TTL = 7
K117 ?
N04
N01
N05
TTL = 6
K124 N02
K317 N08
K613 N05
N06
N07
N08
Freenet routing
K514 N01
N02
N03
TTL = 7
K117 ?
N04
N01
N05
TTL = 6
K124 N02
K317 N08
K613 N05
N06
K100 N07
K222 N06
K617 N05
N07
N08
TTL = 5
Freenet routing
K514 N01
N02
N03
TTL = 7
K117 ?
N04
N01
N05
TTL = 6
K124 N02
K317 N08
K613 N05
N06
117
K100 N07
K222 N06
K617 N05
N07
N08
TTL = 5
117
Freenet routing
K514 N01
N02
N03
TTL = 7
K117 ?
N04
N01
N05
TTL = 6
K124 N02
K317 N08
K613 N05
117
N06
117
K117
K100
K222
K617
N07
N07
N06
N05
117
N08
TTL = 5
N07
Freenet routing
K514 N01
N02
N03
N04
N01
K117
K124
K317
K613
117
N05
N07
N02
N08
N05
N06
117
K117
K100
K222
K617
N07
N07
N06
N05
117
N08
N07
Freenet routing
K514 N01
N02
N03
N04
N01
K117
K124
K317
K613
117
N05
N08
N02
N08
N05
N06
117
K117
K100
K222
K617
N08
N07
N06
N05
117
N07
N08
Inserts are performed similarly – they are unsuccessful queries
Freenet strengths and weaknesses
 pros:
 complete decentralization
 fault tolerance/self-organization
 anonymity
 scalability (to some degree)
 cons:
 questionable efficiency & performance
 rare keys disappear from the system
 improvement of performance requires high overhead
(maintenance of additional information for routing)
P2P Review
 Two key functions of P2P systems


Sharing content
Finding content
 Sharing content



Direct transfer between peers
• All systems do this
Structured vs. unstructured placement of data
Automatic replication of data
 Finding content



Centralized (Napster)
Decentralized (Gnutella)
Probabilistic guarantees (DHTs)
Issues with P2P
 Free Riding (Free Loading)
 Two types of free riding
• Downloading but not sharing any data
• Not sharing any interesting data

On Gnutella
• 15% of users contribute 94% of content
• 63% of users never responded to a query
– Didn’t have “interesting” data
 No ranking: what is a trusted source?
 “spoofing”