p2p-unstructured - Computer Science and Engineering

Download Report

Transcript p2p-unstructured - Computer Science and Engineering

Overview of Peer-to-Peer Systems,
Grids, and Clouds (Part 1)
Unstructured P2P Systems
COP 6390.all
02/21/2011
Anda Iamnitchi
1
Google trends: Grid computing (blue), peer-to-peer (red), cloud computing (orange)
2
What they have in common
 Real instances of distributed systems
 Passed the test of deployment and usability
 Large-scale distributed systems
 Interesting phenomena at large scale
 Scale: users, resources, geographical
3
Why Peer-to-Peer Systems?
 Wide-spread user experience
 Large-scale distributed application,
unprecedented growth and
popularity
 KaZaA – 389 millions downloads
(1M/week) one of the most
popular applications ever!
 Heavily researched in the last 10
years with results in:
 User behavior characterization
 Scalability
 Novel problems: reputation,
trust, incentives for fairness
 Commercial impact
eDonkey
3,108,909
FastTrack (Kazaa)
2,114,120
Gnutella
2,899,788
Cvernet
691,750
Filetopia
3,405
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
What Is a P2P System?
Node
Node
Node
Internet
Node
Node
 A distributed system architecture:
 No centralized control (debatable: Napster?)
 Nodes are symmetric in function (debatable: New Gnutella
5
protocol?)
 Large number of unreliable nodes
 Initially identified with music file sharing
P2P Definition(s)
A number of definitions coexist:
 Def 1:“A class of applications that takes advantage of resources —
storage, cycles, content, human presence — available at the edges of
the Internet.”
 Edges often turned off, without permanent IP addresses
 Def 2: “A class of decentralized, self-organizing distributed systems, in
which all or most communication is symmetric.”
6
Lots of other definitions that fit in between
The Promise of P2P Computing
 High capacity through parallelism:
 Many disks
 Many network connections
 Many CPUs
 Reliability:
 Many replicas:
 of data
 of network paths
 Geographic distribution
 Automatic configuration
 Useful in public and proprietary settings
7
Popularity since 2004
Britney Spears 1.00
p2p
0.40
Normalized and compared to the popularity of Britney Spears as shown
by Google Trends
8
Napster: History
 Program for sharing files over the Internet
 History:
 5/99: Shawn Fanning (freshman, Northeasten U.)
founds Napster Online music service
 12/99: first lawsuit
 3/00: 25% UWisc traffic Napster
 2000: est. 60M users
 2/01: US Circuit Court of Appeals:
Napster knew users violating
copyright laws
 7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
9
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
10
Napster: How It Works
napster.com
• Client-Server: Use central server to locate files
• Peer-to-Peer: Download files directly from peers
11
Napster
1. File list is uploaded
(Join and Publish)
napster.com
users
12
Napster
2. User requests search
at server (Search).
napster.com
Request
and
results
user
13
Napster
napster.com
3. User pings hosts
that apparently
have data.
Looks for best
transfer rate.
ping
ping
user
14
Napster
4. User retrieves file
(Fetch)
napster.com
Download
file
user
15
Lessons Learned from Napster
 Strengths: Decentralization of Storage
 Every node “pays” its participation by providing access to its resources
 physical resources (disk, network), knowledge (annotations), ownership (files)
 Every participating node acts as both a client and a server (“servent”): P2P
 Decentralization of cost and administration = avoiding resource bottlenecks
 Weaknesses: Centralization of Data Access Structures (Index)
 Server is single point of failure
 Unique entity required for controlling the system = design bottleneck
 Copying copyrighted material made Napster target of legal attack
increasing degree of resource sharing and decentralization
16
Centralized
System
Decentralized
System
Gnutella: File-Sharing with No Central
Server
17
Gnutella: History
 Developed in a 14 days “quick hack” by Nullsoft (winamp)
 Originally intended for “exchange of cooking recipes”
 Evolution of Gnutella
 Published under GNU General Public License on the Nullsoft web server
 Taken off after a couple of hours by AOL (owner of Nullsoft): too late
 Gnutella protocol was reverse engineered from the original
 Protocol published
 Third-party clients were published and Gnutella started to spread
 Many iterations to fix poor initial design
 High impact:
 Many versions implemented
 Many different designs
18
 Lots of research papers/ideas
Gnutella: Search in an Unstructured
Overlay
I have file A.
I have file A.
Reply
Flooding
Query
Where is file A?
19
Gnutella: Overview
 Join: on startup, client contacts a few other nodes; these become its “neighbors”
 Initial list of contacts published as gnutellahosts.com:6346
 Outside the Gnutella protocol specification
 Default value for number of open connections (neighbors): C = 4
 Publish: no need
 Search:
 Flooding: ask neighbors, who ask their neighbors, and so on...
 Each forwarding of requests decreases a TTL. Default: TTL = 7
 When/if found, reply to sender
 Drop forwarding requests when TTL expires
 One request leads to
TTL
2 * i 0 C * (C  1)i  26,240
 Back-propagation in case of success (Why?)
 Fetch: get the file directly from peer (HTTP)
messages
Gnutella: Protocol Message Types
Type
Ping
Pong
Query
QueryHit
Push
Description
Contained Information
Announce availability and probe for None
other servents
Response to a ping
IP address and port# of responding servent;
number and total kb of files shared
Search request
Minimum network bandwidth of responding
servent; search criteria
Returned by servents that have
IP address, port# and network bandwidth of
the requested file
responding servent; number of results and
result set
File download requests for
Servent identifier; index of requested file; IP
servents behind a firewall
address and port to send file to
What would you ask about
a Gnutella network?
22
Gnutella: Tools for Network Exploration
 Eavesdrop traffic - insert modified node into the network and log traffic.
 Crawler - connect to active nodes and use the membership protocol to discover
membership and topology.
23
Gnutella: Heterogeneity
All Peers Equal? (1)
1.5Mbps DSL
1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
24
Gnutella Network Structure: Improvement
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
25
Déjà vu?
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
26
Gnutella: Free Riding
All Peers Equal? (2)
 More than 25% of Gnutella clients
share no files; 75% share 100 files
or less
 Conclusion: Gnutella has a high
percentage of free riders
 If only a few individuals
contribute to the public good,
these few peers effectively act as
centralized servers.
 Outcome:
 Significant efforts in building
incentive-based systems
27  BitTorrent?
Adar and Huberman (Aug ’00)
Flooding in Gnutella: Loops?
Seen request already
28
Improvements of Message Flooding
 Expanding Ring
 start search with small TTL (e.g. TTL = 1)
 if no success iteratively increase TTL (e.g. TTL = TTL +2)
 k-Random Walkers
 forward query to one randomly chosen neighbor only, with large TTL
 start k random walkers
 random walker periodically checks
with requester whether to continue
 Experiences (from simulation)
 adaptive TTL is useful
 message duplication should be avoided
 flooding should be controlled at fine granularity
29
Gnutella Topology (Mis)match?
30
Gnutella: Network Size?
Explosive growth in 2001, slowly shrinking thereafter
•
High user interest
– Users tolerate high latency, low
quality results
•
Better resources
– DSL and cable modem nodes grew
from 24% to 41% over first 6
months.
31
Is Gnutella a Power-Law Network?
Power-law networks: the number of links per node
follows a power-law distribution
N = L-k
Num. of nodes (log scale)
10000
Examples of power-law networks:
November 2000
1000
–
–
–
–
–
100
10
The Internet at AS level
In/out links to/from HTML pages
Airports
US power grid
Social networks
1
1
10
100
Number of links (log scale)
Implications: High tolerance to random node failure but low reliability
32
when facing of an ‘intelligent’ adversary
Network Resilience
Partial Topology
Random 30% die
Targeted 4% die
from Saroiu et al., MMCN 2002
33
Is Gnutella a Power-Law Network?
(Later Data)
Later, larger networks display a bimodal distribution
Implications:
– High tolerance to random node failures preserved
– Increased reliability when facing an attack.
Number of nodes
(log scale)
10000
May 2001
1000
From Ripeanu, Iamnitchi, Foster, 2002
34
100
10
1
1
10
100
Number of links (log scale)
Discussion on Unstructured P2P
Networks
 Performance
 Search latency: low (graph properties)
 Message Bandwidth: high
 improvements through random walkers, but essentially the
35
whole network needs to be explored
 Storage cost: low (only local neighborhood)
 Update cost: low (only local updates)
 Resilience to failures good: multiple paths are explored and data
is replicated
 Qualitative Criteria
 search predicates: very flexible, any predicate is possible
 global knowledge: none required
 peer autonomy: high
BitTorrent
36
BitTorrent Components
 Torrent File
 Metadata of file to be shared
 Address of tracker
 List of pieces and their checksums
 Tracker
 Lists peers interested in the distribution of the file
 Peers
 Clients interested in the distribution of the file
 Can be “seeds” or “leachers”
37
A BitTorrent Swarm
• A “seed” node has the file
• A “tracker” associated with the file
• A “.torrent” meta-file is built for the file:
identifies the address of the tracker node
• The .torrent file is published on web
• File is split into fixed-size segments (e.g.,
256KB)
38
Choking Algorithm
 Each connected peer is in one of two states
 Choked: Download requests by a choked peer are ignored
 Unchoked: Download requests by an unchoked peer are honored
 Choking occurs at the peer level
 Each peer has a certain number of unchoke slots
 4 regular unchoke slots (per BitTorrent standard)
 1 optimistic unchoke slot (per BitTorrent standard)
 Choking Algorithm
 Peers unchoke connected peers with best service rate
 Service rate = rolling 20 second average of its upload bandwidth
 Optimistically unchoking peers prevents a static set of unchoked
39
peers
 The choking algorithm runs every 10 seconds
 Peers optimistically unchoked every 30 seconds
 New peers are 3 times more likely to be optimistically unchoked
Piece Selection
 Random First Piece
 Piece download at random
 Algorithm used by new peers
 Rarest Piece First
 Ensures > 1 distributed copies of a piece
 Increases interest of connected peers
 Increases scalability
 Random Piece vs. Rarest Piece
 Rarest has probabilistically high download time
 New peers want to reduce download time but also increase their
interest
40
BitTorrent: Overview
 Join: nothing
 Just find that there is a community ready to host your tracker
 Publish: Create tracker, upload .torrent metadata file
 Search:
 For file: nothing
 the community is supposed to provide search tools
 For segments: exchange segment IDs maps with other peers.
 Fetch: exchange segments with other peers (HTTP)
41
Gnutella vs. BitTorrent: Discussion
 Architecture
 Decentralization?
 System properties
 Reliability?
 Scalability?
 Fairness?
 Overheads?
 Quality of Service
 Search coverage for content?
 Ability to download content fast?
42