p2p-unstructured - Computer Science and Engineering
Download
Report
Transcript p2p-unstructured - Computer Science and Engineering
Overview of Peer-to-Peer Systems,
Grids, and Clouds (Part 1)
Unstructured P2P Systems
COP 6390.all
02/21/2011
Anda Iamnitchi
1
Google trends: Grid computing (blue), peer-to-peer (red), cloud computing (orange)
2
What they have in common
Real instances of distributed systems
Passed the test of deployment and usability
Large-scale distributed systems
Interesting phenomena at large scale
Scale: users, resources, geographical
3
Why Peer-to-Peer Systems?
Wide-spread user experience
Large-scale distributed application,
unprecedented growth and
popularity
KaZaA – 389 millions downloads
(1M/week) one of the most
popular applications ever!
Heavily researched in the last 10
years with results in:
User behavior characterization
Scalability
Novel problems: reputation,
trust, incentives for fairness
Commercial impact
eDonkey
3,108,909
FastTrack (Kazaa)
2,114,120
Gnutella
2,899,788
Cvernet
691,750
Filetopia
3,405
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
What Is a P2P System?
Node
Node
Node
Internet
Node
Node
A distributed system architecture:
No centralized control (debatable: Napster?)
Nodes are symmetric in function (debatable: New Gnutella
5
protocol?)
Large number of unreliable nodes
Initially identified with music file sharing
P2P Definition(s)
A number of definitions coexist:
Def 1:“A class of applications that takes advantage of resources —
storage, cycles, content, human presence — available at the edges of
the Internet.”
Edges often turned off, without permanent IP addresses
Def 2: “A class of decentralized, self-organizing distributed systems, in
which all or most communication is symmetric.”
6
Lots of other definitions that fit in between
The Promise of P2P Computing
High capacity through parallelism:
Many disks
Many network connections
Many CPUs
Reliability:
Many replicas:
of data
of network paths
Geographic distribution
Automatic configuration
Useful in public and proprietary settings
7
Popularity since 2004
Britney Spears 1.00
p2p
0.40
Normalized and compared to the popularity of Britney Spears as shown
by Google Trends
8
Napster: History
Program for sharing files over the Internet
History:
5/99: Shawn Fanning (freshman, Northeasten U.)
founds Napster Online music service
12/99: first lawsuit
3/00: 25% UWisc traffic Napster
2000: est. 60M users
2/01: US Circuit Court of Appeals:
Napster knew users violating
copyright laws
7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
9
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
10
Napster: How It Works
napster.com
• Client-Server: Use central server to locate files
• Peer-to-Peer: Download files directly from peers
11
Napster
1. File list is uploaded
(Join and Publish)
napster.com
users
12
Napster
2. User requests search
at server (Search).
napster.com
Request
and
results
user
13
Napster
napster.com
3. User pings hosts
that apparently
have data.
Looks for best
transfer rate.
ping
ping
user
14
Napster
4. User retrieves file
(Fetch)
napster.com
Download
file
user
15
Lessons Learned from Napster
Strengths: Decentralization of Storage
Every node “pays” its participation by providing access to its resources
physical resources (disk, network), knowledge (annotations), ownership (files)
Every participating node acts as both a client and a server (“servent”): P2P
Decentralization of cost and administration = avoiding resource bottlenecks
Weaknesses: Centralization of Data Access Structures (Index)
Server is single point of failure
Unique entity required for controlling the system = design bottleneck
Copying copyrighted material made Napster target of legal attack
increasing degree of resource sharing and decentralization
16
Centralized
System
Decentralized
System
Gnutella: File-Sharing with No Central
Server
17
Gnutella: History
Developed in a 14 days “quick hack” by Nullsoft (winamp)
Originally intended for “exchange of cooking recipes”
Evolution of Gnutella
Published under GNU General Public License on the Nullsoft web server
Taken off after a couple of hours by AOL (owner of Nullsoft): too late
Gnutella protocol was reverse engineered from the original
Protocol published
Third-party clients were published and Gnutella started to spread
Many iterations to fix poor initial design
High impact:
Many versions implemented
Many different designs
18
Lots of research papers/ideas
Gnutella: Search in an Unstructured
Overlay
I have file A.
I have file A.
Reply
Flooding
Query
Where is file A?
19
Gnutella: Overview
Join: on startup, client contacts a few other nodes; these become its “neighbors”
Initial list of contacts published as gnutellahosts.com:6346
Outside the Gnutella protocol specification
Default value for number of open connections (neighbors): C = 4
Publish: no need
Search:
Flooding: ask neighbors, who ask their neighbors, and so on...
Each forwarding of requests decreases a TTL. Default: TTL = 7
When/if found, reply to sender
Drop forwarding requests when TTL expires
One request leads to
TTL
2 * i 0 C * (C 1)i 26,240
Back-propagation in case of success (Why?)
Fetch: get the file directly from peer (HTTP)
messages
Gnutella: Protocol Message Types
Type
Ping
Pong
Query
QueryHit
Push
Description
Contained Information
Announce availability and probe for None
other servents
Response to a ping
IP address and port# of responding servent;
number and total kb of files shared
Search request
Minimum network bandwidth of responding
servent; search criteria
Returned by servents that have
IP address, port# and network bandwidth of
the requested file
responding servent; number of results and
result set
File download requests for
Servent identifier; index of requested file; IP
servents behind a firewall
address and port to send file to
What would you ask about
a Gnutella network?
22
Gnutella: Tools for Network Exploration
Eavesdrop traffic - insert modified node into the network and log traffic.
Crawler - connect to active nodes and use the membership protocol to discover
membership and topology.
23
Gnutella: Heterogeneity
All Peers Equal? (1)
1.5Mbps DSL
1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
24
Gnutella Network Structure: Improvement
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
25
Déjà vu?
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
26
Gnutella: Free Riding
All Peers Equal? (2)
More than 25% of Gnutella clients
share no files; 75% share 100 files
or less
Conclusion: Gnutella has a high
percentage of free riders
If only a few individuals
contribute to the public good,
these few peers effectively act as
centralized servers.
Outcome:
Significant efforts in building
incentive-based systems
27 BitTorrent?
Adar and Huberman (Aug ’00)
Flooding in Gnutella: Loops?
Seen request already
28
Improvements of Message Flooding
Expanding Ring
start search with small TTL (e.g. TTL = 1)
if no success iteratively increase TTL (e.g. TTL = TTL +2)
k-Random Walkers
forward query to one randomly chosen neighbor only, with large TTL
start k random walkers
random walker periodically checks
with requester whether to continue
Experiences (from simulation)
adaptive TTL is useful
message duplication should be avoided
flooding should be controlled at fine granularity
29
Gnutella Topology (Mis)match?
30
Gnutella: Network Size?
Explosive growth in 2001, slowly shrinking thereafter
•
High user interest
– Users tolerate high latency, low
quality results
•
Better resources
– DSL and cable modem nodes grew
from 24% to 41% over first 6
months.
31
Is Gnutella a Power-Law Network?
Power-law networks: the number of links per node
follows a power-law distribution
N = L-k
Num. of nodes (log scale)
10000
Examples of power-law networks:
November 2000
1000
–
–
–
–
–
100
10
The Internet at AS level
In/out links to/from HTML pages
Airports
US power grid
Social networks
1
1
10
100
Number of links (log scale)
Implications: High tolerance to random node failure but low reliability
32
when facing of an ‘intelligent’ adversary
Network Resilience
Partial Topology
Random 30% die
Targeted 4% die
from Saroiu et al., MMCN 2002
33
Is Gnutella a Power-Law Network?
(Later Data)
Later, larger networks display a bimodal distribution
Implications:
– High tolerance to random node failures preserved
– Increased reliability when facing an attack.
Number of nodes
(log scale)
10000
May 2001
1000
From Ripeanu, Iamnitchi, Foster, 2002
34
100
10
1
1
10
100
Number of links (log scale)
Discussion on Unstructured P2P
Networks
Performance
Search latency: low (graph properties)
Message Bandwidth: high
improvements through random walkers, but essentially the
35
whole network needs to be explored
Storage cost: low (only local neighborhood)
Update cost: low (only local updates)
Resilience to failures good: multiple paths are explored and data
is replicated
Qualitative Criteria
search predicates: very flexible, any predicate is possible
global knowledge: none required
peer autonomy: high
BitTorrent
36
BitTorrent Components
Torrent File
Metadata of file to be shared
Address of tracker
List of pieces and their checksums
Tracker
Lists peers interested in the distribution of the file
Peers
Clients interested in the distribution of the file
Can be “seeds” or “leachers”
37
A BitTorrent Swarm
• A “seed” node has the file
• A “tracker” associated with the file
• A “.torrent” meta-file is built for the file:
identifies the address of the tracker node
• The .torrent file is published on web
• File is split into fixed-size segments (e.g.,
256KB)
38
Choking Algorithm
Each connected peer is in one of two states
Choked: Download requests by a choked peer are ignored
Unchoked: Download requests by an unchoked peer are honored
Choking occurs at the peer level
Each peer has a certain number of unchoke slots
4 regular unchoke slots (per BitTorrent standard)
1 optimistic unchoke slot (per BitTorrent standard)
Choking Algorithm
Peers unchoke connected peers with best service rate
Service rate = rolling 20 second average of its upload bandwidth
Optimistically unchoking peers prevents a static set of unchoked
39
peers
The choking algorithm runs every 10 seconds
Peers optimistically unchoked every 30 seconds
New peers are 3 times more likely to be optimistically unchoked
Piece Selection
Random First Piece
Piece download at random
Algorithm used by new peers
Rarest Piece First
Ensures > 1 distributed copies of a piece
Increases interest of connected peers
Increases scalability
Random Piece vs. Rarest Piece
Rarest has probabilistically high download time
New peers want to reduce download time but also increase their
interest
40
BitTorrent: Overview
Join: nothing
Just find that there is a community ready to host your tracker
Publish: Create tracker, upload .torrent metadata file
Search:
For file: nothing
the community is supposed to provide search tools
For segments: exchange segment IDs maps with other peers.
Fetch: exchange segments with other peers (HTTP)
41
Gnutella vs. BitTorrent: Discussion
Architecture
Decentralization?
System properties
Reliability?
Scalability?
Fairness?
Overheads?
Quality of Service
Search coverage for content?
Ability to download content fast?
42