Transcript Group 9

"A Measurement Study of
Peer-to-Peer File Sharing
Systems"
Stefan Saroiu, P. Krishna Gummadi Steven D. Gribble,
"A Measurement Study of Peer-to-Peer File Sharing
Systems", Proceedings of the Multimedia Computing
and Networking (MMCN), San Jose, January, 2002.
Peer-to-Peer

Membership



Goals


Ad-hoc
Dynamic
Design architecture that encourages cooperation
Examples



Napster
Gnutella
BitTorrent
Locating Files (Napster)




Server maintains index
of files contained in
connected peers
Peer queries server for
file
Server returns peer who
has file and then queries
other servers
Direct link between
peers to transfer file
Locating Files (Gnutella)



Queries are performed
by flooding the
network
Responses are
returned only if the
peer has the file,
queries are forwarded
to neighbors
Files are downloaded
via direct link between
two peers
Collecting Metrics (Napster)

Napster Crawler


Query for “popular” files using multiple
simultaneous connections, maintain list of peers
returned by server
For each peer collect metadata from the server





Peers reported bandwidth
Number of files shared
Current number of uploads/downloads
Names and sizes of all files shared
IP address
Collecting Metrics (Gnutella)

Gnutella Crawler




Connect to popular peers
Send ping messages with large TTLs to known
peers
Add newly discovered peers based on pong
messages
Pong messages include metadata about peer


Number of files
Total size of files
Metadata collected





Bottleneck bandwidth
Latency
Number of shared files
Lifetime/Uptime
Distribution across DNS domains




Approx. 60% of peers have a session duration shorter
than 1 hour
Napster peers are up a larger percentage of the time as
compared to Gnutella peers
Gnutella –20% of hosts are available >45% of the time
Napster – 20% of hosts are available >80% of the time


Approximately 30% of Napster peers misreport
their bandwidth (Modems + ISDN < 64Kbps)
Authors argue that misreporting is one indication
that many users in a p2p system are not willing
to cooperate.

Gnutella




Approximately 25% of Gnutella peers share no files
75% of peers share less than 100 files
7% of peers share more than 1000 files (which is
more than the other 93% combined)
Napster
 40%-60% of users share only 5-20% of files
(a) Gnutella network of 1771 peers
(b) 30% of peers randomly removed
(a) 1106 of remaining 1300 nodes still connected
(c) 4% of peers selectively removed (63 best connected
peers)
(a) Network becomes highly fragmented
"Characterizing Unstructured
Overlay Topologies in Modern
P2P File-Sharing Systems"
Daniel Stutzbach, Reza Rejaie, and Subhabrata Sen,
"Characterizing Unstructured Overlay Topologies in
Modern P2P File-Sharing Systems," Networking, 2008.
Gnutella Characteristics in
Depth
• Previous studies as claimed in the paper:
– Lack the accuracy of the captured snapshots.
– The crawlers that have been used weren’t fast enough.
• Resulting in distorted snapshots
• And partial view of the topology.
– Simulations are standing on invalid assumption (e.g. powerlaw distributing).
– “Finally, to our knowledge, the dynamics of unstructured P2P
overlay topologies have not been studied in detail in any
prior work.”
Cruiser: Fast and Accurate
Crawler
• “Cruiser can accurately capture a complete snapshot
of the Gnutella network with more than one million
peers in just a few minutes”.
• Faster than any crawler ever built, as claimed.
• Enabled capturing the overlay dynamics which led to
more accurate characterization.
• Data Set: 18,000 snapshot captured in 11 month
period. Weekly intervals and daily random captures.
Some of the Findings
WikiPedia
• As mentioned, the study refutes the
power-law distribution of the node
degree.
• Online-like architecture.
• Limewire v.s BearShare.
• Reachability: 30-38% are
unreachable. Previous studies ignored
this factor which forms a nonnegligible fraction of the peers
Two-Tier Topology
Ultrapeer
Top-level overlay
Leaf
Fast Crawler vs. Slow Crawler
• Why the Power-Law Distribution is a
measurement artifact?
• Slow Crawler: Cruiser with less concurrent
connection.
Form Two-piece
Power-Law
distribution.
Top-Level Overlay Analysis
• The actual distribution is two-piece
Power-Law distribution
Peaks at 30 degree
Due to the fact of the preconfiguration of Limewire
and BearShare.
New peers, approaching
30 degree
Re-configured or
other
implementation to
have higher degree
Leaves Overlay Analysis
Limewire: 30 leaves
BearShare: 45 leaves
Other: 75 leaves
Majority have 3 or less parent
< 0.02%
100-3000
parent
Reachability
• Flood-based Query: New peers are
discovered exponentially up to a certain
point.
• Pair-wise Distance: 60% have a length of
4 as shortest path distance
• Effect of two-tier topology on leaves:
– One Parent: we get a distribution similar to the
Utlrapeers shifted by 2.
– More than that:
• 50% -> length of 5 (+1).
• 50% -> length of 6 (+2).
Overlay Dynamics
• As the number of peers increases, the
uptime (in hours) decreases
exponentially.
Overlay Dynamics-2
• What are the causes ?
– Protocol Driven: As the peers select their
neighbors according to the protocol.
– User Driven: Peer participation
• Definition of a stable peer:
– a peer is stable if it manages to have a
connection duration of time t. t=48 hours as
in the study.
Internal Connectivity of Stable
Core
• Stable core: peers with t >= 48. [Excluding the
connection between unstable peers]
• 88-94% remained connected.
• Observations:
– Stable core are clustered
together
– Peers with higher uptime are
more biased to establish
connections with each
others.
External Connectivity of Stable
Core
• Peers (not in the stable core) are following
the same behavior. (i.e biased to connect
with peers have >= uptime)
• This behavior led to form onion-like
connections.
– The core of the onion is the Stable Core
• SC(t) < P1(t)
•…
• Pn-1(t) < Pn(t)
Overlay Dynamics .. The Main
Cause
User driven dynamics are the major factor
of the overlay dynamics.
"Making Gnutella-like P2P
Systems Scalable"
Y. Chawathe, S. Ratnaswamy, L. Breslau, N. Lanham,
S. Shenker, "Making Gnutella-like P2P Systems
Scalable," ACM Sigcomm 2003,Dongyu Qiu and R.
Srikant, "Modeling and performance analysis of
BitTorrent-like peer-to-peer networks", Proceedings of
ACM Sigcomm, 2004.
Review of Gnutella
• Uses an unstructured overlay network
• Distributed download AND search
• Floods query across this overlay with a
limited scope
• Notorious for poor scaling
Scaling Gnutella
• Previously proposed solution: hash table to
wide-area file search
• The paper advocates maintaining simplicity of
unstructured system but with new
mechanisms
– propose solution using aspects of a system similar
to KaZaA and its supernodes model
• supernodes have higher bandwidth connectivity
• searches are routed to supernodes which hold pointers
to peer data
Distributed Hash Tables
(structured overlay)
• Pros:
– Looking up using DHT requires O(log n) steps vs.
Gnutella's O(n) steps
• Cons: Doesn’t deal well with…
– Transience of nodes in P2P network: DHTs require
repair operations to preserve efficiency and
correctness of routing.
– Situations where keyword searches are more
prevalent, and important, than exact-match
queries
– Situations where most queries are for relatively
well-replicated files not single copies of files
Gia’s Improvements
• All messages exchanged by clients are
tagged at their origin with a globally unique
identifier
• Explicitly accounts for node heterogeneity
and capacity constraints
• Replace flooding with “biased” random walks
• It is not any single of the following
components, but in fact, the combination of
them all that provides GIA's large
performance advantage…
• Dynamic topology adaptation:
– Puts most nodes within short reach of high
capacity nodes and makes sure that the wellconnected high-degree nodes can actual handle
the large number of queries by calculating
“satisfaction level” (discussed later)
• Active flow control:
– Avoid overloading by assigning flow-control tokens
based on "available capacity"
– Dropping queries not an option in Gia where
random walks are taken instead of floods
– A node that advertises high capacity to handle
incoming queries is in turn assigned more tokens
for its own outgoing queries.
• One-hop replication: of pointers to content
offered by immediate neighbors
– This way a node can also answer queries for its
neighbors (resulting in fewer hops)
– When a node goes offline, its information is
flushed from its neighbors to maintain consistency
and accuracy
• Search protocol:
– Based on biased random walks that directs
queries towards high-capacity nodes (instead of
purely random)
– TTLs (and MAX_RESPONSES) bound duration of
the biased random walks and book-keeping
avoids redundant paths
“Satisfaction Level”
• A measure of how close the sum of the
capacities of all of a node's neighbors
(normalized by their degrees) is to the node's
own capacity.
• To add a new neighbor, a node randomly
selects a small number of candidate entries.
From these randomly chosen entries, it
selects the node with maximum capacity
greater than its own capacity. If no such
candidate entry exists, it selects one at
random. (see Algorithm 1)
• Nodes with low satisfaction levels perform
topology adaptation more frequently than
satisfied nodes.
I = T x K^(1-S)
S : Satisfaction level (see algorithm 2)
I : Adaptation interval
T : maximum interval between adaptation
iterations
K : aggressiveness of the adaptation.
• After each interval I, if a node's S < 1.0, it
attempts to add a new neighbor. If S = 1.0, it
still continues to iterate through the
adaptation process, checking its satisfaction
level every T seconds.
Algorithms
(regarding Satisfaction)
Gia, Flood, Random Walk over Random
Topology, and Supernode models.
• When the query load increases, we notice a sharp
"knee" in the curves beyond which the success rate
drops sharply and delays increase rapidly. The hopcount holds steady until the knee-point and then
decreases.
Gia, Flood, Random Walk over Random
Topology, and Supernode models.
• Compare by observing Collapse Point (CP) and Hopcount before Collapse Point (CP-HC)
Single Search Responses
• Aggregate System Capacity of Gia is 3 to 5 orders of magnitude
higher than Flood and Random Walk Random Topology.
• RWRT performs better than Flood typically but can be about the
same when there are fewer nodes since RWRT may end up
visiting practically all the nodes anyway.
• Flood and Supernode models retain low hop counts whereas
Gia may sometimes need to traverse quite a bit. RWRT, being
random, has hop counts that are inversely proportional to the
replication factor.
• Flood and Supernode models have a collapse point that falls as
the number of nodes increases due to the greater query load at
each node.
Multiple Search Responses
• MAX_RESPONSES parameter only affect Gia and
RWRT since Flood and Supernode flood throughout
the network
• Naturally, higher MAX_RESPONSES causes more
hop counts in Gia and RWRT models. Collapse point
is also drops when the replication factor is low.
• Note: A search for k responses at r% replication is
equivalent to one for a single answer at r/k%
replication.
Node failure
• When a node leaves the network, its queued
queries are resumed by the nodes that
originally generated them.
• How do we make sure a query isn't lost?
– Keep-alive messages (query responses sent back
to the originator of the query act as implicit keepalives). If no actual matches have been found yet,
we’ll still just send a dummy query response
message.
Next bottleneck:
Download process?
• The authors believe that the technique of
directing towards higher capacity nodes helps
alleviate this issue.
• To make the above benefit more significant, it
would be better to have the higher capacity
nodes store more files (rather than simply
pointers).
• Simple solution would be to have popular files
at low capacity nodes replicated to the higher
capacity nodes.