p2p_traf_char
Download
Report
Transcript p2p_traf_char
A Measurement Study of Peer-to-Peer
File Sharing Systems
Presented by
Cristina Abad
Motivation
In a P2P file sharing system, peers are
usually in the “edge” of the network
Does this affect/limit the quality of the
infrastructure?
What are the characteristics of hosts that
choose to participate?
Solution: Measure Gnutella and Napster
traffic to help understand these issues
Napster
Gnutella
Methodology
Crawler periodically takes “snapshot” of
Napster/Gnutella
– capture basic info (peers, files shared, …)
For peers discovered
– measure bottleneck bandwidth
– measure latency
– track content and degree of sharing
Measure lifetime
– track availability of peers (at P2P and IP level)
Crawling Napster
Peers can only be discovered by querying
index
Crawler issues queries with names of
popular song artists
Query responses contain
– IP, reported bandwidth, files shared (number,
names and sizes)
Results:
– Captured 40-60% of Napster hosts
(contributing to 80-95% of total files)
– Could not capture peers that do not share files
Crawling Gnutella
Crawler uses ping/pong to discover peers
Each crawl captured aprox. 10000 peers
Measuring bandwidth
Reported bandwidth may not be accurate
(ignorance or lies)
Use bottleneck bandwidth as approximation
to available bandwidth
– capacity of slowest host along path between
two hosts
Used SProbe to actively measure both
upstream and downstream bottleneck
bandwidth
– Similar to “packet pair” technique
Packet Pair Technique
Two packets queued next to each other at
bottleneck link exit the link t seconds apart:
t
s2
bbnl
s2 :
bbnl:
Then,
bbnl
size of second packet
bottleneck bandwidth
s2
t
Kevin Lai and Mary Baker. “Measuring bandwidth”. In Proceedings of IEEE INFOCOM '99. 1999.
How many peers are server-like?
High-bandwidth, low latency, high
8% have upstream
availability
bb 10Mbps
Availability – Host uptimes
Availability – Session duration
Free-riders
Is Gnutella robust?
Measurement, Modeling, and
Analysis of a Peer-to-Peer
File-Sharing Workload
Presented by
Cristina Abad
Three-tiered approach
1.
Analyze 200-day trace of Kazaa traffic
Considered only traffic going from U.
Washington to the outside
2.
Develop a model of multimedia workloads
Analyze and confirm hypothesis
3.
Explore potential impact of locality awareness in Kazaa
Contributions
Obtained some useful characterizations of
Kazaa’s traffic
Showed that Kazaa’s workload is not Zipf
– Showed that other workloads (multimedia) may
not be Zipf either
Presented a model of P2P file-sharing
workloads based on their trace results
– Validated the model through simulations that
yielded results very similar to those from traces
Proved the usefulness of exploiting localityaware request routing
Measurement results
Users are patient
Users slow down as they age
Kazaa is not one workload
Kazaa clients fetch objects at-most-once
Popularity of objects is often short-lived
Kazaa is not Zipf
User characteristics (1)
Users are patient
User characteristics (2)
Users slow down as they age
– clients “die”
– older clients ask for less each time they use
system
User characteristics (3)
Client activity
– Tracing used could only detect users when their
clients transfer data
– Thus, they only report statistics on client
activity, which is a lower bound on availability
– Avg session lengths are typically small
(median: 2.4 mins)
• Many transactions fail
• Periods of inactivity may occur during a request if
client cannot find an available server with the object
Object characteristics (1)
Kazaa is not
one workload
Object characteristics (2)
Kazaa object dynamics
– Kazaa clients fetch objects at most once
– Popularity of objects is often short-lived
– Most popular objects tend to be recently born
objects
– Most requests are for old objects
Object characteristics (3)
Kazaa is not Zipf
Web access patterns are Zipf: small number of
objects are extremely popular, but there is a long
tail of unpopular requests.
Zipf’s law: popularity of ith-most popular object is
proportional to i-α, (α: Zipf coefficient)
(Zipf) looks
linear on
log-log scale
Model of P2P file-sharing
workloads
On average, a client requests 2 objects/day
P(x): probability that a user requests an
object of popularity rank x Zipf(1)
– Adjusted so that objects are requested at most
once
A(x): probability that a newly arrived object
is inserted at popularity rank x Zipf(1)
All objects are assumed to have same size
Use caching to observe performance
changes (effectiveness hit rate)
Model – Simulation results
File-sharing effectiveness diminishes with
client age
– System evolves towards one with no locality
and objects chosen at random from large space
New object arrivals improve performance
– Arrivals replenish supply of popular objects
New clients cannot stabilize performance
– Can’t compensate for increasing number of old
clients
– Overall bandwidth increases in proportion to
population size
Model validation
By tweaking the arrival rate of of new
objects, were able to match trace results
(with 5475 new arrivals per year)
Exploring locality-awareness
Currently organizations shape or filter P2P
traffic
Alternative strategy: exploit locality in filesharing workload
– Caching; or,
– Use content available within organization to
substantially decrease external bandwidth usage
– Result: 86% of externally downloaded bytes
could be avoided by using an organizational
proxy
Questions?
Analysis
How can results obtained be used when
evaluating P2P schemes?
Are any of the measurements obtained
biased?
Peers are heterogeneous
– Incentives
– Enforcement (e.g. super-peers in Kazaa)
SProbe
Works in uncooperative environments
Works on asymmetric network paths
Exploit properties of TCP protocol
– Send SYN packet with large payload; then,
measure time dispersion of received RST
packet
Zipf
Linguist George Kingsley Zipf observed that for
many frequency distributions, the n-th largest
frequency is proportional to a negative power of
the rank order n
"Zipf's law" is also sometimes used to refer to the
corresponding probability distribution
Is an instance of a power law
Zipf's law is often demonstrated by plotting the
data, with the axes being log(rank order) and
log(frequency). If the points are close to a single
straight line, the distribution follows Zipf's law.