Transcript Document

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 12, NO. 2, APRIL 2004
Analyzing Peer-To-Peer Traffic
Across Large Networks
Subhabrata Sen, Member, IEEE, and Jia Wang, Member, IEEE
組員:李英宗d96725004
林慶和d95725005
2009年6月15日
1
Authors
Subhabrata Sen received the B.Eng. Degree in
computer science from Jadavpur University, India, in
1992, and the M.S. and Ph.D. degrees in computer
science from the University of Massachusetts,A
mherst, in 1997 and 2001, respectively.
Jia Wang received the B.S. degree in computer
science from the State University of New York,
Binghamton, in 1996, and the M.S. and Ph.D. degrees
in computer science from Cornell University, Ithaca,
NY, in 1999 and 2001, respectively.
They’re currently two members of the Internet and Networking Systems Research
Center at AT&T Labs–Research in Florham Park, NJ. Their research interests include
network measurement, routing and topology analysis, traffic flow measurement, overlay
networks and applications, network security and anomaly detection, Web performance,
content distribution networks, and other Internet-related research work. Dr. Sen and
Dr.Wang are the members of the Association for Computing Machinery (ACM).
ACN 2009
2
Introduction
 Motivation & Goals




The use of P2P applications is for distributed file sharing
Large and growing traffic volume impact on the underlying network
to characterize P2P behavior with a view to understanding how these
systems impact the network
and to gain insights into developing P2P systems with superior
performance.
 Previous research


almost exclusively on P2P signaling traffic
setting up P2P crawlers on the Internet, using “active probing” approach
 Early version


Based on data from the edge networks
provide a view of local P2P usage
 This work provides a complementary “backbone view”


from a large tier-1 ISP
gathering data at multiple border routers across the ISP.
ACN 2009
3
Outline
Methodology
Characterization Metrics
View and Analysis results
P2P vs Web
ACN 2009
4
Methodology
 Popular P2P Applications




Three systems: Gnutella, FastTrack, DirectConnect
All decentralized, self organizing
Data and index information distributed over peers
Transient peer membership
 Measurement Approach





Large-scale passive measurement
Flow-level data gathered from routers across a large tier-1 ISP’s
backbone
Analyze both signaling and data traffic
Three levels of granularity: IP address, network prefix,
Autonomous system
Collect data using Cisco’s NetFlow
ACN 2009
5
Methodology
Advantages
Requires knowledge about P2P protocol: port#
 Non-intrusive measurement
 More easy than crawler
 More complete view of P2P traffic
 Allow localized analysis

Limitations
Flow level data, No AP-level details
 May not capture the complete flow

ACN 2009
6
Characterization Metrics
Characterization
Topology: hosts distributions, application-level
overlay
 Traffic distribution: downstream & upstream
 Dynamic behavior:how frequently hosts join an
leave the system, how long a host stay…

ACN 2009
7
Characterization Metrics
Metrics





Host distribution
Traffic Volume
Host Connectivity
Traffic pattern over time
Connection duration and on-time
Data cleaning




Invalid IP: 10.x.x.x/8、172.16.x.x/13、192.168.x.x/16
No matched prefix in routing tables
Invalid AS#(>64512)、
Remove 4% of flow records
ACN 2009
8
Overview of P2P traffic
Total around 800 million flow records
ACN 2009
TABLE I Netflow DATA SET OF P2P TRAFFIC OVER TCP
9
Host distribution
ACN 2009
Fig. 2. Host density: the distribution of the hosts participating in three
P2P systems per day (y-axis is in logscale).
10
Traffic volume distribution
Significant skews in traffic volume across granularities
 Few entities source/receive most of the traffic
ACN 2009
Fig. 3. Cumulative distribution of traffic volume associated with IP
addresses ranked in decreasing order of volume, for September 14, 2001
(x-axis is in logscale). Aggregate traffic observed for FastTrack on this day
was 960 GB.
11
Host connectivity
Connectivity is very small for most hosts, very high for few hosts
 Distribution is less skewed at prefix and AS levels
ACN 2009
Fig. 5. Cumulative distribution of network connectivity at the IP and network
prefix (PR) levels, for hosts participating in FastTrack on September 14, 2001.
12
Time of day effect
Fig. 6. Distribution of number of IP addresses and traffic volume across hours in
FastTrack on September 14, 2001 (GMT). (a) The traffic volume transferred in each bin. (b)
The number of unique IP addresses, network prefixes, and ASes that are active in each bin.
ACN 2009
13
Host connection duration & on-time
Substantial transience: most hosts stay in the system for a short time
 Distribution less skewed at the prefix and AS levels
ACN 2009
FastTrack (9/14/2001) thd=30min
14
Mean bandwidth usage
Upstream < Downstream: ADSL, Rate limiting
Fig. 9. Cumulative distribution of the mean upstream and downstream bandwidth
usage of hosts participating in FastTrack, and DirectConnect on September 14, 2001 (xaxis is in logscale). (a) FastTrack. (b) DirectConnect.
ACN 2009
15
Traffic Characterization
The P2P traffic does not fit well with power
law distributions.
Relationships between measures
Traffic volume
 #IPs
 On-times
 Mean bandwidth usage

ACN 2009
16
The power laws
Fig. 10. Rank-frequency plots of the P2P metrics for FastTrack on September 14, 2001: (a)
overall host connectivity; (b) host connectivity for the top 10% IP addresses; (c) traffic volume of
the top 10% IP addresses; (d) on-time of the top 10% IP addresses (both x-axis and y-axis are
labeled in logscale).
ACN 2009
17
Relationships: Traffic volume vs on-time、
Connectivity、#BW
Volume heavy hitters are likely to have
long on-times; Hosts with short on-times
contribute small traffic volumes
A Host communicating with many others
can transmit a small amount of traffic; a
host communicating with few others can
also source significant traffic.
Volume heavy hitters are likely to have
large bandwidths; Hosts with small
bandwidths contribute small traffic volumes
ACN 2009
18
Traffic volume vs on-time、
Connectivity、#BW
Fig. 11. FastTrack data set for September 14, 2001—top 1%. IP addresses ranked by volume of data sent
out. Scatter plots (log-log scale): (a) upstream volume versus upstream on-time; (b) upstream volume
versus number of unique upstream IP addresses that an IP address connects to; (c) upstream volume
versus average upstream bandwidth of an IP address.
ACN 2009
19
Connectivity、on-time、 #BW
Hosts with high connectivity have long ontimes; Hosts with short on-times
communicate with few other hosts.
Hosts with high upstram badwidths have
low connectivity counts; Hosts send traffic
to many others tend to span the bandwidths,
but no one with the highest bandwidths
Hosts with low upstram badwidths have
very long on-time (maybe download large
file or SuperNode)
ACN 2009
20
Connectivity、on-time、 #BW
Fig. 12. FastTrack data set for September 14, 2001—top 1% IP addresses ranked by volume of data
sent out. Scatter plots (log-log scale): (a) number of unique upstream IP addresses that a host connects
to versus total upstream on-time of the IP address; (b) number of unique upstream IP addresses versus
average upstream bandwidth; (c) average upstream bandwidth versus total upstream on-time.
ACN 2009
21
P2P vs Web
97% of prefixes contributing P2P traffic also
contribute Web traffic
Heavy hitter prefixes for P2P traffic tend to be
heavy hitters for Web traffic
P2P traffic contributed by the top heavy hitter
prefixes is more stable than either Web or total
traffic
0.01%, 0.1%, 1%, 10% heavy hitters contribute
10%, 30%, 50%, 90% of the traffic volume
ACN 2009
22
P2P vs Web
Fig. 13. Cumulative distribution of the traffic volume changes for top heavy
hitter prefixes. (a) Top 0.01%. prefixes. (b) Top 1% prefixes.
ACN 2009
23
Summary
The analysis covers both signaling & data traffic.

complements previous work for Gnutella.
Significant increase in both traffic volume and
number of Users.
The traffic volume generated by individual hosts
is extremely variable

less than 10% #IPs  99% of the traffic volume.

Both of traffic volume, connectivity, ontime and
average bandwidth usage.
But do not strictly obey with power laws.
Traffic distributions are extremely skewed

ACN 2009
24
Summary
All three P2P systems exhibit a high level
of system dynamics
But only a small fraction of hosts are persistent
over long time periods.
P2P is significant, but stable component of the
Internet traffic



More stable than Web traffic or overall traffic
Application-specific layer-3 traffic engineering is a
promising way to manage the P2P workload in an ISP’s
network.
ACN 2009
25