A Quantitative Analysis of the Gnutella Network Traffic”
Download
Report
Transcript A Quantitative Analysis of the Gnutella Network Traffic”
University of California – Riverside
Department of Computer Science & Engineering
“A Quantitative Analysis of the
Gnutella Network Traffic”
cs204 Final Project by
Demetris Zeinalipour & Theodoros Folias
<[email protected] , [email protected]>
Advisor: Michalis Faloutsos
Online Resources:
http://www.cs.ucr.edu/~csyiazti/cs204.html
Presentation Outline
1.
2.
3.
4.
5.
6.
Motivation.
Gnutella Protocol in a nutshell.
Related Work.
gnuDC – Gnutella Distributed Crawler.
Experiments.
Conclusions & Future Work.
Online Resources:
http://www.cs.ucr.edu/~csyiazti/cs204.html
1. Motivation
• P2P file-sharing systems, such as Gnutella, Napster and Freenet
realized a distributed infrastructure for sharing files.
• Traditionally, files were shared using the Client-Server model (e.g.
http, ftp). Not scalable (centralized)
• P2P systems have shown that distributed file-searching is feasible!
…and yes that they may change the way we interact on the Internet.
• Why Gnutella?
–
–
–
–
–
It is a Pure P2P protocol in contrast with e.g. Napster
It is an open protocol which allows its investigation.
It is a large community 250,000 peers at any given moment.
It is a truly international phenomenon with a world-wide community.
It is still not clear what kind of traffic is traversing the network
1. Motivation
• Questions :
–
–
–
–
How do these systems really look like?
What kind of traffic are these systems carrying?
What is the communication overhead of P2P?
Where are file-searchers coming from and what are they looking for?
• Our contribution :
– We make a quantitative analysis of the Gnutella Network Traffic at a
large-scale (17 machines, 85 nodes, 700MB log traces in 5 hours)
– To our knowledge such a large-scale measurement is not presented in
any publication.
– We describe design and implementation issues of a large-scale
distributed Gnutella Crawler.
2. Gnutella Protocol v0.4 (1/5)
• One of the most popular file-sharing protocols.
• Operates without a central Index Server (such as
Napster).
• Clients (downloaders) are also servers => servents
• Clients may join or leave the network at any time =>
highly fault-tolerant but with a cost!
• Searches are done within the virtual network while actual
downloads are done offline (with HTTP).
• The core of the protocol consists of 5 descriptors
(PING, PONG, QUERY, QUERHIT and PUSH).
2. Gnutella Protocol (2/5)
• It is important to understand how the protocol works in order to
understand our framework.
• A Peer (p) needs to connect to 1 or more other Gnutella Peers in
order to participate in the virtual Network
• p initially doesn’t know IPs of its fellow file-sharers
Gnutella Network N
?
Servent p
2. Gnutella Protocol (3/5)
a. HostCaches – The initial connection
• P connects to a HostCache H to obtain a set of IP addresses of
active peers.
• P might alternatively probe its cache to find peers it was
connected in the past.
Hostcache Server H
e.g. connect1.gnutellahosts.com:6346
Gnutella Network N
Request/Receive
a set of Active 1
Peers
!
Servent p
2
Connect to network
2. Gnutella Protocol (4/5)
b. Ping/Pong – The communication overhead
Although p is already connected it must discover new peers since its
current connections may break.
• Thus, it sends periodically PING messages which are broadcasted
(message flooding).
• If a host e.g. p2 is available it will respond with a PONG (routed only the
same path the PING came from).
• P might utilize this response and attempt a connection to p2 in order to
increase its degree.
Gnutella Network N
•
1
PING
2 PONG
Servent p
Servent p2
2. Gnutella Protocol (5/5)
c. Query/QueryHit – The utilization
•
•
•
Query descriptors contain unstructured queries e.g. “celine dion mp3”
They are again, like PING, broadcasted with a typical TTL=7.
If a host e.g. p2 matches the query it will respond with a Queryhit descriptor
Gnutella Network N
1
QUERY
2 QUERYHIT
Servent p
Servent p2
d. Push – Enable downloads from peers that are firewalled.
•
If a peer is firewalled => we can’t connect to him. Hence we request from
him to establish a connection on us and to send us the file.
3. Related Work (1/3)
a. Simulating Peer-to-Peer Systems
• Most researchers use simulation Testbeds (e.g. Anthill) to validate the
performance improvement they gain from new ideas (routing algorithms
etc.)
• Initial assumptions (e.g. degree of nodes, graph type “random”, powerlaw”, “tree”), might be wrong though!
• Visualizations might also not be very helpful.
• What we would need instead are real network metrics from a large P2P
Network such as Gnutella.
3. Related Work (2/3)
a. Obtaining data from different physical locations
“Tracing a large-scale Peer to Peer System : An hour in the life of Gnutella”, E. Markatos, CCGrid 2002
• They Obtained traffic log traces from 3 different physical locations
(Norway, Greece, USA).
• The collected data from all three locations are almost identical.
• They found that the gnutella traffic is bursty and remains bursty over
several time scales.
• The results also show that there are high locality patterns in QUERY
messages. This observation might lead to better caching policies at
peers
• Their study also reveals that there is topology mismatch between
the physical topology and the virtual gnutella topology, since
collected data are identical among their 3 different crawlers.
3. Related Work (3/3)
b. Obtaining real network data
• Limewire shows that there are averagely 250,000 hosts at any given
moment.
• They also show that only a small fraction of these hosts accept incoming
connections.
• GnutellaMeter.com also monitors the network by attaching itself to well
positioned peers (i.e. high degree) in the network. They present top queries.
• Clip2 showed that the network diameter in 2000 was 22 indicating that
some regions of the network were
not communicating with others.
• Clip2 also showed that most
Gnutella searchers are seeking
for video/audio media.
How have these trends changed?
4. gnuDC – Gnutella Distributed Crawler (1/6)
• gnuDC is a Large Scale distributed Gnutella Crawler
which monitors the network by attaching itself to it with
large numbers of peers.
• A determinant factor between a WWW Crawler and a
P2P Crawler is that the latter needs to obtain results
(snapshot) in a relatively short amount of time.
• Design Issues and an architecture for a Distributed
P2P Crawler were not described in any other
publication.
• What should be the responsibilities of a P2P Crawler
and how should we design it?
4. gnuDC – Gnutella Distributed Crawler (2/6)
Design Issues of a Distributed P2P Crawler
1. Obtain Network Statistics in a small Interval.
•
•
2.
A P2P network might be very large which implies that sequential
discovery won’t return expected results.
Parallelizing the discovery process might be easy by partitioning the
hosts to be discovered among K parallel crawlers.
Scale with the Network Size.
•
•
A few years ago Gnutella had a few thousands hosts. Today
250,000 at any given moment. Distributed Discovery is a must.
What is desirable?
1.
2.
•
purely distributed approaches ?
Distributed approaches with centralized indexes (e.g. SETI@Home)?
gnuDC is based on a hybrid approach were each crawler runs in its
own memory space, logs information on local disks and notifies a
central index when new IPs are found
4. gnuDC – Gnutella Distributed Crawler (3/6)
Design Issues of a Large Distributed P2P Crawler (cont’d)
3. Maintain Network Health.
•
•
4.
The Crawlers should not affect the regular operation of the network.
Typically a message’s TTL is decreased when it traverses a
Crawler. This shouldn’t happen!
Platform Independence.
•
•
•
•
•
Our distributed crawler is aimed to run on a NOW.
Network of Workstations are typically heterogeneous (Linux,
SunOS, Unix).
Java is based on a “write once, use everywhere” philosophy.
It also provides a strong core for networking, Threads, RMI and
others.
It makes it an ideal language for our purpose.
4. gnuDC – Gnutella Distributed Crawler (4/6)
gnuDC Architecture.
•
•
•
It consists of an IP Index Server, several distributed gnuBricks,
a Log Aggregator and a Log Analyzer.
Components operate asynchronously and independently.
The whole system is bootstrapped by 1 Unix script
gnuDC - gnutella Distributed Crawler
gnuDC Brick
IP Index Server
Local Logs
gnuDC Brick
Local Logs
Local Repository
1
gnuDC Brick
Local Logs
Logs Aggregator
.
.
.
3
gnuDC Brick
2
Local Repository
Log
Manager
Logs Analyzer
Results
4
Gnutella
Network
P2P Network
Module
config.txt
Local Logs
4. gnuDC – Gnutella Distributed Crawler (5/6)
a)
•
•
•
•
IP Index Server
Multi-threaded Engine which maintains and indexes IP addresses
of active Gnutella peers.
Uses double buffering for flushing results to secondary storage.
Sustains high loads and indexes at a rates
Avg:2,500 IPs/sec with a Peak: 5,000 IPs/sec.
The cost for the in-memory data structures is 300MB for 240,000
IPs.
4. gnuDC – Gnutella Distributed Crawler (6/6)
b)
gnuDC Bricks
•
•
•
c)
Configurable and self-adaptive Gnutella clients.
Implementation based on the Jtella API
gnuDC bricks are independent from each other and run in different
memory spaces.
Log Aggregator
•
•
d)
Collects and Aggregates data that is dispersed on the remote disks of
the gnuDC bricks.
Uses ssh along with bash scripts to make the harvesting process
easy.
Log Analyzer
•
•
Combination of bash scripts, C++ routines and Java programs for
analyzing the harvested data based on various criteria.
Aggregating and Analyzing takes approximately 7-10 minutes for
700MB of log traces.
5. Experiments 1/6
•
•
•
We deployed gnuDC on 85 nodes running on 17 AMD Athlons 4, 1.4 GHz with
1GB RAM running Mandrake Linux 8.0 interconnected with a 10/100 LAN
connection.
On the 1st of June 2002, we performed our first "long" crawl.
We also performed several other small scale experiments to gather data on
specific issues.
Technical Difficulties.
•
•
•
•
•
We were crawling only during early morning hours (i.e. 1:30 a.m. - 6:30 a.m.)
because during weekdays the machines were used by students.
Huge amounts of log traces. e.g. 700MB log traces in 5 hours, so we had
problems due to quota limitations.
Department's Administrators blocked any remote access (i.e. establishing a
TCP connection on any port number of a lab machine).
the crawler couldn’t accept any incoming connections.
the degree of a gnuBrick decreased in this way from 100 to 30 connections
5. Experiments 2/6
a)
Analysis of Gnutella Messages (ALL)
•
•
•
•
Our sample includes 56 million messages.
The communication overhead (ping/pong) of Gnutella is 63%
The utilization of the network (query/queryhit) is only 37%
The huge communication overhead might be due to the fact that
Gnutella network connections are highly unstable.
The proportion of queries with queryhits is satisfactory, although we can’t
say if users are satisfied by the actual query results.
General queries such as ”mpeg video” may increase this number.
•
•
5. Experiments 2/6
a)
Analysis of Gnutella Messages (ALL)
•
We observed a correlation between the flow of Ping/Pong and
Query/Queryhit pairs although there is formally no relation.
It is interesting to investigate this further.
Ironically although a Ping message generates many Pong messages
(~4x) a query message generates a queryhit only 1/8 of the time.
•
Ping
Query
Pong
Queryhit
5. Experiments 3/6
b)
Analysis of Query Messages
•
•
•
•
We analyzed 15,153,524 query messages.
High locality of specific queries. Might enable better caching policies.
Gnutella users are looking for Video > Audio > Images > Documents
We observed three classes of Searchers
–
–
–
Seasonal-Content Searchers – search patterns depend on time of crawling
Adult-Content Searchers – constant search patterns over time.
File Extension Searchers - constant search patterns over time.
a) Ranking By query.
b) Ranking By file extension.
5. Experiments 4/6
c)
•
Analysis of Gnutella IP Addresses
We analyzed 294,000 unique IP Addresses (the initial number was larger
but we filtered out IP addresses designated for private networks (i.e.
192.x.x.x, 172.16.x.x and 10.x.x.x).
•
We implemented MRDL – Multithreaded Reverse DNS Lookup Engine
which resolves in parallel 100 IP/second.
•
MRDL resolved 244,522 IP Addresses. 16,92% were not resolvable.
From which domains are Gnutella Users coming from?
5. Experiments 5/6
c)
Analysis of Gnutella IP Addresses (cont’d)
Which ISPs are paying the price of the Gnutella Infrastructure?
•
US, German, Canadian, French and English ISPs are dominating.
•
We haven’t validated if this rank reflects the actual size of each ISP.
•
Interestingly Asian ISPs (e.g. from Japan) are listed very low in this rank
although they are technologically advanced.
5. Experiments 6/6
d)
Analysis of Hop Count found in IP Addresses (cont’d)
Gnutella clients are conforming to the Protocol specifications
•
Only a few queries are coming from father than 7 hops.
•
The protocol thwarts excessive network resources consumption.
•
The bar graph presents a bimodal distribution with 2 peaks.
at hopcount 1 and 7. It is interesting to investigate why so many queries are
coming from so close (i.e. 1). It is probably connections are weak.
log/normal scale
normal/normal scale
6. Conclusions and Future Work
Summary of main observations
•
•
1. The Gnutella communication overhead is huge.
Ping/Pong: 63% | Query/QueryHits: 37%.
2. Gnutella users can be classified in three main categories.
Season-Content, Adult-Content and File Extension Searchers.
3. Gnutella Users are mainly interested in video > audio > images >
documents.
4. Although Gnutella is a truly international phenomenon its largest segment
is contributed by only a few countries.
5. The clients started conforming to the specifications of the protocol and
that they thwart excessive network resources consumption.
We are interested in examining more carefully other data (e.g. User-Agents)
that we have collected but which we haven’t analyzed due to time shortage.
Our metrics might facilitate the development of more advanced P2P protocols
which might take into consideration various bottlenecks and characteristics of
current solutions, such as Gnutella.
Online Resources: http://www.cs.ucr.edu/~csyiazti/cs204.html