Presentation PPT
Download
Report
Transcript Presentation PPT
Mapping the Gnutella Network
Presented By:
Tony Young
M.Math Candidate
October 7th, 2004
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Paper Review
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Paper Review
Introduction
Peer to peer systems have recently
exploded onto the internet scene
Two main contributing factors:
Low cost and high availability of resources
(computing and storage)
Increased network connectivity (proliferation
of “always on” connections)
Introduction
Peer systems build a virtual topology (overlay)
with its own routing mechanisms
The topology of the overlay and routing
protocols directly affects
Performance: Number of physical hops to send a
message through virtual overlay
Reliability: Will a message actually reach the other
end
Scalability: Can other nodes be added while keeping
performance good
Anonymity: Can we protect the identity of nodes in
the network
Introduction
Gnutella is studied in depth and analysis is
performed to determine how the overlay affects
the four characteristics previously mentioned
Started by capturing the network topology and
behaviour
Performed a macroscopic analysis of the
network to evaluate costs and benefits
Investigated possible improvements
Introduction
Two questions drive analysis
What is the connectivity structure of
Gnutella?
How well does the Gnutella overlay map to
the actual network topology?
Introduction
Connectivity Structure
Networks as diverse as natural networks
usually have a few well connected nodes and
many poorly connected nodes
• I.e. Power Law Networks
We will see Gnutella is not a pure power law
network, but still has good fault tolerance and
is less resistant to DoS attacks
Introduction
Overlay Topology
Important for ISP’s: overlays that don’t map
closely to the physical topology adds
additional stress on the infrastructure and
costs ISP’s more money
Scalability is directly linked to efficient use of
network resources
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Gnutella in Depth
Gnutella is an open protocol
It is decentralized and unstructured
Allows group membership and searching of
available files for download
Gnutella should operate in a dynamic environment
where hosts can join/leave at any time
Gnutella should experience good performance and
scalability
External attacks should not cause data loss or
performance degradation
Users seeking or providing unpopular material
should stay anonymous
Gnutella in Depth
Gnutella nodes are called “servents”
(SERVer-cliENTS)
Provide a client-side interface to allow
searching of file base
Provide server-side storage, routing and
response to network messages and requests
Gnutella in Depth
To connect, a node contacts an “always
on” host (I.e. gnutella.com) and sends a
PING
Node replies with a PONG and forwards
the PING on to other nodes in the network
who reply with PONG messages and
forward the PING on
PING stops after TTL hops
Gnutella in Depth
To find files, users submit QUERY
messages to other nodes
Messages are broadcast to all neighbours
who forward them on to other neighbours,
etc. for TTL hops
QUERY RESPONSE messages are returned
to the querying node
Gnutella in Depth
To download a file, nodes send GET and
PUSH messages to individual hosts
holding a file
I.e. transfer requests and transfers are routed
directly between communicating hosts, and
not back-propagated
Gnutella in Depth
Messaging protocol has three important
features
TTL and “hops passed” fields are attached to each
message
Randomly generated message ID is attached to
each message
Each node keeps track of recently routed messages
to prevent re-broadcasting and to implement backpropagation
Gnutella in Depth
PING message contains the host address
and name, number of files and size of
data store
PONG message contains the same
information from the host that received
the PING
Gnutella in Depth
PING messages propagate until TTL has
expired
Hop count incremented at each servent
receiving the PING
Message propagates until hop count = TTL
PONG messages are back-propagated
(I.e. sent on the reverse path that the
original message followed) to the host
initiating the PING
Gnutella in Depth
QUERY messages are sent the same
way as a PING message
Nodes check the search string requested
against the names of their locally stored files
QUERY RESPONSE messages are backpropagated to the querying node and
include information necessary to
download the file
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Paper Review
The Crawler
In order to conduct the network tests, a crawler
was developed to gather information about the
virtual topology
Crawler starts with a list of active nodes and
sends a PING message to each of them
PONG messages are received and the IP, port,
number of stored files and size of archive are
stored in a table
PING propagates to other nodes and PONG
back propagates to crawler
The Crawler
A sequential version of the crawler was initially
developed
I.e. send a PING with an empirically determined
optimal TTL to a set of nodes; resend to the nodes
where the PING stops, etc.
Proved to be very slow: 50 hours to collect data
from a 4 000 node network
Slowness means two things:
Not scalable: Will get slower as we add more nodes
Does not give an accurate network snapshot:
network changes drastically over 50 hours!
The Crawler
A distributed crawler was developed next
Client-Server architecture
Server maintains node list and creates a network
graph
Clients receive a list of nodes to contact and
discover neighbours for
Decided to use only 50 clients at once
Reduces invasiveness of search and consumption of
network resources
Reduced crawling time to a couple of hours for a
large initial list and a network of 30 000 nodes
The Crawler
Network membership is defined as follows
A node is a member of the network if the
crawler is able to connect to it
A node might be excluded from network
membership if it was reported as active by a
server or other node, but the crawler could
not contact it
• This might happen if nodes go offline before the
crawler can contact them
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Growth Trends
Traffic Estimates
Connectivity and Reliability
Overlay vs. Topology
Summary and Improvements
Paper Review
Analysis of Network
Data was collected over a 6 month period
Data shows:
Overhead traffic is reducing
Traffic volume is a significant barrier to
growth
Growth Trends
Size of network is growing rapidly
Largest connected component in November
2000 had 2 063 neighbours
Largest connected component in May 2001
had 48 195 neighbours!
Number of neighbours for the largest
connected component has grown 25 times!
Growth Trends
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Growth Trends
Despite the explosive growth, most nodes
are not connected long
Successive crawls of the network found:
40% of nodes leave the network in less than
4 hours
25% of nodes are alive for more than 24
hours
Traffic Estimates
A modified version of the crawler recorded
traffic generated across one randomly
chosen link
36% of total traffic (in bytes) is user
generated QUERY messages
55% is group membership (PING/PONG)
messages
9% is non-standard or malformed messages
N.B. File transfer traffic is excluded
Traffic Estimates
After June 2001 (when new Gnutella
implementation was released)
92% of total traffic (in bytes) was QUERY
messages
8% is group membership (PING/PONG)
messages
N.B. File transfer traffic is excluded
Traffic Estimates
95% of all nodes are reachable within 7 hops.
Thus, each message typically uses a TTL = 7
Most links are expected to support similar amounts
of traffic for these reasons
As verified empirically, the total Gnutella
generated traffic is proportional to the number
of connections in the network
However, the average number of connections per
node stays relatively constant as the network grows
Traffic Estimates
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates
The total traffic estimate for the Gnutella
network is 1 Gbps
I.e. 170 000 connections for a 50 000 node
network times 6 kbps per connection
This is approximately 330 TB/month!
Excluding file transfers!
Traffic Estimates
This total is 1.7% of the total internet
traffic in US backbones in December
2000
This volume of traffic is believed to be an
obstacle to further growth
The underlying network topology must be
used more efficiently to allow scaling and
wider deployment
Connectivity and Reliability
Note: Nodes decide locally:
How many connections to support
When to add or drop a connection
Recent research shows that many natural
systems organize themselves into “power
law networks”
I.e. networks where a few nodes are well
connected and most nodes have very few
connections
Connectivity and Reliability
Power law networks:
Number of nodes with L links (connections) is
proportional to L-k where k is systemdependent
Resilient to losing many poorly connected
nodes
Falls apart quickly if only a few well
connected nodes are lost
Extremely robust to random failures, but
vulnerable to targeted attacks
Connectivity and Reliability
Power law networks appear as a linear
system on a log-log plot
Data for December 2000 shows that early
Gnutella networks were power law
Data for March 2001 shows that later
Gnutella networks are a mixture
• There are a constant number of nodes with fewer
than 10 links
• Above 10 links, nodes follow a power law
structure
Connectivity and Reliability
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
Why did the distribution change?
Two possible reasons:
About 20% of Gnutella users have modem
connections - DSL and up can support more
connections
Gnutella users run as many connections as
their network can support - perception is that
more connections = better query results
Connectivity and Reliability
Does the change in distribution affect
reliability? Yes!
Preserves resilience to random failures
Makes network less dependent on well
connected nodes and hence less prone to
DoS attacks
Overlay vs. Topology
Peer systems change the way bandwidth
is used on the internet
Servers are at the edge of the network now,
and peers are constantly downloading
Most ISP’s use flat-rate billing
Peer systems may break this model!
Overlay vs. Topology
Due to the amount of traffic peer systems
generate, efficient use of resources is
important
The greater the mismatch between the
overlay and the physical network topology,
the more messages need to be transmitted to
route information from A to B
This means more stress on the network
resources
Overlay vs. Topology
Communication from A to all other nodes
requires one message over the D - E link
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Overlay vs. Topology
Communication from A to all other nodes
requires six messages over the D - E link
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Overlay vs. Topology
How well does Gnutella map to the
topology?
Assume that domain names are roughly
evident of the hierarchy of the internet
Check how well generated traffic maps to the
cluster of domain names found by the
crawler
Overlay vs. Topology
After analysis of 10 overlays, it was found
that Gnutella nodes often connect to
peers outside of their respective domains
Thus, it appears that Gnutella does not
make efficient use of the underlying
topology
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Paper Review
Summary and Improvements
Gnutella has a multimodal connectivity
distribution that is partially constant and partially
power law
Network is resilient to random failures
Network is harder to attack by malicious parties, but
not immune to DoS attacks
Gnutella makes little effort to ward off attackers
E.g. topology, connectivity and traffic information is
easy to obtain and can be used to plan attacks
Summary and Improvements
Gnutella’s traffic volume is a significant
fraction of all internet traffic
Makes the future growth of the network
reliant on efficient use of the topology
Gnutella’s overlay does not match the
network topology very well
This increases quite substantially the number
of messages and the amount of network
traffic generated
Summary and Improvements
Necessary improvements
Make efforts to hide overlay and connectivity
information (encryption?)
Match overlay more closely with topology
Limits to growth must be solved first and fast
at the rate that Gnutella is growing
Summary and Improvements
Suggested Improvements
Exploit locality of files and query distribution
(I.e. caching and localized queries)
Replace query flooding strategy with
something more efficient (I.e. superpeer
routing and group communication)
Outline
Introduction
Gnutella in Depth
The Crawler
Analysis of Network
Summary and Improvements
Paper Review
Paper Review
Organization
Some discussions of the Gnutella
architecture and protocols were scattered
throughout the paper
Should have combined everything into a
more logical order inside the protocol section
Writing Style
Generally very good. Some missing words
and poor grammar
Paper Review
Novel Ideas
Presented a qualitative and quantitative
analysis of the Gnutella network, and some
important points for P2P as a whole
Content
Some backing information was missing
Some claims were made without supporting
evidence, or just referring the reader to
another paper
Questions?