What is Gnutella?

Download Report

Transcript What is Gnutella?

An overview of Gnutella
1
What is Gnutella?
Gnutella is a protocol for
distributed search
• peer-to-peer comm
• decentralized model
• No third party lookup
Two stages:
1. Join Network … later
2. Use Network, I.e discover / search
other peers
2
Gnutella Jargon
Servent: A Gnutella node.
Each servent is both a
server and a client.
2 Hops
Hops: a hop is a
pass through an
intermediate node
1 Hop
client
TTL: how many hops a packet can
go before it dies
(default setting is 7 in Gnutella)
3
Gnutella Scenario
Step 0: Join the network
Step 1: Determining who is on the network
• "Ping" packet is used to announce your presence on the network.
• Other peers respond with a "Pong" packet.
• Also forwards your Ping to other connected peers
• A Pong packet also contains:
• an IP address
• port number
• amount of data that peer is sharing
• Pong packets come back via same route
Step 2: Searching
•Gnutella "Query" ask other peers if they have the file you desire A Query packet might
ask, "Do you have any content that matches the string ‘Homer"?
• Peers check to see if they have matches & respond (if they have any matches) & send
packet to connected peers
• Continues for TTL
Step 3: Downloading
• Peers respond with a “QueryHit” (contains contact info)
4
• File transfers use direct connection using HTTP protocol’s GET method
• When there is a firewall a "Push" packet is used – reroutes via Push path
Remarks
Simple idea , but lacks scalability, since bandwidth is
wasted.
Sometimes, existing objects may not be located due to
limited TTL.
Various improved search strategies have been
proposed.
5
Searching in Gnutella
The topology is dynamic, I.e. constantly changing. How do
we model a constantly changing topology? Usually, we begin
with a static topology, and later account for the effect of churn.
Modeling Static topology (measurements provide useful inputs)
Random graph
Power law graph
(Q. Is Gnutella topology a random graph?)
Search strategies
Flooding
Random walk / Biased random walk / multiple walker
one-hop replication / two-hop replication
6
Gnutella topology
Gnutella topology is a power-law graph. (Also called scale-free graph)
What is a power-law graph?
The number of nodes with degree k = c.k - r
Contrast this with Gaussian distribution where the number of nodes
with degree k = c. 2 - k.
Many graphs in the nature exhibit power-law characteristics.
Examples, world-wide web (the number of pages that have k in-links
Is proportional to k - 2), fraction of scientic papers that receive k
citations is k -3 etc.
7
# of telephone numbers
from which calls were made
AT&T Call Graph
How many telephone
numbers receive calls from k
different telephone numbers?
# of telephone numbers called
8
4
Gnutella network
power-law link distribution
proportion of nodes
10
10
10
data
power-law fit
t = 2.07
2
1
0
10
0
1
10
number of neighbors
summer 2000,
data provided by Clip2
9
5
A possible explanation
Nodes join at different times.
The more connections a node has, the more likely
it is to acquire new connections (Rich gets richer).
Popular webpages attract new pointers.
Such a growth process produces power-law network
10
7
Search via random walk
Existence of a path does
not necessarily mean that
such a path can be discovered
11
Search via Random Walk
Search metrics
Discovery time in hops (also called delay)
Distance covered (overhead) by the walker
Both should be as small as possible.
For a single random walker, these are equal.
For search by flooding, if delay = h but
distance = d + d2 + … + dh where d = degree of a node.
K random walker (fixed K) is a compromise.
12
A simple analysis of random walk
Let p = Population of the
object. i.e. the fraction of
nodes hosting the object
T = TTL (time to live)
Hop count h
probability
1
p
2
(1-p).p
3
(1-p)2.p
T
(1-p)T-1.p
13
A simple analysis of random walk
Expected hop count E(h) =
1.p + 2.(1-p).p + 3(1-p)2.p + …+ T.(1-p)T-1.p
=
1/p. (1-(1-p)T) - T(1-p)T
With a large TTL, E(h) = 1/p, which is intuitive.
With a small TTL, there is a risk that search will time out before an
existing object is located.
14
K random walkers
Assume they all k walkers start in unison. Probability that none could
find the object after one hop = (1-p)k. Prob. that none succeeded
after k hops = (1-p)kT. So the probability that at least one walker
succeeded is 1-(1-p)kT. A typical assumption is that the search is
abandoned as soon as at least one walker succeeds
Expected overhead =
1-(1-p)T-1 - T. (1-p)T-1 . k
p
As k increases, the probability of success increases, the overhead
increases, but the delay decreases. There is a tradeoff here.
15
Increasing search efficiency
Major strategies
1. Biased walk utilizing node degree heterogeneity.
2. Utilizing structural properties like random graph,
power-law graphs, or small-world properties
3. Topology adaptation for faster search
4. Introducing two layers in the graph structure using
supernodes
16
Dynamic Topology Adaptation
• Make high-capacity nodes have high degree (i.e.,
more neighbors)
• Per-node level of satisfaction, S:
– 0  no neighbors, 1  enough neighbors
– Function of:
• Node’s capacity
• Neighbors’ degrees
Neighbors’ capacities
Their age
– When S << 1, look for neighbors aggressively
17
One hop replication
Each node keeps track of the indices of the files belonging to its
immediate neighbors. As a result, high capacity / high degree nodes
can provide useful clues to a large number of search queries.
Where is
18
Biased random walk
P=2/10
P=5/10
P=3/10
Each node records the degree of the neighboring nodes. Search
easily gravitates towards high degree nodes that hold more clues.
19
power-law graph
number of
nodes found
94
67
63
54
2
6
1
Deterministic biased walk
20
9
The KaZaA approach
Where is
ABC?
ABC
Supernode
Supernode
Powerful nodes (supernodes) act as local index servers, and
client queries are propagated to other supernodes. Two-layered.
21
Case Study: Gia
(Chawathe et al. Making Gnutella-like P2P systems scalable. SIGCOMM ‘03)
Three major decisions
1. Flooding replaced by (biased) random walk
2. Topology adaptation: high-capacity nodes maintain
more clues via one hop replication. They should be
easily reachable
3. Token based flow control to prevent overloading of
high-degree nodes.
22
Topology Control
Steps for a node X when Y wants to join
{X has spare capacity}
If # of neighbors + 1 < max then accept Y as a neighbor
{when X has no spare capacity, it has to drop a neighbor}
If C(Y) > C(i: i is a current neighbor of X) then accept Y as neighbor
and drop any one of the existing neighbors.
{Otherwise, X has to pick a Z that can be dropped in favor Y}
pick the highest degree current neighbor Z
If C(Y) > C(Z) or Y has fewer neighbors than Z
then
drop Z, accept Y
else
reject Y
23
{This last step ensures that we do not drop poorly connected neighbors}
Simulation Results
• Compare four systems
–
–
–
–
FLOOD: TTL-scoped, random topologies
RWRT: Random walks, random topologies
SUPER: Supernode-based search
GIA: search using GIA protocol suite
• Metric:
– Collapse point: aggregate throughput that the
system can sustain. The success rate drops to
90%.
24
Collapse Point (qps/node)
System Performance
1000
GIA: N=10,000
SUPER: N=10,000
10
RW RT: N=10,000
FLOOD: N=10,000
0.1
0.001
0.00001
0.01
0.1 %
Replication Rate (percentage)
%
1%
population of
the object
GIA outperforms SUPER, RWRT & FLOOD by many
orders of magnitude in terms of aggregate query load
25
Factor Analysis
Algorithm
Collapse
point
Algorithm
Collapse
point
RWRT
0.0005
GIA
7
RWRT+OHR
0.005
GIA – OHR
RWRT+BIAS
0.0015
GIA – BIAS
0.004
6
RWRT+TADAPT
0.001
0.0006
GIA – TADAPT
0.2
GIA – FLWCTL
2
RWRT+FLWCTL
Topology
adaptation
Flow control
26