Peer-to-peer and probabilistic protocols

Download Report

Transcript Peer-to-peer and probabilistic protocols

Reliable Distributed Systems
Peer to Peer
Peer-to-Peer (p2p) Systems


The term refers to a kind of distributed
computing system in which the “main”
service is provided by having the client
systems talk directly to one-another
In contrast, traditional systems are
structured with servers at the core and
clients around the edges
p2p systems
Standard systems:
Client/Server structured
P2P systems: Clients help
one-another out
An “important” topic

… or at least, it gets a lot of press

Recording industry claims that p2p downloads are
killing profits!


Used to be mostly file sharing, but now online radio
feeds (RSS feeds) are a big deal too
U. Wash. study showed that 80% of their network
bandwidth was spent on music/video downloads!




DVDs are largest, and accounted for the lion’s share
A great many objects were downloaded many times
Strangely, many downloads took months to complete…
Most went to a tiny handful of machines in dorm rooms
Where has all the bandwidth gone?
800
600
500
non-HTTP
TCP
non-HTTP
TCP
400
300
Akamai
0:00
Thu
12:00
0:00
Wed
0:00
Tue
12:00
Mon 0:00
12:00
0:00
Sun
12:00
0:00
Sat
12:00
0:00
Fri
12:00
0:00
Thu
12:00
0:00
12:00
WWW
WWW
Wed
0
P2P
P2P
200
100
12:00
Mbps
700
Breakdown of UW TCP bandwidth into HTTP Components (May 2002)
• WWW = 14% of TCP traffic; P2P = 43% of TCP traffic
• P2P dominates WWW in bandwidth consumed!!
Source: Hank Levy. See
http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf
Bandwidth consumed by UW servers
(outbound traffic)
Bandwidth Consumed by UW Servers
250
Kazaa
Mbps
200
150
100
Gnutella
WWW
50
Wed
Thu
Fri
Sat
Sun
Mon
Tue
Wed
Thu
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0
Fri
Source: Hank Levy. See
http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf
Object type for different systems
Byte Breakdown per Content Delivery System
100%
TEXT (T)
IMAGES (I)
AUDIO (A)
VIDEO (V)
OTHER (O)
% Bytes
80%
V
V
60%
I
40%
T
I
A
O
V O
T
20%
A
A
V
A
T
I
O
T I
O
0%
WWW
Akamai
Gnutella
Kazaa
Source: Hank Levy. See
http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf
Today: An Overview

Today we’ll look at the area as a whole




Origins: Illegal fire sharing
Early academic work: “Distributed hash tables”
Subsequent spread of field into many other areas:
steganographic storage, erasure codes, gossip
protocols and epidemic data dissemination, etc
In upcoming lectures we’ll look at details of
some research systems
An old idea…

If you think about it, most of the protocols we’ve
discussed are “peer to peer” in a broad sense



Pretty much everything Lamport was interested in uses
direct client-to-client communication
Group communication systems often do have servers, but
not all need them…
But the term really has a stronger meaning



Denotes systems where the “data that matters” is passed
among cooperating client systems
And there may be huge numbers of clients
Evokes image of resistance fighters working to overthrow an
evil IP empire
Attributes of p2p systems

They can be enormous



We often talk about hundreds of thousands or
millions of client nodes, coming and going rapidly
If there are servers, they are small in number and
have limited roles
These clients are everywhere



Even in Kenya or Nepal… places with lousy
network connectivity
Often behind firewalls or NAT boxes
Some are supercomputers. But many are slow
The issue with NAT boxes

When a system uses firewalls or NAT boxes

Client systems inside the network can usually talk
to servers outside it



The NAT knows about the TCP 3-way handshake and
“creates a tunnel” on the fly
It remaps the (IP address, port) pair as packets pass by,
so it looks as if the NAT (not the client) is making the
connection and receiving the replies…
But connectivity from outside to inside is blocked

In fact, because client IP address is mapped, the client
simply can’t be addressed other than through the NAT!
The first peer-to-peer system

The term, and the intuition, emerged
from the Napster file sharing service




In fact Napster has a set of servers
But these just keep a directory on behalf of
clients and orchestrate publicity inserts
Servers build the web pages users see
Actual music and DVD downloads are done
from client to client
Napster
Having obtained a top-level
page listing peers with copies
of music or other content
desired, a client can download
the files directly from the peer
Got “Sting”?…Can
I have a copy?
no problem,
dude
Where can I find a copy of
“Sting:Fields
of Barley”?
… try 167.26.16.89
or
221.18.71.36
Data center builds the
pages users see when they
access Napster
Quick aside

Should “intellectual property” be free?


Topic of much debate right now
Lessig: “East Code vs West Code”



East Code is a term for “laws on the books”
West Code is a term for software
His point?


We need to evolve a balance between what we demand
(law), what we can implement (code), and what will
promote the general wellfare
What regime gives the most benefit for the most people?
Why did Napster go this route?

When service launched, developers hoped to work
around legal limits on sharing media





They reasoned: let client systems advertise “stuff”
If some of that stuff happens to be music, that’s the
responsibility of the person who does it
The directory system “helps clients advertise wares” but
doesn’t “endorse” the sharing of protected intellectual
property. Client who chooses to do so is violating the law
They make their money on advertising they insert
Judges saw it differently…

“Napster’s clear purpose is to facilitate theft of IP…”
Characteristics of big populations

With huge numbers of users



Surprisingly many “come and go” on short
time scales
One study: mean residence time in Freenet
was just a few seconds… and many clients
were never heard of again!
British telcom reassigns IP addresses for all
its networked users every few hours!
List of (technical) issues with Napster

Many clients just aren’t accessible





Firewalls can limit incoming connections to clients
Many client systems come and go (churn)
Round trip times to Nepal are slow…
Slow “upload” speeds are common connections
Clients might withdraw a file unexpectedly

E.g. if low on disk space, or if they download
something on top of a song they aren’t listening to
anymore
More (technical) issues with Napster

Industry has been attacking the service… and
not just in court of law




Denial of service assaults on core servers
Some clients lie about content (e.g. serve Frank
Sinatra in response to download for Eminem)
Hacking Napster “clients” to run the protocol in
various broken (disruptive) ways
And trying to figure out who is serving which files,
in order to sue those people
What problems are “fundamental”?



If we assume clients serve up the same stuff
people download, the number of sources for
a less popular item will be very small
Under assumption that churn is a constant,
these less popular items will generally not be
accessible.
But experiments show that clients fall into
two categories:



Well-connected clients that hang around
Poorly-connected clients that also churn
… this confuses the question
What problems are fundamental?


One can have, some claim, as many
electronic personas as one has the time and
energy to create. – Judith S. Donath.
So-called “Sybil attack….”




Attacker buys a high performance computer cluster
It registers many times with Napster using a variety of IP
addresses (maybe 10’s of thousands of times)
Thinking these are real, Napster lists them in download
pages. Real clients get poor service or even get snared
Studies show that no p2p system can easily defend
against Sybil attacks!
Refined Napster structure

Early Napster just listed anything. Later:


Enhanced directory servers to probe clients, track their
health. Uses an automated reporting of download problems
to trim “bad sources” from list
Ranks data sources to preferentially list clients who…




Have been up for a long time, and
Seem to have fast connections, and
Appear to be “close” to the client doing the download (uses
notion of “Internet distance”)
Implement parallel downloads and even an experimental
method for doing “striped” downloads (first block from
source A, second from source B, third from C, etc)

Leverages asymmetric download/uplink speeds
Meanwhile, p2p took off


By the time Napster was ruled illegal, it had
15 million users. 5 million of them joined in
just a few months!
With Napster out of business, a vacuum arose


Some users teamed up to define an open standard
called “Gnutella” and to develop many protocol
implementations
Gnutella eliminates the server



Judge singled it out in deciding that Napster was illegal
Also, a true peer-to-peer network seems harder to defeat
than one that is only partly peer-to-peer
Credo: “All information should be free”
How Gnutella works

Rough outline

User joins the network using a broadcast with
increasing TTL values



“Is anyone out there?”
Links itself to the first Gnutella node to respond
To find content, protocol searches in a similar way



Broadcasts “I’m looking for Eminem:WhackHer”
Keeps increasing TTL value… eventually gives up if no
system respond
Hopefully, popular content will turn up nearby
Self-organized “overlay” network
I’m looking for
Sting:Fields…
Self-organized “overlay” network
TTL determines how far the
search will “flood” in the
network. Here, TTL of 2
reached 10 nodes
Self-organized “overlay” network
Nodes with a copy send
Download
back a file from the first
message offering it. node
This that offers a copy.
basically is a URL for the
Hopefully
file
this is a nearby
source with good connectivity…
Gnutella has “issues”

In experimental studies of the system



Very high rates of join requests and
queries are sometimes observed
Departures (churn) found to disrupt the
Gnutella communication graph
Requests for rare or misspelled content
turn into world-wide broadcasts

Rare is… um… rare. Misspellings are common.
Berkeley, MIT research in p2p

Universities were first to view p2p as an
interesting research area



CAN: “Content addressable network”
proposed by Berkeley
Chord: MIT “distributed hash table”
Both systems separate the “indexing”
problem from actual storage
Distributed hash tables (DHTs)

Idea is to support a simple index with API:



Insert(key, value) – saves (key,value) tuple
Lookup(key) – looks up key and returns value
Implement it in a p2p network, not a server…


Exactly how we implement it varies
Normally, each p2p client has just part of the
tuples, hence must route query to the right place
Distributed indexing
Lookup(“Sting:Fields”)  128.64.72.13
Abstraction of an index makes it look like a big server.
Implementation spreads the index over many peers.
But we can implement this one abstraction in many ways.
Insert(“Sting:Fields”, 128.64.72.13);
Distributed indexing
Lookup(“Sting:Fields”)  128.64.72.13
Insert(“Sting:Fields”, 128.64.72.13);
Some details

Keep in mind

There are lots of protocols that can solve this
problem: the protocol used is not part of the
problem statement


Some DHTs allow updates (e.g. if data moves, or
nodes crash). Others are write once.
Most DHTs allow many tuples with the same key
and can return the whole list, or a random subset
of size k, etc
So what can we insert?

Normally, we want to keep the values
small… like an IP address

So the (key,value) pairs might tell us where
to look for something but probably not the
actual thing


Value could be (and often is) a URL
Once we have the DHT running we can
use it to build a p2p file system
DHTs: Area quickly took off




Can, Chord: DHTs, already mentioned
Pastry: From Rice and MSR, uses
“Plaxton trees” (a kind of lookup tree)
Tapestry: Berkeley (similar to Pastry)
Kelips, Beehive: Cornell (use replication
to get much faster responses)
… and too many more to list!
Representative research topics

Can we make a DHT…





…
…
…
…
“resilient” to churn?
hide content and guarantee anonymity?
secure and robust against attack?
support high quality parallel striped downloads?
Can we use a DHT…


To support scalable content distribution (IP
multicast isn’t popular with ISPs)?
To implement a new style of Internet addressing
(i.e. replace IP routing or multicast)?
Are there legitimate uses of p2p
file systems?



One thought: corporations might want to index
“everything in their file store” or to archive stuff
Digital libraries might use p2p to avoid keeping extra
copies of special or extremely big objects
Risk of “bit rot” is a big concern
 Suppose some huge set of PCs collaborates to
preserve important documents



Might also encrypt them – various options exist…
How many replicas needed to avoid risk that “rare
events” will destroy all copies simultaneously?
A topic of study in Oceanstore and at UCSD
Are there legitimate uses of
p2p file systems?

p2p could be a great way to legally share information
within a team of collaborators at work, or some other
“interest group”



Think of these as little groups superimposed on a massive
p2p network using the same technology
Idea would be: “We help each other out”
Some argue that p2p systems could be valuable in
resisting repressive political regimes


Like “coffee house” meetings in pre-revolutionary Russia
Can repressive regimes survive if they can’t control the flow
of information?
Spyware: The real thing

Imagine a popular p2p system that


Encrypts content: need key to make sense of it
Achieves a high degree of anonymity




Pretty much everyone helps to serve each request, but
nobody actually has a copy of the whole file on their
drive – e.g. I have a few bits, you have a few bits
Real sources and nodes accessing content concealed
from intruders
Robust against disruptive attack
Needs to be popular: Spies hide in crowds
Philosophical debate

Is technology “political”?

Here we have a technology invented to





Rip off IP from owners
Conceal crime from law enforcement
Pretty much unstoppable without incredibly intrusive
oversight mechanisms
What’s the story here? Are we all anarchists?
Some people believe technology is negative,
some positive, some neutral


What about p2p technology?
Are we allowed to answer “all of the above”?
p2p outside of file sharing

Key idea was that p2p systems could
“gossip” about replicated data


Now and then, each node picks some
“peer” (at random, more or less)
Sends it a snapshot of its own data


Or asks for a snapshot of the peer’s data


Called “push gossip”
“Pull” gossip
Or both: a push-pull interaction
Gossip “epidemics”




[t=0] Suppose that I know something
[t=1] I pick you… Now two of us know it.
[t=2] We each pick … now 4 know it…
Information spread: exponential rate.


Due to re-infection (gossip to an infected
node) spreads as 1.8k after k rounds
But in O(log(N)) time, N nodes are infected
Gossip epidemics
An unlucky node may
just “miss” the gossip
for a long time
Gossip scales very nicely


Participants’ loads independent of size
Network load linear in system size
Data spreads in log(system size) time
Time to
infection:O(log n)
1.0
% infected

0.0
Time 
Facts about gossip epidemics

Extremely robust


Data travels on exponentially many paths!
Hard to even slow it down…




Suppose 50% of our packets are simply lost…
… we’ll need 1 additional round: a trivial delay!
Push-pull works best. For push-only/pull-only a
few nodes can remain uninfected for a long time
Later we’ll see that many optimizations are
needed in practice… but the approach works!
Uses of gossip epidemics

To robustly multicast data




Slow, but very sure of getting through
To repair inconsistency in replicas
To support “all to all” monitoring and
distributed management
For distributed data mining and
discovery
A contemporary perspective

p2p computing has many pros and
many cons, and for most purposes the
cons outweigh the pros




A “hard to control” technology
Firewalls cause many annoyances
Rather slow to propagate updates
But at the same time

Incredibly robust against disruption
Contemporary response?

So… use p2p techniques, but mostly



In data centers or LANs where there are no
firewalls
In uses where slow update times aren’t an
issue
Often means that we need to marry
p2p mechanism to a more “urgent”
protocol like our multicast protocols
Peek ahead

We’ll look at several p2p technologies



Chord, Pastry, Kelips: three DHTs
Bimodal Multicast: Uses gossip in a
multicast protocol to get superior scalability
Astrolabe: Uses gossip to implement a
scalable monitoring, management and
control infrastructure (also great for data
mining)
Reliable Distributed Systems
Distributed Hash Tables
Reminder: Distributed Hash
Table (DHT)

A service, distributed over multiple machines,
with hash table semantics


Designed to work in a peer-to-peer (P2P)
environment



Insert(key, value), Value(s) = Lookup(key)
No central control
Nodes under different administrative control
But of course can operate in an “infrastructure”
sense
P2P “environment”


Nodes come and go at will (possibly
quite frequently---a few minutes)
Nodes have heterogeneous capacities


Bandwidth, processing, and storage
Nodes may behave badly


Promise to do something (store a file) and
not do it (free-loaders)
Attack the system
Several flavors, each with
variants

Tapestry (Berkeley)




Based on Plaxton trees---similar to hypercube
routing
The first* DHT
Complex and hard to maintain (hard to
understand too!)
CAN (ACIRI), Chord (MIT), and Pastry
(Rice/MSR Cambridge)

Second wave of DHTs (contemporary with and
independent of each other)
* Landmark Routing, 1988, used a form of DHT
called Assured Destination Binding (ADB)
Basics of all DHTs
127
111

13
97
33
81
58
Goal is to build some “structured”
overlay network with the following
characteristics:



Node IDs can be mapped to the hash key
space
Given a hash key as a “destination
address”, you can route through the
network to a given node
Always route to the same node no matter
where you start from
Simple example (doesn’t
scale)
127
111

13

97
33
81
58



Circular number space 0 to 127
Routing rule is to move clockwise until
current node ID  key, and last hop
node ID < key
Example: key = 42
Obviously you will route to node 58
from no matter where you start
Node 58 “owns” keys in [34,58]
Building any DHT

127
13
111
97
33
81
58
24
Newcomer always starts with at
least one known member
Building any DHT

127
13
111

97
33
Newcomer always starts with at
least one known member
Newcomer searches for “self” in
the network

81
58
24

hash key = newcomer’s node ID
Search results in a node in the vicinity
where newcomer needs to be
Building any DHT

127
111
13
24 
97
33
Newcomer always starts with at
least one known member
Newcomer searches for “self” in
the network

81

58

hash key = newcomer’s node ID
Search results in a node in the vicinity
where newcomer needs to be
Links are added/removed to satisfy
properties of network
Building any DHT

127
111
13

24
97

33

81
58
Newcomer always starts with at least one
known member
Newcomer searches for “self” in the
network


hash key = newcomer’s node ID
Search results in a node in the vicinity
where newcomer needs to be
Links are added/removed to satisfy
properties of network
Objects that now hash to new node are
transferred to new node
Insertion/lookup for any DHT
127
111

13
24
97
Hash name of object to produce
key

33

81
58
Use key as destination address to
route through network

foo.htm93

Well-known way to do this
Routes to the target node
Insert object, or retrieve object,
at the target node
Properties of all DHTs




Memory requirements grow (something like)
logarithmically with N (exception: Kelips)
Routing path length grows (something like)
logarithmically with N (several exceptions)
Cost of adding or removing a node grows
(something like) logarithmically with N
Has caching, replication, etc…
DHT Issues


Resilience to failures
Load Balance







Heterogeneity
Number of objects at each node
Routing hot spots
Lookup hot spots
Locality (performance issue)
Churn (performance and correctness issue)
Security
We’re going to look at four
DHTs

At varying levels of detail…

Chord


Kelips


Cornell (Gupta, Linga, Birman)
Pastry


MIT (Stoica et al)
Rice/Microsoft Cambridge (Druschel, Rowstron)
Behive

Cornell (Ramasubramanian, Sirer)
Things we’re going to look at






What is the structure?
How does routing work in the structure?
How does it deal with node departures?
How does it scale?
How does it deal with locality?
What are the security issues?
Chord uses a circular ID space
Key ID Node ID
K100 N100
N10 K5, K10
Circular
ID Space
N32 K11, K30
K65, K70 N80
N60
K33, K40, K52
• Successor: node with next highest ID
Chord slides care of Robert Morris, MIT
Basic Lookup
N5
N10
N110
“Where is key 50?”
N20
N99
“Key 50 is
At N60”
N32
N40
N80
N60
• Lookups find the ID’s successor
• Correct if predecessor is correct
Successor Lists Ensure Robust Lookup
N5
5, 10, 20 N110
10, 20, 32
N10
20, 32, 40
N20
110, 5, 10 N99
N32
N40
99, 110, 5 N80
N60
32, 40, 60
40, 60, 80
60, 80, 99
80, 99, 110
• Each node remembers r successors
• Lookup can skip over dead nodes to find blocks
• Periodic check of successor and predecessor links
Chord “Finger Table”
Accelerates Lookups
¼
1/8
1/16
1/32
1/64
1/128
N80
½
To build finger tables, new
node searches for the key
values for each finger
To do it efficiently, new
nodes obtain successor’s
finger table, and use as a
hint to optimize the search
Chord lookups take O(log N)
hops
N5
N10
K19
N20
N110
N99
N32 Lookup(K19)
N80
N60
Drill down on Chord reliability


Interested in maintaining a correct routing
table (successors, predecessors, and fingers)
Primary invariant: correctness of successor
pointers



Fingers, while important for performance, do not
have to be exactly correct for routing to work
Algorithm is to “get closer” to the target
Successor nodes always do this
Maintaining successor pointers

Periodically run “stabilize” algorithm





Finds successor’s predecessor
Repair if this isn’t self
This algorithm is also run at join
Eventually routing will repair itself
Fix_finger also periodically run

For randomly selected finger
Initial: 25 wants to join correct
ring (between 20 and 30)
20
20
20
25
25
30
30
25 finds successor,
and tells successor
(30) of itself
25
30
20 runs “stabilize”:
20 asks 30 for 30’s predecessor
30 returns 25
20 tells 25 of itself
This time, 28 joins before 20
runs “stabilize”
20
20
20
28
30
25
25
28
25
30
28
30
28 finds successor,
and tells successor
(30) of itself
20 runs “stabilize”:
20 asks 30 for 30’s predecessor
30 returns 28
20 tells 28 of itself
20
20
20
25
25
28
28
25
28
30
30
25 runs “stabilize”
30
20 runs “stabilize”
Chord problems?

With intense “churn” ring may be very
disrupted





Worse case: partition can provoke formation of
two distinct rings (think “Cornell ring” and “MIT
ring”)
And they could have finger pointers to each other,
but not link pointers
This might never heal itself…
But scenario would be hard to induce.
Unable to resist Sybil attacks
Pastry also uses a circular
number space
d46a1c
d471f1
d467c4
d462ba

d4213f

Route(d46a1c)
65a1fc
d13da3

Difference is in how
the “fingers” are
created
Pastry uses prefix
match overlap
rather than binary
splitting
More flexibility in
neighbor selection
Pastry routing table (for node
65a1fc)
Pastry nodes also
have a “leaf set” of
immediate neighbors
up and down the ring
Similar to Chord’s list
of successors
Pastry join




X = new node, A = bootstrap, Z = nearest node
A finds Z for X
In process, A, Z, and all nodes in path send state
tables to X
X settles on own table



Possibly after contacting other nodes
X tells everyone who needs to know about itself
Pastry paper doesn’t give enough information to
understand how concurrent joins work

18th IFIP/ACM, Nov 2001
Pastry leave

Noticed by leaf set neighbors when leaving
node doesn’t respond


Neighbors ask highest and lowest nodes in leaf set
for new leaf set
Noticed by routing neighbors when message
forward fails



Immediately can route to another neighbor
Fix entry by asking another neighbor in the same
“row” for its neighbor
If this fails, ask somebody a level up
For instance, this neighbor
fails
Ask other neighbors
Try asking some
neighbor in the same
row for its 655x entry
If it doesn’t have one, try
asking some neighbor in
the row below, etc.
Kelips takes a different
approach



Network partitioned into N “affinity groups”
Hash of node ID determines which affinity
group a node is in
Each node knows:



One or more nodes in each group
All objects and nodes in own group
But this knowledge is soft-state, spread
through peer-to-peer “gossip” (epidemic
multicast)!
Kelips
Take a a collection
of “nodes”
110
230
30
202
Kelips
Map nodes to
affinity groups
Affinity Groups:
peer membership thru consistent hash
0
1
2
N -1
110
N
230
30
202
members
per affinity
group
Kelips
110 knows about
other members –
230, 30…
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
Affinity Groups:
peer membership thru consistent hash
0
1
2
N -1
110
N
230
30
Affinity group
pointers
202
members
per affinity
group
Kelips
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
Contacts
group
contactNode
…
…
2
202
202 is a
In practice, keep k “contact” for
Affinity Groups:
110
contacts, for some
peer membershipin
thrugroup
consistent2hash
small constant k
0
1
N -1
2
110
N
230
202
members
per affinity
group
30
Contact
pointers
Kelips
“cnn.com” maps to group 2.
So 110 tells group 2 to
“route” inquiries about
cnn.com to it.
Affinity group view
id
hbeat
rtt
30
234
90ms
230
322
30ms
Affinity Groups:
peer membership thru consistent hash
0
1
2
N -1
110
N
Contacts
230
group
contactNode
…
…
2
202
Resource Tuples
resource
info
…
…
cnn.com
110
202
30
Gossip protocol
replicates data
cheaply
members
per affinity
group
Kelips
To look up “cnn.com”,
just ask some contact
in group 2. It returns
“110” (or forwards
your request).
Affinity Groups:
peer membership thru consistent hash
0
1
N -1
2
110
N
230
202
members
per affinity
group
30
IP2P, ACM TOIS (submitted)
Kelips gossip

Operates at constant “background” rate



Independent of frequency of changes in
the system
Average overhead may be higher than
other DHTs, but not bursty
If churn too high, system performs
poorly (failed lookups), but does not
collapse…
Beehive

A DHT intended for supporting righperformance infrastructure services with
proactive caching


Focus of “real system” is on DNS but has
other applications
Proactive caching: a form of replication.
DNS already caches… Beehive pushes
updates to improve hit rates
Domain Name Service

Translates textual names to internet
addresses


Relies on a static hierarchy of servers




“www.cnn.com” -> 1.2.3.4
Prone to DoS attacks
Fragile, expensive to maintain
Slow and sluggish
In 2002, a DDoS attack almost disabled the
root name servers
A Self-Organizing Solution…
0021
0112
0122
2012
oid = 0122
www.cnn.com
IP=1.2.3.4
Great idea, but O(log N) is too slow on the
Internet
By replicating a
(key,value) tuple we can
With greater degree of
shorten
the worst-case
replication
the search
searchgoes
by even
onefaster
hop
Beehive Intuition
0021
0112
0122
2012
Optimization problem: Minimize total number of replicas s.t.,
average lookup performance  C
Beehive Summary


general replication framework
suitable for structured DHTs


decentralization, self-organization, resilience
properties



high performance: O(1) average lookup time
scalable: minimize number of replicas and reduce
storage, bandwidth, and network load
adaptive: promptly respond to changes in
popularity – flash crowds
To finish up

Various applications have been
designed over DHTs



File system, DNS-like service, pub/sub
system
DHTs are elegant and promising tools
Concerns about churn and security
Reliable Distributed Systems
Scalability
Scalability

Today we’ll focus on how things scale


Basically: look at a property that matters
Make something “bigger”



Like the network, the number of groups, the
number of members, the data rate
Then measure the property and see impact
Often we can “hope” that no slowdown
would occur. But what really happens?
Stock Exchange Problem:
Sometimes, someone is slow…
Most members are
healthy….
… but one is slow
i.e. something is contending with the red process,
delaying its handling of incoming messages…
With a slow receiver, throughput
collapses as the system scales up
Virtually synchronous Ensemble multicast protocols
average throughput on nonperturbed members
250
group size: 32
group size: 64
group size: 96
32
200
150
100
96
50
0
0
0.1
0.2
0.3
0.4
0.5
perturb rate
0.6
0.7
0.8
0.9
Why does this happen?


Superficially, because data for the slow
process piles up in the sender’s buffer,
causing flow control to kick in
(prematurely)
But why does the problem grow worse as
a function of group size, with just one
“red” process?

Small perturbations happen all the time
Broad picture?


Virtual synchrony works well under bursty
loads
And it scales to fairly large systems (SWX
uses a hierarchy to reach ~500 users)



From what we’ve seen so far, this is about as good
as it gets for reliability
Recall that stronger reliability models like Paxos
are costly and scale far worse
Desired: steady throughput under heavy load
and stress
Protocols famous for
scalability

Scalable reliable multicast (SRM)
Reliable Multicast Transport Protocol (RMTP)
On-Tree Efficient Recovery using Subcasting

Several others: TMP, MFTP, MFTP/EC...



(OTERS)
But when stability is tested under stress, every
one of these protocols collapses just like virtual
synchrony!
Example: Scalable Reliable
Multicast (SRM)




Originated in work on Wb and Mbone
Idea is to do “local repair” if messages
are lost, various optimizations keep load
low and repair costs localized
Wildly popular for internet “push,” seen
as solution for Internet radio and TV
But receiver-driven reliability model
lacks “strong” reliability guarantees
Local Repair Concept
Local Repair Concept
Local Repair Concept
lost
Local Repair Concept
Receipt of subsequent
packet triggers NACK
for missing packet
NACK
X
NACK
NACK
Local Repair Concept
Retransmit
NACK
NACK
X
X
NACK
X
Receive useless NAK,
duplicate repair
Local Repair Concept
X
XX
NACK
NACK
X
X
X
Receive useless NAK,
duplicate repair
X
Local Repair Concept
X
NACK
X
NACK
Receive useless NAK,
duplicate repair
X
NACK
X
Local Repair Concept
X
X
X
Receive useless NAK,
duplicate repair
Limitations?



SRM runs in application, not router, hence IP
multicast of nack’s and retransmissions tend
to reach many or all processes
Lacking knowledge of who should receive
each message, SRM has no simple way to
know when a message can be garbage
collected at the application layer
Probabilistic rules to suppress duplicates
In practice?

As the system grows large the
“probabilistic suppression” fails



More and more NAKs are sent in duplicate
And more and more duplicate data
message are sent as multiple receivers
respond to the same NAK
Why does this happen?
Visualizing how SRM collapses

Think of sender as the hub of a wheel


Messages depart in all directions
Loss can occur at many places “out there” and they could be
far apart…




Hence NAK suppression won’t work
Causing multiple NAKS
And the same reasoning explains why any one NAK is likely
to trigger multiple retransmissions!
Experiments have confirmed that SRM overheads
soar with deployment size

Every message triggers many NAKs and many
retransmissions until the network finally melts down
Dilemma confronting
developers



Application is extremely critical: stock
market, air traffic control, medical
system
Hence need a strong model, guarantees
But these applications often have a
soft-realtime subsystem


Steady data generation
May need to deliver over a large scale
Today introduce a new design pt.

Bimodal multicast (pbcast) is reliable in a sense
that can be formalized, at least for some
networks



Generalization for larger class of networks should be
possible but maybe not easy
Protocol is also very stable under steady load
even if 25% of processes are perturbed
Scalable in much the same way as SRM
Environment



Will assume that most links have known
throughput and loss properties
Also assume that most processes are
responsive to messages in bounded
time
But can tolerate some flakey links and
some crashed or slow processes.
Start by using unreliable multicast to rapidly
distribute the message. But some messages
may not get through, and some processes may
be faulty. So initial state involves partial
distribution of multicast(s)
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
Delivery? Garbage Collection?



Deliver a message when it is in FIFO order
Garbage collect a message when you believe
that no “healthy” process could still need a
copy (we used to wait 10 rounds, but now
are using gossip to detect this condition)
Match parameters to intended environment
Need to bound costs

Worries:



Someone could fall behind and never catch
up, endlessly loading everyone else
What if some process has lots of stuff
others want and they bombard him with
requests?
What about scalability in buffering and in
list of members of the system, or costs of
updating that list?
Optimizations


Request retransmissions most recent
multicast first
Idea is to “catch up quickly” leaving at
most one gap in the retrieved sequence
Optimizations

Participants bound the amount of data
they will retransmit during any given
round of gossip. If too much is solicited
they ignore the excess requests
Optimizations


Label each gossip message with
senders gossip round number
Ignore solicitations that have expired
round number, reasoning that they
arrived very late hence are probably no
longer correct
Optimizations

Don’t retransmit same message twice in
a row to any given destination (the
copy may still be in transit hence
request may be redundant)
Optimizations

Use IP multicast when retransmitting a
message if several processes lack a copy




For example, if solicited twice
Also, if a retransmission is received from “far
away”
Tradeoff: excess messages versus low latency
Use regional TTL to restrict multicast scope
Scalability




Protocol is scalable except for its use of
the membership of the full process
group
Updates could be costly
Size of list could be costly
In large groups, would also prefer not
to gossip over long high-latency links
Can extend pbcast to solve both



Could use IP multicast to send initial
message. (Right now, we have a treestructured alternative, but to use it,
need to know the membership)
Tell each process only about some
subset k of the processes, k << N
Keeps costs constant.
Router overload problem



Random gossip can overload a central
router
Yet information flowing through this
router is of diminishing quality as rate
of gossip rises
Insight: constant rate of gossip is
achievable and adequate
Hierarchical Gossip



Weight gossip so that probability of
gossip to a remote cluster is smaller
Can adjust weight to have constant load
on router
Now propagation delays rise… but just
increase rate of gossip to compensate
Remainder of talk

Show results of formal analysis




We developed a model (won’t do the math
here -- nothing very fancy)
Used model to solve for expected reliability
Then show more experimental data
Real question: what would pbcast “do”
in the Internet? Our experience: it
works!
Idea behind analysis



Can use the mathematics of epidemic
theory to predict reliability of the
protocol
Assume an initial state
Now look at result of running B rounds
of gossip: converges exponentially
quickly towards atomic delivery
p{#processes=k}
Pbcast bimodal delivery distribution
Either sender
fails…
1.E+00
… or data gets
through w.h.p.
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
0
5
10
15
20
25
30
35
40
45
50
number of processes to deliver pbcast
Failure analysis



Suppose someone tells me what they
hope to “avoid”
Model as a predicate on final system
state
Can compute the probability that pbcast
would terminate in that state, again
from the model
Two predicates

Predicate I: A faulty outcome is one
where more than 10% but less than
90% of the processes get the multicast
… Think of a probabilistic Byzantine
General’s problem: a disaster if many
but not most troops attack
Two predicates

Predicate II: A faulty outcome is one where
roughly half get the multicast and failures
might “conceal” true outcome
… this would make sense if using pbcast to
distribute quorum-style updates to replicated
data. The costly hence undesired outcome is
the one where we need to rollback because
outcome is “uncertain”
Two predicates



Predicate I: More than 10% but less than
90% of the processes get the multicast
Predicate II: Roughly half get the multicast
but crash failures might “conceal” outcome
Easy to add your own predicate. Our
methodology supports any predicate over
final system state
Scalability of Pbcast reliability
P{failure}
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
10
15
20
25
30
35
40
45
50
55
60
#processes in system
Predicate I
Predicate II
P{failure}
Effects of fanout on reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
1
2
3
4
5
6
7
8
9
10
fanout
Predicate I
Predicate II
fanout
Fanout required for a specified reliability
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
25
30
35
40
45
50
#processes in system
Predicate I for 1E-8 reliability
Predicate II for 1E-12 reliability
Scalability of Pbcast reliability
1.E+00
1.E-05
1.E-05
1.E-10
P{failure}
p{#processes=k}
Pbcast bimodal delivery distribution
1.E-10
1.E-15
1.E-20
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
1.E-25
10
15
1.E-30
0
5
10
15
20
25
30
35
40
45
P{failure}
fanout
5
6
7
40
45
50
55
60
Predicate II
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
4
35
Fanout required for a specified reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
3
30
Predicate I
Effects of fanout on reliability
2
25
#processes in system
50
number of processes to deliver pbcast
1
20
8
9
10
25
30
35
40
45
50
#processes in system
fanout
Predicate I for 1E-8 reliability
Predicate I
Predicate II
Figure 5: Graphs of analytical results
Predicate II for 1E-12 reliability
Discussion



We see that pbcast is indeed bimodal
even in worst case, when initial
multicast fails
Can easily tune parameters to obtain
desired guarantees of reliability
Protocol is suitable for use in
applications where bounded risk of
undesired outcome is sufficient
Model makes assumptions...




These are rather simplistic
Yet the model seems to predict behavior in
real networks, anyhow
In effect, the protocol is not merely robust to
process perturbation and message loss, but
also to perturbation of the model itself
Speculate that this is due to the incredible
power of exponential convergence...
Experimental work

SP2 is a large network





Nodes are basically UNIX workstations
Interconnect is basically an ATM network
Software is standard Internet stack (TCP, UDP)
We obtained access to as many as 128 nodes
on Cornell SP2 in Theory Center
Ran pbcast on this, and also ran a second
implementation on a real network
Example of a question



Create a group of 8 members
Perturb one member in style of Figure 1
Now look at “stability” of throughput


Measure rate of received messages during
periods of 100ms each
Plot histogram over life of experiment
Now revisit Figure 1 in detail



Take 8 machines
Perturb 1
Pump data in at varying rates, look at
rate of received messages
Revisit our original scenario with
perturbations (32 processes)
High bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional: at unperturbed host
pbcast: at unperturbed host
180
traditional: at perturbed host
pbcast: at perturbed host
160
Low bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional w/1 perturbed
pbcast w/1 perturbed
throughput for traditional, measured at perturbed host
throughput for pbcast measured at perturbed host
180
160
140
average throughput
average throughput
140
120
100
80
120
100
80
60
60
40
40
20
20
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
Discussion




Saw that stability of protocol is exceptional
even under heavy perturbation
Overhead is low and stays low with system
size, bounded even for heavy perturbation
Throughput is extremely steady
In contrast, virtual synchrony and SRM both
are fragile under this sort of attack
Programming with pbcast?

Most often would want to split
application into multiple subsystems


Use pbcast for subsystems that generate
regular flow of data and can tolerate
infrequent loss if risk is bounded
Use stronger properties for subsystems
with less load and that need high
availability and consistency at all times
Programming with pbcast?



In stock exchange, use pbcast for pricing but
abcast for “control” operations
In hospital use pbcast for telemetry data but
use abcast when changing medication
In air traffic system use pbcast for routine
radar track updates but abcast when pilot
registers a flight plan change
Our vision: One protocol sideby-side with the other


Use virtual synchrony for replicated
data and control actions, where strong
guarantees are needed for safety
Use pbcast for high data rates, steady
flows of information, where longer term
properties are critical but individual
multicast is of less critical importance
Summary

New data point in a familiar spectrum





Virtual synchrony
Bimodal probabilistic multicast
Scalable reliable multicast
Demonstrated that pbcast is suitable for
analytic work
Saw that it has exceptional stability