Transcript Epidemic

Epidemic Techniques
Algorithms and Implementations
Agenda
Consistency issues
Epidemic algorithms
Astrolabe
Conclusion
Databases replicated at many sites
need to maintain consistency
• Relaxed consistency problem:
– Database is changed at one
site
– Change must propagate to all
other sites
– All copies must eventually
agree
– Copies should be mostly
current
• Important factors
– Propagation time
– Network traffic (ideally
proportional to Size of the
update X Number of servers)
Epidemic algorithms help spread
updates and maintain consistency
• Epidemic terminology
– A site with an update it is
willing to share is infective
– A site which has yet to
receive an update is
susceptible
– A site with an update it is
no longer willing to share is
removed
Susceptible
Infective
Removed
Agenda
Consistency issues
Epidemic algorithms
Astrolabe
Conclusion
Xerox has three algorithms to
create database consistency
• Direct mail
– Updates are mailed from originating site to all other
sites
• Anti-entropy (epidemic)
– All sites regularly chose other sites and exchange
database contents
• Rumor mongering (epidemic)
– Updates become “hot rumors” which are periodically
sent to other sites until most sites contacted are
“infective”
Direct mail is almost, but not
completely reliable
• Queues
– Updates are queued to prevent delays
– Queue located in stable storage
• Failures
– Queues overflow
– Destinations inaccessible for long periods of time
– Source lacks accurate knowledge of all other sites
• Traffic
– n messages per update
– Each message traverses all links from source to destination
– Traffic proportional to the number of sites X distance between sites
Server 1
Queue
Mail
…
Update
Server n
Anti-entropy is reliable but costly
• Site A choose site B at
random
• The databases are compared
– Pull: A gets database from B
– Push: A sends database to B
– Push-pull: A and B exchange
databases
A
A
• When used as backup pull or
push-pull is preferable
A
Pull
Push
Push-pull
B
B
B
Checksums can be used with antientropy to improve performance
•
•
•
•
•
•
Comparing databases is expensive
A recent update list can be kept
Recent updates are exchanged
Updates applied
Checksums of database contents exchanged
Databases compared only if checksums
disagree
Rumor mongering is less costly but
can be inconsistent
• n individuals initially
susceptible
• Rumor planted making A
infective
• A contacts others at random to
share the rumor
• Everyone who hears the rumor
becomes infective
• When A unnecessarily
contacts someone A will
become inactive (removed)
with probability 1/k
• Increasing k insures almost
everyone will hear the rumor
A
Some variations on rumor
mongering exist
Convergence time
Convergence Time Comparison
40
35
30
25
20
15
10
5
0
Avg. - Response and
Counters
Avg. - Blind and
Probablistic
Last - Response and
Counters
1
2
3
4
5
Last - Blind and
Probablistic
Counter k
•
Blind vs. Feedback
–
–
•
Counter vs. Coin
–
–
•
Feedback can tell when a recipient has already heard a rumor
Blind stops spreading the rumor with probability 1/k regardless of whether recipient has
already heard the rumor
Coin loses interest with probability 1/k
Counter loses interest after k unnecessary contacts
Simulations indicate that counter and feedback used in combination have the least
delay
Pull performs better when updates
are frequent
• Push vs. pull
– Up until now, have assumed that updates are
pushed
– When a database has a high rate of rumor
injection
• Pull more likely to find non-empty rumor lists
– When database is mostly quiescent
• Push will cease to introduce traffic
– Choice is based on the rate of updates
– Connection limits help push but hinder pull
Rumor mongering a better choice
when using anti-entropy as backup
Direct mail
Rumor mongering
• What happens when anti-entropy detects inconsistency?
– Nothing. Anti-entropy makes the databases consistent
• Ok when only a few sites were missed
– Update redistributed
• Better in the event of a complete failure
• Worst case: distribution reached half the sites
Deletion is more complicated than
simply removing a file
• In anti-entropy and rumor
mongering an absent file
will be replaced by an old
version
• Solution
– File replaced with death
certificate
– Death certificates spread
removing old copies of
deleted items
– When and how do death
certificates get deleted?
File A
Death
Certificate
A
Death certificates become dormant
but can be resurrected
•
•
•
•
•
Death certificates are stamped with two timestamps T1 and T2
When T1 is reached, most servers delete the certificate
Servers on death certificate’s retention site list keep a dormant copy
Dormant copies discarded when T1+T2 is reached
Dormant death certificates are resurrected if an obsolete copy of the
data is encountered
Death certificate
Dormant certificate
kept on A, B, and D
All certificates
deleted
T1- 1:00
T2- 2:00
Retention ListA, B, D
T1- 1:00
T2- 2:00
Retention ListA, B, D
T1- 1:00
T2- 2:00
Retention ListA, B, D
Current time – 12:00
Current time – 1:00
Current time – 2:00
X
Timestamps tricky when
reactivating a death certificate
• Setting the timestamp forward to current clock value reactivates the
death certificate
• Problem: legitimate updates made between death certificate and
current time will be erased erroneously
• An activation timestamp must be added to prevent the deletion of
changes more current than the death certificate
Death certificate
Dormant certificate
kept on A, B, and D
Certificate
reactivated
Activation time – 1:00
T1- 1:00
T2- 2:00
Retention ListA, B, D
Activation time – 1:00
T1- 1:00
T2- 2:00
Retention ListA, B, D
Activation time – 1:00
T1- 3:00
T2- 4:00
Retention ListA, B, D
Current time – 12:00
Current time – 1:00
Current time – 2:00
Distance between nodes can effect
traffic overhead
•
•
•
Updates cost less to send when
the source and destination are
close
Assume a worst case linear
network
Nearest neighbor selection results
in high convergence time
– Links per cycle would be O(1)
– O(n) cycles would be needed
•
Uniform random connections
result in high traffic overhead
– Average connection time of O(n)
– Convergence O(log n)
– Traffic per link per cycle is O(n)
•
Nonuniform distribution reduces
traffic and has acceptable
convergence time
1
2
…
n
Spatial distribution can improve
traffic in anti-entropy
Convergence time comparison for anti-entropy, no
connection limit
Traffic comparison for anti-entropy, no connection
limit
80
10
Avg. convergence time
5
Last convergence time
Average traffic
40
Bushey traffic
2
1.
8
1.
6
1.
4
un
ifo
1.
2
20
rm
0
60
Time
Time
15
0
uniform
Spacial distribution, a
1.2
1.4
1.6
1.8
2
Spatial distribution, a
• Each site builds a list of sites sorted by distance
• An anti-entropy exchange partner is selected from the list according
to some function f(i) = i-a
• Spatial distribution significantly reduces traffic on critical links
• Convergence time is not significantly worse with a higher spatial
distribution
Push and pull rumors more
sensitive to spatial distribution
U1
S
U2
…
T
Um
• There is a high
probability that S and
T will chose each
other
• If update introduced
at S or T, will be
pushed to the other
• Rumor will eventually
die without reaching
all other nodes
Xerox chose to implement
randomized anti-entropy algorithm
• Anti-entropy guarantees consistency
• Well chosen spatial distribution algorithm
reduced link traffic by factor of 4 and
critical link traffic by 30
• Xerox experienced improvement in
consistency and network traffic overhead
with implementation
Agenda
Consistency issues
Epidemic algorithms
Astrolabe
Conclusion
Astrolabe provides fast, dynamic
mgmt of large stores of information
DNS
Astrolabe
• A directory service
• Organizes machines into
domains
• Associates attributes with each
domain
• Designed to map domain
names to IP addresses and
mail servers
• Changes rare
• Updates are slow to propagate
• An information management
service
• Organizes resources into a
hierarchy of zones, like
domains
• Attributes associated with each
zone
• Zones not bound to specific
servers
• Attributes can be very dynamic
• Updates propagate quickly
Astrolabe can be used in p2p
systems to cache large objects
• Problem
– Infeasible to keep
large objects on a
central database and
copy on every access
– Load time and network
load too high
A
• Solution
– Store copies on
different hosts
– Use Astrolabe to find a
nearby, fresh copy
A
A
Astrolabe strives to satisfy four
basic principles
• Scalability through hierarchy
– Maintains consistent overhead
• Flexibility through mobile code
– SQL queries allow different applications to
communicate
• Robustness through a randomized peer-to-peer
protocol
– Communicate by running a process on each host
– Epidemic protocol used
• Security through certificates
– Digital signatures used to allow or deny access to
data, operations, etc.
Zone hierarchy makes Astrolabe
scalable
• A zone is
– A host or a set of nonoverlapping zones (no hosts
in common)
• Tree structure
– Leaves are hosts
– Each zone (except root) has
a local zone identifier
– Each zone has an attribute
list (MIB)
– Attributes are generated by
aggregation functions,
summary of children’s
attributes
– Leaf zones have writable
virtual child zones used to
populate attributes for that
zone
Host
MIB
Zone
Aggregate functions are used to
query the tree
•
•
•
•
•
•
•
Aggregate functions
summarize and are bounded
in size
Aggregate functions are
programmable
Code embedded in timestamped aggregate function
certificates (AFCs)
AFCs stored as attributes in
MIBs
For every zone an agent is in,
it scans hosts looking for
children’s attributes, then
aggregates results
Zones learn about other zones
through gossip protocol
Applications invoke Astrolabe
through calls to library
AFC
Agents on each host maintain a
database of the zone hierarchy
• Astrolabe agent runs on
each host
• Each agent stores a
subset of MIBs in the
Astrolabe tree
Asia
self
self
– A copy of root MIB
– A copy of all MIBs of the
root’s children
• For each level a list of
child zones (and
attributes) is kept along
with which child
represents its own zone
self
self
Europe
USA
Cornell
MIT
pc1
pc2
pc3
pc4
system
inventory
monitor
Gossiping is an epidemic protocol
used to propagate information
pc1 pc2
pc3 pc4
Cornell
Cornell
pc1
inventory
monitor
pc2
system
inventory
monitor
pc3
system
inventory
monitor
pc4
system
inventory
monitor
system
• Periodically, an agent selects a zone in which to gossip
• Agent picks some child at random (other than its own) within that
zone to gossip with
• Agent sends chosen child the id, rep, and issued attributes of all
MIBs of all children at that level and up to the root
• Recipient can then tell which entries are out of date
• Updates are passed back and forth
• Note: timestamps can be compared only if the attribute is issued by
the same rep
Astrolabe allows members to be
added or removed
Member removal
Member integration
• Each MIB knows which
rep (agent) created it and
when it was last updated
• When an agent has not
seen an update for some
zone from a rep for time
Tfail, the MIB is removed
• When the last MIB for a
zone is removed, the
zone is also removed
• IP multicast sets up initial
contact
• When two trees join, each
tree multicasts a gossip
message at a fixed rate
• Broadcasting gossip on
local LAN is also used
• Astrolabe agents
maintain a set of relatives
who should be contacted
on occasion
Certificates are used to guarantee
security
• Each zone is allowed to override the security
requirements of his parent zone
– Control zone creation, gossip rate, failure detection time-outs,
introducing new AFCs, etc.
• Each zone has a Certificate Authority (CA) which
issues certificates for that zone
– Zone certificate: binds zoneID to its public key
– MIB certificate: gossiped with zone certificate to propagate
data between hosts
– Aggregate function certificate (AFC): contains code and
other info for aggregation functions. Agent will only install
AFCs issued by ancestor zones or by one of their clients.
– Client certificate: authenticates a client. Astrolabe agents do
not maintain a client database for scalability. If an ancestor
signs the client certificate with its CA key, the client is trusted.
An AFC is introduced into the
system through the virtual children
Cornell
pc1
inventory
•
•
•
•
monitor
pc2
system
inventory
monitor
pc3
system
inventory
monitor
pc4
system
inventory
system
AFCs can be introduced by adding an attribute to the virtual child zone
The agent will automatically evaluate the attribute
AFCs can propagate by copying into the parent MIBs until they reach the
root
Adoption is used to propagate back down the tree
– Agents scan ancestor for new attributes
– New AFCs automatically copied
•
monitor
For garbage collection, an expiration time can be specified
new
An AFC must meet certain security
requirements to propagate
Cornell
pc1
inventory
monitor
pc2
system
inventory
monitor
pc3
system
inventory
monitor
pc4
system
inventory
monitor
system
new
• AFC must be signed by ancestor zone, or a client of that zone
– A client must have permission to propagate
• AFC cannot have expired
• The name of the AFC attribute and the category attribute must
match
– Prevents a malicious client from introducing an AFC for a purpose other
than advertised
Experiments demonstrate
Astrolabe’s scalability
Number of Rounds Needed to Infect All Participants
at Different Loss Rates
bf = 5
bf = 25
00
10
00
0
10
00
50
00
80
0
12
5
25
bg = 125
flat
Number of Members
Expected Number of
Rounds
30
25
20
15
10
5
0
5
Expected Number of
Rounds
Average Number of Rounds Needed to Infect All
Participants with Different Branching Factors
30
25
20
15
10
5
0
lost = .00
loss = .05
loss = .10
.oss = .15
25
1000
5000
200000
Number of Members
• Branch factor increases
– A higher branching factor leads to larger messages and more
traffic
– Astrolabe remains scalable even with a high branch factor
• Loss rates
– A higher loss rate does not seriously affect scalability
– Due to the randomization algorithm
Agenda
Consistency issues
Epidemic algorithms
Astrolabe
Conclusion
Conclusion
• Scalability through hierarchy
– Zones enable scalability
• Flexibility through mobile code
– AFCs can be generated by one agent and the propagated
throughout to learn the attributes on a variety of hosts
• Robustness through a randomized p2p protocol
– Zones select other zones at random and propagate MIB of least
common ancestor
– Guarantees changes will eventually reach the entire system
• Security through certificates
– Certificates authenticate every level of communication
• Conclusion: Astrolabe is a scalable, robust system which
allows changes to propagate quickly and guarantees
eventual consistency
Backup
Astrolabe improves upon several
previous systems