Modeling and Analyzing Fault-Tolerant Real

Download Report

Transcript Modeling and Analyzing Fault-Tolerant Real

Theory of Distributed Systems
(TDS)
Group Leaders: Nancy Lynch, Idit Keidar
PhD students: Victor Luchangco, Josh Tauber, Roger
Khazan, Carl Livadas, Rui Fan
Meng and UROP students: Laura Dean, Andrej
Bogdanov, Michael Tsai, Chris Luhrs,
Affiliates: Alex Shvartsman (U. Conn.), Alan Fekete (U.
Sydney), Paul Attie (Northeastern), Ziv Bar-Joseph
1
Research Overview
Math Models,
Proof and Analysis Methods
Building blocks
Project (AFOSR)
Algorithms
Applications
2
Math Models,
Proof and Analysis Methods
• I/O Automata
– Mathematical model for asynchronously interacting system
components.
• Timed I/O Automata, Hybrid I/O Automata [Lynch,
Segala, Vaandrager], Probabilistic [Segala],
Dynamic [Lynch, Attie]
• Invariant assertions
• Levels of abstraction (simulation relations)
• Compositional reasoning
• Conditional performance analysis
• Object inheritance [Keidar, Khazan, Shvartsman]
3
IOA Language and Toolset
•
•
•
•
•
•
Language design [Garland, Lynch]
Front end [Garland]
Simulator [Chefter, Ramirez, Dean]
Code generator [Tauber, Tsai]
Theorem prover [Garland, Fan, Bogdanov]
Test examples [Luhrs, Dean, Rosen, Karlovich]
4
Algorithms
• Communication [Keidar, Khazan, Livadas, Bar-Joseph,
Lynch]
• Distributed data management [Keidar, Shvartsman, De
Prisco, Lynch]
• Consensus [Bar-Joseph, Keidar]
• Load balancing [Shvartsman, Khazan, Fekete]
• Lower bounds, impossibility results
• Distributed algorithms => Distributed systems
– Generating code from models [Tauber, Tsai]
5
Applications
• Multiprocessor memories [Luchangco]
• Hybrid systems
– Automated transportation: Planes (TCAS), Trains (Raytheon
PRT), and Automobiles (California PATH) [Livadas]
– Mobile computing
• Middleware (building blocks) for fault-tolerant
distributed computing [Keidar, Shvartsman, Fekete, De
Prisco, Khazan]
• Communication [Livadas, Bar-Joseph]
• Object-oriented programming [Fekete, Lynch]
• Agents [Attie, Luchangco, Lynch, Keidar]
6
7
Reliable Group Communication:
a Mathematical Approach
Nancy Lynch
Theory of Distributed Systems
MIT LCS
AFOSR Review
November 15, 2000
GC
8
Dynamic Distributed Systems
• Modern distributed systems are dynamic.
• Set of clients participating in an application
changes, because of:
– Network, processor failure, recovery
– Changing client requirements
• To cope with changes:
?
?
?
?
– Use abstract groups of client processes
with changing membership sets.
– Processes communicate with group members by
sending messages to the group as a whole.
9
Group Communication Services
•
•
•
•
Support management of groups
GC
Maintain membership info
Manage communication
Make guarantees about ordering, reliability of
message delivery, e.g.:
– Best-effort: IP Multicast
– Strong consistency guarantees: Isis, Transis, Ensemble
• Hide complexity of coping with changes
10
This Talk
• Describe
– Group communication systems
– A mathematical approach to designing, modeling,
analyzing GC systems.
– Our accomplishments and ideas for future work.
• Collaborators: Idit Keidar, Alan Fekete,
Alex
Shvartsman, Roger Khazan, Roberto De Prisco, Jason
Hickey, Robert van Renesse, Carl Livadas,
Ziv BarJoseph, Kyle Ingols, Igor Tarashchanskiy
11
Talk Outline
I. Background: Group Communication
II. Our Approach
III. Projects and Results
1. View Synchrony
2. Ensemble
3. Dynamic Views
4. Scalable Group Communication
IV. Conclusions
12
I. Background:
Group Communication
13
The Setting
• Dynamic distributed system,
changing set of participating
clients.
?
?
?
?
• Applications:
–
–
–
–
Replicated databases, file systems
Distributed interactive games
Multi-media conferencing, collaborative work
…
14
Groups
• Abstract, named groups of client processes,
changing membership.
• Client processes send messages to the group
(multicast).
• Early 80s: Group idea used in replicated data
management system designs
• Late 80s: Separate group communication
services.
15
Group Communication Service
• Communication middleware
• Manages group membership, current views
View = membership set + identifier
B
• Manages multicast communication
A
among group members
GC
– Multicasts respect views
– Guarantees within each view:
• Reliability constraints
• Ordering constraints, e.g., FIFO from each sender, causal,
common total order
• Global service
16
Group Communication Service
mcast
receive
new-view mcast
new-view
receive
GCS
17
Isis
[Birman, Joseph 87]
• Primary component group membership
• Several reliable multicast services, different ordering
guarantees, e.g.:
– Atomic Broadcast: Common total order, no gaps
– Causal Broadcast:
A
B
• When partition is repaired, primary processes send state
information to rejoining processes.
• Virtually Synchronous message delivery
18
Example: Interactive Game
• Alice, Bob, Carol, Dan in view {A,B,C,D}
• Primary component membership
A
– {A}{B,C,D} split;
only {B,C,D} may continue.
• Atomic Broadcast
– A fires, B moves away;
need consistent order
A
B
C
B
C
D
D
19
Interactive Game
• Causal Broadcast
A
B
C
D
– C sees A enter a room; locks door.
• Virtual Synchrony
– {A}{BCD} split; B sees A shoot; so do C, D.
A
B
C
D
20
Applications
• Replicated data management
– State machine replication
[Lamport 78] , [Schneider 90]
– Atomic Broadcast provides support
– Same sequence of actions performed everywhere.
– Example: Interactive game state machine
• Stock market
• Air-traffic control
21
Transis
[Amir, Dolev, Kramer, Malkhi 92]
• Partitionable group membership
• When components merge, processes exchange state
information.
• Virtual synchrony reduces amount of data exchanged.
• Applications
–
–
–
–
–
Highly available servers
Collaborative computing, e.g. shared whiteboard
Video, audio conferences
Distributed jam sessions
Replicated data management [Keidar , Dolev 96]
22
Other Systems
• Totem [Amir, Melliar-Smith, Moser, et al., 95]
– Transitional views, useful with virtual synchrony
• Horus [Birman, van Renesse, Maffeis 96]
• Ensemble [Birman, Hayden 97]
– Layered architecture
– Composable building blocks
• Phoenix, Consul, RMP, Newtop, RELACS,…
• Partitionable
23
Service Specifications
• Precise specifications needed for GC services
– Help application programmers write programs that use the
services correctly, effectively
– Help system maintainers make changes correctly
– Safety, performance, fault-tolerance
• But difficult:
– Many different services; different guarantees about
membership, reliability, ordering
– Complicated
– Specs based on implementations might not be optimal for
application programmers.
24
Early Work on GC Service Specs
•
•
•
•
•
•
•
•
•
[Ricciardi 92]
[Jahanian, Fakhouri, Rajkumar 93]
[Moser, Amir, Melliar-Smith, Agrawal 94]
[Babaoglu et al. 95, 98]
[Friedman, van Renesse 95]
[Hiltunin, Schlichting 95]
[Dolev, Malkhi, Strong 96]
[Cristian 96]
[Neiger 96]
• Impossibility results [Chandra, Hadzilacos, et al. 96]
• But still difficult…
25
II. Our Approach
26
Approach
• Model everything:
– Applications
• Requirements, algorithms
– Service specifications
• Work backwards, see what
the applications need
– Implementations of the services
• State, prove correctness theorems:
Application
Service
Application
Algorithm
– For applications, implementations.
– Methods: Composition, invariants, simulation relations
• Analyze performance, fault-tolerance.
• Layered proofs, analyses
27
Math Foundation: I/O Automata
•
•
•
•
•
Nondeterministic state machines
Not necessarily finite-state
Input/output/internal actions (signature)
Transitions, executions, traces
System modularity:
– Composition, respecting traces
– Levels of abstraction, respecting traces
• Language-independent, math model
28
Modeling Style
• Describe interfaces, behavior
• Program-like behavior descriptions:
– Precondition/effect style
– Pseudocode or IOA language
• Abstract models for algorithms, services
• Model several levels of abstraction,
– High-level, global service specs
…
– Detailed distributed algorithms
29
Modeling Style
• Very nondeterministic:
– Constrain only what must be constrained.
– Simpler
– Allows alternative implementations
30
Describing Timing Features
• TIOAs [Lynch, Vaandrager 93]
– For describing:
• Timeout-based algorithms.
• Clocks, clock synchronization
• Performance properties
31
Describing Failures
• Basic or timed I/O automata, with fail, recover input
actions.
• Included in traces, can use them in specs.
fail
recover
fail
recover
32
Using I/O Automata for Group
Communication Systems
• Use for global services + distributed algorithms
• Define safety properties separately from
performance/fault-tolerance properties.
– Safety:
• Basic I/O automata; trace properties
– Performance/fault-tolerance:
• Timed I/O automata with failure actions; timed
trace properties
33
III. Projects and Results
34
Projects
1. View Synchrony
2. Ensemble
3. Dynamic Views
4. Scalable Group Communication
35
1. View Synchrony (VS)
[Fekete, Lynch, Shvartsman 97, 00]
Goals:
• Develop prototypes:
–
–
–
–
Specifications for typical GC services
Descriptions for typical GC algorithms
Correctness proofs
Performance analyses
• Design simple math foundation for the area.
• Try out, evaluate our approach.
36
View Synchrony
What we did:
• Talked with system developers (Isis, Transis)
• Defined I/O automaton models for:
– VS, prototype partitionable GC service
– TO, non-view-oriented totally ordered bcast service
– VStoTO, application algorithm based on
[Amir, Dolev, Keidar, Melliar-Smith, Moser]
• Proved correctness
• Analyzed performance/ fault-tolerance.
37
VStoTO Architecture
brcv
bcast
TO
VStoTO
VStoTO
gprcv
newview
gpsnd
VS
38
TO Broadcast Specification
Delivers messages to everyone, in the same order.
Safety: TO-Machine
Signature:
input: bcast(a,p)
output: brcv(a,p,q)
internal: to-order(a,p)
State:
queue, sequence of (a,p), initially empty
for each p:
pending[p], sequence of a, initially empty
next[p], positive integer, initially 1
TO
39
TO-Machine
Transitions:
bcast(a,p)
Effect:
append a to pending[p]
to-order(a,p)
Precondition:
a is head of pending[p]
Effect:
remove head of pending[p]
append (a,p) to queue
brcv(a,p,q)
Precondition:
queue[next[q]] = (a,p)
Effect:
next[q] := next[q] + 1
40
Performance/Fault-Tolerance
TO-Property(b,d,C):
If C stabilizes, then within time b, a point is reached
after which any message sent or received anywhere in
C is received everywhere in C, within time d.
stabilize
send
b
receive
d
41
VS Specification
• Partitionable view-oriented service
• Safety: VS-Machine
–
–
–
–
–
–
Views presented in consistent order, possible gaps
Messages respect views
Messages in consistent order
Causality
VS
Prefix property
Safe indication
• Doesn’t guarantee Virtual Synchrony
• Like TO-Machine, but per view
42
Performance/Fault-Tolerance
VS-Property(b,d,C):
If C stabilizes, then within time b, a point is reached
where views known within C are consistent, and
messages sent in the final view v are delivered
everywhere in C, within time d.
stabilize
newview( v)
b
mcast(v)
receive(v)
d
43
VStoTO Algorithm
• TO must deliver messages in order, no gaps.
• VS delivers messages in order per view.
• Problems arise from view changes:
– Processes moving between views could have different prefixes.
– Processes could skip views.
• Algorithm:
– Real work done in majority views only
– Processes in majority views totally order messages, and deliver
to clients messages that VS has said are safe.
– At start of new view, processes exchange state, to reconcile
progress made in different majority views.
44
Correctness (Safety) Proof
• Show composition of VS-Machine and VStoTO
machines implements TO-Machine.
• Trace inclusion
TO
• Use simulation relation proof:
– Relate start states, steps of composition
to those of TO-Machine
– Invariants, e.g.:
Once a message is ordered everywhere in some
majority view, its order is determined forever.
Composition
• Checked using PVS theorem-prover, TAME [Archer]
45
Performance Analysis
• Assume VS satisfies VS-Property(b,d,C):
– If C stabilizes, then within time b, views known
within C become consistent, and messages sent in
the final view are delivered everywhere in C, within
time d.
• And VStoTO satisfies:
– Simple timing and fault-tolerance assumptions.
• Then TO satisfies TO-Property(b+d,d,C):
– If C stabilizes, then within time b+d, any message
sent or delivered anywhere in C is delivered
everywhere in C, within time d.
46
Conclusions: VS
•
•
•
•
•
Models for VS, TO, VStoTO
Proofs, performance analysis
Tractable, understandable, modular
[PODC 97], [TOCS 00]
Follow-on work:
– Algorithm for VS [Fekete, Lesley]
– Load balancing using VS [Khazan]
– Models for other Transis algorithms [Chockler]
• But: VS is only a prototype; lacks some key features, like
Virtual Synchrony
• Next: Try a real system!
47
2. Ensemble
[Hickey, Lynch, van Renesse 99]
Goals:
• Try, evaluate our approach on a real system
• Develop techniques for modeling, verifying,
analyzing more features, of GC systems,
including Virtual Synchrony
• Improve on prior system validation methods
48
Ensemble
• Ensemble system [Birman, Hayden 97]
– Virtual Synchrony
– Layered design, building blocks
– Coded in ML [Hayden]
• Prior verification work for
Ensemble and predecessors:
– Proving local properties
using Nuprl [Hickey]
– [Ricciardi], [Friedman]
49
Ensemble
• What we did:
– Worked with developers
– Followed VS example
– Developed global specs for key layers:
• Virtual Synchrony
• Total Order with Virtual Synchrony
– Modeled Ensemble algorithm spanning between layers
– Attempted proof; found logical error in state exchange
algorithm (repaired)
– Developed models, proofs for repaired system
50
Conclusions: Ensemble
•
•
•
•
•
•
Models for two layers, algorithm
Tractable, easily understandable by developers
Error, proofs
Low-level models similar to actual ML code (4 to 1)
[TACAS 99]
Follow-on:
– Same error found in Horus.
– Incremental models, proofs [Hickey]
• Next: Use our approach to design new services.
51
3. Dynamic Views
[De Prisco, Fekete, Lynch, Shvartsman 98]
Goals:
• Define GC services that cope with both:
– Long-term changes:
• Permanent failures, new joins
• Changes in the “universe” of processes
– Transient changes
• Use these to design consistent total order and
consistent replicated data algorithms that
tolerate both long-term and transient changes.
52
Dynamic Views
• Many applications with strong consistency
requirements make progress only in primary views:
– Consistent replicated data management
– Totally ordered broadcast
• Can use static notion of allowable primaries, e.g.,
majorities of universe, quorums
– All intersect.
– Only one exists at a time.
– Information flows from each to the next.
A
B
C
D
• But: Static notion not good for
long-term changes
53
E
Dynamic Views
• For long-term changes, want dynamic notion of
allowable primaries.
• E.g., each primary might contain majority of previous:
A
B
C
D
E
F
• But: Some might not intersect.
Makes it hard to maintain consistency.
54
Dynamic Views
• Key problem:
– Processes may have different opinions about which
is the previous primary
– Could even be disjoint.
• [Yeger-Lotem, Keidar, Dolev 97] algorithm
– Keeps track of all possible previous primaries.
– Ensures intersection with all of them.
55
Dynamic Views
What we did:
• Defined Dynamic View Service, DVS, based on [YKD]
• Designed to tolerate long-term failures
• Membership:
– Views delivered in consistent order, with possible gaps.
– Ensures new primary intersects all possible previous primaries.
• Communication: Similar toVS
– Messages delivered within views,
– Prefix property, safe notifications.
56
Dynamic Views
• What we did:
– Modeled, proved implementing algorithm
– Modeled, proved TO-Broadcast application
TO
DVS
– Distributed implementation [Ingols 00]
57
Handling Transient Failures:
Dynamic Configurations
• Configuration = Set of processes plus structure,
e.g., set of quorums, leader,…
• Application: Highly available consistent
replicated data management:
– Paxos [Lamport], uses leader, quorums
– [Attiya,Bar-Noy, Dolev], uses read quorums and
write quorums
• Quorums allow flexibility, availability in the
face of transient failures.
58
Dynamic Configurations
[De Prisco, Fekete, Lynch, Shvartsman 99, 00]
• Combine ideas/benefits of
– Dynamic views, for long-term failures, and
– Static configurations, for transient failures
• Idea:
– Allow configuration to change (reconfiguration).
– Each configuration satisfies intersection properties with
previous configuration
• Example:
– Config = (membership set, read quorums, write quorums)
– Membership set of new configuration contains read quorum
and write quorum of previous configuration
59
Dynamic Configurations
What we did:
• Defined dynamic configuration service DCS,
guaranteeing intersection properties w.r.t. all possible
previous configurations.
• Designed implementing algorithm, extending [YKD]
• Developed application: Replicated data
– Dynamic version of [Paxos]
– Dynamic version of [Attiya, BarNoy, Dolev]
– Tolerate
• Transient failures, using quorums
• Longer-term failures, using reconfiguration
60
Conclusions: Dynamic Views
• New DVS, DCS services for long-term changes in set of
processes
• Applications, implementations
• Decomposed complex algorithms into tractable pieces:
– Service specification, implementation, application
– Static algorithm vs. reconfiguration
• Couldn’t have done it without the formal framework.
• [PODC 98], [DISC 99]
61
4. Scalable Group Communication
[Keidar, Khazan 99],
[Keidar, Khazan, Lynch, Shvartsman 00]
Goal:
• Make GC work in wide area networks
What we did:
• Defined desired properties for GC services
• Defined spec for scalable group membership service
[Keidar, Sussman, Marzullo, Dolev 00],
implemented on small set of membership servers
62
Scalable Group Communication
What we did:
• Developed new, scalable GC algorithms:
– Use scalable GM service
– Multicast implemented on clients
– Efficient: Algorithm for virtual synchrony uses only one round
for state exchange, in parallel with membership service’s
agreement on views.
– Processes can join during reconfiguration.
•
Distributed implementation [Tarashchanskiy]
63
Scalable GC
What we did:
• Developed new incremental modeling, proof methods
[Keidar, Khazan, Lynch, Shvartsman 00]
– Proof Extension Theorem
S
S’
A
A’
• Developed models, proofs (safety and liveness) , using
the new methods.
64
Conclusions: Scalable GC
•
•
•
•
Specs, new algorithms, proofs
New incremental proof methods
Couldn’t have done it without the formal framework.
[ICDCS 99], [ICSE 00]
65
IV. Conclusions
66
Practical GC Systems: Current Status
[Birman 99]
• Commercial successes:
– Stock exchange (Zurich, New York)
– Air-traffic control (France)
• Problems:
– Performance, for strong guarantees like Virtual Synchrony
– Not integrated with OO programming technologies.
• Trends:
– Flexible services
– Weaker guarantees; better performance
– Integration with OO technologies
67
Summary
GC
• GC services help in programming dynamic
distributed systems, though scalability, integration
problems remain.
• Our contributions:
– Modeling style: Automata + performance properties
– Techniques: Conditional performance analysis, incremental
modeling/proof
– Models, proofs for key services
– Discovered errors
– New services: Dynamic views, scalable GC
• Mathematical framework makes it possible to design
more complex systems correctly.
68
Conditional Performance Analysis
• Make conditional claims about system behavior, under
various assumptions about behavior of environment,
network.
• Include timing, performance, failures.
• Benefits:
– Formal performance predictions
– Says when system makes specific guarantees
• Normal case + failure cases
• Parameters, sensitivity analysis
– Composable
– Get probabilistic claims as corollaries
69
CP Analysis: Typical Hypotheses
•
•
•
•
•
Stabilization of underlying network.
Limited rate of change.
Bounds on message delay.
Limited amount of failure (number, density).
Limit input arrivals (number, density).
• Method allows focus on tractable cases.
70
Future Work
• Consider more dynamic settings:
–
–
–
–
–
Join, leave requests
Processor failures: Volatile memory, Byzantine behavior
Communication failures
Changing network topology
Mobility
• Design, model, analyze more services, applications
– Communication, data sharing services
– Other services
• Lower bounds, impossibility results
• Math foundations
71
Math Foundations
• Models:
– General models for timing behavior, hybrid
continuous/discrete behavior, failures, probabilistic
behavior, process creation,…
– Models for mobility (based on Hybrid I/O Automata)
– Combined models
• Proof and performance analysis methods:
– Measures
– Conditional performance analysis
– Incremental modeling, proof
72
Communication: View-oriented GC
• Scalable GC
–
–
–
–
GC
Experiments
CP analysis
Compare predicted, observed performance
Applications
• Dynamic views, dynamic configurations
– Experiments, Analysis, …
• Self-stabilization
• Lower bounds, impossibility results
73
Scalable Reliable Multicast
[Floyd, Jacobson, et al. 95]
• IP Multicast+ retransmission protocol
• Modeling, analysis [Livadas, Keidar, Lynch]
SRM
IPMcast
74
Other communication services
• Bimodal Multicast [Birman, Hayden, et al. 99]
• Compare GC services with SRM, Bimodal Multicast
• Study tradeoffs between strength of ordering and
reliability guarantees vs. performance
• Multicast with QoS [Bar-Joseph, Keidar, Anker, Lynch 00]
– Models for:
• Bandwidth reservation service
• TO Multicast service with QoS (latency, bandwidth)
– Algorithms implementing TO-QoS using reservation service
75
Data Sharing
• Coherent
– Analyze, compare implementations
– Lower bounds, impossibility results
• Less coherent
• Tradeoffs between coherence guarantees and
performance
76
Other services for dynamic systems:
•
•
•
•
•
•
•
Registry/location services
Leader election
Consensus
Resource allocation
Establishing spanning trees, overlay networks
Load-balancing
…
77
Mobility
• Mobile system models
– Ad hoc, cellular, mixed (Oxygen)
– Failure models
– Formalized using Hybrid I/O Automata
• Mobile system services
–
–
–
–
Location services
Group membership, group communication
Data sharing
Applications: Games, military operations
78