Modeling and Analyzing Fault-Tolerant Real

Download Report

Transcript Modeling and Analyzing Fault-Tolerant Real

Building Blocks for
High-Performance, Fault-Tolerant
Distributed Systems
Nancy Lynch
Theory of Distributed Systems
MIT
AFOSR project review
Cornell University
May, 2001
1
Project Participants
• Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman,
Steve Garland
• PhD students: Victor Luchangco, Roger Khazan, Carl
Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan
• MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy,
Andrej Bogdanov, Michael Tsai, Laura Dean
• Collaborators: Roberto De Prisco, Jeremy Sussman,
Keith Marzullo, Danny Dolev, Alan Fekete, Gregory
Chockler, Roman Vitenberg
2
Project Scope
• Define services to support high-performance
distributed computing in dynamic environments:
Failures, changing participants.
• Design algorithms to implement the services.
• Analyze algorithms: Correctness, performance, faulttolerance.
• Develop necessary mathematical foundations: State
machine models, analysis methods.
• Develop supporting languages and tools: IOA.
3
Talk Outline
I. View-oriented group communication services
II. Non-view-oriented group communication
III. Mathematical foundations
IV. IOA language and tools
V. Memory consistency models
VI. Plans
4
I. View-Oriented Group
Communication Services
5
View-Oriented Group Communication
Services
• Cope with changing participants using abstract groups
of client processes with changing membership sets.
• Processes communicate with group members by
sending messages to the group as a whole.
• GC services support management of groups:
– Maintain membership information, form views.
– Manage communication.
– Make guarantees about ordering,
reliability of message delivery.
• Isis, Transis, Totem, Ensemble,…
GC
6
Using View-Oriented GC Services
• Advantages:
– High-level programming abstraction
– Hides complexity of coping with changes
• Disadvantages:
– Can be costly, especially when forming new views.
– May have problems scaling to large networks.
• Applications:
– Managing replicated data
– Distributed interactive games
– Multi-media conferencing, collaborative work
7
Our Approach
• Mathematical, using state machines
(I/O automata)
Application
• Model everything:
– Applications
– Service specifications
– Implementations of the services
Service
• Prove correctness
• Analyze performance, fault-tolerance
Application
Algorithm
8
Our Earlier Work: VS
[Fekete, Lynch, Shvartsman 97, 01]
• Defined automaton models for:
– VS, partitionable GC service, based on Transis
– TO, non-view-oriented totally ordered bcast service
– VStoTO, algorithm based on
[Amir, Dolev, Keidar, Melliar-Smith, Moser]
• Proved correctness
• Analyzed performance,
bcast
fault-tolerance: conditional
performance analysis
brcv
TO
VStoTO
VStoTO
gprcv
newview
gpsnd
VS
9
Conditional Performance Analysis
• Assume VS satisfies:
– If a network component C stabilizes, then soon
thereafter, views known within C become consistent,
and messages sent in the final view are delivered
everywhere in C, within bounded time.
• And VStoTO satisfies:
– Simple timing and fault-tolerance assumptions.
• Then TO satisfies:
– If C stabilizes, then soon thereafter, any message
sent or delivered anywhere in C is delivered
everywhere in C, within bounded time.
10
Ensemble
[Hickey, Lynch, van Renesse 99]
• Ensemble system [Birman, Hayden],
layered design:
• Worked with developers, following VS.
• Developed global specs for key layers.
• Modeled Ensemble algorithm spanning between layers.
• Tried proof; found algorithmic error.
• Modeled, analyzed repaired system
• Same error found in Horus.
11
More Recent Progress
1. GC with unique primary views
2. Scalable GC
3. Optimistic Virtual Synchrony
4. GC service specifications
12
1. GC With Unique Primary Views
13
GC With Unique Primaries
• Dynamic View Service
[De Prisco, Fekete, Lynch, Shvartsman 98]
– Produces unique primary views
– Copes with long-term changes.
• Dynamic Configuration Service [DFLS 99]
– Adds quorums.
– Copes with long-term and transient changes.
• Dynamic Leader Configuration Service [D 99], [DL 01]
– Adds leaders.
14
GC With Unique Primaries
• Algorithms to implement the services
– Based on dynamic voting algorithm of
[YegerLotem, Keidar, Dolev 97].
– Each primary needs majority of all possible
previous primaries.
– Models, proofs,…
• Applications
– Consistent total order and consistent replicated data
algorithms that tolerate both long-term and
transient changes.
– Models, proofs,…
15
Availability of Unique Primary Algorithms
[Ingols, Keidar 01]
• Simulation study comparing unique primary algorithms:
– [Yeger-Lotem, Keidar, Dolev], [DFLS]
– 1-pending, like [Jajodia, Mutchler]
– Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev]
• Simulate repeated view changes, interrupting other view
changes.
• Availability shown to depend heavily on:
– Number of processes from previous view needed to form new view.
– Number of message rounds needed to form a view.
• [YKD], [DFLS] have highest availability.
16
2. Scalable Group Communication
[Keidar, Khazan 00]
17
Group Communication Service
• Manages group membership, current view.
• Multicast communication among group members,
with ordering, reliability guarantees.
• Virtual Synchrony [Birman, Joseph 87]
– Integrates group membership and group communication.
– Processes that move together from one view to another deliver
the same messages in the first view.
– Useful for replicated data management.
– Before announcing new view, processes must synchronize,
exchange messages.
18
Example: Virtual Synchrony
i
j
3: i,j,k
k
3: i,j,k
3: i,j,k
mcast(m)
rcv(m)
4: i, j
rcv(m)
4: i, j
VS algorithm supplies missing m19
Group Communication in WANs
• Difficulties:
– High message latency, message exchanges are expensive
– Frequent connectivity changes
• New, scalable GC algorithm:
– Uses scalable GM service of [Keidar, Sussman, et al. 00],
implemented on a small set of membership servers.
– GC (with virtual synchrony) implemented on clients.
VS
Net
VSGC
GM
20
Group Communication in WANs
• Try to minimize time from when network stabilizes
until GC delivers new views to clients.
• After stabilization: GM forms view, VSGC algorithm
synchronizes.
Net event
VSGC Algorithm
GM Algorithm
view(v)
• Existing systems (LANs):
– GM, VSGC uses several message exchange rounds
– Continue in spite of new network events
• Inappropriate for WANs
21
New Algorithm
• VSGC uses one message exchange round, in parallel with
GM’s agreement on views.
• GM usually delivers views in one message exchange.
Net event
VSGC Algorithm
GM Algorithm
view(v)
• Responds to new network events during reconfiguration:
– GM produces new membership sets
– VSGC responds to membership changes
• Distributed implementation [Tarashchanskiy 00]
22
Correctness Proofs
• Models, proofs (safety and liveness)
• Developed new incremental modeling, proof methods
[Keidar, Khazan, Lynch, Shvartsman 00]
• Proof Extension Theorem:
S
S’
A
A’
• Used new methods for the safety proofs.
23
Performance Analysis
• Analyze time from when network stabilizes until GC
delivers new views to clients.
• System is a composition:
– Network service, GM services, VSGC processes
• Compositional analysis:
– Analyze the VSGC algorithm alone, in terms of its inputs and
timing assumptions.
– State reasonable performance guarantees for GM, Network.
– Combine to get conditional performance properties for the
system as a whole.
24
Analysis of VSGC algorithm
• Assume component C stabilizes:
– GM delivers same views to VSGC processes
– Net provides reliable communication with latency .
• Let
– T[start], T[view] be times of last GM events for C
–  be upper bound on local step time.
• Then VSGC outputs new views by time
max (T[start] +  + x, T[view]) + 
25
Analysis of VSGC Algorithm
 + x
VS Algorithm
view(v)
Net Event
start
start
view(v)
GM algorithm
T[start]
T[view]
26
Assumed Bounds for GM
T[start]
start
T[view]
start

GM
view(v)

• Bounds for “Fast Path” of [Keidar, et al. 00],
observed empirically in almost all cases.
27
Combining VSGC and GM Bounds
• Bounds for system, conditional on GM bounds.
 + x
VSGC
start
view(v)
start
view(v)
T[start]

T[view]
GM

28
3. Optimistic Virtual Synchrony
[Sussman, Keidar, Marzullo 00]
• Most GC algorithms block sending during
reconfiguration.
• OVS service provides:
– Optimistic view proposal, before reconfiguration.
– Optimistic sends after proposal, during reconfiguration.
– Deliveries of optimistic messages in next view, subject to
application policy.
• Useful for applications:
– Replicated data management
– State transfer
– Sending vectors of data
29
4. GC Service Specifications
[Chockler, Keidar, Vitenberg 01]
• Comprehensive set of specifications for properties
guaranteed by GC services.
• Unifying framework.
• Safety properties
–
–
–
–
Membership: View order, partitionable, primary component
Multicast: Sending view delivery, virtual synchrony
Safe notifications
Ordering, reliability: FIFO, causal, totally ordered, atomic
• Liveness properties
– For eventually stable components: View stability, multicast
delivery, safe notification liveness
– For eventually stable pairs
30
II. Non-View-Oriented Group
Communication
1. Totally Ordered Multicast with QoS
[Bar-Joseph, Keidar, Anker, Lynch 00, 01]
31
Totally Ordered Multicast with QoS
• Multicast to dynamic group, subject to joins, leaves,
and failures.
• Global total ordering of messages
• QoS: Message delivery latency
• Built on reliable network with latency guarantees
• Add ordering guarantees, preserve latency bounds.
• Applications
– State machine replication
– Distributed games
– Shared editing
32
Two Algorithms
• Algorithm 1: Basic Totally Ordered Multicast
– Sends, receives consistent with total ordering of messages.
– Non-failing processes agree on messages from non-failing
processes.
– Latency: Constant, even with joins, leaves, failures.
• Algorithm 2: Atomic Multicast
– Non-failing processes agree on all messages.
– Latency:
• Joins, leaves only: Constant
fail_i
• With failures: Linear in f
TOM
Net
fail_j
33
Local Node Process
rcv(m)
join
leave
mcast(m)
FrontEnd_i
joiners(s,J),
leavers(s,J)
end-slot(s)
members(s,J)
Ord_i
Memb_i
mcast(m)
join
leave
mcast(join)
mcast(leave)
progress(s,j)
Sniffer_i
rcv(m)
Net
34
Local Algorithm Operation
• FrontEnd divides time into slots, tags messages with slots.
• Ord delivers messages by slot, in order of process indices.
• Memb determines slot membership.
– Join, leave messages
– Failures:
• Algorithm 1 uses local failure detector.
• Algorithm 2 uses consensus on failures.
– Requires new dynamic version of consensus.
• Timing-dependent
35
Architecture for Algorithm 2
TO-QoS
Net
GM
36
2. Scalable Reliable Multicast Services
[Livadas 01]
37
SRM [Floyd, et al.]
•
•
•
•
Reliable multicast to dynamic group.
Built over IP multicast
Based on requests (NACKs) and retransmissions
Limits duplicate requests/retransmissions using:
– Deterministic suppression: Ancestors suppress
descendants, by scheduling requests/replies based on
distance to source.
– Probabilistic suppression: Siblings suppress each other,
by spreading out requests/replies.
38
SRM Architecture
SRM
IPMcast
39
New Protocol
•
•
•
•
Inspired by SRM
Assume future losses occur on same link (locality).
Uses deterministic suppression for siblings
Elects, caches best requestor and retransmitter
– Chooses requestor closest to source.
– Chooses retransmitter closest to requestor.
– Break ties with processor ids.
40
Best Requestor and Retransmitter
S
Retransmitter
Requestor
41
Performance Analysis
• Metrics:
– Loss recovery latency: Time from detection of packet loss to
receipt of first retransmission
– Loss recovery overhead: Number of messages multicast to
recover from a message loss
• Protocol performance benefits:
– Removes delays caused by probabilistic suppression
– Following election of requestor and retransmitter:
• Reduces latency by using best requestor and retransmitter.
• Reduces overhead by using single requestor and
retransmitter.
42
III. Mathematical Foundations
• Incremental modeling and proof methods
Khazan, Lynch, Shvartsman 00]
[Keidar,
– Proof Extension Theorem
– Arose in Scalable GC work [Keidar, Khazan 00]
• Hybrid Input/Output Automata
[Lynch, Segala, Vaandrager 01]
– Model for continuous and discrete system behavior
– Useful for mobile computing?
• Conditional performance analysis methods
– For analyzing communication protocols
– AFOSR MURI project (Berkeley)
43
IV. IOA Language and Tools
I
O A
44
IOA Language and Tools
• Language for describing I/O automata:
Garland, Lynch
– Use to describe services and algorithms.
• Front end: Garland
I
O A
– Translates to Java objects
– Completely rewritten this year.
– Still needs support for composition.
• Theorem-prover connection: Garland, Bogdanov
– Connection with LP
– Seeking connections: SAL, Isabelle, STeP, NuPRL
45
IOA Language and Tools
• Simulator: Chefter, Ramirez, Dean
– Has support for paired simulation.
– Needs additions.
– Being instrumented for invariant discovery using
Ernst’s Daikon tool
• Code generator: Tauber, Tsai
– Local code-gen (translation to Java) running.
– Needs composition, communication service calls,
correctness proof.
• Challenge examples
46
V. Multiprocessor Memory Models
[Luchangco 01]
47
Memory Models
• Establishes a general mathematical framework for
specifying and reasoning about multiprocessor
memories and the programs that use them.
P1
P2
Pn
read
write
Memory
• Also applies to distributed shared memory.
48
Memory Models
• Sequentially consistent memory:
– Operations appear to happen in some sequential order.
– Read operation returns latest value written to the location.
• Processor consistent memory:
– Reads overtake writes to other locations.
– SPARC TSO, IBM 370
• Coherent memory
• Memory with synchronization commands:
– Fences, barriers, acquire/release,…
– Release consistency, weak ordering, locking
• Transactional memory
49
Programming restrictions:
• Data-race-free (for use with weak ordering)
• Properly labelled (for use with release
consistency)
• Two-phase locking
50
Formal Modeling Framework
read(x)
• Computation DAG
– Partial order model for
individual executions
– Describes dependencies
between operations
– Doesn’t model entire
programs.
write(y,1)
read(y)
write(x,3)
write(x,2)
read(x)
• Memory = set of computations with return values
51
Results
Results I: Uses the framework to model, classify,
compare memories and programming disciplines.
Results II: Programming discipline + memory model 
stronger memory model:
– Completely race-free + any memory  sequential consistency
– Race-free under locking + locking memory  seq. consistency
– 2-phase locking + locking memory  serializable transactions
Results III: Extend results to automaton models for
programs and memory.
52
VI. Plans
• Finish Scalable GC, TO Mcast with QoS, SRM.
• More dynamic services:
– Resource allocation, consensus, communication, distributed
data management, location services,…
– Services for mobile computing systems
– Theory: Algorithms and lower bounds
• Foundations:
– Hybrid automata, add control theory and probability
– Conditional performance analysis methods
• IOA:
– Solidify front end, simulator, theorem-prover connections
– Finish code generation.
53