EduPar-2015-v2 - Home | Georgia State University

Download Report

Transcript EduPar-2015-v2 - Home | Georgia State University

Fault-Tolerant Parallel and Distributed
Computing for Software Engineering
Undergraduates
Ali Ebnenasir and Jean Mayo
{aebnenas, jmayo}@mtu.edu
Department of Computer Science
Michigan Technological University
Houghton MI 49931, USA
Problem
Teaching Parallel and Distributed Computing (PDC)
is hard
Even more challenging when teaching how PDC
should be designed in the presence of faults
Objective
Educate undergraduate students on the design,
verification, and implementation of Fault-Tolerant
Parallel and Distributed Computing (FTPDC).
Outline
• Three courses
– Two core courses (geared towards design and
verification)
• Model-Driven Software Development (CS4710)
• Software Quality Assurance (CS4712)
– One elective (focused on implementation)
• Operating Systems (CS4411)
• Contributions:
– Three course modules including lectures,
demonstrations and assignments
DESIGN AND VERIFICATION
MODULES
Courses in This Module
• Model-Driven Software Development
(CS4710)
– Lectures
– Demos
– Assignments
• Software Quality Assurance (CS4712)
– Lectures
– Demos
– Assignments
CS4710 Lectures
• Models: basic concepts of parallel and distributed
architectures (e.g., physical shared memory
multiprocessors and computer clusters)
– Inter-process communication mechanisms
– Non-determinism
– Fairness.
• Formal Representation: an automata-theoretic
formalization of the basic concepts of PDC
• Formal Modeling in Promela:
– The Process Meta Language (Promela) is the modeling
language of the SPIN model checker
– refines concepts in Promela
CS4710 Lectures - Continued
• Properties and Specifications: specify system
requirements
– in Hoare logic (e.g., pre-post conditions, loop invariants)
– In Temporal logic (e.g., temporal operators “always” and
“eventually” for safety and liveness specification)
• Concurrency in Promela: necessary skills for
modeling and analyzing concurrency in
– The shared memory model, and
– The message-passing model (synchronous and
asynchronous)
CS4710 Lectures - Continued
• Fault Modeling: introduces different types of faults
(e.g., transient, permanent)
– Model and analyze the impact of faults on system executions
– Notions of Fault, Error and Failure (FEF) and their
interdependencies.
• Levels of Fault tolerance: define three levels
depending on the extent to which safety and liveness
are met
– Failsafe: always satisfy safety, even when faults occur
– Nonmasking: guarantee recovery; safety is not a priority
– Masking: both failsafe and nonmasking
CS4712 Lectures
• Self-Stabilization: basics of designing and verifying
self-stabilization
• Design and Verification of Self-Stabilization: history
of existing methods for design and verification of
self-stabilizing system
• Algorithmic Design of Self-Stabilization: a
preliminary background on automated design of
stabilization using examples and tools developed by
the Software Design Group (http://asd.cs.mtu.edu)
at Michigan Tech
Assignments
• CS4710
– Non-determinism: how non-determinism could result in
different paths of execution
– Fairness: how fairness can simplify design
– Safety and liveness: how to specify safety/liveness
requirements in LTL and how to verify them in SPIN
– Fault modeling: How to model faults in Promela
– Levels of fault tolerance: How to verify fault tolerance
• CS4712
– How to test that a program is correct from any initial state?
• A self-stabilizing game
• Maximal matching network protocol that provides a matching from any
arbitrary configuration
Demonstrations
• Used the model checker SPIN to give demos
on the basic concepts
– Non-determinism
– Fairness
– Inter-process communication in shared memory
and message-passing models
– Impact of faults
– Fault tolerance
IMPLEMENTATION MODULE
Operating Systems Lectures
• Introduction to Distributed Systems
– Models, faults, consistent global states
– Consensus in a synchronous system with fail-stop failures
• Common Approaches
– Isolation: approaches, VM example
– Restart: checkpointing, rollback in distributed systems
– Replication: approaches, Triple Modular Redundancy,
stateless service and idempotent operations
• Reliability Block Diagrams
– Constant component reliability
OS Assignments
• Checkpoint
– Students already complete Nachos projects
– Added implementation of checkpoint system call and test
• Reliability Block Diagrams
– Traditional problems; developed RBDTool for demonstration
and exploration
– Source (and binaries for Windows, MacOS, Linux) at:
http://www.cs.mtu.edu/~pjbonamy/rbdtool.html
• Shared Memory Server
– Support operations against a shared memory
– Server must survive one crash failure
OS Experience
• Nachos checkpoint assignment very effective
– Understanding return into ckpt required “breakthrough”
– Enhanced understanding of binary format, process state,
difficulty of process migration
• Difficult to fit topics into the course
– Insufficient time for other meaningful programming
assignments
– Familiarity with reliability calculations; only simple systems
• Otherwise OS was a good fit
– Enhanced coverage of RAID, virtual machines and
implementation of processes
Conclusions
• Considered this pilot project a success
• All curriculum revisions will be retained
• Considering how to extend, reorganize the material
– Design and implementation of modules for other courses:
Team Software Project, Parallel Programming and
Concurrent Computing
– New course on modeling and analyzing fault-tolerant
parallel/distributed computing systems