Checkpoint and Restart Mechanisms (CPR)

Download Report

Transcript Checkpoint and Restart Mechanisms (CPR)

Yavor Todorov
Contents
Introduction
How it works
OS level checkpointing
Application level checkpointing
CPR for parallel programing
CPR functionality
References
Introduction
• Basically, checkpoint/restart mechanisms allow a machine that crashes
and is subsequently restarted to continue from the checkpoint with no
loss of data, just as if no failure had occurred.
• At some firms and supercomputing centers, it's common practice to
break up long-running computational programs into several batches.
Programs such as a gene- sequencing application search through
enormous databases and execute complex algorithms that can take
several weeks to complete.
• But while the concept is easy to understand, the technical mechanism to
checkpoint and restart an operating system or application is quite
complex.
How it works
• Checkpointing can occur either within the operating system or at the
application level.
• Most high end mainframes have automated CPR utilities
• CPR at the operating system level saves the state of everything that's
being done within a given application at periodic checkpoints and allows
the system to restart from the last point.
• On very large computers with hundreds or thousands of processes
running, saving the entire state of an operating system can take a long
time
How it works (cont’d)
• It also takes a long time to later restart the machine at that state - on
large jobs, it could take several hours.
• The recovery is delayed because a large amount of data must be stored,
whether or not the application requires that information to fully restart it.
• When a process is checkpointed register set, file handlers
Process image
• Text
• Data
• Stack
• Heap
• Register set values
• Status of open files
• Sockets
• Signals.
Checkpoint at OS level
• "Checkpointing at the operating system is useful but very costly, in that
the operating system does not know what data the application really
needs to restore it later, so it blindly saves everything," according to
James Kasdorf , director of supercomputing center
• At OS level CPR saves system state. That includes unneeded copies of
data, program code and system libraries
• Most supercomputing centers try to avoid CPR at OS level
Checkpointing at OS level drawbacks
• Since the whole process image is save its an expensive operation
• Takes more time
• Usually needs kernel modifications since most OS like Linux were not
built with CPR functionality
Checkpoint at application level
•
The application uses OS hooks to save information needed for restart
•
More efficient in a way that it takes less time to checkpoint and it is
faster to restart the application
• It allows you to choose optimal point which is typically at the end of a
loop
• Only needed data gets saved
CPR at app level drawbacks
• Difficult in some cases. E.g. application has an open communications
channel to an external device or the application runs on a clustered
computer.
• Distributed application’s state is hard to save as programs state is
changing across multiple nodes
• CPR for apps with large buffer memory takes longer
CPR for parallel programs
• Each process is responsible for taking its own checkpoint.
• Checkpoint timing is responsibility of a coordinating process.
• CPR data includes: in-transit message data, data section, file offsets,
signal state, executable information, stack contents and register
contents, CPU state, info about open files, pending signals.
• Checkpoint file can be stored either on local or global storage.
• When program is restarted each process initiates its own restart.
CPR for parallel programs
• All migrating processes have to be stopped at the time, to avoid loss of a
signal
• Socket IP addressing space have to be taken in consideration( it can be
virtualized)
• I/O speeds are pivotal for any CPR process ( the faster the better)
CPR functionality
• Process migration
• Load balancing
• Crash recovery
• Rollback transaction
• Job control
Sources
Duell, J. (2005). The design and implementation of Berkeley Lab's linux
checkpoint/restart. Berkeley: Lawrence Berkeley National Lab.
Litzkow, M., Tannenbaum, T., Basney, J., & Livny, M. (1997). Checkpoint and
Migration of UNIX Processes in the Condor Distributed Processing System.
Zhong, H., & Nieh, J. (2001). CRAK: Linux Checkpoint/Restart As a Kernel
Module. Depertment of CS Columbia University.
http://www.computerworld.com/s/article/68930/Checkpoint_and_Restart