HP standard light template

Download Report

Transcript HP standard light template

Cruz:
Application-Transparent
Distributed CheckpointRestart on Standard
Operating Systems
G. (John) Janakiraman, Jose Renato Santos,
Dinesh Subhraveti§, Yoshio Turner
HP Labs
§: Currently at Meiosys, Inc.
© 2004 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Broad Opportunity for CheckpointRestart in Server Management
•
Fault tolerance (minimize unplanned downtime)
− Recover by restarting from checkpoint
•
Minimize planned downtime
− Migrate application before hardware/OS maintenance
•
Resource management
− Manage resource allocation in shared computing
environments by migrating applications
2
Need for General-Purpose
Checkpoint-Restart
•
Existing checkpoint-restart methods are too
limited:
− No support for many OS resources that commercial
applications use (e.g., sockets)
− Limited to applications using specific libraries
− Require application source and recompilation
− Require use of specialized operating systems
•
Need a practical checkpoint-restart mechanism
that is capable of supporting a broad class of
applications
3
Cruz: Our Solution for GeneralPurpose Checkpoint-Restart on
Linux
•
Application-transparent: supports applications
without modifications or recompilation
•
Supports a broad class of applications (e.g.,
databases, parallel MPI apps, desktop apps)
− Comprehensive support for user-level state, kernellevel state, and distributed computation and
communication state
•
Supported on unmodified Linux base kernel –
checkpoint-restart integrated via a kernel module
4
Cruz Overview
Builds on Columbia Univ.’s Zap process migration
Our Key Extensions
• Support for migrating networked applications,
transparent to communicating peers
− Enables role in managing servers running commercial
applications (e.g., databases)
•
General method for checkpoint-restart of TCP/IPbased distributed applications
− Also enables efficiencies compared to library-specific
approaches
5
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
6
Zap
(Background)
•
Process migration mechanism
− Kernel module implementation
•
Applications
Pods
Virtualization layer groups
processes into Pods with
private virtual name space
− Intercepts system calls to expose
only virtual identifiers (e.g., vpid)
− Preserves resource names and
dependencies across migration
•
Mechanism to checkpoint and
restart pods
− User and kernel-level state
− Primarily uses system call
handlers
− File system not saved or restored
(assumes a network file system)
Linux
Zap
Linux System calls
7
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
8
Migrating Networked Applications
•
Migration must be transparent to remote peers to
be useful in server management scenarios
− Peers, including unmodified clients, must not perceive
any change in the IP address of the application
− Communication state of live connections must be
preserved
•
No prior solution for these (including original Zap)
•
Our Solution:
− Provide unique IP address to each pod that persists
across migration
− Checkpoint and restore the socket control state and
socket data buffer state of all live sockets
9
Network Address Migration
3. dhcprequest(MAC-p1)
Pod
DHCP
Client
4. dhcpack(IP-p1)
2. MAC-p1
1. ioctl()
eth0:1
[IP-1, MAC-h1]
•
DHCP
Server
Network
eth0
Pod attached to virtual interface with own IP & MAC addr.
− Implemented by using Linux’s virtual interfaces (VIFs)
•
IP address assigned statically or through a DHCP client
running inside the pod (using pod’s MAC address)
•
Intercept bind() & connect() to ensure pod processes use
pod’s IP address
•
Migration: delete VIF on source host & create on new host
− Migration limited to subnet
10
Communication State Checkpoint and
Restore
Communication state:
− Control: Socket data structure, TCP connection state
− Data: contents of send and receive socket buffers
Challenges in communication state checkpoint and restore:
•
Network stack will continue to execute even after
application processes are stopped
•
No system call interface to read or write control state
•
No system call interface to read send socket buffers
•
No system call interface to write receive socket buffers
•
Consistency of control state and socket buffer state
11
Communication State Checkpoint
•
•
•
•
•
Acquire network stack locks
to freeze TCP processing
Save receive buffers using
socket receive system call in
peek mode
Save send buffers by walking
kernel structures
Copy control state from kernel
structures
Modify two sequence
numbers in saved state to
reflect empty socket buffers
− Indicate current send buffers
not yet written by application
− Indicate current receive
buffers all consumed by
application
Control
Control
direct
access
Timers,
Options,
etc.
Timers,
Options,
etc.
X X
copied_seq Rh St+1 write_seq Rt+1 Rh St+1
Rt+1
rcv_nxt Rt+1 Sh snd_una
Sh
Rh
St
direct
access
..
.
..receive()
.
Rt
Sh
Recv buffers Send buffers
Live Communication State
Rh
Sh
St
Sh
Rt
Recv
buffers
Send
buffers
Checkpoint State
State for one socket
Note:
Checkpoint does not change
live communication state
12
Communication State Restore
To App by intercepted
receive system call
•
Create a new socket
•
Copy control state in
checkpoint to socket structure
•
•
Restore checkpointed send
buffer data using the socket
write call
Deliver checkpointed receive
buffer data to application on
demand
Control
Control
direct
update
Timers,
Options,
etc.
Timers,
Options,
etc.
copied_seq Rt+1 St+1
Sh write_seq
rcv_nxt Rt+1 Sh snd_una
Rh
St
write()
..
− Copy checkpointed receive
.
Rt
buffer data to a special buffer
Recv
− Intercept receive system call
Sh
data
to deliver data from special
Send buffers
buffer until buffer is emptied Live Communication State
Rt+1
Sh
Rt+1
Sh
Rh
St
Sh
Rt
Recv
buffers
Send
buffers
Checkpoint State
State for one socket
13
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
14
Checkpoint-Restart of Distributed
Applications
Processes
Checkpoint Processes
Processes
Library
Library
Library
TCP/IP
Node
TCP/IP
TCP/IP
Node
Node
Communication Channel
•
•
•
State of processes and messages in channel must be
checkpointed and restored consistently
Prior approaches specific to particular library – e.g.,
modify library to capture and restore messages in channel
Cruz preserves TCP connection state and IP addresses of
each pod, implicitly preserving global communication
state
− Transparently supports TCP/IP-based distributed applications
− Enables efficiencies compared to library-based implementations
15
Checkpoint-Restart of Distributed
Applications in Cruz
Pod
(processes)
Library
TCP/IP
Node
Pod
Checkpoint (processes)
Library
Pod
(processes)
Library
TCP/IP
TCP/IP
Node
Node
Communication Channel
•
Global communication state saved and restored by saving
and restoring TCP communication state for each pod
− Messages in flight need not be saved since the TCP state will
trigger retransmission of these messages at restart
• Eliminates O(N2) step to flush channel for capturing messages in flight
− Eliminates need to re-establish connections at restart
•
Preserving pod’s IP address across restart eliminates
need to re-discover process locations in library at restart
16
Consistent Checkpoint Algorithm in
Cruz (Illustrative)
<checkpoint>
•Disable pod comm§
•Save pod state
Pod Agent
<done>
Library
TCP/IP
Node
<continue>
•Enable pod comm
•Resume pod
§: using netfilter
rules in Linux
•
•
<continue-done>
<checkpoint>
Coordinator •Disable pod comm
•Save pod state
<done>
Node
<continue>
Pod Agent
Library
TCP/IP
Node
•Enable pod comm
•Resume pod
<continue-done>
Algorithm has O(N) complexity (blocking algorithm shown for
simplicity)
Can be extended to improve robustness and performance, e.g.:
− Tolerate Agent & Coordinator failures
− Overlap computation and checkpointing using copy-on-write
− Allow nodes to continue without blocking for all nodes to complete
checkpoint
− Reduce checkpoint size with incremental checkpoints
17
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
18
Evaluation
•
Cruz implemented for Linux 2.4.x on x86
•
Functionality verified on several applications,
e.g., MySQL, K Desktop Environment, and a
multi-node MPI benchmark
•
Cruz incurs negligible runtime overhead (less
than 0.5%)
•
Initial study shows performance overhead of
coordinating checkpoints is negligible, suggesting
the scheme is scalable
19
Performance Result – Negligible
Coordination Overhead
•
•
Checkpoint behavior for Semi-Lagrangian atmospheric
model benchmark in configurations from 2 to 8 nodes
Negligible latency in coordinating checkpoints (time spent
in non-local operations) suggests scheme is scalable
− Coordination latency of 400-500 microseconds is a small fraction
of the overall checkpoint latency of about 1 second
20
Related Work
•
MetaCluster product from Meiosys
− Capabilities similar to Cruz (e.g., checkpoint and restart
of unmodified distributed applications)
•
Berkeley Labs Checkpoint Restart (BLCR)
− Kernel-module based checkpoint-restart for single node
− No identifier virtualization – restart will fail in the event
of an identifier (e.g., pid) conflict
− No support for handling communication state – relies
on application or library changes
•
MPVM, CoCheck, LAM-MPI
− Library-specific implementations of parallel application
checkpoint-restart with disadvantages described earlier
21
Future Work
Many areas for future work, e.g.,
• Improve portability across kernel versions by
minimizing direct access to kernel structures
− Recommend additional kernel interfaces when
advantageous (e.g., accessing socket attributes)
•
Implement performance optimizations to the
coordinated checkpoint-restart algorithm
− Evaluate performance on a wide range of applications
and cluster configurations
•
Support systems with newer interconnects and
newer communication abstractions (e.g.,
InfiniBand, RDMA)
22
Summary
•
Cruz, a practical checkpoint-restart system for
Linux
− No change to applications or to base OS kernel needed
•
Novel mechanisms to support checkpoint-restart of
a broader class of applications
− Migrating networked applications transparent to
communicating peers
− Consistent checkpoint-restart of general TCP/IP-based
distributed applications
•
Cruz’s broad capabilities will drive its use in
solutions for fault tolerance, online OS
maintenance, and resource management
23
http://www.hpl.hp.com/research/dca
© 2004 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
27
Outline
•
Zap (Background)
•
Migrating Networked Applications
− Network Address Migration
− Communication State Checkpoint and Restore
•
Checkpoint-Restart of Distributed Applications
•
Evaluation
•
Related Work
•
Future Work
•
Summary
28