Computer Architecture Research Cache

Download Report

Transcript Computer Architecture Research Cache

Transient Fault Tolerance via Dynamic
Process-Level Redundancy
Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors
University of Colorado at Boulder
Department of Electrical and Computer Engineering
DRACO Architecture Research Group
Workshop on Binary Instrumentation and Applications
San Jose, CA
10.22.2006
Outline
• Introduction
• Background/Terminology
• Software-centric Fault Detection
• Process-Level Redundancy
• Experimental Results
• Conclusion
Introduction
• Process technology trends
– Single transistor error rate expected to stay close to constant
– Number of transistors is increasing exponentially with each generation
• Transient faults will be a problem for microprocessors!
• Hardware Approaches
– Specialized redundant hardware, redundant multi-threading
• Software Approaches
– Compiler solutions: instruction duplication, control flow checking
– Low-cost, flexible alternative but higher overhead
• Goal: Leverage available hardware parallelism in SMT and CMP
machines to improve the performance of software transient fault
tolerance
Background/Terminology
• Types of transient faults (based upon outcome)
– Benign Faults
– Silent Data Corruption (SDC)
– Detected Unrecoverable Error (DUE)
• True DUE
• False DUE
• Sphere of Replication (SoR)
– Indicates the scope of fault detection and containment
• Input Replication
• Output Comparison
Software-centric Fault Detection
Hardware SoR
Software SoR
Processor
Application
Libraries
Cache
Operating System
Memory
Devices
Hardware-centric
Fault Detection
Software-centric
Fault Detection
• Most previous approaches are hardware-centric
– Even compiler approaches (e.g. EDDI, SWIFT)
• Software-centric able to leverage strengths of a software approach
– Correctness is defined by software output
– Ability to see larger scope effect of a fault
– Ignore benign faults
Process-Level Redundancy (PLR)
Master Process
• only process
allowed to perform
system I/O
App
App
App
Libs
Libs
Libs
SysCall Emulation Unit
Redundant Processes
• identical address space,
file descriptors, etc.
• not allowed to perform
system I/O
Watchdog
Alarm
Operating System
System Call Emulation Unit
• Creates redundant processes
• Barrier synchronize at all system calls
• Enforces SoR with input replication and output comparison
• Emulates system calls to guarantee determinism among all
processes
• Detects and recovers from transient faults
Watchdog Alarm
• occasionally a process
will hang
• set at beginning of barrier
synchronization to ensure
that all processes are
alive
Enforcing SoR and Determinism Redundant
• Input Replication
– All read events: read(),
gettimeofday(), getrusage(), etc.
– Return value from all system calls
Master Process
Barrier
• Output Comparison
– All write events: write(), msync(, etc.
– System call parameters
• Maintaining Determinism at System Calls
– Master process executes system call
– Redundant processes emulate it
• Ignore some: rename(), unlink()
• Execute similar/altered system call
– Identical address space: mmap()
– Process-specific data: open(), lseek()
Processes
Write cmd line
parameters and
syscall type to
shmem
Compare syscall
type and cmd
line parameters
read()
Write resulting
file offset and
read buffer to
shmem
Copy the read
buffer from
shmem
lseek() to
correct file offset
Example of handling a
read() system call
Fault Detection and Recovery
Type of Error
Output
Mismatch
Detection Mechanism
Detected as a mismatch of
compare buffers on an output
comparison
Recovery Mechanism
Use majority vote ensure correct
data exists, kill incorrect process,
and fork() to create a new one
Program Failure System call emulation unit
registers signal handlers for
SIGSEGV, SIGIOT, etc.
Re-create the dead process by
forking one of existing processes
Timeout
Determine the missing process
and fork() to create a new one
•
Watchdog alarm times out
PLR supports detection/recovery from multiple faults by increasing number
of redundant processes and scaling the majority vote logic
Experimental Methodology
•
•
Use a set of the SPEC2000 benchmarks
PLR prototype developed with Pin
– Intercept system calls to implement PLR
•
Fault Injection
– Gather an instruction count profile
– Use profile to generate a test case
• Test case: an instruction and a particular execution of the instruction to fault
– Run with Pin in JIT mode and use IARG_RETURN_REGS to alter a random bit
of the instructions source or destination registers
•
Fault Coverage
– Use fault injector on test inputs generating 1000 test cases per benchmark
– specdiff in SPEC2000 harness determines output correctness
•
PLR Performance
– Run PLR (in Probe mode using Pin Probes) on reference inputs with two
redundant processes
– 4-way SMP machine, each processor is hyper-threaded
– Use sched_set_affinity() to simulate various hardware platforms
Fault Coverage
Fault Injection Results W ith and W ithout P LR
Failed
I nc orrec t
C orrec t
D etec t SegFault
D etec t M is matc h
N o F ault D etec ted
100
90
80
70
60
50
40
30
20
•
•
•
•
Watchdog timeout very rare so not shown
PLR detects all Incorrect and Failed cases
Effectively detects relevant faults and ignores benign faults
Floating point correctness question (ex. 168.wupwise, 172.mgrid)
– Actually different results but tolerable difference for specdiff
191.fma3d
189.lucas
187.facerec
178.galgel
173.applu
172.mgrid
171.swim
168.wupwise
300.twolf
256.bzip2
255.vortex
254.gap
197.parser
186.crafty
181.mcf
176.gcc
164.gzip
0
183.equake
10
Performance
P LR Slowdown
N ative
6
P L R 1 x1
P L R 2 x2
P L R 4 x1
4
3
2
• Performance for single processor (PLR 1x1), 2 SMT processors
(PLR 2x1) and 4 way SMP (PLR 4x1)
• Slowdown for 4-way SMP only 1.26x
– Should be better on a CMP with faster processor interconnect
Avg
191.fma3d
189.lucas
187.facerec
178.galgel
173.applu
172,mgrid
171.swim
168.wupwise
300.twolf
256.bzip2
255.vortex
254.gap
197.parser
186.crafty
181.mcf
176.gcc
0
183.equake
1
164.gzip
Slowdown
5
Conclusion
• Present a different way to use existing general purpose SMT and
CMP machines for transient fault tolerance
• Differentiate between hardware-centric and software-centric fault
detection models
– Show how software-centric can be effective in ignoring benign faults
• PLR on a 4-way SMP executes with only a 26% slowdown, a 36%
improvement over the fastest compiler technique
• Future Work
– Implementation in a run-time system allows for dynamically altering
amount of fault tolerance
– Simple PLR model is presented; work on handling interrupts, shared
memory, and threads (the tough one)
Questions?