Software Only

Download Report

Transcript Software Only

DESIGN AND EVALUATION OF HYBRID
FAULT-DETECTION SYSTEMS
Qing Xu
Kevin Wang
OUTLINE
 Background
 Motivation
 Key Ideas
 Introduction to CRAFT
 Summary and Discussion Points
BACKGROUND
Smaller and Faster Transistors
 Lower threshold voltage
 Tighter noise margins
 Less reliable
Results
Recovery
 Incorrect program execution
0
Transient
Faults
Alpha
Particle
1
Hardware Only
Software Only
REDUNDENCY
Int main()
{
Int main()
cout
<< “Hello\n”;
} {
cout << “Hello\n”;
}
MOTIVATION AND GOAL
Lower Hardware
Area and Cost
Better Reliability
and Performance
Hybrid Solution
KEY IDEA:
COMPILER ASSISTED FAULT TOLERANCE (CRAFT)
Characteristics:
- Based on software technique
- Minimal hardware adaptations
- Take advantages from Software
and Hardware solution
Hardware
Benefits:
- Nearly perfect reliability
- Low performance degradation
- Low hardware cost
Software
CRAFT: HYBRID OF EXISTING METHODS
Hardware Method
Redundant
Multithreading
Technique (RMT)
 Error Correcting Codes
(ECC)

Advantages
 Almost-perfect fault coverage
 Low performance cost
Software Method
Software Implemented Fault
Tolerance (SWIFT)
 Error Detection by Duplicating
Instructions (EDDI)

Advantages
 High fault coverage
 Modest performance cost
 Zero hardware cost
EXISTING METHOD: HARDWARE
RMT
RMT makes use of SMT resource through loosely synchronized
redundant threads
 Components not covered by redundant execution must employ
alternative techniques, such as Error Correction Code (ECC)

Redundant Multithreading (RMT)
Original
Thread
Checker
Thread
EXISTING METHOD: SOFTWARE
SWIFT
ld r3 = [r4]
ld r3 = [r4]
mov r3’ = r3
A compiler based
transformation
 Store instruction is the
synchronization point
 Assumes that Error Correction
Code (ECC) guards correctness
of memory subsystem
add r1 = r2, r3
add r1 = r2, r3
add r1’ = r2’, r3’

br Fault, r1 != r1’
br Fault, r2 != r2’
br Fault, r3 != r3’
st m[r1] = r2
st m[r1] = r2
(Original Code)
(SWIFT Code)
CRAFT: SUITE OF THREE DETECTION SYSTEM
Preliminaries

Assume Single Event Upset fault
model
List of the Suite:
1. Checking Store Buffer (CSB)

Architecturally Correct Execution (ACE)
2. Load Value Queue (LVQ)

Detected Unrecoverable Error (DUE)
3. CSB + LVQ

Silent Data Corruption (SDC)
SUITE 1: CHECKING STORE BUFFER (CSB)
Problem to Improve:
• SWIFT: Vulnerable to faults in the time
interval between the validation and use of a
register value
Vulnerable to Faults
Validated values
Use of validated values
Solution:
• Add a Store Buffer to perform checks
CSB : IMPLEMENTATION
Basic Idea: Commit a store when two copies of store data match
Method : Create CSB to keep track of all original and duplicated instructions
Insn duplicate #1
Compiler duplicates stores
st [r1] = r2

st1 [r1] = r2
st2 [r1’] = r2’
0xFF
Insn duplicate #2
0xEE
0x8
0x2
CSB #
0
1
2
3
Address
--
--
0xFF
0xEE
Value
--
--
0x8
0x1
Validated
--
--
N
Y
N
N
Table will fill up and
structural hazard
Store Value Checks
Out! Send to MEM.
Not match, not
OK to go to MEM
CSB : ADVANTAGES/ DISADVANTAGES
Advantages
 Checking implemented in hardware level
 No longer need validation code; reduces code size
 Store instructions are no longer synchronization points (SWIFT)
 Exploit more dynamic scheduling
Disadvantages
 Additional compiler requirements: distance between duplicated
instruction should not exceed size of CSB
SUITE 2: LOAD VALUE QUEUE (LVQ)
Problem to Improve:
• SWIFT: Window of vulnerability between
load instruction and value duplication.
Vulnerable to Faults
Loading values
Solution:
• Add a load value queue
Copying values
LVQ : IMPLEMENTATION PROCEDURE
Basic Idea: Duplicate
load to enable
Threadmill:
Branch redundant
to TEST1 computation
Method : LVQ provides redundant load instruction execution
ld insn
Compiler duplicates loads
ld [r1] = r2

ld1 [r1] = r2
ld2 [r1’] = r2’
LVQ #
0
1
Address
--
Value
--
ld insn duplicate
0xAA
0xAA
0x2
0x2
3
--
2
0xAA
--
--
0x2
--
--
--
LVQ : ADVANTAGES/ DISADVANTAGES
Advantages
 Reduces window of vulnerability by issuing duplicated load instruction
 Keep memory traffic low by bypassing load value
Disadvantages
 Extra hardware to enforce loads and their duplicates access same
entry in LVQ
SUITE 3: CSB + LVQ
 Implements both CSB and LVQ simultaneously to software-only
solutions like SWIFT
EXPERIMENTAL EVALUATION
Evaluation Method
– Performance vs. Reliability:
Inject randomly chosen faults to detailed microarchitectural simulation
Each chosen bit-flip is tracked until completion of program
Analyze final result to determine:
- How much SDC is converted to DUE
- How much work (# of application) did program complete before encountering SDC
EXPERIMENTAL EVALUATION
Results:
Measures # of applications the
program completed before
encountering an SDC
Implementation
Performance
CSB
Enable better performance as it eliminates scheduling constraints
LVQ
Impact varies by benchmark
SUMMARY AND CONCLUSION
CRAFT, as compared to:
Software-only Technique
Hardware-only Technique
Execution time reduction by 5%
Significantly reduce area
overhead
Maintain comparable reliability
SDC to DUE conversion rate
increase by 75%
Hybrid technique can provide better reliability with relatively low cost
DISCUSSION POINTS
 CRAFT detects fault when CSB is clogged
Tradeoff between detection latency and more flexible
scheduling?
 Recovery method?
 Evaluation in terms of coverage?