Transcript 7810-25

CS 7810
Lecture 25
DIVA: A Reliable Substrate for Deep
Submicron Microarchitecture Design
T. Austin
Proceedings of MICRO-32
November 1999
Redundancy
• If a processor’s output is error-prone, reliability
can be provided with redundancy
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Redundancy
• If a processor’s output is error-prone, reliability
can be provided with redundancy
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Checker
Core
One checker can detect errors. For
recovery, we may need another checker
or some other form of redundancy
Why Redundancy?
• Soft Errors: A high energy particle can strike a device and
deposit enough charge to flip the value
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
 Cosmic rays
 Alpha particles
Why Redundancy?
• Soft Errors: voltage spikes or noise
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
 Crosstalk
 di/dt
 Lower voltages
Why Redundancy?
• Allows unverified or aggressively clocked primary cores
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
 Functionally incorrect
core: some corner
case slips through
 Electrically incorrect
core: high temperature
causes a circuit to not
meet the timing
constraint
DIVA Microarchitecture
BPred
I-$
Dec/Ren
IQ
Rename
Regs
Arch
Regs
If both checks
succeed, write
12 into LR15
Storage Check
Rd LR3 and LR7 from Arch Regs
and confirm it equals 4 and 8
ALU Check
Add 4+8 and confirm it equals 12
ALU
D-$
LR3 + LR7  LR15
4
8
12
Microarchitecture Details
• Instructions are fed to checker in order during commit
• The logic and storage checks detect errors in ALUs
and datapath
• The checker core is a simple in-order pipeline – easy to
design and verify
• An error in an earlier stage (LR3 instead of LR2) can be
detected by also adding a ren/decode stage to the checker
• In-order core has no stalls (need bypass for register file)
– no data dependences, cache misses, branch mispredicts
• Contention for register file and data cache can degrade
primary thread
Recovery
• The architected register file and data cache are ECC
protected – when an error is detected, it is assumed
that checker and architected state are correct
• Primary core is re-started from faulting instruction
• A fault in the primary core may result in deadlock:
e.g. instruction that produces R5 is waiting for R5 to be
produced (instead of R4)
A timeout in the checker signals an error
Redundant Multi-Threading
• Execute two threads in parallel (CMP or SMT) – each
thread maintains its own register state
• Threads execute as in a conventional processor, except
 trailing thread commits after verifying result
 leading thread commits stores to a buffer – these
get written to cache/memory only after verification
 load values of the leading thread are sent to trailing
thread, so trailing thread never accesses data cache
 branch outcomes are also sent to trailing thread
Reg results, load values,
branch outcomes
Leading Thread
Trailing Thread
Store values
Fault Model
• A single error in either core can be detected
• Since loads are not replicated, the load/store datapath
must be ECC protected
• For recovery, a second checker thread is required
• ECC in the checker register file will enable recovery
in most cases without a second checker
RMT on SMT/CMP
+ SMT does not require inter-core traffic – values can be
read from shared register file/data cache
– Single thread performance may be degraded
– Each redundant instr executes on high-power pipeline
+ Trailing CMP core can be a simple in-order processor
 low power/area overheads
+ Trailing core’s frequency can be independently
controlled
+ Heterogeneous CMP where cores can be dynamically
employed for throughput/reliability
+ Lower probability for errors
Parallelization of Trailing Thread
Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4
Sequential Thread
Is it more power-efficient to execute the verification thread in parallel?
Parallelization of Trailing Thread
Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4
Sequential Thread
If the trailing cores are frequency-scaled, dynamic power does not
change, but leakage power increases
If the trailing cores are frequency-and-voltage scaled, dynamic power
decreases, and leakage power increases
Error Types
Acronyms!!
• MTTF & MTBF: Mean time to/between failures
• Errors are either SDC (silent data corruption) or DUE
(detected unrecoverable errors)
Many errors get masked:
• ACE bits: these bits are required for architecturally
correct execution
• un-ACE bits: these bits do not affect the final output
• AVF: architecture vulnerability factor (the percentage of
time/space that a structure holds ACE state)
Partial Coverage
• RMT covers faults in the entire core (almost!)
• If that is too expensive, provide error coverage in
specific structures to reduce error probabilities
• Are there ways to ensure that an instruction spends less
time in architecturally vulnerable structures?
Title
• Bullet