Transcript 7810-25
CS 7810
Lecture 25
DIVA: A Reliable Substrate for Deep
Submicron Microarchitecture Design
T. Austin
Proceedings of MICRO-32
November 1999
Redundancy
• If a processor’s output is error-prone, reliability
can be provided with redundancy
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Redundancy
• If a processor’s output is error-prone, reliability
can be provided with redundancy
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Checker
Core
One checker can detect errors. For
recovery, we may need another checker
or some other form of redundancy
Why Redundancy?
• Soft Errors: A high energy particle can strike a device and
deposit enough charge to flip the value
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Cosmic rays
Alpha particles
Why Redundancy?
• Soft Errors: voltage spikes or noise
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Crosstalk
di/dt
Lower voltages
Why Redundancy?
• Allows unverified or aggressively clocked primary cores
Input
Program
Primary
Core
Checker
Core
Verify &
Commit
Functionally incorrect
core: some corner
case slips through
Electrically incorrect
core: high temperature
causes a circuit to not
meet the timing
constraint
DIVA Microarchitecture
BPred
I-$
Dec/Ren
IQ
Rename
Regs
Arch
Regs
If both checks
succeed, write
12 into LR15
Storage Check
Rd LR3 and LR7 from Arch Regs
and confirm it equals 4 and 8
ALU Check
Add 4+8 and confirm it equals 12
ALU
D-$
LR3 + LR7 LR15
4
8
12
Microarchitecture Details
• Instructions are fed to checker in order during commit
• The logic and storage checks detect errors in ALUs
and datapath
• The checker core is a simple in-order pipeline – easy to
design and verify
• An error in an earlier stage (LR3 instead of LR2) can be
detected by also adding a ren/decode stage to the checker
• In-order core has no stalls (need bypass for register file)
– no data dependences, cache misses, branch mispredicts
• Contention for register file and data cache can degrade
primary thread
Recovery
• The architected register file and data cache are ECC
protected – when an error is detected, it is assumed
that checker and architected state are correct
• Primary core is re-started from faulting instruction
• A fault in the primary core may result in deadlock:
e.g. instruction that produces R5 is waiting for R5 to be
produced (instead of R4)
A timeout in the checker signals an error
Redundant Multi-Threading
• Execute two threads in parallel (CMP or SMT) – each
thread maintains its own register state
• Threads execute as in a conventional processor, except
trailing thread commits after verifying result
leading thread commits stores to a buffer – these
get written to cache/memory only after verification
load values of the leading thread are sent to trailing
thread, so trailing thread never accesses data cache
branch outcomes are also sent to trailing thread
Reg results, load values,
branch outcomes
Leading Thread
Trailing Thread
Store values
Fault Model
• A single error in either core can be detected
• Since loads are not replicated, the load/store datapath
must be ECC protected
• For recovery, a second checker thread is required
• ECC in the checker register file will enable recovery
in most cases without a second checker
RMT on SMT/CMP
+ SMT does not require inter-core traffic – values can be
read from shared register file/data cache
– Single thread performance may be degraded
– Each redundant instr executes on high-power pipeline
+ Trailing CMP core can be a simple in-order processor
low power/area overheads
+ Trailing core’s frequency can be independently
controlled
+ Heterogeneous CMP where cores can be dynamically
employed for throughput/reliability
+ Lower probability for errors
Parallelization of Trailing Thread
Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4
Sequential Thread
Is it more power-efficient to execute the verification thread in parallel?
Parallelization of Trailing Thread
Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4
Sequential Thread
If the trailing cores are frequency-scaled, dynamic power does not
change, but leakage power increases
If the trailing cores are frequency-and-voltage scaled, dynamic power
decreases, and leakage power increases
Error Types
Acronyms!!
• MTTF & MTBF: Mean time to/between failures
• Errors are either SDC (silent data corruption) or DUE
(detected unrecoverable errors)
Many errors get masked:
• ACE bits: these bits are required for architecturally
correct execution
• un-ACE bits: these bits do not affect the final output
• AVF: architecture vulnerability factor (the percentage of
time/space that a structure holds ACE state)
Partial Coverage
• RMT covers faults in the entire core (almost!)
• If that is too expensive, provide error coverage in
specific structures to reduce error probabilities
• Are there ways to ensure that an instruction spends less
time in architecturally vulnerable structures?
Title
• Bullet