slides - Oregon State University

Download Report

Transcript slides - Oregon State University

Microprocessor Reliability
Robert Pawlowski
ECE 570 – 2/19/2013
1
Reliability
• Involves different aspects about a processor
that can affect performance and functionality.
– Ultimately can reduce the lifetime of the
processor.
• Issues typically manifest themselves at the
device level.
– Solutions can be implemented at multiple design
levels.
2
Why the concern?
• Operating at highest frequencies and/or lowest
power possible increases sensitivity to processrelated variabilities.
– Gate length/doping concentration variations
– Temperature
– Supply voltage droops
• This decreases processor yield
• Decreasing device sizes  Increased effect of
external issues
3
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
4
Processor Error Classification
• Hard Errors will result in permanent processor failure.
• Processor lifetime is inversely proportional to hard error rate.
• Soft Errors do not permanently damage the device.
5
Hard Errors
• Extrinsic failures
– Caused by process and manufacturing defects
– Occur with decreasing rate over time
– No impact from micro-architecture
• Intrinsic failures
– Related to processor wear-out
– Occur with increasing rate over time
– Related to wafer packaging, process parameters, and
processor design.
6
Hard Errors
7
Soft Errors
• Occur in both memory and logic
– External radiation main issue in memory
• Alpha particles
• High energy neutrons
• Thermal neutrons
• Different causes of transient errors in logic
– External radiation
– Supply voltage droop
• Power supply fluctuations
– Ground bounce, cross-talk
– Process variation, temperature
– Affect delay of computational paths
8
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
9
Radiation-Induced Soft Errors
•
•
•
•
Ionized particle strike causing a state change
No permanent damage (Hard-error)
Combo logic – Single Event Transients (SET)
Memory cells – Single Bit Upset (SBU)
Multi Bit Upset (MBU)
• Three causes of soft errors
– Alpha particles
– Thermal neutrons
– High-energy neutrons
10
Alpha-Particles
• Emitted from impurities in packaging materials.
• Create electron-hole pairs through direct ionization
• Range for a 10 MeV particle < 100um
– Typical energy 4-9MeV
• Improved manufacturing trends  Reduced effect
– Purified materials
– Shielding layers
11
Neutrons
• Result of cosmic ray reactions
with atmosphere
• High-Energy neutrons react
with chip materials.
• Concrete only shielding
material
– 1.4x lower flux/foot of
thickness
12
Neutrons
• Thermal neutrons (<<< 1MeV) react with BoronDoped Phosphosilicate Glass (BPSG) dielectric layer.
– Produce ionized particles that can cause soft-errors
• Solution  Remove BPSG from advanced processes
• Mostly solved – SEU’s still found in 45nm, 90nm
13
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
14
Device-level solutions
• Larger device sizes  Larger capacitance
– Increase the amount of charge necessary to flip bit
(critical charge)
• Multiple VT design
– Sensitivity to variation at low-VDD may limit
effectiveness.
• Body biasing also common to both radiation
hardening and variation tolerance
15
Circuit-level solutions
• DICE cell
– Used for SRAM, FF’s, latches
• Built-in current sensors on supply lines of memory
cells.
16
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
17
Modular redundancy
• Dual Modular Redundancy
data in
data out
Main Module
error
Replicated
Module
• Triple Modular Redundancy
Main Module
Replicated
Module
Replicated
Module
Voter
data in
data out
18
Redundant Circuits
• Redundancy increases area/power
• DMR/TMR in sub/near-VT
– Timing variation between circuits increases
• Utilization of redundant lanes for parallel operation
can increase throughput at low-VDD
19
Self-Checking Circuits
• Partition circuit into smaller blocks
– Error checker for each block
• Use error detection codes
– Berger codes
– Arithmetic codes
• Increases circuit delay for error computation
20
Circuit-Level Speculation
• Uses approximated circuit implementation
– Goal is to reduce critical path
21
Tunable Replica Circuits
• Mirrors delay of critical path
• Monitors for errors over voltage/frequency
changes
22
Timing Speculation
data in
clk
clk
0
1
D
Q
DFF
data out
delayed clk
error
data in
D
delayed clk
Q
Shadow
Latch
D0
D1
D2
error
data out
D0
D1
D2
• Razor timing error detection
– Designed for transient faults
– Effective against SET’s and SBU’s on flip-flops
• Requires error recovery
23
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
24
Error Recovery Options in Scalar Processors
• Clock Gating:
– Global error signal
– Clock gating
– 1-cycle penalty
25
Error Recovery Options in Scalar Processors
• Multiple Issue:
– Error signals propagated to control unit
– Instructions must be flushed
– Error instruction then replayed
– 2N-cycle penalty
26
Error Recovery Options in Scalar Processors
• Counter-flow pipelining
• Micro-rollback
27
Error correcting codes for memories
• Most common is Hamming code
• Check bits stored when data written
• Identifies error and erroneous bit position
28
Error correcting codes for memories
• Single-bit ECC adds area/power and delay
– Low-VDD  Increased delay
– Hybrid VDD operation will reduce delay
• Overhead increases for multi-bit ECC
– Increased memory density  higher probability of
MBU
– Current research increase in ratio of MBU to total
SER in sub-VT
29
Outline
• Error Classification
• Hard Errors
• Soft Errors
• Sources of radiation
• Device/Circuit approaches
• Architectural approaches
• Error detection
• Error correction
• System level impact
30
System-Level Impact
• Soft errors can have a large affect on
processor functionality
– Increasing issue with further device scaling
• All methods off error detection/correction are
costly
– Need to be added to system blocks wisely
• SEU distribution
• Effects of process variation
31
System-Level Impact
• How to determine what blocks have the
highest system-level impact?
– Mostly through simulation
• For radiation: all-encompassing
– Includes fault injection @ circuit level
• Different models have been developed
– ReStore – University of Illinois at Urbana-Champaign
• Focuses on system level effect of radiation-induced errors
– RAMP – IBM
• Directed more towards hard-errors and processor failure.
32
Questions?
33