Transcript Slide 1
Overcoming Hard-Faults in
High-Performance Microprocessors
Presented by:
I2PC Talk
Amin Ansari
Sept 15, 2011
University of Michigan
Electrical Engineering and Computer Science
Significance of Reliability
Hard-Faults Core disabling
Equipped
with:
Desktop
Engine or
Break CU
or Server
Processor
• Triple Modular
Redundancy
• Watchdog Timer
• Error Correction Code
• Fault-Tolerant Scheduling
Full Authority
Digital Engine
Financial
Analysis or
Transactions
RAID
HP Tandem
NonStop
Mission Critical Systems
ECC
IBM z series
Commodity Systems
2
University of Michigan
Electrical Engineering and Computer Science
Hard-Faults
Main sources
o
o
o
o
Manufacturing defects
Process variation induced
In-field wearout
Ultra low-power operation
Have a direct impact on
o
o
o
o
Manufacturing yield
Performance
Lifetime throughput
Dependability of semiconductor parts
3
University of Michigan
Electrical Engineering and Computer Science
Manufacturing Defects
Happen due to
o
o
o
Silicon crystal defects
Random particles on the wafer
Fabrication impreciseness
ITRS: One defect per five
100mm2 dies expected
o
A real threat for yield
4
University of Michigan
Electrical Engineering and Computer Science
Protecting HPµPs against hardfaults is more challenging
o
Contain billions of transistors
o
Complex connectivity + many stages
o
↑ operational stress accelerates aging
Operating at most aggressive V/F curve
o
[AMD,
Phenom]
Fine-grained redundancy is not cost-eff.
Higher clock frequency, voltage, temp.
o
↑ transistors per core (no core disabling)
Usage of high V/F guard-bands
Large on-chip caches
Bit-cell with worst timing characteristics
dictates the V&F of the SRAM array
5
[IBM, POWER7]
[Intel, Nehalem]
Challenges with High-Performance µPs
University of Michigan
Electrical Engineering and Computer Science
Outline
Objective:protect
overcome hard-faults
in high-performance
Archipelago
[HPCA’11] µPs,
Archipelago
with comprehensive, low-cost solutions for protecting the
on-chip caches
against:
IEEE Micro’10]
on-chip caches and also the Necromancer
non-cache parts[ISCA’10,
of the core.
● Near-threshold failures
● Process variation
● Wearout and defects
Necromancer protects
general core area (noncache parts) against:
● Manufacturing defects
● Wearout failures
6
University of Michigan
Electrical Engineering and Computer Science
NT Operation: SRAM Bit-Error-Rate
Extremely fast growth in failure rate with decreasing Vdd
7
University of Michigan
Electrical Engineering and Computer Science
Our Goal
Enabling DVS to push core’s Vdd down to
o
o
Ultra low voltage region ( < 650mV )
While preserving correct functionality of on-chip caches
Proposing a highly flexible and FT
cache architecture that can efficiently
tolerate these SRAM failures
Minimizing our overheads in highpower mode
8
University of Michigan
Electrical Engineering and Computer Science
Archipelago (AP)
data chunk
Thisautonomous
particular cache
has
By forming
islands,
a6
single
AP only
saves
out offunctional
8 lines. line.
1
Island
1
2
3
Island
2
4
5
sacrificial line
6
7
sacrificial line
8
9
University of Michigan
Electrical Engineering and Computer Science
Baseline AP Architecture
Two lines
collision, if they have at least one faulty
in
Addedchunk
modules:
Faulthave
map address
Sacrificial line
Memory Map (10T)
Memory map
line
the same Data
position
(blue and
orange are collision●free)
● Fault map
There should be
no
collision
between
lines
within
group layer
G3
Input Address
●a MUXing
[Group 3 (G3) contains green, blue, and orange lines]
First Bank
Second Bank
S
Fault Map (10T)
MUXing layer
G3
-
-
Functional Line
10
Two type of lines:
● data line
● sacrificial line
University of Michigan
Electrical Engineering and Computer Science
AP with Relaxed Group Formation
Sacrificial lines do not contribute to the effective capacity
o
We want to minimize the total number of groups
Second Bank
First Bank
S
S
First Bank
Second Bank
S
11
University of Michigan
Electrical Engineering and Computer Science
Semi-Sacrificial Lines
First Bank
Accessed Line
Semi-sacrificial line guarantees the parallel access
In contrast to a sacrificial line, it also contributes to
the effective cache capacity
Sacrificial line
MUXing Layer
Second Bank
Semi-sacrificial line
12
University of Michigan
Electrical Engineering and Computer Science
AP with Semi-Sacrificial Lines
Memory Map
Input Address
G3
First Bank
Second Bank
S
semisacrificial
line
way0
way1
way0
way1
Fault Map
MUXing layer
G3
Functional Block
13
University of Michigan
Electrical Engineering and Computer Science
AP Configuration
We model the problem as a graph:
o
o
Each node is a line of the cache.
Edge when there is no collision between nodes
A collision free group forms a clique
o
Group formation
Finding the cliques
To maximize the number of
functional lines, we need to
minimize the number of groups.
o
minimum clique cover (MCC).
14
University of Michigan
Electrical Engineering and Computer Science
AP Configuration Example
First Bank
Second Bank
1
2
3
4
5
G1(1)
G2(1)
G2(S)
G1(2)
G2(2)
6
7
8
9
10
D
G2(3)
G1(3)
G1(S)
G2(4)
10
1
Island or Group 2
7
9
2
4
Island or Group 1
8
5
6
15
3
Disabled
University of Michigan
Electrical Engineering and Computer Science
Operation Modes
High power mode (AP is turned off)
There is no non-functional lines in this case
Clock gating to reduce dynamic power of SRAM structures
Low power mode
o
During the boot time in low-power mode
BIST scans cache for potential faulty cells
Processor switches back to high power mode
Forms groups and configure the HW
16
University of Michigan
Electrical Engineering and Computer Science
Minimum Achievable Vdd
17
University of Michigan
Electrical Engineering and Computer Science
Performance Loss
One extra cycle latency for L1 and 2 cycles for L2
18
University of Michigan
Electrical Engineering and Computer Science
Comparison with Alternative Methods
100
10T
Recently
Proposed
Cache Area Overhead (%)
66% area
overhead
Conventional
ZC
ECC-2
10
SECDED
Row
Red
AP
BF
Disabled: 25%
Disabled: 9%
10T : [Verma, ISSCC’08]
ZC : [Ansari, MICRO’09]
BF : [Wilkerson, ISCA’08]
1
0.5
1
1.5
2
2.5
3
Power (at minimum Vdd) Normalized to Archipelago
19
University of Michigan
Electrical Engineering and Computer Science
Archipelago: Summary
DVS is widely used to deal with high power dissipation
o
We proposed a highly flexible cache architecture
o
Minimum achievable voltage is bounded by SRAM structures
To tolerate failures when operating in near-threshold region
Using our approach
o
o
o
Vdd of processor can be reduced to 375mV
79% dynamic power saving and 51% leakage power saving
< 10% area overhead and performance overheads
20
University of Michigan
Electrical Engineering and Computer Science
Outline
Archipelago protect
on-chip caches against:
Archipelago [HPCA’11]
Necromancer [ISCA’10, IEEE Micro’10]
● Near-threshold failures
● Process variation
● Wearout and defects
Necromancer protects
general core area (noncache parts) against:
● Manufacturing defects
● Wearout failures
21
University of Michigan
Electrical Engineering and Computer Science
Necromancer (NM)
There are proper techniques to protect caches
To maintain an acceptable level of yield, the processing
cores need to be protected
o
More challenging due to inherent
irregularity
Given a CMP system, Necromancer
o
o
Utilizes a dead core (i.e., a core with a
hard-fault) to do useful work
Enhances system throughput
22
University of Michigan
Electrical Engineering and Computer Science
Impact of Hard-Faults on Program Execution
More than 40% of the injected faults cause an immediate
(less than 10K) architectural state mismatch.
Distribution of injected hard-faults that manifest as
architectural
across
latencies
Thus,
a faultystate
core mismatches
cannot be trusted
todifferent
provide correct
o
Based on number of committed instructions before mismatch
functionality
even for short periods of program execution.
happening when starting from a valid architectural state
23
University of Michigan
Electrical Engineering and Computer Science
Relaxing Absolute Correctness Constraint
Distribution of injected faults resulting into similarity index
For an across
SI threshold
of 90%,
in more than 85% of
mismatch
different
latencies
cases, the dead core can successfully commit at least 100K
instructions
Similarity Index:
PCs matching
the faulty
before %
its of
execution
differs between
by more than
10%
and golden execution (sample @1K instruction intervals)
24
University of Michigan
Electrical Engineering and Computer Science
Using the Undead Core to Generate Hints
The execution behavior of a dead core coarsely matches
the intact program execution for long time periods
o
Accelerating the execution of
another core!
o
We extract useful information
from the execution of the program
on the dead core and sending
this information (hints) to the
other core (the animator core),
running the same program.
25
Hard-fault
Undead Core
Performance
How to exploit the program execution on the dead core?
Animator
Core
Hints
University of Michigan
Electrical Engineering and Computer Science
Opportunities for Acceleration
Increasing complexity/resources
In
most
cases, Alpha
by providing
hintstoforEV4’s
the simpler
IPC
of several
cores, perfect
normalized
IPC.
cores (EV4,
Perfect
hints: EV5, and EV4 (OoO)), these cores can
o
Perfect
branch
and comparable to that
achieve
a prediction
performance
o
No L1 cache
miss by a 6-issue OoO EV6.
achieved
26
University of Michigan
Electrical Engineering and Computer Science
head
Hint Gathering
tail
Cache Fingerprint
● Undead core executes the same
program to provide hints for the AC.
● It works as “an external run-ahead
engine Queue
for the AC”.
● A 6-issue OoO EV6 (evaluation)
Hint Distribution
Hint Disabling
●generation
I$ hints:signal
PCand
of committed instructions
● Animator
core
is an older
Resynchronization
● NoFET
communication
for
L2
warm-up
DEC REN DIS EXE MEM COM
hint
information
●disabling
D$
hints:
address of committed ld/strs
with
the
same
ISA
and
less
resources
● Most communications are from
FE DERE
EX ME CO
● Branch prediction hints:
BP DIupdates
●
A
2-issue
OoO
EV4
(evaluation)
the undead core to the animator
● Handles exceptions
coupled cores
core except resynchronization
and in NM
Memory
Hierarchy
● D$info
dirty
lines are dropped when they
L1-Inst
● TreatsL1-Data
$ hints as prefetching
L1-Inst
L1-Data
hint disabling
signals.
required
to
be
replaced
● Fuzzy
hint disabling
based on
● A single queue
for sending
hints approach
Read-Only
● It can proceed on data L2 misses
cont. monitoring of hints effectiveness
and cache fingerprints
Shared
cache
● PC & arch. registers
forL2
resynch
A robust heterogeneous core coupling execution technique
27
University of Michigan
Electrical Engineering and Computer Science
The Animator Core
The Undead Core
Necromancer Architecture
Example: Branch Prediction Hints
Age tag ≤ num committed instructions + BP release window size
Type
Hint Format
Original BP
NM BP
r Predictor
Tournament
Cache Fingerprint
PC*
a
DIS
EXE MEM COM
Counter
SC2
a
Original BP of AC
H
Hint Distribution
H
Hint Disabling
NM Predictor
NPC Resynchronization
PC* NPCsignal and
hint disabling information
> Threshold
L1-Data
H H H
-Queue
Disable Hint
PC*
L1-Inst
Buffer
head
r
PC*
FET DEC REN
SC1
--
Hint
a
Hint Gathering
r
a
tail
PC NPC
Action
Age PC* NPC
Type
r
Age PC* NPC
FE DERE DI EX ME CO
NPC
Memory Hierarchy
L1-Inst
L1-Data
Read-Only
Shared L2 cache
28
University of Michigan
Electrical Engineering and Computer Science
The Animator Core
The Undead Core
Prediction Outcomes
NM Design for CMP Systems
29
University of Michigan
Electrical Engineering and Computer Science
Impact of Hard-Fault Location
30
University of Michigan
Electrical Engineering and Computer Science
Overheads
18%
Necromancer Specific Structures in the Undead Core
Interconnection Wires + Hint Queue
Necromancer Specific Structures in the Animator Core
Animator Core (net overhead)
16%
% Overhead
14%
12%
10%
8%
6%
4%
2%
0%
area power
area power
area power
area power
area power
1 Core
2 Cores
4 Cores
8 Cores
16 Cores
31
University of Michigan
Electrical Engineering and Computer Science
Performance Gain
88%
71%
32
University of Michigan
Electrical Engineering and Computer Science
Necromancer: Summary
Enhancing system throughput by exploiting dead cores
Necromancer leverages a set of microarchitectural
techniques to provide
o
o
o
o
Intrinsically robust hints
Fine and coarse-grained hint disabling
Online monitoring of hints effectiveness
Dynamic state resynchronization between cores
Applying Necromancer to a 4-core CMP
o
o
On average, 88% of the original performance of the undead
core can be retrieved
Modest area and power overheads of 5.3% and 8.5%
33
University of Michigan
Electrical Engineering and Computer Science
Takeaways
Mission-critical and conventional reliability solutions
are too expensive for modern high-perf. processors
AP: low-cost cache protection against major
reliability threats in nanometer technologies
For processing core, redundancy
o
NM an alternative to utilize dead cores
To achieve efficient, reliable solutions
o
o
o
Runtime adaptability
High degree of re-configurability
Fine-grained spare substitution
34
University of Michigan
Electrical Engineering and Computer Science
Thank You
35
University of Michigan
Electrical Engineering and Computer Science