Accumulation Model

Download Report

Transcript Accumulation Model

Hardware Reliability Margining for the
Dark Silicon Era
Liangzhen Lai and Puneet Gupta
Department of Electrical Engineering
University of California, Los Angeles
[email protected]
This work is supported in part by NSF Variability Expedition grant CCF-1029030
1
Outline

Overview

Accumulation Model and Management Policies

Problem Formulation

Experimental Results

Conclusion
2
Hardware Reliability Margin

Parametric margin



Physical margin



Voltage/Frequency or sign-off corners
E.g., BTI, HCI
Metal width, layout spacing
E.g., current-dependent minimum metal width for EM
Typically worst-case driven


Mostly derived at hardware design time
Uncertainty in workload, circuit operating points etc.
3
Reliability vs. Operating Points

Most reliability-related phenomena depends heavily
on the circuit operating points

Voltage, Frequency, Temperature etc.
4
Dynamic Range of Operations

Efficiency needed for the Dark Silicon Era



Multi/Many-core design with less powerful cores
Low voltage/current/power -> less margin
“Turbo X”: Turbo Boost (Intel), Turbo Core (AMD)


Under certain conditions
High voltage/current/power-> more margin
Moderate
Parallel
Known optimistic
Low stress states
Workload
Reliability
margin
High stress states
Intensive
Single-thread
Known pessimistic
5
Dark Silicon Contexts

Pessimism depends on the difference between
peak power/temperature and sustainable
power/temperature



Power constraint


Quantify silicon “darkness”
Dark ratio:
Limit on maximum instantaneous power
Thermal constraint

Limit on maximum on-chip temperature
6
Margining Methodology

Formulate as workload optimization


Maximize the reliability degradation
Still meets the power/thermal constraints
7
Outline

Overview

Accumulation Model and Management Policies

Problem Formulation

Experimental Results

Conclusion
8
Dynamic Reliability Model

Most reliability models are static


Derived for constant voltage/current/temperature
Need a highly dynamic model for optimization

Comparing different degradation scenarios
v
v
P1
P3
P1
vs.
P3
t
P2
t
9
Accumulation Model
Time spent in
each power
states

Some can be derived from the model itself


Accumulation
Model
Worst-case
degradation
at the end of
lifetime
E.g., EM can be modeled by effective current density Jeff
Other can be derived by simulator


E.g. Worst-case BTI degradation can be derived by
simulating different power state ordering and picking the
worst-case
Fitting and interpolation can also be used
10
Spatial problem vs. Temporal problem

With accumulation model, reliability degradation
can be modeled as temporal distribution problems
v
P1
P2
P3
t

The workload and power/thermal constraints are
spatial problems
P1
P3
P1
P2
P1
P2
11
System Management Policy

We assume a fair round-robin policy



Iterate scheduling priorities among all processor cores
Iterating frequency can be of hours to days
Assuming this policy because:



Simple: open-loop, reasonable to assume at hardware
design time
Effective: sufficient iterations to balance workload during
typical hardware life time of multiple years
Pessimistic: more sophisticated policies are likely to
perform better, i.e., margin is pessimistic
12
Bridging Spatial and Temporal Problems

Management policy will iterate workload among all
cores

Spatial distribution is equivalent to temporal distribution
v
P1
P3
P1
P2
P1
P2
P1
P2
P3
t
Spatial
constraints
Temporal
distribution
13
Outline

Overview

Accumulation Model and Management Policies

Problem Formulation

Experimental Results

Conclusion
14
Optimization Under Power Constraints

x is the number of cores at each power states




Also the input to the accumulation model f(x)
P is the power corresponding to the power states
Pmax is the power constraint
Formulated as Integer Linear Programing (ILP)
problem
15
Thermal Problem

Thermal limit can be reached by two scenarios



Heat up then cool down (left)
Constant temperature (right)
The constant stress will result in worse degradation


Higher average temperature
More time in high power state
16
Optimization Under Thermal Constraints


S is time spend in each power states for each cores
A is the temperature sensitivity matrix




Temperature increase per unit power
Tmax is the maximum temperature constraint
Tbak is the background power for each cores
Formulated as Linear Programming (LP) problem
17
Outline

Overview

Accumulation Model and Management Policies

Problem Formulation

Experimental Results

Conclusion
18
Experimental Setup

Power model



Thermal model




Based on a commercial processor benchmark
Using libraries characterized at different supply voltages
from 0.6V to 0.9V
Using HotSpot simulator
Consider the cases of 2x2, 4x4, 8x8 and 16x16
cores
BTI: both NBTI and PBTI
EM: metal sized to have the same current density
(MTTF)
19
Local Power Network EM Results
Power constraint
40% reduction
Thermal constraint
20
Signal Wire EM Results
Power constraint
60% reduction
Thermal constraint
21
BTI Results
20%
reduction
Power constraint
Thermal constraint
22
Conclusion



We propose hardware reliability margining
methodology for chips in the dark silicon era
We formulate the margining problem under power
and thermal constraints
Experimental results show that at 60% dark ratio,
our method can achieve 40%-60% reduction in
metal width margin and 20% reduction in BTI delay
margin
23
Backup slides
24
EM Accumulation Model

Effective current density:

For local power mesh


Jeff can be calculated by average power consumed
For signal wires:

Jeff is proportional to V * f
25
BTI Accumulation Model

Two steps:

Identify the worst-case ordering by simulator


Worst BTI degradation happen when power states are
applied in increasing order of stress voltages
Fitting the accumulation model




First pick a set of power state distribution sample x
Simulate the degradation g(x)
Assuming the fitting function is
Formulated as:
26