Accumulation Model
Download
Report
Transcript Accumulation Model
Hardware Reliability Margining for the
Dark Silicon Era
Liangzhen Lai and Puneet Gupta
Department of Electrical Engineering
University of California, Los Angeles
[email protected]
This work is supported in part by NSF Variability Expedition grant CCF-1029030
1
Outline
Overview
Accumulation Model and Management Policies
Problem Formulation
Experimental Results
Conclusion
2
Hardware Reliability Margin
Parametric margin
Physical margin
Voltage/Frequency or sign-off corners
E.g., BTI, HCI
Metal width, layout spacing
E.g., current-dependent minimum metal width for EM
Typically worst-case driven
Mostly derived at hardware design time
Uncertainty in workload, circuit operating points etc.
3
Reliability vs. Operating Points
Most reliability-related phenomena depends heavily
on the circuit operating points
Voltage, Frequency, Temperature etc.
4
Dynamic Range of Operations
Efficiency needed for the Dark Silicon Era
Multi/Many-core design with less powerful cores
Low voltage/current/power -> less margin
“Turbo X”: Turbo Boost (Intel), Turbo Core (AMD)
Under certain conditions
High voltage/current/power-> more margin
Moderate
Parallel
Known optimistic
Low stress states
Workload
Reliability
margin
High stress states
Intensive
Single-thread
Known pessimistic
5
Dark Silicon Contexts
Pessimism depends on the difference between
peak power/temperature and sustainable
power/temperature
Power constraint
Quantify silicon “darkness”
Dark ratio:
Limit on maximum instantaneous power
Thermal constraint
Limit on maximum on-chip temperature
6
Margining Methodology
Formulate as workload optimization
Maximize the reliability degradation
Still meets the power/thermal constraints
7
Outline
Overview
Accumulation Model and Management Policies
Problem Formulation
Experimental Results
Conclusion
8
Dynamic Reliability Model
Most reliability models are static
Derived for constant voltage/current/temperature
Need a highly dynamic model for optimization
Comparing different degradation scenarios
v
v
P1
P3
P1
vs.
P3
t
P2
t
9
Accumulation Model
Time spent in
each power
states
Some can be derived from the model itself
Accumulation
Model
Worst-case
degradation
at the end of
lifetime
E.g., EM can be modeled by effective current density Jeff
Other can be derived by simulator
E.g. Worst-case BTI degradation can be derived by
simulating different power state ordering and picking the
worst-case
Fitting and interpolation can also be used
10
Spatial problem vs. Temporal problem
With accumulation model, reliability degradation
can be modeled as temporal distribution problems
v
P1
P2
P3
t
The workload and power/thermal constraints are
spatial problems
P1
P3
P1
P2
P1
P2
11
System Management Policy
We assume a fair round-robin policy
Iterate scheduling priorities among all processor cores
Iterating frequency can be of hours to days
Assuming this policy because:
Simple: open-loop, reasonable to assume at hardware
design time
Effective: sufficient iterations to balance workload during
typical hardware life time of multiple years
Pessimistic: more sophisticated policies are likely to
perform better, i.e., margin is pessimistic
12
Bridging Spatial and Temporal Problems
Management policy will iterate workload among all
cores
Spatial distribution is equivalent to temporal distribution
v
P1
P3
P1
P2
P1
P2
P1
P2
P3
t
Spatial
constraints
Temporal
distribution
13
Outline
Overview
Accumulation Model and Management Policies
Problem Formulation
Experimental Results
Conclusion
14
Optimization Under Power Constraints
x is the number of cores at each power states
Also the input to the accumulation model f(x)
P is the power corresponding to the power states
Pmax is the power constraint
Formulated as Integer Linear Programing (ILP)
problem
15
Thermal Problem
Thermal limit can be reached by two scenarios
Heat up then cool down (left)
Constant temperature (right)
The constant stress will result in worse degradation
Higher average temperature
More time in high power state
16
Optimization Under Thermal Constraints
S is time spend in each power states for each cores
A is the temperature sensitivity matrix
Temperature increase per unit power
Tmax is the maximum temperature constraint
Tbak is the background power for each cores
Formulated as Linear Programming (LP) problem
17
Outline
Overview
Accumulation Model and Management Policies
Problem Formulation
Experimental Results
Conclusion
18
Experimental Setup
Power model
Thermal model
Based on a commercial processor benchmark
Using libraries characterized at different supply voltages
from 0.6V to 0.9V
Using HotSpot simulator
Consider the cases of 2x2, 4x4, 8x8 and 16x16
cores
BTI: both NBTI and PBTI
EM: metal sized to have the same current density
(MTTF)
19
Local Power Network EM Results
Power constraint
40% reduction
Thermal constraint
20
Signal Wire EM Results
Power constraint
60% reduction
Thermal constraint
21
BTI Results
20%
reduction
Power constraint
Thermal constraint
22
Conclusion
We propose hardware reliability margining
methodology for chips in the dark silicon era
We formulate the margining problem under power
and thermal constraints
Experimental results show that at 60% dark ratio,
our method can achieve 40%-60% reduction in
metal width margin and 20% reduction in BTI delay
margin
23
Backup slides
24
EM Accumulation Model
Effective current density:
For local power mesh
Jeff can be calculated by average power consumed
For signal wires:
Jeff is proportional to V * f
25
BTI Accumulation Model
Two steps:
Identify the worst-case ordering by simulator
Worst BTI degradation happen when power states are
applied in increasing order of stress voltages
Fitting the accumulation model
First pick a set of power state distribution sample x
Simulate the degradation g(x)
Assuming the fitting function is
Formulated as:
26