Olay: Combat the Signs of Aging with Introspective
Download
Report
Transcript Olay: Combat the Signs of Aging with Introspective
Olay: Combat the Signs of Aging with
Introspective Reliability Management
Authors:
Shuguang Feng
Shantanu Gupta
Scott Mahlke
W-QUAD (ISCA-35)
June 21, 2008
1
University of Michigan
Electrical Engineering and Computer Science
Motivation
“Designing Reliable Systems from Unreliable
Components…”
- Shekhar Borkar (Intel)
More failures to come
Failures will be wearout
induced
[Srinivasan, DSN‘04]
[Borkar, MICRO‘05]
2
University of Michigan
Electrical Engineering and Computer Science
Approaches to Reliability
Tolerate Faults
(reactive)
or…
Prevent Faults
(proactive)
Circuit-level
Margining
High-K dielectrics
Approaches
to Reliability
Robust cell topologies
Passivation
Architecture-level
Detect
Dynamic thermal mgmt (DTM)
Diagnose
Repair/reconfigure/recover
Introspective reliability mgmt (IRM)
Targeted management based on wearout monitoring
3
University of Michigan
Electrical Engineering and Computer Science
Not All Cores Are Created Equal
Chip-multiprocessors will be subject to severe process
variation
Dynamic thermal/power budgeting can be suboptimal
Temperature is only part of the picture
Need low-level reliability awareness
Low-level sensors measure physical changes
Wearout-aware management improves reliability
enhancement
System reconfiguration
Dynamic voltage and frequency scaling (DVFS)
Job assignment
4
University of Michigan
Electrical Engineering and Computer Science
Introspective Reliability Management (IRM)
OS
Scheduled Jobs
IRM Policy
Virtualization Layer
Reliability Assesment
WDU [MICRO`07]
measure propagation
delay
track statistical trends
5
Olay
track the progression of
wearout
profile workload behavior
generate wearout-aware
job schedules
Low-level Sensors
delay
leakage
temperature
etc.
Aggregate Analysis
Processed Data
Filtering and Analysis
Raw Sensor Data
Management Decisions
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Scheduling
T0
T1
Per-module Reliability Profile
Activity:
75%
10%
50%
15%
35%
25%
25%
45%
35%
85%
5%
T2
T3
Tn
Active Jobs
Available Cores
6
T11
T1
T6
T2
Idle
Idle
T7
T4
T5
Idle
T7
T1
T9
T10
T8
T10
T8
T0
Idle
T11
T6
Idle
T4
T3
Idle
T4
T6
T1
T7
Job Schedule
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Scheduling
OS
T0
Scheduled Jobs
Application
IRM Policy
Life
Remaining
Virtualization Layer
T1
Reliability Assesment
Job-to-Core Binding
T2
Aggregate Analysis
100%
Processed Data
50% 35% 55% 85%
Filtering and Analysis
10% 15% 30% 80%
Tn
Core
25% 17% 75%
Raw Sensor Data
T3
8%
Lightweight
Strong
0%
Heavyweight
Weak
17% 60% 70% 30%
7
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Policies
GreedyE
Weak
Strong
Optimizes for early life performance
Minimizes premature failures with wear-leveling
C7
C6
C1
C0
Light
T13
T12
T4
T0
C6
C3
C1
T8
T3
T1
C10
C1
C3
C2
T9
T2
C10
C4
C3
T3
T5
C10
C0
C4
T5
T4
T7
Cn
Cores
Heavy
T11
T13
T0
T7
T5
T2
T4
T12
T15
T6
T8
T3
T1
T10
T15
T9
Tn
Jobs
8
Schedule
University of Michigan
Electrical Engineering and Computer Science
Wearout-aware Policies
GreedyE
Optimizes for early life performance
Minimizes premature failures with wear-leveling
GreedyL
Optimizes for end of life performance
Victimizes weak cores to maximize the life of stronger
cores
GreedyA
Hybrid of GreedyE and GreedyL
Adapts behavior based on system utilization
9
University of Michigan
Electrical Engineering and Computer Science
Lifetime Reliability Simulation (FACE)
Offline Characterization
SimAlpha
Wattch
HotSpot
Benchmark
Profiles
Benchmark
Suite
SPEC2000 (INT Execution
& FP)
Temperature
Trace
TracePower Trace Synthetic
Benchmarks
representative of SPEC2000
suite
reduces online profiling
complexity
10
University of Michigan
Electrical Engineering and Computer Science
Lifetime Reliability Simulation (FACE)
Offline Characterization
SimAlpha
Wattch
HotSpot
Benchmark
Profiles
Benchmark
Suite
Workload
Simulator
Parameter
Specification
WorkloadCMP
Generation
Simulate
Aging
Reliability
Management
Online Simulation
emulates
OS health
scheduler
tracks
progression
of
monitors
CMP
Device
lifetimes
temperature
traces
wearout mechanisms
wearout-aware
scheduling
Utilization
pattern
power
tracesdesign
hierarchical
profiling
intelligent heuristics
Olay
Monte Carlo Engine
CMP Simulator
11
University of Michigan
Electrical Engineering and Computer Science
Wearout Modeling
Mean time to failure (MTTF)
MTTFTDDB
1
V
a bT
e
Y
X ZT
T
T
MTTFNBTI
1
e
V
EaNBTI
T
defines distribution of device lifetimes
Damage accumulation
Dn 1 n1 Dn1 in01 1 i D0
where α is the degradation rate
i
MTTFqual
MTTFi
12
University of Michigan
Electrical Engineering and Computer Science
CMP Reliability Simulation
CMP
CMPs:
variable number of cores
model systematic variation
Core
Cores:
Alpha 21264-type processor
Modules:
Module
experience load-dependent stress
smallest granularity of
temperature modeling
Transistors:
multiple mechanisms evolve
Transistor
independently
13
University of Michigan
Electrical Engineering and Computer Science
Evaluation
Policies
Random (baseline), GreedyE, GreedyL, GreedyA
Figures of merit
Failure distribution
Useful work performed prior to system failure
Varied system parameters
CMP size
System utilization
Sensor error
14
University of Michigan
Electrical Engineering and Computer Science
Failure Distribution
w/ 16-cores
15
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to System Utilization
w/ 16-cores
16
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to CMP Size
w/ 100% utilization & GreedyE
17
University of Michigan
Electrical Engineering and Computer Science
Sensitivity to Sensor Error
w/ 16-cores,100% utilization, & GreedyE
18
University of Michigan
Electrical Engineering and Computer Science
Conclusions
Heterogeneity exists in both CMPs and their
workloads
Wearout-aware job assignments effectively exploit
this heterogeneity
Real-time health monitoring (low-level sensors)
CMPs augmented with Olay perform up to 20% more
useful work
Proper high-level analysis and profiling is essential
for enhancing lifetime reliability.
19
University of Michigan
Electrical Engineering and Computer Science
Questions?
?
20
University of Michigan
Electrical Engineering and Computer Science