Power Management for Real

Download Report

Transcript Power Management for Real

10 years of research on Power Management
(now called green computing)
Rami Melhem
Daniel Mosse
Bruce Childers
•
•
•
•
•
•
Introduction
Power management in real-time systems
Power management in multi-core processors
Performance-Resilience-Power Tradeoff
Management of memory power
Phase Change Memory
Power management techniques
Two common techniques:
1) Throttling
Turn off (or change mode of) unused components (Need to
predict usage patterns to avoid time and energy overhead
of on/off or mode switching)
2) Frequency and voltage scaling
Scale down core’s speed (frequency and voltage)
Designing power efficient components is orthogonal
to power management
Frequency/voltage scaling
• Gracefully reduce performance
• Dynamic power Pd = C f 3 + Pind
• Static power: independent of f.
power
Cf3
Pind
Idle time
Static
power
time
When frequency is halved:
• Time is doubled
• C f 3 is divided by 8
• Energy caused by C f 3 is divided by 4
• Energy caused by Pind is doubled
time
Different goals of power management
• Minimize total energy consumption
• Minimize the energy-delay product
•
•
•
•
Pind / f
Speed (f)
Energy*delay
– Takes performance into consideration
Cf2
energy
- static energy decreases with speed
- dynamic energy increases with speed
total
Minimize the maximum temperature
Maximize performance given a power budget
Minimize energy given a deadline
Minimize energy given reliability constraints
Cf
Pind / f 2
f
DVS in real-time systems
CPU speed
Worst case execution
deadline
Smax
Smin
time
Static
scaling
time
(power management points)
Remaining time
Dynamic
scaling
Remaining time
• Utilize slack to slow down future tasks (Proportional, Greedy, aggressive,…)
Implementation of Power Management Points
• Can be implemented as periodic OS interrupst
• Difficulty: OS does not know how much execution is remaining
branch
loop
min
average max
• Compiler can insert code to provide hints to the OS
Example of compiler/OS collaboration
min
average max
At a power management hint
• Compiler records WCET based on the longest remaining path
At a power management point
• OS uses knowledge about current load to set up the speed
Compiler/OS collaboration
Static
analysis
PMHs: Power
management hints
Compiler
(knows the
task)
PMPs: Power
management points
Application
Source Code
Run-time
information
OS/HW
(knows the
system)
PMHs
time
Interrupts for executing PMPs
DVS for multiple cores
To derive a simple analytical model, assume Amdahl’s law:
- p % of computation can be perfectly parallelized.
s
p
One core
Two cores
Manage energy by determining:
• The speed for the serial section
• The number of cores used in the
parallel section
• The speed in the parallel section
Slowing down
the cores
Slowing down the
parallel section
Using more
cores
Mapping streaming applications to CMPs
• Streaming applications are prevalent
– Audio, video, real-time tasks, cognitive
applications
T
• Constrains:
D
– Inter-arrival time (T)
– End-to-end delay (D)
• Power aware mapping to CMPs
– Determine speeds
– Account for communication
– Exclude faulty cores
Mapping a linear task graph
onto a linear pipeline
If the # of stages = # of cores
Core
Core
Core
minimize

Subject to

Core
(
e


)
stage
stage
stage0
n
Find tstage
n
stage0
(t stage   stage)  D
max stage(t stage   stage)  T
ei : energy for executing stage i
i : energy for moving data from stage i-1 to stage i
ti : time for executing stage i
i : time for moving data from stage i-1 to stage i
Mapping a linear task graph
onto a linear pipeline
If the # of stages > # of cores
Core
Core
Core
Core
1)
Group the stages so that the number of stages equals the
number of cores
2)
Use a dynamic programming approach to explore possible
groupings
3)
A faster solution may guarantee optimality within a specified
error bound.
Mapping a non-linear task graph onto CMP
instance
A
instance
B
G
A
H
B
C
H
F
F
E
J
I
Medium speed
Minimum speed
K
• Timing constraints are conventionally
satisfied through load balanced mapping
• Additional constraint
I
J
K
Maximum speed
C
D
E
G
D
– Minimize energy consumption
– Maximize performance for a given energy budget
– Avoid faulty cores
Turn OFF some PEs
C
B
instance
I
H
C
J
E G
A
B
F
A
instance
K
D
D
Maximum speed/voltage (fmax)
E
G
H
F
Medium speed/voltage
I
J
Minimum speed/voltage (fmin)
K
PE OFF
DVS using Machine Learning
Characterize the execution state of a core by
• Rate of instruction execution (IPC)
• # of memory accesses per instruction
• Average memory access time (depends on other threads)
During training, record for each state
• The core frequency
• The energy consumption
Determine the optimal frequency for each state
core
core
core
core
L1 $$
L1 $$
L1 $$
L1 $$
L2 $$
L2 $$
L2 $$
L2 $$
M
C
During execution, periodically,
Estimate the current state (through run-time measurements)
Assume that the future is a continuation of the present
Set the frequency to the best recorded during training
M
Statistical learning applied to DVS in CMPs.
Training phase
Auto. policy
generator
Learning
engine
Runtime
Integrated
DVS policy
determine freq.
& voltages
17
Energy-Reliability tradeoff
Using time redundancy (checkpointing and rollbacks)
If you have a time slack:
1) add checkpoints
2) reserve recovery time
3) reduce processing speed
Smax
For a given number of checkpoints, we can find
the speed that minimizes energy consumption,
While guaranteeing recovery and timeliness.
deadline
Optimal number of checkpoints
More checkpoints = more overhead + less recovery slack
r
C
For a given
slack (C/D) and
checkpoint overhead (r/C),
we can find the number of checkpoints that
minimizes energy consumption, and
guarantee recovery and timeliness.
D
Energy
# of checkpoints
Faults are rare events
If a fault occurs, may continue
executing at Smax after recovery.
Non-uniform check-pointing
Observation: If a fault occurs, may continue
executing at Smax after recovery.
Disadvantage: increases energy consumption
when a fault occurs (a rare event)
Advantage: recovery in an early section can
use slack created by execution of later
sections at Smax
Requires non-uniform checkpoints.
Triple Modular Redundancy vs. Duplex
Duplex:
TMR:
Compare and roll
back if different
vote and exclude
the faulty result
checkpoint
overhead
TMR is more
Energy efficient
0.035
0.02
Load=0.7
Duplex is more
Energy efficient
0.1
Efficiency of TMR Vs. Duplex depends on
• static power (l),
• checkpoint overhead and
• load
0.2
l
Add memory power to the mix
Example: DRAM and SRAM modules can be switched between
different power states (modes) – not free:
- Mode transition power overhead
- Mode transition time overhead
Active
(779.1 mW)
auto
Standby
(275.0 mW)
5ns
5ns
Power-down
(150 mW)
1000ns
5ns
Self-refresh
(20.87 mW)
OS assisted Memory Power Management?
•
keep a histogram for patterns of bank accesses and idle time
distributions.
•
Use machine learning techniques to select the optimal
“threashold” to turn banks off.
Example of compiler assisted
Memory Power Management?
….
Load x
….
Store x
….
Load z
….
Load y
….
Store z
….
Store y
….
….
Compiler transformation
Load x
Load y
….
….
….
….
….
….
Store y
Load z
….
Store z
Code transformations to increase the memory idle time (the
time between memory accesses).
Example of compiler assisted
Memory Power Management?
Declare A[], B[], C[], D[]
….
Access A
….
Access D
Memory allocation
….
Access B
….
Access C
….
Access B
….
….
A[], B[]
C[], D[]
OR
A[], D[]
C[], B[]
Algorithms that use the access pattern to allocate memory
to banks in a way that maximizes bank idle times
Phase Change Memory (PCM)
A power saving memory technology
• Solid State memory made of germanium-antimony alloy
• Switching between states is thermal based (not electrical
based)
• Samsung, Intel, Hitachi
and IBM developed PCM
prototypes (to replace
Flash).
Properties of PCM
• Non-volatile but faster than Flash
• Byte addressable but denser and cheaper than DRAM
• No static power consumption and very low switching power
• Not susceptible to SEUs (single event upsets) and hence
do not need error detecting or correcting codes
o Errors occur only during write (not read) – use a simple
read-after-write to detect errors
So, where is the catch?
• Slower than DRAM
• factor of 2 for read and 10 for write
• Low endurance
• A cell fails after 107 writes (as opposed to 1015 for DRAM)
• Asymmetric energy consumption
• write is more expensive than read
• Asymmetry in bit writing
• writing 0s is faster than writing 1s
Goal: use PCM as main memory
CPU
Memory
Controller
DRAM
Traditional architecture
CPU
AEB
MM
PCM
Proposed architecture
AEB: acceleration/endurance buffer
MM: memory manager
Advantages: cheaper + denser + lower power consumption
Dealing with asymmetric read/write
• Use coherence algorithms in which “writes” are not on the
critical path.
• Design algorithms with “read rather than write” in mind
• Take advantage of the fact that writing 0s is faster than 1s
• Pre-write a block with 1’s as soon as block is dirty in
the cache
• On write back, only write 0’s .
Dealing with low write endurance
(write minimization)
• Block (or page) allocation algorithms should not be oblivious
to the status of the block – for wear minimization
• Modify the cache replacement algorithm
• ex. LRR replacement (least recently read)
• Lower priority to dirty pages in cache replacement
• Use coherence algorithms that minimize writes (writethrough is not good)
• Read/compare/write == write a bit only if it is different than
the current content
Wear leveling
• Memory allocation decisions should consider age of blocks
(age = number of write cycles exerted)
• Periodically change the physical location of a page (write to
a location different than the one read from)
• Consider memory as a consumable resource - can be
periodically replaced
Detailed architecture
Memory Manager
Request Controller
Tag Array
CPU
Bus
Interface
V D Addr
V D Addr
V D Addr
In Flight Buffer Busy Bitmap
FSM
Requests Buffer
V St CPU R/W Size Tag AEB PCM
n
Control bus
Data bus
DRAM
Controller/DMAC
Flight
InIn Flight
Buffer
Buffer
(SRAM)
PCM
Controller/DMAC
AEB
PCM
Page
Cache
Pages
Area
Spare Table
Spares
Conclusion
It is essential to manage the tradeoffs between
Time constrains
(deadlines or rates)
Energy constrains
Reliability
constrains
Compiler
OS
Hardware