Introduction

Download Report

Transcript Introduction

Power Management
(1)
Introduction to Basics
Background Reading
•
http://en.wikipedia.org/wiki/CPU_power_dissipation
•
http://en.wikipedia.org/wiki/CMOS#Power:_switching_and_leaka
ge
•
http://www.xbitlabs.com/articles/cpu/display/core-i5-2500t-2390ti3-2100t-pentium-g620t.html
•
http://www.cpu-world.com/info/charts.html
•
Goal: Understand
 The sources of power dissipation in combinational and
sequential circuits
 Power vs. energy
 Options for controlling power/energy dissipation
(3)
Moore’s Law
Goal: Sustain
Performance
Scaling
•
•
Performance scaled with
number of transistors
Dennard scaling*:
power scaled with
feature size
From wikipedia.org
*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State
Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
(4)
Where Does the Power Go in CMOS?
•
Dynamic Power Consumption
 Caused by switching transitions  cost of switching state
Vdd
PMOS
•
Static Power Consumption
Vin
Vout
 Caused by leakage currents in the absence of any switching
activity
•
NMOS
Ground
Power consumption per transistor changes with each
technology generation
 No longer reducing at the same rate
 What happens to power density?
AMD Trinity APU
(5)
n-channel MOSFET
GATE
GATE
DRAIN
SOURCE
tox
DRAIN
SOURCE
BODY
L
L
• Vgs < Vt transistor off - Vt is the threshold voltage
• Vgs > Vt transistor on
• Impact of threshold voltage
 Higher Vt, slower switching speed, lower leakage
 Lower Vt, faster switching speed, higher leakage
• Actual physics is more complex but this will do for
now!
(6)
Charge as a State Variable
a
b
c
x
y
For computation we should be able to
identify if each of the variable (a,b,c,x,y)
is in a ‘1’ or a ‘0’ state.
We could have used any physical quantity
to do that
• Voltage
• Current
• Electron spin
• Orientation of magnetic field
• ………
All nodes have some capacitance associated with them
a
b
c
x
y
We choose voltage distinguish between
a ‘0’ and a ‘1’.
Logic 1: Cap is charged
Logic 0: Cap is discharged
+
(7)
Abstracting Energy Behavior
• How can we abstract energy consumption for a
digital device?
• Consider the energy cost of charge transfer
Vdd
Vin
Vout
0
1
1
0
Vin
Modeled as an
on/off resistance
PMOS
Vout
NMOS
Modeled as an
output capacitance
Ground
(8)
Switch from one state to another
To perform computation, we need to switch from
one state to another
Connect the cap to
GND thorough an ON
NMOS
Vdd
Vin
PMOS
Vout
NMOS
Ground
Logic 1: Cap is charged
Logic 0: Cap is discharged
+
Connect the cap to
VCC thorough an ON
PMOS
The logic dictates whether a node capacitor will be
charged or discharged.
(9)
Power(watts)
Power(watts)
Power Vs. Energy
P2
P1
Same Energy = area under the curve
P0
Time
P0
Time
•
Energy is a rate of expenditure of energy
 One joule/sec = one watt
•
Both profiles use the same amount of energy at
different rates or power
(10)
Dynamic Power vs. Dynamic Energy
•
Dynamic power: consider the rate at which switching
(energy dissipation) takes place
VDD
VDD
Voltage
iDD
VDD
iDD
CL
0
T
Input to
CMOS
inverter
CL
Tim
e
Output
Capacitor
Charging
Output
Capacitor
Discharging
activity factor = fraction of total capacitance that switches each cycle
æ CL ö
Pdynamic = a ç ÷ ×Vdd ×Vdd × F
è 2 ø
Delay = k × C
Vdd
(Vdd -Vt )
2
(11)
Energy or delay
Delay
Energy
Power State
VDD
•
Energy-Delay Product (EDP)
Energy-Delay Interaction
Target of optimization
VDD
Delay decreases with supply voltage but
energy/power increases
æ CL ö
Pdynamic = a ç ÷ ×Vdd ×Vdd × F
è 2 ø
Delay = k × C
Vdd
(Vdd -Vt )
(12)
2
Static Power
•
Technology scaling has caused transistors to become
smaller and smaller. As a result, static power has
become a substantial portion of the total power.
GATE
SOURCE
DRAIN
Gate Leakage
Junction Leakage
Sub-threshold
Leakage
Pstatic = Vdd × I static
(13)
leakage or delay
Static Energy-Delay Interaction
leakage
delay
GATE
DRAIN
SOURCE
tox
L
Delay = k × C
Vth
Vdd
(Vdd -Vt )
•
Static energy increases exponentially with decrease in
threshold voltage
•
Delay increases with threshold voltage
(14)
2
Higher Level Blocks
Vdd
Vdd
A
A
B
Vdd
B
C
C
A
A
B
B
A
B
C
A
C
B
(15)
Temperature Dependence
•
As temperature increases static power increases1
Pstatic = Vdd × N × Kdesign × Ileakage
Supply voltage
#Transistors
Technology
Dependent
Normalized
Leakage Current
Ileakage = F(Temp)
1J.
Butts and G. Sohi, “A Static Power Model for Architects, MICRO 2000
(16)
The World Today
•
Yesterday scaling to minimize time (max F)
æ CL ö
Pdynamic = a ç ÷ ×Vdd ×Vdd × F
è 2 ø
Delay = k × C
Vdd
(Vdd -Vt )
•
Maximum performance (minimum time) is too
expensive in terms of power
•
Today: trade/balance performance for power
efficiency
2
(17)
Technology Factors Affecting Power
•
Transistor size
 Affects capacitance (CL)
•
Rise times and fall times (delay)
 Affects short circuit power (not in this course)
•
Threshold voltage

•
Vdd
Affects leakage power
PMOS
Vout
Vin
Temperature
NMOS
 Affects leakage power
•
Ground
Switching activity
 Frequency (F) and number of switching transistors (
æ CL ö
Pdynamic = a ç ÷ ×Vdd ×Vdd × F
è 2 ø
Delay = k × C
a
)
Vdd
(Vdd -Vt )
2
(18)
Low Power Design: Options?
æ CL ö
Pdynamic = a ç ÷ ×Vdd ×Vdd × F
è 2 ø
Delay = k × C
Vdd
(Vdd -Vt )
2
• Reduce Vdd
 Increases gate delay
 Note that this means it reduces the frequency of
operation of the processor!
• Compensate by reducing threshold voltage?
 Increase in leakage power
• Reduce frequency
 Computation takes longer to complete
 Consumes more energy (but less power) if voltage is
not scaled
(19)
Example
HW
Only
(Boost)
SWVisible
CPU
P-state
Pb0
Voltage
(V)
1
Freq
(MHz)
2400
Pb1
0.875
1800
P0
0.825
1600
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
AMD Trinity A105800 APU: 100W
TDP
(20)
Optimizing Power vs. Energy
Maximize battery life  minimize energy
Thermal envelopes 
minimize peak power
Example:
(21)
What About Wires?
Lumped RC Model
1
Cline
2
Rline = r ×l
Resistance per
unit length
•
1
Cline
2
1
t = rc × l 2
2
Cline = c ×l
Capacitance per
unit length
We will not directly address delay or energy expended
in the interconnect in this class
 Simple architecture model: lump the energy/power with the
source component
(22)
Power Management Basics
Parallelism and Power
IBM Power5
Source: IBM
AMD Trinity
Source: forwardthinking.pcmag.com
• How much of the chip area is devoted to
compute?
• Run many cores slower. Why does this reduce
power?
(24)
The Power Wall
P = aCV f +Vdd Ileak
2
dd
•
Power per transistor scales with frequency but also
scales with Vdd
 Lower Vdd can be compensated for with increased
pipelining to keep throughput constant
 Power per transistor is not same as power per area 
power density is the problem!
 Multiple units can be run at lower frequencies to keep
throughput constant, while saving power
(25)
What is the Problem?
Mukhopadhyay and Yalamanchili (2009)
Based on scaling using Pentium-class cores
 While Moore’s Law continues, scaling phenomena have
changed
 Power densities are increasing with each generation

(26)
26
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(27)
What are my Options?
1. Better technology
 Manufacturing
 Better devices (FinFet)
 New Devices  non-CMOS?  this is the future
2. Be more efficient – activity management
 Clock gating – dynamic energy/power
 Power gating – static energy/power
 Power state management - both
3. Improved architecture
 Simpler pipelines
4. Parallelism
(28)
Activity Management
Clock Gating
Power Gating
Vdd
input
Combinational
Logic
clk
Power gate
transistor
cond
clk
•
Turn off clock to a block of
logic
•
Eliminate unnecessary
transitions/activity
•
Core 0
clk
Core 1
•
Turn off power to a
block of logic, e.g.,
core
•
No leakage
Clock distribution power
(29)
Multiple Voltage Frequency Domains
Intel Sandy Bridge
Processor
•
•
•
Cores and ring in one DVFS domain
Graphics unit in another DVFS domain
Cores and portion of cache can be gated
off
From E. Rotem et. Al. HotChips 2011
(30)
Processor Power States
•
Performance States – P-states
 Operate at different voltage/frequencies
o
Recall delay-voltage relationship
 Lower voltage  lower leakage
 Lower frequency  lower power (not the same as energy!)
 Lower frequency  longer execution time
•
Idle States - C-states
 Sleep states
 Differ is how much state is saved
•
SW or HW managed transitions between states!
(31)
Example of P-states
AMD Trinity A10-5800 APU: 100W TDP •
CPU P- Voltage
state
(V)
HW
Only
(Boost)
SWVisible
Freq
(MHz)
Pb0
1
2400
Pb1
0.875
1800
P0
0.825
1600
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
•
Software Managed
Power States
Changing Power
States is not free
(32)
Example of P-states
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
(33)
Management Knobs
•
Each core can be in any one of a multiple of states
•
How do I decide what state to set each core?
 Who decides? HW? SW?
•
How do I decide when I can turn off a core?
•
What am I saving? Static energy or dynamic energy?
(34)
Power Management
•
Software controlled power management
 Optimize power and/or energy
 Orchestrated by the operating system or application libraries
 Industry standard interfaces for power management
o
•
Advanced Configuration and Power Interface (ACPI)
 https://www.acpica.org/
 http://www.acpi.info/
Hardware power management
 Optimized power/energy
 Failsafe operation, e.g., protect against thermal emergencies
(35)
Boosting
Intel Sandy Bridge
•
Exploit package physics
 Temperature changes on the
order of milliseconds
•
Use the thermal headroom
Turbo boost region
Max Power
TDP Power
10s of seconds
Low power – build up thermal credits
(36)
Power Gating
• Turn off components
that are not being used
 Lose all state information
• Costs of powering down
• Costs of powering up
• Smart shutdown
 Models to guide decisions
Intel Sandy Bridge
Processor
(37)
Parallelism
•
Concurrency + lower frequency  greater energy
efficiency
Example
Core
Cache
Core
Core
Cache
Cache
Core
Core
Cache
Cache
•
•
•
•
•
4X #cores
0.75x voltage
0.5x Frequency
1X power
2X in performance
P = aCV f +Vdd Ileak
2
dd
(38)
Simplify Core Design
AMD Bulldozer Core
•
Support for branch
prediction, schedulers, etc.
consumes more energy per
instruction
•
Can fit many more simpler
cores on a die
ARM A7 Core (arm.com)
(39)
Metrics
•
Power efficiency
 MIPS/watt
 Ops/watt
•
Energy efficiency
 Joules/instruction
 Joules/op
•
Composite
 Energy-delay product
 Energy-delay2
Why are these useful?
(40)
Thermal Issues
Thermal Issues
•
Heat can cause damage to the chip
 Need failsafe operation
•
Thermal fields change the physical characteristics
 Leakage current and therefore power increases
 Delay increases
 Device degradation becomes worse
•
Cooling solution determines the permitted power
dissipation
(42)
Thermal Design Power (TDP)
•
This is the maximum power at
which the part is designed to
operate
 Dictates the design of the cooling
system
o
AMD Trinity APU
Max temperature  Tjmax
 Typically fixed by worst case
workload
•
Parts are typically operating
below the TDP
•
Opportunities for turbo mode?
http://ecs.vancouver.wsu.edu/thermofluids-research
(43)
Heat Sink Limits on Performance
Thermal design power (TDP)

Performance depends on effective utilization of
this thermal headroom
Temp
 www.legitreviews.com
Workload
Thermal
Headroom
Boost power
Instructions/cycle

Determines the cooling solution & package limits
Power

Time
HW Boost
states
SW visible
states
Convert thermal
headroom to higher
performance
through boosting
(44)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
(45)
Coordinated Energy
Management in Heterogeneous
Processors
SC13
Indrani Paul1,2, Vignesh Ravi1, Srilatha Manne1,
Manish Arora1,3, Sudhakar Yalamanchili2
1
2
3
Advanced Micro Devices, Inc.
Georgia Institute of Technology
University of California, San Diego
(46)
Goal
•
Goal:
 Optimize energy efficiency under power and performance
constraints in a heterogeneous processor
•
Outline:





Problem
State-of-the-Art Power Management
HPC Application Characteristics and Frequency Sensitivity
Run-time Coordinated Energy Management
Results
(47)
State-of-the-art Heterogeneous processor
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces
Graphics processing
unit (GPU):
384 AMD Radeon™
cores
Multi-threaded
CPU cores
Accelerated processing unit (APU)
Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity
(48)
Programming model
Host
Tasks
GPU
Tasks
User Application
Each OpenCL kernel
N-Dimensional Range
OpenCL™ or other
Software Stack
Operating System
CPU
GPU
APU Hardware
•
Grid of threads, each operating over a
data partition
Coupled programming model  Offload compute intensive tasks to the
GPU
(49)
CPU-GPU Phase behavior in an Exascale
Proxy Application (Lulesh)
CPU-GPU coupled execution  time-varying redistribution of compute
intensity
Energy efficient operation  coordinated distribution of power to CPU vs.
GPU
Coordinated power states  sensitivity of performance to CPU and GPU
power state (frequency)
– Need to characterize ROI: Return (performance) on investment (power)
(50)
Challenge: CPU-GPU Coupling effects
Direct Performance Coupling
Host
Tasks
Indirect Performance
Coupling: Shared Resources
Performance
GPU
Tasks
User Application
Performance Constraint
Coupling Effects
Coordinated
Energy
Management
Power Efficiency
• HPC applications have uncompromising
performance requirements!
• Need more efficient energy
management
(51)
State of the Art Power
Management
State-of-the-art: Bi-directional application
power management (BAPM)
CU0
TE
•
CU1
TE
GPU
TE
Chip is divided
into BAPMcontrolled
thermal entities
(TEs)
Power management algorithm
1. Calculate digital estimate of power consumption
2. Convert power to temperature
- RC network model for heat transfer
3. Assign new power budgets to TEs based on
temperature headroom
4. TEs locally control (boost) their own DVFS states to
maximize performance
(53)
Power Management
APU Die
Temperature
Performance and energy efficiency depend on
3.0
effective utilization of power and thermal headroom
Thermal
Headroo
m
GPU
HW
Only
HW Boost
states
Convert thermal
headroom to
higher
performance
through boost
Instructions/cycle
APU
Performance
HW
Only
(Boost)
SWVisible
Time
SW visible
states
CPU
DVFSstate
Pb0
Pb1
P0
P1
P2
--Pmin
DVFSstate
High
Mediu
m
Low
Time
(54)
Key observations
•
Overall application performance is a function of both
the CPU and the GPU
•
State of the practice: Manage to thermal limits by
locally boosting when power and thermal headroom
are available  utilize all of the available headroom
•
Pitfall: boosting may not lead to proportional
performance improvement energy inefficient
•
Need a concept of performance sensitivity to power
states
(55)
Application Characteristics
Frequency sensitivity of gpu kernels
DVFS-high
DVFS-med
DVFS-low
% increase in run-time
160%
140%
120%
100%
80%
60%
40%
20%
0%
Total
Force
Neighbour
Comm
GPU DVFS per kernel in miniMD->
Other
Some kernels are more sensitive to GPU frequency than
others  more power efficient
(57)
Sensitivity of gpu kernel execution to cpu
frequency
% increase in run-time
P0
P1
P2
P3
P4
50%
40%
30%
20%
10%
0%
Total
Force
Neighbor Comm
Other
CPU DVFS per kernel in miniMD ->
 Some kernels are more tightly coupled to CPU’s
performance
 Smaller kernels such as Comm have high overheads in
launching and feeding the GPU
(58)
Sensitivity to Shared resource interference
Normalized Metric ->
Performance
actually limited
by GPU
memory
demand
GPU_Mem_BW/Pb1
CPU_Mem_BW/Pb0
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Power
management
locally boosts
CPU to
highest DVFS
states
Mem_BW_breakdown
CPU_DVFS_residency
miniMD – Neighbor kernel
Wasted energy  power inefficient
Need online estimates of sensitivity to interference
(59)
Computation and control divergence
Percentage metric ->
0.80
0.70
• GPU_freq_sensitivity: unit
performance gain for unit
frequency increase
0.60
0.50
0.40
0.30
• GPU_ALUBusy%:
measured hardware
compute utilization
0.20
0.10
0.00
GPU_freq_sensitivity(meas)
GPU_ALUBusy%
Graph Algorithm – BFS
Control divergence
 increased thread serialization
 increased frequency sensitivity
(60)
Key Observations
•
HPC applications exhibit varying degrees of CPU and
GPU frequency sensitivities due to
 Control divergence
 Interference at shared resources
 Performance coupling between CPU and GPU
•
Efficient energy management requires metrics that
can predict frequency sensitivity (power) in
heterogeneous processors
•
Sensitivity metrics drive the coordinated setting of
CPU and GPU power states
(61)
Energy Management
Performance metrics for APU frequency
sensitivity
 Linear regression model using the above metrics to compute measur
GPU Compute
Interference
Performance Coupling
CPU Compute
(63)
DynaCO: Run-time system for coordinated
energy management
CPU-GPU
Frequency
Sensitivity
Computation
Performance
Metric Monitor
CPU-GPU
Power State
Decision
GPU Frequency
Sensitivity
CPU Frequency
Sensitivity
Decision
High
Low
Shift power to GPU
High
High
Proportional power
allocation
Low
High
Shift power to CPU
Low
Low
Reduce power of
both CPU and GPU
 DynaCo-1levelTh: Lowest CPU DVFS-state limited to P2
 DynaCo-multilevelTh: Lowest CPU DVFS-state allowed to use up to
Pmin based on degree of performance coupling
(64)
Key observations
•
Coordinated CPU-GPU execution
•
Linear combination of three key high level
performance metrics proposed to model APU
frequency sensitivity behavior
•
Run-time coordinated energy management scheme
DynaCo to manage CPU and GPU DVFS states
dynamically based on measured frequency
sensitivities
(65)
Experimental Set-Up
 Trinity A10-5800 APU: 100W TDP
 CPU: Managed by HW or SW
HW
Only
(Boost)
SWVisible
CPU
Pstate
Voltage Freq
(V)
(MHz)
Pb0
1
2400
Pb1
0.875
1800
P0
0.825
1600
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
 GPU: Managed by sending software
messages through driver layer
GPU PFreq
state
(MHz)
GPU-high
800
GPU-med
633
GPU-low
304
 DynaCo implemented as a run-time
software policy overlaid on top of
BAPM in real hardware
(66)
Benchmarks
BM (Description)
Problem Size
miniMD
32 x 32 x 32 elements
miniFE
100 x 100 x 100 elements
Lulesh
100 x 100 x 100 elements
Sort
Stencil2D
2,097,152 elements
4,096 x 4,096 elements
S3D
SHOC default for integrated GPU
BFS
1,000,000 nodes
(67)
Normalized ED^2 product
Energy Efficiency (ED2 product)
DynaCo-1levelTh
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
DynaCo-multilevelTh
Ideal-static
Average energy efficiency improvement of 24% and
30% with DynaCo-1levelTh and DynaCo-multilevelTh
respectively
(68)
Increase in run-time
Execution time Impact
DynaCo-1levelTh
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.92
0.90
DynaCo-multilevelTh
Ideal-static
Baseline
Average performance slow down of 0.78% and 1.61%
with DynaCo-1levelTh and DynaCo-multilevelTh
respectively
(69)
Power Savings
DynaCo-1levelTh
60%
DynaCo-multilevelTh
Ideal-static
Power
50%
40%
30%
20%
10%
0%
Average power savings of 24% and 31% with
DynaCo-1levelTh and DynaCo-multilevelTh
respectively
(70)
Conclusions
•
Note effects of shared resource interference, control
divergence and performance coupling on energy
management for HPC applications
•
Importance and scope of frequency sensitivity in
characterizing energy behaviors in tightly coupled
heterogeneous architecture
•
Dynamic power shifting power to the entity that can
best utilize it
(71)
Cooperative Boosting: Needy
versus Greedy Power
Management
Indrani Paul1,2, Srilatha Manne1, Manish Arora1,3, W. Lloyd
Bircher1, Sudhakar Yalamanchili2
June 2013
1
2
3
Advanced Micro Devices, Inc.
Georgia Institute of Technology
University of California, San Diego
(72)
Goal & Outline
•
Goal:
 Optimize performance under power and
thermal constraints in heterogeneous
architecture
•
Outline:





State-of-the-Art Power and Thermal Management
Thermal Coupling
Performance Coupling
Cooperative Boosting
Results
(73)
State-of-the-art Heterogeneous processor
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces
Graphics processing
unit (GPU):
384 AMD Radeon™
cores
Multi-threaded
CPU cores
Accelerated processing unit (APU)
Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity
(74)
What is Thermal design power?
•
Thermal design power: TDP



Upper bound for the sustainable power draw
Determines the cooling solution and package
limits
Usually set by determining worst-case
execution profile
 www.legitreviews.com
Performance depends on effective utilization
of thermal headroom
Instructions/cycle
•
Time
(75)
Key Observations
•
Power and thermals are shared resources in a
heterogeneous processor  thermal coupling
•
Overall application performance is a function of both
the CPU and the GPU  performance coupling
•
State of the practice: Managing to thermal limits by
locally boosting when thermal headroom is available
 utilize all of the headroom!
(76)
Thermal Coupling
Thermal signatures: CPU & GPU
Steady-state thermal fields produced by BAPM on a 19W AMD Trinity APU
 High-power GPU benchmark
 High-power CPU benchmark, idle
GPU
 Worst-case GPU: 19.7 W
 Worst-case CPU: 18.8 W
 Higher thermal density of CPUs  steeper thermal gradients
 Faster consumption of thermal headroom on the CPU
(78)
Running a 100% CPU workload,
GPU idle
GPU temp
CPU temp
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
Idle GPU temperature
rose by ~20oC
1
•
•
51
101
151 201 251
Time (sec) ->
301
Running a 100% GPU workload
(CPU cycles only to feed the GPU)
1.05
CPU temp
GPU temp
Peak Temperature (C) ->
Peak Temperature (C) ->
Thermal Time Constant
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
51
101
151 201 251
Time (sec) ->
301
Significant rise in temperature of the idle component due to
thermal coupling and pollution from the active components
within a die
CPU consumes thermal headroom more rapidly (4X faster)
 GPU can sustain higher power boosts longer
(79)
Thermal Coupling: Headroom Availability
Thermal Temp throttling
coupling
3
2.5
2
1.5
1
0.5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Peak Die Temperature
3.5
CPU power is limited, GPU running at max
DVFS state
CPU CU0 Pow
PeakDieTemp
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
CPU & GPU Relative Power
GPU Pow
CPU CU1 Pow
Time (seconds) ->
(80)
Thermal coupling: Consumption of Thermal Headroom
Thermal Temp throttling
coupling
3
2.5
2
1.5
1
0.5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Peak Die Temperature
3.5
CPU power is limited, GPU running at max
DVFS state
CPU CU0 Pow
PeakDieTemp
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
CPU & GPU Relative Power
GPU Pow
CPU CU1 Pow
Time (seconds) ->
6oC rise in GPU temperature once CPU power limit was removed and
both CUs were allowed to boost
(81)
Thermal Coupling: Thermal Throttling
Thermal Temp throttling
coupling
3
2.5
2
1.5
1
0.5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Peak Die Temperature
3.5
CPU power is limited, GPU running at max
DVFS state
CPU CU0 Pow
PeakDieTemp
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
CPU & GPU Relative Power
GPU Pow
CPU CU1 Pow
Time (seconds) ->
 Minimize detrimental effects of thermal coupling by capping
maximum CPU P-state  P-state limiting
(82)
Residency in Different Power States
GPU-low
GPU-med
GPU-high
Peak Temp
75%
0.95
BAPM
50%
0.9
25%
0.85
0%
0.8
1
51
101
151
201
251
1
75%
0.95
50%
0.9
P2
25%
0.85
0%
Normalized
temperature
% DVFS residency
100%
0.8
1
51
101
151
201
251
 Capping the max CPU DVFS state at P2
100%
% DVFS residency
Normalized
temperature
1
1
75%
0.95
P4
50%
0.9
25%
0.85
0%
0.8
1
51
101
151
201
251
 Capping the max CPU DVFS state at P4
Normalized
temperature
% DVFS residency
100%
(83)
Key Observtions
•
Thermal signatures different between CPU and GPU
 Heterogeneity in physical properties
•
High thermal density leads to faster consumption of
thermal headroom in the CPU cores
•
Significant thermal coupling from active to idle
components
•
Near the thermal limit, boosting based on available
thermal headroom introduces inefficiencies
 Reduce the CPU P-state limit
(84)
Performance Coupling
Programming model
Host
Tasks
GPU
Tasks
User Application
Each OpenCL kernel
N-Dimensional Range
OpenCL™ or other
Software Stack
Operating System
CPU
GPU
APU Hardware
•
Grid of threads, each operating over a
data partition
Coupled programming model  Offload compute intensive tasks to the
GPU
(86)
Managing thermals for performancecoupled applications
GPU-med
GPU-high
Speedup
Normalized GPU active time
100%
1.2
80%
1.0
60%
0.8
40%
0.6
20%
0.4
0%
0.2
Normalized metric
% DVFS residency
GPU-low
CPU P-state Limit
Binary Search
(87)
Managing thermals for performancecoupled applications
GPU-med
GPU-high
Speedup
Normalized GPU active time
100%
1.2
80%
1.0
60%
0.8
40%
0.6
20%
0.4
0%
0.2
Normalized metric
% DVFS residency
GPU-low
CPU P-state Limit
(88)
Managing thermals for performancecoupled applications
GPU-med
GPU-high
Speedup
% DVFS residency
CPU thermally limiting
100%
Normalized GPU active time
CPU performance limiting
1.2
80%
1.0
60%
0.8
40%
0.6
20%
0.4
0%
0.2
Normalized metric
GPU-low
CPU P-state Limit
(89)
100%
1.3
80%
1.2
60%
1.1
40%
1.0
20%
0.9
0%
0.8
Normalized metric
% DVFS residency
P-state sensitivity
CPU P-state Limit
Needle
(90)
Determining Critical CPU P-state
GPU-med
GPU-high
Speedup
Normalized GPU active time
1.3
1.0
0.8
0.5
0.3
P3
P4
Normalized
metric
GPU-low
100%
75%
50%
25%
0%
% DVFS
residency
•
Find the inflection point in performance as a function of
CPU P-state  critical P-state
Critical P-state is determined by interference (CPU vs.
GPU) in the memory system
Baseline
% increase
over baseline
•
Pb1
P0
P1
P2
CPU P-state Limit ->
Critical CPU P-state Limit
20%
0%
Pb1
-20%
-40%
Mem BW
P0
P1
P2
P3
P4
Performance
CPU P-state Limit ->
(91)
Key Observations
•
Performance coupling – CPU-GPU performance
dependency
•
Balance between detrimental effects of thermal
coupling and needs of performance coupling
•
CPU critical P-state limit is determined by
performance coupling and thermal coupling
•
GPU memory bandwidth gradients as a function of
CPU frequency along with CPU IPC serve as a
measure of performance coupling
(92)
Cooperative Boosting
Cooperative Boosting (CB)
•
Overlaid on top of BAPM – invoked periodically when thermal coupling is detrimental i.e.
when thermal limit is approached
(94)
Experimental Set-up
•
Trinity A8-4555M APU: 19W TDP
•
CPU: Managed by HW or SW
HW
Only
(Boost)
SWVisible
PVoltag Freq
state e (V) (MHz)
Pb0
1
2400
Pb1
0.875
1800
P0
0.825
1600
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
 GPU: Managed by HW only
 GPU-high: 423 MHz
 GPU-med: 320 MHz
 Cooperative Boosting
implemented as a system
software policy overlaid on top
of BAPM in real hardware
(95)
Benchmarks
BM (Description)
NDL (NeedlemanWusch)
HS (HotSpot)
Problem Size
4096x4096 data points, 1K
iterations
1024x1024 data points, 100K
iterations
BF (BoxFilter SAT) 1Kx1K input image, 6x6
filter,10K iterations
FAH (Folding at
Synthesis of large protein:
Home)
spectrin$
BS (Binary Search) 4096 inputs, 256 segments,
1M iterations
Viewdle (Haar
Image 1920x1080, 2K
facial recognition) iterations
Lbm (CPU2006)
4 threads, Ref input
Gcc (CPU2006)
4 threads, Ref input
Type
Performanc
e-coupled
Performanc
e-coupled
Performanc
e-coupled
Performanc
e-coupled
Performanc
e-coupled
Performanc
e-coupled
CPU-centric
CPU-centric
(96)
Performance Improvement with
Cooperative Boosting
Speedup
P0
1.40
1.30
1.20
1.10
1.00
0.90
0.80
0.70
0.60
0.50
•
P4
CB
Baseline
1.36
1.28
1.10
NDL
•
P2
HS
1.13
BF
1.10
FAH
1.10
1.04
BS
1.00
Viewdle
Lbm
0.99
Perl
MEAN
Static P-state limiting requires profiling and a priori
information of workload
An average of 15% performance gain for performancecoupled applications with CB
(97)
% of power savings over
baseline
Power Savings
CB
40%
35%
30%
25%
20%
15%
10%
5%
0%
NDL
•
•
HS
BF
FAH
BS
Viewdle
Lbm
Gcc
MEAN
Average 10% power savings across performance-coupled
applications
5oC reduction in peak temperature for BS -> large
percentage of leakage power savings
(98)
Energy*Delay^2
P0
P2
P4
CB
Baseline
Normalized metric
3.00
2.50
2.00
1.50
1.00
0.50
0.00
NDL
HS
BF
FAH
BS
Viewdle
Lbm
Gcc
MEAN
Average 33% energy-delay2 savings across
performance-coupled applications
(99)
Conclusions
•
Effects of thermal and performance coupling on
performance
 Applications with high GPU compute-to-load ratio are
more susceptible to detrimental effects of thermal
coupling
 Emergent balanced workloads with split CPU-GPU
computation are tightly performance-coupled
•
Cooperative Boosting (CB): balance effects of thermal
coupling with needs of performance coupling
 Shifts power to CPU only when needed
(100)