ppt - ECE Users Pages

Download Report

Transcript ppt - ECE Users Pages

Power Management
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Technology Scaling
GATE
GATE
DRAIN
SOURCE
tox
DRAIN
SOURCE
BODY
L
• 30% scaling down in dimensions  doubles
transistor density
P   CV dd f  V dd I st  V dd I leak
2
• Power per transistor
 Vdd scaling  lower power
• Transistor delay = Cgate Vdd/ISAT
 Cgate, Vdd scaling  lower delay
(2)
Moore’s Law
Goal: Sustain
Performance
Scaling
•
•
Performance scaled with
number of transistors
Dennard scaling*:
power scaled with
feature size
From wikipedia.org
*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State
Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
(3)
3
Parallelism and Power
IBM Power5
Source: IBM
AMD Trinity
Source: forwardthinking.pcmag.com
• How much of the chip area is devoted to
compute?
• Run many cores slower. Why does this reduce
power?
(4)
The Power Wall
P   CV dd f  V dd I st  V dd I leak
2
• Power per transistor scales with frequency
but also scales with Vdd
 Lower Vdd can be compensated for with increased
pipelining to keep throughput constant
 Power per transistor is not same as power per
area  power density is the problem!
 Multiple units can be run at lower frequencies to
keep throughput constant, while saving power
(5)
What is the Problem?
Mukhopadhyay and Yalamanchili (2009)
Based on scaling using Pentium-class cores
 While Moore’s Law continues, scaling phenomena have
changed
 Power densities are increasing with each generation

(6)
6
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(7)
Power Management Basics
Lecture notes S. Yalamanchili and S. Mukhopadhyay
What are my Options?
1. Better technology
 Manufacturing
Not this course
 Better devices (FinFet)
 New Devices  non-CMOS?  this is the future
2. Be more efficient – activity management
 Clock gating – dynamic energy/power
 Power gating – static energy/power
 Power state management - both
3. Improved architecture
 Simpler pipelines
4. Parallelism
(9)
Activity Management
Clock Gating
Power Gating
Vdd
input
Combinational
Logic
clk
Power gate
transistor
cond
clk
•
Turn off clock to a block of
logic
•
Eliminate unnecessary
transitions/activity
•
Core 0
clk
Clock distribution power
Core 1
•
Turn off power to a
block of logic, e.g.,
core
•
No leakage
(10)
Multiple Voltage Frequency Domains
Intel Sandy Bridge
Processor
•
•
•
Cores and ring in one DVFS domain
Graphics unit in another DVFS domain
Cores and portion of cache can be gated
off
From E. Rotem et. Al. HotChips 2011
(11)
Processor Power States
• Performance States – P-states
 Operate at different voltage/frequencies
o
Recall delay-voltage relationship
 Lower voltage  lower leakage
 Lower frequency  lower power (not the same as energy!)
 Lower frequency  longer execution time
• Idle States - C-states
 Sleep states
 Differ is how much state is saved
• SW or HW managed transitions between states!
(12)
Example of P-states
AMD Trinity A10-5800 APU: 100W TDP •
CPU P- Voltage
state
(V)
HW
Only
(Boost)
SWVisible
Freq
(MHz)
Pb0
1
2400
Pb1
0.875
1800
P0
0.825
1600
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
•
Software
Managed Power
States
Changing Power
States is not free
(13)
Example of P-states
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
(14)
Management Knobs
• Each core can be in any one of a multiple of
states
• How do I decide what state to set each core?
 Who decides? HW? SW?
• How do I decide when I can turn off a core?
• What am I saving? Static energy or dynamic
energy?
(15)
Power Management
• Software controlled power management
 Optimize power and/or energy
 Orchestrated by the operating system or application
libraries
 Industry standard interfaces for power management
o
Advanced Configuration and Power Interface (ACPI)
 https://www.acpica.org/
 http://www.acpi.info/
• Hardware power management
 Optimized power/energy
 Failsafe operation, e.g., protect against thermal
emergencies
(16)
Power Management
Thermal
Headroo
m
CPU
HW
Only
(Boost)
SWVisible
Time
DVFSstate
Pb0
Pb1
P0
P1
P2
--Pmin
HW Boost
states
SW visible
states
Convert thermal
headroom to
higher
performance
through boost
Instructions/cycle
Performance
Die Temperature
3.0
Performance and energy efficiency depend on effective utilization
of power and thermal headroom
Time
(17)
Boosting
Intel Sandy Bridge
• Exploit package physics
 Temperature changes on the
order of milliseconds
• Use the thermal headroom
Turbo boost region
Max Power
TDP Power
10s of seconds
Low power – build up thermal credits
(18)
Power Gating
• Turn off components
that are not being used
 Lose all state information
• Costs of powering down
• Costs of powering up
• Smart shutdown
 Models to guide decisions
Intel Sandy Bridge
Processor
(19)
Parallelism
• Concurrency + lower frequency  greater
energy efficiency
Example
Core
Cache
Core
Core
Cache
Cache
Core
Core
Cache
Cache
•
•
•
•
•
4X #cores
0.75x voltage
0.5x Frequency
1X power
2X in performance
P   CV dd f  V dd I st  V dd I leak
2
(20)
Simplify Core Design
AMD Bulldozer Core
• Support for branch
prediction, schedulers,
etc. consumes more
energy per instruction
• Can fit many more
simpler cores on a die
ARM A7 Core (arm.com)
(21)
Metrics
• Power efficiency
 MIPS/watt
 Ops/watt
• Energy efficiency
 Joules/instruction
 Joules/op
• Composite
 Energy-delay product
 Energy-delay2
Why are these useful?
(22)
Modeling
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Microarchitectural Level Models
• How can we study power consumption without
building circuits?
 Models
• Models can are available at multiple levels of
abstraction.
We are interested in microarchitectural models
(24)
Processor Microarchitecture
Fetch
Decode
Execute/Writeback
Register
Files
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
Queue
FPU
LD
Branch
Prediction
Instruction
TLB
Data
TLB
ST
L1 Data
Cache
Network
Memory
L2 Data Cache
NoC
Router
On-Chip
Network
(25)
Energy/Power Calculation
• How do we calculate energy or power dissipation
for a given microarchitecture?
• Energy/Power varies between:
 Different ISA; ARM vs Intel x86
 Different microarchitecture; in-order vs out-of-order
 Different applications; memory vs compute-bound
 Different technologies; 90nm vs 22nm technology
 Different operation conditions; frequency, temperature
(26)
Architecture Activity (1)
icache.read++; fbuffer.write++;
Register
Files
Activity 1: Instruction Fetch
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
TLB
Instruction
Queue
FPU
LD
Branch
Prediction
• Collect activity counts of
each architecture
component (through
simulation or
measurement).
• List of components differs
between microarchitectures.
• Activity counts at each
component differs between
applications.
Data
TLB
ST
L1 Data
Cache
L2 Data Cache
NoC
Router
On-Chip
Network
(27)
Architecture Activity (2)
fbuffer.read++; idecoder.logic++;
Activity 2: Instruction Decode
Register
Files
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
TLB
Instruction
Queue
FPU
LD
Branch
Prediction
• Read/write accesses to
caches, buffers, etc.
• Logical accesses to logic
blocks such as decoder, ALUs,
etc.
Data
TLB
ST
L1 Data
Cache
L2 Data Cache
NoC
Router
On-Chip
Network
• Tradeoff of differentiating
more access types (accuracy)
vs simulation speed
(complexity).
(28)
Power and Architecture Activity
• For example, At nth clock cycle, collected
counters are:
 Data cache:
o
read = 20, write = 12;
o
per-read energy = 0.5nJ; per-write energy = 0.6nJ;
o
Read energy = read*per-read energy = 10nJ
o
Write energy = write*per-write energy = 7.2nJ
o
Total activity energy = read+write energies = 17.2nJ
o
If n = 50th clock cycle and clock frequency = 2GHz,
Total activity power = energy*clock_freq/n = 688mW
*Note: n/clock_freq = n clock periods in sec
power = time average of energy
(29)
Things to consider (1)
1. How do we calculate per-read/write energies?
• Per-access energies can be estimated from circuit-level
designs and analyses.
• There are various open-source tools for this.
Architecture
Specification
Technology
Parameters
Circuit-level
Estimation
Tool
Estimation
Results:
Area, Energy,
Timing, etc.
(30)
Things to consider (2)
2. Is per-access energy always the same?
• Per-access energy in fact depends on:
• how many bits are switching
• how they are switching (0→1 or 1→0)
• It is reasonable to assume constant per-access
energy in long-term observation (e.g., n = 1M
clock cycles); the number of switching bits are
averaged (e.g., 50% of bits are switching).
• Most architecture simulators do not capture bitlevel details due to simulation complexity.
(31)
Things to consider (3)
3. If a register file didn’t have read/write accesses
but held data, what is the energy dissipation?
• Energy (or power) is largely comprised of dynamic and
static dissipations.
• Dynamic (or switching) energy refers to energy dissipation
due to switching activities.
• Static (or leakage) energy is dissipation to keep the
electronic system turned on.
• In this case, the register file has no dynamic energy
dissipation but consumes static energy.
(32)
Thermal Issues
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Thermal Issues
• Heat can cause damage to the chip
 Need failsafe operation
• Thermal fields change the physical
characteristics
 Leakage current and therefore power increases
 Delay increases
 Device degradation becomes worse
• Cooling solution determines the permitted
power dissipation
(34)
Thermal Design Power (TDP)
• This is the maximum
power at which the part is
designed to operate
 Dictates the design of the
cooling system
o
AMD Trinity APU
Max temperature  Tjmax
 Typically fixed by worst case
workload
• Parts are typically
operating below the TDP
• Opportunities for turbo
mode?
http://ecs.vancouver.wsu.edu/thermofluids-research
(35)
Heat Sink Limits on Performance
Thermal design power (TDP)

Performance depends on effective utilization of
this thermal headroom
Temp
 www.legitreviews.com
Workload
Thermal
Headroom
Boost power
Instructions/cycle

Determines the cooling solution & package limits
Power

Time
HW Boost
states
SW visible
states
Convert thermal
headroom to higher
performance
through boosting
(36)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
(37)
Issues
• Cooling chips is now an issue for computer
architects!
• Co-design the cooling system and the
processor
• Some very “cool” new technologies
 E.g., microfluidics!
(38)
Electrical and Fluidic I/Os
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
• Fluid flow through the microchannels carry heat
out to an external heat exchanger (e.g., heat sink)
(39)
Fabrication Examples
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
Micropin-fins (150 µm diameter
and 225 µm diameter)and vias
Electrical and fluidic microbumps,
fluidic vias and fine wires
(40)
Conclusions
• Power/energy is the leading driver of modern
architecture design
• Power and energy management is key to
scalability
• Need integrated power/energy, performance,
thermal management in fielded systems
• What about energy/power efficient algorithms?
(41)
Study Guide
• Explain the difference between energy
dissipation and power dissipation
• Distinguish between static power dissipation
and dynamic power dissipation
• Explain dynamic voltage frequency scaling
 What are power states?
 Why is this an advantage?
 What is the impact of DVFS on i) energy, ii)
execution time, and iii) power
• Distinguish between clock gating and power
gating
(42)
Study Guide (cont.)
• Define thermal design power (TDP)
• Name two schemes to preventing the chip from
exceeding TDP. Explain how they achieve this
goal
• What does boosting achieve?
• What is the difference between C-states and Pstates?
• Name one power management technique that
will save static power?
• How does using many slower simpler cores
improve power efficiency?
(43)
Study Guide (cont.)
• How is thermal design power (TDP) calculated?
• When using boost algorithms, what determines
the duration of the high frequency operation?
• How does a power virus work?
• Describe how throttling works
• Know the power dissipation in some modern
processor-memory systems drawn from the
embedded, server, and high performance
computing segments
(44)
Glossary
• Boosting
• C-states
• Dynamic Power
and Energy
• Power Gating
• P-states
• Static Power and
Energy
• Time constant
• Thermal Design
Point
• Throttling
(45)