Transcript lec4-1

CS 152
Computer Architecture and Engineering
Lecture 7 -- Power and Energy
2014-2-11
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Today: Power and Energy
Metrics: Power and energy (intro)
Short Break.
Metrics: Power and energy (technique)
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Universal: Power and energy
units are comparable across
all of applied physics.
Power and Energy
So, we use automobiles
to introduce terminology.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
The Watt:
Unit of power.
A rate of
energy (J/s).
A gas pump
hose delivers
6 MW.
120 KW: The power
delivered by a
Tesla Supercharger.
Tesla Model S has a
306 MJ battery
1J=1W
(good for 265 miles).
CS 152: L7: Power and Energy
The Joule: Unit of
energy. A 1 Gallon
gas container holds
130 MJ of energy.
1 W = 1 J/s.
UC Regents Spring 2014 © UCB
Sad fact: Computers turn
electrical energy into heat.
Computation is a byproduct.
Energy and Performance
Air or water carries heat
away, or chip melts.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
The Joule: Unit of
energy.
Can also be expressed
as Watt-Seconds.
Burning 1 Watt for 100
seconds uses 100
Watt-Seconds of
energy. 1A
1V
+
-
CS 152: L7: Power and Energy
This is how electric tea pots work
...
1 Joule heats 1 gram of water
0.24 degree C
1 Joule of Heat Energy
per Second
1 Ohm
Resistor
The Watt: Unit of
power.
The amount of energy
burned in the resistor
in 1 second.
20 W rating: Maximum
power the package is able
to transfer to the air.
Exceed rating and resistor
UC Regents Spring 2014 © UCB
Cooling an iPod nano ...
Like resistor on last
slide, iPod relies on
passive transfer of
heat from case
to the air.
Why? Users don’t
want
fans in their pocket ...
To stay “cool to the touch”
via passive cooling,
power budget of 5 W.
If iPod nano used 5W all the time, its battery would last 15
minutes ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Powering an iPod nano (2005 edition)
1.2 W-hour battery:
Can supply 1.2 watts
of power for 1 hour.
1.2 W-hr / 5 W ≈ 15
minutes.
More W-hours require bigger
battery and thus bigger “form
factor” -it wouldn’t be “nano” anymore :-).
Real specs for iPod nano :
14 hours for music,
4 hours for slide shows.
85 mW for
300 mW for slides.
music.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Finding the (2005) iPod nano CPU ...
A close relative ...
Two 80 MHz CPUs
One CPU used for
audio, one for
slides.
Low-power ARM
roughly 1mW per
MHz ... variable
clock, sleep modes
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
The CPU is only part of power budget!
“Amdahl’s Law for
Power”
“other”
GPU
LCD
Backlight
CPU
LCD
If our CPU took no
power at all to run, that
would
only double battery life!
2004-era notebook
running a full workload.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
What’s happened since 2005?
2010 nano
0.74 ounces
(50% of 2005
Nano)
“Up to” 24 hours
audio playback.
70% improvement
from 2005 nano.
0.39 W Hr
(33% of 2005
Processors and Energy
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
2.6 Billion
1 Million
2
Thousand
Moore’s Law
Main driver: device scaling ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
1974: Dennard Scaling
If we scale the gate
length by a factor 𝞳
, how should we
scale other aspects
of transistor to get
the “best” results?
not
scaled
𝞳=5
scaling
Dennard Scaling
Things we do:
scale dimensions,
doping, Vdd.
not
scaled
𝞳=5
scaling
What we get:
𝞳2 as many transistors
at the same power
density!
Whose gates switch 𝞳
Power
density
scaling
ended
in
2003
times faster!
(Pentium 4: 3.2GHz, 82W, 55M FETs).
Why? We could no longer scale Vdd.
The
Why? We can no longer fully scale Vdd ...
Power because MOS transistor leakage current is
Wall
no longer a negligible part of the power budget.
Switching Energy: Fundamental Physics
Every logic transition dissipates energy.
V
dd
V
dd
C
2
2
1
1
C
C
E0E12 V
2 V
=
=
dd
dd
>0
Strong result: Independent>1of technology.
How can
we limit
switching
energy?
(1) Reduce # of clock transitions. But we have work to
do ...
(2)
Reduce Vdd. But lowering Vdd limits the clock
speed
... circuits. But more transistors can do more
(3) Fewer
work.
(4) Reduce C per node. One reason why we scale
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Scaling switching energy per gate ...
IC process scaling
(“Moore’s Law”)
Due to
reducing V
and C (length
and width of
Cs decrease,
but plate
distance gets
smaller).
Recent slope
more shallow
because V is
being scaled
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
CS 152: L7: Power and Energy
lessUC Regents Spring 2014 © UCB
Second Factor: Leakage Currents
Even when a logic gate isn’t switching, it burns
power.
Isub: Even when this nFet
is off, it passes an Ioff
leakage current.
0V =
We can engineer any Ioff
we like, but a lower Ioff also
results in a lower Ion, and
thus a lower maximum clock
speed.
Intel’s 2006 processor
designs, leakage vs switching
Igate: Ideal capacitors have
power
A lot of work was
zero DC current. But modern
done to get a
ratio this good ...
transistor gates are a few
50/50 is
atoms thick, and are not
Bill Holt, Intel, Hot Chips 17. common.
ideal.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Engineering “On” Current at 25 nm ...
V
I
V
g
We can increase Ion by
raising Vdd and/or lowering Vt.
d
ds
V
s
I
ds
1.2 mA = I
0.25 ≈ V
I
off
on
t
= 0 ???
0.7 = V
CS 152: L7: Power and Energy
dd
UC Regents Spring 2014 © UCB
Plot on a “Log” Scale to See “Off” Current
V
I
V
d
ds
V
s
g
We can decrease Ioff by
raising Vt - but that lowers Ion.
I
ds
1.2 mA = I
0.25 ≈ V
I
off
on
t
≈ 10 nA
0.7 = V
CS 152: L7: Power and Energy
dd
UC Regents Spring 2014 © UCB
Ioff? Ion? Recall: Timing Lecture ...
I
An “off”
n-FET
while
bucket
fills.
I
Why
open?
Gnd <<
A “on”
V
t
n-FET
empties
the
bucket.
Why on?
Vdd >> V
CS 152: L7: Power and Energy
t
I
= Current through nds
FET
ds
1.2 mA = I
off
0.25 ≈ V
I
I
on
off
on
t
≈ 10 nA
0.7 = V
dd
UC Regents Spring 2014 © UCB
Device engineers trade speed and power
2
We can reduce CV
(Pactive)
by lowering Vdd.
We can increase
speed
by raising Vdd and
lowering Vt.
We can reduce
leakage
(Pstandby) by raising Vt.
From: Silicon Device Scaling to the Sub-10-nm Regime
Meikei Ieong,1* Bruce Doris,2 Jakub Kedzierski,1 Ken Rim,1 Min Yang1
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Customize processes for product types ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Transistor channel is
a raised fin.
Gate controls channel
from sides and top.
Ids
Channel depth is fin
width. 12-15nm for
L=22nm.
Intel 22nm Process
Vgs
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Clock rates have flattened out, but ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Performance: Put more transistors to work
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
2.6 Billion
1 Million
2
Thousand
Moore’s Law
We still scale
to get more
transistors per
unit area ... but
we use design
techniques to
reduce power.
Takeaway Abstractions
From Part I ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Small circuits can go very fast in standard CMOS ...
This oscillator runs at 210 GHz in a 32 nm
SOI CMOS logic process, and consumes 42 mW ...
But if we used these techniques for a CPU, our
150 W air-cooled power limit would limit our design to
using about ten thousand transistors ...
... but we are used to using 100s of millions!
The
Dennard scaling stopped working in 2004.
Power Why? MOSFET off currents became non-negligible.
Wall We limit clock speed to prevent chip from melting.
Dynamic Power: 4 ways to reduce it ...
Every logic transition dissipates energy.
V
dd
V
dd
C
2
2
1
1
C
C
E0E12 V
2
V
=
=
dd
dd
>1
>0
Strong result: Independent of technology.
How can
we limit
switching
energy?
(1) Reduce # of clock transitions. But we have work to
do ...
(2)
Reduce Vdd. But lowering Vdd limits the clock
speed
... circuits. But more transistors can do more
(3) Fewer
work.
(4) Reduce C per node. One reason why we scale
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Static power: We trade off speed for power.
Even when a logic gate isn’t switching, it burns
power.
Isub: Even when this nFet
is off, it passes an Ioff
leakage current.
0V =
We can engineer any Ioff
we like, but a lower Ioff also
results in a lower Ion, and
thus a lower maximum clock
speed.
Intel’s 2006 processor
designs, leakage vs switching
Igate: Ideal capacitors have
power
A lot of work was
zero DC current. But modern
done to get a
ratio this good ...
transistor gates are a few
50/50 is
atoms thick, and are not
Bill Holt, Intel, Hot Chips 17. common.
ideal.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Factor of 60 in leakage vs. 3.5 in speed
Chart shows 9 different NAND gates for an IC process,
each with a different speed vs. static power tradeoff.
CS 152: L7: Power and Energy
(40, 45, 50) are transistor
channel lengths (in nm)
UC Regents Spring 2014 © UCB
FO4 (Delay, Power) vs Vdd -- 65nm
FO4 is the delay of one inverter driving four additional
inverters. Power in this plot includes dynamic and static.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Six low-power design techniques
Parallelism and pipelining
Power-down idle transistors
Slow down non-critical paths
Clock gating
Data-dependent processing
Thermal management
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Design Technique #1 (of 6)
Trading Hardware for Power
via Parallelism and Pipelining ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
And so, we can transform this:
Gate delay
roughly linear
with Vdd
2
P ~ F ⨯ Vdd
2
P~1⨯1
Block processes stereo audio. 1/2
of clocks for “left”, 1/2 for “right”.
Into this:
Top block processes “left”, bottom “right”.
2
Vdd
P ~ #blks ⨯ F ⨯
P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4
CV2 power
only
This magic trick brought to you by Cory Hall ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Chandrakasan & Brodersen (UCB, 1992)
Simple
Pipelined
From:
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Multiple Cores for Low Power
Trade hardware for
power, on a large scale ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Cell:
The PS3 chip
CS 152: L7: Power and Energy
2006
UC Regents Spring 2014 © UCB
Cell (PS3 Chip): 1 CPU + 8 “SPUs”
L2 Cache
512 KB
PowerPC
8
Synergistic
Processing
Units
(SPUs)
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
One Synergistic Processing Unit (SPU)
SPU issues 2 inst/cycle (in order) to 7 execution units
256 KB Local Store, 128 128-bit Registers
SPU fills Local Store using DMA to DRAM and network
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
A “Schmoo” plot for a Cell SPU ...
The lower Vdd, the less
dynamic energy
consumption.
2
1
1
C
E02
=
>1
V
dd
CS 152: L7: Power and Energy
E12
=
>0
2
C
V
dd
The lower Vdd, the longer
the maximum clock period,
the slower the clock
frequency.
UC Regents Spring 2014 © UCB
Clock speed alone doesn’t help E/op ...
But, lowering clock frequency while keeping voltage constant
spreads the same amount of work over a longer time, so chip
stays cooler ...
2
2
1
1
C
E02
=
>1
CS 152: L7: Power and Energy
V
dd
E12
=
>0
C
V
dd
UC Regents Spring 2014 © UCB
Scaling V and f does lower energy/op
1 W to get 2.2 GHz
7W to reliably get 4.4 GHz
performance. 26 C die
performance. 47C die
temp.
If a program that needs
a 4.4
temp.
Ghz CPU can be recoded to
use two 2.2 Ghz CPUs ... big
win.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
How iPod nano 2005 puts its 2 cores to use ...
CS 152: L7: Power and Energy
Two 80 MHz
CPUs. Was used
in several nano
generations, with
one CPU doing
audio decoding,
the other doing
UC Regents Spring 2014 © UCB
2013 Macbook Air
Voltage range: 0.655V to 1.041V ... 2.5x in CV2 energy
Haswell
CPU/GPU
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Design Technique #2 (of 6)
Powering down idle circuits
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Add “sleep” transistors to logic ...
Example: Floating point unit
logic.
When running fixed-point
instructions, put logic “to sleep”.
+++ When “asleep”, leakage
power is dramatically reduced.
--- Presence of sleep transistors
slows down the clock rate when
the logic block is in use.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Intel example: Sleeping cache blocks
A tiny current supplied in “sleep” maintains SRAM
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
state.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Intel Medfield
Intel Medfield
Switches 45
power “islands.”
Fine-grained
control of
leakage power,
to track user
activity.
“Race to idle”
strategy -- finish
tasks quickly, to
get to power
down.
Playing a game ...
Watching a video ...
Looking at phone screen, not doing anything ...
Phone in your pocket, waiting for a call ...
Design Technique #3 (of 6)
Slow down “slack paths”
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Fact: Most logic on a chip is “too fast”
The critical path
Most wires have hundreds
of picoseconds to spare.
From “The circuit and physical design of the POWER4 microprocessor”, IBM J
Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Use several supply voltages on a chip ...
Why use multi-Vdd? We can reduce dynamic
power by using low-power Vdd for logic off the
critical path.
What if we can’t do a multi-Vdd design?
In a multi-Vt process, we can reduce leakage
power on the slow logic by using high-Vth
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
transistors.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Logical partition into
0.8V and 1.0V nets done
manually to meet 350
MHz spec (90nm).
Level-shifter insertion
and placement done
automatically.
Dynamic power in 0.8V
section cut 50% below
baseline.
Leakage power in 1.0V
section cut 70% below
baseline.
From a chapter from new book on ASIC design by Chinnery and Keutzer (UCB).
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Design Technique #4 (of 6)
Gating clocks to save power
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
On a CPU, where does the power go?
Half of the power go
to latches
(Flip-Flops).
Most of the time,
the latches don’t
change state.
So (gasp) gated clocks are a big win.
But, done with CAD tools in a disciplined way.
From: Bose, Martonosi, Brooks: Sigmetrics-2001 Tutorial
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Synopsis Design Compiler can do this ...
<=
CS 152: L7: Power and Energy
“Up to 70%
power savings
at the block
level, for
applicable
circuits”
Synopsis Data
Sheet
UC Regents Spring 2014 © UCB
Power Compiler also can do this ...
10-20%
push-button
power
savings,
using
techniques
like this one.
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Design Technique #5 (of 6)
Data-Dependent Processing
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Example: Video Decode Transform
Most of the time,
the inputs flip
between small
positive and
negative integers.
In 2’s complement,
wastes power:
+1: 0b00001
-1: 0b11110
Solution: Add bias value to all inputs
30+% power reduction for a bias of 64. For this linear
transform, correcting the output for the bias is trivial.
Design Technique #6 (of 6)
Thermal Management
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
Keep chip cool to minimize leakage power
A recipe for thermal runaway
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
IBM Power 4: How does die heat up?
4 dies on a
multi-chip
module
2 CPUs
per die
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
115 Watts: Concentrated in “hot spots”
Hot
spots
Fixed
point
units
Cache
logic
66.8 C == 152 F
CS 152: L7: Power and Energy
82 C == 179.6
UC Regents Spring 2014 © UCB
Idea: Monitor temperature, servo clock speed
TDP = Thermal Design Point
Repeatedly running the same
benchmark on three Apple products.
iPad Air, iPad
Mini retina, and
iPhone 5S all
use the A7
The TDP of each form factor dictates
how long it can run at “top speed”
TDP = Thermal Design Point
On Thursday
Time to market via chip verification.
Have fun in section !