Transcript ppt

CS 15-447: Computer Architecture
Lecture 27
Power Aware Architecture Design
November 24, 2007
Nael Abu-Ghazaleh
[email protected]
http://www.qatar.cmu.edu/~msakr/15447-f08
15-447 Computer Architecture
Fall 2008 ©
Uniprocessor Performance (SPECint)
Performance (vs. VAX-11/780)
10000
From Hennessy and Patterson, Computer Architecture: A
Quantitative Approach, 4th edition, Sept. 15, 2006
3X
??%/year
1000
52%/year
100
10
25%/year
 Sea change in chip
design—what is emerging?
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
15-447 Computer Architecture
2
Fall 2008 ©
Three walls
1. ILP Wall:
•
•
•
•
Wall: not enough parallelism available in one thread
Very costly to find more
Implications:  cant continue to grow IPC
VLIW? SIMD ISA extensions?
2. Memory Wall:
•
•
•
•
Growing gap between DRAM and Processor speed
Caching helps, but only so much
Implications:  cache misses are getting more expensive
Multithreaded processors?
3. Physics/Power Wall:
•
•
•
•
Cant continue to shrink devices; running into physical limits
Power dissipation is also increasing (more today)
Implications:  cant rely on performance boost from shrinking
transistors
But we will continue to get more transistors
3
15-447 Computer Architecture
Fall 2008 ©
Multithreaded Processors
• What support is
needed?
• I can use it to help ILP
as well
– Which designs help
ILP in the picture to
the right?
4
15-447 Computer Architecture
Fall 2008 ©
Power-Efficient Processor Design
Goals:
1. Understand why energy efficiency is important
2. Learn the sources of energy dissipation
3. Overview a selection of approaches to reduce energy
5
15-447 Computer Architecture
Fall 2008 ©
Why Worry About Power?
• Embedded systems:
– Battery life
• High-end processors:
– Cooling (costs $1 per chip per Watt if operating @ >40W)
– Power cost:15 cents/KiloWatt hr (KWH)
• A single 900 Watt server costs 100 USD /month to run, not
including cooling costs!
– Packaging
– Reliability
6
15-447 Computer Architecture
Fall 2008 ©
Why worry about power -- Oakridge Lab. Jaguar
• Current highest performance super computer
– 1.3 sustained petaflops (quadrillion FP operations per
second)
– 45,000 processors, each quad-core AMD Opteron
• 180,000 cores!
– 362 Terabytes of memory; 10 petabytes disk space
– Check top500.org for a list of the most powerful
supercomputers
• Power consumption? (without cooling)
– 7MegaWatts!
– 0.75 million USD/month to power
– There is a green500.org that rates computers based on
flops/Watt
7
15-447 Computer Architecture
Fall 2008 ©
Peak Power in Today’s CPUs
•
•
•
•
•
•
Alpha 21264
AMD Athlon XP
HP PA-8700
IBM Power 4
Intel Itanium
Intel Xeon
95W
67W
75W
135W
130W
59W
Even worse when we consider power density (watt/cm2)
8
15-447 Computer Architecture
Fall 2008 ©
9
15-447 Computer Architecture
Fall 2008 ©
Where is This Power Coming From?
• Sources of power consumption in CMOS:
– Dynamic or active power (due to the switching of
transistors)
– Short-circuit power
– Leakage power
• High temperature increases power consumption
– Silicon is a bad conductor: higher temperature
->higher leakage current->even higher temperature…
10
15-447 Computer Architecture
Fall 2008 ©
Power Consumption in CMOS
– Dynamic Power Consumption
• Charging and discharging capacitors
Vdd
Vdd
E=CV2
In
Out
0
1
E=CV2
In
Out
1
0
C
C
P=E*f=C*V2*f
15-447 Computer Architecture
11
Fall 2008 ©
Dynamic Power Consumption
Capacitance: function
of wire length,
transistor size
Power=
Clock frequency:
increasing
2*
*C*V f
Activity factor: how
often do wires switch
Supply voltage: has been
dropping with successive
process generations
12
15-447 Computer Architecture
Fall 2008 ©
Power Consumption in CMOS
– Short-circuit power
• Both PMOS and NMOS are conducting
Vdd
Isc
In
1/2
Out
C
About 2% of the overall power.
15-447 Computer Architecture
13
Fall 2008 ©
Power Consumption in CMOS
– Leakage power – transistors are not perfect switches
and they leak.
Vdd
In
Out
0
Isub
1
C
20% now, expect 40% in next technology and growing
14
15-447 Computer Architecture
Fall 2008 ©
Cooling
• All of the consumed power has to be dissipated
• Done by means of heat pipes, heat sinks, fans, etc.
• Different segments use different cooling mechanisms.
• Costs $1-$3 or more per chip per Watt if operating @ >40W
• We may soon need budgets for liquid-cooling or
refrigeration hardware.
15
15-447 Computer Architecture
Fall 2008 ©
Dynamic Power Consumption
Capacitance:
function of wire
length, transistor size
Power=
Clock frequency:
increasing
2*
*C*V f
Activity factor: how
often do wires switch
Supply voltage: has been
dropping with successive
process generations
16
15-447 Computer Architecture
Fall 2008 ©
Voltage Scaling
• Transistor switches slower at lower voltage.
• Leakage current grows exponentially with
decreases in threshold voltage
• Leakage power goes through the roof
17
15-447 Computer Architecture
Fall 2008 ©
Technology Scaling: the Enabler
• New process generation every 2-3 years
• Ideal shrink for 30% reduction in size:
– Voltage scales down by 30%
– Gate delays are shortened by 30%
 ~50% frequency gain (500ps cycle = 2GHz
clock, 333ps cycle = 3GHz clock)
– Transistor density increases by 2X
• 0.7X shrink on a side, 2X area reduction
– Capacitance/transistor reduced by 30%
18
15-447 Computer Architecture
Fall 2008 ©
Ideal Process Shrink: the Results
• 2/3 reduction in energy/transition
(CV2  0.7x0.72 = 0.34X)
• 1/2 reduction in power
(CV2f  0.7x0.72 x 1.5= 0.5X
• But twice as many transistors, or more if area
increases
• Power density unchanged
Looks good!
15-447 Computer Architecture
19
Fall 2008 ©
Process Technology – the Reality*
• Performance does not scale w/ frequency
– New designs increase frequency by 2X
– New designs use 2X-3X more transistors to get 1.4X-1.8X
performance*
• So, every new process generation:
– Power goes up by about 2X (3X transistors * 2X switches
* 1/3 energy)
– Leakage power is also increasing
– Power density goes up 30%~80% (2X power / 1.X area)
• Will get worse in future technologies, because
Voltage will scale down less
*Source: “Power – the Next Frontier: a Microarchitecture Perspective”,
Ronny Ronen, Keynote speech at PACS’02 Workshop.
15-447 Computer Architecture
Fall 2008 ©
20
Ugly Numbers*
i486 (0.8)
Pentium 4 (0.18)
Factor
Transistors
1.2M
42M
35x
Frequency
50 MHz
2000 MHz
40x
Voltage
5V
1.65 V
1/3x
Peak Power
5W
100 W
20x
0.73 cm2
2.17 cm2
3x
6.8 W/cm2
46 W/cm2
7x
Die size
Power density
21
15-447 Computer Architecture
Fall 2008 ©
The Bottom Line
• Circuits and process scaling alone can no longer
solve all power problems
• SYSTEMS must also be power-aware
– OS
– Compilers
– Architecture
• Techniques at the architectural level are needed to
reduce the absolute power dissipation as well as
the power density
22
15-447 Computer Architecture
Fall 2008 ©
Microarchitectural Techniques for Power Reduction
23
15-447 Computer Architecture
Fall 2008 ©
A Superscalar Datapath
Performance=N*f*IPC
Function
Units
Instruction Issue
IQ
F1
F2
D1
D2
Architectural
Register File
FU1
FU2
ARF
ROB
Fetch
FUm
Decode/Dispatch
LSQ
Instruction
dispatch
EX
D-cache
Result/status
forwarding
buses
Actually, it’s the whole system, but we focus on processor
24
15-447 Computer Architecture
Fall 2008 ©
Microarchitectural Techniques—General Approach
• Dynamic power:
– Reduce the activity factor
– Reduce the switching capacitance (usually not possible)
– Reduce the voltage/frequency (speedstep; e.g., 1.6 GHz
pentium M can be clocked down to 600MHz, voltage can be
dropped from 1.48V to 0.95V)
• Leakage power:
– Put some portions of the on-chip storage structures in a lowpower stand-by mode or even completely shutting off the
power supply to these partitions
– Resizing
• We usually give up some performance to save
energy, but how much?
25
15-447 Computer Architecture
Fall 2008 ©
Guideline
• If we reduce voltage, linear drop in maximum
frequency (and performance)
• “The cube law”: P=kV3 (~1%V=3%P)
– If we use voltage scaling we can approximately trade 1%
of performance loss for 3% of power reduction.
• Any architectural technique that trades
performance for power should do better than that
(or at least as good). Otherwise simple voltage
scaling can be used to achieve better tradeoffs.
26
15-447 Computer Architecture
Fall 2008 ©
Examples: Front-End Throttling
•
•
•
•
Speculation is used to increase performance
Wasted energy if it is wrong
Can we speculate only when we think we’ll be right?
Gating: temporarily prevent the new instructions
from entering the pipeline
• Use Gating to avoid speculation beyond the
branches with low prediction accuracy
– The number of unresolved low-confidence branches is used
to determine when to gate the pipeline and for how long
– Report 38% energy savings in the wrong-path instructions
with about 1% of IPC loss
27
15-447 Computer Architecture
Fall 2008 ©
Front-End Throttling (continued)
• Just-in Time Instruction Delivery
– Fetch stage is throttled based on the number of in-flight
instructions.
– If the number of in-flight instructions exceeds a
predetermined threshold, the fetch is throttled
– Threshold is adjusted through the “tuning cycle”
– Reasons for energy savings:
• Fewer instructions are processed along the mispredicted
path
• Instruction spends fewer cycles in the issue queue
28
15-447 Computer Architecture
Fall 2008 ©
Energy Reduction in the Register Files
• General solutions:
– Use of multi-banked RFs. Each bank has fewer entries
and fewer ports than the monolithic RF.
• Problems:
– Possible bank conflicts -> IPC loss
– Overhead of the port arbitration logic
– Use of the smaller cache-like structures to exploit the
access locality
29
15-447 Computer Architecture
Fall 2008 ©
Energy Reduction in the Register Files
• Value Aging Buffer
– At the time of writeback, the results are written into a
FIFO-style cache called VAB
– The RF is updated only when the values are evicted from
the VAB.
– In many situations, this can be avoided because a
register may be deallocated during its residency in the
VAB
– If a register is read from the VAB, there is no need to
access the RF.
– Some performance loss due to the sequential access to
the VAB and the RF.
30
15-447 Computer Architecture
Fall 2008 ©
Isolation of short-lived operands
31
15-447 Computer Architecture
Fall 2008 ©
Out-of-Order Execution and
In-Order Retirement
Inst. Queue
F
R
In-order
front end
Ex
ARF
D
ROB
Out-of-order core
15-447 Computer Architecture
In-order
retirement
32
Fall 2008 ©
Register Renaming
• Used to cope with false data dependencies.
• A new physical register is allocated for EVERY
new result
• P6 style: ROB slots serve as physical registers
LOAD R1, R2, 100
LOAD P31, P2, 100
SUB R5, R1, R3
SUB P32, P31, P3
ADD R1, R5, R4
ADD P33, P32, P4
33
15-447 Computer Architecture
Fall 2008 ©
Register Renaming: the Implementation
– Register Alias Table (RAT) maintains the mappings between
logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
(0-ROB,1-ARF)
0
0
1
1
1
1
2
2
1
3
3
1
4
4
1
5
5
1
15-447 Computer Architecture
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
34
Fall 2008 ©
Register Renaming: the Implementation
– Register Alias Table (RAT) maintains the mappings between
logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
(0-ROB,1-ARF)
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
5
1
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
15-447 Computer Architecture
Renamed code
LOAD P31, R2, 100
35
Fall 2008 ©
Register Renaming: the Implementation
– Rename Table (RT) is used to maintain the mappings between
logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
(0-ROB,1-ARF)
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
32
0
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
15-447 Computer Architecture
Renamed code
LOAD P31, R2, 100
SUB P32, P31, R3
36
Fall 2008 ©
Register Renaming: the Implementation
– Rename Table (RT) is used to maintain the mappings between
logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
(0-ROB,1-ARF)
0
0
1
1
33
0
2
2
1
3
3
1
4
4
1
5
32
0
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
15-447 Computer Architecture
Renamed code
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
37
Fall 2008 ©
Short-Lived Values
• Definition: a value is short-lived if the destination
register is renamed by the time of the result
generation.
• Identified one cycle before the result writeback
• A large percentage of all generated results are
short-lived for SPEC 2000 benchmarks.
RENAMER
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
38
15-447 Computer Architecture
Fall 2008 ©
Percentage of Short-Lived Values
96-entry ROB, 4-way processor
100
90
80
70
60
50
40
30
20
10
vg .
.
A fp.
ve
ra
ge
A
.I
nt
vg
A
pl
u
ap
si
a
eq rt
ua
ke
m
es
a
m
gr
id
sw
w im
up
w
ise
ap
r
vp
gc
c
gz
ip
m
c
pa f
r
pe ser
rl
bm
k
tw
ol
vo f
rt
ex
ga
p
bz
ip
2
0
As
15-447 Computer Architecture
39
Fall 2008 ©
Why Keep Them ?
• Reasons for maintaining short-lived values:
– Recovering from branch mispredictions
– Reconstructing precise state if interrupts or exceptions
occur
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
40
15-447 Computer Architecture
Fall 2008 ©
Energy-dissipating Events
Ex
Inst. Queue
F
R
D
ARF
Write
Write
In-order
front end
ROB
Read
Out-of-order core
15-447 Computer Architecture
In-order
retirement
41
Fall 2008 ©
Isolating Short-Lived Values: the Idea
Write short-lived
values into a small
dedicated RF (SRF)
Ex
Inst. Queue
F
R
D
ARF
Write
SRF
Write
In-order
front end
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
ROB
Read
Out-of-order core
15-447 Computer Architecture
In-order
retirement
42
Fall 2008 ©
Energy Reduction in Caches
• Dynamically resizable caches
– Dynamically estimates the program requirements and
adapts to the required cache size
– Cache is upsized or downsized at the end of periodic
intervals based on the value of the cache miss counter
– Downsizing puts the higher-numbered sets into a lowleakage mode using sleep transistors
– A bit mask is used to specify the number of address bits
that are used for indexing into the set
– The cache size always changes by a factor of two
43
15-447 Computer Architecture
Fall 2008 ©
Energy Reduction within the Execution Units
• Gating off portions of the execution units
– Disables the upper bits of the ALUs where they are not needed
(for small operands)
– Energy can be reduced by 54% for integer programs
• Packaging multiple narrow-width operations in a single ALU
in the same cycle
• Steering instructions to FUs based on the criticality
information
– Critical instructions are steered to fast and power-hungry
execution units, non-critical instructions are steered to slow
and power-efficient units
44
15-447 Computer Architecture
Fall 2008 ©
Encoding Addresses for Low Power
• Using Grey code for the addresses to reduce
switching activity on the address buses (Su et.al.,
IEEE Design and Test, 1994)
– Exploits the observation that programs often generate
consecutive addresses
– Grey code: there is only a single transition on the
address bus when consecutive addresses are accessed
– 37% reduction in the switching activity is reported
– A Gray code encoder is placed at the transmitting end of
the bus, and a decoder is needed at the receiving end
45
15-447 Computer Architecture
Fall 2008 ©
Encoding Data for Low Power
• Bus-invert encoding
– Uses redundancy to reduce the number of transitions
– Adds one line to the bus to indicate if the actual data or
its complement is transmitted
– If the Hamming distance between the current value and
the previous one is less than or equal to (n/2) (for n bits),
the value is transmitted as such and the value of 0 is
transmitted on the extra line.
– Otherwise, the complement of the value is transmitted
and the extra line is set to 1
– The average number of bus transitions per clock cycle is
lowered by 25% as a result
46
15-447 Computer Architecture
Fall 2008 ©
OS and Compiler Techniques
• Can compiler help?
• Can OS help?
– E.g., control voltage scaling
– Control turning off devices
47
15-447 Computer Architecture
Fall 2008 ©