Power Management in High Performance Processors through

Download Report

Transcript Power Management in High Performance Processors through

Power Management in High
Performance Processors through
Dynamic Resource Adaptation and
Multiple Sleep Mode Assignments
Houman Homayoun
National Science Foundation Computing Innovation Fellow
Department of Computer Science
University of California San Diego
Outline – Multiple Sleep Mode





Brief overview of state-of-art superscalar processor
Introducing the idea of multiple sleep modes design
Architectural control of multiple sleep modes
Results
Conclusions
Copyright © 2010 Houman Homayoun
University of California San Diego
2
Superscalar Architecture
Fetch
Decode
ROB
Reservation
Station
Physical
Register
File
Logical
Register
File
Rename
Dispatch
Instruction Queue
Issue
Write-Back
F.U.
Load Store
Queue
Execute
F.U.
F.U.
Copyright © 2010 Houman Homayoun
F.U.
University of California San Diego
3
On-chip SRAMs+CAMs and Power

On-chip SRAMs+CAMs in high-performance
processors are large








Branch Predictor
Reorder Buffer
Instruction Queue
Instruction/Data TLB
Load and Store Queue
L1 Data Cache
L1 Instruction Cache
L2 Cache


Pentium M processor die photo
Courtesy of intel.com
more than 60% of chip budget
Dissipate significant portion of power via
leakage
Copyright © 2010 Houman Homayoun
University of California San Diego
4
Techniques Address Leakage in SRAM+CAM


Circuit






Architecture


Copyright © 2010 Houman Homayoun
Gated-Vdd, Gated-Vss
Voltage Scaling (DVFS)
ABB-MTCMOS
Forward Body Biasing (FBB), RBB
Sleepy Stack
Sleepy Keeper
Way Prediction, Way Caching, Phased Access

Predict or cache recently access ways, read tag first
Drowsy Cache

Keeps cache lines in low-power state, w/ data retention
Cache Decay

Evict lines not used for a while, then power them down
Applying DVS, Gated Vdd, Gated Vss to memory cell

Many architectural support to do that.
University of California San Diego
5
Sleep Transistor Stacking Effect

Subthreshold current: inverse exponential function of
threshold voltage
VT  VT 0   (

( 2) F  VSB 
2 F )
Stacking transistor N with slpN:

The source to body voltage (VM ) of
transistor N increases, reduces its
subthreshold leakage current, when
both transistors are off
Drawback : rise time, fall time, wakeup
delay, area, dynamic power, instability
Copyright © 2010 Houman Homayoun
University of California San Diego
vdd
VC
Vgn
N
CL
VM
Vgslpn
vss
slpN
vss
6
Wakeup Latency

To benefit the most from the leakage savings of
stacking sleep transistors



keep the bias voltage of NMOS sleep transistor as low as possible
(and for PMOS as high as possible)
Drawback: impact on the wakeup latency (sleep
transistor wakeup delay + sleep signal propagation
reduction in
delay) of thereduction
circuit
in the
leakage
power
Control the
gate
voltage
of the
sleep transistors
circuit wakeup
delay

savings
overhead
Increasing the gate voltage of footer sleep transistor reduces the
virtual ground voltage (VM)
Copyright © 2010 Houman Homayoun
University of California San Diego
7
1
4.5
0.9
4.0
0.8
3.5
0.7
3.0
0.6
0.5
0.4
0.3
0.2
(0
.3
0,
0.
75
)
(0
.2
5,
0.
80
)
(0
,1
)
1.0
(0
.2
0,
0.
85
)
1.5
(0
.1
5,
0.
89
)
2.0
trade-off between the
wakeup overhead
and leakage power saving
(0
.1
,0
.9
3)
2.5
Normalized Wake-Up Delay
5.0
(0
.0
5,
.9
6)
Normalized Leakage Power
Wakeup Delay vs. Leakage Power Reduction
(Footer,Header) Gate Bias Voltage Pair
Normalized leakage

Normalized wake-up delay
Increasing the bias voltage increases the leakage
power while decreases the wakeup delay overhead
Copyright © 2010 Houman Homayoun
University of California San Diego
8
Multiple Sleep Modes Specifications
On-chip SRAM multiple sleep mode normalized leakage power savings


BPRED
FRF
IRF
IL1
DL1
L2
DTLB
ITLB
basic-lp
0.29
0.21
0.21
--
--
--
0.25
0.25
lp
0.43
0.31
0.31
0.37
0.37
--
0.34
0.34
aggr-lp
0.55
0.58
0.58
0.48
0.48
0.44
0.49
0.49
ultra-lp
0.67
0.65
0.65
0.69
0.64
0.63
0.57
0.57
Wakeup Delay varies from 1~more than 10 processor cycles (2.2GHz).
Large wakeup power overhead for large SRAMs.

Need to find Period of Infrequent Access
Copyright © 2010 Houman Homayoun
University of California San Diego
9
Reducing Leakage in SRAM Peripherals

Maximize the leakage reduction


put SRAM into ultra low power mode
adds few cycles to the SRAM access latency


significantly reduces performance
Minimize Performance Degradation


put SRAM into the basic low power mode
requires near zero wakeup overhead

Not noticeable leakage power reduction
Copyright © 2010 Houman Homayoun
University of California San Diego
10
Motivation for Dynamically Controlling Sleep Mode

large leakage reduction benefit


low performance impact benefit


Basic-lp mode
Periods of frequent access


Ultra and
dynamically adjust
aggressive
low
power
modes
sleep
power
mode
Basic-lp mode
Periods of infrequent access

Ultra and aggressive low power modes
Copyright © 2010 Houman Homayoun
University of California San Diego
11
Architectural Motivations

Architectural Motivation

A load miss in L1/L2 caches takes a long time to service


When dependent instructions cannot issue


prevents dependent instructions from being issued
performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy
Copyright © 2010 Houman Homayoun
University of California San Diego
12
Multiple Sleep Mode Control Mechanism
Processor continue
all pending
DL1 misses
serviced
basic-lp
lp
3 pending
DL1 miss
L2 m
se
L
rvi 2 m i
ce
d/f ss
lus
he
Pending DL1
misses
Processor stall
ultra-lp
L2 miss
i ss
e
Proc
sso r
stall
aggr-lp
d
Pending L2
miss/es
General state machine to control power mode transitions


L2 cache miss or multiple DL1 misses triggers power mode
transitioning.
The general algorithm may not deliver optimal results for all
units.
 modified the algorithm for individual on-chip SRAM-based
units to maximize the leakage reduction at NO performance
cost.
Copyright © 2010 Houman Homayoun
University of California San Diego
13
Branch Predictor
IPB
IPB
applu
apsi
art
bzip2
4.5 equake
324.1 facerec
28.9 galgel
8.1 gap
6.7 gcc
4.21 mcf
20.0 mesa
14.3 mgrid
14.2 parser
6.3 perlbmk
crafty
eon
8.5 gzip
8.2 lucas
9.5 sixtrack
25.6 swim
ammp

IPB
3.9 twolf
11.0 vortex
310.4 vpr
6.0 wupwise
7.2 average
IPB
7.6
5.7
9.0
8.7
37.8
11.9
77.1
1 out of every 9 fetched instructions in integer benchmarks
and out of 63 fetched instructions in floating point
benchmarks accesses the branch predictor

always put branch predictor in deep low power modes (lp,
ultra-lp or aggr-lp) and waking up on access.

noticeable performance degradation for some benchmarks.
Copyright © 2010 Houman Homayoun
University of California San Diego
14
Observation: Branch Predictor Access Pattern
350
30
swim
equake
25
IPB every 512 cycles
IPB every 512 cycles
300
250
200
150
100
50
20
15
10
5
0
0
1M cycles
1 M cycles
Distribution of the number of branches per 512-instruction interval (over 1M cycles)

Within a benchmark there is significant variation in Instructions
Per Branch (IPB).

once the IPB drops (increases) significantly it may remain low
(high) for a long period of time.
Copyright © 2010 Houman Homayoun
University of California San Diego
15
Branch Predictor Peripherals Leakage Control

Can identify the high IPB period, once the first low IPB
period is detected.



The number of fetched branches is counted every 512
cycles, once the number of branches is found to be less
than a certain threshold (24 in this work) a high IPB
period identified. The IPB is then predicted to remain
high for the next twenty 512 cycles intervals (10K
cycles).
Branch predictor peripherals transition from basic-lp
mode to lp mode when a high IPB period is identified.
During pre-stall and stall periods the branch predictor
peripherals transition to aggr-lp and ultra-lp mode,
respectively.
Copyright © 2010 Houman Homayoun
University of California San Diego
16
Leakage Power Reduction
40%
35%
30%
25%
20%
15%
10%
5%
am
m
ap p
pl
u
ap
si
a
bz rt
ip
cr 2
af
ty
eq eo
u n
fa ake
ce
r
ga ec
lg
el
ga
p
gc
c
gz
lu ip
ca
s
m
m cf
e
m sa
g
pa ri d
pe rse
rl r
si bm
xt k
ra
c
sw k
im
tw
vo olf
rte
x
w v
up p r
av wis
er e
ag
e
0%
basic-lp
lp
aggr-lp
ultra-lp
Noticeable Contribution of Ultra and Basic low power mode
Copyright © 2010 Houman Homayoun
University of California San Diego
17
Outline – Resource Adaptation





why an IQ, ROB, RF major power dissipators?
Study processor resources utilization during L2/multiple L1
misses service time
Architectural approach on dynamically adjusting the size of
resources during cache miss period for power conservation
Results
Conclusions
Copyright © 2010 Houman Homayoun
University of California San Diego
18
Instruction Queue

The Instruction Queue is a CAM-like structure which holds
instructions until they can be issued.




Set entries for new dispatched instructions
Read entries to issue instructions to functional units
Wakeup instructions waiting in the IQ once a result is ready
Select instructions for issue when the number of instructions
available exceed the processor issue limit (Issue Width).
Main Complexity: Wakeup Logic
Copyright © 2010 Houman Homayoun
University of California San Diego
19
tagIW3
tagIW2
tagIW1
tagIW0
tagIW0
tagIW1
tagIW2
tagIW3
tag03
tag02
tag01
tag00
tag00
tag02
tag03
Vdd
tag01
Logical View of Instruction Queue
Pre-charge
matchline1
matchline2
Ready Bit
OR
matchline3
matchline4




At each cycle, the match lines are pre-charged high

To allow the individual bits associated with an instruction tag to be compared with the
results broadcasted on the taglines.
Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line
stays at Vdd, which indicates a tag match.
At each cycle, up to 4 instructions broadcasted on the taglines,

four sets of one-bit comparators for each one-bit cell are needed.
All four matchlines must be ORed together to detect a match on any of the broadcasted
tags. The result of the OR sets the ready bit of instruction source operand
No Need to always have such aggressive wakeup/issue width!
Copyright © 2010 Houman Homayoun
University of California San Diego
20
ROB and Register File

The ROB and the register file are multi-ported SRAM
structures with several functionalities:



Setting entries for up to IW instructions in each cycle,
Releasing up to IW entries during commit stage in a cycle, and
Flushing entries during the branch recovery.
data
output
driver
29%
sense_am
p
4%
decode
8% wordline
1%
sense_am
p
3%
bitline and
memory
cell
58%
Dynamic Power
Copyright © 2010 Houman Homayoun
data output
driver
15%
decode
11%
wordline
8%
bitline and
memory
cell
63%
Leakage Power
University of California San Diego
21
Architectural Motivations

Architectural Motivation:

A load miss in L1/L2 caches takes a long time to service


When dependent instructions cannot issue

After a number of cycles the instruction window is full



prevents dependent instructions from being issued
ROB, Instruction Queue, Store Queue, Register Files
The processor issue stalls and performance is lost
At the same time, energy is lost as well!

This is an opportunity to save energy


Scenario I: L2 cache miss period
Scenario II: three or more pending DL1 cache misses
Copyright © 2010 Houman Homayoun
University of California San Diego
22
How Architecture can help reducing power in ROB,
Register File and Instruction Queue
Issue rate decrease
Scenario I
pa
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
rs
er
tw
o
vo lf
rte
x
IN
T
vp
av r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
fa ke
ce
re
c
ga
lg
el
lu
ca
s
m
gr
id
sw
w im
FP upw
av ise
er
ag
e
Scenario II
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
-10%
Scenario I: The issue rate drops by more than 80%
Scenario II: The issue rate drops is 22% for integer benchmarks and
32.6% for floating-point benchmarks.
Significant issue width decrease!
Copyright © 2010 Houman Homayoun
University of California San Diego
23
How Architecture can help reducing power in ROB,
Register File and Instruction Queue


ROB occupancy grows
significantly during scenario I
and II for integer benchmarks:
98% and 61% on average
The increase in ROB
occupancy for floating point
benchmarks is less, 30% and
25% on average for scenario I
and II.
Copyright © 2010 Houman Homayoun
Benchmark
Scenario I
bzip2
165.0
88.6
crafty
179.6
gap
Scenario II Benchmark
Scenario I
Scenario II
applu
13.8
-4.9
63.6
apsi
46.6
18.2
6.6
61.7
Art
31.7
56.9
gcc
97.7
43.9
equake
49.8
38.1
gzip
152.9
41.0
facerec
87.9
14.1
mcf
42.2
40.6
galgel
30.9
34.4
parser
31.3
102.3
lucas
-0.7
54.0
twolf
81.8
58.8
mgrid
8.8
5.6
vortex
118.7
57.8
swim
-4.3
11.4
vpr
96.6
55.7
wupwise
40.2
24.4
INT average
98.2
61.4
FP average
30.5
25.2
University of California San Diego
24
How Architecture can help reducing power in ROB,
Register File and Instruction Queue
nonnonnonnonRegister File Scenario I
Scenario I
Scenario II
Scenario II
Scenario I
Scenario I
Scenario II
Scenario
occupancy
IRF
FRF
IRF
FRF
II FRF
IRF
FRF
IRF
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
INT average
applu
apsi
art
equake
facerec
galgel
lucas
mgrid
swim
wupwise
FP average
74.4
83.4
46.2
46.3
45.1
40.8
37.4
58.7
70.9
63.9
55.3
6.0
16.1
35.4
34.2
52.6
50.4
21.7
5.9
23.3
26.3
26.6
28.8
31.9
41.1
21.2
27.2
29.3
29.8
32.3
31.1
29.0
29.2
5.6
18.3
25.0
27.4
22.5
27.4
23.8
6.2
27.8
28.8
20.9
0.0
0.1
0.1
0.2
0.0
1.0
0.0
2.6
0.3
7.8
1.1
76.6
65.7
36.2
16.1
50.0
41.8
47.7
90.0
77.1
53.5
56.5
0.0
0.0
0.7
0.1
0.0
1.1
0.0
2.1
0.2
8.6
1.2
64.8
37.6
30.7
7.1
28.9
48.7
44.0
80.7
78.1
28.7
44.7
56.6
51.4
65.8
28.7
39.8
46.8
57.0
46.0
52.4
66.4
50.3
1.7
15.8
23.0
32.7
30.3
32.1
41.7
1.9
29.7
40.5
24.0
30.7
32.2
42.9
24.0
27.2
36.4
29.8
29.8
35.0
41.0
32.0
6.2
17.9
29.0
29.4
38.4
26.0
22.1
6.4
23.1
26.9
22.1
0.0
0.0
0.6
0.0
0.0
3.2
0.1
2.5
0.2
8.7
1.4
77.3
58.8
42.9
21.0
48.1
61.0
29.7
96.7
87.1
38.0
56.2
0.0
0.0
0.5
0.1
0.0
0.1
0.0
2.0
0.2
8.3
1.0
73.7
43.6
6.3
9.6
35.0
44.2
47.0
87.2
76.2
42.2
46.0
IRF occupancy always grows for both scenarios when experimenting
with integer benchmarks. a similar case is for FRF when running
floating-point benchmarks and only during scenario II
Copyright © 2010 Houman Homayoun
University of California San Diego
25
Proposed Architectural Approach

Adaptive resource resizing during cache miss
period



Reduce the issue and the wakeup width of the processor
during L2 miss service time.
Increase the size of ROB and RF during L2 miss service time or
when at least three DL1 misses are pending
simple resizing scheme: reduce to half size. not
necessarily optimized for individual units, but a
simple scheme to implement at circuit!
Copyright © 2010 Houman Homayoun
University of California San Diego
26
Results
50%
Power (Dynamic/Leakage) Reduction
45%
40%
35%
30%
25%
20%
15%
10%
5%
T
av vp r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
k
fa e
ce
re
ga c
lg
el
lu
ca
s
m
gr
id
sw
w
i
u m
FP pw
av ise
er
ag
e
0%
IN

Small Performance loss~1%
15~30% dynamic and leakage
power reduction
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
pa cf
rs
er
tw
o
vo lf
rte
x

ROB Leakage
ROB Dynamic
Issue Queue
Power (Dynamic/Leakage) Reduction
40%
35%
30%
25%
20%
15%
10%
5%
av vp r
er
ag
e
ap
pl
u
ap
si
A
eq rt
ua
fa ke
ce
re
c
ga
lg
el
lu
ca
s
m
gr
id
sw
w
i
u m
FP pw
av ise
er
ag
e
IN
T
gc
c
gz
ip
m
pa cf
rs
er
tw
ol
vo f
rte
x
ga
p
bz
ip
2
cr
af
ty
0%
INT RF Leakage
INT RF Dynamic
FP RF Lekage
FP RF Dynamic
6%
IPC Degradation
5%
4%
3%
2%
1%
av vp
er r
ag
ap e
pl
u
ap
si
eq Art
ua
fa ke
ce
re
ga c
lg
e
lu l
ca
m s
gr
id
s
w wim
up
FP w
av i se
er
ag
e
IN
T
bz
ip
cr 2
af
ty
ga
p
gc
c
gz
ip
m
p a cf
rs
er
tw
o
vo lf
rt
ex
0%
Copyright © 2010 Houman Homayoun
University of California San Diego
27
Conclusions





Introducing the idea of multiple sleep mode design
Apply multiple sleep mode to on-chip SRAMs

Find period of low activity for state transition
Introduce the idea of resource adaptation
Apply resource adaptation to on-chip SRAMs+CAMs

Find period of low activity for state transition
Applying similar adaptive techniques to other energy hungry
resources in the processor

Multiple sleep mode functional units
Copyright © 2010 Houman Homayoun
University of California San Diego
28