ppt - Department of Computer Science and Engineering

Download Report

Transcript ppt - Department of Computer Science and Engineering

Low Static-Power Frequent-Value Data
Caches
Chuanjun Zhang*, Jun Yang, and Frank Vahid**
*Dept. of Electrical Engineering
Dept. of Computer Science and Engineering
University of California, Riverside
**Also with the Center for Embedded Computer Systems at UC Irvine
This work was in part supported by the National Science Foundation and
the Semiconductor Research Corporation
Chuanjun Zhang, UC Riverside
1
Leakage Power Dominates

Growing impact of leakage power



Cache consumes much static power


Increase of leakage power due to scaling of transistors’
lengths and threshold voltages.
Power budget limits use of fast leaky transistors.
Caches account for the most of the transistors on a die.
Related work



DRG:dynamically resizes cache by monitoring the miss
rate.
Cache line decay: dynamically turns off cache lines.
Drowsy cache: low leakage mode.
Chuanjun Zhang, UC Riverside
2
Frequent Values in Data Cache (J. Yang and R. Gupta Micro 2002)
data
Microprocessor
00000000
00000000
00000000
00000000
FFFFFFFF
00000234
FFFF1234
00100000
FFFFFFFF
FFFFFFFF
Data read out from
L1 data cache
address

Frequently accessed values behavior
Chuanjun Zhang, UC Riverside
00000000
00000000
00100000
00000000
FF000000
FFFFFFFF
FFFF1234
FFFFFFFF
00000234
00000000
00100000
00000000
FF000000
FFFFFFFF
2341FFFF
FFFFFFFF
L1 DATA CACHE
3
Frequent Values in Data Cache (J. Yang and R. Gupta Micro 2002)



32 FVs account for around 36% of the total
data cache accesses for 11 Spec 95
Benchmarks.
FVs can be dynamically captured.
FVs are also widespread within data cache


FVs are stored in encoded form.



Not just accesses, but also stored throughout.
4 or 5 bits represent 16 or 32 FVs.
Non-FVs are stored in unencoded form.
The set of frequent values remains fixed
for a given program run.
Chuanjun Zhang, UC Riverside
00000000
00000000
00000000
FVs
00000000
00100000
00000000 00000000
00000000
FFFFFFFF
FFFFFFFF
FFFFFFFF
FVs accessed
00000000
00000000
00100000
00000000
FF000000
FFFFFFFF
FFFFFFFF
FFFFFFFF
FVs in D$
4
Original Frequent Value Data Cache Architecture







Data cache memory is
separated as low-bit
and high-bit array.
5 bits encodes 32 FVs.
27 bits are not
accessed for FVs.
A register file holds the
decoded lines.
Dynamic power is
reduced.
Two cycles when
accessing Non-FVs.
Flag bit: 1-FV ; 0-NFV
Chuanjun Zhang, UC Riverside
5
New FV Cache Design: One Cycle Access to Non FV




32 bits
No extra delay in determining
driver
accesses of the 27-bit portion
27 bits
decoder
Leakage energy proportion to output
5 bits
Original cache line architecture
program execution time
New driver is as fast as the
original by tuning the NAND
gate’s transistor parameters
(a)
(b)
new word line driver
Flag bit: 0-FV ; 1-NFV
original word line driver
20 bits
27 bits
new driver
decoder
output
flag bits
flag bits
27 bits
27 bits
27 bits
New cache line architecture: subbanking
Chuanjun Zhang, UC Riverside
6
Low leakage SRAM Cell and Flag Bit
20 bits
27 bits
new driver
decoder
output
Bitline
flag bits
flag bits
27 bits
27 bits
27 bits
New cache line architecture: sub banking
Vdd
Bitline
Gated-Vdd
Control
Vdd
Bitline
Bitline
Flag bit
output
Gated_Vdd
Control
Gnd
Gnd
Flag bit SRAM cell
SRAM cell with a pMOS gated Vdd control.
Chuanjun Zhang, UC Riverside
7
Experiments



SimpleScalar.
Eleven Spec 2000 benchmarks
Fast Forward the first 1 billion and execute 500M
Proce ssor Core
Instruction Window
80-RUU,40-LSQ
Issue width
4 instructions per cycle
Me mory Hie rarchy
L1 Dcache
32KB, four way, 32B line size, WB
L1 Icache
32KB, four way, 32B line size, WB
L2 unified cache
128KB,four way, 64B line size, WB
Memory
100 cycles
T LB Size
128-entry, 30-cycle miss penalty.
Configuration of the simulated processor.
Chuanjun Zhang, UC Riverside
8
Performance Improvement of One Cycle to Non-FV
75%
50%
10%
8%
6%
4%
2%
0%
Ave
gzip
gcc
votex
mesa
ammp
bzip
vpr
parser
mcf
0%
equake
25%
art
Performance (IPC) improvement of one-cycle FV cache vs.
two-cycle FV cache.
9
Chuanjun Zhang, UC Riverside
Ave
bzip2
gcc
equake
vortex
mesa
ammp
gzip
vpr
parser
5.5%
mcf

Two cycles impact
performance
hence increase
leakage power
One cycle access
to Non FV
achieves 5.5%
performance
improvement (and
hence impacts
leakage energy
correspondingly)
art

Hit rate of FVs in data cache.
Distribution of FVs in Data Cache
80%
Ave
gzip
gcc
equake
votex
mesa
ammp
bzip
vpr
parser
49.2%
mcf

FVs are widely found
60%
in data cache
memory. On average 40%
49.2%.
20%
Leakage power
0%
reduction
proportional to the
percentage
occurrence of FVs
art

Percentage of data cache words (on average) that are FVs.
Chuanjun Zhang, UC Riverside
10
Static Energy Reduction
60%
50%
40%
30%
20%
10%
0%

Ave
gzip
gcc
equake
votex
mesa
ammp
bzip
vpr
parser
mcf
art
D$
33% total static energy savings for data caches.
Chuanjun Zhang, UC Riverside
11
How to Determine the FVs

Application-specific processors


Processors that run multiple applications


The FVs can be first identified offline through profiling, and then
synthesized into the cache so that power consumption is
optimized for the hard coded FVs.
The FVs can be located in a register file to which different
applications can write a different set of FVs.
Dynamically-determined FVs

Embed the process of identifying and updating FVs into registers,
so that the design dynamically and transparently adapts to
different workloads with different inputs automatically.
Chuanjun Zhang, UC Riverside
12
Conclusion

Two improvements to the original FV data cache:

One cycle access to Non FVs


Improve performance (5.5%) and hence static leakage
Shut off the unused 27 bits portion of a FV


The scheme does not increase data cache miss rate
The scheme further reduces data cache static energy by over
33% on average
Chuanjun Zhang, UC Riverside
13