Reconfigurable Caches - Rice University Electrical and Computer

Download Report

Transcript Reconfigurable Caches - Rice University Electrical and Computer

Reconfigurable Caches and their
Application to Media Processing
Parthasarathy (Partha) Ranganathan
Dept. of Electrical and Computer Engineering
Rice University
Houston, Texas
Sarita Adve
Norman P. Jouppi
Dept. of Computer Science
University of Illinois at Urbana Champaign
Urbana, Illinois
Western Research Laboratory
Compaq Computer Corporation
Palo Alto, California
Motivation (1 of 2)
Different workloads on general-purpose processors
Scientific/engineering, databases, media processing, …
Widely different characteristics
Challenge for future general-purpose systems
Use most transistors effectively for all workloads
Reconfigurable Caches
-2-
Partha Ranganathan
Motivation (2 of 2)
Challenge for future general-purpose systems
Use most transistors effectively for all workloads
50% to 80% of processor transistors devoted to cache
Very effective for engineering and database workloads
BUT large caches often ineffective for media workloads
Streaming data and large working sets [ISCA 1999]
Can we reuse cache transistors for other useful work?
Reconfigurable Caches
-3-
Partha Ranganathan
Contributions
Reconfigurable Caches
Flexibility to reuse cache SRAM for other activities
Several applications possible
Simple organization and design changes
Small impact on cache access time
Reconfigurable Caches
-4-
Partha Ranganathan
Contributions
Reconfigurable Caches
Flexibility to reuse cache SRAM for other activities
Several applications possible
Simple organization and design changes
Small impact on cache access time
Application for media processing
e.g., instruction reuse – reuse memory for computation
1.04X to 1.20X performance improvement
Reconfigurable Caches
-5-
Partha Ranganathan
Outline for Talk
 Motivation
Reconfigurable caches
Key idea
Organization
Implementation and timing analysis
Application for media processing
Summary and future work
Reconfigurable Caches
-6-
Partha Ranganathan
Reconfigurable Caches: Key Idea
Key idea: reuse cache transistors!
Partition A - cache
On-chip SRAM
Cache
Partition B - lookup
Current use of
on-chip SRAM
Proposed use of
on-chip SRAM
Dynamically divide SRAM into multiple partitions
Use partitions for other useful activities
 Cache SRAM useful for both conventional and media workloads
Reconfigurable Caches
-7-
Partha Ranganathan
Reconfigurable Cache Uses
Number of different uses for reconfigurable caches
Optimizations using lookup tables to store patterns
Instruction reuse, value prediction, address prediction, …
Hardware and software prefetching
Caching of prefetched lines
Software-controlled memory
QoS guarantees, scratch memory area
 Cache SRAM useful for both conventional and media workloads
Reconfigurable Caches
-8-
Partha Ranganathan
Key Challenges
Partition A - cache
On-chip SRAM
Cache
Partition B - lookup
Current use of
on-chip SRAM
Proposed use of
on-chip SRAM
How to partition SRAM?
How to address the different partitions as they change?
Minimize impact on cache access (clock cycle) time
 Associativity-based partitioning
Reconfigurable Caches
-9-
Partha Ranganathan
Conventional Cache Organization
Address
State
Tag
Data
Tag Index Block
Way 1
Way 2
Compare
Reconfigurable Caches
-10-
Select
Data out
Hit/miss
Partha Ranganathan
Associativity-Based Partitioning
Address
State
Tag
Data
Tag Index Block
Way 1
Partition 1
Choose
Way 2
Partition 2
Tag Index Block
Compare
Select
Data out
Hit/miss
Partition at granularity of “ways”
Multiple data paths and additional state/logic
Reconfigurable Caches
-11-
Partha Ranganathan
Reconfigurable Cache Organization
Associativity-based partitioning
Simple - small changes to conventional caches
But # and granularity of partitions depends on associativity
Alternate approach: Overlapped-wide-tag partitioning
More general, but slightly more complex
Details in paper
Reconfigurable Caches
-12-
Partha Ranganathan
Other Organizational Choices (1 of 2)
Partition A
On-chip SRAM
Cache
Partition B
Current use of
on-chip SRAM
Proposed use of
on-chip SRAM
Ensuring consistency of data at repartitioning
Cache scrubbing: flush data at repartitioning intervals
Lazy transitioning: Augment state with partition information
Addressing of partitions - software (ISA) vs. hardware
Reconfigurable Caches
-13-
Partha Ranganathan
Other Organizational Choices (2 of 2)
Partition A
On-chip SRAM
Cache
Partition B
Current use of
on-chip SRAM
Proposed use of
on-chip SRAM
Method of partitioning - hardware vs. software control
Frequency of partitioning - frequent vs. infrequent
Level of partitioning - L1, L2, or lower levels
Tradeoffs based on application requirements
Reconfigurable Caches
-14-
Partha Ranganathan
Outline for Talk
 Motivation
Reconfigurable caches
Key idea
Organization
Implementation and timing analysis
Application for media processing
Summary and future work
Reconfigurable Caches
-15-
Partha Ranganathan
Conventional Cache Implementation
ADDRESS
TAG
ARRAY
DATA
ARRAY
BIT LINES
WORD LINES
DECODERS
COLUMN MUXES
SENSE AMPS
COMPARATORS
MUX DRIVERS
DATA
OUTPUT DRIVER
OUTPUT DRIVERS
VALID OUTPUT
Tag and data arrays split into multiple sub-arrays
to reduce/balance length of word lines and bit lines
Reconfigurable Caches
-16-
Partha Ranganathan
Changes for Reconfigurable Cache
ADDRESS [1:NP]
TAG
ARRAY
DATA
ARRAY
BIT LINES
WORD LINES
DECODERS
COLUMN MUXES
SENSE AMPS
COMPARATORS
MUX DRIVERS [1:NP]
DATA [1:NP]
OUTPUT DRIVER
OUTPUT DRIVERS [1:NP]
VALID OUTPUT
[1:NP]
Associate sub-arrays with partitions
Constraint on minimum number of sub-arrays
Additional multiplexors, drivers, and wiring
Reconfigurable Caches
-17-
Partha Ranganathan
Impact on Cache Access Time
Sub-array-based partitioning
Multiple simultaneous accesses to SRAM array
No additional data ports
Timing analysis methodology
CACTI analytical timing model for cache time (Compaq WRL)
Extended to model reconfigurable caches
Experiments varying cache sizes, partitions, technology, …
Reconfigurable Caches
-18-
Partha Ranganathan
Impact on Cache Access Time
Cache access time
Comparable to base (within 1-4%) for few partitions (2)
Higher for more partitions, especially with small caches
But still within 6% for large caches
Impact on clock frequency likely to be even lower
Reconfigurable Caches
-19-
Partha Ranganathan
Outline for Talk
 Motivation
 Reconfigurable caches
Application for media processing
Instruction reuse with media processing
Simulation results
Summary and future work
Reconfigurable Caches
-20-
Partha Ranganathan
Application for Media Processing
Instruction reuse/memoization [Sodani and Sohi, ISCA 1997]
Exploits value redundancy in programs
cache
partition
Store instruction operands and result in reuse buffer cache
If later instruction and operands match in reuse buffer, partition
skip execution;
cache
read answer from reuse buffer partition
Few changes for implementation with reconfigurable caches
Reconfigurable Caches
-21-
Partha Ranganathan
Simulation Methodology
Detailed simulation using RSIM (Rice)
User-level execution-driven simulator
Media processing benchmarks
JPEG image encoding/decoding
MPEG video encoding/decoding
GSM speech decoding and MPEG audio decoding
Speech recognition and synthesis
Reconfigurable Caches
-22-
Partha Ranganathan
System Parameters
Modern general-purpose processor with ILP+media extensions
1 GHz, 8-way issue, OOO, VIS, prefetching
Multi-level memory hierarchy
128KB 4-way associative 2-cycle L1 data cache
1M 4-way associative 20-cycle L2 cache
Simple reconfigurable cache organization
2 partitions at L1 data cache
64 KB data cache, 64KB instruction reuse buffer
Partitioning at start of application in software
Reconfigurable Caches
-23-
Partha Ranganathan
Normalized execution time
Impact of Instruction Reuse
120
100
80
100
84
100
89
92
Memory
CPU
40
0
State-of- with IR
art
JPEG decode
State-of- with IR
art
MPEG decode
State-of- with IR
art
Speech synthesis
Performance improvements for all applications (1.04X to 1.20X)
Use memory to reduce compute bottleneck
Greater potential with aggressive design [details in paper]
Reconfigurable Caches
-24-
Partha Ranganathan
Summary
Goal: Use cache transistors effectively for all workloads
Reconfigurable Caches: Flexibility to reuse cache SRAM
Simple organization and design changes
Small impact on cache access time
Several applications possible
Instruction reuse - reuse memory for computation
1.04X to 1.20X performance improvement
More aggressive reconfiguration currently under investigation
Reconfigurable Caches
-25-
Partha Ranganathan
More information available at
http://www.ece.rice.edu/~parthas
[email protected]
Reconfigurable Caches
-26-
Partha Ranganathan