ppt - Ann Gordon-Ross

Download Report

Transcript ppt - Ann Gordon-Ross

Configurable Cache Subsetting for
Fast Cache Tuning
Pablo Viana1, Ann Gordon-Ross2, Eamonn Keogh2,
Edna Barros1, Frank Vahid2
Science Institute – CIN
Federal University of Pernambuco - UFPE, Brazil
1Computer
2Department
of Computer Science and Engineering
University of California, Riverside
This work was supported in part by the Capes Foundation BEX 1366/04-1
Outline
 Introduction


Configurable cache tuning
Cache configuration space
 Configuration space subsetting problem

An exhaustive analysis
 A heuristic to subset the configuration space
 Applying the subsetting method to a 2-level cache
 Future work on other configurable platforms
1/25
Introduction
 Caches can improve system performance. However,
caches are power hungry.
 Thus, the cache is a good candidate for optimization
44%
Power analysis of ARM970T
S. Segars (ISSCC, 2001)
2/25
Motivation
Tuning cache parameters (Total size, Line size, Associativity) to an
application can reduce energy by 60% on average.
Choose lowest energy
configuration
Microprocessor
Cache
Tuning
Energy

Main Memory
Possible Cache Configurations
3/25
Cache Configuration Tuning
Each application has different cache requirements
4kB, 2-way
Application 1
2kB, 1-way
16B/line
16B/line
Application 2
4kB, 4-way
32B/line
Application 3
Application 4
2kB, 1-way
8kB, 2-way
Application 5
64B/line
16B/line
...
8kB, 4-way
32B/line
Application N
Cache Configuration Space
Configurable cache optimization
4/25
Related Work
 Configurable caches

Soft-core processors
ARM;
 MIPS;
 Tensillica
 etc.


Hard-core processors
Motorola M*Core (Malik, ISLPED’00);
 Albonesi (MICRO’00);
 Zhang (ISCA’03)

 Configurable cache tuning



Mostly done manually in practice
Sub-optimal
Time-consuming
5/25
Related Work
 Automated Methods for Cache Tuning

Methods and heuristics for Design Space Exploration

Single-level caches (Tens of configurations)
Platune (Givargis TCAD’02, Palesi CODES’02)
 Zhang (RSP’03)


Two-level caches (Hundreds to thousands of configurations)

Tcat (Gordon-Ross, DATE’04)
Level 2
Level 1
Total size
Line size
Associativity
Say 50 configs.
*
Total size
Line size
Associativity
= 2500 configs
Say 50 configs.
6/25
Problem Context
 Do we really need such a large number of cache
configurations?
 Could just few carefully-chosen configurations adequately
cover the large configuration space?
 A smaller configuration space would be easier to explore (via
simulation-based or dynamic tuning)
7/25
Problem Context
Potential scenario:
 A configurable microprocessor vendor pre-selects a subset of
configurations for each particular application domain.
 The user selects the most appropriate domain for a target
application and examine only the pre-selected subset.
Automotive
control
Image
Filtering
Network
protocol
Configuration space
Application domains
Configuration subset
8/25
Cache Configuration Tuning
Energy consumption
Energy to run a given application on different configurations
Not that bad.
Energy increase
Lower energy
Possible cache configurations
9/25
Cache Configuration Subsetting
Application 1
Application 2
subset
Application 3
Application 4
Application 5
...
Application N
Cache Configuration Space
Near optimal cache tuning: Average energy increases
10/25
Cache Configuration Subsetting
Problem Definition:
 Identify the subset of configurations that adequately
covers the configuration space for the application
domain.
 We state that p configurations are necessary to cover a
space of m points.
Exhaustive Approach:

Select the subset of p configurations from the space m;
Criterion for selection:
Choosing the subset which keep the average energy of the
tuned cache nearest to the optimal.
11/25
Experimental Setup
 Configurable cache architecture for initial experiments:

Our target configurable cache architecture is based on
Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture
for Embedded Systems,” ISCA 2003

We set the base cache to suport the following 18 configurations:
Instruction
cache
Memory
Processor
Data
cache
 Energy model for estimation:
e(cj, ai )  eCache  eMemory  eCPUstall
12/25
Exhaustive Approach
y
 Subsets of size p=18 down to 1 were chosen through the
exhaustive evaluation of the average energy increase.
50%
Instruction cache
45%
Energy Increase
40%
Data cache
35%
30%
25%
20%
15%
Energy increase is still under 5%.
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
Size of the subset
12
13
14
15
16
17
18
13/25
Exhaustive Approach
 Good results, but exhaustively determining the subsets requires too
much computation:
 For m=18 and 1 < p < 18 it gives us 262,143 possible combinations.
That’s too expensive!
 We need a more efficient way to find the subset of configurations...
14/25
Looking for a Heuristic
 First attempt:

Choose the p configurations that offered optimal energy for the
largest number of applications.
50%
Energy increase
45%
1st trial I$
1st trial D$
Exhaustive I$
Exhaustive D$
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
15/25
Looking for a Heuristic
 Second attempt:

Hierarchical clustering according to the similarity between
configurations on their average energy savings.
60%
2nd trial I$
2nd trial D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
16/25
Similar Problem
 We found a problem similar to our “subsetting puzzle”.
 The problem of segmenting time series.
Time series data set
Data set representation
Error (%) depends on average accuracy of the data set representation
17/25
Similar Problem
 The method proposed by Keogh’s can be applied, to reduce the
number of primary color for displaying an image

(IEEE Int. Conf. On Data Mining, 2001)
 Color/nuances of the image are merged according to the associated
error/difference in the final image.
36 colors
Merging and measuring error
Ex.: Lower average error?
by
or
8 colors
by
Error of replacing the neighboring colors by merging them
18/25
Adapting to the Subsetting Problem
 Similarly, we may iteratively discard configurations by merging them.
 By merging two configs cj and ck into ck means that all applications
which were tuned by cj, now use ck.
Application
ai
cj
ck
merging
e(cj,ai)
e(ck,ai)
energy increase
19/25
Keogh’s Heuristic
 All the possible merges of two adjacent (neighboring) configurations in
the space are evaluated to find the best merging.
c21
c71
cc13
7
20/25
Keogh’s Heuristic
 Accuracy of the results
60%
Keogh's I$
Keogh's D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
21/25
Comparing the Results
 We went on merging the configuration space while the energy
increase remains under 5% (4 configs).
Keogh’s heuristic
Instruction cache
2kB 32B/line 1-way
2kB 32B/line 1-way
4kB 32B/line 2-way
4kB 32B/line 1-way
4kB 64B/line 1-way
4kB 32B/line 2-way
8kB 32B/line 1-way
8kB 64B/line 1-way
Average Error: 0.67%
Average Error: 0.69%
Exhaustive
Data cache
2kB 16B/line 1-way
2kB 16B/line 1-way
2kB 64B/line 1-way
2kB 64B/line 1-way
4kB 16B/line 2-way
4kB 16B/line 2-way
8kB 16B/line 2-way
8kB 16B/line 2-way
Average Error: 4.75%
Average Error: 4.75%
Runtime: 835,000 ms
Runtime: 991 ms
Speed up: 800x
22/25
Two-level Configurable Cache
Instruction
cache
Processor
Data
cache
2nd Level
cache
Memory
Optimal cache
configuration
1.00
0.90
0.80
0.70
Tuning with 4 configs
Average energy increase: 3.36%
0.60
0.50
0.40
0.30
0.20
0.10
0.00
23/25
Where Else this Method Could be Applied?
 Other parameterized IP cores

Buses, I/O devices (Word size, bandwidth, etc).
 Parameterized Platforms,

Processor’s functional units, cache, bus, multiplier, IPs.
 Optimization for low energy, area, performance and others.
24/25
Conclusions
 The configurable cache subsetting problem was presented;
 A Data Mining algorithm for segmenting time series was
adapted to tackle the cache subsetting problem;
 Subsetting the Two-level cache configuration space, keeping
the average energy increase under 5% takes:


Keogh’s heuristic: less than 1 minute
Exhaustively: around 53 hours.
 Thanks to the versatility of the proposed method, our next
step is to apply it for Platform customization.
Thank you !
25/25