ppt - Ann Gordon-Ross
Download
Report
Transcript ppt - Ann Gordon-Ross
Configurable Cache Subsetting for
Fast Cache Tuning
Pablo Viana1, Ann Gordon-Ross2, Eamonn Keogh2,
Edna Barros1, Frank Vahid2
Science Institute – CIN
Federal University of Pernambuco - UFPE, Brazil
1Computer
2Department
of Computer Science and Engineering
University of California, Riverside
This work was supported in part by the Capes Foundation BEX 1366/04-1
Outline
Introduction
Configurable cache tuning
Cache configuration space
Configuration space subsetting problem
An exhaustive analysis
A heuristic to subset the configuration space
Applying the subsetting method to a 2-level cache
Future work on other configurable platforms
1/25
Introduction
Caches can improve system performance. However,
caches are power hungry.
Thus, the cache is a good candidate for optimization
44%
Power analysis of ARM970T
S. Segars (ISSCC, 2001)
2/25
Motivation
Tuning cache parameters (Total size, Line size, Associativity) to an
application can reduce energy by 60% on average.
Choose lowest energy
configuration
Microprocessor
Cache
Tuning
Energy
Main Memory
Possible Cache Configurations
3/25
Cache Configuration Tuning
Each application has different cache requirements
4kB, 2-way
Application 1
2kB, 1-way
16B/line
16B/line
Application 2
4kB, 4-way
32B/line
Application 3
Application 4
2kB, 1-way
8kB, 2-way
Application 5
64B/line
16B/line
...
8kB, 4-way
32B/line
Application N
Cache Configuration Space
Configurable cache optimization
4/25
Related Work
Configurable caches
Soft-core processors
ARM;
MIPS;
Tensillica
etc.
Hard-core processors
Motorola M*Core (Malik, ISLPED’00);
Albonesi (MICRO’00);
Zhang (ISCA’03)
Configurable cache tuning
Mostly done manually in practice
Sub-optimal
Time-consuming
5/25
Related Work
Automated Methods for Cache Tuning
Methods and heuristics for Design Space Exploration
Single-level caches (Tens of configurations)
Platune (Givargis TCAD’02, Palesi CODES’02)
Zhang (RSP’03)
Two-level caches (Hundreds to thousands of configurations)
Tcat (Gordon-Ross, DATE’04)
Level 2
Level 1
Total size
Line size
Associativity
Say 50 configs.
*
Total size
Line size
Associativity
= 2500 configs
Say 50 configs.
6/25
Problem Context
Do we really need such a large number of cache
configurations?
Could just few carefully-chosen configurations adequately
cover the large configuration space?
A smaller configuration space would be easier to explore (via
simulation-based or dynamic tuning)
7/25
Problem Context
Potential scenario:
A configurable microprocessor vendor pre-selects a subset of
configurations for each particular application domain.
The user selects the most appropriate domain for a target
application and examine only the pre-selected subset.
Automotive
control
Image
Filtering
Network
protocol
Configuration space
Application domains
Configuration subset
8/25
Cache Configuration Tuning
Energy consumption
Energy to run a given application on different configurations
Not that bad.
Energy increase
Lower energy
Possible cache configurations
9/25
Cache Configuration Subsetting
Application 1
Application 2
subset
Application 3
Application 4
Application 5
...
Application N
Cache Configuration Space
Near optimal cache tuning: Average energy increases
10/25
Cache Configuration Subsetting
Problem Definition:
Identify the subset of configurations that adequately
covers the configuration space for the application
domain.
We state that p configurations are necessary to cover a
space of m points.
Exhaustive Approach:
Select the subset of p configurations from the space m;
Criterion for selection:
Choosing the subset which keep the average energy of the
tuned cache nearest to the optimal.
11/25
Experimental Setup
Configurable cache architecture for initial experiments:
Our target configurable cache architecture is based on
Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture
for Embedded Systems,” ISCA 2003
We set the base cache to suport the following 18 configurations:
Instruction
cache
Memory
Processor
Data
cache
Energy model for estimation:
e(cj, ai ) eCache eMemory eCPUstall
12/25
Exhaustive Approach
y
Subsets of size p=18 down to 1 were chosen through the
exhaustive evaluation of the average energy increase.
50%
Instruction cache
45%
Energy Increase
40%
Data cache
35%
30%
25%
20%
15%
Energy increase is still under 5%.
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
Size of the subset
12
13
14
15
16
17
18
13/25
Exhaustive Approach
Good results, but exhaustively determining the subsets requires too
much computation:
For m=18 and 1 < p < 18 it gives us 262,143 possible combinations.
That’s too expensive!
We need a more efficient way to find the subset of configurations...
14/25
Looking for a Heuristic
First attempt:
Choose the p configurations that offered optimal energy for the
largest number of applications.
50%
Energy increase
45%
1st trial I$
1st trial D$
Exhaustive I$
Exhaustive D$
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
15/25
Looking for a Heuristic
Second attempt:
Hierarchical clustering according to the similarity between
configurations on their average energy savings.
60%
2nd trial I$
2nd trial D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
16/25
Similar Problem
We found a problem similar to our “subsetting puzzle”.
The problem of segmenting time series.
Time series data set
Data set representation
Error (%) depends on average accuracy of the data set representation
17/25
Similar Problem
The method proposed by Keogh’s can be applied, to reduce the
number of primary color for displaying an image
(IEEE Int. Conf. On Data Mining, 2001)
Color/nuances of the image are merged according to the associated
error/difference in the final image.
36 colors
Merging and measuring error
Ex.: Lower average error?
by
or
8 colors
by
Error of replacing the neighboring colors by merging them
18/25
Adapting to the Subsetting Problem
Similarly, we may iteratively discard configurations by merging them.
By merging two configs cj and ck into ck means that all applications
which were tuned by cj, now use ck.
Application
ai
cj
ck
merging
e(cj,ai)
e(ck,ai)
energy increase
19/25
Keogh’s Heuristic
All the possible merges of two adjacent (neighboring) configurations in
the space are evaluated to find the best merging.
c21
c71
cc13
7
20/25
Keogh’s Heuristic
Accuracy of the results
60%
Keogh's I$
Keogh's D$
Exhaustive I$
Exhaustive D$
55%
Energy increase
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Size of the subset
21/25
Comparing the Results
We went on merging the configuration space while the energy
increase remains under 5% (4 configs).
Keogh’s heuristic
Instruction cache
2kB 32B/line 1-way
2kB 32B/line 1-way
4kB 32B/line 2-way
4kB 32B/line 1-way
4kB 64B/line 1-way
4kB 32B/line 2-way
8kB 32B/line 1-way
8kB 64B/line 1-way
Average Error: 0.67%
Average Error: 0.69%
Exhaustive
Data cache
2kB 16B/line 1-way
2kB 16B/line 1-way
2kB 64B/line 1-way
2kB 64B/line 1-way
4kB 16B/line 2-way
4kB 16B/line 2-way
8kB 16B/line 2-way
8kB 16B/line 2-way
Average Error: 4.75%
Average Error: 4.75%
Runtime: 835,000 ms
Runtime: 991 ms
Speed up: 800x
22/25
Two-level Configurable Cache
Instruction
cache
Processor
Data
cache
2nd Level
cache
Memory
Optimal cache
configuration
1.00
0.90
0.80
0.70
Tuning with 4 configs
Average energy increase: 3.36%
0.60
0.50
0.40
0.30
0.20
0.10
0.00
23/25
Where Else this Method Could be Applied?
Other parameterized IP cores
Buses, I/O devices (Word size, bandwidth, etc).
Parameterized Platforms,
Processor’s functional units, cache, bus, multiplier, IPs.
Optimization for low energy, area, performance and others.
24/25
Conclusions
The configurable cache subsetting problem was presented;
A Data Mining algorithm for segmenting time series was
adapted to tackle the cache subsetting problem;
Subsetting the Two-level cache configuration space, keeping
the average energy increase under 5% takes:
Keogh’s heuristic: less than 1 minute
Exhaustively: around 53 hours.
Thanks to the versatility of the proposed method, our next
step is to apply it for Platform customization.
Thank you !
25/25