Analytical Model - University of Wisconsin

Download Report

Transcript Analytical Model - University of Wisconsin

An Analytical Model for
CMPs
Spring 2003 ECE/CS 757
University of Wisconsin - Madison
Peter McClone
Kim-Huei Low
Overview






Introduction
Simulation Environment
Analytical Model
Results
Future work
Conclusion
Introduction






Performance limits of superscalar processors
Relative chip area increases
Chip Multiprocessors (CMP)
Design by simulation data
Time constraint
Analytical Model
CMP System Model
…
CPU0
CPUN


iL1
dL1
iL1
dL1

L2
Memory
Like Piranha
Multiprogrammed
workload
SimpleMP?
CacheTracer





Trace fed CMP cache simulator
Fills gap in multiprocessor simulators
Created specifically for the project
Designed to track statistics need for the
model
Lots of work and complex data structures
Processor Model



Most areaefficient model
Area vs
performance
SimpleScalar
is used
SimpleScalar


Generate address traces
Ignore instruction addresses




~100% hit rate
Inclusion in L2 is not enforced
Very little interference in L2
Performance
Simulator Combination

1. SimpleScalar


2. CacheTracer




Generate address traces
Takes in address traces
Simulate the L2 cache interference
Generate statistics needed for model computation
3. Model Computation

Compute performance estimates for variations of
all of the cache parameters
Simulator Combination

3. Model Computation
bestIPC = 0
area = 100000
//area constraint
// for each processor type (in/out-order, issue width)
for(each processor size)
// for each number of processors that fit on the chip
for(i=0; i*proc size < area; i++)
for(each L1 associativity)
for(each L1 size available)
for(each L2 associativity)
for(each L2 size)
IPC = solveModel()
if(IPC > bestIPC)
bestIPC = IPC
Analytical Model




A mathematical equation for IPC
Combination of observations made in
research papers and from intuitive knowledge
Based mainly on a detailed cache model that
was the focus of the project
Processor model is very simple
Analytical Model

3 part miss rate model:
M(C,t) =
M(C,t)startup+ M(C,t)nonstationary + M(C,t)intrinsic


C is a cache configuration
t is a time granule of r references
Analytical Model



Startup effects are the number of unique
blocks accessed in the first time granule, u(B)
The miss rate is u(B) divided by the total
number of references.
M(C,t)startup = u(B)
rt
Analytical Model




Nonstationary misses are caused by the change in
the working set of a process
This is the difference between the total number of
unique blocks accessed to this point U(B) and the
blocks attributed to startup:
M(C,t)nonstationary = U(B) – u(B)
rT
The sum of the startup and nonstationary misses
are simply the number of unique blocks in the trace.
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
M(C,t)intrinsic =
c * [u(B) - ∑D S*d* P(B,d) ]
d=0
r
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
M(C,t)intrinsic =
S is the
D
c * [u(B) - ∑ S*d* P(B,d) ] number
d=0
r
of sets
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
P(B,d) is the probability
M(C,t)intrinsic =
S is the
that d blocks of size B
c * [u(B) - ∑D S*d* P(B,d) ] number
map into any cache set.
d=0
r
of sets
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
P(B,d) is the probability
M(C,t)intrinsic =
S is the
that that d blocks of size B
c * [u(B) - ∑D S*d* P(B,d) ] number
map into any cache set.
d=0
r
of sets
The sum computes the number of
cache sets that have fewer than D
blocks that map to it (D is set size)
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
P(B,d) is the probability
M(C,t)intrinsic =
S is the
that that d blocks of size B
c * [u(B) - ∑D S*d* P(B,d) ] number
map into any cache set.
d=0
r
of sets
This many blocks cannot cause
misses. Fewer blocks than the set
size map to the set.
Analytical Model



Intrinsic misses are caused inherently caused because of
program behavior
The natural sequence of accesses cause cache lines to displace
each other
P(B,d) is the probability
M(C,t)intrinsic =
S is the
that that d blocks of size B
c * [u(B) - ∑D S*d* P(B,d) ] number
map into any cache set.
d=0
r
of sets
This many blocks cannot cause
misses. Fewer blocks than the set
size map to the set.

However the other blocks (u(B) – the sum) may cause misses.
This is estimated by multiplying by the measure collision rate c
Analytical Model

Combining all three components of the miss
rate results in the equation:
M(C,t) =
u(B) + U(B)–u(B) + c * [u(B)-∑D S*d* P(B,d)]
d=0
rt
rt
r

Average memory access time can be trivially
computed:
AMA = (1 – ML1) * HitTimeL1
+ (ML1 * (1 - ML2) * HitTimeL2
+ (ML1* ML2) * HitTimemain
Analytical Model

The final model then incorporates the AMA into a
simple processor model based on:







Issue Width (IW)
Issue Capabilities (IC)
Average misprediction penalty (AMP)
Average memory access time (AMA)
Pi = IWi * ICi * [(AMAi + 2)/3 * AMPi ]
Total performance is the sum of each processor
Pcmp = ∑Nc Pi
i=1
Results: Example
IPC for different configurations

1.6
1.4
1.2
IPC
1
0.8

0.6
0.4
0.2
0
1
2
3
Configurations
4
5

In general, trade off
complex processors for
larger cache
Same can be done for
power constrained
Workload specific
Results
Hopeful Comparison
1.6
1.4
1.2
IPC
1
Values from Model
0.8
Values from [2]
0.6
0.4
0.2
0
1
2
3
Growing L2 size
4
We hope our
results are close
to those reported
in the papers
Future work

Modify SimpleMP or RSIM



Expand processor model



Modify SMP or DSM system to CMP system
Fix or create the multi-programmed loader
Add Instruction window size, branch predictor
policy, # of ALUs, etc.
More accurately evaluate current parameters
Model variations in the collision rate, c.
Conclusion

Simulation of CMP systems is long and
arduous
An analytical model allows virtually limitless
configurations to be analyzed
An analytical model can accurately predict
performance
More work in this area is required!

Questions?



References



[1] AGARWAL, A., HOROWITZ, M., AND HENNESSY, J. An
analytical cache model. Computer Systems Lab. Rep. TR 86-304,
Stanford Univ. Stanford, Calif., Sept. 1986
[2] J. Huh, D. Burger, and S. W. Keckler. Exploring the design
space of future CMPs. In The 10th International Conference on
Parallel Architectures and Compilation Techniques, pages 199210, September 2001.
[3] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk,
S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese.
Piranha: A scalable architecture based on single-chip
multiprocessing. In The 27th Annual International Symposium on
Computer Architecture, pages 282–293, June 2000.