Evaluation of Hardware Performance Counters on the R12000

Download Report

Transcript Evaluation of Hardware Performance Counters on the R12000

Evaluation of Hardware Performance Counters on the R12000 Microprocessor
Wendy Korn,
Mentor: Dr. Patricia
Senior
Teller
The necessity for accurate performance counters became apparent when we began defining the resource usage of Sweep 3D, an ASCI benchmark from the DOD used to evaluate high
performance computers. For years, many computer scientists have used performance counters to help find problem areas in code. This study shows that performance counters on modern
microprocessors provide rudimentary performance measurements that may or may not be accurate. Below shows the methodology used to determine the accuracy of this hardware feature on
the R12000 as well as results.
SSEAL, Computer Science
Methodology
 Experiments
 High Performance
Performance Counters are used mainly to optimize code.
For example, this piece of code
has a nested loop and accesses data
in a matrix. The way the matrix is
stored in memory determines the
number of cache misses. Cache
misses increase execution time.
for i = 1 to n do
for j = 1 to n do
a[i j]: = a[i j] + 1
1.
If this code was analyzed using
performance counters and the results showed
that there are many cache misses during
execution of this code, the analytical model
programmer could try to tune the code to
decrease this miss rate and, thus, decrease
execution time.
Based on results, conclusions are made about problem areas in code.
2.
 Microbenchmarks
3.
Two counters can count up to 30 total events, we studied nine.
To generate events, use small programs, or Microbenchmarks.
To quantify the accuracy of performance counters, the number of events a program generates
must be known. Thus, microbenchmarks were designed to generate events for which we could
predict counts. For example, if we used the above code, we could measure the number of cache
misses generated by the code. Certain types of code measure certain events. Below is a diagram
of three types of microbenchmarks and the events they can generate.
1. Decoded instructions
2. Decoded loads
3. Decoded stores
4. Conditional resolved branches
5. Primary instruction cache misses
6. Translation Lookaside Buffer misses
7. Primary data cache misses
8. Secondary data cache misses
9. Secondary instruction cache misses
Loop
Use grep on the assembly file to
find events such as loads, stores,
branches.
Validate the
numbers from step
one with a
simulator.
Use sim-outorder from the
SimpleScalar simulation tool suite
and an R12000 configuration file.
Compare numbers
with those generated
by counters.
Counter Data
Use the perfex and libperfex
interfaces to access the counters.
Compare the numbers from steps 1,
2, and 3.
the interface used
Per figures A-F below, counters accessed by perfex exhibit poorer
accuracy than those accessed by libperfex for microbenchmarks with
small numbers of events.
the event begin measured
Per D and E, cache miss counts were not accurate using either
interface; Per A, load counts were accurate when the number generated
events was large enough.
the application run to generate the events
The linear microbenchmark neither generated enough data cache
misses to provide accurate counts nor did it provide accurate
instruction counts.
Data
A
B
C
D
Validate
predictions
Simulations
Accuracy depends on:
#define MAXSIZE 1000000
int main (int argc, char *argv[]) {
int a[MAXSIZE], ARRAYSIZE, i;
ARRAYSIZE = atoi(argv[1]);
for (i=0; i<ARRAYSIZE;i++)
a[i] = a[i] + 1;}
Array
Predictions
 Conclusions
a = 1; b = 1; c = 1;
a = b + 1;
b = a + 1;
c = a + b;
a = b + c;
b = a + c;
c = a + b;
Linear
Calculate number of
events by searching
for event in
assembly file or
analytical model.
E
F
Compare
results