Understanding Performance Counter Data

Download Report

Transcript Understanding Performance Counter Data

Understanding Performance
Counter Data - 1
Methodology
– [Configuration micro-benchmark]
– Validation micro-benchmark – used to predict
event count
– Prediction via tool, mathematical model, and/or
simulation
– Hardware-reported event count collection via
PAPI (instrumented benchmark run 100 times;
mean event count and standard deviation
calculated)
– Comparison/analysis
– Report findings
Understanding Performance
Counter Data - 2
• Can quantify PAPI overhead in some cases,
e.g.,
– Loads and stores
– Floating-point operations (on some platforms)
• Can show that count is reasonable in
others, e.g.,
– L1 Dcache misses
– DTLB misses (R10K)
– Multiprocessor cache consistency protocolrelated events (R10K)
Understanding Performance
Counter Data - 3
• Interesting facts
– Stream buffers are incredibly effective!
– Itanium has 17% more instructions retired and
17% more Icache misses than predicted – this
is due to no-ops
– Itanium has 5x TLB misses than predicted –
don’t know why yet!
– Power3 has 5x (for smaller versions of
benchmark) and 2x (for larger versions) TLB
misses than predicted – don’t know why yet!
Understanding Performance
Counter Data - 4
• Interesting facts
– Power3 (gcc compiler): single-precision
vs. double-precision floating-point add
benchmark
• ½ the number of floating-point operations
for double-precision benchmark due to
rounding instructions needed for singleprecision benchmark
• 1.39x cycles for single-precision benchmark,
as compared to double-precision benchmark