Performance Monitor Workshop

Download Report

Transcript Performance Monitor Workshop

Confessions of a Performance
Monitor Hardware Designer
Workshop on Hardware Performance Monitor Design
HPCA-11
13 February 2005
Jim Callister
Intel Corporation
© Intel Corp. 2005
Why Include a PMU?
• Ya gotta do something with all those
transistors!
• Cause my PAPI told me to
• To give competitors a fighting chance
• To show my boss how great my branch
predictor is (ie., get a raise)
• To improve the performance of
current and future systems
13 February 2005
Itanium Processor PMU
®
How much Performance would you
give up for PMU Functionality?
• Transistors may be “free” but…
–
–
–
–
–
Wires are not!
Design time costs
Validation costs
Documentation costs
Time to Market costs
• The answer is not 0%
– PMU proven to improve performance
• But it’s not 10% either!
13 February 2005
Itanium Processor PMU
®
The PMU Has Tentacles Everywhere!
Collector
Collector
Collector
PMU Central
Collector
13 February 2005
Collector
Itanium Processor PMU
®
What to Architect
in the PMU?
• “Machine Architecture is a contract
between hardware and software”
• Architect too much…
– Lowers performance through design
constraints
– Events don’t map well to hardware
• Architect too little…
– Jeopardizes Software Investment
– Discourages Software Support
13 February 2005
Itanium Processor PMU
®
Itanium® Architecture: PMU
• Architected
– Access & Management of PMU Resources
• PMD registers for Data, PMC registers to control PMU
– Counter Overflow Behavior and Interrupt Handling
– Only a few basic counter events
• Implementation Dependent
– Number of counters, width of counters
– Non-counter performance monitors
– Events: Encourage use of CPU-specific tables
• Itanium architecture protects OS and Tool
infrastructure while promoting performance and
full visibility
13 February 2005
Itanium Processor PMU
®
Performance Events –
Let me count the ways…
• Which events are important?
– How will the events be used?
– Do you really care about a cache miss if
it doesn’t cause any stalls?
• Mapping an event to signals
– Needed signal may not be available
• On critical path, lack of wires, no signal
– Combining signals is problematic
• Distance between signals, timing, logic
13 February 2005
Itanium Processor PMU
®
Itanium® 2 Processor PMU Events
Event Categories
Number of Events
Cycle Accounting
89
Instruction Execution
42
Branches
69
Caches & TLBs
150
Bus
73
Misc
20
Total
443
13 February 2005
Itanium Processor PMU
®
Where are the Performance
Problems?
• Counters only give type of problem
and magnitude of the problem
• Use filters on counters (hunt & peck)
• Itanium® architecture currently
includes:
– Opcode Filters
– Privilege Level Filters
– Instruction Address Range Filters
– Data Address Range Filters
13 February 2005
Itanium Processor PMU
®
A Better Way to Locate
Performance Problems
• Event Address Registers (EARs)
– Logs information about a single cache
miss
– The logs are sampled by software
– Creates a statistical profile of cache
misses
• Branch Trace Buffer (BTB)
– Logs information about consecutive
branches
– Logs also sampled by software
13 February 2005
Itanium Processor PMU
®
Lend Me an EAR
• Instruction & Data EARs
– Log Instruction Address of Miss
• Data EAR also logs Data Address of Miss
– Log Latency of Miss
– Filter by latency bin
– Have an associated counter event
– Can also log TLB misses
• And where TLB miss was resolved
• Have proven to be extremely useful
13 February 2005
Itanium Processor PMU
®
The D-EAR Shadow Effect
Latency Counter Busy
Miss
Latency Counter Busy
Miss
Recorded
Miss
Miss
Recorded
Without extra hardware, these misses
would never be recorded!
13 February 2005
Itanium Processor PMU
®
The D-EAR Shadow Effect
Latency Counter Busy
Miss
Latency Counter Busy
Miss
Recorded
Miss
Miss
Recorded
Without extra hardware, these misses
would never be recorded!
The Itanium® 2 Processor Solution
•Don’t Track every Opportunity -- randomly pick misses to track
•Tradeoff: shadow mitigation versus sampling frequency
•Use LFSR to decide which port to sample and if to sample
•Every miss has ~1 in 8 chance of being tracked
•This mitigates the shadow effect, does not totally eliminate it
•Customer feedback indicates it works very well
13 February 2005
Itanium Processor PMU
®
The Itanium® 2 Processor’s
Branch Trace Buffer (BTB)
• An eight entry Circular Buffer
• Each entry contains either:
– Address & Prediction Data of a branch, or
– Address of a branch target
• Uses of the BTB
– Mis-predicted branch profiler
– An efficient Instruction Address Profiler
– Path Profiler
• Cool use: in conjunction with EARs
– Path leading up to sampled miss!
13 February 2005
Itanium Processor PMU
®
The Itanium® 2 processor’s PMU
Helps Improve Performance
100%
90%
Performance Improvement in Percent
80%
70%
50% CAGR
App One
App Two
App Three
App Four
App Five
App Six
60%
50%
40%
30%
20%
10%
0%
0
5
10
15
20
Tuning Time in Weeks
13 February 2005
Itanium Processor PMU
®
25
30
Performance is measured using specific
computer systems and reflect the approximate
performance of Intel products as measured by
those tests. Any difference in system hardware
or software design or configuration may affect
actual performance.
Conclusions
• Walking a micron in HW design shoes
– Balancing PMU functionality & overall performance
• We need to move beyond counters!
– Itanium® 2 processors provide EARs and BTBs
– What’s next?
• The Itanium 2 processor’s PMU has much to offer
– Customers are making good use of it
– Would like to see more use – how do we do it?
• Discussion
– What is the long-term vision for the PMU?
– What can the PMU provide to improve current and future
systems
– Did anything “stick” or resonate?
Itanium® and Itanium® 2 are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and
other countries
13 February 2005
Itanium Processor PMU
®
For More Information….
http://developer.intel.com/design/itanium/documentation.htm
Manuals
Intel Itanium Architecture Software Developer's Manuals Volume 1: Application Architecture
Part II: Optimization Guide
Intel Itanium Architecture Software Developer's Manuals Volume 2: System Architecture
Chapter 7: Debugging and Performance Monitoring
Chapter 12: Performance Monitoring Support
Intel Itanium 2 Processor Reference Manual for Software Development and Optimization
Chapter 10: Performance Monitoring
Chapter 11: Performance Monitor Events
13 February 2005
Itanium Processor PMU
®