Profiling and Instrumention

Download Report

Transcript Profiling and Instrumention

Hardware and Software Tracing
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
[email protected]
Trace Collection Methodologies
• Hardware
– Monitors and instrumentation
– Microcode
• Software
– Trap-based system
– Emulators
– Code annotation (source, object,
executable)
– Direct execution
Metrics for Evaluating Trace Collection
Methodologies
•
•
•
•
•
•
•
•
•
•
Speed – trace capture rate
Memory – extra memory used
Accuracy – address perturbation
Intrusiveness – tracing overhead
Completeness – OS, interrupts, libraries
Granularity – smallest traceable unit
Flexibility – ease of use
Portability – platform dependence
Capacity – trace storage space
Cost - $$, time
Hardware Monitors
• Capture trace at peak execution rates
• Challenge - match storage media speed to
tracing needs utilizing interleaving and
multiplexing
• Pros:
– Non-intrusive
– Accurate
– Complete
• Cons:
– Expensive
– Limited probeability
– Limited trace length
Examples of Hardware Monitors
• Monster – (U. of Michigan 1992) – R2000
traces using a DAS9200
• BACH (BYU, 1992) – i486, Pentium SPARC,
68K – developed a customized pod – being
used by Intel today
• Real-time Tracer (IBM 1992) – Customized
SRAM array
• National Instruments (2006) – provides a
family of programmable instrumentation
monitors
Microcode-based Tracing
• Places hooks in microcode to capture
machine state
• Pros:
– Complete (OS, application)
– Minimal slowdown (2-10x)
• Cons:
– Microcode is dated technology
– Nonportable
Example Microcode-based Tracing
• ATUM (Stanford 1986) – VAX traces
• PatchWrx (DEC WRL 1995, NU 1996) –
Complete OS-rich traces on Alpha
running NT
Intrumenting NT-based
Workloads
Participants
•
•
•
•
•
•
Chakib Ouarraoui – EMC
Jason Casmira – Intel
John Fraser – US Air Force
David Hunter – VMWare
Sharon Smith – HP
Richard Sites – Adobe Systems
Tracing tools that capture
OS activity
Name
Pixie
EEL
QPT
Shade
ATUM
ATOM
SimOS
Etch
NT-Atom
PatchWrx
Avg. Slowdown
10X - 100X
10X
2X - 6X
6X
20X
10X - 100X
10X - 50,000X
35X
10X - 100X
4X
Addr. Perturb
Y
Y
Y
N
N
N
N
Y
N
N
OS Activity
N
Y
N
N
Y
Y
Y
N
N
Y
Platform(s)
MIPS
SPARC Solaris
SPARC Solaris
SPARC V8, V9
DEC VAX
DEC UNIX
DEC UNIX, SGI IRIX, SPARC Solaris
ix86 Windows NT 4.0
Alpha Windows NT 4.0
Alpha Windows NT 4.0
OS Rich and NT-based
Instrumentation Tools
• SimOS
– UNIX-based platforms – (basis for VMWare)
– OS, memory, I/O activity
– High overhead (10X - 50,000X)
• Etch
– Intel x86-based platform
– No OS activity
– 35X slowdown
PatchWrx Overview
•
•
•
•
•
•
Dynamic execution tracing tool suite
Captures full system workloads
Traces branches executed by the processor
Reconstructs full instruction stream
DEC Alpha 21064 Windows NT 4.0 platforms
Low overhead with minimum slowdown
– 2X while running
– 4X while tracing
PatchWrx Components
• PALcode – Alpha Privileged Architecture
Library
• Reserves trace buffer upon boot
• Captures trace info
• Facilitates long branches
• Patch – instrument all NT images
• Trace – collect runtime information
• Reconstruct – reconstitute the information
Patching an Image
• Instrument all WinNT binary image
types
– COM, EXE, DLL, SYS, DRV
• Replace branch-type instructions with
branches to PatchWrx PAL calls
• Log trace entry of branch type into
buffer
• Branch to original target
Patching an Image
ORIGINAL IMAGE
A
PATCHED IMAGE
1
A’
PAL
B
B
4
3
PATCH
SECTION
PWX PAL
BR
2
Patching Large Images
• Normal Alpha ISA branch instruction
– (PC+4) + SEXT(disp21) * 4
• New PatchWrx long branches
– LBR (PC+4) + SEXT(disp25) * 4
– LBSR (PC+4) + ZEXT(disp20) * 32
Patching Large Images
LONG
PATCHED IMAGE
1
A’
PAL
6
B
2
4
PATCH
SECTION
PWX PAL
BR
3
5
CAPTURE
Tracing with PatchWrx
• Trace
• User controlled start/stop/dump
• Dumps captured trace to binary file
• Captures VA mapping snapshot of
active processes during trace capture
Reconstructing Execution
IMAGE
0
IMAGE
n
I-STREAM
AND/OR
D-STREAM
RAW
TRACE
.
.
.
.
RECONSTRUCT
TOOL
VA
MAP
SYMBOL
TABLE
0
SYMBOL
TABLE
n
OS-Rich Workload Characterization
• Execution domain analysis
• Hot EXEs / DLLs (system resources)
• Instruction mix
– Application-only
– Full system
• Branching behavior
– Branch frequency (average basic block size)
– Branch prediction in presence of OS
Workloads Investigated
Workload
fourier
li
go
ie
vc50
fx32
word
Description
BYTEmark; numerical analysis routine for calculating series
approximations of waveforms
SPEC95 Xlisp interpreter benchmark
SPEC95 Go! game benchmark
Microsoft Internet Explorer V2.0 following a series of web page links
Microsoft Visual C++ 5.0 compiling a 3000 line C program
FX!32 V1.1 interpreting/translating included openGL sample
Intel x86 application
Microsoft Word97 V7.0, spell-checking a 15 page document
Five most frequently used images in each
benchmark or application
Workload
1st
2nd
3rd
4th
5th
Other
fourier
bytecpu.exe
(99.5%)
winsrv.dll
(0.2%)
win32k.sys
(0.1%)
ntoskrnl.ece
(0.1%)
user32.dll
(.02%)
(0.8%)
li
li.exe (97.7%)
win32k.sys
(1.0%)
ntoskrnl.exe
(0.6%)
user32.dll
(0.1%)
qv.dll
(0.1%)
(0.5%)
go
go.exe
(95.5%)
win32k.sys
(2.0%)
ntoskrnl.exe
(1.0%)
hal.dll
(0.4%)
gv.dll
(0.1%)
(1.0%)
ie
iexplore.exe
(37.2%)
win32k.sys
(19.3%)
ntoskrnl.exe
(17.5%)
fastfat.sys
((6.1%)
ntdll.dll
(6.0%)
(13.9%)
vc50
c1.exe
(83.1%)
ntoskrnl.exe
(10.5%)
msvcrt.dll
(2.8%)
nsfs.sys
(1.2%)
win32k.sys
(1.1%)
(1.3%)
word
mssp232.dll
(36.4%)
msgren32.dll
(34.0%)
ntoskrnl.exe
(10.2%)
win32k.sys
(7.7%)
hal.dll
(4.0%)
(7.7%)
fx!32
hal.dll
(42.5%)
s3.dll
(24.6%)
opengl32.dll
(12.2%)
msvcrt.dll
(11.7%)
glu32.dll
(2.7%)
(6.3%)
Average basic block lengths
14
12
10
All
OS
DLL
APPDLL
APP
Instruction 8
count
6
4
2
0
Fourier
Go
vc50
Workload
word
Conditional Branch Prediction
2-level BTB, 12-bit PHR, 4096 entries, gshare
35
30
25
All
OS
DLL
APPDLL
APP
Misprediction 20
Ratio
15
10
5
0
Fourier
Go
vc50
Workload
word
Summary of Results
• Benchmarks execute almost entirely within
the application domain
– Desktop applications execute across many images
and interact with the kernel and system DLLs
• Branch prediction accuracy can change
drastically (sometimes it can even improve)
when the operating system interaction is
considered
• The instruction mix in desktop applications
changes significantly in the presence of OS
– Increased number of indirect branches and
privileged instructions (e.g., PALcalls)
For Further Information
1. “Tracing and Characterization of Windows NTbased System Workloads,” J.P. Casmira, D.P.
Hunter and D.R. Kaeli, Digital Technical Journal, Vol.
10, No. 1, 1998, pp. 6-21
(www.digital.com/info/DTJ01/DTJ01HM.HTM).
2. “Operating System Impact on Trace-Driven
Simulation,” J.P. Casmira, J. Fraser and D.R. Kaeli,
Proceedings of the 31st Simulation Symposium,
Boston, MA, April 1998, pp. 76-82.
3. “A Code Annotation Tool for Capturing Operating
System Execution,” J.Fraser, Northeastern
University Technical Report, NUCAR_6-97-1, June
1997 (on the NUCAR website).
http://www.ece.neu.edu/groups/nucar
And now back to tracing……..
Trap Based
• Interrupt the application at selected points in
order to save trace records
• Pros:
– Available on many CPUs
– Portable
– Inexpensive
• Cons:
– Considerable slowdown (1000x)
– Intrusive (ISR), especially when considering realtime events
– How we decide where to interrupt the processor
and still maintain a representative trace?
Example Trap Based Systems
• VAX-Tracer – Clark&Emer study on VAX
• OS2-Tracer – Intel 386
• Wisconsin Wind Tunnel – ECC error
trapping – CM5 (SPARC)
• Tapeworm II system – ECC error
trapping – OS trap handler
Emulators
• Simulating the target ISA using one or a
multiple machine instructions on the host ISA
• Pros:
–
–
–
–
Minimal slowdown (10-100x)
Opportunity for JIT compilation
Portable
Flexible – software controlled
• Cons:
– Serious programming effort needed
– Extra memory needed
– Typically single process tracing
Emulators
• Shade (UW 1994) – dynamic translation
– Compiles emulated instructions to native instructions (many
elements of Shade have shown up in Transmeta products)
– Host – SPARC-V8
– Targets – SPARC-V8, SPARC-V9, MIPS
• Spa (Sun 1993) – Iterative interpretation
– Reinterprets instructions on each occurrence
– Host – MIPS-1
– Targets – MIPS-1, MIPS-2
• SPIM (U of Wisc 1991) – predecoded interpretation
– Provides pointers to instruction handler and operands to
speed decoding
– Hosts – SPARC, 680x0, MIPS, HP-PA
– Target – MIPS-1
More Recent Emulators
• VisualDSP (Analog Devices 1995-present)
– Simulator for SHARC and BlackFin DSPs that runs on WinTel
and Linux-x86
– Provides C/C++ compilation environment
– Statistical profiling
– Cycle-accurate simulator
– Provides a full visualization environment for machine
performance
• AMD Opteron X86-64 (2003)
–
–
–
–
Simulator for the new 64-bit X86 from AMD
Runs on 32-bit Linux-x86
Comes complete with a X86-64 version of gcc
http://www.x86-64.org/
MP Emulators
• MINT (University of Rochester 1994)
– Predecoded interpretation – memory references
– Host – R3000 (SGI, DECstations)
– Target – R3000, (an Alpha-based derivative was
developed called AINT)
• RSim (Rice Univ 1997) – Simulator for highILP Multiprocessors
– Detailed cycle-based emulation
– Host – SPARC, SGI PowerChallenge
– Target – MIPS R10K
Machine Emulators
• Simics (1996-present) Virtutech
– Developed out research work at SICS
– Provides a large number of CPU targets
• Alpha, ARM, Itanium, MIPS, Pentium, PowerPC, SPARC, X86-64
– Provides both detailed simulation/emulation and high
throughput
– http://www.simics.com/
• SimOS (1997) Stanford University
– Originally designed to run on an SGI platform
– Actually boots a full operating system (SGI IRIX and DEC
UNIX)
– Implementations on Alpha and MIPS platforms
– Designed around the operating system, emulating IO and
other system-related events
– Provided the base technology for VMWare products
Code Annotation
• Instrumented program produces trace while the
application is run
• Three levels of annotation
– Source code modification
– Object code modification
– Binary code modification
• Pros:
– Ease of implementation
– Small slowdown (10x)
– Inexpensive
• Cons:
– Limited completeness (OS, multiprocessing)
– May not capture DLLs
– Memory dilation
Source Code Annotation
• TRAPEDS (Univ. of Illinois 1989)
– Adds a call upon exit from a basic block
• MPTrace (Univ. of Washington 1990)
– I386, instruments only MP-relevant events
• Tangolite (Stanford 1993)
– Annotates all memory events in an MP
environment
Object Code Annotation
•
•
•
•
•
Epoxie (DEC WRL 1989) – Titan MP
Epoxie2 (DEC WRL 1993) – R3000
ATOM (DEC WRL 1994) – Alpha
Alto (Univ. of Arizona 1996) – Alpha
PLTO (Univ. of Arizona 2001) – IA32
Binary Code Annotation
•
•
•
•
•
•
•
Pixie (DEC 1991) – MIPS
Goblin (IBM/CMU 1991) – RS/6000
IDtrace (Univ. of Mich.) – i486
QPT (Univ. of Wisc.) – MIPS, SPARC
EEL (Univ. of Wisc.) – MIPS, SPARC
DSPTune (NEU) – ADI SHARC DSP
Pin (Intel 2005) – X86, XScale, Itanium
Embedded Systems Profiling Tools
• Enhance current embedded system
compilation environments, providing profiledriven analysis and feedback capabilities
• DSPTune - instrumentation and analysis
package for the SHARC family of DSPs
• Allows for full instrumentation of C and C++
codes at the source, assembly and ELF binary
levels
• Supported by Analog Devices and the NSF
The DSPTune Toolset
• A set of library routines that enable the user
to instrument C and assembly programs
• Function calls can be inserted at various
locations in the application code, enabling
execution driven simulation
• The user provides:
– instrumentation routines, which specify the
selected instrumentation events (e.g., loads,
branches, traps)
– analysis routines, which carry out the desired
simulation (e.g., caches, stacks, branch predictors)
User application code
Parser
User instrumentation
code
Step I
Intermediate Representation
Step II
Instrumenting Tool
Instrumented IR
Step III
Code Generator
User analysis
code
Instrumented
application code
Assembler
Linker
Instrumented
application executable
Step IV
BDSPTune
• Provides similar capabilites as DSPTune
• Allows ELF binaries to be instrumented
• Enable instrumentation and profiling to
include library routines
Summary of Tracing Methodologies
Slow down
OS
coverage
Sample
size
Cost
Source
Code
10X
NO
>GB
LOW
Object
Code
10X
SOME
>GB
LOW
Binary Code
10X
NO
>GB
LOW
Microcode
10X
YES
>GB
MEDIUM
I-Stepping
1000X
YES
unlimited
MEDIUM
Emulation
10-100X
YES
unlimited
MEDIUM
Real-time
1X
YES
<GB
HIGH
Counter-based
Profiling and Instrumentation
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
[email protected]
Counters are used to:
• Identify Performance Bottlenecks
– especially unpredictable dynamic stalls
e.g. cache misses, branch mispredicts, TLB
misses, etc.
– complex out-of-order processors make this difficult
• Guide Optimizations
– help programmers understand and improve code
– automatic, profile-driven optimizations
• Profile Production Workloads
– low overhead
– transparent
– profile whole system
Performance Counters
• Interfaced through a device driver and
supporting GUI (e.g., VTune)
• Counters increment based on a set of events
of interest (e.g., cache misses, pipeline stalls)
• Interrupt will occur that signals that the
counter has overflowed
• An interrupt service routine reads the counter
information and tags it to a program counter
(PC) value
• Information is then available for offline
analysis
Performance Counters
• Low overhead method for obtaining
performance and profiling information
– Typically less than 5% slowdown
•
•
•
•
•
Requires no modification of the binary
May require root level access to system
Lacks precision in cause/affect analysis
Come for free on most ISAs
Commonly used today to measure
performance and estimate power usage
Counter Library
• A number of counter libraries are available to
provide an API to program and access
common architectures
– Rabbit
• for Intel/AMD Processors and Linux
• URL: www.scl.ameslab.gov/Projects/Rabbit/
– PAPI
• Linux IA32, IA64
• Allows counters to be captured on a per thread basis
• URL: icl.cs.utk.edu/projects/papi/
Counters available on different ISAs
Category
PentiumII
21064
21164
IBM604e
R10K
Ultra2
#
counters
2
2
3
4
2
2
Counter
Range
40
8, 12, 16
14,16
32
32
32
Variable
Range
No
Yes
No
No
No
No
Sampling
Freq
Variable
Fixed
Fixed
Variable
Variable
Fixed
R/W
Access
Yes
No
Yes
Yes
Yes
Yes
Duration
Counting
Yes
No
No
No
No
NO
Counting
Modes
Different
Privilege
Levels
Selected
Processes
User,
Kernel,
PALmode
User,
Kernel,
Processes
User,
Kernel
User,
Kernel
Events countable on different ISAs
Event
PentiumII
21164
IBM604e
R10K
Ultra2
L1 data cache read
Y
N
Y
N
Y
L1 data cache write
Y
N
N
N
Y
L1 data cache r/w
N
Y
N
N
N
L1 data cache miss
Y
Y
Y
Y
Y
L1 inst cache read
Y
N
N
N
N
L1 inst cache r/w
N
Y
N
N
Y
L1 inst cache hit
N
Y
N
N
Y
L1 inst cache miss
Y
Y
Y
Y
Y
Events countable on different ISAs
Event
Pentium2
21164
IBM604e
R10K
Ultra2
TLB miss
N
N
Y
Y
N
Data TLB miss
N
Y
Y
N
N
Inst TLB miss
Y
Y
Y
N
N
Retired Branches
Y
N
N
Y
N
Mispredicted Branches
Y
Y
N
Y
N
Taken Branches
Y
N
N
N
N
Mispredicted Retired B
Y
N
N
N
N
Events countable on different ISAs
Event
Pentium2
21164
IBM604e
R10K
Ultra2
Retired Instructions
Y
Y
Y
Y
Y
Issued Instructions
Y
N
Y
Y
N
Integer Inst Executed
N
Y
Y
N
N
FP Inst Executed
Y
Y
Y
Y
N
Load Inst Executed
N
Y
Y
Y
N
Store Inst Executed
N
Y
N
Y
N
Branch Inst Executed
Y
N
Y
N
N
Events countable on different ISAs
Event
Pentium2
21164
IBM604e
R10K
Ultra2
Total cycles
Y
Y
Y
Y
Y
Cycles BPU is idle
N
N
Y
N
N
Cycles IU is idle
N
N
Y
N
N
Cycles LSU is idle
N
N
Y
N
N
Cycles LSU stalls
N
N
Y
N
N
Cycles FPU stalls
Y
N
Y
N
N
Cycles BPU stalls
N
N
Y
N
N
Existing Instruction-Level Sampling
• Use Hardware Event Counters
– small set of software-loadable counters
– each counts a single event at a time, e.g. dcache miss
– counter overflow generates interrupt
• Advantages
– low overhead vs. simulation and instrumentation
– transparent vs. instrumentation
– complete coverage, e.g. kernel, shared libs, etc.
• Effective on In-Order Processors
– analysis computes execution frequency
– heuristics identify possible reasons for stalls
– example: DIGITAL’s Continuous Profiling Infrastructure
Problems with Event-Based Counters
• Cannot simultaneously monitor all events
• Limited information about events
– “event has occurred”, but no additional context
e.g. cache miss latencies, recent execution path,
...
• Blind spots in non-interruptible code
• Key problem: imprecise attribution
– interrupt delivers restart PC, not the PC that
caused event
– problem worse on out-of-order processors
Problem: Imprecise Attribution
Example: Finding the single operation that introduces a
long latency operation to occur (e.g., cache miss, TLB
miss, branch mispredict)
• Most counter-based schemes provide the PC at the
point a counter overflowed
• Inorder processors – (Alpha 21164)
– Imprecise exceptions/interrupts hinder our ability to quickly
identify the cause of latencies during execution
– It is possible to post-analyze the problem to attempt to
identify the responsible instruction
• Out-Of-Order processors – (Alpha21264, Pentium4)
– Due to the lack of sequentiality in the execution, the
distance between the responsible instruction and the current
PC could be far
– It is nearly impossible to identify the cause of the latency
Profile-Me Profiling Strategy – (DEC 1998)
•
•
•
•
•
•
•
PC
PC
PC
PC
PC
PC
PC
+
+
+
+
+
+
+
Retire Status
Cache Miss Flag
Branch Mispredict
Event Flag
Branch Direction
Branch History
Latency







execution frequency
cache miss rates
mispredict rates
event rates
edge frequencies
path execution rates
instruction stalls
Identifying True Botttlenecks
• ProfileMe: Detailed Data for Single Instruction
• In-Order Processors
– ProfileMe PC + latency data identifies stalls
– stalled instructions back up pipeline
• Out-of-Order Processors
– explicitly designed to mask stall latency
e.g. dynamic reordering, speculative execution
– stall does not necessarily imply bottleneck
• Example: Does This Stall Matter?
load r1, …
add
…,r1,…
average latency: 35.0 cycles
… other instructions …
Example: Retire Count
Convergence
Estimate / Actual
2
1.5
Accuracy  1/N
1
0.5
0
0
250
Number of Retired Samples (N )
500
How to handle concurrency and OOO?
Appropriate concurrency metrics
– retired instructions per cycle
– issue slots wasted while an instruction is in flight
– pipeline stage utilization
How to measure concurrency?
• Special-purpose hardware
– some metrics difficult to measure
e.g. need retire/abort status
• Sample potentially-concurrent instructions
– aggregate info from pairs of samples
– statistically estimate metrics
How to handle concurrency and OOO?
• Sample Two Instructions
– sample instructions, not events
– may be in-flight simultaneously
– replicate ProfileMe hardware, add intra-pair distance
• Nested Sampling
– sample window around first profiled instruction
– randomly select second profiled instruction
– statistically estimate frequency for F (first, second)
...
...
...
...
-W
...
...
...
...
+W
time
overlap
no overlap
Other Uses of Paired Sampling
• Path Profiling
– two PCs close in time can identify
execution path
– identify control flow, e.g. indirect branches,
calls, traps
• Direct Latency Measurements
– data load-to-use
– loop iteration cost
VTune: IA32 Instrumentation and Profiling
• Supports all versions of IA32 Intel processors
• Provides a rich GUI to ease programming and
reading of hardware counters
• Features include:
–
–
–
–
Time and event-based sampling
Call graph profiling
Provides source-level tuning advice
Allows for integrated visualization of source and
counter information
– Supports C/C++, Fortran, Java and IA32 assembly
VTune Time Sample
VTune Call Graph
VTune Hot Spot Analyzer
VTune Tuning Assistant
Using Performance Counters for Power
Profiling/Estimation
• Profile power-consuming events
– Cache misses
– TLB misses
– Pipeline stalls
Opportunities to wait slower!
• How can we tie high counts to when to
adjust voltage/frequency? (more on this
later in the class….)
Summary
• Tracing/Instrumentation is still used today by industry
and academia
– The field has evolved significantly
– Industry uses software-based tools for performance and
hardware-based tools for power/energy
– Most performance studies today use some form of emulation or
virtualized execution to obtain trace data
• Counters can be used effectively to capture performance
data
– The entry cost for using counters is low
– OO microarchitectures inhibit the use of counters
– Paired sampling can be an effective technique for handling
imprecision
• A number of high-quality free and commercial tools are
available (and we are going to use at least one of them)