Profiling and Instrumention
Download
Report
Transcript Profiling and Instrumention
Workload Characteristics
and Representative Workloads
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
[email protected]
Overview
• When we want to collect profiles to be used
in the design of a next-generation computing
system, we need to be very careful that we
capture a representative sample in our profile
• Workload characteristics allow us to better
understand the content of the samples which
we collect
• We need to select programs to study which
represent the class of applications we will
eventually run on the target system
Examples of Workload Characteristics
•
•
•
•
•
•
•
•
•
•
Instruction mix
Static vs. dynamic instruction count
Working sets
Control flow behavior
Working set behavior
Inter-reference gap model (temporal locality)
Database size
Address and value predictability
Application/library/OS breakdown
Heap vs. stack allocation
Benchmarks
• Real or synthetic programs/applications used
to exercise a hardware system, and generally
representing a range of behaviors found in a
particular class of applications
• Benchmarks classes include:
–
–
–
–
–
–
–
Toy benchmarks: tower of hanoi, qsort, fibo
Synthetic benchmarks: dhrystone, whetstone
Embedded: EEMBC, UTDSP
CPU benchmarks: SPECint/fp
Internet benchmark: SPECjAppServ, SPECjvm
Commerical benchmark: TPC, SAP, SPECjAppServ
Supercomputing: Perfect Club, Splash, Livermore
Loops
SPEC Benchmarks Presentation
Is there a way to reduce the runtime of SPEC
while maintaining representativeness?
• MinneSPEC (U. of Minn) – Using gprof
statistics about the runtime of SPEC and
various Simplescalar simulation results (I-mix,
cache misses, etc), we can capture
statistically similar, though significantly
shorter, runs of the programs
• Provides three input sets will run in:
– A few minutes
– A few hours
– A few days
• IEEE Computer Arch Letters paper
How many programs do we need?
• Does the set of application capture
enough variation to be “representative”
of the entire class of workload?
• Should we consider using multiple
benchmark suites to factor out
similarities in programming styles?
• Can we utilize workload characteristics
to identify the particular programs of
interest?
Example of capturing program behavior:
“Quantifying Behavioral Differences Between C and
C++ Programs” Calder, Grunwald and Zorn, 1995
• C++ is a programming language growing in
popularity
• We need to design tomorrow’s computer
architectures based on tomorrow’s software
paradigms
• How do workload characteristics changes as
we move to new programming paradigms?
Example of capturing program behavior:
“Quantifying Behavioral Differences Between C and
C++ Programs” Calder, Grunwald and Zorn, 1995
• First problem – Find a set of representative
programs from both FO and OO domains
– Difficult for OO in 1995
• Differences between programming models
– OO relies heavily on messages and methods
– Data locality will change due to the bundling together of
data structures in Objects
– Size of functions will be reduced in OO Polymorphism
allows for indirect function invocation and runtime
method selection
– OO programs will manage a larger number of dynamically
allocated objects
Address and Value Profiling
• Lipasti observed that profiled instructions
tend to repeat their behavior
• Many addresses are nearly constant
• Many values do not change between
instruction execution
• Can we use profiles to better understand
some of these behaviors, and the until this
knowledge to optimize execution?
Address Profiling
• If an address remains unchanged, can we
issue loads and store early (similar to
prefetching)?
• Do we even have to issue the load or store if
we have not modified memory?
• What are the implications if indirect
addressing is used?
• Can we detect patterns (i.e., strides) in the
address values?
• Can we do anything smart when we detect
pointer chasing??
Data Value Profiling
• When we see that particular data values do
not change, how can we take advantage of
this?
• Lipasti noticed that a large percentage of
store instructions overwrite memory with the
value already stored there
• Can we avoid computing new results if we
notice that our input operand have not
changed?
• What can we do if we a particular operand
only takes on a small set of values?
Parameter Value Profiling
• Profile the parameter values passed to
functions
• If these parameters are predictable, we
can exploit this fact during compilation
• We can study this on an individual
function basis or a call site basis
• Compiler optimizations such as code
specialization and function cloning can
be used
Parameter Value Profiling
• We have profiled a set of MS Windows NT 4.0
desktop applications
–
–
–
–
–
–
–
Word97
Foxpro 6.0
SQLserver 7.0
VC++ 6.0
Excel97
Powerpoint97
Access97
• We measured the value predictability of
parameter values for all non-pointer based
parameters
Parameter Value Profiling
• We look for the predictability of parameters using:
– Invariance 1 – probability that the most frequent value is
passed
– Invariance 4 – probability that one of the 4 most frequent
values is passed
• Parameter values are more predictable on a call site
basis than on a function basis (e.g., for Word97, 8%
of the functions pass highly predictable parms, where
as when computed on individual call sites, over 16%
of the call sites pass highly predictable parms)
• Highly predictable means that on over 90% of the
calls the same value is observed
• We will discuss how to clone and specialize
procedures when we discuss profile guided data
transformations
How can we reduce the runtime of a single
benchmark and still maintain accuracy?
• Simpoint – attempt to collect a set of trace
samples that best represents the whole
execution of the program
– Identifies phase behavior in programs
– Considers a metric that captures the differences
between two samples
– Computes the difference between these two
intervals
– Selects the interval that is closest to all other
intervals
Simpoint (Calder ASPLOS 2002)
• Utilize basic block frequencies to build basic
block vectors (bbf0, bbf1….bbfn-1)
• Each frequency is weighted by its length
• Entire vector normalized by dividing by total
number of basic blocks executed
• Take fixed-length samples (100M instructions)
• Compare BBVs using:
– Euclidean Distance:
• ED(a, b) = sqrt(sum(i->1,n) (ai-bi)2)
– Manhattan Distance:
• MD(a, b) = sum(i->1,n)(|ai-bi|)
Simpoint (Calder ASPLOS 2002)
• Manhattan Distance is used to build a
similarity matrix
• N x N matrix, where N is the number of
sampling intervals in the program
• Element SM(x, y) is the Manhattan
Distance between two 100M element
BBV at sample offsets x and y
• Plot the Similarity Matrix as an upper
triangle
Simpoint (Calder ASPLOS 2002)
• Basic Block Vectors can adequately
capture the necessary representative
characteristics of a program
• Distance metrics can help to identify the
most representative samples
• Cluster analysis (k-means) can improve
representativeness by selecting multiple
samples
Simpoint (Calder ISPASS 2004)
• Newer work on Simpoint considers
using register def-use behavior on an
interval basis
• Also, tracking of loop behavior and
procedure calls frequencies provides
similar accuracy as using basis block
vectors
Simpoint (Calder ASPLOS 2002)
• Algorithm overview
– Profile program by dividing into fixed sized
intervals (e.g., 1M, 10M, 100M insts)
– Collect frequency vectors (e.g., BBVs, def-use,
etc.) – compute normalized frequencies
– Run k-means clustering algorithm to divide the set
of intervals into k partitions/sets, for values of k
from 1 to K
– Compute a “goodness-of-fit” of the data for each
value of k
– Select the clustering the reduces small k and
provides a reasonable “goodness-of-fit” result
– The result is a selection of representative
simulation points that best “fit” the entire
application execution
Simpoint Paper
Improvements to Simpoints
(KDD05, ISPASS06)
• Utilize a Mixture of Multinomials instead of K-means
• Assumes data is generated by a mixture of Kcomponent density functions
• We utilize Expectation-Maximization (EM) to find a
local maximum likelihood for the parameters of the
density function – iterate on E and M steps until
convergence
• The number of clusters is selected using the Bayesian
Information Criteria (BIC) approach to judge
“goodness of fit”
“A multinomial clustering model for fast simulation of
computer architecture designs”, K. Sanghai et al.,
Proc. of KDD 2005, Chicago, IL., pp. 808-813.
Summary
• Before you begin studying a new architecture,
have a clear understanding of the target
workloads for this system
• Perform a significant amount of workload
characterization before you begin profiling
work
• Benchmarks are very useful tools, but must
be used properly to obtain meaningful results
• Value profiling is a rich area for future
research
• Simpoints can be used to reduce the runtime
of simulation and still maintain simulation
fidelity