Profiling and Instrumention

Download Report

Transcript Profiling and Instrumention

Workload Characteristics
and Representative Workloads
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
[email protected]
Overview
• When we want to collect profiles to be used
in the design of a next-generation computing
system, we need to be very careful that we
capture a representative sample in our profile
• Workload characteristics allow us to better
understand the content of the samples which
we collect
• We need to select programs to study which
represent the class of applications we will
eventually run on the target system
Examples of Workload Characteristics
•
•
•
•
•
•
•
•
•
•
Instruction mix
Static vs. dynamic instruction count
Working sets
Control flow behavior
Working set behavior
Inter-reference gap model (temporal locality)
Database size
Address and value predictability
Application/library/OS breakdown
Heap vs. stack allocation
Benchmarks
• Real or synthetic programs/applications used
to exercise a hardware system, and generally
representing a range of behaviors found in a
particular class of applications
• Benchmarks classes include:
–
–
–
–
–
–
–
Toy benchmarks: tower of hanoi, qsort, fibo
Synthetic benchmarks: dhrystone, whetstone
Embedded: EEMBC, UTDSP
CPU benchmarks: SPECint/fp
Internet benchmark: SPECjAppServ, SPECjvm
Commerical benchmark: TPC, SAP, SPECjAppServ
Supercomputing: Perfect Club, Splash, Livermore
Loops
SPEC Benchmarks Presentation
Is there a way to reduce the runtime of SPEC
while maintaining representativeness?
• MinneSPEC (U. of Minn) – Using gprof
statistics about the runtime of SPEC and
various Simplescalar simulation results (I-mix,
cache misses, etc), we can capture
statistically similar, though significantly
shorter, runs of the programs
• Provides three input sets will run in:
– A few minutes
– A few hours
– A few days
• IEEE Computer Arch Letters paper
How many programs do we need?
• Does the set of application capture
enough variation to be “representative”
of the entire class of workload?
• Should we consider using multiple
benchmark suites to factor out
similarities in programming styles?
• Can we utilize workload characteristics
to identify the particular programs of
interest?
Example of capturing program behavior:
“Quantifying Behavioral Differences Between C and
C++ Programs” Calder, Grunwald and Zorn, 1995
• C++ is a programming language growing in
popularity
• We need to design tomorrow’s computer
architectures based on tomorrow’s software
paradigms
• How do workload characteristics changes as
we move to new programming paradigms?
Example of capturing program behavior:
“Quantifying Behavioral Differences Between C and
C++ Programs” Calder, Grunwald and Zorn, 1995
• First problem – Find a set of representative
programs from both FO and OO domains
– Difficult for OO in 1995
• Differences between programming models
– OO relies heavily on messages and methods
– Data locality will change due to the bundling together of
data structures in Objects
– Size of functions will be reduced in OO Polymorphism
allows for indirect function invocation and runtime
method selection
– OO programs will manage a larger number of dynamically
allocated objects
Address and Value Profiling
• Lipasti observed that profiled instructions
tend to repeat their behavior
• Many addresses are nearly constant
• Many values do not change between
instruction execution
• Can we use profiles to better understand
some of these behaviors, and the until this
knowledge to optimize execution?
Address Profiling
• If an address remains unchanged, can we
issue loads and store early (similar to
prefetching)?
• Do we even have to issue the load or store if
we have not modified memory?
• What are the implications if indirect
addressing is used?
• Can we detect patterns (i.e., strides) in the
address values?
• Can we do anything smart when we detect
pointer chasing??
Data Value Profiling
• When we see that particular data values do
not change, how can we take advantage of
this?
• Lipasti noticed that a large percentage of
store instructions overwrite memory with the
value already stored there
• Can we avoid computing new results if we
notice that our input operand have not
changed?
• What can we do if we a particular operand
only takes on a small set of values?
Parameter Value Profiling
• Profile the parameter values passed to
functions
• If these parameters are predictable, we
can exploit this fact during compilation
• We can study this on an individual
function basis or a call site basis
• Compiler optimizations such as code
specialization and function cloning can
be used
Parameter Value Profiling
• We have profiled a set of MS Windows NT 4.0
desktop applications
–
–
–
–
–
–
–
Word97
Foxpro 6.0
SQLserver 7.0
VC++ 6.0
Excel97
Powerpoint97
Access97
• We measured the value predictability of
parameter values for all non-pointer based
parameters
Parameter Value Profiling
• We look for the predictability of parameters using:
– Invariance 1 – probability that the most frequent value is
passed
– Invariance 4 – probability that one of the 4 most frequent
values is passed
• Parameter values are more predictable on a call site
basis than on a function basis (e.g., for Word97, 8%
of the functions pass highly predictable parms, where
as when computed on individual call sites, over 16%
of the call sites pass highly predictable parms)
• Highly predictable means that on over 90% of the
calls the same value is observed
• We will discuss how to clone and specialize
procedures when we discuss profile guided data
transformations
How can we reduce the runtime of a single
benchmark and still maintain accuracy?
• Simpoint – attempt to collect a set of trace
samples that best represents the whole
execution of the program
– Identifies phase behavior in programs
– Considers a metric that captures the differences
between two samples
– Computes the difference between these two
intervals
– Selects the interval that is closest to all other
intervals
Simpoint (Calder ASPLOS 2002)
• Utilize basic block frequencies to build basic
block vectors (bbf0, bbf1….bbfn-1)
• Each frequency is weighted by its length
• Entire vector normalized by dividing by total
number of basic blocks executed
• Take fixed-length samples (100M instructions)
• Compare BBVs using:
– Euclidean Distance:
• ED(a, b) = sqrt(sum(i->1,n) (ai-bi)2)
– Manhattan Distance:
• MD(a, b) = sum(i->1,n)(|ai-bi|)
Simpoint (Calder ASPLOS 2002)
• Manhattan Distance is used to build a
similarity matrix
• N x N matrix, where N is the number of
sampling intervals in the program
• Element SM(x, y) is the Manhattan
Distance between two 100M element
BBV at sample offsets x and y
• Plot the Similarity Matrix as an upper
triangle
Simpoint (Calder ASPLOS 2002)
• Basic Block Vectors can adequately
capture the necessary representative
characteristics of a program
• Distance metrics can help to identify the
most representative samples
• Cluster analysis (k-means) can improve
representativeness by selecting multiple
samples
Simpoint (Calder ISPASS 2004)
• Newer work on Simpoint considers
using register def-use behavior on an
interval basis
• Also, tracking of loop behavior and
procedure calls frequencies provides
similar accuracy as using basis block
vectors
Simpoint (Calder ASPLOS 2002)
• Algorithm overview
– Profile program by dividing into fixed sized
intervals (e.g., 1M, 10M, 100M insts)
– Collect frequency vectors (e.g., BBVs, def-use,
etc.) – compute normalized frequencies
– Run k-means clustering algorithm to divide the set
of intervals into k partitions/sets, for values of k
from 1 to K
– Compute a “goodness-of-fit” of the data for each
value of k
– Select the clustering the reduces small k and
provides a reasonable “goodness-of-fit” result
– The result is a selection of representative
simulation points that best “fit” the entire
application execution
Simpoint Paper
Improvements to Simpoints
(KDD05, ISPASS06)
• Utilize a Mixture of Multinomials instead of K-means
• Assumes data is generated by a mixture of Kcomponent density functions
• We utilize Expectation-Maximization (EM) to find a
local maximum likelihood for the parameters of the
density function – iterate on E and M steps until
convergence
• The number of clusters is selected using the Bayesian
Information Criteria (BIC) approach to judge
“goodness of fit”
 “A multinomial clustering model for fast simulation of
computer architecture designs”, K. Sanghai et al.,
Proc. of KDD 2005, Chicago, IL., pp. 808-813.
Summary
• Before you begin studying a new architecture,
have a clear understanding of the target
workloads for this system
• Perform a significant amount of workload
characterization before you begin profiling
work
• Benchmarks are very useful tools, but must
be used properly to obtain meaningful results
• Value profiling is a rich area for future
research
• Simpoints can be used to reduce the runtime
of simulation and still maintain simulation
fidelity