The TAU Performance System

Transcript The TAU Performance System

Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
Outline of Talk






Research motivation
Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
TAU parallel performance system developments
Application performance case studies
New project directions


Performance data mining and knowledge discovery
Concluding discussion
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
2
Research Motivation

Tools for performance problem solving


Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
database
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
LLNL, Oct. 2004
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
3
Large Scale Performance Problem Solving




How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion





Seen as a tradeoff with performance diagnosis accuracy
Scaling complicates observation and analysis
Nature of application development may change
Paradigm shift in performance process and technology?
What will enhance productive application development?
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
4
Scaling and Performance Observation

Consider “traditional” measurement methods



More parallelism  more performance data overall






Profiling: summary statistics calculated during execution
Tracing: time-stamped sequence of execution events
Performance specific to each thread of execution
Possible increase in number interactions between threads
Harder to manage the data (memory, transfer, storage)
How does per thread profile size grow?
Instrumentation more difficult with greater parallelism?
More parallelism / performance data  harder analysis

More time consuming to analyze and difficult to visualize
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
5
Concern for Performance Measurement Intrusion

Performance measurement can affect the execution



Problems exist even with small degree of parallelism



Intrusion is accepted consequence of standard practice
Consider intrusion (perturbation) of trace buffer overflow
Scale exacerbates the problem … or does it?




Perturbation of “actual” performance behavior
Minor intrusion can lead to major execution effects
Traditional measurement techniques tend to be localized
Suggests scale may not compound local intrusion globally
Measuring parallel interactions likely will be affected
Use accepted measurement techniques intelligently
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
6
Role of Intelligence and Specificity


How to make the process more effective (productive)?
Scale forces performance observation to be intelligent


What are the important performance events and data?




Standard approaches deliver a lot of data with little value
Tied to application structure and computational mode
Tools have poor support for application-specific aspects
Process and tools can be more application-aware
Will allow scalability issues to be addressed in context



More control and precision of performance observation
More guided performance experimentation / exploration
Better integration with application development
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
7
Role of Automation and Knowledge Discovery




Even with intelligent and application-specific tools, the
decisions of what to analyze may become intractable
Scale forces the process to become more automated
Performance extrapolation must be part of the process
Build autonomic capabilities into the tools






Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
8
TAU Parallel Performance System Goals

Multi-level performance instrumentation



Flexible and configurable performance measurement
Widely-ported parallel performance profiling system





Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms


Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
9
TAU Parallel Performance System Architecture
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
10
TAU Parallel Performance System Architecture
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
11
Advances in TAU Instrumentation

Source instrumentation

Program Database Toolkit (PDT)
 automated
Fortran 90/95 support (Flint parser, very robust)
 statement level support in C/C++ (Fortran soon)


TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
 automatic





CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
12
Advances in TAU Measurement

Profiling

Memory profiling
 global

heap memory tracking (several options)
Callpath profiling
 user-controllable






Improved support for multiple counter profiling
Online profile access and sampling
Tracing


calling depth
Generation of VTF3 traces files (fully portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
13
Advances in TAU Performance Analysis

Enhanced parallel profile analysis (ParaProf)



Callpath analysis integration in ParaProf
Embedded Lisp interpreter
Performance Data Management Framework (PerfDMF)


First release of prototype
In use by several groups
 S.

Integration with Vampir Next Generation (VNG)



Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …
Online trace analysis
Performance visualization (ParaVis) prototype
Component performance modeling and QoS
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
14
TAU Performance System Status

Computing platforms (selected)


Programming languages


C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python
Thread libraries


IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 /
X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000,
NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PARISC, Power, Opteron), Apple (G4/5, OS X), Windows
pthreads, SGI sproc, Java,Windows, OpenMP
Compilers (selected)

Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
15
Component-Based Scientific Applications
How to support performance analysis and tuning process
consistent with application development methodology?
 Common Component Architecture (CCA) applications
 Performance tools should integrate with software
 Design performance observation component



Measurement port and measurement interfaces
Build support for application component instrumentation



Interpose a proxy component for each port
Inside the proxy, track caller/callee invocations, timings
Automate the process of proxy component creation
 using
PDT for static analysis of components
 include support for selective instrumentation
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
16
Flame Reaction-Diffusion (Sandia, J. Ray)
CCAFFEINE
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
17
Component Modeling and Optimization

Given a set of components, where each component has
multiple implementations, what is the optimal subset of
implementations that solve a given problem?






How to model a single component?
How to model a composition of components?
How to select optimal subset of implementations?
A component only has performance meaning in context
Applications are dynamically composed at runtime
Application developers use components from others



Instrumentation may only be at component interfaces
Performance measurements need to be non-intrusive
Users interested in a coarse-grained performance
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
18
MasterMind Component (Trebon, IPDPS 2004)
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
19
Proxy Generator for other Applications

TAU (PDT) proxy component for:




QoS tracking [Boyana, ANL]
Debugging Port Monitor for CCA (tracks arguments)
SCIRun2 Perfume components [Venkat, U. Utah]
Exploring Babel for auto-generation of proxies:


Direct SIDL-to-proxy code generation
Generating client component interface in C++
 Using
LLNL, Oct. 2004
PDT for generating proxies
Performance Technology for Productive, High-End Parallel Computing
20
Earth Systems Modeling Framework


Coupled modeling with modular software framework
Instrumentation for ESMF framework and applications

PDT automatic instrumentation
 Fortran
95 code modules
 C / C++ code modules


ESMF Component instrumentation (using CCA)



MPI wrapper library for MPI calls
CCA measurement port manual instrumentation
Proxy generation using PDT and runtime interposition
Significant callpath profiling used by ESMF team
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
21
Using TAU Component in ESMF/CCA
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
22
TAU’s Paraprof Profile Browser (ESMF Data)
Callpath profile
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
23
CUBE Browser (UTK, FZJ) (ESMF Data)
metric
calltree
location
TAU profile
data converted
to CUBE form
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
24
TAU Traces with Counters (ESMF)
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
25
Visualizing TAU Traces with Counters/Samples
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
26
Uintah Computational Framework (UCF)
University of Utah, Center for Simulation of Accidental
Fires and Explosions (C-SAFE), DOE ASCI Center
 UCF analysis





Scheduling
MPI library
Components
Performance
mapping
Use for online
and offline
visualization
 ParaVis tools
F

LLNL, Oct. 2004
500 processes
Performance Technology for Productive, High-End Parallel Computing
27
Scatterplot Displays (UCF, 500 processes)

Each point
coordinate
determined
by three
values:
MPI_Reduce
MPI_Recv
MPI_Waitsome
Min/Max
value range
 Effective for
cluster
analysis

Relation between MPI_Recv and MPI_Waitsome
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
28
Online Unitah Performance Profiling

Demonstration of online profiling capability



Colliding elastic disks



Multiple profile samples
Each profile taken at major iteration (~ 60 seconds)
Test material point method (MPM) code
Executed on 512 processors ASCI Blue Pacific at LLNL
Example




3D bargraph visualization
MPI execution time
Performance mapping
Multiple time steps
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
29
Online Unitah Performance Profiling
QuickTime™ and a GIF decompressor are needed to see this picture.
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
30
Miranda Performance Analysis (Miller, LLNL)

Miranda is a research hydrodynamics code


Mostly synchronous



MPI_ALLTOALL on Np x,y communicators
Some MPI reductions and broadcasts for statistics
Good communications scaling




Fortran 95, MPI
ACL and MCR Linux cluster
Up to 1728 CPUs
Fixed workload per CPU
Ported to BlueGene/L

Breaking News! (see next slide)
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
31
Profiling of Miranda on BG/L (Miller, LLNL)


Profile code performance (automatic instrumentation)
Scaling studies (problem size, number of processors)
128 Nodes

512 Nodes
1024 Nodes
Run on 8K and 16K processors this week!
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
32
Fine Grained Profiling via Tracing on Miranda

Use TAU to generate VTF3 traces for Vampir analysis


Combines MPI calls with HW counter information
Detailed code behavior to focus optimization efforts
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
33
Memory Usage Analysis


BG/L will have limited memory per node (512 MB)
Miranda uses TAU to profile memory usage



Streamlines code
Squeeze larger
problems on the
machine
TAU’s footprint
is small

Approximately
100 bytes per event
per thread
LLNL, Oct. 2004
Max Heap Memory (KB) used for 1283 problem
on 16 processors of ASC Frost at LLNL
Performance Technology for Productive, High-End Parallel Computing
34
Kull Performance Optimization (Miller, LLNL)

Kull is a Lagrange hydrodynamics code



Scalar test problem analysis


CCSubzonalEffects member functions
Examination revealed optimization opportunities



Serial execution to identify performance factors
Original code profile indicated expensive functions


Physics packages written in C++ and Fortran
Parallel Python interpreter run-time environment!
Loop merging
Amortizing geometric lookup over more calculations
Apply to CSSubzonalEffects member functions
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
35
Kull Optimization

CSSubzonalEffects member functions total time


Reduced from 5.80 seconds to 0.82 seconds
Overall run time reduce from 28.1 to 22.85 seconds
Original Exclusive Profile
LLNL, Oct. 2004
Optimized Exclusive Profile
Performance Technology for Productive, High-End Parallel Computing
36
Important Questions for Application Developers










How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
37
Multi-Level Performance Data Mining

New (just forming) research project





: Karen L. Karavanic
: Sally A. McKee
: Allen D. Malony and Sameer Shende
: John M. May and Bronis R. de Supinski
Develop performance data mining technology




PSU
Cornell
UO
LLNL
Scientific applications, benchmarks, other measurements
Systematic analysis for understanding and prediction
Better foundation for evaluation of leadership-class
computer systems
“Scalable, Interoperable Tools to Support Autonomic
Optimization of High-End Applications,” S. McKee, G.
Tyson, A. Malony, begin Nov. 1, 2004.
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
38
General Goals

Answer questions at multiple levels of interest

Data from low-level measurements and simulations
 use
to predict application performance
 data mining applied to optimize data gathering process

High-level performance data spanning dimensions
 Machine,
applications, code revisions
 Examine broad performance trends

Need technology




Performance instrumentation and measurement
Performance data management
Performance analysis and results presentation
Automated performance experimentation and exploration
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
39
Specific Goals
Design, develop, and populate a performance database
 Discover general correlations application performance
and features of their external environment
 Develop methods to predict application performance on
lower-level metrics
 Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a give system
 Performance data mining infrastructure is important for
all of these goals
 Establish a more rational basis for evaluating the
performance of leadership-class computers

LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
40
PerfTrack: Performance DB and Analysis Tool
PSU: Kathryn Mohror, Karen Karavanic
UO: Kevin Huck
LLNL: John May, Brian Miller (CASC)
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
41
TAU Performance Data Management Framework
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
42
TAU Performance Regression (PerfRegress)


Prototype developed by Alan Morris for Uintah
Re-implement using PerfDMF
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
43
Background – Ahn & Vetter, 2002



“Scalable Analysis Techniques for Microprocessor
Performance Counter Metrics,” SC2002
Applied multivariate statistical analysis techniques to
large datasets of performance data (PAPI events)
Cluster Analysis and F-Ratio



Factor Analysis


Agglomerative Hierarchical Method - dendogram
identified groupings of master, slave threads in sPPM
K-means clustering and F-ratio - differences between
master, slave related to communication and management
shows highly correlated metrics fall into peer groups
Combined techniques (recursively) leads to observations
of application behavior hard to identify otherwise
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
44
Similarity Analysis


Can we recreate Ahn and Vetter’s results?
Apply techniques from the phase analysis (Sherwood)



Threads of execution can be compared for similarity
Threads with abnormal behavior show up as less similar
Each thread is represented as a vector (V) of dimension n

n is the number of functions in the application
V = [f1, f2, …, fn]

Each value is the percentage of time spent in that function
 normalized

(represent event mix)
from 0.0 to 1.0
Distance calculated between the vectors U and V:
n
ManhattanDistance(U, V) = ∑ |ui - vi|
i=0
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
45
sPPM on Blue Horizon (64x4, OpenMP+MPI)
• TAU profiles
• 10 events
• PerfDMF
• threads 32-47
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
46
sPPM on MCR (total instructions, 16x2)
• TAU/PerfDMF
• 120 events
• master (even)
• worker (odd)
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
47
sPPM on MCR (PAPI_FP_INS, 16x2)
• TAU profiles
• PerfDMF
• master/worker
• higher/lower
Same result as Ahn/Vetter
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
48
sPPM on Frost (PAPI_FP_INS, 256 threads)


View of fewer than half of
the threads of execution is
possible on the screen at
one time
Three groups are obvious:



Lower ranking threads
One unique thread
Higher ranking threads
 3%

LLNL, Oct. 2004
more FP
Finding subtle differences
is difficult with this view
Performance Technology for Productive, High-End Parallel Computing
49
sPPM on Frost (PAPI_FP_INS, 256 threads)

Dendrogram shows 5 natural clusters:
 Unique thread
 High ranking master threads
 Low ranking master threads
 High ranking worker threads
 Low ranking worker threads
• TAU profiles
• PerfDMF
• R direct
access to DM
• R routine
threads
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
50
sPPM on MCR (PAPI_FP_INS, 16x2 threads)
masters
slaves
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
51
sPPM on Frost (PAPI_FP_INS, 256 threads)

After K-means clustering into 5 clusters



Similar clusters are formed (seed with group means)
Each cluster’s performance characteristics analyzed
Dimensionality reduction (256 threads to 5 clusters!)
SPPM
10
LLNL, Oct. 2004
119
1
6
INTERF
DIFUZE
DINTERF
Barrier [OpenMP:runhyd3.F <604,0>]
120
Performance Technology for Productive, High-End Parallel Computing
52
PerfExplorer Design (K. Huck, UO)

Performance knowledge discovery framework

Use the existing TAU infrastructure
 TAU



instrumentation data, PerfDMF
Client-server based system architecture
Data mining analysis applied to parallel performance data
Technology integration




Relational DatabaseManagement Systems (RDBMS)
Java API and toolkit
R-project / Omegahat statistical analysis
Web-based client
 Jakarta
LLNL, Oct. 2004
web server and Struts (for a thin web-client)
Performance Technology for Productive, High-End Parallel Computing
53
PerfExplorer Architecture
Server accepts
multiple client
requests and
returns results
Server supports
R data mining
operations built
using RSJava
PerfDMF Java
API used to
access DBMS
via JDBC
Client is a
traditional Java
application with
GUI (Swing)
Analyses can be
scripted,
parameterized,
and monitored
Browsing of analysis results via automatic
web page creation and thumbnails
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
54
ZeptoOS: Extreme Performance Scalable OS’s

DOE, Office of Science



Investigate operating system and run-time (OS/R) functionality
required for scalable components used in petascale architectures





OS / RTS for Extreme Scale Scientific Computation
Argonne National Lab and University of Oregon
Flexible OS/R functionality
Scalable OS/R system calls
Performance tools, monitoring, and metrics
Fault tolerance and resiliency
Approach





Specify OS/R requirements across scalable components
Explore flexible functionality (Linux)
Hierarchical designs optimized with collective OS/R interfaces
Integrated (horizontal, vertical) performance measurement / analysis
Fault scenarios and injection to observe behavior
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
55
ZeptoOS Plans


Explore Linux functionality for BG/L
Explore efficiency for ultra-small kernels


Construct kernel-level collective operations



Scheduler, memory, IO
Support for dynamic library loading, …
Build Faulty Towers Linux kernel and system for
replaying fault scenarios
Extend TAU



Profiling OS suites
Benchmarking collective OS calls
Observing effects of faults
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
56
Concluding Discussion
As high-end systems scale, it will be increasingly
important that performance tools be used effectively
 Performance observation methods do not necessarily
need to change in a fundamental sense



More intelligent performance systems for productive use





Just need to be controlled and used efficiently
Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
Deliver to community next-generation tools
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
57
Support Acknowledgements


Department of Energy (DOE)
 Office of Science contracts
 University of Utah ASCI Level 1
sub-contract
 ASC/NNSA Level 3 contract
NSF



High-End Computing Grant
Qu i c k Ti m e ™ a n d a
TIF F (Un c o m p re s s e d ) d e c o m p re s s o r
a re n e e d e d to s e e th i s p i c tu re .
Research Centre Juelich
 John von Neumann Institute
 Dr. Bernd Mohr
Los Alamos National Laboratory
LLNL, Oct. 2004
Performance Technology for Productive, High-End Parallel Computing
58

The TAU Performance System

Transcript The TAU Performance System

Directory