The TAU Performance System
Download
Report
Transcript The TAU Performance System
Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
Outline of Talk
Performance problem solving
TAU parallel performance system and advances
Performance data management and data mining
ORNL
Performance Data Management Framework (PerfDMF)
PerfExplorer
Multi-experiment case studies
Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
Comparative analysis (PERC tool study)
Clustering analysis
Future work and concluding remarks
Performance Technology for Productive, High-End Parallel Computing
2
Research Motivation
Tools for performance problem solving
Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
storage
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
ORNL
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
3
Challenges in Performance Problem Solving
How to make the process more effective (productive)?
Process may depend on scale of parallel system
What are the important events and performance metrics?
Process and tools can/must be more application-aware
ORNL
Tied to application structure and computational model
Tied to application domain and algorithms
Tools have poor support for application-specific aspects
What are the significant issues that will affect the
technology used to support the process?
Enhance application development and benchmarking
New paradigm in performance process and technology
Performance Technology for Productive, High-End Parallel Computing
4
Large Scale Performance Problem Solving
How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion
Scaling complicates observation and analysis
ORNL
Seen as a tradeoff with performance diagnosis accuracy
Performance data size becomes a concern
Analysis complexity increases
Nature of application development may change
Performance Technology for Productive, High-End Parallel Computing
5
Role of Intelligence, Automation, and Knowledge
Scale forces the process to become more intelligent
Even with intelligent and application-specific tools, the
decisions of what to analyze is difficult and intractable
More automation and knowledge-based decision making
Build autonomic capabilities into the tools
ORNL
Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
Performance Technology for Productive, High-End Parallel Computing
6
TAU Performance System
Tuning and Analysis Utilities (13+ year project effort)
Performance system framework for HPC systems
Targets a general complex system computation model
ORNL
Entities: nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance problem solving
Integrated, scalable, flexible, and parallel
Instrumentation, measurement, analysis, and visualization
Portable performance profiling and tracing facility
Performance data management and data mining
University of Oregon , Research Center Jülich, LANL
Performance Technology for Productive, High-End Parallel Computing
7
TAU Parallel Performance System Goals
Multi-level performance instrumentation
Flexible and configurable performance measurement
Widely-ported parallel performance profiling system
ORNL
Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms
Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
Performance Technology for Productive, High-End Parallel Computing
8
TAU Performance System Architecture
ORNL
Performance Technology for Productive, High-End Parallel Computing
9
TAU Performance System Architecture
ORNL
Performance Technology for Productive, High-End Parallel Computing
10
Advances in TAU Instrumentation
Source instrumentation
Program Database Toolkit (PDT)
automated
Fortran 90/95 support (Flint parser, very robust)
statement level support in C/C++ (Fortran soon)
TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
automatic
ORNL
CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
Performance Technology for Productive, High-End Parallel Computing
11
Advances in TAU Measurement
Profiling
Memory profiling
global
heap memory tracking (several options)
Callpath profiling
user-controllable
ORNL
Phase-based profiling
Online profile access
Tracing
calling depth
Generation of VTF3 traces files (fully portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
Performance Technology for Productive, High-End Parallel Computing
12
Profile Measurement – Three Flavors
Flat profiles
Callpath Profiles
Time spent along a calling path (edges in callgraph)
“main=> f1 => f2 => MPI_Send”
TAU_CALLPATH_DEPTH environment variable)
Phase-based profiles
ORNL
Time (or counts) spent in each routine (nodes in
callgraph)
Exclusive/inclusive time, # of calls, child calls
Flat profiles under a phase (nested phases are allowed)
Default “main” phase
Supports static or dynamic (per-iteration) phases
Performance Technology for Productive, High-End Parallel Computing
13
Advances in TAU Performance Analysis
Enhanced parallel profile analysis (ParaProf)
Callpath analysis integration in ParaProf
Event callgraph view
Performance Data Management Framework (PerfDMF)
First release of prototype
In use by several groups
S.
Integration with Vampir Next Generation (VNG)
ORNL
Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …
Online trace analysis
Performance visualization (ParaVis) prototype
Component performance modeling and QoS
Performance Technology for Productive, High-End Parallel Computing
14
Flat Profile – Pprof (NPB LU)
Intel Linux
cluster
F90 +
MPICH
Profile
- Node
- Context
- Thread
Events
- code
- MPI
ORNL
Performance Technology for Productive, High-End Parallel Computing
15
Flat Profile – ParaProf (Miranda)
ORNL
Performance Technology for Productive, High-End Parallel Computing
16
Callpath Profile (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
17
Callpath Profile
21-level
callpath
ORNL
Performance Technology for Productive, High-End Parallel Computing
18
Phase Profile – Dynamic Phases
In 51st iteration, time
spent in MPI_Waitall
was 85.81 secs
Total time spent in
MPI_Waitall was
4137.9 secs across all
92 iterations
ORNL
Performance Technology for Productive, High-End Parallel Computing
19
ParaProf – Manager
performance
database
derived performance metrics
ORNL
Performance Technology for Productive, High-End Parallel Computing
20
ParaProf – Histogram View (Miranda)
8k processors
16k processors
ORNL
Performance Technology for Productive, High-End Parallel Computing
21
ParaProf – Stacked View (Miranda)
ORNL
Performance Technology for Productive, High-End Parallel Computing
22
ParaProf – Full Callgraph View (MFIX)
ORNL
Performance Technology for Productive, High-End Parallel Computing
23
ParaProf – Callpath Highlighting (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
24
ParaProf – Callgraph Zoom (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
25
Profiling of Miranda on BG/L (Miller, LLNL)
Profile code performance (automatic instrumentation)
Scaling studies (problem size, number of processors)
128 Nodes
ORNL
512 Nodes
1024 Nodes
Run on 8K and 16K processors!
Performance Technology for Productive, High-End Parallel Computing
26
Fine Grained Profiling via Tracing on Miranda
Use TAU to generate VTF3 traces for Vampir analysis
ORNL
Combines MPI calls with HW counter information
Detailed code behavior to focus optimization efforts
Performance Technology for Productive, High-End Parallel Computing
27
Memory Usage Analysis
BG/L will have limited memory per node (512 MB)
Miranda uses TAU to profile memory usage
TAU’s footprint
is small
ORNL
Streamlines code
Squeeze larger
problems on the
machine
Approximately
100 bytes per event
per thread
Max Heap Memory (KB) used for 1283 problem
on 16 processors of ASC Frost at LLNL
Performance Technology for Productive, High-End Parallel Computing
28
TAU Performance System Status
Computing platforms (selected)
Programming languages
pthreads, SGI sproc, Java,Windows, OpenMP
Compilers (selected)
ORNL
C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python
Thread libraries
IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 /
X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000,
NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PARISC, Power, Opteron), Apple (G4/5, OS X), Windows
Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft
Performance Technology for Productive, High-End Parallel Computing
29
Important Questions for Application Developers
ORNL
How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
Performance Technology for Productive, High-End Parallel Computing
30
Performance Problem Solving Goals
Answer questions at multiple levels of interest
Data from low-level measurements and simulations
use
to predict application performance
High-level performance data spanning dimensions
machine,
applications, code revisions, data sets
examine broad performance trends
ORNL
Discover general correlations application performance
and features of their external environment
Develop methods to predict application performance on
lower-level metrics
Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a given system
Performance Technology for Productive, High-End Parallel Computing
31
Automatic Performance Analysis Tool (Concept)
PerfTrack
Performance
Database
PSU: Kathryn Mohror, Karen Karavanic
UO: Kevin Huck
LLNL: John May, Brian Miller (CASC)
ORNL
Performance Technology for Productive, High-End Parallel Computing
32
Performance Data Management Framework
ORNL
Performance Technology for Productive, High-End Parallel Computing
33
TAU Performance Regression (PerfRegress)
ORNL
Prototype developed by Alan Morris for Uintah
Re-implement using PerfDMF
Performance Technology for Productive, High-End Parallel Computing
34
ParaProf Performance Profile Analysis
Raw files
PerfDMF
managed
(database)
HPMToolkit
Metadata
MpiP
Application
Experiment
Trial
TAU
ORNL
Performance Technology for Productive, High-End Parallel Computing
35
PerfExplorer (K. Huck, UO)
Performance knowledge discovery framework
Use the existing TAU infrastructure
TAU
instrumentation data, PerfDMF
Client-server based system architecture
Data mining analysis applied to parallel performance data
Technology integration
Relational DatabaseManagement Systems (RDBMS)
Java API and toolkit
R-project / Omegahat statistical analysis
Web-based client
Jakarta
ORNL
web server and Struts (for a thin web-client)
Performance Technology for Productive, High-End Parallel Computing
36
PerfExplorer Architecture
Server accepts
multiple client
requests and
returns results
Server supports
R data mining
operations built
using RSJava
PerfDMF Java
API used to
access DBMS
via JDBC
Client is a
traditional Java
application with
GUI (Swing)
Analyses can be
scripted,
parameterized,
and monitored
Browsing of analysis results via automatic
web page creation and thumbnails
ORNL
Performance Technology for Productive, High-End Parallel Computing
37
PERC Tool Requirements and Evaluation
Performance Evaluation Research Center (PERC)
PERC tools study (led by ORNL, Pat Worley)
In-depth performance analysis of select applications
Evaluation performance analysis requirements
Test tool functionality and ease of use
Applications
ORNL
DOE SciDAC
Evaluation methods/tools for high-end parallel systems
Start with fusion code – GYRO
Repeat with other PERC benchmarks
Continue with SciDAC codes
Performance Technology for Productive, High-End Parallel Computing
38
GYRO Execution Parameters
Three benchmark problems
Test different methods to evaluate nonlinear terms:
ORNL
B1-std : 16n processors, 500 timesteps
B2-cy : 16n processors, 1000 timesteps
B3-gtc : 64n processors, 100 timesteps (very large)
Direct method
FFT (“nl2” for B1 and B2, “nl1” for B3)
Task affinity enabled/disabled (p690 only)
Memory affinity enabled/disabled (p690 only)
Filesystem location (Cray X1 only)
Performance Technology for Productive, High-End Parallel Computing
39
Primary Evaluation Machines
Phoenix (ORNL – Cray X1)
Ram (ORNL – SGI Altix (1.5 GHz Itanium2))
864 total processors on 27 compute nodes
Seaborg (NERSC – IBM SP3)
ORNL
~7,738 total processors on 15 machines at 9 sites
Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))
256 total processors
TeraGrid
512 multi-streaming vector processors
6080 total processors on 380 compute nodes
Performance Technology for Productive, High-End Parallel Computing
40
Region (Events) of Interest
Total program is measured, plus specific code regions
NL
: nonlinear advance
NL_tr* : transposes before / after nonlinear advance
Coll
: collisions
Coll_tr* : transposes before/after main collision routine
Lin_RHS : compute right hand side of the electron and
ion GKEs (GyroKinetic (Vlasov) Equations)
Field
: explicit or implicit advance of fields and
solution of explicit maxwell equations
I/O, extras
Communication
ORNL
Performance Technology for Productive, High-End Parallel Computing
41
Data Collected Thus Far…
User timer data
Self instrumentation in the GYRO application
Outputs aggregate data per N timesteps
N
= 50 (B1, B3)
N = 125 (B2)
HPM (Hardware Performance Monitor) data
MPICL profiling/tracing
ORNL
IBM platform (p690) only
Cray X1 and IBM p690
TAU (all platforms, profiling/tracing, in progress)
Data processed by hand into Excel spreadsheets
Performance Technology for Productive, High-End Parallel Computing
42
PerfExplorer Analysis of Self-Instrumented Data
PerfExplorer
Focus on comparative analysis
Apply to PERC tool evaluation study
Look at user timer data
Aggregate data
no
per process data
process clustering analysis is not applicable
Timings output every N timesteps
some
Goal
ORNL
phase analysis possible
Recreate manually generated performance reports
Performance Technology for Productive, High-End Parallel Computing
43
Comparative Analysis
Supported analysis
Timesteps per second
Relative speedup and efficiency
For
entire application (compare machines, parameters, etc.)
For all events (on one machine, one set of parameters)
For one event (compare machines, parameters, etc.)
Initial analysis implemented as scalability study
Future analysis
ORNL
Fraction of total runtime for one group of events
Runtime breakdown (as a percentage)
Arbitrary organization
Parametric studies
Performance Technology for Productive, High-End Parallel Computing
44
PerfExplorer Interface
Experiment
metadata
Select experiments
and trials of interest
Data organized in application,
experiment, trial structure
(will allow arbitrary in future)
ORNL
Performance Technology for Productive, High-End Parallel Computing
45
PerfExplorer Interface
Select analysis
ORNL
Performance Technology for Productive, High-End Parallel Computing
46
Timesteps per Second
Cray X1 is the fastest to
solution in all 3 tests
FFT (nl2) improves time for
B3-gtc only
TeraGrid faster than p690 for
B1-std?
Plots generated automatically
B2-cy
B1-std
B1-std
TeraGrid
B3-gtc
B3-gtc
ORNL
Performance Technology for Productive, High-End Parallel Computing
47
Relative Efficiency (B1-std)
By experiment (B1-std)
By event for one experiment
Total runtime (Cheetah (red))
Coll_tr (blue) is significant
By experiment for one event
Shows how Coll_tr behaves
for all experiments
Cheetah
Coll_tr
16 processor
base case
ORNL
Performance Technology for Productive, High-End Parallel Computing
48
Relative Speedup (B2-cy)
By experiment (B2-cy)
By event for one experiment
NL_tr (orange) is significant
By experiment for one event
ORNL
Total runtime (X1 (blue))
Shows how NL_tr behaves
for all experiments
Performance Technology for Productive, High-End Parallel Computing
49
Fraction of Total Runtime (Communication)
IBM SP3 (cyan) has the
highest fraction of total time
spent in communication for
all three benchmarks
Cray X1 has the lowest
fraction in communication
B2-cy
ORNL
B1-std
B3-gtc
Performance Technology for Productive, High-End Parallel Computing
50
Runtime Breakdown on IBM SP3
ORNL
Communications grows as a percentage of total as the
application scales (colors match in graphs)
Both Coll_tr (blue) and NL_tr (orange) scale poorly
I/O (green) scales poorly, but its percentage of total
runtime is small
Performance Technology for Productive, High-End Parallel Computing
51
Phase Analysis
Breakdown by phase shows variability from beginning
of application to final solution
ORNL
Relative efficiency and runtime breakdown
Iteration 6 (cyan) has big drop in efficiency for 128
Greater variability in higher processor counts
Performance Technology for Productive, High-End Parallel Computing
52
Clustering Analysis
“Scalable Analysis Techniques for Microprocessor
Performance Counter Metrics,” Ahn and Vetter, SC2002
Applied multivariate statistical analysis techniques to
large datasets of performance data (PAPI events)
Cluster Analysis and F-Ratio
Factor Analysis
ORNL
Agglomerative Hierarchical Method - dendogram
identified groupings of master, slave threads in sPPM
K-means clustering and F-ratio - differences between
master, slave related to communication and management
shows highly correlated metrics fall into peer groups
Combined techniques (recursively) leads to observations
of application behavior hard to identify otherwise
Performance Technology for Productive, High-End Parallel Computing
53
Similarity Analysis
Can we recreate Ahn and Vetter’s results?
Apply techniques from the phase analysis (Sherwood)
Threads of execution can be compared for similarity
Threads with abnormal behavior show up as less similar
Each thread is represented as a vector (V) of dimension n
n is the number of functions in the application
V = [f1, f2, …, fn]
Each value is the percentage of time spent in that function
normalized
(represent event mix)
from 0.0 to 1.0
Distance calculated between the vectors U and V:
n
ManhattanDistance(U, V) = ∑ |ui - vi|
i=0
ORNL
Performance Technology for Productive, High-End Parallel Computing
54
sPPM on Blue Horizon (64x4, OpenMP+MPI)
• TAU profiles
• 10 events
• PerfDMF
• threads 32-47
ORNL
Performance Technology for Productive, High-End Parallel Computing
55
sPPM on MCR (total instructions, 16x2)
• TAU/PerfDMF
• 120 events
• master (even)
• worker (odd)
ORNL
Performance Technology for Productive, High-End Parallel Computing
56
sPPM on MCR (PAPI_FP_INS, 16x2)
• TAU profiles
• PerfDMF
• master/worker
• higher/lower
Same result as Ahn/Vetter
ORNL
Performance Technology for Productive, High-End Parallel Computing
57
sPPM on Frost (PAPI_FP_INS, 256 threads)
View of fewer than half of
the threads of execution is
possible on the screen at
one time
Three groups are obvious:
Lower ranking threads
One unique thread
Higher ranking threads
3%
ORNL
more FP
Finding subtle differences
is difficult with this view
Performance Technology for Productive, High-End Parallel Computing
58
sPPM on Frost (PAPI_FP_INS, 256 threads)
Dendrogram shows 5 natural clusters:
Unique thread
High ranking master threads
Low ranking master threads
High ranking worker threads
Low ranking worker threads
• TAU profiles
• PerfDMF
• R direct
access to DM
• R routine
threads
ORNL
Performance Technology for Productive, High-End Parallel Computing
59
sPPM on MCR (PAPI_FP_INS, 16x2 threads)
masters
slaves
ORNL
Performance Technology for Productive, High-End Parallel Computing
60
sPPM on Frost (PAPI_FP_INS, 256 threads)
After K-means clustering into 5 clusters
Similar clusters are formed (seed with group means)
Each cluster’s performance characteristics analyzed
Dimensionality reduction (256 threads to 5 clusters!)
SPPM
10
ORNL
119
1
6
INTERF
DIFUZE
DINTERF
Barrier [OpenMP:runhyd3.F <604,0>]
120
Performance Technology for Productive, High-End Parallel Computing
61
Current and Future Work
ParaProf
PerfDMF
ORNL
Adding new database backends and distributed support
Building support for user-created tables
PerfExplorer
Developing 3D performance displays
Extending comparative and clustering analysis
Adding new data mining capabilities
Building in scripting support
Performance regression testing tool (PerfRegress)
Integrate in Eclipse Parallel Tool Project (PTP)
Performance Technology for Productive, High-End Parallel Computing
62
Concluding Discussion
Performance tools must be used effectively
More intelligent performance systems for productive use
Performance observation methods do not necessarily
need to change in a fundamental sense
ORNL
Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
More automatically controlled and efficiently use
Develop next-generation tools and deliver to community
Performance Technology for Productive, High-End Parallel Computing
63
Support Acknowledgements
Department of Energy (DOE)
Office of Science contracts
University of Utah ASCI Level 1
sub-contract
ASC/NNSA Level 3 contract
NSF
ORNL
High-End Computing Grant
Qu i c k Ti m e ™ a n d a
TIF F (Un c o m p re s s e d ) d e c o m p re s s o r
a re n e e d e d to s e e th i s p i c tu re .
Research Centre Juelich
John von Neumann Institute
Dr. Bernd Mohr
Los Alamos National Laboratory
Performance Technology for Productive, High-End Parallel Computing
64