The TAU Performance System

Download Report

Transcript The TAU Performance System

Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
NeuroInformatics Center
University of Oregon
Research Motivation

Tools for performance problem solving


Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
storage
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
PSC
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
2
Challenges in Performance Problem Solving



How to make the process more effective (productive)?
Process may depend on scale of parallel system
What are the important events and performance metrics?



Process and tools can/must be more application-aware




PSC
Tied to application structure and computational model
Tied to application domain and algorithms
Tools have poor support for application-specific aspects
What are the significant issues that will affect the
technology used to support the process?
Enhance application development and benchmarking
New paradigm in performance process and technology
Performance Technology for Productive, High-End Parallel Computing
3
Large Scale Performance Problem Solving




How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion


Scaling complicates observation and analysis



PSC
Seen as a tradeoff with performance diagnosis accuracy
Performance data size becomes a concern
Analysis complexity increases
Nature of application development may change
Performance Technology for Productive, High-End Parallel Computing
4
Role of Intelligence, Automation, and Knowledge




Scale forces the process to become more intelligent
Even with intelligent and application-specific tools, the
decisions of what to analyze is difficult and intractable
More automation and knowledge-based decision making
Build automatic/autonomic capabilities into the tools






PSC
Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
Performance Technology for Productive, High-End Parallel Computing
5
Outline of Talk

Performance problem solving




TAU parallel performance system and advances
Performance data management and data mining




PSC
Performance Data Management Framework (PerfDMF)
PerfExplorer
Multi-experiment case studies


Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
Clustering analysis
Comparative analysis (PERC tool study)
Future work and concluding remarks
Performance Technology for Productive, High-End Parallel Computing
6
TAU Performance System


Tuning and Analysis Utilities (13+ year project effort)
Performance system framework for HPC systems


Targets a general complex system computation model






PSC
Entities: nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance problem solving


Integrated, scalable, flexible, and parallel
Instrumentation, measurement, analysis, and visualization
Portable performance profiling and tracing facility
Performance data management and data mining
University of Oregon , Research Center Jülich, LANL
Performance Technology for Productive, High-End Parallel Computing
7
TAU Parallel Performance System Goals

Multi-level performance instrumentation



Flexible and configurable performance measurement
Widely-ported parallel performance profiling system





PSC
Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms


Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
Performance Technology for Productive, High-End Parallel Computing
8
TAU Performance System Architecture
event
selection
PSC
Performance Technology for Productive, High-End Parallel Computing
9
TAU Performance System Architecture
PSC
Performance Technology for Productive, High-End Parallel Computing
10
Advances in TAU Instrumentation

Source instrumentation

Program Database Toolkit (PDT)
 automated
Fortran 90/95 support (Cleanscape Flint parser)
 statement level support in C/C++ (Fortran soon)


TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
 automatic





PSC
CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
Performance Technology for Productive, High-End Parallel Computing
11
Program Database Toolkit (PDT)
Application
/ Library
C / C++
parser
IL
C / C++
IL analyzer
Program
Database
Files
PSC
Fortran parser
F77/90/95
IL
Fortran
IL analyzer
DUCTAPE
PDBhtml
Program
documentation
SILOON
Application
component glue
CHASM
C++ / F90/95
interoperability
TAU_instr
Automatic source
instrumentation
Performance Technology for Productive, High-End Parallel Computing
12
Advances in TAU Measurement

Profiling (four types)

Memory profiling
 global

heap memory tracking (several options)
Callpath profiling and calldepth profiling
 user-controllable





PSC
Phase-based profiling
Tracing


callpath length and calling depth
Generation of VTF3 / SLOG traces files (fully portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
Performance Technology for Productive, High-End Parallel Computing
13
Profile Measurement

Flat profiles



Callpath profiles (Calldepth profiles)




Time spent along a calling path (edges in callgraph)
“main=> f1 => f2 => MPI_Send” (event name)
TAU_CALLPATH_LENGTH environment variable
Phase-based profiles



PSC
Metric (e.g., time) spent in an event (callgraph nodes)
Exclusive/inclusive, # of calls, child calls
Flat profiles under a phase (nested phases are allowed)
Default “main” phase
Supports static or dynamic (per-iteration) phases
Performance Technology for Productive, High-End Parallel Computing
14
Advances in TAU Performance Analysis

Enhanced parallel profile analysis (ParaProf)



Callpath analysis integration in ParaProf
Event callgraph view
Performance Data Management Framework (PerfDMF)


First release of prototype
In use by several groups
 S.

Integration with Vampir Next Generation (VNG)



PSC
Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …
Online trace analysis
3D Performance visualization prototype (ParaVis)
Component performance modeling and QoS
Performance Technology for Productive, High-End Parallel Computing
15
Pprof – Flat Profile (NAS PB LU)






PSC
Intel Linux
cluster
F90 +
MPICH
Profile
- Node
- Context
- Thread
Events
- code
- MPI
Metric
- time
Text display
Performance Technology for Productive, High-End Parallel Computing
16
ParaProf – Manager Window
performance
database
derived performance metrics
PSC
Performance Technology for Productive, High-End Parallel Computing
17
ParaProf – Full Profile (Miranda)
8K processors!
PSC
Performance Technology for Productive, High-End Parallel Computing
18
ParaProf– Flat Profile (Miranda)
PSC
Performance Technology for Productive, High-End Parallel Computing
19
ParaProf– Callpath Profile (Flash)
PSC
Performance Technology for Productive, High-End Parallel Computing
20
ParaProf– Callpath Profile (ESMF)
21-level
callpath
PSC
Performance Technology for Productive, High-End Parallel Computing
21
ParaProf – Phase Profile (MFIX)
In 51st iteration, time
spent in MPI_Waitall
was 85.81 secs
dynamic phases
one per interation
Total time spent in
MPI_Waitall was
4137.9 secs across all
92 iterations
PSC
Performance Technology for Productive, High-End Parallel Computing
22
ParaProf – Histogram View (Miranda)

Scalable 2D displays
8k processors
PSC
16k processors
Performance Technology for Productive, High-End Parallel Computing
23
ParaProf –Callgraph View (MFIX)
PSC
Performance Technology for Productive, High-End Parallel Computing
24
ParaProf – Callpath Highlighting (Flash)
MODULEHYDRO_1D:HYDRO_1D
PSC
Performance Technology for Productive, High-End Parallel Computing
25
Profiling of Miranda on BG/L (Miller, LLNL)


Profile code performance (automatic instrumentation)
Scaling studies (problem size, number of processors)
128 Nodes

PSC
512 Nodes
1024 Nodes
Run on 8K and 16K processors!
Performance Technology for Productive, High-End Parallel Computing
26
ParaProf – 3D Full Profile (Miranda)
16k processors
PSC
Performance Technology for Productive, High-End Parallel Computing
27
ParaProf – 3D Scatterplot (Miranda)



Each point
is a “thread”
of execution
A total of
four metrics
shown in
relation
ParaVis 3D
profile
visualization
library

PSC
JOGL
Performance Technology for Productive, High-End Parallel Computing
28
Performance Tracing on Miranda

Use TAU to generate VTF3 traces for Vampir analysis


PSC
MPI calls with HW counter information (not shown)
Detailed code behavior to focus optimization efforts
Performance Technology for Productive, High-End Parallel Computing
29
S3D on Lemieux (TAU-to-VTF3, Vampir)
PSC
Performance Technology for Productive, High-End Parallel Computing
30
S3D on Lemieux (Zoomed)
PSC
Performance Technology for Productive, High-End Parallel Computing
31
TAU Performance System Status

Computing platforms (selected)


Programming languages


pthreads, SGI sproc, Java,Windows, OpenMP
Compilers (selected)

PSC
C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python
Thread libraries (selected)


IBM SP/pSeries, SGI Origin, Cray T3E/SV-1/X1/XT3,
HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC
SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PA-RISC,
Power, Opteron), Apple (G4/5, OS X), Windows
Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
PathScale, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft
Performance Technology for Productive, High-End Parallel Computing
32
Project Affiliations (selected)

Center for Simulation of Accidental Fires and Explosion



Center for Simulation of Dynamic Response of Materials



California Institute of Technology, ASCI ASAP Center
Virtual Testshock Facility (VTF) (Python, Fortran 90)
Earth Systems Modeling Framework (ESMF)




PSC
University of Utah, ASCI ASAP Center, C-SAFE
Uintah Computational Framework (UCF) (C++)
NSF, NOAA, DOE, NASA, …
Instrumentation for ESMF framework and applications
C, C++, and Fortran 95 code modules
MPI wrapper library for MPI calls
Performance Technology for Productive, High-End Parallel Computing
33
Project Affiliations (selected) (continued)

Lawrence Livermore National Lab


Sandia National Lab and Los Alamos National Lab





OS / RTS for Extreme Scale Scientific Computation
Zeptos - scalable components for petascale architectures
KTAU - integration of TAU infrastructure in Linux kernel
Oak Ridge National Lab

PSC
DOE CCTTSS SciDAC project
Common component architecture (CCA) integration
Argonne National Lab


Hydrodynamics (Miranda), Radiation diffusion (KULL)
Contribution to the Joule Report: S3D, AORSA3D
Performance Technology for Productive, High-End Parallel Computing
34
Important Questions for Application Developers










PSC
How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
Performance Technology for Productive, High-End Parallel Computing
35
Performance Problem Solving Goals

Answer questions at multiple levels of interest

Data from low-level measurements and simulations
 use

to predict application performance
High-level performance data spanning dimensions
 machine,
applications, code revisions, data sets
 examine broad performance trends



PSC
Discover general correlations application performance
and features of their external environment
Develop methods to predict application performance on
lower-level metrics
Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a given system
Performance Technology for Productive, High-End Parallel Computing
36
Automatic Performance Analysis Tool (Concept)
105%
Faster!
Build
application
build
information
Execute
application
environment /
performance
data
Performance
database
PSC
72%
Faster!
Simple
analysis
feedback
Offline
analysis
Performance Technology for Productive, High-End Parallel Computing
37
Performance Data Management Framework
PSC
Performance Technology for Productive, High-End Parallel Computing
38
ParaProf Performance Profile Analysis
Raw files
PerfDMF
managed
(database)
HPMToolkit
Metadata
MpiP
Application
Experiment
Trial
TAU
PSC
Performance Technology for Productive, High-End Parallel Computing
39
PerfExplorer (K. Huck, Ph.D. student, UO)

Performance knowledge discovery framework

Use the existing TAU infrastructure
 TAU


instrumentation data, PerfDMF
Client-server based system architecture
Data mining analysis applied to parallel performance data
 comparative,

Technology integration





PSC
clustering, correlation, dimension reduction, ...
Relational DatabaseManagement Systems (RDBMS)
Java API and toolkit
R-project / Omegahat statistical analysis
WEKA data mining package
Web-based client
Performance Technology for Productive, High-End Parallel Computing
40
PerfExplorer Architecture
PSC
Performance Technology for Productive, High-End Parallel Computing
41
PerfExplorer Client GUI
PSC
Performance Technology for Productive, High-End Parallel Computing
42
Hierarchical and K-means Clustering (sPPM)
PSC
Performance Technology for Productive, High-End Parallel Computing
43
Miranda Clustering on 16K Processors
PSC
Performance Technology for Productive, High-End Parallel Computing
44
PERC Tool Requirements and Evaluation

Performance Evaluation Research Center (PERC)



PERC tools study (led by ORNL, Pat Worley)




In-depth performance analysis of select applications
Evaluation performance analysis requirements
Test tool functionality and ease of use
Applications



PSC
DOE SciDAC
Evaluation methods/tools for high-end parallel systems
Start with fusion code – GYRO
Repeat with other PERC benchmarks
Continue with SciDAC codes
Performance Technology for Productive, High-End Parallel Computing
45
Primary Evaluation Machines

Phoenix (ORNL – Cray X1)


Ram (ORNL – SGI Altix (1.5 GHz Itanium2))


864 total processors on 27 compute nodes
Seaborg (NERSC – IBM SP3)

PSC
~7,738 total processors on 15 machines at 9 sites
Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))


256 total processors
TeraGrid


512 multi-streaming vector processors
6080 total processors on 380 compute nodes
Performance Technology for Productive, High-End Parallel Computing
46
GYRO Execution Parameters

Three benchmark problems




Test different methods to evaluate nonlinear terms:





PSC
B1-std : 16n processors, 500 timesteps
B2-cy : 16n processors, 1000 timesteps
B3-gtc : 64n processors, 100 timesteps (very large)
Direct method
FFT (“nl2” for B1 and B2, “nl1” for B3)
Task affinity enabled/disabled (p690 only)
Memory affinity enabled/disabled (p690 only)
Filesystem location (Cray X1 only)
Performance Technology for Productive, High-End Parallel Computing
47
PerfExplorer Analysis of Self-Instrumented Data

PerfExplorer



Focus on comparative analysis
Apply to PERC tool evaluation study
Look at user timer data

Aggregate data
 no
per process data
 process clustering analysis is not applicable

Timings output every N timesteps
 some

Goal

PSC
phase analysis possible
Recreate manually generated performance reports
Performance Technology for Productive, High-End Parallel Computing
48
PerfExplorer Interface
Experiment
metadata
Select experiments
and trials of interest
Data organized in application,
experiment, trial structure
(will allow arbitrary in future)
PSC
Performance Technology for Productive, High-End Parallel Computing
49
PerfExplorer Interface
Select analysis
PSC
Performance Technology for Productive, High-End Parallel Computing
50
Timesteps per Second




Cray X1 is the fastest to
solution in all 3 tests
FFT (nl2) improves time for
B3-gtc only
TeraGrid faster than p690 for
B1-std?
Plots generated automatically
B2-cy
B1-std
B1-std
TeraGrid
B3-gtc
B3-gtc
PSC
Performance Technology for Productive, High-End Parallel Computing
51
Relative Efficiency (B1-std)

By experiment (B1-std)


By event for one experiment


Total runtime (Cheetah (red))
Coll_tr (blue) is significant
By experiment for one event

Shows how Coll_tr behaves
for all experiments
Cheetah
Coll_tr
16 processor
base case
PSC
Performance Technology for Productive, High-End Parallel Computing
52
Automated Parallel Performance Diagnosis
PSC
Performance Technology for Productive, High-End Parallel Computing
53
Current and Future Work

ParaProf


PerfDMF






PSC
Adding new database backends and distributed support
Building support for user-created tables
PerfExplorer


Developing phase-based performance displays
Extending comparative and clustering analysis
Adding new data mining capabilities
Building in scripting support
Performance regression testing tool (PerfRegress)
Integrate in Eclipse Parallel Tool Project (PTP)
Performance Technology for Productive, High-End Parallel Computing
54
Concluding Discussion
Performance tools must be used effectively
 More intelligent performance systems for productive use






Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
Performance observation methods do not necessarily
need to change in a fundamental sense

More automatically controlled and efficiently use
Develop next-generation tools and deliver to community
 Open source with support by ParaTools, Inc.

PSC
Performance Technology for Productive, High-End Parallel Computing
55
Support Acknowledgements


Department of Energy (DOE)
 Office of Science contracts
 University of Utah ASCI Level 1
sub-contract
 ASC/NNSA Level 3 contract
NSF



PSC
High-End Computing Grant
Qui ckT ime™ and a
TI FF (Uncomp res sed) dec ompresso r
are nee ded t o see t his p ic ture.
Research Centre Juelich
 John von Neumann Institute
 Dr. Bernd Mohr
Los Alamos National Laboratory
Performance Technology for Productive, High-End Parallel Computing
56
PSC
Performance Technology for Productive, High-End Parallel Computing
57