lacsi04 - Computer & Information Science Department
Download
Report
Transcript lacsi04 - Computer & Information Science Department
Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
Outline of Talk
Research motivation
Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
TAU parallel performance system developments
Application performance case studies
New project directions
Discussion
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
2
Research Motivation
Tools for performance problem solving
Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
database
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
LACSI 2004
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
3
Problem Description
How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion
Seen as a tradeoff with performance diagnosis accuracy
Scaling complicates observation and analysis
Nature of application development may change
Paradigm shift in performance process and technology?
What will enhance productive application development?
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
4
Scaling and Performance Observation
Consider “traditional” measurement methods
More parallelism more performance data overall
Profiling: summary statistics calculated during execution
Tracing: time-stamped sequence of execution events
Performance specific to each thread of execution
Possible increase in number interactions between threads
Harder to manage the data (memory, transfer, storage)
How does per thread profile size grow?
Instrumentation more difficult with greater parallelism?
More parallelism / performance data harder analysis
LACSI 2004
More time consuming to analyze and difficult to visualize
Performance Technology for Productive, High-End Parallel Computing
5
Concern for Performance Measurement Intrusion
Performance measurement can affect the execution
Problems exist even with small degree of parallelism
Intrusion is accepted consequence of standard practice
Consider intrusion (perturbation) of trace buffer overflow
Scale exacerbates the problem … or does it?
Perturbation of “actual” performance behavior
Minor intrusion can lead to major execution effects
Traditional measurement techniques tend to be localized
Suggests scale may not compound local intrusion globally
Measuring parallel interactions likely will be affected
Use accepted measurement techniques intelligently
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
6
Role of Intelligence and Specificity
How to make the process more effective (productive)?
Scale forces performance observation to be intelligent
What are the important performance events and data?
Standard approaches deliver a lot of data with little value
Tied to application structure and computational mode
Tools have poor support for application-specific aspects
Process and tools can be more application-aware
Will allow scalability issues to be addressed in context
LACSI 2004
More control and precision of performance observation
More guided performance experimentation / exploration
Better integration with application development
Performance Technology for Productive, High-End Parallel Computing
7
Role of Automation and Knowledge Discovery
Even with intelligent and application-specific tools, the
decisions of what to analyze may become intractable
Scale forces the process to become more automated
Performance extrapolation must be part of the process
Build autonomic capabilities into the tools
Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
8
TAU Parallel Performance System Goals
Multi-level performance instrumentation
Flexible and configurable performance measurement
Widely-ported parallel performance profiling system
Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms
Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
9
TAU Performance System Architecture
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
10
TAU Performance System Architecture
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
11
TAU Instrumentation Advances
Source instrumentation
Program Database Toolkit (PDT)
automated
Fortran 90/95 support (Flint parser, very robust)
statement level support in C/C++ (Fortran soon)
TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
automatic
CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
12
TAU Measurement Advances
Profiling
Memory profiling
global
heap memory tracking (several options)
Callpath profiling
user-controllable
Improved support for multiple counter profiling
Online profile access and sampling
Tracing
calling depth
Generation of VTF3 traces files (portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
13
TAU Performance Analysis Advances
Enhanced parallel profile analysis (ParaProf)
Performance Data Management Framework (PerfDMF)
Callpath analysis integration in ParaProf
Integration with Vampir Next Generation (VNG)
First release of prototype
Online trace analysis
Performance visualization (ParaVis) prototype
Component performance modeling and QoS
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
14
Component-Based Scientific Applications
How to support performance analysis and tuning process
consistent with application development methodology?
Common Component Architecture (CCA) applications
Performance tools should integrate with software
Design performance observation component
Measurement port and measurement interfaces
Build support for application component instrumentation
Interpose a proxy component for each port
Inside the proxy, track caller/callee invocations, timings
Automate the process of proxy component creation
using
PDT for static analysis of components
include support for selective instrumentation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
15
Flame Reaction-Diffusion (Sandia, J. Ray)
CCAFFEINE
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
16
Component Modeling and Optimization
Given a set of components, where each component has
multiple implementations, what is the optimal subset of
implementations that solve a given problem?
How to model a single component?
How to model a composition of components?
How to select optimal subset of implementations?
A component only has performance meaning in context
Applications are dynamically composed at runtime
Application developers use components from others
LACSI 2004
Instrumentation may only be at component interfaces
Performance measurements need to be non-intrusive
Users interested in a coarse-grained performance
Performance Technology for Productive, High-End Parallel Computing
17
MasterMind Component (Trebon, IPDPS 2004)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
18
Proxy Generator for other Applications
PDT-based proxy component for:
QoS tracking [Boyana, ANL]
Debugging Port Monitor for CCA (tracks arguments)
SCIRun2 Perfume components [Venkat, U. Utah]
Exploring Babel for auto-generation of proxies:
Direct SIDL to proxy code generation
Generating client component interface in C++
Using
LACSI 2004
PDT for generating proxies
Performance Technology for Productive, High-End Parallel Computing
19
Earth Systems Modeling Framework
Coupled modeling with modular software framework
Instrumentation for framework and applications
PDT automatic instrumentation
Fortran
95
C / C++
Component instrumentation (using CCA Components)
MPI wrapper library for MPI calls
CCA measurement port manual instrumentation
Proxy generation using PDT and runtime interposition
Significant callpath profiling use by ESMF team
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
20
Using TAU Component in ESMF/CCA
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
21
TAU’s Paraprof Profile Browser (ESMF Data)
Callpath profile
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
22
TAU Traces with Counters (ESMF)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
23
Visualizing TAU Traces with Counters/Samples
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
24
Uintah Computational Framework (UCF)
University of Utah, Center for Simulation of Accidental
Fires and Explosions (C-SAFE), DOE ASCI Center
UCF analysis
Scheduling
MPI library
Components
Performance
mapping
Use for online
and offline
visualization
ParaVis tools
F
LACSI 2004
500 processees
Performance Technology for Productive, High-End Parallel Computing
25
Scatterplot Displays
Each point
coordinate
determined
by three
values:
MPI_Reduce
MPI_Recv
MPI_Waitsome
Min/Max
value range
Effective for
cluster
analysis
Relation between MPI_Recv and MPI_Waitsome
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
26
Online Unitah Performance Profiling
Demonstration of online profiling capability
Colliding elastic disks
Test material point method (MPM) code
Executed on 512 processors ASCI Blue Pacific at LLNL
Example
LACSI 2004
Bargraph visualization
MPI execution time
Performance mapping
Multiple time steps
QuickTime™ and a GIF decompressor are needed to see this picture.
Performance Technology for Productive, High-End Parallel Computing
27
Miranda Performance Analysis (Miller, LLNL)
Miranda is a research hydrodynamics code
Mostly synchronous
Fortran 95, MPI
MPI_ALLTOALL on Np x,y communicators
Some MPI reductions and broadcasts for statistics
Good communications scaling
ACL and MCR
Sibling
Linux clusters
~1000 Intel P4 nodes, dual 2.4 GHz
Up to 1728 CPUs
Fixed workload per CPU
Ported to BlueGene/L
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
28
Tau Profiling of Miranda on BG/L
Miranda team is using TAU to profile code performance
Routinely runs on BG/L for 1000 CPUs for 10-20 minutes
Scaling studies (problem size, number of processors)
128 Nodes
LACSI 2004
512 Nodes
1024 Nodes
Performance Technology for Productive, High-End Parallel Computing
29
Fine Grained Profiling via Tracing
Miranda uses TAU to generate traces
LACSI 2004
Combines MPI calls with HW counter information
Detailed code behavior to focus optimization efforts
Performance Technology for Productive, High-End Parallel Computing
30
Memory Usage Analysis
BG/L will have limited memory per node (512 MB)
Miranda uses TAU to profile memory usage
Streamlines code
Squeeze larger
problems on the
machine
Max Heap Memory (KB) used for 1283 problem
on 16 processors of ASC Frost at LLNL
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
31
Kull Performance Optimization (Miller, LLNL)
Kull is a Lagrange hydrodynamics code
Scalar test problem analysis
CCSubzonalEffects member functions
Examination revealed optimization opportunities
Serial execution to identify performance factors
Original code profile indicated expensive functions
Physics packages written in C++ and Fortran
Parallel Python interpreter run-time environment!
Loop merging
Amortizing geometric lookup over more calculations
Apply to CSSubzonalEffects member functions
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
32
Kull Optimization
CSSubzonalEffects member functions total time
Reduced from 5.80 seconds to 0.82 seconds
Overall run time reduce from 28.1 to 22.85 seconds
Original Exclusive Profile
LACSI 2004
Optimized Exclusive Profile
Performance Technology for Productive, High-End Parallel Computing
33
Important Questions for Application Developers
How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
34
Multi-Level Performance Data Mining
New (just forming) research project
PSU
Cornell
UO
LLNL
: Karen L. Karavanic
: Sally A. McKee
: Allen D. Malony and Sameer Shende
: John M. May and Bronis R. de Supinski
Develop performance data mining technology
LACSI 2004
Scientific applications, benchmarks, other measurements
Systematic analysis for understanding and prediction
Better foundation for evaluation of leadership-class
computer systems
Performance Technology for Productive, High-End Parallel Computing
35
Goals
Answer questions at multiple levels of interest
Data from low-level measurments and simulations
use
to predict application performance
data mining applied to optimize data gathering process
High-level performance data spanning dimensions
Machine,
applications, code revisions
Examine broad performance trends
Need technology
LACSI 2004
Performance data instrumentation and measurement
Performance data management
Performance analysis and results presentation
Automated performance exploration
Performance Technology for Productive, High-End Parallel Computing
36
Specific Goals
Design, develop, and populate a performance database
Discover general correlations application performance
and features of their external environment
Develop methods to predict application performance on
lower-level metrics
Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a give system
Performance data mining infrastructure is important for
all of these goals
Establish a more rational basis for evaluating the
performance of leadership-class computers
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
37
PerfTrack: Performance DB and Analysis Tool
PSU: Kathryn Mohror, Karen Karavanic
UO: Kevin Huck, Allen D. Malony
LLNL: John May, Brian Miller (CASC)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
38
TAU Performance Data Management Framework
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
39
TAU Performance Regression (PerfRegress)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
40
Background – Ahn & Vetter, 2002
“Scalable Analysis Techniques for Microprocessor
Performance Counter Metrics,” SC2002
Applied multivariate statistical analysis techniques to
large datasets of performance data (PAPI events)
Cluster Analysis and F-Ratio
Factor Analysis
Agglomerative Hierarchical Method - dendogram
identified groupings of master, slave threads in sPPM
K-means clustering and F-ratio - differences between
master, slave related to communication and management
shows highly correlated metrics fall into peer groups
Combined techniques (recursively) leads to observations
of application behavior hard to identify otherwise
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
41
Thread Similarity Matrix
Apply techniques from the phase analysis (Sherwood)
Threads of execution can be visually compared
Threads with abnormal behavior show up as less similar
than other threads
Each thread is represented as a vector (V) of dimension n
n is the number of functions in the application
V = [f1, f2, …, fn]
Each value is the percentage of time spent in that function
normalized
(represent event mix)
from 0.0 to 1.0
Distance calculated between the vectors U and V:
n
ManhattanDistance(U, V) = ∑ |ui - vi|
i=0
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
42
sPPM on Blue Horizon (64x4, OpenMP+MPI)
• TAU profiles
• 10 events
• PerfDMF
• threads 32-47
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
43
sPPM on MCR (total instructions, 16x2)
• TAU/PerfDMF
• 120 events
• master (even)
• worker (odd)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
44
sPPM on MCR (PAPI_FP_INS, 16x2)
• TAU profiles
• PerfDMF
• master/worker
• higher/lower
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
45
sPPM on Frost (PAPI_FP_INS, 256 threads)
View of fewer than half of
the threads of execution is
possible on the screen at
one time
Three groups are obvious:
Lower ranking threads
One unique thread
Higher ranking threads
3%
LACSI 2004
more FP
Finding subtle differences
is difficult with this view
Performance Technology for Productive, High-End Parallel Computing
46
sPPM on Frost (PAPI_FP_INS, 256 threads)
Dendrogram shows 5 natural clusters:
Unique thread
High ranking master threads
Low ranking master threads
High ranking worker threads
Low ranking worker threads
• TAU profiles
• PerfDMF
• R access
threads
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
47
sPPM on MCR (PAPI_FP_INS, 16x2 threads)
masters
slaves
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
48
sPPM on Frost (PAPI_FP_INS, 256 threads)
After k-means clustering into 5 clusters
Similar natural clusters are grouped
Each groups performance characteristics analyzed
256 threads of data has been reduced to 5 clusters!
SPPM
10
LACSI 2004
119
1
6
INTERF
DIFUZE
DINTERF
Barrier [OpenMP:runhyd3.F <604,0>]
120
Performance Technology for Productive, High-End Parallel Computing
49
Extreme Performance Scalable Oss (ZeptoOS)
DOE, Office of Science
Investigate operating system and run-time (OS/R) functionality
required for scalable components used in petascale architectures
OS / RTS for Extreme Scale Scientific Computation
Argonne National Lab and University of Oregon
Flexible OS/R functionality
Scalable OS/R system calls
Performance tools, monitoring, and metrics
Fault tolerance and resiliency
Approach
LACSI 2004
Specify OS/R requirements across scalable components
Explore flexible functionality (Linux)
Hierarchical designs optimized with collective OS/R interfaces
Integrated (horizontal, vertical) performance measurement / analysis
Fault scenarios and injection to observe behavior
Performance Technology for Productive, High-End Parallel Computing
50
ZeptoOS Plans
Explore Linux functionality for BG/L
Explore efficiency for ultra-small kernels
Construct kernel-level collective operations
Scheduler, memory, IO
Support for dynamic library loading, …
Build Faulty Towers Linux kernel and system for
replaying fault scenarios
Extend TAU
LACSI 2004
Profiling OS suites
Benchmarking collective OS calls
Observing effects of faults
Performance Technology for Productive, High-End Parallel Computing
51
Discussion
As high-end systems scale, it will be increasingly
important that performance tools be used effectively
Performance observation methods do not necessarily
need to change in a fundamental sense
More intelligent performance systems for productive use
Just need to be controlled and used efficiently
Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
Deliver to community next-generation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
52
Support Acknowledgements
Department of Energy (DOE)
Office of Science contracts
University of Utah ASCI Level 1
sub-contract
ASC/NNSA Level 3 contract
NSF
High-End Computing Grant
Qu i ck Ti me ™a nd a
TIF F (Un co mpre ss ed )d ec omp res so r
a re ne ed ed to s ee th i s pi c tu re.
Research Centre Juelich
John von Neumann Institute
Dr. Bernd Mohr
Los Alamos National Laboratory
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
53