lacsi04 - Computer & Information Science Department

Download Report

Transcript lacsi04 - Computer & Information Science Department

Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
Outline of Talk







Research motivation
Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
TAU parallel performance system developments
Application performance case studies
New project directions
Discussion
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
2
Research Motivation

Tools for performance problem solving


Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
database
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
LACSI 2004
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
3
Problem Description




How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion





Seen as a tradeoff with performance diagnosis accuracy
Scaling complicates observation and analysis
Nature of application development may change
Paradigm shift in performance process and technology?
What will enhance productive application development?
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
4
Scaling and Performance Observation

Consider “traditional” measurement methods



More parallelism  more performance data overall






Profiling: summary statistics calculated during execution
Tracing: time-stamped sequence of execution events
Performance specific to each thread of execution
Possible increase in number interactions between threads
Harder to manage the data (memory, transfer, storage)
How does per thread profile size grow?
Instrumentation more difficult with greater parallelism?
More parallelism / performance data  harder analysis

LACSI 2004
More time consuming to analyze and difficult to visualize
Performance Technology for Productive, High-End Parallel Computing
5
Concern for Performance Measurement Intrusion

Performance measurement can affect the execution



Problems exist even with small degree of parallelism



Intrusion is accepted consequence of standard practice
Consider intrusion (perturbation) of trace buffer overflow
Scale exacerbates the problem … or does it?




Perturbation of “actual” performance behavior
Minor intrusion can lead to major execution effects
Traditional measurement techniques tend to be localized
Suggests scale may not compound local intrusion globally
Measuring parallel interactions likely will be affected
Use accepted measurement techniques intelligently
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
6
Role of Intelligence and Specificity


How to make the process more effective (productive)?
Scale forces performance observation to be intelligent


What are the important performance events and data?




Standard approaches deliver a lot of data with little value
Tied to application structure and computational mode
Tools have poor support for application-specific aspects
Process and tools can be more application-aware
Will allow scalability issues to be addressed in context



LACSI 2004
More control and precision of performance observation
More guided performance experimentation / exploration
Better integration with application development
Performance Technology for Productive, High-End Parallel Computing
7
Role of Automation and Knowledge Discovery




Even with intelligent and application-specific tools, the
decisions of what to analyze may become intractable
Scale forces the process to become more automated
Performance extrapolation must be part of the process
Build autonomic capabilities into the tools






Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
8
TAU Parallel Performance System Goals

Multi-level performance instrumentation



Flexible and configurable performance measurement
Widely-ported parallel performance profiling system





Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms


Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
9
TAU Performance System Architecture
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
10
TAU Performance System Architecture
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
11
TAU Instrumentation Advances

Source instrumentation

Program Database Toolkit (PDT)
 automated
Fortran 90/95 support (Flint parser, very robust)
 statement level support in C/C++ (Fortran soon)


TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
 automatic





CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
12
TAU Measurement Advances

Profiling

Memory profiling
 global

heap memory tracking (several options)
Callpath profiling
 user-controllable






Improved support for multiple counter profiling
Online profile access and sampling
Tracing


calling depth
Generation of VTF3 traces files (portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
13
TAU Performance Analysis Advances


Enhanced parallel profile analysis (ParaProf)
Performance Data Management Framework (PerfDMF)



Callpath analysis integration in ParaProf
Integration with Vampir Next Generation (VNG)



First release of prototype
Online trace analysis
Performance visualization (ParaVis) prototype
Component performance modeling and QoS
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
14
Component-Based Scientific Applications
How to support performance analysis and tuning process
consistent with application development methodology?
 Common Component Architecture (CCA) applications
 Performance tools should integrate with software
 Design performance observation component



Measurement port and measurement interfaces
Build support for application component instrumentation



Interpose a proxy component for each port
Inside the proxy, track caller/callee invocations, timings
Automate the process of proxy component creation
 using
PDT for static analysis of components
 include support for selective instrumentation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
15
Flame Reaction-Diffusion (Sandia, J. Ray)
CCAFFEINE
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
16
Component Modeling and Optimization

Given a set of components, where each component has
multiple implementations, what is the optimal subset of
implementations that solve a given problem?






How to model a single component?
How to model a composition of components?
How to select optimal subset of implementations?
A component only has performance meaning in context
Applications are dynamically composed at runtime
Application developers use components from others



LACSI 2004
Instrumentation may only be at component interfaces
Performance measurements need to be non-intrusive
Users interested in a coarse-grained performance
Performance Technology for Productive, High-End Parallel Computing
17
MasterMind Component (Trebon, IPDPS 2004)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
18
Proxy Generator for other Applications

PDT-based proxy component for:




QoS tracking [Boyana, ANL]
Debugging Port Monitor for CCA (tracks arguments)
SCIRun2 Perfume components [Venkat, U. Utah]
Exploring Babel for auto-generation of proxies:


Direct SIDL to proxy code generation
Generating client component interface in C++
 Using
LACSI 2004
PDT for generating proxies
Performance Technology for Productive, High-End Parallel Computing
19
Earth Systems Modeling Framework


Coupled modeling with modular software framework
Instrumentation for framework and applications

PDT automatic instrumentation
 Fortran
95
 C / C++


Component instrumentation (using CCA Components)



MPI wrapper library for MPI calls
CCA measurement port manual instrumentation
Proxy generation using PDT and runtime interposition
Significant callpath profiling use by ESMF team
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
20
Using TAU Component in ESMF/CCA
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
21
TAU’s Paraprof Profile Browser (ESMF Data)
Callpath profile
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
22
TAU Traces with Counters (ESMF)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
23
Visualizing TAU Traces with Counters/Samples
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
24
Uintah Computational Framework (UCF)
University of Utah, Center for Simulation of Accidental
Fires and Explosions (C-SAFE), DOE ASCI Center
 UCF analysis





Scheduling
MPI library
Components
Performance
mapping
Use for online
and offline
visualization
 ParaVis tools
F

LACSI 2004
500 processees
Performance Technology for Productive, High-End Parallel Computing
25
Scatterplot Displays

Each point
coordinate
determined
by three
values:
MPI_Reduce
MPI_Recv
MPI_Waitsome
Min/Max
value range
 Effective for
cluster
analysis

Relation between MPI_Recv and MPI_Waitsome
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
26
Online Unitah Performance Profiling


Demonstration of online profiling capability
Colliding elastic disks



Test material point method (MPM) code
Executed on 512 processors ASCI Blue Pacific at LLNL
Example




LACSI 2004
Bargraph visualization
MPI execution time
Performance mapping
Multiple time steps
QuickTime™ and a GIF decompressor are needed to see this picture.
Performance Technology for Productive, High-End Parallel Computing
27
Miranda Performance Analysis (Miller, LLNL)

Miranda is a research hydrodynamics code


Mostly synchronous



Fortran 95, MPI
MPI_ALLTOALL on Np x,y communicators
Some MPI reductions and broadcasts for statistics
Good communications scaling

ACL and MCR
 Sibling
Linux clusters
 ~1000 Intel P4 nodes, dual 2.4 GHz



Up to 1728 CPUs
Fixed workload per CPU
Ported to BlueGene/L
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
28
Tau Profiling of Miranda on BG/L

Miranda team is using TAU to profile code performance


Routinely runs on BG/L for 1000 CPUs for 10-20 minutes
Scaling studies (problem size, number of processors)
128 Nodes
LACSI 2004
512 Nodes
1024 Nodes
Performance Technology for Productive, High-End Parallel Computing
29
Fine Grained Profiling via Tracing

Miranda uses TAU to generate traces


LACSI 2004
Combines MPI calls with HW counter information
Detailed code behavior to focus optimization efforts
Performance Technology for Productive, High-End Parallel Computing
30
Memory Usage Analysis


BG/L will have limited memory per node (512 MB)
Miranda uses TAU to profile memory usage


Streamlines code
Squeeze larger
problems on the
machine
Max Heap Memory (KB) used for 1283 problem
on 16 processors of ASC Frost at LLNL
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
31
Kull Performance Optimization (Miller, LLNL)

Kull is a Lagrange hydrodynamics code



Scalar test problem analysis


CCSubzonalEffects member functions
Examination revealed optimization opportunities



Serial execution to identify performance factors
Original code profile indicated expensive functions


Physics packages written in C++ and Fortran
Parallel Python interpreter run-time environment!
Loop merging
Amortizing geometric lookup over more calculations
Apply to CSSubzonalEffects member functions
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
32
Kull Optimization

CSSubzonalEffects member functions total time


Reduced from 5.80 seconds to 0.82 seconds
Overall run time reduce from 28.1 to 22.85 seconds
Original Exclusive Profile
LACSI 2004
Optimized Exclusive Profile
Performance Technology for Productive, High-End Parallel Computing
33
Important Questions for Application Developers










How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
34
Multi-Level Performance Data Mining

New (just forming) research project





PSU
Cornell
UO
LLNL
: Karen L. Karavanic
: Sally A. McKee
: Allen D. Malony and Sameer Shende
: John M. May and Bronis R. de Supinski
Develop performance data mining technology



LACSI 2004
Scientific applications, benchmarks, other measurements
Systematic analysis for understanding and prediction
Better foundation for evaluation of leadership-class
computer systems
Performance Technology for Productive, High-End Parallel Computing
35
Goals

Answer questions at multiple levels of interest

Data from low-level measurments and simulations
 use
to predict application performance
 data mining applied to optimize data gathering process

High-level performance data spanning dimensions
 Machine,
applications, code revisions
 Examine broad performance trends

Need technology




LACSI 2004
Performance data instrumentation and measurement
Performance data management
Performance analysis and results presentation
Automated performance exploration
Performance Technology for Productive, High-End Parallel Computing
36
Specific Goals
Design, develop, and populate a performance database
 Discover general correlations application performance
and features of their external environment
 Develop methods to predict application performance on
lower-level metrics
 Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a give system
 Performance data mining infrastructure is important for
all of these goals
 Establish a more rational basis for evaluating the
performance of leadership-class computers

LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
37
PerfTrack: Performance DB and Analysis Tool
PSU: Kathryn Mohror, Karen Karavanic
UO: Kevin Huck, Allen D. Malony
LLNL: John May, Brian Miller (CASC)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
38
TAU Performance Data Management Framework
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
39
TAU Performance Regression (PerfRegress)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
40
Background – Ahn & Vetter, 2002



“Scalable Analysis Techniques for Microprocessor
Performance Counter Metrics,” SC2002
Applied multivariate statistical analysis techniques to
large datasets of performance data (PAPI events)
Cluster Analysis and F-Ratio



Factor Analysis


Agglomerative Hierarchical Method - dendogram
identified groupings of master, slave threads in sPPM
K-means clustering and F-ratio - differences between
master, slave related to communication and management
shows highly correlated metrics fall into peer groups
Combined techniques (recursively) leads to observations
of application behavior hard to identify otherwise
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
41
Thread Similarity Matrix

Apply techniques from the phase analysis (Sherwood)



Threads of execution can be visually compared
Threads with abnormal behavior show up as less similar
than other threads
Each thread is represented as a vector (V) of dimension n

n is the number of functions in the application
V = [f1, f2, …, fn]

Each value is the percentage of time spent in that function
 normalized

(represent event mix)
from 0.0 to 1.0
Distance calculated between the vectors U and V:
n
ManhattanDistance(U, V) = ∑ |ui - vi|
i=0
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
42
sPPM on Blue Horizon (64x4, OpenMP+MPI)
• TAU profiles
• 10 events
• PerfDMF
• threads 32-47
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
43
sPPM on MCR (total instructions, 16x2)
• TAU/PerfDMF
• 120 events
• master (even)
• worker (odd)
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
44
sPPM on MCR (PAPI_FP_INS, 16x2)
• TAU profiles
• PerfDMF
• master/worker
• higher/lower
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
45
sPPM on Frost (PAPI_FP_INS, 256 threads)


View of fewer than half of
the threads of execution is
possible on the screen at
one time
Three groups are obvious:



Lower ranking threads
One unique thread
Higher ranking threads
 3%

LACSI 2004
more FP
Finding subtle differences
is difficult with this view
Performance Technology for Productive, High-End Parallel Computing
46
sPPM on Frost (PAPI_FP_INS, 256 threads)

Dendrogram shows 5 natural clusters:
 Unique thread
 High ranking master threads
 Low ranking master threads
 High ranking worker threads
 Low ranking worker threads
• TAU profiles
• PerfDMF
• R access
threads
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
47
sPPM on MCR (PAPI_FP_INS, 16x2 threads)
masters
slaves
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
48
sPPM on Frost (PAPI_FP_INS, 256 threads)

After k-means clustering into 5 clusters



Similar natural clusters are grouped
Each groups performance characteristics analyzed
256 threads of data has been reduced to 5 clusters!
SPPM
10
LACSI 2004
119
1
6
INTERF
DIFUZE
DINTERF
Barrier [OpenMP:runhyd3.F <604,0>]
120
Performance Technology for Productive, High-End Parallel Computing
49
Extreme Performance Scalable Oss (ZeptoOS)

DOE, Office of Science



Investigate operating system and run-time (OS/R) functionality
required for scalable components used in petascale architectures





OS / RTS for Extreme Scale Scientific Computation
Argonne National Lab and University of Oregon
Flexible OS/R functionality
Scalable OS/R system calls
Performance tools, monitoring, and metrics
Fault tolerance and resiliency
Approach





LACSI 2004
Specify OS/R requirements across scalable components
Explore flexible functionality (Linux)
Hierarchical designs optimized with collective OS/R interfaces
Integrated (horizontal, vertical) performance measurement / analysis
Fault scenarios and injection to observe behavior
Performance Technology for Productive, High-End Parallel Computing
50
ZeptoOS Plans


Explore Linux functionality for BG/L
Explore efficiency for ultra-small kernels


Construct kernel-level collective operations



Scheduler, memory, IO
Support for dynamic library loading, …
Build Faulty Towers Linux kernel and system for
replaying fault scenarios
Extend TAU



LACSI 2004
Profiling OS suites
Benchmarking collective OS calls
Observing effects of faults
Performance Technology for Productive, High-End Parallel Computing
51
Discussion
As high-end systems scale, it will be increasingly
important that performance tools be used effectively
 Performance observation methods do not necessarily
need to change in a fundamental sense



More intelligent performance systems for productive use





Just need to be controlled and used efficiently
Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
Deliver to community next-generation
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
52
Support Acknowledgements


Department of Energy (DOE)
 Office of Science contracts
 University of Utah ASCI Level 1
sub-contract
 ASC/NNSA Level 3 contract
NSF



High-End Computing Grant
Qu i ck Ti me ™a nd a
TIF F (Un co mpre ss ed )d ec omp res so r
a re ne ed ed to s ee th i s pi c tu re.
Research Centre Juelich
 John von Neumann Institute
 Dr. Bernd Mohr
Los Alamos National Laboratory
LACSI 2004
Performance Technology for Productive, High-End Parallel Computing
53