Allinea Tools - Computational Information Systems Laboratory

Download Report

Transcript Allinea Tools - Computational Information Systems Laboratory

Programming weather, climate, and earth-system models
on heterogeneous multi-core platforms Conference Sept 7 & 8
Allinea DDT 3.0
For Debugging Challenge for Weather, Climate and Earth-systems models
David Maples
Allinea Software Inc
[email protected]
www.allinea.com
HPC World
High Performance Computing
needs ever-increasing compute
power
Systems in Top 500
●
180
160
140
Performance improvements will
come from:
●
120
8k - 32k
cores
32k+
cores
100
Concurrency and multi-core
architectures
• Optimized software
•
Writing or migrating software for
concurrency is more complex and
requires different tools and skills
80
60
40
●
20
0
2006
2007
2008
2009
2010
2006
2007
2008
2009
2010
Year (June & November
Lists)
www.allinea.com
New Market Drivers
• “Software has become the #1 roadblock … Many
applications will need a major redesign”
• IDC HPC Update, June 2010
– Most ISV codes do not scale
– High programming costs are
delaying GPU usage
• Development tools are a
vital part of the solution
www.allinea.com
Allinea Software
• HPC tools company since 2001
– Allinea DDT: Scalable parallel debugger
– Allinea OPT: Optimization tool for MPI and non-MPI
• Large U.S. and Large European customer base
– 12 of top 20 systems run Allinea DDT in EMEA
– Most scalable and cost effective debugger for CUDA
– Users debugging at all scales from 1 to 100,000 cores and beyond,
but it's also easy to use on small clusters!
– World's only Petascale debugger!
www.allinea.com
Clients and Partners
Aviation and Defence
Climate and Weather
Energy
Electronic Design Automation
Academic
Over 200 universities
www.allinea.com
Allinea Clients in Climate
• Weather and Climate are a great fit
– HLRS, our first user in Germany in 2004.
– NERSC
– Met Office (UK)
– Proudman UK
–
Irish Centre for High-End Computing (ICHEC)
–
British Geological Survey (BGS) UK
–
IFREMER (France)
–
Meteo France
–
NOAA USA (Cray Linux)
–
Mercator Franc
– US Navy – Fleet Numeric
– BoM – Australia,
– Royal Meteorological Institute of Belgium (IRM).
www.allinea.com
Collaborations
Partnership to develop Petascale debugger with NVIDIA support
Partnership to develop Petascale/ Exascale tools and standards
Partnership on Full Scale debugging on IBM Blue Gene /P & /Q
Allinea DDT is “Debugger of Choice” on NERSC 5 and NERSC
6 and first implementation on CRAY XE6
Partnership with CEA French Atomic Energy Authority on scalable
programming and CUDA
Partnership on Keeneland project to help solving software
challenges introduced by mixed architectures
www.allinea.com
Allinea Software Collaborations
– Technical Collaboration Results - examples
•
Cray
•
Scalability - Most Scalable Debugger for Cray
•
•
•
•
Fast Track support – Rapid Debugging exclusive from Cray
•
UPC and CAF Support
•
Cray User Group
•
Titan Debugger Development collaboration
•
In house expertise on Allinea Software
SGI
•
UPC and CAF Support for SGI Compiler
•
SGI Training for Allinea Users
•
In house Expertise on Allinea Software
IBM
•
•
Enhanced BlueGene Support for Scalable Debugging for BG/P and future
Nvidia
•
Allinea DDT with CUDA Support
•
www.allinea.com
Developed on Jaguar
Shipped commercially since April 2010
What is the value to your work?
Scalability, Ease of Use and Intuitive GUI
- Allinea DDT extended capabilities
- Allinea Joint Development deliverables are include
in Standard Product
- Allinea Collaboration with you to build new
capabilities for your market
- Allinea support for current and future architectures
- Large group of DDT users in Weather/Climate
www.allinea.com
Allinea DDT - Key capabilities for WC&E
www.allinea.com
Use a Parallel Debugger
• Many benefits to graphical parallel debuggers
– Large feature sets for common bugs
– Richness of user interface and real control of processes
• Historically all parallel debuggers hit scale problems
– Bottleneck at the front-end: Direct GUI → nodes architectures
• Linear performance in number of processes
– Human factors limit – mouse fatigue and brain overload
• Are tools ready for the task?
– Allinea DDT has changed the game
www.allinea.com
Achievements
•
Allinea DDT: First debugger with MPI and CUDA debugging
– Simplifying hybrid debugging
– Strong partnership with NVIDIA enables support for latest toolkit
•
Allinea DDT new releases support new capabilities
– June 2010:
Toolkit 3.0 - Nvidia
DDT 2.6
– December 2010:
Toolkit 3.1 and 3.2 - Nvidia
DDT 2.6.1
– April 2011
Scalability and More
DDT 3.0
• Allinea DDT smashes the Petascale barrier
– 220,000 core debugging delivered to Oak Ridge National Laboratories
– Full set of core capabilities with global ~100ms timings
www.allinea.com
Allinea DDT 3.0
•Petascale Architecture: Common collective process operations
complete in a fraction of a second, even at over 200,000 cores!
•Smart Highlighting: Automated display of the differences between processes
and the changing of variable values
•Visualization: New distributed multiple-dimensional array viewer with filtering
•Faster C++ debugging: Automatic display of STL, Boost and Qt variables
•Cross Process Comparison: Improved scalable cross process comparison
•Attaching to Jobs: Improved Attach window lets you easily find and select MPI
jobs and attach to subsets
•HMPP Support: DDT 3.0 introduces support for CAPS HMPP
•Tracepoints: Intelligent logging and merging of variable history during program
execution
www.allinea.com
DDT in a nutshell
• Scalar features
– Advanced C++ and STL
– Fortran 90, 95 and 2003: modules, allocatable
data, pointers, derived types
– Memory debugging
• Multithreading & OpenMP features
– Step, breakpoint etc. one or all threads
• MPI features
– Easy to manage groups
– Control processes by groups
– Compare data
– Visualize message queues
www.allinea.com
DDT Platforms
Platform
Operating System
MPI
x86, x86_64
RHEL 4,5,6
SLES 10,11
Fedora 4 and above
Ubuntu 8.04 and above
All known MPIs including:
GNU, Absoft, Intel, Pathscale,
SGI Altix, Bproc, Bull MPI 1 PGI, Sun
and 2, LAM-MPI, MPICH,
Myricom MPICH-GM and
MPICH-MX, Open MPI,
Quadrics MPI, Platform
(Scali) MPI, SCore, Scyld,
Intel MPI, Slurm, MVAPICH
Cell BE
Fedora Core 7
Yellow Dog
As above
Cell BE SDK 3.0
IBM Power
AIX 5.3 and above
IBM PE, MPICH
Native, GNU
Sun Sparc
Solaris 9 and above
Sun Clustertools 5 and above Native - Studio 11
Sun Solaris Opteron
Solaris 10 and above
Sun Clustertools 6, 7 and
MPICH
Cray XT/XE
SLES 10,11 (frontend)
Cray MPT (aprun) and Open Cray, PGI, Pathscale, Intel,
MPI
GNU
Blue Gene/P
SLES 10 (frontend)
Native
Native, GNU
NEC SX 9
SUPER-UX 15.1 (backend
only)
Native
Native
www.allinea.com
Compilers
Native - Studio 11/12, GNU
CAPS HMPP Support
Automatic detection of HMPP code fragments and set
breakpoints before/after kernel
●
● Step-over a kernel
–Ignore HMPP wrapper layers
● Suppress stack of HMPP internals to report only user
code and high-level name of HMPP fragment
● Obtain error codes (if possible) from HMPP kernels
www.allinea.com
Handling Regular Bugs
• Immediate stop on crash
– Segmentation fault, or other
memory problems
– Abort, exit, error handlers
– CUDA errors
• Scalable handling of error
messages
• Leaps to the problem
– Source code highlighted
– Affected processes shown
– Process stacks displayed
clearly in parallel
www.allinea.com
Finding the cause
• Full class/structure browsing
– Locals and Current line(s)
• Show variables relevant to
current position
• Drag in the source code to see
more
– C, C++, F90: object members,
static members and derived
types
• Automatic comparison and
change detection
– Scalable and fast
www.allinea.com
Smart Highlighting
• Compare variables across processes and instantly detect changes:
Fast and
scalable!
−Blue: Value change
−Green: Different value on
other process(es)
• Full class/structure browsing
− Local variables and current line(s)
• Show variables relevant to current position
• Drag in the source code to see more
− C, C++, F90: object members, static members, derived types
www.allinea.com
Finding rogue processes
• Easy to find where differences are:
– Cross process comparison of data
•Fetches values from every process,
compares and then groups by value
•Summary of NaN, Inf and statistics
– Easy to spot rogues
• Use to group processes
–Define process group and control enmasse
www.allinea.com
Cross Process Comparison
• Analyse expressions calculated on each
process in the current process group
• Cross process comparison of data
• Fetches values from every process,
compares and then groups by value
• Summary of NaNs, Infs and statistics
• Easy to spot rogue processes!
• Use to group processes
–Define a process group
www.allinea.com
Visualization
3-D Visualization of distributed data using the Multi-Dimensional Array viewer
• Large Array Support
• Browse arrays
– 1, 2, 3… dimensions
– Table view
• Filtering
– Look for an outlying value
• Export
– Save to a spreadsheet
• View arrays from multiple processes
– Search through terabytes for rogue data
in parallel
www.allinea.com
Tracepoints
• Intelligent logging and merging of variable history during execution
• “Scalable printf”:
– No need to recompile your program
– Merging helps prevent information overload: Network traffic and user interface
– Add conditions to filter output
• Allows you to view both the data
and the lines of code your program is
executing without stopping
– View program flow and state
quickly over multiple iterations
•Save output for offline analysis – Free
up system resources
www.allinea.com
Improved C++ debugging
• Faster startup when debugging C++ codes
– Much improved performance for heavily templated code
• Edit Type Feature
– Helps viewing polymorphic types
• Automatic display of STL, Boost and Qt containers
Easily de-reference pointers
– Easily view the contents of the data structure
Before
www.allinea.com
After
Attaching to Jobs
• Improved Attach window allows you to easily find and select MPI jobs and attach
to running processes
• Clicking the Attach to a Running Program button on the Welcome Screen will
show DDT's Attach Window:
– List of automatically detected MPI jobs: No need to select individual processes
– Or you can manually select from a list of processes if required
www.allinea.com
Memory Debugging
Find memory leaks
Or stop on read/write beyond end of array:
www.allinea.com
Debugging at Scale
www.allinea.com
Problems at Scale
• Increasing job sizes leads to unanticipated errors
– Regular bugs
• Data issues from larger data sets – eg. garbage in..., overflow
• Logic issues and control flow
– Increasing probability of independent random error
• Memory errors/exhaustion – “random” bugs!
• System problems – MPI and operating system
– Pushing coded boundaries
• Algorithmic (performance)
• Hard-wired limits (“magic numbers”)
– Unknown unknowns
• ....
www.allinea.com
Strategies for bug fixing I
• Improved coding standards – unit tests, assertions
– Good practice – but coverage is rarely perfect
• Random/system issues – often missed
– Combines well with debuggers
• Find why a failure occurs not just a pass/fail
• Logging – printf and write
– If you have good intuition into the problem
• Edit code, insert print, recompile and re-run
• Slow and iterative
– Post-mortem analysis only
• Hard establish real order of output of multiple processes
• Rapid growth in log output size
• Unscalable
www.allinea.com
Strategies for bug fixing II
• Reproduce at a smaller scale
– Attempt to make problem happen on fewer nodes
• Often requires reduced data set – the large one may not fit
– Smaller data set may not trigger the problem
• Does the bug even exist on smaller problems?
– Didn't you already try the code at small scale?
• Is it a system issue – eg. an MPI problem?
– Is probability stacking up against you?
• Unlikely to spot on smaller runs – without many many runs
• But near guaranteed to see it on a many-thousand core run
– What can a parallel debugger do to help?
• Debug at the scale of the problem - Now.
www.allinea.com
Scalable Process Control
•Parallel Stack View
• Finds rogue processes quickly
• Identify classes of process
behaviour
• Rapid grouping of processes
•Control Processes by Groups
• Set breakpoints, step, play, stop
for groups
• Scalable groups view: compact
group display
www.allinea.com
Petascale Architecture
DDT 3.0 Performance Figures
• Logarithmic performance due to new tree
architecture
• Many operations are now faster at 220,000
than previously at 1000 cores
• ~1/10th of a second to step and gather all
stacks at 220,000 cores
0.12
0.1
0.08
All Step
0.06
All Breakpoint
0.04
Time (Seconds)
• Developed due to collaborations with ORNL
on Jaguar Cray XT, ANL and CEA
0.02
• A massive performance revolution for
every user’s benefit!
www.allinea.com
0
0
50,000
100,000
150,000
MPI Processes
200,000
Debugging GPU Applications
www.allinea.com
CUDA Debugging Options
• Old world “printf”
• NVIDIA SDK 3.0 allows this but with limitations
• Fake it – Run the kernel on the host x86_64 processor
•
•
•
•
Languages often support targeting host CPU instead of GPU
Different numeric precision – different answer?
Different scheduling – different answer?
A reasonable option for some bugs
• Or run on the GPU with Allinea DDT...
www.allinea.com
GPUs Made Easy
• View all threads in parallel
stack view
• At one glance, see all GPU
and CPU threads together
• Links with thread selection
• Pick a tree node to select one
of the CUDA threads at that
location
• Full MPI support
• See GPU and CPU threads
from multiple nodes
www.allinea.com
Debugging Kernels
• Debugging CPU and GPU
concurrently
–
Browse source, examine variables,
control processes and threads
• Set breakpoints
– Automatically stop on kernel launch
– Stop at a line of CUDA code
– Kernels stop when breakpoint reached
– Hover the mouse for more information
• Step a warp
- 32 CUDA threads
www.allinea.com
Examine Thread Data
• At a glance display of variables
– Expressions, local variables, and current line
– Also possible to edit values
• Displays the memory types
– shared, parameter, constant, register, …
www.allinea.com
DDT CUDA Status
• NVIDIA SDK 3.1, SDK 3.2
• Allinea DDT 3.0
– Multi-device support
– Fermi and Tesla support
– CUDA Memcheck support for memory
errors
– MPI and CUDA support for GPU
clusters
– Breakpoints, thread control, and data
evaluation
– Stop on kernel launch
www.allinea.com
Summary
• Debuggers are the right tools to fix bugs quickly
– Other methods have limited success and issues at scale
• Allinea DDT scales in both performance and interface
– Breaking all records and making problems manageable
– Be sure to get DDT release 3.0!
• Allinea DDT supports NVIDIA CUDA with the ability to
debug code running on both the CPU and GPU
– NVIDIA SDK 3.1, SDK 3.2 and available for SDK 4.0
• Contact [email protected] or [email protected]
www.allinea.com
Thank You
David Maples
Allinea Software Inc.
2033 Gateway Pl.
San Jose, Ca.
408 884 0282
[email protected]
www.allinea.com