Runtime Optimizations - Parallel Programming Laboratory

Download Report

Transcript Runtime Optimizations - Parallel Programming Laboratory

Techniques for Developing Efficient
Petascale Applications
Laxmikant (Sanjay) Kale
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
Outline
•
•
•
•
Basic Techniques for attaining good performance
Scalability analysis of Algorithms
Measurements and Tools
Communication optimizations:
– Communication basic
– Overlapping communication and computation
– Alpha-beta optimizations
• Combining and pipelining
– (topology-awareness)
• Sequential optimizations
• (Load balancing)
7/18/2015
Performance Techniques
2
Examples based on multiple applications:
Quantum Chemistry
(QM/MM)
Protein Folding
Molecular Dynamics
Computational Cosmology
Crack Propagation
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
Space-time meshes
Rocket Simulation
7/18/2015
Dendritic Growth
Performance Techniques
3
Analyze Performance with both:
Simple as well as
Sophisticated Tools
7/18/2015
Performance Techniques
4
Simple techniques
• Timers: wall timer (time.h)
• Counters: Use papi library raw counters, ..
– Esp. useful:
• number of floating point operations,
• cache misses (L2, L1, ..)
• Memory accesses
• Output method:
– “printf” (or cout) can be expensive
– Store timer values into an array or buffer, and print at the end
7/18/2015
Performance Techniques
5
Sophisticated Tools
• Many tools exist
• Have some learning curve, but can be beneficial
• Example tools:
–
–
–
–
–
Jumpshot
TAU
Scalasca
Projections
Vampir ($$)
• PMPI interface:
– Allows you to intercept MPI calls
• So you can write your own tools
– PMPI interface for projections:
• git://charm.cs.uiuc.edu/PMPI_Projections.git
7/18/2015
Performance Techniques
6
Example: Projections Performance Analysis Tool
• Automatic instrumentation
via runtime
• Graphical visualizations
• More insightful feedback
– because runtime understands
application events better
7/18/2015
Performance Techniques
7
Exploit sophisticated Performance analysis tools
•
•
•
•
We use a tool called “Projections”
Many other tools exist
Need for scalable analysis
A not-so-recent example:
– Trying to identify the next performance obstacle for NAMD
• Running on 8192 processors, with 92,000 atom simulation
• Test scenario: without PME
• Time is 3 ms per step, but lower bound is 1.6ms or so..
7/18/2015
Performance Techniques
8
7/18/2015
Performance Techniques
9
7/18/2015
Performance Techniques
10
7/18/2015
Performance Techniques
11
Performance Tuning with
Patience and Perseverance
7/18/2015
Performance Techniques
12
Performance Tuning with Perseverance
• Recognize multi-dimensional nature of the
performance space
• Don’t stop optimizing until you know for sure
why it cannot be speeded up further
– Measure, measure, measure ...
7/18/2015
Performance Techniques
13
94% efficiency
Shallow valleys, high peaks, nicely overlapped PME
Apo-A1, on BlueGene/L, 1024 procs
green: communication
Charm++’s “Projections” Analysis too
Time intervals on x axis, activity added across
processors on Y axisl
Red: integration
Blue/Purple: electrostatics
Orange: PME
turquoise: angle/dihedral
7/18/2015
Performance Techniques
14
76% efficiency
Cray XT3, 512 processors: Initial runs
Clearly, needed further tuning, especially PME.
But, had more potential (much faster processors)
7/18/2015
Performance Techniques
15
On Cray XT3, 512 processors: after optimizations
96% efficiency
7/18/2015
Performance Techniques
16
Communication Issues
7/18/2015
Performance Techniques
17
Recap: Communication Basics: Point-topoint
Sending processor
Sending co-processor
Network

Each component has a
per-message cost, and
per byte cost
7/18/2015

Receiving coprocessor
Receiving processor
Each cost, for a n-byte message
 =ά+nβ
Important metrics:
 Overhead at processor, co-processor
 Network latency
 Network bandwidth consumed
 Techniques
Number
Performance
of hops traversed
18
Communication Basics
• Message Latency: time between the application
sending the message and receiving it on the other
processor
• Send overhead: time for which the sending
processor was “occupied” with the message
• Receive overhead: the time for which the
receiving processor was “occupied” with the
message
• Network latency
7/18/2015
Performance Techniques
19
Communication: Diagnostic Techniques
• A simple technique: (find “grainsize”)
– Count the number of messages per second of computation per processor!
(max, average)
– Count number of bytes
– Calculate: computation per message (and per byte)
• Use profiling tools:
– Identify time spent in different communication operations
– Classified by modules
• Examine idle time using time-line displays
– On important processors
– Determine the causes
• Be careful with “synchronization overhead”
– May be load balancing masquerading as sync overhead.
– Common mistake.
7/18/2015
Performance Techniques
20
Communication: Problems and Issues
• Too small a Grainsize
– Total Computation time / total number of messages
– Separated by phases, modules, etc.
• Too many, but short messages
– a vs. b tradeoff
• Processors wait too long
• Later:
– Locality of communication
• Local vs. non-local
• How far is non-local? (Does that matter?)
– Synchronization
– Global (Collective) operations
• All-to-all operations, gather, scatter
• We will focus on communication cost (grainsize)
7/18/2015
Performance Techniques
21
Communication: Solution Techniques
• Overview:
– Overlap with Computation
• Manual
• Automatic and adaptive, using virtualization
– Increasing grainsize
– a-reducing optimizations
• Message combining
• communication patterns
– Controlled Pipelining
– Locality enhancement: decomposition control
• Local-remote and bw reduction
– Asynchronous reductions
– Improved Collective-operation implementations
7/18/2015
Performance Techniques
22
Overlapping Communication-Computation
• Problem:
– Processors wait for too long at “receive” statements
• Idea:
– Instead of waiting for data, do useful work
– Issue: How to create such work?
• Can’t depend on the data to be received
• Routine communication optimizations in MPI
– Move sends up and receives down
• Keep data dependencies in mind..
– Moving receive down has a cost: system needs to buffer message
• Use irecvs, but be careful
• irecv allows you to post a buffer for a recv, but not wait for it
7/18/2015
Performance Techniques
23
Major analytical/theoretical techniques
• Typically involves simple algebraic formulas, and ratios
– Typical variables are:
• data size (N), number of processors (P), machine constants
– Model performance of individual operations, components, algorithms in
terms of the above
• Be careful to characterize variations across processors, and model them
with (typically) max operators
– E.g. max{Load I}
– Remember that constants are important in practical parallel computing
• Be wary of asymptotic analysis: use it, but carefully
• Scalability analysis:
– Isoefficiency
7/18/2015
Performance Techniques
24
Analyze Scalability of the Algorithm
(say via the iso-efficiency metric)
7/18/2015
Performance Techniques
25
Scalability
• The Program should scale up to use a large
number of processors.
– But what does that mean?
• An individual simulation isn’t truly scalable
• Better definition of scalability:
– If I double the number of processors, I should be able to retain
parallel efficiency by increasing the problem size
7/18/2015
Performance Techniques
26
Isoefficiency Analysis
• An algorithm (*) is scalable if
– If you double the number of processors available to it, it can
retain the same parallel efficiency by increasing the size of the
problem by some amount
– Not all algorithms are scalable in this sense..
– Isoefficiency is the rate at which the problem size must be
increased, as a function of number of processors, to keep the
same efficiency
Equal efficiency
curves
– Use η(p,N) = η(x.p, y.N) to get this equation
T1 : Time on one processor
Tp: Time on P processors
7/18/2015
Performance Techniques
Problem size
Parallel efficiency= T1/(Tp*P)
processors
27
Gauss-Jacobi Relaxation
Sequential Pseudocode:
Decomposition by:
while (maxError > Threshold) {
Re-apply Boundary conditions
maxError = 0;
Row
for i = 0 to N-1 {
for j = 0 to N-1 {
B[i,j] = 0.2 * (A[i,j] + A[i,j-1] +
A[i,j+1] + A[i+1, j] + A[i-1,j]) ;
if (|B[i,j]- A[i,j]| > maxError)
maxError = |B[i,j]- A[i,j]|
Blocks
}
}
swap B and A
}
Or Column
7/18/2015
Performance Techniques
28
Isoefficiency of Jacobi Relaxation
Row decomposition
Block decomposition
• Computation per proc:
• Computation per proc:
• Communication:
• Communication:
• Ratio:
• Ratio
• Efficiency:
• Efficiency
• Isoefficiency:
• Isoefficiency
7/18/2015
Performance Techniques
29
Isoefficiency of Jacobi Relaxation
Row decomposition
• Computation per PE:
– A * N * (N/P)
• Communication
– 16 * N
• Comm-to-comp Ratio:
– (16 * P) / (A * N) = γ
• Efficiency:
– 1 / (1 + γ)
Block decomposition
• Computation per PE:
– A * N * (N/P)
• Communication:
– 32 * N / P1/2
• Comm-to-comp Ratio
– (32 * P1/2) / (A * N)
• Efficiency
• Isoefficiency:
– N4
– problem-size = N2
– = (problem-size)^2
7/18/2015
• Isoefficiency
– N2
– Linear in problem size
Performance Techniques
30
NAMD: A Production MD program
NAMD
• Fully featured program
• NIH-funded development
• Distributed free of charge
(~20,000 registered users)
• Binaries and source code
• Installed at NSF centers
• User training and support
• Large published simulations
7/18/2015
Performance
CharmWorkshop2007
Techniques
31
Molecular Dynamics in NAMD
• Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (10,000 – 5,000,000)
• At each time-step
– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Short-distance: every timestep
– Long-distance: using PME (3D FFT)
– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions
• Challenge: femtosecond time-step, millions needed!
Collaboration with K. Schulten, R. Skeel, and coworkers
7/18/2015
Performance Techniques
32
Traditional Approaches: non isoefficient
• Replicated Data:
– All atom coordinates stored on each processor
• Communication/Computation ratio: P log P
• Partition the Atoms array across processors
– Nearby atoms may not be on the same processor
– C/C ratio: O(P)
• Distribute force matrix to processors
– Matrix is sparse, non uniform,
– C/C Ratio: sqrt(P)
7/18/2015
Performance Techniques
33
Spatial Decomposition Via Charm
•Atoms distributed to cubes based on
their location
• Size of each cube :
•Just a bit larger than cut-off radius
•Communicate only with neighbors
•Work: for each pair of nbr objects
•C/C ratio: O(1)
•However:
•Load Imbalance
•Limited Parallelism
Charm++ is useful to handle this
Cells, Cubes or“Patches”
7/18/2015
Performance Techniques
34
Object Based Parallelization for MD:
Force Decomposition + Spatial Decomposition
•Now, we have many objects
to load balance:
•Each diamond can be
assigned to any proc.
• Number of diamonds (3D):
–14·Number of Patches
–2-away variation:
–Half-size cubes
–5x5x5 interactions
–3-away interactions: 7x7x7
7/18/2015
Performance Techniques
35
Strong Scaling on JaguarPF
96000
Speedup
224,076 cores
107,520 cores
53,760 cores
24000
6000
6000
6,720 cores
24000
96000
Number of Cores
Ideal
36
7/18/2015
PME
cutoff w/ barrier
Performance Techniques
cutoff w/o barrier
Gauss-Seidel Relaxation
Sequential Pseudocode:
No old-new arrays..
While (maxError > Threshold) {
Sequentially, how well
Re-apply Boundary conditions
does this work?
maxError = 0;
for i = 0 to N-1 {
It works much better!
for j = 0 to N-1 {
old = A[i, j]
How to parallelize this?
A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1]
+ A[i+1,j] + A[i-1,j]) ;
if (|A[i,j]-old| > maxError)
maxError = |A[i,j]-old|
}
}
}
Spring 2009
CS420: Parallel Algorithms
37
How do we parallelize Gauss-Seidel?
• Visualize the flow of values
• Not the control flow:
– That goes row-by-row
• Flow of dependences: which values depend on
which values
• Does that give us a clue on how to parallelize?
Spring 2009
CS420: Parallel Algorithms
38
Parallelizing Gauss Seidel
• Some ideas
– Row decomposition, with pipelining
PE 0
PE 1
PE 2
– Square over-decomposition
• Assign many squares to a processor (essentially same?)
Spring 2009
CS420: Parallel Algorithms
39
W
N/P
# Of Phases
N/W + P (-1)
Row decomposition, with pipelining
N
1 2 ... P ... ... ...
W
N
2 ... P ... ... ...
W
N
... P ... ... ...
W
N
P ... ... ...
W
# Columns = N/W
# Rows = P
N
N+1
W
N+1
W
N+1
W
...
...
N +P
W
N
# Procs
Used
P
0
P
Time
N
W
N + P -1
W
Red-Black Squares Method
• Red squares calculate values based on the black
squares
– Then black squares use values from red squares
– Now red ones can be done in parallel and then black ones can
be done
in parallel
 Each
square
locally can do Gauss-
Seidel computation
Spring 2009
CS420: Parallel Algorithms
42
Communication : alpha reducing optimizations
• When you are sending too many tiny messages:
– Alpha cost is high (a microsecond per msg, for example)
– How to reduce it?
• Simple combining:
– Combine messages going to the same destination
– Cost: delay (lesser pipelining)
• More complex scenario:
– AllToAll: everyone wants to send a short message to everyone
else
– Direct method: a . (P-1) +b.(P-1).m
– For small m, the a cost dominates
7/18/2015
Performance Techniques
43
All to all via Mesh
Organize processors in a
2D (virtual) grid
Phase 1:
Each processor sends  P 1
messages within its row
Phase 2:
Each processor sends  P 1
messages within its column
Message from (x1,y1) to (x2,y2)
goes via (x1,y2)
2.  P 1 messages instead of P-1
For us: 26 messages instead of 192
All to all on Lemieux for 76 byte Msg.
60
50
Time (ms)
40
MPI
Mesh
Hypercube
3d Grid
30
20
10
0
16
32
64
96
128
192
256
Processors
512
1024
1280
1536
2048
All to all on Lemieux 1024 processors
900
Bigger benefit: CPU is occupied for a much shorter time!
800
700
Time (ms)
600
500
Mesh
Mesh Compute
400
300
200
100
0
76
276
476
876
1276
1676
2076
Message Size (Bytes)
3076
4076
6076
8076
Namd Performance on Lemieux
140
120
100
80
Step Time
60
40
20
0
Mesh
Direct
MPI
256
512
1024
Processors
Impact on Application Performance
Namd Performance on Lemieux, with the transpose step
implemented using different all-to-all algorithms
140
120
100
80
Step Time
60
40
20
0
Mesh
Direct
MPI
256
512
1024
Processors
Sequential Performance Issues
7/18/2015
Performance Techniques
49
Example program
•
•
•
•
Imagine a sequential program running using a large array, A
For each I, A[i] = A[i] + A[some other index]
How long should the program take, if each addition is a ns
What is the performance difference you expect, depending on how
the other index is chosen?
for (i=0; i<size-1; i++) {
A[i] += A[i+1];
}
for (i=0, index2=0; i<size; i++) {
index2 += SOME_NUMBER; // smaller than size
if (index2 > size)
index2 -= size;
A[i] += A[index2];
}
Spring 2009
CS420: Cache Hierarchies
50
Caches and Cache Performance
• Remember the von Neumann model
CPU
CPU
Registers
Cach
e
Memory
Memory
Spring 2009
CS420: Cache Hierarchies
51
Why and how does a cache help?
• Only because of the principle of locality
– Programs tend to access the same and/or nearby data repeatedly
– Temporal and spatial locality
• Size of cache
• Multiple levels of cache
• Performance impact of caches
– Designing programs for good sequential performance
Spring 2009
CS420: Cache Hierarchies
52
Reality today: multi-level caches
• Remember the von Neumann model
CPU
CPU
Caches
Cach
e
Memory
Memory
Spring 2009
CS420: Cache Hierarchies
53
Example: Intel’s Nehalem
• Nehalem architecture, core i7 chip:
– 64 KB L1 instruction and 64 KB L1 data cache per core
– 256 KB L2 cache (combined instruction and data) per core
– 8 MB L3 (combined instruction and data) "inclusive", shared by
all cores
• Still, even L1 cache is several cycles
– (reportedly 4 cycles, inreased from 3 before)
– L2: 10 cycles
Spring 2009
CS420: Cache Hierarchies
54
A little bit about microprocessor architecture
• Architecture over the last 2-3 decades was driven
by the need to make clock cycle go faster and
faster
– Pipelining developed as an essential technique early on.
– Each instruction execution is pipelinesd:
• Fetch, decode, execute, stages at least
• In addition, floating point operations, which take longer to
calculate, have their own separate pipeline
• L1 cache accesses in Nehalem are pipelined – so even
though it takes 4 cycles to get the result, you can keep
issuing a new load every cycle, and you wouldn’t notice a
difference (almost) if they are all found in L1 cache (i.e. are
“hit”s)
Spring 2009
CS420: Cache Hierarchies
55
Another issue: SIMD vectors
• Hardware is capable of executing multiple floting
point instructions per cycle
– Need to enable that, by using SIMD vector instructions
– Example: Intel’s SSE and IBM’s AltiVec
• Compilers try to automate it,
– but are not always successful
• Learn manual vectorization
– Or use libraries that help
Example from Wikipedia:
movaps xmm0, [v1]
;xmm0 = v1.w | v1.z | v1.y | v1.x
addps xmm0, [v2]
;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x
movaps [vec_res], xmm0
7/18/2015
Performance Techniques
56
Broad Approach To Performance Tuning
• Understand (for a given appliction)
– Fraction of peak performance
• (10% is good for many apps!)
– Parallel efficiency:
• Speedup plots
• Strong scaling: keep problem size fixed
• Weak scaling: increase problem size with processors
– These help you decide where to focus
• Sequential optimizations => fraction of peak
– Use right compiler flags (basic: -O3)
• Parallel inefficiency:
– Grainsize, Communication costs, idle time, critical paths,
– load imbalances
• One way to recognize it: wait time at barriers!
7/18/2015
Performance Techniques
57
Decouple decomposition
from Physical Processors
7/18/2015
Performance Techniques
58
Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition
into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
System implementation
User View
7/18/2015
Benefits
• Software engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for different modules
• Message driven execution
– Adaptive overlap of communication
– Predictability :
• Automatic out-of-core
– Asynchronous reductions
• Dynamic mapping
– Heterogeneous clusters
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change set of processors used
– Automatic dynamic load balancing
– Communication optimization
Performance Techniques
59
Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition
into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
MPI
processes
Virtual
Processors
(user-level
migratable
threads)
Real Processors
7/18/2015
Benefits
• Software engineering
– Number of virtual processors can be
independently controlled
– Separate VPs for different modules
• Message driven execution
– Adaptive overlap of communication
– Predictability :
• Automatic out-of-core
– Asynchronous reductions
• Dynamic mapping
– Heterogeneous clusters
• Vacate, adjust to speed, share
– Automatic checkpointing
– Change set of processors used
– Automatic dynamic load balancing
– Communication optimization
Performance Techniques
60
Parallel Decomposition and Processors
• MPI-style encourages
– Decomposition into P pieces, where P is the number of physical
processors available
– If your natural decomposition is a cube, then the number of
processors must be a cube
– …
• Charm++/AMPI style “virtual processors”
– Decompose into natural objects of the application
– Let the runtime map them to processors
– Decouple decomposition from load balancing
7/18/2015
Performance Techniques
61
Decomposition independent of numCores
• Rocket simulation example under traditional MPI vs.
Charm++/AMPI framework
Solid
Solid
Fluid
Fluid
Fluid
1
2
P
Solid1
Fluid1
Solid2
Solid3
Fluid2
Solid
. . .
. . .
. . .
Solidn
Fluidm
– Benefit: load balance, communication optimizations,
modularity
7/18/2015
Performance
LSU PetaScale
Techniques
62
OpenAtom
Car-Parinello ab initio MD
Collabrative IT project with: R. Car, M. Klein, M. Tuckerman,
Glenn Martyna, N. Nystrom, ..
Specific software project (leanCP): Glenn Martyna, Mark
Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi,
Ramkumar Vadali)
Funding : NSF-CHE, NSF-CS, NSF-ITR, IBM
7/18/2015
Performance Techniques
63
New OpenAtom Collaboration
• Principle Investigators
–
–
–
–
–
M. Tuckerman (NYU)
L.V. Kale (UIUC)
G. Martyna (IBM TJ Watson)
K. Schulten (UIUC)
J. Dongarra (UTK/ORNL)
• Current effort is focused on
– QMMM via integration with NAMD2
– ORNL Cray XT4 Jaguar (31,328 cores)
– ALCF IBM Blue Gene/P (163,840 cores)
7/18/2015
Performance Techniques
64
Ab initio Molecular Dynamics, electronic structure simulation enables
the study of many important systems
Molecular Clusters :
Nanowires:
3D-Solids/Liquids:
Semiconductor Surfaces:
7/18/2015
65
Performance
Techniques
Quantum Chemistry
• Car-Parinello Molecular Dynamics
– High precision: AIMD molecular dynamics uses
quantum mechanical descriptions of electronic structure
to determine forces between atoms. Thereby permitting
accurate atomistic descriptions of chemical reactions.
– PPL's OpenAtom project features a unique parallel
decomposition of the Car-Parinello method. Using
Charm++ virtualization we can efficiently scale small
(32 molecule) systems to thousands of processors.
7/18/2015
Performance Techniques
66
Computation Flow
7/18/2015
Performance Techniques
67
Torus Aware Object Mapping
7/18/2015
Performance Techniques
68
OpenAtom Performance
7/18/2015
Performance Techniques
69
Benefits of Topology Mapping
Watson Blue Gene/L
(CO mode)
PSC BigBen (XT3)
(SN and VN mode)
7/18/2015
Performance Techniques
70
Use Dynamic Load Balancing
Based on the
Principle of Persistence
Principle of persistence
Computational loads and communication
patterns tend to persist, even in dynamic
computations
So, recent past is a good predictor or near
future
7/18/2015
Performance
LSU PetaScale
Techniques
71
Load Balancing Steps
Regular
Timesteps
Instrumented
Timesteps
7/18/2015
Detailed, aggressive Load
Balancing
Refinement Load
Balancing
Performance Techniques
72
Refinement
Load
Balancing
Load
Balancing
Aggressive Load
Balancing
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
7/18/2015
Performance Techniques
73
ChaNGa: Parallel Gravity
• Collaborative project (NSF ITR)
– with Prof. Tom Quinn, Univ. of Washington
• Components: gravity, gas dynamics
• Barnes-Hut tree codes
– Oct tree is natural decomposition:
• Geometry has better aspect ratios, and so you “open” fewer
nodes up.
• But is not used because it leads to bad load balance
• Assumption: one-to-one map between sub-trees and
processors
• Binary trees are considered better load balanced
– With Charm++: Use Oct-Tree, and let Charm++ map subtrees
to processors
7/18/2015
Performance Techniques
74
Load balancing with GreedyLB
dwarf 5M on 1,024 BlueGene/L processors
Main title
25000
22500
20000
Before LB
After LB
17500
15000
12500
10000
7500
5000
2500
0
Messages x1000
Bytes transferred
(MB)
5.6s
7/18/2015
6.1s
Performance Techniques
75
Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
5.6s
7/18/2015
5.0s
Performance Techniques
76
ChaNGa Preliminary Performance
ChaNGa: Parallel Gravity Code
Developed in Collaboration with
Tom Quinn (Univ. Washington)
using Charm++
7/18/2015
Performance Techniques
77
Summary
• Exciting times ahead
• Petascale computers
– unprecedented opportunities for progress in science and engg.
– Petascale Applications will require a large toolbox, with
• Algorithms, Adaptive Runtime System, Performance tools, …
• Object-based decomposition
• Dynamic Load balancing
• Scalable Performance analysis
• Early performance development via BigSim
My Research:http://charm.cs.uiuc.edu
Blue Waters: http://www.ncsa.uiuc.edu/BlueWaters/
7/18/2015
Performance Techniques
78