New Template.97 - Computer Science and Engineering
Download
Report
Transcript New Template.97 - Computer Science and Engineering
Why parallel computing matters (and what makes
it hard)
Challenges and opportunities for parallel computing
Moore’s Law
Amdahl’s Law
The Von Neumann Bottleneck
The Speed of Light
The Heat Wall
The 3rd Pillar of Science
Simulation
Technical challenges in parallel computing
Task decomposition
Task decomposition
Data decomposition
Load Balancing
Applied to Sharks ‘n Fishes, solving systems of PDEs
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Moore’s Law
100,000,000
10,000,000
Transistors
R10000
Pentium
1,000,000
i80386
i80286
100,000
R3000
R2000
i8086
10,000
i8080
i4004
1,000
1970 1975
1980 1985 1990 1995
2000 2005
Year
Moore’s Law: the number of transistors per processor chip by doubles every 18 months.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Moore’s Law
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor
density of semiconductor chips would double roughly every 18 months.
Moore’s law has had a decidedly mixed impact, creating new
opportunities to tap into exponentially increasing computing power
while raising fundamental challenges as to how to harness it effectively.
The Red Queen Syndrome
It takes all the running you can do, to keep in the same place. If you
want to get somewhere else, you must run at least twice as fast as
that!'
Things Moore never said:
“computers double in speed every 18 months”
“cost of computing is halved every 18 months”
“cpu utilization is halved every 18 months”
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Snavely’s Top500 Laptop?
Among other startling implications is the fact that the peak
performance of the typical laptop would have placed it as one of
the 500 fastest computers in the world as recently as 1995.
Shouldn’t we all just go find another job now?
No because Moore’s Law has several more subtle implications
and these have raised a series of challenges to utilizing the
apparently ever-increasing availability of compute power; these
implications must be understood to see where we are today in
HPC.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Real Work (the fly in the ointment)
Scientific calculations involve operations upon large amounts of
data, and it is in moving data around within the computer that the
trouble begins. As a very simple pedagogical example consider
the expression.
A+B=C
This generic expression takes two values (arguments), and
performs an arithmetic operation to produce a third. Implicit in
such a computation is that A and B must to be loaded from where
they are stored and the value C stored after it has been
calculated. In this example, there are three memory operations
and one floating-point operation (Flop).
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Real Work cont.
Obviously, if the rate at which the CPU can perform the one Flop
in this example increases, the time required to load and store the
three values in memory had better decrease by the same
proportion if the overall rate of the computation is to continue to
track Moore’s Law. As has been noted, the rates of improvement
in memory bandwidth have not in fact been keeping up with
Moore’s Law for some time [FoSC].
Amdahl’s Law
Speedup(P) = Time(1)/Time(P) <= 1/(s +(1-s)/P) < 1/s,
where P be a number of processors and s is the fraction of work
done sequentially.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Amdahl’s Law
The law of diminishing returns
When a task has multiple parts, after you speed up one part
a lot, the other parts come to dominate the total time
An example from cycling:
On a hilly closed-loop course you cannot ever average more
than 2x your uphill speed even if you go downhill at the speed
of light!
For supercomputers this means even though processors
get faster the overall time to solution is limited by memory
and interconnect speeds (moving the data around)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
The Problem of Latency
Since memory bandwidth is also increasing faster than is
memory latency, there a slower, but important overall trend
towards an increase in the number of words of memory
bandwidth that equate to memory latency. For a processor to
avoid stalling and to realize peak bandwidths under these
conditions, it must be designed to operate on multiple
outstanding memory requests. The number of outstanding
requests that need to be processed in order to saturate memory
bandwidth is now approaching 100 to 1000 [FoSC].
“Einstein’s Law
It is clear that latency tolerance mechanisms are an essential
component of future system design.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Hierarchical systems
Modern supercomputers are composed of a hierarchy of
subsystems:
CPU and registers
Cache (multiple levels)
Main (local) memory
System interconnect to other CPUs and memory
Disk
Archival
Latencies span 11 orders of magnitude! Nanoseconds to Minutes.
This is a solar-system scale. It is 794 million miles to Saturn.
Bandwidths taper
Cache bandwidths generally are at least four times larger than
local memory bandwidth, and this in turn exceeds interconnect
bandwidth by a comparable amount, and so on outward to the
disk and archive system.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
“Red Shift”
While the absolute speed of all computer subcomponents have
been changing rapidly, they have not all been changing at the
same rate. For example, the analysis in [FoSC] indicates that,
while peak processor speeds have been increasing at 59%
between 1988 and 2004, memory speeds of commodity
processors have been increasing at only 23% per year since
1995, and DRAM latency has only been improving at a pitiful rate
of 5.5% per year. This “red shift” in the latency distance between
various components has produced a crisis of sorts for computer
architects, and a variety of complex latency-hiding mechanisms
currently drive the design of computers as a result.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
3 ways of science
Experiment
Theory
Simulation
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallel Models and Machines
Steps in creating a parallel program
decomposition
assignment
orchestration/coordination
mapping
Performance in parallel programs
try to minimize performance loss from
PMaC
load imbalance
communication
synchronization
extra work
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Some examples
Simulation models
A model problem: sharks and fish
Discrete event systems
Particle systems
Lumped systems (Ordinary Differential Equations,
ODEs)
(Next time: Partial Different Equations, PDEs)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Sources of Parallelism and Locality
in Simulation
Real world problems have parallelism and locality
Many objects do not depend on other objects
Objects often depend more on nearby than distant
objects
Dependence on distant objects often “simplifies”
Scientific models may introduce more parallelism
When continuous problem is discretized, may limit
effects to timesteps
Far-field effects may be ignored or approximated, if they
have little effect
Many problems exhibit parallelism at multiple
levels
e.g., circuits can be simulated at many levels and within
each there
may
parallelism
Performance
Modeling
and be
Characterization
Lab within and between
San Diego Supercomputer Center
PMaC subcircuits
Basic Kinds of
Simulation
Discrete event systems
e.g.,”Game of Life”, timing level simulation for circuits
Particle systems
e.g., billiard balls, semiconductor device simulation, galaxies
Lumped variables depending on continuous parameters
ODEs, e.g., circuit simulation (Spice), structural mechanics,
chemical kinetics
Continuous variables depending on continuous parameters
PDEs, e.g., heat, elasticity, electrostatics
A given phenomenon can be modeled at multiple levels
Many simulations combine these modeling techniques
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
A Model Problem: Sharks
and Fish
Illustration of parallel programming
Original version (discrete event only) proposed by
Geoffrey Fox
Called WATOR
Basic idea: sharks and fish living in an ocean
rules for movement (discrete and continuous)
breeding, eating, and death
forces in the ocean
forces between sea creatures
6 problems (S&F1 - S&F6)
PMaC
Different sets of rule, to illustrate different phenomena
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
http://www.cs.berkeley.edu/~demmel/cs267/Sharks_an
Sharks and Fish 1. Fish alone move continuously
d_Fish/
subject to an external current and Newton's laws.
Sharks and Fish 2. Fish alone move continuously
subject to gravitational attraction and Newton's laws.
Sharks and Fish 3. Fish alone play the "Game of
Life" on a square grid.
Sharks and Fish 4. Fish alone move randomly on a
square grid, with at most one fish per grid point.
Sharks and Fish 5. Sharks and Fish both move
randomly on a square grid, with at most one fish or
shark per grid point, including rules for fish attracting
sharks, eating, breeding and dying.
Sharks and Fish 6. Like Sharks and Fish 5, but
continuous, subject to Newton's laws.
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
18
Discrete Event
Systems
Discrete Event Systems
Systems are represented as
finite set of variables
each variable can take on a finite number of values
the set of all variable values at a given time is called the
state
each variable is updated by computing a transition function
depending on the other variables
System may be
synchronous: at each discrete timestep evaluate all
transition functions; also called a finite state machine
asynchronous: transition functions are evaluated only if the
inputs change, based on an “event” from another part of
the system; also called event driven simulation
PMaC
E.g., functional level circuit simulation
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Sharks and Fish as Discrete
Event System
Ocean modeled as a 2D toroidal grid
Each cell occupied by at most one sea creature
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
The Game of Life (Sharks
and Fish 3)
Fish only, no sharks
An new fish is born if
a cell is empty
exactly 3 (of 8) neighbors contain fish
A fish dies (of overcrowding) if
cell contains a fish
4 or more neighboring cells are full
A fish dies (of loneliness) if
cell contains a fish
less than 2 neighboring cells are full
Other configurations are stable
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in Sharks and
Fishes
The simulation is synchronous
use two copies of the grid (old and new)
the value of each new grid cell depends only on 9 cells (itself plus 8
neighbors) in old grid
simulation proceeds in timesteps, where each cell is updated at
every timestep
Easy to parallelize using domain decomposition
P1 P2 P3
P4 P5 P6
Repeat
compute locally to update local system
barrier()
exchange state info with neighbors
until done simulating
P7 P8 P9
Locality is achieved by using large patches of the ocean
boundary values from neighboring patches are needed
Load balancing is more difficult. The activities in this system are
discrete events
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in Circuit Simulation
Circuit is a graph made up of subcircuits connected by wires
component simulations need to interact if they share a wire
data structure is irregular (graph)
parallel algorithm is synchronous
compute subcircuit outputs
propagate outputs to other circuits
edge crossings = 6
edge crossings = 10
Graph partitioning assigns subgraphs to processors
Determines parallelism and locality
Want even distribution of nodes (load balance)
With minimum edge crossing (minimize communication)
Nodes and edges may both be weighted by cost
NP-complete to partition optimally, but many good heuristics (later
lectures)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in Asynchronous
Circuit Simulation
Synchronous simulations may waste time
simulate even when the inputs do not change, little internal
activity
activity varies across circuit
Asynchronous simulations update only when an event arrives
from another component
no global timesteps, but events contain time stamp
Ex: Circuit simulation with delays (events are gates changing)
Ex: Traffic simulation (events are cars changing lanes, etc.)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Scheduling Asynch. Circuit Simulation
Conservative:
Only simulate up to (and including) the minimum time stamp of
inputs
May need deadlock detection if there are cycles in graph, or
else “null messages”
Speculative:
Assume no new inputs will arrive and keep simulating, instead
of waiting
May need to backup if assumption wrong
Ex: Parswec circuit simulator of Yelick/Wen
Ex: Standard technique for CPUs to execute instructions
Optimizing load balance and locality is difficult
Locality means putting tightly coupled subcircuit on one
processor since “active” part of circuit likely to be in a tightly
coupled subcircuit, this may be bad for load balance
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
26
Particle Systems
Particle Systems
A particle system has
a finite number of particles
moving in space according to Newton’s Laws (i.e. F = ma)
time is continuous
Examples
stars in space with laws of gravity
electron beam semiconductor manufacturing
atoms in a molecule with electrostatic forces
neutrons in a fission reactor
cars on a freeway with Newton’s laws plus model of driver
and engine
Reminder: many simulations combine techniques
such as particle simulations with some discrete
events (Ex Sharks and Fish)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Forces in Particle
Systems
Force on each particle can be subdivided
force = external_force + nearby_force + far_field_force
External force
ocean current to sharks and fish world (S&F 1)
externally imposed electric field in electron beam
Nearby force
sharks attracted to eat nearby fish (S&F 5)
balls on a billiard table bounce off of each other
Van der Wals forces in fluid (1/r^6)
Far-field force
fish attract other fish by gravity-like (1/r^2 ) force (S&F 2)
gravity, electrostatics, radiosity
forces governed by elliptic PDE
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in External Forces
These are the simplest
The force on each particle is independent
Called “embarrassingly parallel”
Evenly distribute particles on processors
Any distribution works
Locality is not an issue, no communication
For each particle on processor, apply the external
force
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in
Nearby Forces
Nearby forces require interaction => communication
Force may depend on other nearby particles
Ex: collisions
simplest algorithm is O(n^2): look at all pairs to see if they collide
Usual parallel model is domain decomposition of physical domain
Challenge 1: interactions of particles near processor boundary
need to communicate particles near boundary to neighboring processors
surface to volume effect means low communication
Which communicates less: squares (as below) or slabs?
Challenge 2: load imbalance, if particles cluster
galaxies, electrons hitting a device wall
Need to check
for collisions
between regions
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Load balance via Tree
Decomposition
To reduce load imbalance, divide space unevenly
Each region contains roughly equal number of particles
Quad tree in 2D, Oct-tree in 3D
Example: each square
contains at most 3 particles
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Parallelism in Far-Field Forces
Far-field forces involve all-to-all interaction =>
communication
Force depends on all other particles
Ex: gravity
Simplest algorithm is O(n^2) as in S&F 2, 4, 5
Just decomposing space does not help since every
particle apparently needs to “visit” every other particle
Use more clever algorithms to beat O(n^2)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Far-field forces: ParticleMesh Methods
Superimpose a regular mesh
“Move” particles to nearest grid point
Exploit fact that far-field satisfies a PDE that is easy to solve
on a regular mesh
FFT, Multigrid
Wait for next lecture
Accuracy depends on how fine the grid is and uniformity of
particles
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Far-field forces: Tree
Decomposition
Based on approximation
O(n log n) or O(n) instead of O(n^2)
Forces from group of far-away particles “simplifies”
They resemble a single larger particle
Use tree; each node contains an approximation of descendents
Several Algorithms
Barnes-Hut
Fast Multipole Method (FMM) of Greengard/Rohklin
Anderson
Later lectures
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
35
Lumped Systems
ODEs
System of Lumped Variables
Many systems approximated by
System of “lumped” variables
Each depends on continuous parameter (usually time)
Example: circuit
approximate as graph
wires are edges
nodes are connections between 2 or more wires
each edge has resistor, capacitor, inductor or voltage source
system is “lumped” because we are not computing the
voltage/current at every point in space along a wire
Variables related by Ohm’s Law, Kirchoff’s Laws, etc.
Form a system of Ordinary Differential Equations, ODEs
We will refresh more on ODEs and PDEs
See http://www.sosmath.com/diffeq/system/introduction/intro.html
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Circuit physics
The sum of all currents is 0 (Kirchoff Current Law)
The sum of all voltages around the loop is 0
V= IR (Ohm’s Law)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Circuit Example
State of the system is represented by
v_n(t) node voltages
i_b(t) branch currents all at time t
v_b(t) branch voltages
Eqns. Include:
0
A
0
Kirchoff’s current
Kirchoff’s voltage
A’
0
-I *
Ohm’s law
0
R
-I
Capacitance
0
-I
C*d/dt
Inductance
0
L*d/dt
I
v_n
i_b
v_b
0
=
S
0
0
0
Write as single large system of ODEs
(possibly with constraints)
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Systems of Lumped
Variables
Another example is structural analysis in Civil Eng.
Variables are displacement of points in a building
Newton’s and Hook’s (spring) laws apply
Static modeling: exert force and determine
displacement
Dynamic modeling: apply continuous force (earthquake)
The system in these case (and many) will be sparse
i.e., most array elements are 0
neither store nor compute on these 0’s
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Solving
ODEs
Explicit methods to compute solution(t)
Ex: Euler’s method
Simple algorithm: sparse matrix vector multiply
May need to take very small timesteps, especially if system is stiff
(i.e. can change rapidly)
Implicit methods to compute solution(t)
Ex: Backward Euler’s Method
Larger timesteps, especially for stiff problems
More difficult algorithm: solve a sparse linear system
Computing modes of vibration
Finding eigenvalues and eigenvectors
Ex: do resonant modes of building match earthquakes?
All these reduce to sparse matrix problems
Explicit: sparse matrix-vector multiplication
Implicit: solve a sparse linear system
PMaC
direct solvers (Gaussian elimination)
Performance
Modeling
Characterization
Lab
iterative
solversand
(use
sparse matrix-vector
multiplication)
San Diego Supercomputer Center
Eigenvalue/vector algorithms may also be explicit or implicit
Parallelism in Sparse Matrix-vector
multiplication
y = A*x, where A is sparse and n x n
Questions
which processors store
y[i], x[i], and A[i,j]
which processors compute
y[i] = sum from 1 to n of A[i,j] * x[j]
Graph partitioning
Partition index set {1,…,n} = N1 u N2 u … u Np
for all i in Nk, store y[i], x[i], and row i of A on processor k
Processor k computes its own y[i]
Constraints
PMaC
balance load
balance storage
minimize
communication
Performance
Modeling
and Characterization Lab
San Diego Supercomputer Center
Graph Partitioning and
Sparse Matrices
Relationship between matrix and graph
1
2
3
1 1
1
2 1
1
3
4
1
1
5
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5 1
6
4
1
3
2
4
1
6
5
A “good” partition of the graph has
equal number of (weighted) nodes in each part (load balance)
minimum number of edges crossing between
Can reorder the rows/columns of the matrix by putting all the
nodes in one partition together
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
More on Matrix Reordering via
Graph Partitioning
Goal is to reorder rows and columns to
improve load balance
decrease communication
“Ideal” matrix structure for parallelism: (nearly) block diagonal
p (number of processors) blocks
few non-zeros outside these blocks, since these require
communication
P0
P1
=
*
P2
P3
P4
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center
Summary
“Red Shift” means the promise implies by Moore’s
Law is largely unrealized for scientific simulation that
by necessity operates on large data
Consider “The Butterfly Effect”
Computer Architecture is a hot field again
Large centralized, specialized compute engines are
vital
Grids, utility programing, SETI@home etc. do not meet
all the needs of largescale scientific simulation for
reason that should now be obvious
Consider a galactic scale
PMaC
Performance Modeling and Characterization Lab
San Diego Supercomputer Center