HPC

Transcript HPC

C Cox
P00603
High Performance Computing
1
2
3
3
What is HPC
Grid Computing
Example applications
Solving Linear systems
1
C Cox
References
• B Wilkinson and M Allen (1999) Parallel Programming:
techniques and applications using networked workstations and
parallel computes
• G.H.Golub and C.F.Van Loan (1989) Matrix Computations,
2nd.Ed., London Johns Hopkins University Press
• Scott, Clark and Bagheri (2005) Scientific Parallel Computing,
Princeton University Press
• D.J.Kuck (1996) High Performance Computing, OUP
• W.H.Press et.al. (1994) Numerical Recipes in C: the art of
scientific computation, Cambridge University Press
2
C Cox
1
What is HPC
• Use of (parallel) supercomputers, cluster and Grid computers
• Generally refers to computer systems of Teraflop and above
performance
• Aims to increase the available computational power by exploiting
parallelism, recognising machine, storage and network
communication capacity and performance constraints
• Is concerned with large and complex computational problems,
often requiring large amounts of data
• Set of hardware and software techniques for building systems
capable of large amounts of computation
• Aim: Increase available computational power
•
Exploit parallelism
3
C Cox
Why We Need HPC Systems
• Problems require the computations or manipulations
of large amounts of real data in Business, Science
and Engineering
• Image and Signal Processing
• Entertainment (Image Rendering)
• Database and Data Mining
• Seismic analysis
• Simulation/Modeling – data or computation based on
a model of a physical system
• Direct methods (predefined set of instructions)
• Iterative, based on successive approximations
4
C Cox
Science
Some Challenging
• Global climate modeling
• Astrophysical modeling
Computational Problems
• Computational Chemistry
• Biology: human genome; protein folding; drug design
Engineering
• Crash simulation
• Semiconductor design
• Earthquake and structural modeling
• Computation fluid dynamics
• Image and signal processing
• Image rendering
• Seismic analysis
• Combustion (engine design)
Business
• Financial and economic modeling
• Data mining
• Transaction processing, web services and search engines
Defense
• Nuclear weapons -- test by simulations
• Cryptography
5
C Cox
Why Grids?
• Large-scale science and engineering are done
through the interaction of people,
heterogeneous computing resources,
information systems, and instruments, all of
which are geographically and organizationally
dispersed.
• The overall motivation for “Grids” is to facilitate
the routine interactions of these resources in
order to support large-scale science and
Engineering.
7
Grid Concept
C Cox
•
•
•
•
•
The initial vision: The Grid
The present reality: many grids
Each grid is an infrastructure
enabling one or more virtual
organisations (VO) to share
computing resources
What is a VO: Individuals or groups
seeking to cooperate and share
resources across their
organisational boundaries
Why establish a Grid?
– Share data
– Share computers
– Share instruments
– Collaborate
VO
group A
group B
group C
group D
9
C Cox
Key concepts
Virtual organisation
– people and resources collaborating across admin and
organisational boundaries
Grid middleware
– running on each resource to interface to the Grid
– providing specific services
Single Virtual Computer
– User just perceives “shared resources” without concern
for location or owning organisation (location
transparancy)
– Issues: Heterogeneity, Scalability, Reliability, Computing
model, Access control
10
C Cox
Grid Applications




Grand challenge problems
Large hadron Collider at CERN
Distributed.net, NASA Advanced Supercomputing Facility
Open source Berkely Open Infrastructure for Network
Computing (BOINC) – currently over 6.2 machines
connected.
12
C Cox
Grid Scales
 Department grid (cluster)
 Enterprise grid, e.g. CERN LHC
 Global grid, e.g. NGS
13
C Cox
Types of Grid
• Computational – providing computational services
(most common)
• Data – providing computational access to large,
possibly distributed datasets from multiple storage
systems
• Collaboration – grids designed to facilitate
collaborative computation
• Network – or delivery grid. Main service to provide
fault tolerant high performance communications, to
facilitate computations
• Utility – computational resources accessed on
demand
• Cluster – smaller grid operating locally
14
C Cox
Components of a Grid
Resources
– networking, computers, storage, data, instruments, …
Grid Middleware
– the “operating system of the grid”
Operations infrastructure
– Run enabling services (people + software)
Virtual Organization management
– Procedures for gaining access to resources
15
Grid Operation
C Cox
• Users join VO’s
• Virtual organisation
negotiates with sites
to agree access to
resources
• Distributed services
(both people and
middleware) enable
the grid to divide
work uniformly
• Grid needs to deal
with multiple
administration
domains
INTERNET
16
Grid Operation
C Cox





Relies on middleware to divide processing and data
between available processors
Grid size can range from single cluster or network of
workstations to large collaborations across many
companies and networks
Need to be aware of trust issues when operating with
multiple domains
Need to be aware of resource availability issues since there
is no central control
Various projects have attempted to provide a transparent
grid user interface, e.g.
BOINC the desktop grid, see
http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Net
work_Computing
17
C Cox
Grid computational hardware
• Supercluster, e.g. Europe’s Blue Gene (65536 dual
processors in 64 cabinets) – see top500.org
• Clusters, using message passing libraries
• Symmetric Multiprocessors (SMP), using shared
operating system and memory
• Vector (or Array) processors
Grid hardware Performance Requirements
• Giga bytes/second bandwidth
• Tera flops computation
• Peta bytes data storage
18
Grid Architecture
Applications:
Climatology
C Cox
Cosmology
High energy physics
Chemical engineering
Computational fluid dynamics
Application toolkits: data grid, computation grid,
collaboration grid, portals, remote sensors etc
Grid services: protocols, authentication, policy,
resource management etc (e.g Globus)
Grid fabric: disk, networks, processors etc
19
Globus
C Cox
•
•
•
•
•
started 1996
allows the construction of computational grids
focus on the user execution environment for parallel and
distributed applications
Globus Toolkit – open source, now a de-facto standard
Globus Alliance – user community of individuals developing
grid technologies
Condor ( http://www.cs.wisc.edu/condor)
•
•
Started 1988
Tools to support high throughput distributed computing
Harnes
•
•
Heterogeneous Adaptable Reconfigurable Networked system
Successor to PVM (DVM)
20
C Cox
Globus Toolkit™
• A software toolkit addressing key technical problems in the
development of Grid-enabled tools, services, and applications
• Offer a modular set of orthogonal services
• Enable incremental development of grid-enabled tools and
applications
• Implement standard Grid protocols and APIs
• Available under liberal open source license
• Large community of developers & users
• Commercial support
21
C Cox
National Grid Service (NGS)
• Aims to help UK academics and researchers in
providing access to computation and data resources
• Funded by EPSRC and JISC
• Based on Globus Toolkit
• Offers training and grid support
• Available under liberal open source license
• Large community of developers & users
• Commercial support
• http://en.wikipedia.org/wiki/National_Grid_Service
22
C Cox
gLite
•
•
•
•
•
•
Middleware for grid computing
Evolved from a collaboration in the EGEE project (2010)
Currently gLite 2.2 (Nov 2012)
Used by CERN LHC
Development now with European Middleware Initiative (2012)
User interface to list resources, submit and manage jobs,
manage files etc.
23
C Cox
Grid Standards and API’s
• GLUE – Grid Laboratory Uniform Environment
• SAGA – Simple API for Grid Applications (Open Grid
Forum)
• DRMAA – Distributed Resource Management
Application API (Open Grid Forum)
• GSI – Grid Security Infrastructure
• OGSA – Open Grid Services Architecture
• OGSI – Open Grid Services Infrastructure
• WSRF – Web Services Resource Framework
24
C Cox
Grid Projects – Grand Challenge problems
Computational fluid dynamics for
• the design of aircraft, vehicle, submarine etc bodies
• weather forecasting for short and long term effects,
• efficient recovery of oil, and for many other applications;
Electronic structure calculations for the understanding of materials
such as
• chemical catalysts, immunological agents, superconductors etc
Plasma dynamics for fusion energy technology and for safe and
efficient military technology;
Calculations to understand the fundamental nature of matter,
including quantum chromodynamics and condensed matter theory;
Symbolic computations including speech recognition, computer
vision, natural language understanding, automated reasoning, tools for
design, manufacturing, and simulation of complex systems"
25
C Cox
Grid - OS
Grid middleware creates the
image as a single virtual
computer (Ideally)
Issues
• Heterogeneity – hardware,
software, culture
• Scalability
• Reliability – tolerate
permanent partial failure
• Viable computing model
e.g. batch processing
• Access control
– Authentication
– Authorisation
– Single sign on
Application Software
Interface between app. and grid
Grid Middleware: “collective services”
Grid Middleware on each
resource
Operating System on each
resource
Resources connected by internet
26
Grid Middleware
C Cox



Allows the sharing of heterogeneous resources and virtual
organisations
Also called gridware
E.g. Globus, Condor, Harnes
27
C Cox
The Grid World: Current Status
• Numerous Grid projects in scientific & technical
computing, research and education
• Considerable consensus on key concepts and
technologies
• Open source Globus Toolkit™ a de facto standard for
major protocols & services
• Industrial interest emerging rapidly: IBM, Microsoft,
Sun, Compaq, …
28
Grid Applications
C Cox
Traditionally associated with computationally
intensive problems
Also used for large scale commercial applications
E.g. SETI
5 million+ users
1500 million results
2 million years+ CPU time
29
Grid summary
C Cox
•
•
•
•
•
•
•
•
Analogy with power grid (hence name)
Distributed supercomputing (e.g. teraflop computations, petabyte data
storage)
Transparent sharing of distributed computing resources
Grid types:
Computational grid – for computation intensive work
Data grid – for the sharing and management of large amount of data
Grid communities: collaborative computing, share data, develop common
code, Virtual organisations
National grids:
Globus http://www.globus.org/,
Condor http://www.cs.wisc.edu/condor/,
NGS http://www.grid-support.ac.uk/
30
Example applications
C Cox
3
31
C Cox
Global Climate Modeling Problem
• Problem is to compute:
f(latitude, longitude, elevation, time) 
temperature, pressure, humidity, wind velocity
• Approach:
• Discretize the domain, e.g., a measurement point every 10 km
• Devise an algorithm to predict weather at time t+1 given t
• Uses:
- Predict major events,
e.g., El Nino
- Use in setting air
emissions standards
32
C Cox
Global Climate Modeling Computation
• Computational requirements:
• To match real-time, need 5x 1011 flops in 60 seconds = 8
Gflop/s
• Weather prediction (7 days in 24 hours)  56 Gflop/s
• Climate prediction (50 years in 30 days)  4.8 Tflop/s
• To use in policy negotiations (50 years in 12 hours)  288
Tflop/s
• To double the grid resolution, computation is at least 8x
• State of the art models require integration of atmosphere, ocean,
sea-ice, land models, plus possibly carbon cycle, geochemistry
and more
• Current models are coarser than this
33
C Cox
Heart Simulation
• Problem is to compute blood flow in the heart
• Approach:
• Modeled as an elastic structure in an incompressible fluid.
• The “immersed boundary method” due to Peskin and McQueen.
• 20 years of development in model
• Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others
• Use a regularly spaced mesh (set of points) for evaluating the fluid
flow
• Uses
• Current model can be used to design artificial heart valves
• Can help in understand effects of disease (leaky valves)
• Related projects look at the behavior of the heart during a heart
attack
• Ultimately: real-time clinical work
34
C Cox
Parallel computing: Web searching
•
•
•
•
Functional parallelism: crawling, indexing, sorting
Parallelism between queries: multiple users
Finding information amidst junk
Preprocessing of the web data set to help find information
• General themes of sifting through large, unstructured data sets:
35
C Cox
Parallel Programming: Decomposition Techniques
Functional Decomposition (Functional Parallelism)
• Decomposing the problem into different tasks which can be distributed
to multiple processors for simultaneous execution
• Good to use when there is not static structure or fixed determination of
number of calculations to be performed
Domain Decomposition (Data Parallelism)
• Partitioning the problem's data domain and distributing portions to
multiple processors for simultaneous execution
• Good to use for problems where:
• data is static (e.g. solving large matrix or finite difference or finite
element calculations)
• dynamic data structure tied to single entity where entity can be
separated
• domain is fixed but computation within various regions of the domain
is dynamic (fluid vortices models)
• Combination of functional and domain decomposition
36
Finite Element Method
C Cox
• Divide domain into subdomains (divide and conquer), retaining
connections between subdomains (to share boundary conditions)
• System defined exactly by PDE with boundary conditions, but
approximated as a discretised mesh of points in 2-D, 3-D etc.
Property
Applied action
Behaviour
(unknown)
Elastic stiffness
Force
displacement
Thermal conductivity
Heat
Temperature change
Fluid viscosity
Force
Fluid Velocity
Ohms law
Voltage
Current flow
Matrix form: [Property][Behaviour] =[Action]
37
C Cox
Finite Difference Method
• System defined exactly by PDE with boundary conditions, but
converted to set of difference equations for numerical
approximations to derivatives – so approximates the operators
• Domain divided into a mesh intervals
• Transform continuous domain to mesh of discrete points
u ui 1, j  ui , j
e.g.

x
h
 2u ui 1, j  2ui , j  ui 1, j
e.g.

x 2
h2
38
C Cox
Example: Heat Transfer
• Use of computers to solve/compute scientific models
• For instance, many natural phenomena can be modelled by
established Partial Differential Equations (PDEs)
• Classic Example: Heat Transfer (temperature vs. time)
• Consider a “1-D” material between 2 heat sources
• Problem: compute f(x,t)
• An example Boundary Value Problem (BVP)
T=H
T=L
0<x<X
f(x,t): temperature
at location x
at time t
39
Heat Transfer
C Cox
• The laws of physics specify:
• where a is the material thermal conductance
• and f(0,t)=H, f(X,t)=L and f(x,0)=L are all fixed
• Called the boundary conditions
• Simple case: Newtons law of cooling
• Question: How do we solve this PDE?
• It does not have an analytical solution
• Therefore it must be solved numerically
40
C Cox
Heat Transfer – Finite Difference Approach
• Discretise the domain: decide that the values of f(x,t)
will only be known for some finite (but large) number
of values of x and t
• The discretised domain is called a mesh
• All x values are separated by ∆x
• All t values are separated by ∆t
• Then, one replaces partial derivatives by algebraic
differences
• In the limit, when ∆x and ∆t go to zero, we get close
to the real solution
• For practical purposes a sufficiently fine grid is used
to obtain the required level of solution accuracy
41
Bucket Sort – Parallel sorting by domain decomposition
•
C Cox
•
•
•
•
A basic sort is based on compare (if a>b) and exchange (t=a; a=b;
b=temp) steps
Aim is to sort N numbers that occur in range 0..M-1 (best if the N
numbers are uniformly distributed over the range)
Use P buckets (will be 1 per processor in final parallel version)
With P buckets (index 0 to P-1) divide numbers into separate regions
bucket 0 range 0 to (M/P)-1
bucket 1 . . .
(M/P) to (2M/P)-1
bucket 2 . . .
(2M/P) to (3M/P)-1
...
bucket P-1 . . .
((P-1)M/P) to M-1
E.g. for numbers in range 0 to 999 (so M=1000) with P=10
bucket 0 range 0 to 99
bucket 1 . . .
100 to 199
...
bucket 9 . . .
900 to 999
42
C Cox
Aside: Big ‘O’ Notation:
The time taken for a program of job size n is f(n)
f(n) = O( g(n) )
means there are positive constants c and k, such that 0 ≤ f(n) ≤ cg(n)
for all n ≥ k.
The values of c and k must be fixed for the function f and must not
depend on n.
time
n
43
Bucket Sort – Sequential Algorithm
C Cox
•
•
•
•
•
•
•
•
•
Determine bucket for number K as K integer_divide (M/P)
so K=0 will be for bucket 0, K=M-1 will be for bucket P-1
e.g. for range 0..999 (M=1000) and P=10 number K=351 will be for bin
351/100=3
(note having M/P as a power of 2 can make the division
faster by bit shifting)
Placing N numbers into buckets requires N computational steps
If numbers are uniformly distributed there will be around N/P per bucket
Sorting n numbers, using Quicksort or similar, has a complexity
O(nlogn) at best
Sorting N/P numbers will take (N/P)log(N/P) steps
Sorting the P buckets will require P(N/P)log(N/P) =Nlog(N/P) steps
Finally, sorted numbers in each bucket can be merged in N Steps
So total number of steps is 2N + Nlog(N/P) steps
Hence sequential bucket sort complexity is O(Nlog(N/P))
44
C Cox
Bucket Sort
– Parallel Algorithm
• Assign 1 processor per bucket, so buckets 0..P-1
• Each processor examines all N numbers, ignoring numbers not in
range for the bucket
• An average N/P numbers per bucket are (Quick)sorted in parallel
on each processor in time (N/P)log(N/P)
45
Parallel Numerical Integration
b
•
To determine the numerical value of the definite integral: I 
 f xdx
a
•
•
Where the integral value is equivalent to the area under the f(x) curve
The region [a,b] is divided into P regions, with 1 region per worker process
C Cox
q
•
Each worker processor computes its region as:
R   f x dx
p
•
By approximating numerically (see later) with:
 f  p  n  1   f q 
 f  p   f  p    f  p     f  p  2 
R 

 ... 

2
2
2

f q 
 f  p
R 
 f  p     f  p  2   ...  f  p  n  1  
2 
 2
•
•
•
Where the number of rectangles per region, n, determines the final integral
accuracy
Worker returns region area to driver process
Driver sums areas for each P worker regions
46
C Cox
Numerical approximation
A
A
xi
A  a f ( x)dx
b
i n
A   f ( xi ).xi
i 1
47
Basic and Trapezoidal Rules
3
3
3
2
Right
Left
Mid
2
2
1
C Cox
1
0
1
2
A    i f xi 
3
4
i
0
1
1
2
3
A    i f xi 1 
i
4
0
1
2
3
4
 x  xi 1 
A  i f  i

2


i
3
Trapezium
2
 f xi   f  xi 1 
A  i 

2


i
1
0
1
2
3
Area approximations
based on a linear
(order 1) model
4
48
C Cox
Parallel Numerical Integration
• Integration domain, divided into a fixed number of strips, leads
to a data parallel approach – since each strip can be computed
separately from the others and the strip regions can be predefined
• The computation per processor will relate to a range of strips,
such that all the strips are equally divided per processor.
• The summation of areas from each processor needs to be
returned to a driver or controller process
• The optimum speedup needs to balance the potential parallel
speedup with the message passing overhead in using multiple
processors
49
Pseudocode for parallel trapezoidal numerical integration
Integrate f(x) over region [a, b] using P worker processors
C Cox
Driver
Send worker id
Send range and total number of intervals
.
.
.
.
.
.
.
.
Total_area=0
Receive area from each worker id=1..P
Total_area=Total_area+worker_area
Worker
Receive worker id
Receive a, b, n
Strip width d=(b-a)/n
Region width r=(b-a)/P
start=a+r*id
end=start+r
Compute region area:
Area=0
For x=start to end in steps of d
Area=area+0.5*(f(x)+f(x+d))*d
Send area to driver
50
Numerical Integration – Adaptive Quadrature
•
C Cox
•
•
•
•
•
•
Fixed strip width schemes (trapezoidal, Simpson etc) can be inefficient in not
being able to focus in on areas where the function varies most
In Adaptive Quadrature the number of strips is successively doubled, until
the area approximation converges sufficiently.
The strip areas can be computed using trapezoidal or Simpsons method
Regions in which the function is smooth are computed quickly while regions
in which the function behaves more irregularly will require more
computational effort by subdividing succesively to arrive at an acceptable
level of convergence accuracy.
Checking the difference between the area computed at upper compared to
lower layers can form an easy convergence accuracy criteria
A binary tree structure can be used to manage the regions to be worked on
The algorithm is Adaptive since it focuses on areas of irregular function
behaviour where the strip area convergence needs improving
51
C Cox
Floating_point Function Qarea
Top_area = 0.5*width*(f(a)+f(b)) Adaptive Quadrature – recursive algorithm
C = (a+b)*0.5 // bisect region
Larea = 0.25*width*(f(a)+f(c))
Rarea = 0.25*width*(f(c)+f(b))
Absolute value of
If | top_area –(Larea+Rarea) | < close_enough
the difference
Return Larea+Rarea
Else
Return Qarea(a,c,0.5*width)+Qarea(c,b,0.5*width)
EndIf
y
Rarea
f(x)
Larea
Top_area
a
c
b x
52
C Cox
Adaptive Quadrature
– parallel implementation and performance
• Since the number of strips per processor is not possible
to predict in advance it is difficult to ensure load
balancing with an initial static assignment of strips per
processor.
• A dynamic arrangement whereby the processing of
additional strips are made available to a pool of workers
may be a solution here, however the message passing
required in balancing the work load may erode the
performance advantage of this method.
53
C Cox
4 Linear Systems in Matrix Form
a11x1  a12 x2    a1n xn  b1
 a11 a12
a a
a21x1  a22 x2    a2 n xn  b2
21
22






an1 x1  an 2 x2    ann xn  bn
an1 an 2
 a1n   x1   b1 
 a2 n   x2  b2 

       
   
 ann   xn  bn 
Linear equation means that every variable (xn) is in the power 1 only.
54
C Cox
Solving Mx=b
• x=M-1b is the obvious solution
But don’t – Inverting is more costly than solving the
equation by other means
• Don’t write your own solution – When you understand
the nature of the problem, then there are plenty of
algorithms and indeed libraries available
• Understanding the basics though writing some code
will help understand solving new systems
55
Direct and Indirect Methods
C Cox
For solving Linear systems
• Direct – find an exact solution in a finite (and
predetermined) number of computational steps
e.g. Gaussian-Elimination
• Iterative – produce sequence of approximate
solutions that converges to an exact solution
e.g. Jacobi and Gauss-Seidel Methods
56
f(x)
Iterative Methods
f(xi)
B
• Based on Newton-Rhapson method
• Solve f(x)=0
C Cox
f(x)
x f x 
f(xi)
i,
i

xi+2
xi+1
xi
X
A
xi+1
xi
tan(a  
f ' ( xi ) 
f(xi-1)
C a
X
AB
AC
f ( xi )
xi  xi 1
xi 1  xi 
f ( xi )
f ' ( xi )
57
$ cat nrp.c
/* c.cox 3.2.07 newton-rhapson solver */
#include <stdio.h>
#include <math.h>
double f(double x)
{ float value; value = x*x*x*x-81.0; return value; }
main()
{ double root, newroot;
const double delta=1.0e-6;
const double accuracy=1.0e-7;
double divide;
int iterations;
Newton-Rhapson solution of: x  81
4
i.e. solve:
C Cox
root=1.0; /* first guess */
newroot=2.0;
iterations=0;
while (fabs(newroot-root)>accuracy)
{ iterations=iterations+1;
root=newroot;
divide =f(root+delta)-f(root);
newroot=root-f(root)*(delta/divide);
printf("root=%15.12f\n", newroot);
}
printf("iterations=%d\n", iterations);
f x   x 4  81  0
}
$ gcc -o nrp nrp.c -lm
$ ./nrp
root= 4.129920000000
root= 3.365641722222
root= 3.055569547222
root= 3.001426781532
root= 3.000000996743
root= 3.000000000001
root= 3.000000000000
iterations=7
$
58
Newton Rhapson -known limitations
10
f(x)
Inflection Point
1.00E-05
f(x)
5
7.50E-06
2
-1
3
0
1
-5
5.00E-06
x
0
-2
1
2
f x   x  1  0
2.50E-06
3
-0.02
-0.01
-10
C Cox
x
0.00E+00
-0.03
-2.50E-06
0
0.01
0.02
0.03
0.04
0.02
-5.00E-06
Division by zero
-7.50E-06
-15
-1.00E-05
f x   x3  0.03x 2  2.4x106  0
-20
6
f(x)
f x   x  2  0
1.5
2
1
4
f x  Sin x  0
3
0.5
3
x
0
-2
f(x)
5
-0.063069
0 0.54990
2
4
4.462
6
7.53982
8
2
2
10
11
-0.5
4
x
0
-2
-1
-1.5
Oscillations near Local Maxima or Minima
-1.75
-1
-0.3040
0
0.5
1
2
3
3.142
-1
Root Jumping
59
C Cox
Sparse Linear Systems
60
Finite Difference Method
•
•
Numerical solution to large class of physical problems that can be modeled as solution to
second order partial differential equations
E.g. temperature distribution on a 2-D surface satisfies:  2u  2u
 2  f
Poisson equation:
2
C Cox
x
y
•
•
•
where: u  x, y  is the temperature profile and f  x, y  is the applied heat flux
Discretise x, y coordinates onto i, j grid
Numerical approximation of derivative
•
•
Where is the discretization parameter
2
1
1
then  u
u u x  h   u x  1

 ui1, j  ui , j 
h
h
hx
h
•
•
 ui2 , j  2ui1, j  ui , j   2 ui1, j  2ui , j  ui1, j 
x 2 h 2
h
2
and  u
1


ui , j 1  2ui , j  ui , j 1 
2
2
y
h
So  u
 2u

f
x 2 y 2
2
leads to the 5 point discretized model:
ui 1, j  ui 1, j  ui , j 1  ui , j 1  4ui , j  h2 fi , j
61
Finite Difference Method
ui 1, j  ui 1, j  ui , j 1  ui , j 1  4ui , j  h2 fi , j
C Cox
N  E  S  W  4C  h 2 f i , j
The complete system is described by the linear equation Mu  h 2 f
where the vectors
u, f represent the set of discretized u and f values
62
U1
Example system
i-1,j
N  E  S  W  4C  h2 fi with flux entering on left hand edge
f1  1, f 4  1, f 7  1 and remaining f i  0
C Cox
1:
u2  u4  4u1  h2
2:
u3  u5  u1  4u2  0
3:
u6  u2  4u3  0
4:
u1  u5  u7  4u4  h2
5:
u2  u6  u8  u4  4u5  0
6:
u3  u9  u5  4u6  0
7:
u4  u8  4u7  h2
8:
u5  u9  u7  4u8  0
9:
u6  u8  4u9  0
U3
i,j+1
Using
At cell
U2
-4
1
1
-4
1
1
-4
i,j
I+1,j
U4
U5
U6
U7
U8
U9
i,j-1
1
1
1
1
1
-4
1
1
-4
1
1
-4
1
1
1
1
Linear equations
1
1
V1
1
V2
0
V3
0
V4
1
V5
1
=h2
0
V6
0
V7
1
-4
1
1
-4
1
V8
0
1
-4
V9
0
Matrix form
63
Jacobi Iteration Method - 1
•
•
Solve: Mx=b (systems of linear equations)
M=D+Moff (diagonal + off-diagonal elements)
C Cox
1 2 3 4
1
5 6 7 8
0
 D
M
 9 10 11 12
0



13 14 15 16
0
•
•
•
•
0 0 0
0 2 3 4

5 0 7 8
6 0 0

M off  
 9 10 0 12
0 11 0 



0 0 16
13 14 15 0 
(D+Moff)x=b
So
Dx=b-Moffx
And x=D-1(b-Moffx)
Inverse (D-1) of a diagonal matrix D is simply the matrix of
inverse diagonal elements
m11 0
 0 m
22

 0
0

0
 0
0
0
m33
0
 1
m
-1
 11
0 
 0
0 

 
0 
 0


m44 

 0

0
0
1
m22
0
0
1
m33
0
0

0 

0 


0 

1 
m44 
64
Jacobi Iteration Method - 2
Algorithm
• Start with an initial solution guess, e.g. x=b
• Use recurrence formula: x k 1  D 1 b  M
C Cox

k
x
off

Termination condition
x k 1  x k   2 - norm x k 1  x k
• Convergence:
2
• Iteration count does not exceed minimum

k 1
1
 x
  x
k 2
1
x
2
k 1
2
x

k 2
2

 ..  x
k 1
n
x

k 2
n
1
2
n
Requirements
mii   mij
i j
• Diagonally dominant matrix:
• Diagonal element of M must be non zero:
mii  0
• Initial solution must be carefully chosen
65
Jacobi Iteration Method - 3
Serial Algorithm:
xik 1 
bi   mij x kj
i j
mii
C Cox
Pseudocode
For i=0..n-1
x[i]=b[i] // initialise unknown x
For it=1..MaxIts // fixed number of iterations
for i=0..n-1 // for each unknown
sum=0
for j=0..n-1
if(i<>j) sum=sum+m[i][j]*x[j]
new_x[i]=(b[i]-sum)/m[i][i]
for i=o..n-1
x[i]=new_x[i];
Note
can avoid if i<>j by initialising sum to –m[i][i]x[i] or by having loops from
j=0..i-1 then j=i+1..n-1
66
1 
2 1 0
 4  Solution:
x  2 
n  3, A  1 3 2, b  13

3 

0 2 4
16
Example:
Jacobi
Iteration
Method - 4
Algorithm
C Cox
xik 1 
bi   mij x kj
i j
mii
Initially, k=0
k=1
k=2
k=3
k=4
k=5
k=6
x1  b1  a12 x2  a13 x3  a11  4  x2  2
x2  b2  a21x1  a23 x3  a22  13  x1  2x3  3
x3  b3  a31x1  a32 x2  a33  16  2x2  4
x1=1.0, x2=1.0, x3=1.0
x1=[4-1]/2=1.5
x1=[4-3.33]/2=0.333
x1=[4-1.5]/2=1.25
x1=[4-2.666]/2=0.666
x1=[4-1.75]/2=1.125
x1=[4-2.333]/2=0.833
x2=[13-1-2]/3=3.333
x2=[13-1.5-2*3.5]/3=1.5
x2=[13-0.333-2*2.333]/3=2.666
x2=[13-1.25-2*3.25]/3=1.75
x2=[13-0.666-2*2.666]/3=2.333
x2=[13-1.125-2*3.125]/3=1.875
Note slow convergence to eventual solution
x3=[16-2]/4 =3.5
x3=[16-2*3.333]/4=2.333
x3=[16-2*1.5]/4=3.25
x3=[16-2*2.666]/4=2.666
x3=[16-2*1.75]/4=3.125
x3=[16-2*2.333]/4=2.833
x1=1.0, x2=2.0, x3=3.0.
Hand
calculation
67
$ cat jacobi.c
/* c.cox 4.2.07 jacobi solver */
#include <stdio.h>
#define n 3
main()
{ const double m[n][n]={2.0,1.0,0.0, 1.0,3.0,2.0, 0.0,2.0,4.0};
const double b[n]={4.0,13.0,16.0};
double x[n], new_x[n];
$ ./jacobi
double sum;
Its=1 x[0]= 1.50000000 x[1]=
const int MaxIts=30;
Its=2 x[0]= 0.33333333 x[1]=
int Its,i,j;
Its=3 x[0]= 1.25000000 x[1]=
Its=4 x[0]= 0.66666667 x[1]=
for(i=0;i<n;i++) x[i]=1.0; /* initialise */
Its=5 x[0]= 1.12500000 x[1]=
for(Its=1;Its<=MaxIts;Its++)
Its=6 x[0]= 0.83333333 x[1]=
{ printf("Its=%d",Its);
Its=7 x[0]= 1.06250000 x[1]=
for(i=0; i<n; i++)
Its=8 x[0]= 0.91666667 x[1]=
{ sum=0.0;
Its=9 x[0]= 1.03125000 x[1]=
for(j=0;j<n;j++)
Its=10 x[0]= 0.95833333 x[1]=
if(i!=j)sum+=m[i][j]*x[j];
Its=11 x[0]= 1.01562500 x[1]=
new_x[i]=(b[i]-sum)/m[i][i];
Its=12 x[0]= 0.97916667 x[1]=
}
Its=13 x[0]= 1.00781250 x[1]=
for(i=0;i<n;i++)
Its=14 x[0]= 0.98958333 x[1]=
{ x[i]=new_x[i];
Its=15 x[0]= 1.00390625 x[1]=
printf(" x[%d]=%12.8f",i,x[i]);
Its=16 x[0]= 0.99479167 x[1]=
}
Its=17 x[0]= 1.00195312 x[1]=
printf("\n");
Its=18 x[0]= 0.99739583 x[1]=
}
Its=19 x[0]= 1.00097656 x[1]=
}
Its=20 x[0]= 0.99869792 x[1]=
Its=21 x[0]= 1.00048828 x[1]=
$ gcc -o jacobi jacobi.c
Its=22 x[0]= 0.99934896 x[1]=
Its=23 x[0]= 1.00024414 x[1]=
Its=24 x[0]= 0.99967448 x[1]=
Its=25 x[0]= 1.00012207 x[1]=
Its=26 x[0]= 0.99983724 x[1]=
Its=27 x[0]= 1.00006104 x[1]=
Its=28 x[0]= 0.99991862 x[1]=
Its=29 x[0]= 1.00003052 x[1]=
Its=30 x[0]= 0.99995931 x[1]=
$
C Cox
Jacobi Iteration Method - 5
sequential code and output
3.33333333 x[2]=
1.50000000 x[2]=
2.66666667 x[2]=
1.75000000 x[2]=
2.33333333 x[2]=
1.87500000 x[2]=
2.16666667 x[2]=
1.93750000 x[2]=
2.08333333 x[2]=
1.96875000 x[2]=
2.04166667 x[2]=
1.98437500 x[2]=
2.02083333 x[2]=
1.99218750 x[2]=
2.01041667 x[2]=
1.99609375 x[2]=
2.00520833 x[2]=
1.99804688 x[2]=
2.00260417 x[2]=
1.99902344 x[2]=
2.00130208 x[2]=
1.99951172 x[2]=
2.00065104 x[2]=
1.99975586 x[2]=
2.00032552 x[2]=
1.99987793 x[2]=
2.00016276 x[2]=
1.99993896 x[2]=
2.00008138 x[2]=
1.99996948 x[2]=
3.50000000
2.33333333
3.25000000
2.66666667
3.12500000
2.83333333
3.06250000
2.91666667
3.03125000
2.95833333
3.01562500
2.97916667
3.00781250
2.98958333
3.00390625
2.99479167
3.00195312
2.99739583
3.00097656
2.99869792
3.00048828
2.99934896
3.00024414
2.99967448
3.00012207
2.99983724
3.00006104
2.99991862
3.00003052
2.99995931
68
Jacobi Iteration Method - 6
C Cox
xik 1 
bi   mij x kj
i j
mii
Jacobi Parallelisation
The calculation for each row (i-loop in sequential algorithm) can proceed
independently, so the parallel Jacobi algorithm can distribute the processing for each
row between the available processors.
e.g. with n=12 matrix rows and 3 workers allocate P1 to rows 1-4, P2 to rows 5-8 and
P3 to rows 9-12.
Message Passing
At start of program each worker process will require mrc for c=1..n and br for the range
of rows allocated to the worker.
During each iteration:
• Continue iteration flag set false when converged and sent to all worker(s)
• From Driver to Workers: send last solution xk-1 for all rows 1..n to all workers
• From Worker to Driver: new solution elements xk for rows allocated to each worker
69
Jacobi Iteration Method - 7
Sequence Diagram for Message
Passing Design
Driver
Worker
Pseudocode for worker row i
Receive m[0][0]..m[n-1][n-1] from driver
C Cox
Spawn worker(s)
Receive b[0]..b[n-1] from driver
x[i] = initial_value
Send A[r,c];r=range; c=1..n
For its=1..MaxIts
Send b[r];r=range
repeat
sum=0.0
Start iterations
Send continue iteration flag
for j=0..n-1
Rcv.
if(i<>j)
sum=sum+m[i][j]*x[j]
Send xprevious[1..n]
new_x[i]=(b[i]-sum)/m[i][i]
Send x[r];r=range
Until
converged
send x[i] to driver
receive x[0..n-1] from driver
Display solution results
70
Jacobi and Gauss-Seidel Iteration
C Cox
•
•
Jacobi algorithm per element:
Gauss-Seidel algorithm per element:
xik 1 
xik 1 
bi   mij x kj   mij x kj
j i
j i
mii
bi   mij x kj 1   mij x kj
j i
j i
mii
Gauss-Seidel incorporates updated xi
values as they become available.
In the Jacobi new values are
incorporated at the end of the loop
71
Sparse Linear Systems
•
•
C Cox
•
•
Many problems in scientific and engineering computing can be related
to linear form Mx=b
These linear systems arise from finite element and finite difference
models of the PDE systems
Because of physical arrangements and modelling methods we are
mostly interested in close neighbours for short range effects
In matrix systems this leads to sparse systems with matrices populated
mainly about the center diagonal, i.e. mij values are zero or very small
where |i-j|> Some constant L
72
C Cox
Example Sparse Linear Systems
•
•
•
•
•
•
•
Heat flow: Temperature (position,time)
Diffusion: Concentration (position,time)
Current flow in resistor network: Voltage (position)
Electrostatic or Gravitational Potential: Potential (position)
Fluid flow: Velocity, Pressure, Density (position,time)
Quantum mechanics: Wave-function (position,time)
Elasticity: Stress,Strain (position,time)
73
Sparse Matrices – Resistor Line
2
C Cox
1
m
X X
X X X


X X X

X X


X







3
X
X
4
X
X
X
X










X

X X
X X 
m 1
m
Tridiagonal Case
74
Resistor networks – Example 1
VA
V2
C Cox
c12
•
•
•
•
V3
c23
V4
c34
V5
c45
V6
c56
VB
c67
DC resistor network
Current behaves according to Ohms law V=IR
Current through resister (Vi-Vj)/Rij = (Vi-Vj)Cij (Rij is resistance and Cij(=1/ Rij) is conductance)
By Kirchoff’s law the net current at a node is zero
Node 2: C23(V3-V2)-C12(V2-VA)
Node 3: C34(V4-V3)-C23(V3-V2)
Node 4: C45(V5-V4)-C34(V4-V3)
Node 5: C56(V6-V5)-C45(V5-V4)
Node 6: C67(VB-V6)-C56(V6-V5)
.. factorise
Node 2: V2(-C12-C23)+V3(C23)
Node 3: V2(C23)+V3(-C23-C34)+
Node 4: V3(C34)+V4(-C34-C45)+
Node 5: V4(C45)+V5(-C45-C56)+
Node 6: V5(C56)+V6(-C56-C67)
=
=
=
=
=
0
0
0
0
0
=
V4(C34) =
V5(C45) =
V6(C56) =
VA, VB are at fixed voltage (boundary condition)
V2..V6 are unknown
-C12VA
0
0
0
5 equations, 5
unknowns
= - C67VB
75
Resistor networks – example 1
C Cox
Node
Node
Node
Node
Node
2:
3:
4:
5:
6:
V2(-C12-C23)+V3(C23)
=
V2(C23)+V3(-C23-C34)+ V4(C34) =
V3(C34)+V4(-C34-C45)+ V5(C45) =
V4(C45)+V5(-C45-C56)+ V6(C56) =
V5(C56)+V6(-C56-C67)
-C12VA
0
0
0
= - C67VB
sparse linear form to solve for node voltages
C23 ,
0,
0,
0
 C12  C23 ,
 V2    C12VA 
 C ,
 V   0 

C

C
,
C
,
0
,
0
23
23
34
34

 3  


 V4    0 
0,
C34 ,
 C34  C45 ,
C45 ,
0

  

0
,
0
,
C
,

C

C
,
C
V
0
45
45
56
56

 5  


0,
0,
0,
C56 ,
 C56  C67  V6   C67VB 
VA
V2
c12
V3
c23
V4
c34
V5
c45
V6
c56
VB
c67
76
Resistor networks – Example 2
V2
VA
c14
V4
c12
c45
C Cox
c47
V7
c78
V3
c23
c25
c36
V5 c56
c58
V8
V6
c69
c89
VB
More complex example
VA, VB are fixed voltage
V2..V8 are unknown
Node 2: C12(VA-V2)+ C25(V5-V2)-C23(V2-V3) = 0
Node 3: C23(V2-V4)-C36(V3-V6) = 0
Node 4: C47(V7-V4)-C14(V4-VA)-C45(V4-V5) = 0
Node 5: C45(V4-V5)+C56(V6-V5)-C25(V5-V2)-C58(V5-V8)=0
Node 6: C36(V3-V6)-C56(V6-V5)-C69(V6-VB) = 0
Node 7: C78(V8-V7)-C47(V7-V4) = 0
Node 8: C58(V5-V8)-C89(VB-V8)-C78(V8-V7) = 0
.. factorise
Node 2: V2(-C12-C25-C23)+V3(C23)
= -C12VA
Node 3: V2(C23)+V3(-C23-C36)+V6(C36) = 0
Node 4: V4(-C47-C14-C45)+V5(C45)+V7(C47) = -C14VA
Node 5: V2(C25)+V4(C45)+V5(-C45-C56-C25-C58)+
V6(C56)+V8(C58)
= 0
Node 6: V3(C36)+V5(C56)+V6(-C36-C56-C69) = -C69VB
Node 7: V4(C47)+V7(-C78-C47)+V8(C78) = 0
Node 8: V5(C56)+V7(C78)+V8(-C56-C89-C78) = -C89VB
7 equations, 7
unknowns
77
Resistor networks – Example 2
C Cox
Node
Node
Node
Node
2:
3:
4:
5:
V2(-C12-C25-C23)+V3(C23)
= -C12VA
V2(C23)+V3(-C23-C36)+V6(C36) = 0
V4(-C47-C14-C45)+V5(C45)+V7(C47) = -C14VA
V2(C25)+V4(C45)+V5(-C45-C56-C25-C58)+
V6(C56)+V8(C58)
= 0
Node 6: V3(C36)+V5(C56)+V6(-C36-C56-C69) = -C69VB
Node 7: V4(C47)+V7(-C78-C47)+V8(C78) = 0
Node 8: V5(C56)+V7(C78)+V8(-C56-C89-C78) = -C89VB
V2
VA
c14
V4
c12
c45
c47
V7
c78
V3
c
c25 23
V5 c56
c58
V8
c36
V6
c69
c89
VB
sparse linear form to solve for node voltages
C23 ,
 C12  C25  C23 ,

V23 ,
 C23  C36


0,
0,

C25 ,
0,


0,
C36 ,

0,
0,


0,
0,

 V2    C12VA 
 V   0 
0,
0,
C36 ,
0,
0
 3  

 V4    C14VA 
 C47  C14  C45 ,
C45 ,
0,
C47 ,
0
  

C45 ,
 C45  C56  C25  C58 ,
C56 ,
0,
C58
 V5    0 
 V6   C69VB 
0,
C56 ,
 C36  C56  C69 ,
0,
0
  

C47 ,
0,
0,
 C78  C47 ,
C78
 V7   0 
0,
C58 ,
0,
C78 ,
 C58  C89  C78  V8   C89VB 
0,
C25 ,
0,
0,
0
78
C Cox
Sparse Matrices – Compressed Row Storage
Concept
• Only store non-zero values
Tradeoff
• Reduce storage dramatically {from O(n2) to O(n)}
• Increased cost and inconvenience of access and update
31 0 52 ...
 0 59 0 ...


41 26 0 ...


...
...
...
...


79
C Cox
Sparse data structure –
compressed row storage
Column and values
4
0
0
2
1
1
4
0
3
1
0
2
4
2
0
0
0
0
4
2
3
1
0
0
7
__
3
1
Row
pointer
1
1
4
7
.
.
Sparse matrix
80

HPC

Transcript HPC

Directory