Transcript Document
ME964
High Performance Computing
for Engineering Applications
Parallel Programming Patterns
Discussion of Midterm Project
Oct. 21, 2008
Before we get started…
Last Time
Parallel Programming patterns
Today
Parallel Programming patterns
Discuss the “Supporting Structures Design Space”
Discussion of the Midterm Project
The “Finding Concurrency Design Space”
The “Algorithm Structure Design Space”
Assigned today, due on 11/18 or 11/09 (based on level of difficulty)
Other issues:
I will be in Milwaukee on Th, Mik will be the lecturer - tracing the
evolution of the hardware support for the graphics pipeline
2
Key Parallel Programming Steps
1)
Find the concurrency in the problem
2)
Structure the algorithm so that concurrency can be exploited
3)
Implement the algorithm in a suitable programming
environment
4)
Execute and tune the performance of the code on a parallel
system
3
What’s Comes Next?
Finding Concurrency
Algorithm Structure
Focus on this
for a while
Supporting Structures
Implementation Mechanisms
4
Before We Dive In…
You hover in this design space when you ponder whether an MPI or an
SPMD or a event driven simulation environment is what it takes to
optimally design your algorithm
We are not going to focus on this process too much since it’s clear that
we are stuck with SPMD
All that we’ll be doing is to compare SPMD with other parallel programming
paradigms to understand when SPMD is good and when it is bad
Also, we will not cover the “Implementation Mechanisms” design space
since we don’t need to wrestle with how to spawn and join threads,
how to do synchronization, etc.
This would be Chapter 6 of the book
5
Supporting Structures
~Program and Data Models~
Program Models
Data Models
SPMD
Shared Data
Master/Worker
Shared Queue
Loop Parallelism
Distributed Array
Fork/Join
These are not necessarily
mutually exclusive.
6
Program Structuring Patterns
SPMD (Single Program, Multiple Data)
All PE’s (Processor Elements) execute the same program in
parallel
Each PE has its own data
Each PE uses a unique ID to access its portion of data
Different PE can follow different paths through the same code
This is essentially the CUDA Grid model
Master/Worker
Loop Parallelism
Fork/Join
7
Program Structuring Patterns
(the rest of the pack)
Master/Worker
Loop Parallelism
A Master thread sets up a pool of worker threads and a bag of tasks
Workers execute concurrently, removing tasks until done
Loop iterations execute in parallel
FORTRAN do-all (truly parallel), do-across (with dependence)
Very common in OpenMP
Fork/Join
Most general way of creation of threads (the POSIX standard)
Can be regarded as a very low level approach in which you use the OS
to manage parallelism
8
Review: Algorithm Structure
Algorithm Design
Organize
by Task
Organize by
Data
Organize by
Data Flow
Linear
Recursive
Linear
Recursive
Regular
Irregular
Task
Parallelism
Divide and
Conquer
Geometric
Decomposition
Recursive
Data
Pipeline
Event Driven
9
Algorithm Structures
vs.
Supporting Structures
Task Parallel.
Divide/Conquer
Geometric
Decomp.
Recursive
Data
Pipeline
Event-based
SPMD
☺☺☺☺
☺☺☺
☺☺☺☺
☺☺
☺☺☺
☺☺
Loop
Parallel
☺☺☺☺
☺☺
☺☺☺
Master/
Worker
☺☺☺☺
☺☺
☺
☺
☺
☺
Fork/
Join
☺☺
☺☺☺☺
☺☺
☺☺☺☺
☺☺☺☺
Four smilies is the best (Source: Mattson, et al.)
10
Supporting Structures vs. Program Models
OpenMP
MPI
CUDA
SPMD
☺☺☺
☺☺☺☺
☺☺☺☺
Loop
Parallel
Master/
Slave
☺☺☺☺
☺
☺☺
☺☺☺
Fork/Join
☺☺☺
11
More on SPMD
Dominant coding style of scalable parallel computing
MPI code is mostly developed in SPMD style
Many OpenMP code is also in SPMD (next to loop parallelism)
Particularly suitable for algorithms based on data parallelism,
geometric decomposition, divide and conquer.
Main advantage
Tasks and their interactions visible in one piece of source code,
no need to correlated multiple sources
SPMD is by far the most commonly used pattern for
structuring parallel programs.
12
Typical SPMD Program Phases
Initialize
Establish localized data structure and communication channels
Obtain a unique identifier
Each thread acquires a unique identifier, typically in the range from 0 to
N-1, where N is the number of threads.
Distribute Data
Decompose global data into chunks and localize them, or
Sharing/replicating major data structure using thread ID to associate
subset of the data to threads
Run the core computation
Both OpenMP and CUDA have built-in support for this.
More details in next slide…
Finalize
Reconcile global data structure, prepare for the next major iteration
13
Core Computation Phase
Thread IDs are used to differentiate behavior of threads
Use thread ID in loop index calculations to split loop iterations
among threads
Use conditions based on thread ID to branch to their specific
actions
Both can have very different performance results and code
complexity depending on the way they are done.
14
A Simple Example
Assume
The computation being parallelized has 1,000,000 iterations.
Sequential code:
Num_steps = 1000000;
for (i=0; i< num_steps, i++) {
…
}
15
SPMD Code Version 1
Assign a chunk of iterations to each thread
The last thread also finishes up the remainder iterations
num_steps = 1000000;
i_start = my_id * (num_steps/num_threads);
i_end = i_start + (num_steps/num_threads);
if (my_id == (num_threads-1)) i_end = num_steps;
for (i = i_start; i < i_end; i++) {
….
}
Reconciliation of results across threads if necessary.
16
Problems with Version 1
The last thread executes more iterations than others
The number of extra iterations is up to the total
number of threads – 1
This is not a big problem when the number of threads is
small
When there are thousands of threads, this can create
serious load imbalance problems.
Also, the extra if statement is a typical source of
“branch divergence” in CUDA programs.
17
SPMD Code Version 2
Assign one more iterations to some of the threads
int rem = num_steps % num_threads;
i_start = my_id * (num_steps/num_threads);
i_end = i_start + (num_steps/num_threads);
if (rem != 0) {
if (my_id < rem) {
i_start += my_id;
i_end += (my_id +1);
}
else {
i_start += rem;
i_end += rem;
}
Less load imbalance
More branch divergence.
.
18
SPMD Code Version 3
Use cyclic distribution of iteration
num_steps = 1000000;
for (i = my_id; i < num_steps; i+= num_threads)
{
….
}
Less load imbalance
Loop branch divergence in the last Warp
Data padding further eliminates divergence.
19
Comparing the Three Versions
Version 1
ID=0
ID=1
ID=2
ID=3
Version 2
ID=0
ID=1
ID=2
ID=3
ID=0
ID=1
ID=2
ID=3
Version 3
Padded version1 1 may be best
for some data access patterns.
20
Data Styles
Shared Data
All threads share a major data structure
This is what CUDA supports
Shared Queue
All threads see a “thread safe” queue that maintains ordering of
data communication
Very relevant in conjunction with the Master/Worker scenarios
Distributed Array
Decomposed and distributed among threads
Limited support in CUDA Shared Memory
Typically an issue in NUMA layouts
21
End: Parallel Programming Patterns
Begin: Overview Reminder of Semester
22
The Reminder of the Semester
There will be one more lecture on CUDA (in November)
There will be three lectures on MPI
There will be about four to five guest lectures
On massively parallel hardware design
On grid computing (the Condor project)
On the N-body problem
Midterm exam on 11/20
Midterm Project (assigned today, due on 11/17 or sooner)
I’ll explain why “sooner” shortly
Final Project (due on last Friday, Dec. 19, at 11:59 PM)
23
Midterm Project
The topic: Produce a Collision Detection “engine”
Deal with spheres or ellipsoids
There will be two levels of difficulty:
The entry level
You implement a simple “brute force” approach
Deal only with spheres of various radii
Due Date: November 9, 11:59 PM
The advanced level
Apply a sophisticated design to produce a parallel algorithm that is very efficient
Deal with ellipsoids
Note that spheres are a particular case of ellipsoids
Due Date: November 17, 11:59 PM
Most likely make this engine a part of your Final Project
Independent of the level embraced, you’ll be presenting your work on November
18, during regular class hours
You’ll have 10 minutes to make a PowerPoint presentation. The presentations are part
of the project (to be submitted for grading)
Please observe the Nov. 17 deadline for presentations for both levels
24
Final Project
Work on something related to your research
I can assist you with selecting a topic
I will help you with the design
There is a “default” Final Project, in case you can’t choose
The Final Project: Further Comments
26
There is a connection between the
Midterm and Final Projects:
It is up to you to decide what you want to
work on for the Final Project
However, if you don’t know what to work on,
I’ll assign you a Granular Dynamics
problem (this is the “default” Final Project)
An important part of this problem is the
collision detection stage
Granular Dynamics Problem
The Final Project: Concluding Remarks
To conclude, the default Final Project will look like this:
Final Project = Midterm Project + Some Extra Headache
Granular Material Dynamics
Keep in mind:
If you plan to do the default final project, think ahead and
implement a collision detection engine that is suitable to a scenario
where the bodies slightly change their location each time you have
to do the contact detection
This is because in granular flow the particles move in space
Maybe you can run a “cheaper” collision detection after an initial
“exhaustive” one
You’ll need to look at problems with millions of colliding bodies
27
End: Overview Reminder of Semester
Begin: Midterm Project
28
The Collision Problem
(The Midterm Project)
Problem Statement:
The bodies that you are dealing with are of spherical shape
A large set of spheres or ellipsoids is given to you
Provide collision state information for each pair of bodies that are colliding
Spheres are of arbitrary radius R, where Rlow < R < Rhigh
The bound values Rlow and Rhigh are given to you
Their location is given to you
NOTE: The ellipsoid case is similar (skipped here the discussion)
You will have to validate your results against “Bullet”, which is an open
source game development tool.
29
Proposed Workflow
The proposed workflow below will be captured in a Visual Studio
Solution (with three Projects) that will be provided to you (zipped file).
You are responsible for the part in red font below.
Generate an initial distribution of the spheres (part of Project 1)
Load data (part of Project 2)
Run collision detection on the device (part of Project 2)
Save results in a file (part of Project 2)
Run collision detection using the Bullet library (part of Project 3)
The utility will tell you if you “passed” or “failed” (part of Project 3)
30
Project template
Solution contains three projects
First project is used to generate the input data (distribution of the
spheres along with their radius)
Second project loads input data and checks for collisions
This is the project you modify
The main driver calls a function that invokes the collision
detection kernel
Pretty much like you’ve had to do in your HW
Third project does the Bullet part followed up by the validation
Uses ‘C’ function notation
Main file uses C++ classes and objects
Allows C++ and C code to work together
31
Spheres: Possible Contact Cases
Convention:
Color code for object A: bluish
Color code for object B: greenish
A
Sphere contacts sphere
One contact point
Normal - from center of B to center of A
Sphere interpenetrates sphere
A
Two contact points
B
Points on objects furthest in
Normal - from center of B to center of A
B
32
Possible Contact Cases (Contd.)
Sphere exactly on top of sphere
Two points returned
Points on surface that intersect the x-axis
Normal is on +x axis
Sphere inside sphere
B
A
Two points returned
Points on surface that intersect the x-axis
Normal is on +x axis
Spheres are not in contact
A
Nothing to report, ignore
B
A
B
33
Bullet Physics Library
Bullet is a cross platform open source physics library.
Can simulate many types of objects
Supports joints and constraints
Geared towards game development.
Rigid bodies (concave, convex, triangular mesh, etc.)
Cloth, soft bodies
Speed vs. accuracy: comes on the side of speed
Bullet’s collision detection will be used to verify your
results
34
Screenshot, Bullet
35
Relevant Variables
All variables retuned per contact by Bullet
btVector3 localPointA
btVector3 localPointB
btVector3 positionWorldOnB
btVector3 positionWorldOnA
btVector3 normalWorldOnB
btScalar distance1
btVector3 lateralFrictionDir1
btVector3 lateralFrictionDir2
int partId0
int partId1
int index0
int index1
btScalar combinedFriction
btScalar combinedRestitution
void * userPersistentData
btScalar appliedImpulse
bool lateralFrictionInitialized
int lifeTime
btScalar appliedImpulseLateral1
btScalar appliedImpulseLateral2
Variables you need to return and calculate
btVector3 positionWorldOnA
btVector3 positionWorldOnB
btVector3 normalWorldOnB
int objectIdA (note: this is the Id associated with A)
int objectIdB (note: this is the Id associated with B)
Additional parameters for Final Project*:
localPointA
localPointB
btVector3 lateralFrictionDir1
btVector3 lateralFrictionDir2
distance1
36
Variable Description
positionWorldOnB
positionWorldOnA
normalWorldOnB
objectIdA
objectIdB
(float3)
(float3)
(float3)
(int)
(int)
Absolute Collision location on object A
Absolute Collision location on object B
Collision Normal Vector from B to A
Object A ID
Object B ID
(Id starts from 0)
Other parameters that need to be implemented for Final Project (if you take this route):
localPointA
localPointB
distance1
lateralFrictionDir1
lateralFrictionDir2
(float3)
(float3)
(float)
(float3)
(float3)
Collision point w/ respect to center of object A
Collision point w/ respect to center of object B
Distance between collision points
Vector orthogonal to normal and Dir2
Vector orthogonal to normal and Dir1
37
The Passed/Failed Issue
The comparison is going to be made based on
The contact location expressed in global reference frame
There will be an epsilon value that controls this test
The contact normal
Your normal and Bullet’s normal should be within 10±
For comparison purposes, your contacts should be ordered
such that
I still owe you two things:
How are body A and body B chosen?
What to do with this collision case:
38
Common Sense Advice
Donald Knuth (1974): Premature optimization is the root of all evil
Get it to run first, add the bells and whistles later on…
Embrace a Crawl – Walk – Run approach
39
Collision Detection
~ Some Possible Approaches ~
Brute force – simplest and slowest
Ok to implement, unless you want to make this your final project, in
which case you’ll have to implement one of the other three methods
below
It relies on an exhaustive search: N(N-1)/2 tests (given N bodies)
Baraff – sorting based (the method is sometimes called “culling”
Harada – rely on voxel and texture, not recommended
Scott Le Grand – spatial subdivision method
40
References
The methods of Scott Le Grand and Harada are explained in
great detail in GPU Gems 3 (you can borrow it from me)
I also have the following references (found them on the web,
contact me if you need them):
Naga, K. G., R. Stephane, C. L. Ming, and M. Dinesh, 2003: CULLIDE: interactive collision detection between
complex models in large environments using graphics hardware. Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association.
Baraff, D., 1997: An Introduction to Physically Based Modeling: Rigid
Body Simulation II—Nonpenetration Constraints. SIGGRAPH Course Notes.
41
Brute Force
Iterate over all combinations of object pairs and
perform distance checks on them
Store pairs that are colliding into list and compute
required information
Iterate over each pair and calculate additional
contact data.
42
Baraff – Sorting axis
p6
p5
p3
p4
p1
b3 b6
b1
e6
e1 b5
b2 e3 b4 e5
e4
e2
Every time an object p begins on the axis add p to stack
Every time an object p ends on the axis remove p from stack
Each time an object is added to stack compare it to those still
on stack
p2
p1 check with p3 and p6
p5 check with p3
An object is colliding with another if they overlap on all three
axes.
Multi GPU relevant: yes, it can speed up the problem by
simultaneously culling based on more axes
43
Example: Why this is tricky at times
(The issue of false positives)
44
Harada , GPU Gems 3
Mostly meant to deal with spheres, and of the same radii
“Voxel” collision method
Spheres are used to represent other objects
Voxel ´ 3d pixel
Fill object’s space using spheres
Use many 2D textures to store data.
Textures make up 3d grid
Optimize by using texture memory
Easier to coalesce reads and writes
45
Harada (Contd.)
Each voxel in texture contains up
to 4 object id’s
Stored in RGBA values
Objects and cells need to be sized so
that this is true
Constant radius is preferred but some
variation can exist
Too much variation could cause
instances with >4 objects in one cell’s
space
Test each cell with its 26
neighbors
Iterate over all cells in this way
46
Scott Le Grand, GPU Gems 3
Spatial subdivision method
Divide space into three dimensional grid
Collision state of spheres cannot be modified by different
threads
Cells numbered by processing order (1, 2,…)
First all #1 cells are processed, then #2 cells and so on
Insures that two threads do not overwrite collision data for one
sphere
Spheres get assigned a home cell id and overlapping
“phantom” cell id’s
Array gets sorted by cell number and type
47
Scott Le Grand (Contd.)
From this list determine which cells have multiple objects inside
These cells will be checked for collisions
Task is simplified because cell id list is already sorted by cell
number
By checking for a change in cell id the number of objects in that cell
can be determined.
If cell only has phantom objects it will not be checked
Phantom objects - objects whose centroid is not in that cell but still intersects it
Cell id must have at least one home id and >1 objects to be checked
Multi GPU relevant: yes, it can increase the size of the problem
being addressed
48