Parallel Computing and Youx

Download Report

Transcript Parallel Computing and Youx

Parallel Computing and You:
A CME213 Review
June 24, 2013
Jason Su
Technologies for C/C++/Fortran
• Single machine, multi-core
– P(OSIX) threads: bare metal multi-threading
– OpenMP: compiler directives that implement various constructs
like parallel-for
• Single machine, GPU
– CUDA/OpenCL: bare metal GPU coding
– Thrust: algorithms library for CUDA inspired by C++ STL
• Like MATLAB, where you use a set of fast core functions and data
structures to implement your program
• Multi-machine
– Message Passing Interface (MPI): a language-independent
communication protocol to coordinate and program a cluster of
machines
Challenges of Parallel Programming
• Correctness
– Race conditions
– Synchronization/deadlock
– Floating point arithmetic is not associative or distributive
• Debugging
– How do you fix problems that are not reproducible?
– How do you assess the interaction of many threads
running simultaneously?
• Performance
– Management and algorithm overhead
1
– Amdahl’s Law S n

threads
1  P  
P
nthreads
Challenges: Race Conditions
As a simple example, let us assume that two threads each want
to increment the value of a global integer variable by one.
What we expected:
Thread 1
Thread 2
What we got:
Integer
Value
Thread 1
Thread 2
0
Read
← 0
Increment
0
Read
0
Write
→ 1
Read
Increment
Write
→ 2
← 0
Read
Increment
← 1
1
Integer
Value
← 0
0
Increment
Write
0
→ 1
Write
→ 1
Challenges: Deadlock
A real world example would be an illogical statute passed by the Kansas
legislature in the early 20th century, which stated:
“When two trains approach each other at a crossing, both shall come to a
full stop and neither shall start up again until the other has gone.”
Challenges: Deadlock
• Consider a program that manages bank accounts:
BankAccount:
string owner
float balance
withdraw(float amount)
deposit(float amount)
transfer(self, Account to, float amount):
lock(self)
lock(to)
from.withdraw(amount)
to.deposit(amount)
release(to)
release(from)
Challenges: Race Conditions
As a simple example, let us assume that two threads each want
to increment the value of a global integer variable by one.
What we expected:
What we got:
Thread 1:
Bilbo
Thread 2:
Frodo
Thread 1:
Bilbo
Thread 2:
Frodo
Deposit: Ring
Deposit: $1000
Deposit: Ring
Deposit: $1000
Transfer: Ring
→← Transfer: $1000
Transfer: Ring
→ $1000 + Ring
$1000
← Transfer: $1000
$1000
Ring
No, you give me it first, then I’ll give it
to you!
Challenges: Floating Point Arithmetic
Say we had a floating point format that has 1 base-10 exponent
and 1 decimal place:
□
□E
What we expected:
What we got:
1E
1E
 3E 0   4E 0  2E 1
1
0
0
1
1E  3E  4E   2E
1
 3E 0   4E 0  1E 1
1
0
0
1
1E  3E  4E   2E
1
Challenges: Floating Point Arithmetic
• Floating point integers have limited precision
– Any operation implicitly rounds, which means that
order matters
• Float arithmetic is not associative or distributive
• Do operations on values of similar precision first
– Actually an issue in all code, but becomes
apparent when serial and parallel answers differ
“Many programmers steeped in sequential programming for so many
years make the assumption that there is only one right answer for their
algorithm. After all, their code has always delivered the same answer
every time it was run…however, all the answers are equally correct”
Challenges: Floating Point Arithmetic
• The answer changes significantly with the
number of threads
– Because the order of operations has changed
– Indicates an unstable numerical algorithm
– Parallel programming revealed a flaw in the
algorithm which is otherwise unapparent with
serial code
OpenMP
• Uses a fork-join programming paradigm
OpenMP
• Declare private and shared variables for a thread
• Atomic operations
– Ensure that code is uninterrupted by other threads,
important for shared variables
• Parallel for-loop: see MATLAB’s parfor
> #pragma omp parallel for num_threads(nthreads)
for (int i=0; i < N; i++) { ... }
• Reduction: ℝn_threads→ℝ
> #pragma omp parallel for reduction(+:val)
for (uint i=0; i < v.size(); i++)
val = v[i];
Moore’s Law
• Serial scaling
performance
has reached
its peak.
• Processors are
not getting
faster, but
wider
GPUs
GPUs
CPU vs GPU
• GPU devotes more transistors to data
processing
CPU vs GPU
• GPU devotes more transistors to data
processing
CPU vs GPU
• CPU minimizes time to complete a given task:
latency of a thread
• GPU maximizes number of tasks in a fixed
time: throughput of all threads
• Which is better? It depends on the problem
“If you were plowing a field,
which would you rather use: two
strong oxen or 1024 chicken?”
Seymour Cray
NVIDIA CUDA
Compute Unified
Device Architecture
(CUDA)
• A set of libraries and extensions for C/C++ or Fortran
• Also 3rd party support in MATLAB, Python, Java, and
Mathematica
Language is evolving
with the hardware
• Newer GPU architectures use different CUDA versions
• Denoted by their Compute Capability
Hardware originates
comes from the
desktop market
• NVIDIA offers their own expensive products designed for
scientific computing (Tesla and Quadro)
• But desktop options are affordable and effective
(GeForce)
NVIDIA CUDA
Extremely cheap thread
creation and switching
• Hiding latency with parallelism
Direct access to L1 cache
(shared memory,
up to 48KB)
• Fast but up to you to manage it
correctly
Single instruction
multiple thread (SIMT)
execution model
• May or may not fit your problem
NVIDIA GPU Architecture: Processors
• GPU contains many
streaming multiprocessors
(MP or SM)
– These do the work
– Groups threads into “warps”
and executes these in lockstep
– Different warps are swapped
in and out constantly to hide
latency = “parallelism”
What is a warp?
• A warp is a group of 32 threads that are executed in lockstep (SIMT)
• Threads in a warp demand obedience, can either: be idle or
do the same instruction as its siblings
– Divergent code (if/else, loops, return, etc.) can wastefully leave
some threads idle
NVIDIA GPU Architecture: Memory
NVIDIA GPU Architecture
• Imagine a pool of threads representing all the
work there is to do
– For example, a thread for each pixel in the image
– These threads need to be broken up into blocks that
fit neatly onto each MP, this is chosen by you
• There are several limitations that affect how well
these blocks fill up an MP, called its occupancy
–
–
–
–
Total threads < 2048
Total blocks < 16
Total shared memory < 48KB
Total registers < 64K
CUDA Program Structure
Host code
• void main()
• Transfers data to global memory
on GPU/device
• Determines how the data and
threads are divided to fit onto MPs
• Calls kernel functions on GPU
• kernel_name
<<< gridDim, blockDim >>>
(arg1, arg2, …);
CUDA Program Structure
• Memory and processors on the
GPU exist in big discrete
chunks
• Designing for GPU is not only
developing a parallel algorithm
• But also shaping it so that it
optimally maps to GPU
hardware structures
CUDA Program Structure
Global kernel functions
• Device code called by the host
• Access data based on threadIdx
location
• Do work, hopefully in a nondivergent way so all threads are
doing something
Device functions
• Device code called by the device
• Helper functions usually used in
kernels
Connected Component Labeling
Finally, we get to the problem.
Connected Component Labeling
• Say you had a segmented mask of lesions:
– How would you count the total number of
(connected) lesions in the image?
– How would you track an individual lesion over
time?
1
2
 Connected labels share a similar property or
state
 In this case color, where we’ve simplified to
black and white
Stars labeled 1 through 5467
Initialize with Raster-Scan Numbering
Input Image
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
Initial Labels
3 4 5 6
11 12 13 14
19 20 21 22
27 28 29 30
35 36 37 38
43 44 45 46
51 52 53 54
59 60 61 62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
• Goal is to label each connected region with a unique
number, min(a, b) is used to combine labels.
• That means 1, 7 , 9, and 54 should remain.
Kernel A – Label Propagation
• Iterate until nothing changes:
1. Assign a pixel for each thread
2. For each thread:
a) Look at my neighboring pixels
b) If neighbor is smaller than me,
re-label myself to the lowest
adjacent label
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
Kernel A – Label Propagation
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel A – Label Propagation
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel A – Label Propagation
• A label can only propagate itself by a maximum of
one cell per iteration
– Many iterations are required
• Very slow for large clusters in the image
– We’re getting killed by many accesses to global
memory, each taking O(100) cycles
– Even having many parallel threads is not enough to
hide it, need 32 cores/MP*400 cycles = 12,800
threads active/MP
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel B – Local Label Propagation
• Take advantage of L1-speed
shared memory
• Iterate until global
convergence:
1.
2.
Assign a pixel for each thread
For each thread:
a)
b)
Load myself and neighbor labels
into shared memory
Iterate until local convergence:
i.
ii.
Look at my neighboring pixels
If neighbor is smaller than me,
re-label myself to the lowest
adjacent label
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel B – Local Label Propagation
Initial Labels
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
Kernel B – Local Label Propagation
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel B – Local Label Propagation
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
Kernel B – Local Label Propagation
• A label can only propagate itself by a maximum of one
block per iteration
– Iterations are reduced but still can be a lot
• Tiling the data into shared memory blocks is a very
common technique
– Appears in many applications, including FDTD
Kernel D – Label Equivalence
• Completely change the algorithm
– How do we make it so labels can propagate extensively in 1
iteration?
– Track which labels are equivalent and resolve equivalency chains
Iterate until convergence:
Scanning
Analysis
• Find minimum
neighbor for each pixel
and record it in
equivalence list
• Traverse through
equivalency chain until
you get a label that is
equivalent to itself
Relabel the pixels
Kernel D – Label Equivalence
K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs
and CUDA, Parallel Computing 36 (12) (2010) 655–678.
CPU – MATLAB’s bwlabel
Run-length
encode image
• There are special
hardwareinstructions that
make this very fast
Scan the runs
Analysis
• Assign initial labels
and record label
equivalences in an
equivalence table.
• Here the chains are
traversed to the
actual minimum
instead of a selfreference
• This is possible
because the code is
run in serial
Relabel the
runs
Performance
Input
Solution
CPU:
Matlab’s
bwlabel
GPU:
Kernel A
GPU:
Kernel B
GPU:
Kernel D
477x423
23 labels
1.23
ms
92.9
ms
14.3
ms
1.05
ms
1463x2233
68 labels
15.2
ms
5031.1 491.3 12.3
ms
ms
ms
Results: CPU vs GPU
GPU isn’t a
cure-all
• A naïve algorithm will still lose to CPU, badly
• Starting from the best serial algorithm is
smart
• Shared memory can offer a 10x improvement
Connected
component
labeling
• Actually is a more challenging problem for
GPU to beat CPU
• Memory access bound problem
• Does not take advantage of the TFLOPs
available
Figure Credits
•
•
•
•
Christopher Cooper, Boston University
Eric Darve, Stanford University
Gordon Erlebacher, Florida State University
CUDA C Computing Guide