Transcript PPT

CSE-700 Parallel Programming
Introduction
박성우
POSTECH
Sep 6, 2007
Common Features?
2
... runs faster on
3
Multi-core CPUs
•
•
•
•
IBM Power4, dual-core, 2000
Intel reaches thermal wall, 2004 ) no more free lunch!
Intel Xeon, quad-core, 2006
Sony PlayStation 3 Cell, eight cores enabled, 2006
source: Herb Sutter - "Software and the concurrency revolution"
• Intel, 80-cores, 2011 (prototype finished)
4
Parallel Programming Models
•
•
•
•
•
•
•
•
•
Posix threads (API)
OpenMP (API)
HPF (High Performance Fortran)
Cray's Chapel
Nesl
Sun's Fortress
IBM's X10
...
and a lot more.
5
Parallelism
• Data parallelism
– ability to apply a function in parallel to each
element of a collection of data
• Thread parallelism
– ability to run multiple threads concurrently
– Each thread uses its own local state.
• Shared memory parallelism
6
Data Parallelism
Thread Parallelism
Shared Memory Parallelism
Data Parallelism = Data Separation
hardware
thread #1
a1 a2
...
hardware
thread #2
an an+1 an+2
...
hardware
thread #3
an+m an+m+1
...
an+m+l
8
Data Parallelism in Hardware
• GeForce 8800
– 128 stream processors @ 1.3Ghz, 500+GFlops
9
Data Parallelism in
Programming Languages
• Fortress
– parallelism is the default.
for i à 1:m, j à 1:n do
a[i, j] := b[i] c[j]
end
// 1:n is a generator
• Nesl (1990's)
– supports nested data parallelism
• the function being applied itself can be parallel.
{sum(a) : a in [[2, 3], [8, 3, 9], [7]]};
10
Data Parallel Haskell (DAMP '07)
• Haskell + nested data parallelism
– flattening (vectorization)
• transforms a nested parallel program such that
it manipulates only flat arrays.
– fusion
• eliminate many intermediate arrays
• Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements
11
Data Parallelism
Thread Parallelism
Shared Memory Parallelism
Thread Parallelism
synchronous communication
hardware
thread #1
local state
message
message
hardware
thread #2
local state
13
Pure Functional Threads
• Purely functional threads can run concurrently.
– Effect-free computations can be executed in
parallel with any other effect-free computations.
• Example: collision-detection
A'
B'
A
B
14
Manticore (DAMP '07)
• Three layers
– sequential base language
• functional language drawn from SML
• no mutable references and arrays!
– data-parallel programming
• Implicit:
– the compiler and runtime system manage thread
creation.
• E.g.) parallel arrays of parallel arrays
[: 2 * n | n in nums where n > 0 :]
fun mapP f xs = [: f x | x in xs :]
– concurrent programming
15
Concurrent Programming in
Manticore (DAMP '07)
• Based on Concurrent ML
– threads and synchronous message passing
– Threads do not share mutable states.
• actually no mutable references and arrays
– explicit:
• The programmer manages thread creation.
16
Data Parallelism
Thread Parallelism
Shared Memory Parallelism
(Shared State Concurrency)
Share Memory Parallelism
hardware
thread #1
hardware
thread #2
hardware
thread #3
shared memory
18
World War II
19
Company of Heroes
• Interaction of a LOT of objects:
– thousands of objects
– Each object has its own mutable state.
– Each object update affects several other objects.
– All objects are updated 30+ times per second.
• Problem:
– How do we handle simultaneous updates to the
same memory location?
20
Manual Lock-based Synchronization
pthread_mutex_lock(mutex);
mutate_variable();
pthread_mutex_unlock(mutex);
• Locks and conditional variables
) fundamentally flawed!
21
Bank Accounts
thread #1
transfer
request
shared memory
Beautiful Concurrency, Peyton Jones, 2007
thread #2
...
thread #n
transfer
request
transfer
request
account A
account B
• Invariant: atomicity
– no thread observes a state in which the money has left
one account, but has not arrived in the other.
22
Bank Accounts using Locks
• In an object-oriented language:
class Account {
Int balance;
synchronized void deposit (Int n) {
balance = balance + n;
}}
• Code for transfer:
void transfer (Account from, Account to, Int amount) {
from.withdraw (amount);
an intermediate state!
to.deposit (amount);
}
23
A Quick Fix: Explicit Locking
void transfer
(Account from, Account to, Int amount) {
from.lock(); to.lock();
from.withdraw (amount);
to.deposit (amount);
from.unlock(); to.unlock();
}
• Now, the program is prone to deadlock.
24
Locks are Bad
• Taking two few locks
) simultaneous update
• Taking too many locks
) no concurrency or deadlock
• Taking the wrong locks
) error-prone programming
• Taking locks in the wrong order
) error-prone programming
• ...
• Fundamental problem: no modular programming
– Correct implementations of withdraw and deposit do not
give a correct implementation of transfer.
25
Transactional Memory
• An alternative to lock-based synchronization
– eliminates many problems associated with lockbased synchronization
• no deadlock
• read sharing
• safe modular programming
• Hot research area
– hardware transactional memory
– software transactional memory
• C, Java, functional languages, ...
26
Transactions in Haskell
transfer :: Account -> Account -> Int -> IO ()
-- transfer 'amount' from account 'from' to account 'to'
transfer from to amount =
atomically (do { deposit to amount
; withdraw from amount })
• atomically act
– atomicity:
• the effects become visible to other threads all at once.
– isolation:
• the action act does not see any effects from other
threads.
27
Conclusion:
We need parallelism!
Tim Sweeney's POPL '06 Invited Talk
- Last Slide
29
CSE-700 Parallel Programming
Fall 2007
CSE-700 in a Nutshell
• Scope
– Parallel computing from the viewpoint of programmers
and language designers
– We will not talk about hardware for parallel computing
• Audience
– Anyone interested in learning parallel programming
• Prerequisite
– C programming
– Desire to learn new programming languages
31
Material
• Books
– Introduction to Parallel Programming (2nd).
Ananth Grama et al.
– Parallel Programming with MPI. Peter Pacheco.
Parallel Programming in OpenMP. Rohit Chandra
et al.
• Any textbook on MPI and OpenMP is fine.
• Papers
32
Teaching Staff
• Instructors
– Gla
– Myson
– ...
– and YOU!
• We will lead this course TOGETHER.
33
Resources
• Plquad
– quad-core Linux
– OpenMP and MPI already installed
• Ask for an account if you need one.
34
Basic Plan - First Half
• Goal
– learn the basics of parallel programming through
5+ assignments on OpenMP and MPI
• Each lecture consists of:
– discussion on the previous assignment
• Each of you is expected to give a presentation.
– presentation on OpenMP and MPI by the
instructors
– discussion on the next assignment
35
Basic Plan - Second Half
• Recent parallel languages
– learn a recent parallel language
– write a cool program in your parallel language
– give a presentation on your experience
• Topics in parallel language research
– choose a topic
– give a presentation on it
36
What Matters Most?
• Spirit of adventure
• Proactivity
• Desire to provoke Happy Chaos
– I want you to develop this course into a total,
complete, yet happy chaos.
– A truly inspirational course borders almost on
chaos.
37
Impact of Memory and Cache
on Performance
Impact of Memory Bandwidth [1]
Consider the following code fragment:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];
The code fragment sums columns of the matrix b into
a vector column_sum.
39
Impact of Memory Bandwidth [2]
• The vector column_sum is small and easily fits into the cache
• The matrix b is accessed in a column order.
• The strided access results in very poor performance.
Multiplying a matrix with a vector: (a) multiplying column-bycolumn, keeping a running sum; (b) computing each element of
the result as a dot product of a row of the matrix with the vector.
40
Impact of Memory Bandwidth [3]
We can fix the above code as follows:
for (i = 0; i < 1000; i++)
column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];
In this case, the matrix is traversed in a row-order and
performance can be expected to be significantly
better.
41
Lesson
• Memory layouts and organizing computation
appropriately can make a significant impact on the
spatial and temporal locality.
42
Assignment 1
Cache & Matrix Multiplication
Typical Sequential Implementation
• A
• B
• C=A*B
:nxn
:nxn
:nxn
for i = 1 to n
for j = 1 to n
C[i, j] = 0;
for k = 1 to n
C[i, j] += A[i, k] * B [k, j];
44
Using Submatrixes
• Improves data locality significantly.
45
Experimental Results
46
Assignment 1
• Machine
– the older, the better.
– Myson offers his ancient notebook for you.
• Pentium II 600Mhz
• no L1 cache
• 64KB L2 cache
• running Linux
• Prepare a presentation on your experimental results.
47