Lect_04_1999

Download Report

Transcript Lect_04_1999

CS 267 Applications of Parallel Computers
Lecture 4:
More about
Shared Memory Processors
and Programming
Jim Demmel
http://www.cs.berkeley.edu/~demmel/cs267_Spr99
CS267 L4 Shared Memory.1
Demmel Sp 1999
Recap of Last Lecture
° There are several standard programming models (plus
variations) that were developed to support particular kinds of
architectures
• shared memory
• message passing
• data parallel
° The programming models are no longer strictly tied to
particular architectures, and so offer portability of correctness
° Portability of performance still depends on tuning for each
architecture
° In each model, parallel programming has 4 phases
• decomposition into parallel tasks
• assignment of tasks to threads
• orchestration of communication and synchronization among threads
• mapping threads to processors
CS267 L4 Shared Memory.2
Demmel Sp 1999
Outline
° Performance modeling and tradeoffs
° Shared memory architectures
° Shared memory programming
CS267 L4 Shared Memory.3
Demmel Sp 1999
Cost Modeling
and
Performance
Tradeoffs
CS267 L4 Shared Memory.4
Demmel Sp 1999
Example
° s = f(A[1]) + … + f(A[n])
° Decomposition
• computing each f(A[j])
• n-fold parallelism, where n may be >> p
• computing sum s
° Assignment
• thread k sums sk = f(A[k*n/p]) + … + f(A[(k+1)*n/p-1])
• thread 1 sums s = s1+ … + sp
-
for simplicity of this example, will be improved
• thread 1 communicates s to other threads
° Orchestration
• starting up threads
• communicating, synchronizing with thread 1
° Mapping
• processor j runs thread j
CS267 L4 Shared Memory.5
Demmel Sp 1999
Identifying enough Concurrency
° Parallelism profile
n
Simple Decomposition:
f ( A[i] ) is the parallel task
Concurrency
• area is total work done
n x time(f)
1 x time(sum(n))
sum is sequential
Time
° Amdahl’s law bounds speedup
• let s = the fraction of total work done sequentially
After mapping
p
100
90
80
70
SPeedup
Concurrency
1
1
Speedup( P) 

1 s s
s
P
S=0%
60
S=1%
50
S=5%
40
S=10%
30
20
p x n/p x time(f)
10
0
0
CS267 L4 Shared Memory.6
20
40
60
Processors
80
100
Demmel Sp 1999
Algorithmic Trade-offs
° Parallelize partial sum of the f’s
Concurrency
• what fraction of the computation is “sequential”
p x time(sum(n/p) )
p x n/p x time(f)
1 x time(sum(p))
• what does this do for communication? locality?
• what if you sum what you “own”
CS267 L4 Shared Memory.7
Demmel Sp 1999
Problem Size is Critical
Amdahl’s Law Bounds
° Total work= n + P
100
90
° Serial work: P
80
° Parallel work: n
= P/ (n+P)
Speedup
° s = serial fraction
70
n
60
1000
10000
50
1000000
40
30
° Speedup(P)=n/(n/P+P)
° Speedup decreases for
large P if n small
20
10
0
0
20
40
60
80
100
Processors
In general seek to exploit a fraction of the peak parallelism
in the problem.
CS267 L4 Shared Memory.8
Demmel Sp 1999
Algorithmic Trade-offs
° Parallelize the final summation (tree sum)
Concurrency
• Generalize Amdahl’s law for arbitrary “ideal” parallelism profile
CS267 L4 Shared Memory.9
p x time(sum(n/p) )
p x n/p x time(f)
log_2 p x time(sum(2))
Demmel Sp 1999
Shared Memory Architectures
CS267 L4 Shared Memory.10
Demmel Sp 1999
Recap Basic Shared Memory Architecture
° Processors all connected to a large shared memory
° Local caches for each processor
° Cost: much cheaper to cache than main memory
P1
P2
$
Pn
$
$
network
memory
° Simplest to program, but hard to build with many processors
° Now take a closer look at structure, costs, limits
CS267 L4 Shared Memory.11
Demmel Sp 1999
Limits of using Bus as Network
Assume 100 MB/s bus
I/O
MEM
° ° ° MEM
50 MIPS processor w/o cache
=> 200 MB/s inst BW per processor
=> 60 MB/s data BW at 30% load-store
16 MB/s
°°°
cache
cache
260 MB/s
Suppose 98% inst hit rate and 95%
data hit rate (16 byte block)
=> 4 MB/s inst BW per processor
PROC
PROC
=> 12 MB/s data BW per processor
=> 16 MB/s combined BW
\ 8 processors will saturate bus
Cache provides bandwidth filter
– as well as reducing average access time
CS267 L4 Shared Memory.12
Demmel Sp 1999
Cache Coherence: The Semantic Problem
° p1 and p2 both have cached copies of x (as 0)
° p1 writes x=1 and then the flag, f=1, as a signal to other processors that it has
updated x
• writing f pulls it into p1’s cache
• both of these writes “write through” to memory
° p2 reads f (bringing it into p2’s cache) to see if it is 1, which it is
° p2 therefore reads x, expecting the value written by p1, but gets the “stale”
cached copy
x=1
f =1
x1
f 1
p1
x0
f 1
p2
° SMPs have complicated caches to enforce coherence
CS267 L4 Shared Memory.13
Demmel Sp 1999
Programming SMPs
° Coherent view of shared memory
° All addresses equidistant
• don’t worry about data partitioning
° Caches automatically replicate shared data close to processor
° If program concentrates on a block of the data set that no one
else updates => very fast
° Communication occurs only on cache misses
• cache misses are slow
° Processor cannot distinguish communication misses from
regular cache misses
° Cache block may introduce unnecessary communication
• two distinct variables in the same cache block
• false sharing
CS267 L4 Shared Memory.14
Demmel Sp 1999
Where are things going
° High-end
• collections of almost complete workstations/SMP on high-speed
network (Millennium)
• with specialized communication assist integrated with memory
system to provide global access to shared data
° Mid-end
• almost all servers are bus-based CC SMPs
• high-end servers are replacing the bus with a network
- Sun Enterprise 10000, IBM J90, HP/Convex SPP
• volume approach is Pentium pro quadpack + SCI ring
- Sequent, Data General
° Low-end
• SMP desktop is here
° Major change ahead
• SMP on a chip as a building block
CS267 L4 Shared Memory.15
Demmel Sp 1999
Programming Shared Memory Machines
° Creating parallelism in shared memory models
° Synchronization
° Building shared data structures
° Performance programming (throughout)
CS267 L4 Shared Memory.16
Demmel Sp 1999
Programming with Threads
° Several Threads Libraries
° PTHREADS is the Posix Standard
• Solaris threads are very similar
• Relatively low level
• Portable but possibly slow
° P4 (Parmacs) is a widely used portable package
• Higher level than Pthreads
° OpenMP is new proposed standard
• Support for scientific programming on shared memory
• Currently a Fortran interface
• Initiated by SGI, Sun is not currently supporting this
CS267 L4 Shared Memory.17
Demmel Sp 1999
Creating Parallelism
CS267 L4 Shared Memory.18
Demmel Sp 1999
Language Notions of Thread Creation
° cobegin/coend
cobegin
job1(a1); •Statements in block may run in parallel
job2(a2); •cobegins may be nested
coend
•Scoped, so you cannot have a missing coend
° fork/join
tid1 = fork(job1, a1);
job2(a2);
join tid1; •Forked function runs in parallel with current
•join waits for completion (may be in different function)
° cobegin cleaner, but fork is more general
CS267 L4 Shared Memory.19
Demmel Sp 1999
Forking Threads in Solaris
Signature:
int thr_create(void *stack_base, size_t stack_size,
void *(* start_func)(void *),
void *arg, long flags, thread_t *new_tid)
Example:
thr_create(NULL, NULL, start_func, arg, NULL, &tid)
° start_fun defines the thread body
° start_fun takes one argument of type void* and returns void*
° an argument can be passed as arg
• j-th thread gets arg=j so it knows who it is
° stack_base and stack_size give the stack
• standard default values
° flags controls various attributes
• standard default values for now
° new_tid thread id (for thread creator to identify threads)
CS267 L4 Shared Memory.20
Demmel Sp 1999
Example Using Solaris Threads
main()
{
thread_ptr = (thrinfo_t *) malloc(NTHREADS * sizeof(thrinfo_t));
thread_ptr[0].chunk = 0;
thread_ptr[0].tid = myID;
for (i = 1; i < NTHREADS; i++) {
thread_ptr[i].chunk = i;
if (thr_create(0, 0, worker, (void*)&thread_ptr[i].chunk.
0, &thread_ptr[i].tid)) {
perror("thr_create");
exit(1);
}
worker(0);
for (i = 1; i < NTHREADS; ++i)
thr_join(thread_ptr[i].tid, NULL, NULL);
}
CS267 L4 Shared Memory.21
Demmel Sp 1999
Synchronization
CS267 L4 Shared Memory.22
Demmel Sp 1999
Basic Types of Synchronization: Barrier
Barrier -- global synchronization
• fork multiple copies of the same function “work”
- SPMD “Single Program Multiple Data”
• simple use of barriers -- a threads hit the same one
work_on_my_subgrid();
barrier;
read_neighboring_values();
barrier;
• more complicated -- barriers on branches
if (tid % 2 == 0) {
work1();
barrier
} else { barrier }
• or in loops -- need equal number of barriers executed
• barriers are not provided in many thread libraries
- need to build them
CS267 L4 Shared Memory.23
Demmel Sp 1999
Basic Types of Synchronization: Mutexes
Mutexes -- mutual exclusion aka locks
• threads are working mostly independently
• need to access common data structure
lock *l = alloc_and_init();
acquire(l);
access data
release(l);
/* shared */
• Java and other languages have lexically scoped synchronization
- similar to cobegin/coend vs. fork and join
• Semaphores give guarantees on “fairness” in getting the lock, but
the same idea of mutual exclusion
• Locks only affect processors using them:
- pair-wise synchronization
CS267 L4 Shared Memory.24
Demmel Sp 1999
Basic Types of Synchronization: Post/Wait
Post/Wait -- producer consumer synchronization
waitlock *l = alloc_and_init();
P1
big_data - new_value;
post(l);
/* shared */
P2
wait(l);
use big_data;
• post/wait not as common a term
• could be done with generalization of locks to reader/writer locks
• sometimes done with barrier, if there is global agreement
CS267 L4 Shared Memory.25
Demmel Sp 1999
Synchronization at Different Levels
° Can build it yourself out of flags
• while (!flag) {};
° Lock/Unlock primitives build in the waiting
• typically well tested
• system friendly
• sometimes optimized for the machine
• sometimes higher overhead than building your own
° Most systems provide higher level synchronization
primitives
• barrier - global synchronization
• semaphores
• monitors
CS267 L4 Shared Memory.26
Demmel Sp 1999
Solaris Threads Example
mutex_t mul_lock;
barrier_t ba;
int sum;
main()
{
sync_type = USYNC_PROCESS;
mutex_init(&mul_lock, sync_type, NULL);
barrier_init(&ba, NTHREADS, sync_type, NULL
…. spawn all the threads as above…
}
worker (int me)
{
int x = all_do_work(me);
barrier_wait(&ba);
mutex_lock(&mul_lock);
sum =+ mine;
mutex_unlock(&mul_lock);
}
CS267 L4 Shared Memory.27
Demmel Sp 1999
Producer-Consumer Synchronization
° A very common pattern in parallel programs
producer
x = 100
consumer
• special case is write-once variable
° Motivated language constructs that combine
parallelism and synchronization
• future as in Multilisp
- next_job (future (job1 (x1)), future (job2(x2)))
- job1 and job2 will run in parallel, next_job will run until args
are needed, e.g., arg1+arg2 inside next_job
- implemented using presence bits (hardware?)
• promise:
- like future, but need to explicitly ask for a promise
- T and promise[T] are different types
- more efficient, but requires more programmer control
CS267 L4 Shared Memory.28
Demmel Sp 1999
Rolling Your Own Synchronization
° Natural to use a variable for producer/consumer
P1
P2
big_data = new_value;
while (flag != 1) ;
flag = 1;
…. = ...big_data...
° This works as long as your compiler and machine
implement “sequential consistency” [Lamport]
• The parallel execution must behave as if it were an interleaving of
the sequences of memory operations from each of the
processors.
There must exist some:
• serial (total) order
wx
wx
• consistent with the partial order
rz
rx
• that is a correct sequential execution
wy
union forms a partial order
CS267 L4 Shared Memory.29
Demmel Sp 1999
But Machines aren’t Always Sequentially Consistency
° hardware does out-of-order execution
° hardware issues multiple writes
• placed into (mostly FIFO) write buffer
• second write to same location can be merged into earlier
° hardware reorders reads
• first misses in cache, second does not
° compiler puts something in a register, eliminating
some reads and writes
° compiler reorders reads or writes
° writes are going to physically remote locations
• first write is further away than second
CS267 L4 Shared Memory.30
Demmel Sp 1999
Programming Solutions
° At the compiler level, the best defense is:
• declaring variables as volatile
° At the hardware level, there are instructions:
• memory barriers (bad term) or fences that forces serialization
• only serialize operations executed by one processor
• different flavors, depending on the machine
° Sequential consistency is sufficient but not
necessary for hand-made synchronization
• many machines only reorder reads with respect to writes
• the flag example breaks only if read/read pairs reordered or
write/write pairs reordered
CS267 L4 Shared Memory.31
Demmel Sp 1999
Foundation Behind Sequential Consistency
° All violations of sequential consistency are figure 8s
P1
P2
big_data = new_value;
while (flag != 1) ;
flag = 1;
…. = ...big_data...
° Generalized to multiple processors (can be longer)
° Intuition: Time cannot move backwards
° All violations appear as “simple cycles” that visit
each processor at most twice [Shasha&Snir]
CS267 L4 Shared Memory.32
Demmel Sp 1999
Building Shared Data Structures
CS267 L4 Shared Memory.33
Demmel Sp 1999
Shared Address Allocation
° Some systems provide a special form of malloc/free
• p4_shmalloc, p4_shfree
• p4_malloc, p4_free are just basic malloc/free
- sharing unspecified
° Solaris threads
• malloc’d and static variables are shared
CS267 L4 Shared Memory.34
Demmel Sp 1999
Building Parallel Data Structures
° Since data is shared, this is easy
° Some awkwardness comes in setup
• only 1 argument passed to all threads
• need to package data (pointers to) into a structure and pass that
• otherwise everything global (hard to read for usual reasons)
° Depends on type of data structure
CS267 L4 Shared Memory.35
Demmel Sp 1999
Data Structures
° For data structures that are static (regular in time)
• typically partition logically, although not physically
•
•
•
•
need to divide work evenly, often done by dividing data
use the “owner computes” rule
true for irregular (unstructured) data like meshes too
usually barriers are sufficient synchronization
° For Dynamic data structures
• need locking or other synchronization
° Example: tree-based particle computation
• parts of this computation are naturally partitioned
- each processor/thread has a set of particles
- each works on updating a part of the tree
• during a tree walk, need to look at other parts of the tree
- locking used (on nodes) to prevent simultaneous updates
CS267 L4 Shared Memory.36
Demmel Sp 1999
Summary
CS267 L4 Shared Memory.37
Demmel Sp 1999
Uniform Shared Address Space
° Programmers view is still processor-centric
° Specifies what each processor/thread does
° Global view implicit in pattern of data sharing
Shared Data Structure
Local / Stack Data
CS267 L4 Shared Memory.38
Demmel Sp 1999
Segmented Shared Address Space
° Programmer has local and global view
° Specifies what each processor/thread does
° Global data, operation, and synchronization
Shared Data Structure
Local / Stack Data
CS267 L4 Shared Memory.39
Demmel Sp 1999
Work vs. Data Assignment
for (I = MyProc; I<n; I+=PROCS) {
A[I] = f(I);
}
° Assignment of work is easier in a global address
space
° It is faster if it corresponds to the data placement
° Hardware replication moves data to where it is
accessed
CS267 L4 Shared Memory.40
Demmel Sp 1999