Shared Memory Programming - Southern Oregon University

Download Report

Transcript Shared Memory Programming - Southern Oregon University

Multiprocessors and Multi-computers
• Multi-computers
– Distributed address space accessible by local processors
– Requires message passing
– Programming tends to be more difficult
• Multiprocessors
– Single address space accessible by all processors
– Simultaneous access to shared variables can produce
inconsistent results
– Generally programming is more convenient
– Doesn’t scale to more than about sixteen processors
Shared Memory Hardware
Memory Modules
Bus
Memory modules
Bus configuration
Processes
Processes
Crossbar switch configuration
Shared Memory Access
• Critical Section
Shared Variable
x
– A section of code that needs
to be protected from
simultaneous access
• Mutual Exclusion
– The mechanism used to
enforce a critical section
– Locks
– Semaphores
– Monitors
– Condition Variables
=1
=2
Process 1
Process 2
Locks
Locks are the simplest mutual exclusion mechanism
• Single bit variable
– 1=locked, 0=unlocked
– “Enter door and lock the door at entry”
• Spin locks (busy wait locks)
– While (lock==1) spin();
lock = 1;
// Critical section
lock = 0;
• Advantages
– Simple and easy to understand
• Disadvantages
– Poor use of the CPU if process does not block while waiting
– It’s easy to skip the lock=0 statement
• Examples: Pthreads, openMP
Note: The while and lock setting must be atomic
Semaphores
• Allows limited concurrent
access to a critical section
• An integer variable, s, controls
the mechanism
• Operations
– P operation: passeren in
Dutch for: to pass
s--;
while (s<0) wait();
// Critical section code
– V operation: vrigeven in
Dutch for: to release
s++;
if (s<=0)
unblock a waiting process;
p(s);
// Critical section
v(s);
• Notes
– Set s=1 initially for s to
be a binary semaphore
which acts like a lock.
– Set s=k>1 initially if k
simultaneous entries are
possible
– Set s=k<=0 for consumer
processes waiting to
consume data produced.
• Disadvantage
– Easy to skip the v
operation
• Example: UNIX
Monitors
• A Class mechanism that limits access to a shared resource
public class doIt
{ public doIt() {//Constructor logic}
public synchronized void critMethod()
{ wait();
// Wait till another thread signals
notify();
}
}
• Advantage: Most natural mutual exclusive mechanism
• Disadvantage: Requires a language that supports the construct
• Examples: Java, ADA, Modula II
Condition Variables
Mechanism to guarantee a global condition before critical section entry
• Advantages:
– Reduce overhead
associated with checking
global conditions
– Avoids having to frequently
“poll” the global variable
• Disadvantage: Its easy
to skip the unlock
operations
• Example: Pthreads
• Notes:
– wait() unlocks and locks
mutex automatically
– Threads must already be
waiting for a signal when it
is thrown
Example
• Thread 1
lock(mutex)
while (c<>VALUE)
wait(cVar,mutex)
// Critical section
unlock(mutex);
• Thread 2
if (c==VALUE)
signal(condVar)
Deadlock
Resources needed by a process can never be allocated
R1
R2
Rn-1
Rn
P1
P2
Pn-1
Pn
• Necessary Conditions
–
–
–
–
Circular Wait
Limited Resource
Non-preemptive
Hold and Wait
Deadly Embrace
R1
R2
P1
P2
Two Process Deadlock
Cache Coherence
• Cache Coherence Protocol
– Update: All caches
immediately updated with
altered data
– Invalidate: Altered data is
invalidated in all caches.
Updates take place only if
subsequently referenced.
• False Sharing: Cache
updates take place because
multiple processes access the
same cache block but not the
same locations
y
Memory
x
x
Processor 1
y
Processor 2
Cache Blocks
Multiprocessor Cache
Considerations
• Cache coherence in multiprocessor systems
– Temporal locality is reduced
• Smart compiler optimizations require knowledge of cache to
alter data layout
• Optimizations might require special programming
• Multiple cache exist instead of one
• Shared memory distributed systems
– NUMA (SGI Origin), COMA (KSR)
– Hardware or software cache coherence algorithms
• Performance of software coherence can be inadequate
Shared Memory Programming
Alternatives
• Heavyweight processes
• Thread programming standard
– Pthreads
– Java Threads
•
•
•
•
Programming language designed for parallel processing (ADA)
Library routines with an existing language
Modified syntax of an existing language (HP Fortran)
Compiler directives to specify parallel programs (OpenMP)
Heavyweight Processes
• Operating system switches context from one
process to another
• Advantage: A blocked process doesn’t block the
other processes
• Not popular because overhead is significant
– Separate process control blocks
Fork
FORK FORK
FORK
JOIN
FORK
JOIN
JOIN
JOIN
JOIN JOIN
UNIX
pid = fork();
if (pid==0)
{ //slave code }
else
{ //parent code }
if (pid==0)exit(0);
else
wait(0);
Using Heavyweight Processes
• Create new processes
pid= fork();
• Wait for any process to complete
pid=wait(&status);
• Wait for specific process to complete
Pid = waitpid(pid, &status, blockingcall);
• Exit process
exit(status);
• Attaching shared memory
id=shmget(IPC_PRIVATE,size);
addr=shmat(id, NULL, 0);
Processes and Threads
IP
Code
Heap
IP
Code
Heap
Stack
Stack
Listeners
Resources
Single Thread Process
Listeners
IP
Resources
Stack
Dual Thread Process
Notes:
•Threads can be three orders of magnitude faster than processes
•Thread safe library routines can be used by multiple concurrent threads
•Synchronization uses shared variables
Pthreads – Create and Join
IEEE Portable Operating System Interface Standard (POSIX)
• Spawn an attached thread
pthread_create
(&thread1, NULL, proc1, &arg)
.
.
.
Pthread_join(thread1, status)
•
Thread execution
void proc1(&arg)
{
// Thread code
return(*status);
}
• Detatched threads
– Join is not needed
– The OS destroys thread
resources when they
terminate
– A parameter in the create
call indicates a detached
thread
Note: The Pthreads library must be available
Pthread Locks
• Mutex variables
– Wait until lock opens before proceeding
– Thread selection when lock opens is non-deterministic
– Only the thread owning the lock can unlock it
• Example: declaring a Pthread lock
Pthread_mutex_t mutex;
pthread_mutex_lock(&mutex);
// Critical section code
Pthread_mutex_unlock(&mutex);
• Example: test if a lock is open
– Returns ‘EBUSY’ if already locked or 0 if open
Pthread_mutex_trylock(mutex)
Pthread Condition Variables
Example Code
• Declarations
pthread_con_t
pthread_mutex_t
cond1;
mutex1;
• Thread 1
pthread_mutex_lock(&mutex1);
while (cond1<>0)
pthread_cond_wait(cond1, mutex1);
pthread_mutex_unlock(&mutex1);
// Code to take action
• Thread 2
pthread_mutex_lock(&mutex1);
cond1--;
if (cond1==0) pthread_cond_signal(cond1);
pthread_mutex_unlock(&mutex1);
Java Threads
• Instantiate and run a thread
ThreadClass t = new ThreadClass();
t.start();
• Thread class
Class ThreadClass extends thread
{ public ThreadClass {//Constructor}
public void run()
{ while (true)
{ //yield or sleep periodically.
//thread code entered here.
} } }
Modify Existing Language Syntax
Example Constructs
• Declaration of a shared memory variable
shared int x;
• Specify statements to execute concurrently
par { s1(); s2(); s3(); … sn(); }
• Iterations assigned to different processors
forall (i=0; i<n; i++) { //code }
• Examples: High Performance Fortran and C
Smart Compilers
• Perform Bernstein’s Dependency Analysis
– Ii is the set of memory locations read by a block of code i.
– Oi is the set of memory locations written by a block of code i.
– If Ii ∩Oj = Ф and Oi ∩Oj = Ф where i≠j, the code can execute in
parallel
– Questions: Consider instructions: ‘a=x+y and b=x+z’
• Can these instructions execute in parallel?
• What are I1, I2, O1, and O2?
• What are I1 ∩O2, I2 ∩O1, and O1 ∩O2 ?
• Other optimizations
– Relocate data to optimize cache utilization
– Reorder instructions to optimize register use
– Loop decomposition example:
Before: for (i=3;i<=20;i++) a[i] = a[i-1]+4;
After: for (i=3;i<=20;i+=2) a[i]=a[i-2]+4;
for (i=4;i<=20;i+=2) a[i]=a[i-2]+4;
Compiler Directives (openMP)
•
•
•
•
•
•
•
Contains compiler directives and library directives
Recognized industry standard developed in the late 1990s
Designed for shared memory programming
Interfaces to C/C++, Fortran, and Java (JOMP)
Uses fork-join model based on threads
Parallel sections of code execute using “teams of threads”
openMP statements have the form
– C: #pragma omp <directive>
– JOMP: //omp <directive>
OMP Parallel Directive
• Description
– All threads execute the structured block
– An implicit barrier is at the end of the block
#pragma omp parallel
{ // code here }
• Example
#pragma omp parallel private(x, num_threads)
{ x = omp_get_thread_num();
num_threads = omp_get_num_threads();
a[x] = 10*num_threads;
}
• Notes
– Omp_get_thread_num() returns the rank of the current thread
– Omp_get_num_threads() returns the total number of threads
– A[] is global, x and num_threads are local to each thread
OMP: Establish Thread Team Size
• Techniques that programs can use
– Specify in the parallel directive
#pragma parallel num_threads(value)
– Call the library routine: omp_set_num_threads(value)
– Set the environment variable: OMP_NUM_THREADS
– Use system dependent value if not specified
• Notes:
– The parallel directive sets the team size for that directive
– Other approaches set the value which remains until it is changed
OMP: Section and For Constructs
• Parallel sections (teams of threads share structured blocks)
– Syntax
#pragma omp sections
{ #pragma omp section {//block of code}
#pragma omp section {//block of code}
.. . }
– Note: There is an implied barrier at the end of the construct unless
nowait appears on the pragma line
• Parallel for loop
– Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code}
– Note: There is an implied barrier at the end of each iteration unless
nowait appears on the pragma line
OMP: Sequential within Parallel Blocks
• Single: the block executed by one of the threads
– Syntax: #pragma omp single {//code}
– Note: There is an implied barrier at the end of the
construct unless nowait appears on the pragma line
• Master: The block executed by the master thread
– Syntax: #pragma omp master {//code}
– Note: There is no implied barrier in this construct
OMP: Critical Sections and
Synchronization
• Critical Sections (one thread at a time)
– Syntax: #pragma omp critical name {//code}
– Notes
• The name ties all critical sections together with that name
• The system uses a global default if name isn’t specified
• Barrier
– Syntax: #pragma omp barrier
– Note: All threads must be able to reach the barrier
• Atomic expression
– Syntax: #pragma omp atomic <expressionStatement>
Note: Minimize critical sections or waiting processors execute serially
Example Program (summing 1000 numbers)
• Heavyweight UNIX processes (Section 8.7.1)
– Concurrency: fork, exit, and wait.
– Shared memory: shmget, shmat, shmctl.
– Data sharing: semget, semctl, semop to create P,V functions.
• Lightweight threads (Section 8.7.2)
– Concurrency: pthread create and join
– Shared Memory: automatic
– Data sharing: pthred mutex init, lock, and unlock.
• Java (Section 8.7.3)
– Concurrency: Thread extension class
– Shared Memory: automatic
– Data Sharing: synchronized methods provide monitor capability