Transcript Open MP
Shared-Memory
Programming
with Threads
Adapted and edited by Aleksey Zimin from
http://navet.ics.hawaii.edu/~casanova/courses/ics632_f
all07/slides/ics632_threads.ppt
http://users.actcom.co.il/~choo/lupg/tutorials/multiprocess/multiprocess.html#process_creation_fork_syscall
Parallel Applications
Modern computers have multiple CPU cores
(and/or multiple CPUs) on board
We have to be able to utilize the computing
power by parallelizing our tasks
CPU Information
Linux computer: /proc/cpuinfo
Cat /proc/cpuinfo example:
processor
:0
vendor_id
: AuthenticAMD
cpu family : 15
model
: 65
model name
: Dual-Core AMD Opteron(tm) Processor 8220
stepping
:3
cpu MHz
: 2800.000
cache size
: 1024 KB
physical id : 0
siblings
:2
core id
:0
cpu cores
:2
fpu
: yes
fpu_exception : yes
cpuid level : 1
wp
: yes
flags
: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow pni cx16 lahf
_lm cmp_legacy svm extapic cr8_legacy
bogomips
: 5625.16
TLB size
: 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
Processes in UNIX
UNIX is natively parallel operating system
A process is an instance of running a program
Each process has a unique process id
Shell command “ps” gives the list of all running
processes
Using the shell commands
In any UNIX shell, “&” will run the command in
background.
The command will run in its own shell, which is a child
of the current shell
[alekseyz@genome10]$ run_command.sh &
“wait” command will wait for all child processes in the
current shell to finish
Example of & and wait
In bash:
#!/bin/bash
let NUM_CPUS=`cat /proc/cpuinfo |grep processor|tail -1|awk '{print $NF+1}'`
let counter=1;
let cpu_counter=1;
echo "Total processes to run:"$1
echo "Simultaneously running:"$NUM_CPUS
while [[ $counter -le $1 ]];do
while [[ $cpu_counter -le $NUM_CPUS && $counter -le $1 ]];do
./echo_sleep_echo.sh &
let counter=$counter+1
let cpu_counter=$cpu_counter+1;
done
let cpu_counter=1;
wait
done
--------------------------------------------------------------------------#!/bin/bash
echo "Sleeping 10 seconds in shell "$$
sleep 10
echo "Done"
Using fork() and wait() in C
The fork() system call is the basic way to create a new
process. fork() is used to produce child shell.
Returns twice(!!!!)
fork() causes the current process to be split into two
processes - a parent process, and a child process.
All of the memory pages used by the original process
get duplicated during the fork() call, so both parent and
child process see the exact same memory image.
fork() continued
When fork() returns in the parent process, its return
value is the process ID (PID) of the child process.
When it returns inside the child process, its return value
is '0'.
If for some reason fork() failed (not enough memory,
too many processes, etc.), no new process is created,
and the return value of the call is '-1'.
Both child process and parent process continue from
the same place in the code where the fork() call was
used.
Child processes
When a child process exits, it sends a signal to its parent process,
which needs to acknowledge it's child's death. During this time
the child process is in a state called zombie.
When a process exits, if it had any children, they become orphans.
An orphan process is automatically inherited by the init process,
and becomes a child of this init process.
When the parent process is not properly coded, the child
remains in the zombie state forever. Such processes can be
noticed by running the “ps” command, and seeing processes
having the string "<defunct>" as their command name.
Simple fork() and wait() example
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
int main(){
pid_t child_pid;
int child_status;
/* defines fork(), and pid_t.
*/
/* defines the wait() system call. */
child_pid = fork();
switch (child_pid) {
case -1:
perror("fork");
exit(1);
case 0:
printf("I am the child, Hello world\n");
sleep(10);
exit(0);
default:
printf("I am the parent, waiting for the child process %d to exit...
\n",child_pid);
wait(&child_status);
printf("I am the parent, child process %d exited with status
%d\n",child_pid,child_status);
}
}
InterProcess communication
One can prescribe what each child does in the
fork() call
It is helpful if parent could communicate with
child (e.g. report progress, get data)
Easiest way for parent and child to
communicate is through pipe
Using pipes
Anonymous pipe: A pipe is a one-way mechanism that
allows two related processes (i.e. one is an ancestor of
the other) to send a byte stream from one of them to
the other one.
The order in which data is written to the pipe, is the
same order as that in which data is read from the pipe.
The system assures that data won't get lost in the
middle, unless one of the processes (the sender or the
receiver) exits prematurely.
pipe()
The pipe() system call is used to create a readwrite pipe.
pipe() takes as an argument an array of 2 integers
that will be used to save the two file descriptors
used to access the pipe. The first to read from
the pipe, and the second to write to the pipe.
Using pipe()
/* first, define an array to store the two file
descriptors */
int pipes[2];
/* now, create the pipe */
int rc = pipe(pipes);
if (rc == -1)
{
/* pipe() failed */
perror("pipe");
exit(1);
}
pipe() example -- main
int main()
{
int data_pipe[2]; /* an array to store the file descriptors of
the pipe. */
int pid;
int rc;
rc = pipe(data_pipe);
if (rc == -1) { perror("pipe"); exit(1); }
pid = fork();
switch (pid)
{
case -1:
perror("fork"); exit(1);
case 0:
do_child(data_pipe);
default:
do_parent(data_pipe);
}
}
pipe() example -- parent
void do_parent(int data_pipe[]) {
int c; /* data received from the user. */
int rc;
/* first, close the un-needed read-part of the pipe. */
close(data_pipe[0]);
while ((c = getchar()) > 0)
{
rc = write(data_pipe[1], &c, 1);
if (rc == -1)
{
perror("Parent:write");close(data_pipe[1]);exit(1);
}
}
close(data_pipe[1]);
exit(0);
}
pipe() example -- child
void do_child(int data_pipe[]) {
int c; /* data received from the parent. */
int rc;
/* first, close the un-needed write-part of the pipe. */
close(data_pipe[1]);
while ((rc = read(data_pipe[0], &c, 1)) > 0)
{
putchar(c);
}
exit(0);
}
Processes vs. threads
Each process owns:
Unit of resources ownership
Allocated with virtual address space + control of other
resources such as I/O, files….
Unit of dispatching
Execution path and state, dispatching priority.
Controlled by OS
Each thread owns:
Unit of dispatching
Threads share resources
Comments
Traditional program is one thread per process.
The main thread starts with main()
Only one thread or program counter (PC) is
allowed to execute the code segment
To add a new PC, you need to fork() to have
another PC to execute in another process
address space.
Key benefits of multithreading
Less time to create a thread than a process
Less time to terminate a thread than a process
Less time to switch a thread
Enhance efficiency in communication: no need
for kernel to intervene
Smaller chance of driving you crazy while
writing code / debugging
Shared memory programming
The “easiest” form of parallel programming
Can be used to parallelize a sequential code in an
incremental way:
take a sequential code
parallelize a small section
check that it works
check that it speeds things up a bit
move on to another section
Thread
A thread is a stream of instructions that can be
scheduled as an independent unit.
A process is created by an operating system
contains information about resources
process id, file descriptors, ...
contains information on the execution state
program counter, stack, ...
Schedulable Part of a Process
The concept of a thread requires that we make a
separation between these two kinds of information in a
process
resources available to the entire process
program instructions, global data, working
directory
schedulable entities
program counters and stacks.
A thread is an entity within a process which consists of
the schedulable part of the process.
Process is still there,
what’s new for thread?
Thread shares with process
Virtual address space (holding process image), access to
memory and resources of its process, shared with all other
threads in that process
Protected access to CPU, files, and I/O resources
Thread has its own
Thread execution state
Saved thread context (an independent PC within a process)
Execution stack
Per-thread static storage for local variables
Possible combination of thread and processes
One process one thread
Multiple processes
One thread per process
One process multiple thread
Multiple processes multiple
Threads per process
Parallelism with Threads
Create threads within a process
Each thread does something (hopefully) useful
Threads may be working truly concurrently
Multi-processor
Multi-core
Or just pseudo-concurrently
Single-proc, single-core
Example
Say I want to compute the sum of two arrays
I can just create N threads, each of which sums 1/Nth
of both arrays and then combine their results
I can also create N threads that each increment some
sum variable element-by-element, but then I’ve got to
make sure they don’t step on each other’s toes
The first version is a bit less “shared-memory”, but is
probably more efficient
Multi-threading issues
There are really two main issues when writing multi-threaded
code:
Issue #1: Load Balancing
Make sure that no processors/cores is left idle when it could be doing
useful work
Issue #2: Correct access to shared variables
Implemented via mutual exclusion: create sections of code that only a
single thread can be in at a time
Called “critical sections”
Classical variable update example
Done via “locks” and “unlocks”
Warning: locks are NOT on variables, but on sections of code
User-level threads
(DOS, Windows 95)
User-level threads: Many-to-one thread mapping
Implemented by user-level runtime libraries
Create, schedule, synchronize threads at user-level
OS is not aware of user-level threads
OS thinks each process contains only a single
thread of control
User-level threads
Advantages
Does not require OS support; Portable
Can tune scheduling policy to meet application
demands
Lower overhead thread operations since no system
calls
Disadvantages
Cannot leverage multiprocessors
Entire process blocks when one thread blocks
Kernel-level threads
Kernel-level threads: One-to-one thread
mapping
OS provides each user-level thread with a kernel
thread
Each kernel thread scheduled independently
Thread operations (creation, scheduling,
synchronization) performed by OS
Kernel-level threads
Advantages
Each kernel-level thread can run in parallel on a
multiprocessor
When one thread blocks, other threads from process
can be scheduled
Disadvantages
Higher overhead for thread operations
OS must scale well with increasing number of
threads
Threads in Practice
Pthreads
Popular C library
Flexible
Will discuss these
OpenMP
Java Threads
Pthreads
A POSIX standard (IEEE 1003.1c) API for thread
creation and synchronization
The API specifies the standard behavior
Implementation choices are up to developers
Implementations vary, some better than some others
Common in all UNIX operating systems
Some people have written it for Win32
The most portable threading library out there
What do threads look like in UNIX?
Using the Pthread Library
Pthread library typically uses kernel-threads
Programs must include the file pthread.h
Programs must be linked with the pthread library (lpthread)
The API contains functions to
create threads
control threads
manage threads
synchronize threads
pthread_self()
Returns the thread identifier for the calling
thread
At any point in its instruction stream a thread can
figure out which thread it is
Convenient to be able to write code that says: “If
you’re thread 1 do this, otherwise to that”
#include <pthread.h>
pthread_t pthread_self(void);
pthread_create()
Creates a new thread of control
#include <pthread.h>
int pthread_create (
pthread_t *thread,
pthread_attr_t *attr,
void * (*start_routine) (void *),
void *arg);
Returns 0 to indicate success, otherwise returns error code
thread: output argument that will contain the thread id of the new thread
attr: input argument that specifies the attributes of the thread to be created (NULL =
default attributes)
start_routine: function to use as the start of the new thread must have prototype:
void * foo(void*)
arg: argument to pass to the new thread routine
If the thread routine requires multiple arguments, they must be passed bundled up in an array or a
structure
pthread_create() example
Want to create a thread to compute the sum of the elements of
an array
void *do_work(void *arg);
Needs three arguments
the array, its size, where to store the sum
we need to bundle them in a structure
struct arguments {
long int *array;
long int size;
long int *sum;
}
pthread_create() example
int main(void) {
long int array[ARRAY_SIZE], sum, i;
pthread_t worker_thread;
struct arguments *arg;
for(i=0;i<ARRAY_SIZE;i++) array[i]=1;
arg = calloc(1,sizeof(struct arguments));
arg->array = array;
arg->size=ARRAY_SIZE;
arg->sum = ∑
if (pthread_create(&worker_thread, NULL, do_work, (void *)arg))
{
fprintf(stderr,"Error while creating thread");
exit(1);
}
...
exit(0);
}
pthread_create() example
void *do_work(void *arg){
long int i, size;
long int *array;
long int *sum;
size = ((struct arguments *)arg)->size;
array = ((struct arguments *)arg)->array;
sum = ((struct arguments *)arg)->sum;
*sum = 0;
for (i=0;i<size;i++)
*sum += array[i];
return NULL;
}
Comments about the example
The “parent thread” continues its normal execution
after creating the “child thread”
Memory is shared by the parent and the child (the array,
the location of the sum)
Nothing prevents from the parent doing something to
it while the child is still executing which may lead to a
wrong computation
The bundling and unbundling of arguments is a bit
tedious, but nothing compared to what’s needed with
shared memory segments and processes
pthread_exit()
Terminates the calling thread
#include <pthread.h>
void pthread_exit(
void *retval);
The return value is made available to another thread calling a
pthread_join() (see later)
The previous example had the thread just return from
function do_work()
In this case the call to pthread_exit() is implicit
The return value of the function serves as the argument to the
(implicitly called) pthread_exit().
pthread_join()
Causes the calling thread to wait for another thread to terminate
#include <pthread.h>
int pthread_join(
pthread_t thread,
void **value_ptr);
thread: input parameter, id of the thread to wait on
value_ptr: output parameter, value given to pthread_exit() by
the terminating thread (which happens to always be a void *)
returns 0 to indicate success, error code otherwise
multiple simultaneous calls for the same thread are not allowed
pthread_kill()
Causes the termination of a thread
#include <pthread.h>
int pthread_kill(
pthread_t thread,
int sig);
thread: input parameter, id of the thread to terminate
sig: signal number
returns 0 to indicate success, error code otherwise
pthread_join() example
int main(void) {
long int array[100];
long int sum;
pthread_t worker_thread;
struct arguments *arg;
arg = (struct arguments *)calloc(1,sizeof(struct arguments));
arg->array = array;
arg->size=100;
arg->sum = ∑
if (pthread_create(&worker_thread, NULL,
do_work, (void *)arg)) {
fprintf(stderr,”Error while creating thread\n”);
exit(1);
}
...
if (pthread_join(worker_thread, NULL)) {
fprintf(stderr,”Error while waiting for thread\n”);
exit(1);
}
}
Synchronizing pthreads
As we’ve seen earlier, we need a system to implement locks
to create mutual exclusion for variable access, via critical
sections
Lock creation
int pthread_mutex_init(
pthread_mutex_t *mutex,
const pthread_mutexattr_t *attr);
returns 0 on success, an error code otherwise
mutex: output parameter, lock
attr: input, lock attributes
NULL: default
There are functions to set the attribute (look at the man pages if you’re
interested)
Synchronizing pthreads
Locking a lock
If the lock is already locked, then the calling thread is blocked
If the lock is not locked, the the calling thread acquires it
int pthread_mutex_lock(
pthread_mutex_t *mutex);
returns 0 on success, an error code otherwise
mutex: input parameter, lock
Just checking
Returns instead of locking
int pthread_mutex_trylock(
pthread_mutex_t *mutex);
returns 0 on success, EBUSY is the lock is locked, an error code otherwise
mutex: input parameter, lock
Synchronizing pthreads
Releasing a lock
int pthread_mutex_unlock(
pthread_mutex_t *mutex);
returns 0 on success, an error code otherwise
mutex: input parameter, lock
With locking, trylocking, and unlocking, one can
avoid all race conditions and protect access to
shared variables
Mutex Example:
...
pthread_mutex_t mutex;
pthread_mutex_init(&mutex, NULL);
...
pthread_mutex_lock(&mutex);
Critical Section
count++;
pthread_mutex_unlock(&mutex);
To “lock” variable count, just put a pthread_mutex_lock()
and pthread_mutex_unlock() around all sections of the code
that write to variable count
Again, you’re really locking code, not variables
Cleaning up memory
Releasing memory for a mutex attribute
int pthread_mutex_destroy(
pthread_mutex_t *mutex);
Releasing memory for a mutex
int pthread_mutexattr_destroy(
pthread_mutexattr_t *mutex);
Signaling
Allows a thread to wait until some process signals that some
condition is met
provides a more sophisticated way to synchronize threads than just mutex
locks
Done with “condition variables”
Example:
You have to implement a server with a main thread and many threads
that can be assigned work (e.g., an incoming request)
You want to be able to “tell” a thread: “there is work for you to do”
Inconvenient to do with mutex locks
the main thread must carefully manage a lock for each worker thread
everybody must constantly be polling locks
Condition Variables
Condition variables are used in conjunction with mutexes
Create a condition variable
Create an associated mutex
Waiting on a condition
We will see why it’s needed later
lock the mutex
wait on condition variable
unlock the mutex
Signaling
Lock the mutex
Signal on the condition variable
Unlock mutex
pthread_cond_init()
Creating a condition variable
int pthread_cond_init(
pthread_cond_t *cond,
const pthread_condattr_t *attr);
returns 0 on success, an error code otherwise
cond: output parameter, condition
attr: input parameter, attributes (default = NULL)
pthread_cond_wait()
Waiting on a condition variable
int pthread_cond_wait(
pthread_cond_t *cond,
pthread_mutex_t *mutex);
returns 0 on success, an error code otherwise
cond: input parameter, condition
mutex: input parameter, associated mutex
pthread_cond_signal()
Signaling a condition variable
int pthread_cond_signal(
pthread_cond_t *cond;
returns 0 on success, an error code otherwise
cond: input parameter, condition
“Wakes up” one thread out of the possibly
many threads waiting for the condition
The thread is chosen non-deterministically
pthread_cond_broadcast()
Signaling a condition variable
int pthread_cond_broadcast(
pthread_cond_t *cond;
returns 0 on success, an error code otherwise
cond: input parameter, condition
“Wakes up” ALL threads waiting for the
condition
Condition Variable: example
Say I want to have multiple threads wait until a counter reaches a maximum
value and be awakened when it happens
pthread_mutex_lock(&lock);
while (count < MAX_COUNT) {
pthread_cond_wait(&cond,&lock);
}
pthread_mutex_unlock(&lock)
Locking the lock so that we can read the value of count without the possibility of
a race condition
Calling pthread_cond_wait() in a loop to avoid “spurious wakes ups”
When going to sleep the pthread_cond_wait() function implicitly releases
the lock
When waking up the pthread_cond_wait() function implicitly acquires the
lock (and may thus sleep)
Unlocking the lock after exiting from the loop
pthread_cond_timed_wait()
Waiting on a condition variable with a
timeout
int pthread_cond_timedwait(
pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *delay);
returns 0 on success, an error code otherwise
cond: input parameter, condition
mutex: input parameter, associated mutex
delay: input parameter, timeout (same fields as the one
used for gettimeofday)
PThreads: Conclusion
A popular way to write multi-threaded code
If you know pthreads, you’ll have no problem adapting
to other multi-threading techniques
Condition variables are a bit odd, but very useful
For you project you may want to use pthreads
More information
Man pages
PThread Tutorial:
http://www.llnl.gov/computing/tutorials/pthreads/