Shared Memory Programming

Download Report

Transcript Shared Memory Programming

Programming Shared Address
Space Platforms
Carl Tropper
Department of Computer Science
Threads
• On a shared memory architecture, processes
are expensive
• To maintain-they require that all data associated with a
process is private
• To manipulate-overhead for scheduling processes and
maintaining security is significant
• Enter threads, which assume that all memory is
globalf
• Except for stacks associated with the thread
• Threads are invoked via function calls and are
created by a [C] function create_thread
Threads-an example
Logical memory model-a picture
•Logical model treats thread stack as local data
Why threads?
• Portable
• Can develop application on a serial machine and then run it
on a parallel machine.
• Can migrate between different shared memory platforms.
• Latency
• While one thread waits for communications, another can use
the CPU
• I/O, memory access latencies can also be hidden
• Support for load balancing and scheduling in
API’s
• Result• easier to get good performance then message passing,
• development tools for POSIX threads are widespread and
stable
POSIX Thread API
• IEEE Standard, supported by most vendors
• Pthreads for short
• Other thread packages-NT threads, Solaris
threads, Java threads
• Example-compute π by
• Generating random numbers
• Counting how many fall within the largest circle which can be
inscribed in a unit square (area=π/4)
• Idea of program-a bunch of threads, each has a fixed number
of random points, each counts how many are in the circle
Threads-pthread_create
• Pthread_create creates threads
• Invokes thread function
include <pthread.h>
int pthread_create (
pthread_t *thread_handle, const pthread_attr_t *attribute,
void * (*thread_function)(void *),
void *arg);
• Thread_handle points to thread ID
• Thread has attributes specified by attribute argument (later).
If null, then get a default thread
• Arg passes important stuff to thread, e.g. integer which
serves as the random seed
• Pthread_create returns 0 after successful creation of thread
Threads-Pthread_join
• Pthread_join provides mechanism to
communicate results produced by individual
threads
int pthread_join (
pthread_t thread,
void **ptr);
• Suspends execution of the calling program until
thread with ID given by thread terminates
• Upon completion of pthread_join, the value
passed to pthread.exit is returned in the location
pointed to by ptr
• Pthread_join returns 0 upon successful
completion
Compute Pi 1
#include <pthread.h>
#include <stdlib.h>
#define MAX_THREADS 512
void *compute_pi (void *);
....
main() {
...
pthread_t p_threads[MAX_THREADS];
pthread_attr_t attr;
pthread_attr_init (&attr);
for (i=0; i< num_threads; i++) {
hits[i] = i;
pthread_create(&p_threads[i], &attr, compute_pi,
(void *) &hits[i]);
}
for (i=0; i< num_threads; i++) {
pthread_join(p_threads[i], NULL);
total_hits += hits[i];
}
...
}
Compute Pi 2
void *compute_pi (void *s) {
int seed, i, *hit_pointer;
double rand_no_x, rand_no_y;
int local_hits;
hit_pointer = (int *) s;
seed = *hit_pointer;
local_hits = 0;
for (i = 0; i < sample_points_per_thread; i++) {
rand_no_x =(double)(rand_r(&seed))/(double)((2<<14)-1);
rand_no_y =(double)(rand_r(&seed))/(double)((2<<14)-1);
if (((rand_no_x - 0.5) * (rand_no_x - 0.5) +
(rand_no_y - 0.5) * (rand_no_y - 0.5)) < 0.25)
local_hits ++;
seed *= i;
}
*hit_pointer = local_hits;
pthread_exit(0);
}
Threads are so sensitive
• Could increase local_hits outside (instead of inside) the
main loop in compute_pi by
• changing localHits++ to *(hit_pointer)++ and
• deleting *hit_pointer=local_hits outside the loop
• But performance gets whacked because adjacent items in
a shared cache line of the hit array are getting written to,
causing invalidates to be issued (“False sharing”)
• Solution- change hits to a 2-D array. Arrays are stored row
major, so cache lines won’t be shared because we force
distance between the entries
False sharing and the solution
Spaced_1 is the original change
Spaced_16=16 integer 2nd column
Spaced_32 correspond to 32 integer 2nd column
Synchronization Primitives
• Mutual exclusion for shared variables
• Race conditions-for example
If (my_cost<best cost) best_cost=my_cost
If best cost is initially 100,
there are two threads with my_cost =50,75,
!
A race determines the value of best_cost
• So threaded API’s provide support for
we have a non-deterministic outcome
• critical sections and
• atomic operations
Mutex_locks
• Mutex_locks (mutual exclusion locks)
• Atomic operation associated with code or a critical
section
• Thread tries to get lock-if it is locked, the thread is
blocked
• Thread leaving the cs has to unlock the
mutex_lock
• P_threads provides functions for dealing
with locks
Mutex_locks-the 3 maidens
Int
Pthread_mutex_lock {
pthread_mutex_t *mutex_lock};
attempts a lock on mutex_lock, blocks if mutex_lock is blocked success returns 0
Int
Pthread_mutex_unlock{
pthread_mutex_t *mutex_lock};
Mutex_lock is released and a blocked thread is allowed in
Int
Pthread_mutex_init{
pthread_mutex_t *mutex_lock,
const pthread_mutexattr_t *lock_attr};
Unlocks mutex_lock.
Atributes of mutex_lock are specified in in *lock_attr. NULL supplies default values.
Compute the minimum of a list of
integers
•
We can now write our previously incorrect code segment as:
pthread_mutex_t minimum_value_lock;
...
main() {
....
pthread_mutex_init(&minimum_value_lock, NULL);
....
}
void *find_min(void *list_ptr) {
....
pthread_mutex_lock(&minimum_value_lock);
if (my_min < minimum_value)
minimum_value = my_min;
/* and unlock the mutex */
pthread_mutex_unlock(&minimum_value_lock);
}
Producer Consumer Work Queues
• Scenario-producer thread creates tasks and
puts them in work queues. Consumer threads
grab tasks and executes them
• Can use shared data structure for the tasks,
but……
• Producer thread must not overwrite buffer before consumer
thread picks up task
• Consumer threads must not pick up tasks which don’t exist
• Pick up one task at a time
Producer Consumer
• Task_available
=1 producer task has to wait to produce
=0 consumer thread can pick up task
• Need to protect operations on task
available variable
Producer Code
pthread_mutex_t task_queue_lock;
int task_available;
...
main() {
....
task_available = 0;
pthread_mutex_init(&task_queue_lock, NULL);
....
}
void *producer(void *producer_thread_data) {
....
while (!done()) {
inserted = 0;
create_task(&my_task);
while (inserted == 0) {
pthread_mutex_lock(&task_queue_lock);
if (task_available == 0) {
insert_into_queue(my_task);
task_available = 1;
inserted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
}
}
Consumer Code
void *consumer(void *consumer_thread_data) {
int extracted;
struct task my_task;
/* local data structure declarations */
while (!done()) {
extracted = 0;
while (extracted == 0) {
pthread_mutex_lock(&task_queue_lock);
if (task_available == 1) {
extract_from_queue(&my_task);
task_available = 0;
extracted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
process_task(my_task);
}
}
Overheads of locking
• Size of critical section matters-only one
thread at a time is allowed in the section,
i.e. the program becomes serial
• What to do? Instead of blocking a thread
when mutex is locked, allow thread to do
other work and poll mutex to see if it is
unlocked
Int
Pthread_mutex_trylock{
pthread_mutex_t *mutex_lock}
K matches in a list
/* Finding k matches in a list */
void *find_entries(void *start_pointer) {
/* This is the thread function */
struct database_record *next_record;
int count;
current_pointer = start_pointer;
do {
next_record = find_next_entry(current_pointer);
count = output_record(next_record);
} while (count < requested_number_of_records);
}
int output_record(struct database_record *record_ptr) {
int count;
pthread_mutex_lock(&output_count_lock);
output_count ++;
count = output_count;
pthread_mutex_unlock(&output_count_lock);
if (count <= requested_number_of_records)
print_record(record_ptr);
return (count);
}
K matches (continued)
/* rewritten output_record function */
int output_record(struct database_record *record_ptr) {
int count;
int lock_status;
lock_status=pthread_mutex_trylock(&output_count_lock);
if (lock_status == EBUSY) {
insert_into_local_list(record_ptr);
return(0);
}
else {
count = output_count;
output_count += number_on_local_list + 1;
pthread_mutex_unlock(&output_count_lock);
print_records(record_ptr, local_list,
requested_number_of_records - count);
return(count + number_on_local_list + 1);
}
}
Condition variables
• Polling takes time. In the producer-consumer
example, the producer and consumer threads
have to poll for a lock.
• An easier idea would be for the consumer thread
to signal the producer thread when buffer space
is available.
• Enter condition variables-a thread can block
itself if a predicate becomes true
• Task_available==1 in producer-consumer
• Thread locks mutex on condition variable, and
tests the predicate-if the predicate is false, the
thread waits on the condition variable
Condition variables
• Int pthread_cond_wait(pthread_cond_t *cond,pthread_mutex_t
*mutex);
is the function used to wait on the condition
variable
• The function blocks the thread until it receives a signal from
the OS or from another thread
• Releases the lock on mutex-have to do this in order for other
threads to be able to change the predicate
• Thread enters a queue
• When a condition becomes true, it is signaled using
pthread_cond_signal, unblocking one of the threads in the
queue
Producer Consumer
• Producer produces task, inserts task on queue,
sets task-available=1
• Sends signal to consumer thread via
Int pthread_cond_signal(pthread_cond_t *cond);
• Releases lock on mutex via pthread_mutex_unlock
• Initialize condition variable via
Int pthread_cond_init(pthread_cond_t *cond, const
pthread_condattr_t *attr);
• Destroy condition variable via
Int pthread_cond_destroy(pthread_cond_t *cond)
Producer Consumer Using Condition Variables
pthread_cond_t cond_queue_empty, cond_queue_full;
pthread_mutex_t task_queue_cond_lock;
int task_available;
/* other data structures here */
main() {
/* declarations and initializations */
task_available = 0;
pthread_init();
pthread_cond_init(&cond_queue_empty, NULL);
pthread_cond_init(&cond_queue_full, NULL);
pthread_mutex_init(&task_queue_cond_lock, NULL);
/* create and join producer and consumer threads */
}
Producer Consumer Using Condition Variables II
/*create and join producer and consumer threads*/
void *producer(void *producer_thread_data) {
int inserted;
while (!done()) {
create_task();
pthread_mutex_lock(&task_queue_cond_lock);
while (task_available == 1)
pthread_cond_wait(&cond_queue_empty,
&task_queue_cond_lock);
insert_into_queue();
task_available = 1;
pthread_cond_signal(&cond_queue_full);
pthread_mutex_unlock(&task_queue_cond_lock);
}
}
Producer Consumer Using Condition Variables II
void *consumer(void *consumer_thread_data) {
while (!done()) {
pthread_mutex_lock(&task_queue_cond_lock);
while (task_available == 0)
pthread_cond_wait(&cond_queue_full,
&task_queue_cond_lock);
my_task = extract_from_queue();
task_available = 0;
pthread_cond_signal(&cond_queue_empty);
pthread_mutex_unlock(&task_queue_cond_lock);
process_task(my_task);
}
}
Controlling Thread and Synchronization
Attributes
• Threads have different attributes (scheduling
algorithms, stack sizes….)
• Attributes object has default attributesprogrammer can modify them
• Advantage-system implements the attributes
• easier to read
• easier to change attributes
• easier to port to another OS
Attribute Objects for threads
• Use pthread_attr_init to create an
attributes object (pointed to by *attr)with default
values
• pthread_attr_destroy takes out attributes
object
• Individual properties associated with the
attributes object can be changed using the
following functions:
pthread_attr_setdetachstate,
pthread_attr_setguardsize_np,
pthread_attr_setstacksize,
pthread_attr_setinheritsched,
pthread_attr_setschedpolicy,
pthread_attr_setschedparam
Attribute objects for mutexes
• Pthreads supports 3 types of locks
• Normal-the default brand. Deadlocks if a thread with a lock
attempts to lock it again. Recursive code can cause this to
happen.
• Recursive mutex-single thread can lock a mutex multiple
times
» Each time it locks the mutex a counter is incremented
» each time it unlocks the mutex, a counter is decremented.
» Any other thread can lock only if the counter==0
• Error-check mutex-like a normal mutex, only it returns an
error if a thread tries to lock a mutex twice. Good for
debugging.
Attribute Objects for mutexes
• Initialize the attributes object using function:
pthread_mutexattr_init.
• The function pthread_mutexattr_settype_np can be used for
setting the type of mutex specified by the mutex attributes object.
pthread_mutexattr_settype_np (
pthread_mutexattr_t *attr,
int type);
• Here, type specifies the type of the mutex and can take one of:
– PTHREAD_MUTEX_NORMAL_NP
– PTHREAD_MUTEX_RECURSIVE_NP
– PTHREAD_MUTEX_ERRORCHECK_NP
Thread cancellation
• Sometimes it is necessary to cancel thread(s)threads which evaluate different strategies in
parallel need to go after best is chosen, e.g.
chess
• Use pthread_cancel (pthread_t thread);
• Idea-a thread can cancel itself or other threads
Composite Synchronization Structures
• Look at Useful “higher-level” constructs
• Useful because they meet commonly occurring
needs
• Constructs built by combining basic synch
structures
• Look at
– Read-write locks
– Barriers
Read-write locks
• Think of readers and writers in a critical
section
• Model-many readers, few writers
• Idea-let the readers read concurrently, but
serialize the writers
• Read lock is granted if there are only readers
• If there is a write lock (or queued writes), thread
does a condition_wait
Read write locks
• Read write locks based on a data structure
mylib_rwlock_t
containing the following
– a count of the number of readers,
– the writer (a 0/1 integer specifying whether a writer is
present),
– a condition variable readers_proceed that is
signaled when readers can proceed,
– a condition variable writer_proceed that is
signaled when one of the writers can proceed,
– a count pending_writers of pending writers, and
– a mutex read_write_lock associated with the
shared data structure
Read-write locks II
• Mylib_rwlock_rlock first checks for a write lock or
if there are writers queueing. If so, then it does
a condition_wait on readers_proceed, else it
grants read lock and increments the count of
readers
• Mylib_rwlock_wlock checks for readers or
writers-if so, it increments the count of writers
and does a condition_wait on writer_proceed,
else it grants the lock
• Mylib_rwlock_unlock
Read-write locks
typedef struct {
int readers;
int writer;
pthread_cond_t readers_proceed;
pthread_cond_t writer_proceed;
int pending_writers;
pthread_mutex_t read_write_lock;
} mylib_rwlock_t;
void mylib_rwlock_init (mylib_rwlock_t *l) {
l -> readers = l -> writer = l -> pending_writers
= 0;
pthread_mutex_init(&(l -> read_write_lock), NULL);
pthread_cond_init(&(l -> readers_proceed), NULL);
pthread_cond_init(&(l -> writer_proceed), NULL);
}
Read-write locks code II
void mylib_rwlock_rlock(mylib_rwlock_t *l) {
/* if there is a write lock or pending writers, perform
condition wait.. else increment count of readers and
grant read lock */
pthread_mutex_lock(&(l -> read_write_lock));
while ((l -> pending_writers > 0) || (l -> writer > 0))
pthread_cond_wait(&(l -> readers_proceed),
&(l -> read_write_lock));
l -> readers ++;
pthread_mutex_unlock(&(l -> read_write_lock));
}
Read-write locks code III
void mylib_rwlock_wlock(mylib_rwlock_t *l) {
/* if there are readers or writers, increment pending writers
count and wait. On being woken, decrement pending writers
count and increment writer count */
pthread_mutex_lock(&(l -> read_write_lock));
while ((l -> writer > 0) || (l -> readers > 0)) {
l -> pending_writers ++;
pthread_cond_wait(&(l -> writer_proceed),
&(l -> read_write_lock));
}
l -> pending_writers --;
l -> writer ++;
pthread_mutex_unlock(&(l -> read_write_lock));
}
Barriers
• Can be implemented using counter,
mutex,condition variable.
• Use an integer to keep track of the number
of threads which reach the barrier
• If count<number of threads, threads execute
condition wait
• If count==number of threads, last thread wakes up
all the threads using a condition broadcast
Barrier code
typedef struct {
pthread_mutex_t count_lock;
pthread_cond_t ok_to_proceed;
int count;
} mylib_barrier_t;
void mylib_init_barrier(mylib_barrier_t *b) {
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}
Barrier code II
void mylib_barrier (mylib_barrier_t *b, int
num_threads) {
pthread_mutex_lock(&(b -> count_lock));
b -> count ++;
if (b -> count == num_threads) {
b -> count = 0;
pthread_cond_broadcast(&(b -> ok_to_proceed));
}
else
while (pthread_cond_wait(&(b -> ok_to_proceed),
&(b -> count_lock)) != 0);
pthread_mutex_unlock(&(b -> count_lock));
}
Barrier thoughts
• Lower bound on execution time is O(n), since
broadcast message follows queue
• Speed things up by using a tree to implement
the barriers
• Pair threads up so that each pair shares a single condition
variable mutex pair
• A designated member of the pair waits for both threads to
arrive at the barrier
• Continue until there is one thread
• Releasing threads requires signaling n/2 condition variables
Performance on SGI Origin 2000
Designing asynchronous programsthings that go wrong
• Thread T1 creates thread T2 .It wants to
put data to send to T2 in global memory
T1 gets switched after creating T2 but
before placing data.T2 might look for data
and get garbage.
• T1 creates T2, which is on T1’s stack. T1
passes data to T1, but finishes before T2
is scheduled. Stack is destroyed and T2
never gets its data.