Transcript Slides

User-level Techniques
on uniprocessors and
SMPs
Focusing on Thread
Scheduling
CSM211
Kurt Debattista
Literature
• Any good operating system book –
dinosaur book
• Vahalia U. Unix Internals. Prentice
Hall, 1996
• Moores J. CCSP - A portable CSPbased run-time system supporting C
and occam. Volume 57, Concurrent
Systems Engineering Series, pages
147-168, IOS Press, April 1999
• System’s software research group
http://www.cs.um.edu.mt/~ssrg
Literature on thread scheduling on
uniprocessors, SMPs, avoiding
blocking system calls (Vella, Borg,
Cordina, Debattista)
Overview
• Processes
• Threads
– Kernel threads
– User threads
• User-level Memory Management
• User-level Thread Scheduling
• Multiprocessor hardware
• SMP synchronisation
• SMP thread schedulers
Time
Second
1
Millisecond
0.001
Microsecond
0.000001
Nanosecond
0.000000001
Processes
• UNIX like operating systems
provide for concurrency by
means of time-slicing for longlived, infrequently
communicating processes
– Concurrency is usually an illusion
(uniprocessor)
– Parallelism (SMPs, distributed
systems)
• Within a single application?
Processes – Vertical
switch
• Every time we enter the kernel
(system call) we incur a vertical
switch
– Save current context
– Change to kernel stack
– Change back
Processes – Horizontal
switch
• Context switch from one
processes to another –
horizontal switch
– Enter kernel (vertical switch)
– Dispatch next process
– Change memory protection
boundaries
– Restore new process context
Processes - Creation
• Creating a new process
– Enter kernel (vertical switch)
– Allocating memory for all process
structures
– Init new tables
– Update file tables
– Copy parent process context
• All operations on processes in
the order of hundreds of
microseconds and sometimes
milliseconds
Multithreading
• Programming style used to
represent concurrency within a
single application
• A thread is an independent
instance of execution of a
program represented by a PC, a
register set (context) and a stack
Multithreading (2)
• In multithreaded applications threads
co-exist in the same address space
• A traditional process could be
considered an application composed
of a single thread
• Example – web server
– Share the same data between threads
– Spawn threads
– Communicate through shared memory
Multithreading (3)
•
Two types of threads
1. Kernel-level threads
2. User-level threads
Kernel threads
• Kernel threads are threads that
the kernel is aware of
• The kernel is responsible for the
creation of the threads and
schedules them just like any
other process
Kernel threads (2)
Advantages
1. Concurrency
within a single
application
2. No memory
boundaries
3. IPC through
shared
memory
avoiding
kernel access
4. Kernel
interaction
Disadvantages
1. Thread
creation
(vertical
switch)
2. Context switch
(horizontal
switch)
3. Kernel
interaction
Kernel threads (3)
• Thread management in the
order of tens of microseconds
(sometimes hundreds)
– Creation
– Context switch
– Kernel based IPC
• Fast shared memory IPC in the
order of hundreds (even tens) of
nanoseconds
User threads
• Kernel is unaware of threads
• All scheduling takes place at the
user level
• All scheduler data structures
exist in the user-level address
space
User-level Thread
Scheduling
User-level thread
scheduling
• Library
• Pre-emption and cooperative
multithreading
– Pre-emption (like UNIX)
– Cooperative (think Windows 3.1)
• Performance in the order of tens
of nanoseconds
User-level thread library
• Application is linked with a
thread library that manages
threads at run-time
• Library launches the main()
function and is responsible for
all thread management
Fast multithreaded C
library
• Scheduler initialisation and
shutdown
• Structures for scheduler
– Run queue(s)
– Thread descriptor
– Communication constructs
• Provide functions for thread
creation, execution, IPC, yield,
destroy
Scheduler intialisation
• Traditional main() function is
part of library, the application
programmer uses
cthread_main() as main function
• Initialise all scheduler data
structures
• cthread_main() is in itself a
thread
Scheduler structures
• Thread run queue
– Fast FIFO queue
– Priority based scheduling
• Multiple queues
• Priority queue
• Thread descriptor
– PC and set of registers (jmp_buf)
– Stack (pointer to chunk of
memory)
Communication
structures
• Structures for communications
–
–
–
–
Semaphores
Channels
Barrier
others
Thread API
•
Provide functions for threads
–
–
–
–
–
Initialisation - cthread_init()
Execution – cthread_run()
Yield – cthread_yield()
Barrier – cthread_join()
Termination - automatic
Context switch
• Use functions setjmp() and
longjmp() to save and restore
context (jmp_buf)
• setjmp() saves the current
context
• longjmp() restores the context
User-level Thread
Scheduling
• Thread scheduling is abstracted from
the kernel
• Thread management occurs at the
user level
– No expensive system calls
• Mini-kernel on top of the OS kernel
• Ideal for fine-grained multithreading
User-level Thread
Schedulers
• Successful user-level thread
schedulers exist in the form of
–
–
–
–
–
–
CCSP
KRoC
MESH
Mach Cthreads
smash
Sun OS threads
Thread Management Issues
• Interaction with operating system
–
–
–
–
–
Blocking kernel threads
Multiplexing kernel threads
Scheduler activations (Anderson)
System call wrappers
See Borg’s thesis for more info. (SSRG
homepage)
• Thread co-opeation
– Automatic yield insertion (Barnes)
• Active context switching (Moores)
Blocking Kernel Threads
• In single kernel-threaded
schedulers, when the kernel
thread blocks on a blocking
system call the entire scheduler
blocks
• Multiplexing kernel threads
– Reduces the problem (though the
problem is still there)
– Increases the amount of
horizontal switching
Blocking Kernel Threads
(2)
• Partial solutions
– Allocate kernel thread for particular
functions (e.g. Keyboard I/O)
– Horizontal switching
• System Call Wrappers
– Wrap all (or required) blocking system
calls with wrappers that launch a
separate kernel thread
– Require wrapper for each call
– Ideal when wrappers already exist (e.g.
occam)
– Incurs horizontal switching overhead
– Blocking system call might NOT block
Scheduler Activations
• Scheduler activations offer interaction
between user-level space and kernel space
• Scheduler activations are the executing
context on which threads run (like kernel
threads)
• When an activation blocks the kernel
creates a new activation and informs the
user space that an activation is blocked (an
upcall)
• Moreover a new activation is created to
continue execution
• One of the most effective solution but must
be implemented at the kernel level
• Also useful for removing the extended
spinning problem on multiprogrammed
multiprocessor systems
Scheduler Activations
(2)
Web Server Example
User-level Memory
Management
• Memory management can also
benefit from user-level
techniques
• Replace malloc()/free() with
faster user-level versions
– Remove system calls sbrk()
– Remove page faults
User-level Memory
Management (2)
• Ideal for allocating/deallocating memory for similar
data structures
– No fragmentation
– Pre-allocate large chunk using
malloc( ) whenever required
– User-level data structures handle
allocation/de-allocation
• Results
– malloc()/free 323ns
– ul_malloc()/ul_free() 16ns
User-level Memory
Management (3)
• More complex allocation/deallocation possible through
building complete user-level
memory manager
– Avoid page faults (physical
memory allocation)
– Enable direct SMP support
• Results
– malloc()/free() 323ns
– ul_malloc()/ul_free() 100ns (ca.)
Multiprocessor Hardware
(Flynn)
• Single Instruction Single Data
(SISD)
– Uniprocessor machines
• Single Instruction Multiple Data
(SIMD)
– Array computers
• Multiple Instruction Single Data
(MISD)
– Pipelined vector processors (?)
• Multiple Instruction Multiple Data
(MIMD)
– General purpose parallel computers
Memory Models
• Uniform Memory Access (UMA)
– Each CPU has equal access to memory
and I/O devices
• Non-Uniform Memory Access
(NUMA)
– Each CPU has local memory and is
capable of accessing memory local to
other CPUs
• No Remote Memory Access
(NORMA)
– CPUs with local memory connected
over a high-speed network
UMA
NUMA
Hybrid NUMA
NORMA
Symmetric Multiprocessors
• The UMA memory model is
probably the most common
• The tightly-coupled, shared memory,
symmetric multiprocessor (SMP)
(Schimmel)
• CPUs, I/O and memory are
interconnected over a high speed bus
• All units are located at a close
physical distance from each other
Symmetric Multiprocessors
(2)
• Main memory consists of one single
global memory module
• Each CPU usually has access to
local memory in terms of a local
cache
• Memory access is symmetric
– Fair access is ensured
• Cache
Caching
• Data and instruction buffer
• Diminish relatively slow speeds
between memory and processor
• Cache consistency on
multiprocessors
– Write through protocol
– Snoopy caches
• False sharing
Synchronisation
• Synchronisation primitives serve to
– provide access control to shared
resources
– event ordering
• Valois describes the relationship of
synchronisation methods
Mutual exclusion
Blocking
Busy waiting
Lock-free
Non-blocking
Wait-free
Synchronisation Support
• Synchronisation on multiprocessors
relies on hardware support
• Atomic read and write instructions
• “Read-modify-write” instructions
–
–
–
–
–
swap
test and set
compare and swap
load linked / store conditional
double compare and swap
• Herlihy’s hierarchy
Mutual Exclusion
• Blocking or busy-waiting
• Rule of thumb is to busy-wait if time
expected to wait is less than the time
required to block and resume a
process
• In fine grain multithreaded
environments critical sections are
small so, busy-waiting is usually
preferred
Spin locks
•
Spin locks are mostly likely the
simplest locking primitives
•
A spin lock is a variable that is in
one of two states (usually 1 or 0)
•
Two operations act on a spin lock
1.
2.
Spin / acquire lock
Spin release
Spin Lock Implementation
Acquire spin lock
spin:
lock
btsl lock, 0
jnc cont
jmp spin
; lock bus for btsl
; bit test and set
; continue if carry is 0
; else go to spin
cont:
Release spin lock
lock
btrl lock, 0
; lock bus for btrl
; release lock
Test and Test and Set Lock
• Segall and Rudolph present a spin
lock that does not monopolise the
bus
Acquire spin lock – Segall and
Rudolph
spin:
loop;
cont:
lock
btsl lock, 0
jnc cont
btl lock, 0
jc loop
jmp spin
; lock bus for btsl
; bit test and set
; continue if carry is 0
; test only
; loop if carry is set
; else go to spin
Alternative Spin Lock
Techniques
• Anderson’s exponential back-off
algorithm
• Pre-emption safe algorithms for
multiprogrammed environments
– Elder
– Marsh
• Hybrid locking – blocking
techniques
– Ousterhout
– Lim
Lock-free Synchronisation
• Alternative form of synchronisation
which dispenses with serialising
concurrent tasks
• Lock-free algorithms rely on
– powerful atomic primitives
– careful ordering of instructions
Lock-free Properties
•
•
Lock-free data structures may have
two properties
1.
They may be non-blocking in which
case some operation on a lock-free
structure is guaranteed to complete in
finite time
1.
If all such operations are guaranteed
to complete in finite time the data
structure is said to be wait free
Any structure that uses mutual
exclusion cannot be non-blocking
or wait free
Scheduling Strategies
•
Scheduling strategies on SMPs can
be broadly divided into two
categories
1.
2.
Shared run queue schedulers where
processors acquire threads by
accessing a shared run queue (usually
implemented as a FIFO queue)
Per processor run queue scheduler
where each processor acquires
threads from it’s own local run queue
(and in some cases from other run
queues)
Shared Run Queue Thread
Scheduling
Shared Run Queue Thread
Scheduling (2)
• Central run queues balance the work
load equally amongst all processors
• Threads are obtained from the shared
run queue and serviced
• Serviced threads are placed back
onto the run queue (unless they
terminate)
• Shared run queues favour the
principal of load balancing
Per Processor Run Queue
Thread Scheduling
Per Processor Run Queue
Thread Scheduling (2)
• Per processor run queue schedulers
maintain a local run queue for each
processor
• The principal of locality is favoured
• Contention is reduced because
processors reference data close to it,
thus reducing communication (which
can lead to contention)
Locality vs. Load
Balancing
• Both locality and load balancing are
intended to improve performance
• Both policies however interfere with
each other
• Per processor run queues best
represent locality based scheduling
• Shared run queues best represent
load balancing
Shared Run Queue
Scheduler Problems
• When using shared run queues
threads very often migrate across
processors thus loosing out on the
principle of locality
• Poor locality on shared run queue
schedulers also means an increased
contention since processors always
access the shared run queue
– especially for fine grain multithreading
• Existing results are disappointing
(SMP KRoC and SMP MESH) for
fine grained applications
Per Processor Run Queue
Scheduler Problems
• Per processor run queue schedulers
can run into load imbalance
problems
• Threads are usually placed close to
their parent
• Diagrams show the possible load
imbalances compared with a shared
run queue
Shared Run Queue Load
Balancing
Per Processor Run Queue
Possible Load Imbalances
Locality vs. Load
Balancing (2)
• NORMAs the high cost of migrating
threads for load balancing, resolves
the conflict in terms of locality
(Eager et al.)
• NUMAs Bellosa also rules in favour
of locality
• On UMAs due to the low costs of
thread migration load balancing is
usually preferred
Locality vs. Load
Balancing (3)
• The situation might indeed be
through for coarse grain threads
(Gupta et al.)
– Less contention for shared resources
– Locality in terms of cache affinity
scheduling is still catered for (since
processes long-lived and do not
synchronise)
• Fine grained multithreading the
situation is different (Markatos and
LeBlanc, Anderson)
– Contention is increased
– Caching is not used
Locality vs. Load
Balancing (4)
• Results by Anderson indicate per
processor scheduling has its
advantages
• Results of shared run queue based
fine grain thread schedulers are
relatively disappointing
• Markatos and Le Blanc point out the
difference in improving CPU speeds
as opposed to memory speeds will
favour locality
Shared Run Queue
Scheduler’s Run Queues
• When scheduling using a shared run
queue, the choice of run queue
implementation is a fundamental
issue
• Michael and Scott survey the fastest
concurrent queues
– Michael and Scott’s non-blocking
queue
– Valois’ non-blocking queue
– Michael and Scott’s dual-lock queue
• Including pre-emption safe locks
– Mellor-Crummey’s lock-free queue
– Traditional single lock queue
Michael and Scott’s DualLock Concurrent Queue
• Head and tail are protected by means
of separate locks
• Enqueue and dequeue operations can
occur concurrently
• Algorithm is possible because of use
of dummy head and nodes as a
vehicle to transport data (avoid head
and tail interaction)
Michael and Scott’s DualLock Concurrent Queue (2)
struct {
node * head;
node * tail;
int taillock;
int headlock;
} queue;
void enqueue(queue * q, data * d) {
node * n = node_init(d);
spinlock(q->taillock);
q->tail->next = n;
q->tail = n;
spinrelease(q->taillock);
}
Michael and Scott’s DualLock Concurrent Queue (3)
data * dequeue(queue * q) {
node * n, * new_head;
data * d;
spinlock(q->headlock);
n = q->head;
new_head = n->next;
if (new_head == NULL) {
spinrelease(q->headlock);
return NULL;
}
d = new_head->data;
q->head = new_head;
spinrelease(q->headlock);
return d;
}
Per Processor Run Queue
Scheduler - Load
Balancing
• Per processor run queue schedulers
need methods of load balancing to
avoid strong load imbalances
• Threads could be mapped statically
to processors at compile-time (only
useful if all threads are the same
longevity)
• Threads can be migrated across run
queues or across common pools of
threads
Multiprocessor Batching