Transcript ppt

The Synergy Between Non-blocking
Synchronization and Operating System
Structure
By Michael Greenwald and David Cheriton
Presented by Jonathan Walpole
CS510 – Concurrent Systems
1
The Cache Kernel





Yet Another Attempt at a µ-kernel
Minimal privileged mode code: Caching model of
kernel functionality
OS functionality in user-mode class libraries
allowing application-customizability of OS
Shared memory and signal-based communication are
the only communication mechanisms
Goal: scalable, robust, flexible operating system
design
CS510 – Concurrent Systems
2
Synergy of NBS and kernel structure

Claim: NBS and Good Design go together in the
Cache Kernel design and implementation
o
o
NBS allows better OS structuring
Good OS structure simplifies NBS
CS510 – Concurrent Systems
3
Non-blocking Synchronization

Basic idea of NBS
o
o
o
o

Associate version number with data structure
Read version number at start of critical section
Atomically check and increment version number at end of
critical section at the same time as applying the update(s),
using a Double Compare and Swap (DCAS) instruction.
Roll back and retry on failure (i.e., if version number
changed while you were in the critical section).
An optimistic synchronization technique similar to
that of the Synthesis Kernel
CS510 – Concurrent Systems
4
Example: Deletion from Linked List
do {
retry:
backoffIfNeeded();
version = list->version;
/* Read version # */
for (p = list->head;(p->next != elt); p = p->next) {
if (p == NULL) {
/* Not found */
if (version != list->version)
{ goto retry; }
/* List changed */
return NULL;
/* Really not found */
}
}
} while( !DCAS(
&(list->version),
&(p->next),
version,
elt,
version+1,
elt->next
)
)
CS510 – Concurrent Systems
5
Double Compare and Swap
int DCAS(
int *addr1,
int old1,
int new1,
int *addr2,
int old2,
int new2)
{
<begin atomic>
if ((*addr1 == old1) && (*addr2 == old2)) {
*addr1 = new1;
*addr2 = new2;
return(TRUE);
} else {
return(FALSE);
}
<end atomic>
}
CS510 – Concurrent Systems
6
Hardware Support for DCAS


Could be supported through extension to LL/SC
(load-linked/store-conditional) sequence
LLP/SCP
o
LLP (load-linked-pipelined)
•
•
o
SCP (store-conditional-pipelined)
•
o
Load and link a second address after an earlier LL
LLP is linked to the following SCP
Store depends on both previous LLP and LL not having been
invalidated in this CPU’s cache
If LLP/SCP fails, so does enclosing LL/SC
CS510 – Concurrent Systems
7
Software Support for DCAS

Basic idea: make DCAS operations atomic by putting
them in a critical section protected by a lock!
o
o

Must make sure that delayed lock holders release
the lock and roll back
o

OS manages the lock (how many locks do you need?)
Presumably, DCAS is a system call
Ensures that DCAS still has non-blocking properties
Problems implementing this efficiently
o
o
How to support readers?
How to avoid lock contention?
CS510 – Concurrent Systems
8
Back to the List Example
do {
retry:
backoffIfNeeded();
version = list->version;
/* Read version # */
for (p = list->head;(p->next != elt); p = p->next) {
if (p == NULL) {
/* Not found */
if (version != list->version)
{ goto retry; }
/* List changed */
return NULL;
/* Really not found */
}
}
} while( !DCAS(
&(list->version),
&(p->next),
version,
elt,
version+1,
elt->next
)
)
CS510 – Concurrent Systems
9
Problem …

What happens if another thread deletes an element
while we are traversing the list?
o
o
o
o
o

We may end up with an invalid pointer in p
What if we traverse it into the free pool?
What if the memory has been reused?
What if we end up in a different data structure?
What if the memory is reused for a different type?
How can we prevent this kind of reader hijacking?
o
Or at least, how can we prevent it from crashing the
reader?
CS510 – Concurrent Systems
10
Type-stable memory management (TSM)

Descriptor that is allocated as type T is guaranteed
to remain of type T for at least tstable
o
A generalization of existing technique – allocation pools:
•
o
o
o
e.g. process-descriptors are statically allocated at system init,
and are type-stable for lifetime of system
But is this sufficient to ensure that read-side code will
reach the DCAS in the presence of concurrent
modification?
And how long is tstable ?
When is it safe to reuse memory?
CS510 – Concurrent Systems
11
Other solutions to reader hijacking

If memory is reclaimed immediately upon updates,
readers must protect themselves from hijacking
o
o
o
But how?
Must ensure that a pointer is valid before dereferencing it
Additional checks at the end of the read side critical
section are required to ensure no concurrent write access
took place
CS510 – Concurrent Systems
12
Claim: TSM aids NBS

Type stability ensures safety of pointer
dereferences.
o
Without TSM, delete example is too big to fit on slide
•
o
o
And very expensive to execute
Need to check for changes on each loop iteration
TSM makes NBS code simpler and faster
CS510 – Concurrent Systems
13
Other benefits of NBS in Cache Kernel


Signals are only form of IPC in Cache Kernel
NBS simplifies synchronization in signal handlers
o
Makes it easier to write efficient signal-safe code
CS510 – Concurrent Systems
14
Contention Minimizing Data Structures (CMDS)

Locality-based structuring used in Cache kernel:
o
o
o

Replication: Per-processor structures (ie. run queue)
Hierarchical data structures with read-mostly high levels
(ie. hash tables with per bucket synchronization)
Cache block alignment of descriptors to minimize false
sharing and improve cache efficiency
Well-known techniques used in other systems
(Tornado, Synthesis, …)
CS510 – Concurrent Systems
15
Benefits of CMDS

Minimizes logical and physical contention
o
o


Reduces lock conflicts/convoys in lock-based system
Reduces synchronization contention with NBS
o

Minimizes (memory) consistency overhead
Minimizes false sharing
fewer retries from conflict at point of DCAS
CMDS is good for locks, cache consistency and NBS!
o
NBS needs CMDS
CS510 – Concurrent Systems
16
Minimizing the Window of Inconsistency


Delaying writes, and grouping them together at the
end minimizes the window of inconsistency
Advantages of a small window of inconsistency
o
o
o

Less probability of failure leaving inconsistency
Preemption-safe: easy to back-out of critical section
Reduces lock hold time & contention (in lock-based systems)
Its good for system design whether you use blocking
or non-blocking synchronization!
CS510 – Concurrent Systems
17
How Does NBS Affect Contention?


What is the relationship between window of
inconsistency and probability of conflict
(contention)?
NBS approaches have a critical section just like
locking approaches
o
o
NBS critical section is defined by the scope of the version
number
This is NOT the same as the window of inconsistency!
CS510 – Concurrent Systems
18
Priority Inversion Issues

NBS allows synchronization to be subordinate to
scheduling
o
o

It avoids priority inversion problem of locks
Also, page fault,I/O, and other blocking operations
Does the highest priority process always make
progress?
o
Is there a retry-based equivalent to priority inversion?
CS510 – Concurrent Systems
19
Why NBS is good for OS structure

Fail-stop safe: OS class libraries tolerant of user
threads being terminated.
o


Most Cache Kernel OS functionality implemented in usermode class libraries
Portable: same code on uniprocessor, multiprocessor,
and signal handlers
Deadlock free (almost)
o
See examples of deletion and insertion without going
through reclamation
CS510 – Concurrent Systems
20
Implementing non-blocking Synchronization

Basic approach
o
o
o

Read version number
Rely on TSM for type safety of pointers
Increment version number and check with every
modification (abort if changed).
Straightforward transformation from locking
o
o
o
… so long as you have DCAS and can tolerate TSM
Replace acquire with read of version number
Replace release with DCAS …
CS510 – Concurrent Systems
21
Implementing non-blocking Synchronization (cont.)

Variants of the basic approach:
o
o

N reads, 1 write: No backout
2 reads, 2 writes: No version number
In the Cache kernel, every case of synchronization
falls into the special cases
CS510 – Concurrent Systems
22
Complexity and Correctness

DCAS reduces size of algorithms (lines of code) by
75% compared to CAS, for linked lists, queues and
stacks
o

Special case uses of DCAS reduce complexity further
Relatively straightforward transformation from
locking
o
Similar code size
CS510 – Concurrent Systems
23
Performance Issues


Simulation-based study
With non-preemptive scheduling:
o
o

With preemptive scheduling:
o

DCAS-based NBS almost as fast as spin-locks
CAS-based NBS slower
DCAS and CAS-based NBS better than spin-locks
Note, this was using mid 1990’s CPUs
o
o
The cost of CAS is much higher on today’s CPUs
What are the implications of this?
CS510 – Concurrent Systems
24
Hardware-Based Optimizations


A hardware-based optimization using advisory
locking
Cache based advisory locking can help avoid
contention in the form of “useless parallelism”
o
o
o
Instead of just checking at the end and forcing retries,
check at the beginning and back off if retry is inevitable
Uses Cload instruction (conditional load) which succeeds
only if the location does not have an advisory lock set, and
sets the lock on success
Initially load version # with Cload, wait and retry on failure
(seems like using TSL in a spin-lock?)
CS510 – Concurrent Systems
25
Conclusions

Claim: “Good OS structure” can support Non-blocking
synchronization
o
o
o

Claim: Non-blocking synchronization can support convenient OS
structure
o
o
o

Type-Stable Mem. Mgmt (TSM)
Data Structures that Minimize Contention (CMDS)
Minimizing the window of inconsistency
Avoids deadlock; allows signals as sole IPC mechanism
Fault tolerant; kernel functionality can be moved to user space
Performance; isolates scheduling from synchronization
Claim: Strong synergy between non-blocking synchronization &
good OS structure.
CS510 – Concurrent Systems
26
Advantages of Non-blocking Synchronization

Non-blocking:
o

Portability:
o

Minimizes interference between synchronization and
process scheduling
Recovery:
o

same code on uni and multiprocessors and in signal handlers
Performance:
o

No deadlock! Especially useful for signal handlers.
Insulation from process failures
So why isn’t it universally deployed?
CS510 – Concurrent Systems
27
Obstacles to deployment

Complexity:
o

Correctness:
o
o

confusing to design and write efficient algorithms,
especially in the absence of a DCAS primitive
Its hard to be convinced that there are no subtle bugs,
especially on weak memory consistency architectures.
Is TSM enough to preserve critical section invariants?
Performance:
o
o
Poor performance under contention due to excessive retry
high overhead due to expense of CAS on modern CPUs
CS510 – Concurrent Systems
28