Transcript ppt

A Methodology for Implementing
Highly Concurrent Data Objects
by Maurice Herlihy
Slides by Vincent Rayappa
Introduction
• Objective: Describe a methodology of
transforming sequential data structures into
concurrent ones.
– Use LL/SC instructions to accomplish this.
• Why LL/SC?
– Universally applicable to transform data structures
into concurrent ones
– Easier than using CAS.
• A lot of modern architectures support LL/SC:
PPC, ARM, MIPS etc.
– Not on x86?
2
Example of LL/SC
• PPC has load-linked (lwarx) and store cond (stwcx) instructions.
• Example from http://www.ibm.com/developerworks/library/pa-atom/
asm
long
// <- Zero on failure, one on success (r3).
AtomicStore(
long prev,
// -> Previous value (r3).
long next,
// -> New value (r4).
void *addr ) // -> Location to update (r5).
{
retry:
lwarx r6, 0, r5 // current = *addr;
cmpw
r6, r3
// if( current != prev )
bne
fail
// goto fail;
stwcx. r4, 0, r5 // if( reservation == addr ) *addr = next;
bneretry
// else goto retry;
li
r3, 1
// Return true.
blr
// We're outta here.
fail:
stwcx. r6, 0, r5 // Clear reservation.
li
r3, 0
// Return false.
blr
// We're outta here.
}
3
General Methodology
• Programmer provides a non-concurrent
implementation of data structure (with
restrictions)
• By applying transformation techniques and
memory management steps the data structure is
made concurrent.
• Small vs. big objects:
– Small: efficient to whole data structure from one
memory region to another
– Big: inefficient to copy whole data structure.
4
Small Object Transformation
M’ (copied by Process B & Modified)
Process B
Process A
SC
LL
SC
LL
Fails!
M
M’’ (copied by Process A & Modified)
• This synchronization is non-blocking because some
process makes progress at any given time.
– Spurious failures: no-progress
• Restrictions:
– Operations must be free of side-effects other than changing
memory block a process owns
– Operations must be well-defined for all legal states of object.
5
Small Object Memory Management
• Each process owns memory block big
enough to copy data structure
– When version pointer successfully updated,
process releases ownership of new block and
acquires ownership of old block.
• Since only once process can swing the
pointer each block has well defined owner.
6
Comparison to Type-stable
memory
• TSM: value for tstable was left undefined.
• Here we have clearer management (and
recycling of memory).
– Given up ownership of ‘new version’ when SC
succeeds
– Acquire ownership of ‘old version’ when SC
succeeds.
7
Race condition
• Stale read are still possible. In previous example:
– Processes A and B read pointer to M
– Process A updates versions from M to M’.
– Now A owns M and as part of next operation can use
it to copy/update contents from current version.
– If Process B (because it is slow) is still reading M, it
can see incomplete edits that A in now doing!
• Operations on stale data can cause unpredictable
behavior
– Violates the ‘Operations must be well-defined …’
restriction.
• Prevent operations based on stale data by
validating data before using it.
8
Validating data
• Use counters check[0] & check[1]
– use 32-bit integers, making validation robust
• Modify: check[0]++ -> modify -> check[1]++
• Copy: read check[1], copy, read check[0].
check[0] == check[1] ?.
– Yes: copy is consistent
– No: we are reading edits in progress i.e. edits in
progress
• Can also be implemented in hardware
– Not clear modern architectures support this.
9
Example: Priority Queue
• A binary tree where a node have a value
greater than both the subtrees (max.
priority queue)
• Operations supported
– Peak at Max (or min) value
– Extract max value
– Insert new value
• Priority queues implemented via heap
data-structure encoded as an array.
10
typedef struct {
pqueue_type version;
unsigned check[2];
} Pqueue_type;
red – sequential object
blue – concurrent object
static Pqueue_type *new_pqueue;
Q- Shared global
int Pqueue_deq(Pqueue_type **Q){
Pqueue_type *old_pqueue; /* concurrent object */
pqueue_type *old_version, *new_version; /* seq object */
int result;
unsigned first, last;
while (1) {
Calls LL
old_pqueue = load_linked(Q);
Concurrent read of old_pqueue/old_version possible
old_version = &old_pqueue>version;
new_version = &new_pqueue>version; A process ‘owns’ new_version; others can
still read new_version
first = old_pqueue>check[1];
copy(old_version, new_version);
Guard against concurrent read of new_version
last = old_pqueue>check[0];
Consistency check
if (first == last) {
result = pqueue_deq(new_version);
if (store_conditional(Q, new_version)) break;
Calls SC
} /* if */
} /* while */
new_pqueue = old_pqueue;
return result;
} /* Pqueue_deq */
11
Experimental Results
• Time to enqueue and dequeue
into 16-element p-queue.
• Naïve retry was caused a lot of
failure for enqueue since it was
slower than dequeue
• Added exponential back-off on
failure.
2 20
• Number of operations =
n
• n is # of processes
12
Large Objects
• Large objects cannot be copied fully
• Sequential operations create logically distinct
version of objects
– As opposed to changing them in place
– Logically – not physically – distinct since
programmer is free to share memory between old and
new versions.
• Concurrent operation:
– Read pointer using LL
– Use sequential operation to create new version
– Swing pointer to new version using SC
13
Large Object Memory Management
• Memory block required per sequential
operation not fixed
– Depends how much sharing happens between
old and new versions
• Processes own block of memory.
– If they run out of block, they might have to
borrow blocks from a common pool.
• Memory managed via recoverable set data
structure.
14
Recoverable Set
• Blocks in one of three states:
– Committed, Allocated, Freed
Block B1
Committed
set_alloc
set_commit
Block B1
Freed
set_free
Block B1
Allocated
15
Large Object Performance
• Same experiment as
with small objects
except that the heap
has 512 instead of 16
elements
16
Comments
• This paper shows an method for transforming sequential
data structures into concurrent ones with reasonable
performance.
– Transformed program performs with 2x of spin-lock with
backoff
– Reasonable depends on need. Massalin, Michael, Scott don’t
think performance is good enough.
• Reasoning about NBS not easy
– Reasoning about which memory locations are accessed
concurrently and which are not is difficult.
– With locks, inside critical sections you know you do not have
concurrent access.
• Automated transformation
– If the transformation is done automatically by compiler or preprocessor, it would be easy to use NBS.
– Perhaps it might even be worth the performance penalty.
17