Concurrent executions on relaxed memory models
Download
Report
Transcript Concurrent executions on relaxed memory models
Concurrent Executions on Relaxed Memory Models
Challenges & Opportunities for Software Model Checking
Rajeev Alur
University of Pennsylvania
Joint work with Sebastian Burckhardt and Milo Martin
SPIN Workshop, July 2007
Model Checking Concurrent Software
Multi-threaded Software
Shared-memory Multiprocessor
Concurrent Executions
Bugs
Concurrency on Multiprocessors
Initially x = y = 0
thread 1
x = 1
y = 1
thread 2
print y
print x
→1
→0
Output not consistent with any interleaved execution!
can be the result of out-of-order stores
can be the result of out-of-order loads
improves performance
Architectures with Weak Memory Models
A modern multiprocessor does not enforce global ordering of
all instructions for performance reasons
Each processor has pipelined architecture and executes multiple
instructions simultaneously
Each processor has a local cache, and loads/stores to local cache
become visible to other processors and shared memory at
different times
Lamport (1979): Sequential consistency semantics for correctness of
multiprocessor shared memory (like interleaving)
Considered too limiting, and many “relaxations” proposed
In theory: TSO, RMO, Relaxed …
In practice: Alpha, Intel IA-32, IBM 370, Sun SPARC, PowerPC …
Active research area in computer architecture
Programming with Weak Memory Models
Concurrent programming is already hard, shouldn’t the effects
of weaker models be hidden from the programmer?
Mostly yes …
Safe programming using extensive use of synchronization
primitives
Use locks for every access to shared data
Compilers use memory fences to enforce ordering
Not always …
Non-blocking data structures
Highly optimized library code for concurrency
Code for lock/unlock instructions
Non-blocking Queue (MS’96)
boolean_t dequeue(queue_t *queue, value_t *pvalue)
{
node_t *head;
node_t *tail;
node_t *next;
2
3
1
while (true) {
head = queue->head;
tail = queue->tail;
head
tail
next = head->next;
if (head == queue->head) {
if (head == tail) {
if (next == 0)
return false;
cas(&queue->tail, (uint32) tail, (uint32) next);
} else {
*pvalue = next->value;
if (cas(&queue->head, (uint32) head, (uint32) next))
break;
}
}
}
delete_node(head);
return true;
}
Software Model Checking for
Concurrent Code on Multiprocessors
Why?: Real bugs in real code
Opportunities
10s—100s lines of low-level library C code
Hard to design and verify -> buggy
Effects of weak memory models, fences …
Challenges
Lots of behaviors possible: high level of concurrency
How to formalize and reason about weak memory models?
Talk Outline
Motivation
Relaxed Memory Models
CheckFence: Analysis tool for Concurrent Data Types
Hierarchy of Abstractions
Programs (multi-threaded)
-- Synchronization primitives: locks, atomic blocks
-- Library primitives (e.g. shared queues)
Language level memory model
Assembly code
-- Synchronization primitives: compare&swap, LL/SC
-- Loads/stores from shared memory
Architecture level memory model
Hardware
-- multiple processors
-- write buffers, caches, communication bus …
Shared Memory Consistency Models
Specifies restrictions on what values a read from shared
memory can return
Program Order: x <p y if x and y are instructions belonging to
the same thread and x appears before y
Sequential Consistency (Lamport 79): There exists a global
order < of all accesses such that
< is consistent with the program order <p
Each read returns value of most recent, according to <, write to
the same location (or initial value, if no such write exists)
Clean abstraction for programmers, but high implementation cost
Effect of Memory Model
Initially flag1 = flag2 = 0
thread 1
thread 2
1.
2.
1.
2.
flag1 = 1;
if (flag2 == 0)
crit. sect.
flag2 = 1;
if (flag1 == 0)
crit. sect.
Ensures mutual exclusion if architecture supports SC memory
Most architectures do not enforce ordering of accesses to different
memory locations
Does not ensure mutual exclusion under weaker models
Ordering can be enforced using “fence” instructions
Insert MEMBAR between lines 1 and 2 to ensure mutual exclusion
Relaxed Memory Models
A large variety of models exist; a good starting point:
Shared Memory Consistency Models: A tutorial
IEEE Computer 96, Adve & Gharachorloo
How to relax memory order requirement?
Operations of same thread to different locations need not be
globally ordered
How to relax write atomicity requirement?
Read may return value of a write not yet globally visible
Uniprocessor semantics preserved
Typically defined in architecture manuals (e.g. SPARC manual)
Unusual Effects of Memory Models
Initially A = flag1 = flag2 = 0
thread 1
flag1 = 1;
A = 1;
reg1 = A;
reg2 = flag2;
thread 2
flag2 = 1;
A = 2;
reg3 = A;
reg4 = flag1;
Result reg1 = 1; reg3 = 2; reg2 = reg4 = 0
Possible on TSO/SPARC
Write to A propagated only to local reads to A
Reads to flags can occur before writes to flags
Not allowed on IBM 370
Read of A on a processor waits till write to A is complete
Which Memory Model?
Memory models are
platform dependent
RMO
We use a conservative
approximation
“Relaxed” to capture
common effects
PSO
TSO
390
SC
Alpha
IA-32
Relaxed
Once code is correct
for “Relaxed”, it is
correct for many
models
Formalization of Relaxed
A set of addresses
X set of memory accesses
a(x) memory address of x
V set of values
S are stores L are loads
v(x) value loaded or stored by x
<p is a partial order over X (program order)
Correctness wrt Relaxed demands existence of a total order < over X s.t.
For a load l in L, define the following set of stores that are “visible to l”:
S(l) = { s in S | a(s) = a(l) and (s < l or s <p l ) }
1. If x <p y and a(x) = a(y) and y in S, then x < y
2. For l in L and s in S(l), always either v(l) = v(s) or there exists another store s’ in
S(l) such that s < s’
Talk Outline
Motivation
Relaxed memory models
CheckFence: Analysis tool for Concurrent Data Types
CheckFence Focus
concurrency libraries with lock-free synchronization
... are simple, fast, and safe to use
concurrent versions of queues, sets, maps, etc.
more concurrency, less waiting
fewer deadlocks
... are notoriously hard to design and verify
tricky interleavings routinely escape reasoning and testing
exposed to relaxed memory models
code needs to contain memory fences for correct operation!
Concurrent Algorithms:
+ Lock-free queues, sets, lists
CheckFence
Tool
Architecture:
+ Multiprocessors
+ Relaxed memory models
References
[CAV 2006],
[PLDI 2007]
Burckhardt thesis
Computer-Aided
Verification:
+ Model checking C code
+ Counterexamples
Non-blocking Queue
Processor 1
....
... enqueue(1)
... enqueue(2)
....
....
....
Processor 2
....
...
...
a = dequeue()
b = dequeue()
void enqueue(int val) {
...
}
int dequeue() {
...
}
The client program
The implementation
on multiple processors
calls operations
optimized: no locks.
not race-free
exposed to memory model
Non-blocking Queue (MS’96)
boolean_t dequeue(queue_t *queue, value_t *pvalue)
{
node_t *head;
node_t *tail;
node_t *next;
2
3
1
while (true) {
head = queue->head;
tail = queue->tail;
head
tail
next = head->next;
if (head == queue->head) {
if (head == tail) {
if (next == 0)
return false;
cas(&queue->tail, (uint32) tail, (uint32) next);
} else {
*pvalue = next->value;
if (cas(&queue->head, (uint32) head, (uint32) next))
break;
}
}
}
delete_node(head);
return true;
}
Correctness Condition
Data type implementations must appear
sequentially consistent to the client program:
the observed argument and return values must be consistent with some
interleaved, atomic execution of the operations.
Observation
enqueue(1)
dequeue() -> 2
Witness Interleaving
enqueue(2)
dequeue() -> 1
enqueue(1)
enqueue(2)
dequeue() -> 1
dequeue() -> 2
How To Bound Executions
Verify individual “symbolic tests”
finite number of concurrent threads
finite number of operations/thread
nondeterministic input values
Example
thread 1
enqueue(X)
thread 2
dequeue() → Y
User creates suite of tests of increasing size
Why symbolic test programs?
1) Make everything finite
State is unbounded (dynamic memory allocation)
... is bounded for individual test
Checking sequential consistency is undecidable (AMP 96)
... is decidable for individual test
2) Gives us finite instruction sequence to work with
State space too large for interleaved system model
.... can directly encode value flow between instructions
Memory model specified by axioms
.... can directly encode ordering axioms on instructions
Bounded Model Checker
Pass: all executions of the test
are observationally
equivalent to a serial
execution
CheckFence
Fail:
Inconclusive:
runs out of time
or memory
Memory
Model Axioms
Tool Architecture
Trace
C code
Memory model
Symbolic Test
Symbolic test gives exponentially many executions
(symbolic inputs, dynamic memory allocation, ordering of instructions).
CheckFence solves for “bad” executions.
construct CNF formula whose solutions
correspond precisely to the concurrent
executions
Trace
C code
Memory model
Symbolic Test
automatic, lazy
loop unrolling
automatic specification mining
(enumerate correct observations)
Specification Mining
thread 1
thread 2
dequeue() → Z
enqueue(X);
enqueue(Y)
Possible Operation-level Interleavings
enqueue(X)
enqueue(X)
dequeue() -> Z
enqueue(Y)
dequeue() -> Z
enqueue(X)
dequeue() -> Z
enqueue(Y)
enqueue(Y)
For each interleaving, obtain symbolic constraint by encoding
corresponding executions in SAT solver
Spec is disjunction of all possibilities:
Spec: (Z=X) | (Z=null)
To find bugs, check satisfiability of Phi & ~ Spec
where Phi encodes all possible concurrent executions
Encoding Memory Order
thread 1
s1
s2
Example:
thread 2
store
store
l1
l2
load
load
Encoding variables
Use bool vars for relative order (x<y) of memory accesses
Use bitvector variables Ax and Dx for address and data values
associated with memory access x
Encode constraints
encode transitivity of memory order
encode ordering axioms of the memory model
Example (for SC): (s1<s2) & (l1<l2)
encode value flow
“Loaded value must match last value stored to same address”
Example: value must flow from s1 to l1 under following conditions:
((s1<l1)&(A
s1
) -> (D
= Al1)|((s2<s1)|(l1<s2)|(As2 != Al1))
s1=
Dl1)
Example: Memory Model Bug
1
head
Processor 1
links new node into list
...
3 node->value = 2;
...
1 head = node;
...
2
3
Processor 2
reads value at head of list
...
2 value = head->value;
...
Processor 1 reorders the stores!
memory accesses happen in order 1 2 3
--> Processor 2 loads uninitialized value
adding a fence between lines on left side prevents reordering
Studied Algorithms
Type
Description
LOC
Source
Queue
Two-lock queue
80
M. Michael and L. Scott
Queue
Non-blocking queue
98
(PODC 1996)
Set
Lazy list-based set
141
Heller et al. (OPODIS 2005)
Set
Nonblocking list
174
T. Harris (DISC 2001)
Deque
“snark” algorithm
159
D. Detlefs et al. (DISC 2000)
LL/VL/SC CAS-based
74
M. Moir (PODC 1997)
LL/VL/SC Bounded Tags
198
Results
snark algorithm has 2 known bugs
lazy list-based set had a unknown bug
(missing initialization; missed by formal correctness proof
[CAV 2006] because of hand-translation of pseudocode)
Type
Description
Queue
Two-lock queue
Queue
Non-blocking queue
Set
Lazy list-based set
Set
Nonblocking list
Deque
original “snark”
Deque
fixed “snark”
LL/VL/SC CAS-based
LL/VL/SC Bounded Tags
regular
bugs
1 unknown
2 known
Results
snark algorithm has 2 known bugs
lazy list-based set had a unknown bug
(missing initialization; missed by formal correctness proof
[CAV 2006] because of hand-translation of pseudocode)
Many failures on relaxed memory model
• inserted fences by hand to fix them
• small testcases sufficient for this purpose
Type
Description
regular
bugs
# Fences inserted
Store Load Dependent Aliased
Store Load
Loads
Loads
Queue
Two-lock queue
1
Queue
Non-blocking queue
2
Set
Lazy list-based set
Set
Nonblocking list
Deque
original “snark”
Deque
fixed “snark”
1 unknown
1
4
1
2
1
3
1
2
3
4
6
2 known
4
2
LL/VL/SC CAS-based
3
LL/VL/SC Bounded Tags
4
Typical Tool Performance
Very efficient on small testcases (< 100 memory accesses)
Example (nonblocking queue): T0 = i (e | d) T1 = i (e | e | d | d )
- find counterexamples within a few seconds
- verify within a few minutes
- enough to cover all 9 fences in nonblocking queue
Slows down with increasing number of memory accesses in test
Example (snark deque):
Dq = pop_l | pop_l | pop_r | pop_r | push_l | push_l | push_r | push_r
- has 134 memory accesses (77 loads, 57 stores)
- Dq finds second snark bug within ~1 hour
Does not scale past ~300 memory accesses
Conclusions
Software model checking of low-level concurrent
software requires encoding of memory models
Challenge for model checking due to high level of
concurrency and axiomatic specifications
Opportunity to find bugs in library code that’s hard to
design and verify
CheckFence project at Penn
SAT-based bounded model checking for concurrent data types
Real bugs in real code with fences
Research Challenges
What’s the best way to verify C code (on relaxed memory
models)?
SAT-based encoding seems suitable to capture specifications of
memory models, but many opportunities for improvement
Can one develop explicit-state methods for these applications?
Language-level memory models: Can we verify Java
concurrency libraries using the new new Java memory
model?
Hardware support for transactional memory
Current interest in industry and architecture research
Can formal verification influence designs/standards?