Transcript slides

Compiler and Runtime Support
for Efficient
Software Transactional Memory
Vijay Menon
Ali-Reza Adl-Tabatabai, Brian T. Lewis,
Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
Motivation
Multi-core architectures are mainstream
–
–
–
Software concurrency needed for scalability
Concurrent programming is hard
Difficult to reason about shared data
Traditional mechanism: Lock-based Synchronization
–
–
–
–
Hard to use
Must be fine-grain for scalability
Deadlocks
Not easily composable
New Solution: Transactional Memory (TM)
–
–
–
–
–
2
Simpler programming model: Atomicity, Consistency, Isolation
No deadlocks
Composability
Optimistic concurrency
Analogy
• GC : Memory allocation ≈ TM : Mutual exclusion
Composability
class Bank {
ConcurrentHashMap accounts;
…
void deposit(String name, int amount) {
synchronized (accounts) {
int balance = accounts.get(name);
balance = balance + amount;
accounts.put(name, balance);
}
}
…
}
// Get the current balance
// Increment it
// Set the new balance
Thread-safe – but no scaling
• ConcurrentHashMap (Java 5/JSR 166) does not help
• Performance requires redesign from scratch & fine-grain locking
3
Transactional solution
class Bank {
HashMap accounts;
…
void deposit(String name, int amount) {
atomic {
int balance = accounts.get(name);
balance = balance + amount;
accounts.put(name, balance);
}
}
…
}
Underlying system provide:
• isolation (thread safety)
• optimistic concurrency
4
// Get the current balance
// Increment it
// Set the new balance
Transactions are Composable
Scalability - 10,000,000 operations
Scalability
4
3
2
1
0
0
4
8
12
16
# of Processors
Synchronized
Transactional
Scalability on 16-way 2.2 GHz Xeon System
5
Our System
A Java Software Transactional Memory (STM) System
– Pure software implementation
– Language extensions in Java
– Integrated with JVM & JIT
Novel Features
–
–
–
–
–
–
6
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
System Overview
Transactional Java
Polyglot
Java + STM API
Transactional STIR
StarJIT
ORP VM
Optimized T-STIR
Native Code
7
McRT STM
Transactional Java
Java + new language constructs:
• Atomic: execute block atomically
• atomic {S}
• Retry: block until alternate path possible
• atomic {… retry;…}
• Orelse: compose alternate atomic blocks
• atomic {S1} orelse{S2} … orelse{Sn}
• Tryatomic: atomic with escape hatch
• tryatomic {S} catch(TxnFailed e) {…}
• When: conditionally atomic region
•
when (condition) {S}
Builds on prior research
Concurrent Haskell, CAML, CILK, Java
HPCS languages: Fortress, Chapel, X10
8
Transactional Java → Java
Transactional Java
atomic {
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
S;
try {
}
S’;
break;
STM API
•
•
•
•
•
9
txnStart[Nested]
txnCommit[Nested]
txnAbortNested
txnUserRetry
...
} finally {
if(!txnCommit(th))
continue;
}
}
JVM STM support
On-demand cloning of methods called inside transactions
Garbage collection support
•
Enumeration of refs in read set, write set & undo log
Extra transaction record field in each object
•
Supports both word & object granularity
Native method invocation throws exception inside transaction
•
Some intrinsic functions allowed
Runtime STM API
•
•
10
Wrapper around McRT-STM API
Polyglot / StarJIT automatically generates calls to API
Background: McRT-STM
STM for
• C / C++ (PPoPP 2006)
• Java (PLDI 2006)
• Writes:
– strict two-phase locking
– update in place
– undo on abort
• Reads:
– versioning
– validation before commit
• Granularity per type
– Object-level : small objects
– Word-level : large arrays
• Benefits
– Fast memory accesses (no buffering / object wrapping)
– Minimal copying (no cloning for large objects)
– Compatible with existing types & libraries
11
Ensuring Atomicity: Novel Combination
Memory Ops 
Mode
↓
Reads
+ In place updates
+ Fast commits
+ Fast reads
Pessimistic
Concurrency
Optimistic
Concurrency
Writes
+ Caching effects
+ Avoids lock
operations
Quantitative results in PPoPP’06
12
McRT-STM: Example
…
…
atomic {
B = A + 5;
}
…
…
…
stmStart();
temp = stmRd(A);
stmWr(B, temp + 5);
stmCommit();
…
STM read & write barriers before accessing memory inside
transactions
STM tracks accesses & detects data conflicts
13
Transaction Record
Pointer-sized record per object / word
Two states:
• Shared (low bit is 1)
– Read-only / multiple readers
– Value is version number (odd)
• Exclusive
– Write-only / single owner
– Value is thread transaction descriptor (4-byte aligned)
Mapping
• Object : slot in object
• Field : hashed index into global record table
14
Transaction Record: Example
Every data item has an associated transaction record
class Foo {
Object
int x;
granularity
int y;
}
class Foo {
Word
int x;
granularity
int y;
}
15
vtbl
x
y
vtbl
TxR
x
y
vtbl
hash
x
y
TxR1
TxR2
TxR3
…
TxRn
Extra transaction
record field
Object words hash
into table of TxRs
Hash is
f(obj.hash, offset)
Transaction Descriptor
Descriptor per thread
– Info for version validation, lock release, undo on abort, …
Read and Write set : {<Ti, Ni>}
– Ti: transaction record
– Ni: version number
Undo log : {<Ai, Oi, Vi, Ki>}
–
–
–
–
Ai: field / element address
Oi: containing object (or null for static)
Vi: original value
Ki: type tag (for garbage collection)
In atomic region
– Read operation appends read set
– Write operation appends write set and undo log
– GC enumerates read/write/undo logs
16
McRT-STM: Example
Class Foo {
int x;
int y;
};
Foo bar, foo;
T1
atomic {
t = foo.x;
bar.x = t;
t = foo.y;
bar.y = t;
}
T2
atomic {
t1 = bar.x;
t2 = bar.y;
}
• T1 copies foo into bar
• T2 reads bar, but should not see intermediate values
17
McRT-STM: Example
T1
stmStart();
t = stmRd(foo.x);
stmWr(bar.x,t);
t = stmRd(foo.y);
stmWr(bar.y,t);
stmCommit();
T2
stmStart();
t1 = stmRd(bar.x);
t2 = stmRd(bar.y);
stmCommit();
• T1 copies foo into bar
• T2 reads bar, but should not see intermediate values
18
McRT-STM: Example
foo
Commit
T1
stmStart();
t = stmRd(foo.x);
stmWr(bar.x,t);
t = stmRd(foo.y);
stmWr(bar.y,t);
stmCommit;
3
hdr
x=9
y=7
T1
57
hdr
0
x=9
y = 07
T2 waits
bar
Abort
T2
stmStart();
t1 = stmRd(bar.x);
t2 = stmRd(bar.y);
stmCommit();
Reads <bar, 5> <bar, 7>
Reads <foo, 3> <foo, 3>
Writes
<bar,
5>
•T2
should
read [0, 0] or should read [9,7]
Undo <bar.x, 0> <bar.y, 0>
19
Early Results: Overhead breakdown
STM time breakdown
100%
Application
TLS access
80%
STM write
STM commit
60%
STM validate
40%
STM read
20%
0%
Binary tree
Hashtable
Linked list
Time breakdown on single processor
STM read & validation overheads dominate
 Good optimization targets
20
Btree
System Overview
Transactional Java
Polyglot
Java + STM API
Transactional STIR
StarJIT
ORP VM
Optimized T-STIR
Native Code
21
McRT STM
Leveraging the JIT
StarJIT: High-performance dynamic compiler
• Identifies transactional regions in Java+STM code
• Differentiates top-level and nested transactions
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in STIR
Good compiler representation →
greater optimization opportunities
22
Representing Read/Write Barriers
Traditional barriers hide redundant locking/logging
atomic {
…
a.x = t1
a.y = t2
if(a.z == 0) {
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
a.x = 0
stmWr(&a.x, 0);
a.z = t3
}
}
23
stmWr(&a.z, t3)
}
An STM IR for Optimization
Redundancies exposed:
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
atomic {
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
a.y = t2
txnOpenForRead(a)
if(a.z == 0) {
if(a.z != 0) {
txnOpenForWrite(a)
a.x = 0
txnLogObjectInt(&a.x, a)
a.z = t3
a.x = 0
}
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
}
a.z = t3
}
24
Optimized Code
Fewer & cheaper STM operations
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
a.z = t3
}
a)
}
a.y = t3
}
25
txnLogObjectInt(&a.z,
Compiler Optimizations for Transactions
Standard optimizations
•
•
CSE, Dead-code-elimination, …
•
Subtle in presence of nesting
Careful IR representation exposes opportunities and enables
optimizations with almost no modifications
STM-specific optimizations
•
•
•
26
Immutable field / class detection & barrier removal (vtable/String)
Transaction-local object detection & barrier removal
Partial inlining of STM fast paths to eliminate call overhead
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory
• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads
• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default
– Word / mixed where beneficial
27
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
90%
% Overhead on 1P
80%
70%
Synchronized
60%
No STM Opt
50%
+Base STM Opt
40%
+Immutability
30%
+Txn Local
20%
+Fast Path Inlining
10%
0%
HashMap
TreeMap
Prior STMs typically incur ~2x on 1P
With compiler optimizations:
- < 40% over no concurrency control
- < 30% over synchronization
28
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)
•
Thread-unsafe w/o Concurrency Control
Synchronized
•
Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)
•
•
•
Multi-year effort: JSR 166 -> Java 5
Optimized for concurrent gets (no locking)
For updates, divides bucket array into 16 segments (size / locking)
Atomic
•
Transactional version via “AtomicMap” wrapper
Atomic Prime
•
Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution
•
•
29
10,000,000 operations / 200,000 elements
Defaults: load factor, threshold, concurrency level
Speedup over 1P Unsafe
Scalability: 100% Gets
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Unsafe
Synchronized
Atomic (No Opt)
Atomic
Concurrent
Atomic wrapper is competitive with ConcurrentHashMap
Effect of compiler optimizations scale
30
Speedup over 1P Unsafe
Scalability: 20% Gets / 80% Updates
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic (No Opt)
ConcurrentHashMap thrashes on 16 segments
Atomic still scales
31
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic conflicts on entire bucket array
- The array is an object
32
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes: Word-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Object Atomic
Word Atomic
We still conflict on the single size field in java.util.HashMap
33
Speedup over 1P Unsafe
20% Inserts and Removes: Atomic Prime
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Atomic Prime tracks size / segment – lowering bottleneck
No degradation, modest performance gain
34
Speedup over 1P Unsafe
20% Inserts and Removes: Mixed-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Mixed Atomic Prime
Mixed-level preserves wins & reduces overheads
-word-level for arrays
-object-level for non-arrays
35
Scalability: java.util.TreeMap
80% Gets
16
14
12
10
8
6
4
2
0
1.2
1
Scalability
Scalability
100% Gets
0.8
0.6
0.4
0.2
0
0
4
8
12
16
0
4
Synchronized
Atomic
Synchronized
Results similar to HashMap
36
12
# of Processors
# of Processors
Unsafe
8
Atomic
Atomic Prime
16
Scalability: OO7 – 80% Reads
Operations & traversal over synthetic database
6
Scalability
5
4
3
2
1
0
0
4
8
12
16
# of Processors
Atomic
Synch (Coarse)
Synch (Med.)
Synch (Fine)
“Coarse” atomic is competitive with medium-grain synchronization
37
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet
spot
Compiler optimizations significantly reduce STM
overhead
- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to
complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
38
Research challenges
Performance
– Compiler optimizations
– Hardware support
– Dealing with contention
Semantics
–
–
–
–
I/O & communication
Strong atomicity
Nested parallelism
Open transactions
Debugging & performance analysis tools
System integration
39
Conclusions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
40
41