Compiler and Runtime Support for Efficient Software Transactional

Download Report

Transcript Compiler and Runtime Support for Efficient Software Transactional

Compiler and Runtime Support
for Efficient
Software Transactional Memory
Vijay Menon
Programming Systems Lab
Ali-Reza Adl-Tabatabai, Brian T. Lewis,
Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
Motivation
Locks are hard to get right
• Programmability vs scalability
Transactional memory is appealing alternative
• Simpler programming model
• Stronger guarantees
•Atomicity, Consistency, Isolation
•Deadlock avoidance
• Closer to programmer intent
• Scalable implementations
Questions
• How to lower TM overheads – particularly in software?
• How to balance granularity / scalability?
2
Our System
Java Software Transactional Memory (STM) System
– Pure software implementation (McRT-STM – PPoPP ’06)
– Language extensions in Java (Polyglot)
– Integrated with JVM & JIT (ORP & StarJIT)
Novel Features
–
–
–
–
–
–
3
Rich transactional language constructs in Java
Efficient, first class nested transactions
Complete GC support
Risc-like STM API / IR
Compiler optimizations
Per-type word and object level conflict detection
Transactional Java → Java
Transactional Java
atomic {
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
S;
try {
}
S’;
break;
} finally {
Other Language Constructs
•
4
if(!txnCommit(th))
Built on prior research
–
–
–
–
retry (STM Haskell, …)
orelse (STM Haskell)
tryatomic (Fortress)
when (X10, …)
continue;
}
}
Tight integration with JVM & JIT
StarJIT & ORP
• On-demand cloning of methods (Harris ’03)
• Identifies transactional regions in Java+STM code
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in StarJIT IR (STIR)
Good compiler representation →
greater optimization opportunities
5
Representing Read/Write Barriers
Traditional barriers hide redundant locking/logging
atomic {
…
a.x = t1
a.y = t2
if(a.z == 0) {
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
a.x = 0
stmWr(&a.x, 0);
a.z = t3
}
}
6
stmWr(&a.z, t3)
}
An STM IR for Optimization
Redundancies exposed:
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
atomic {
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
a.y = t2
txnOpenForRead(a)
if(a.z == 0) {
if(a.z != 0) {
txnOpenForWrite(a)
a.x = 0
txnLogObjectInt(&a.x, a)
a.z = t3
a.x = 0
}
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
}
a.z = t3
}
7
Optimized Code
Fewer & cheaper STM operations
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
a.z = t3
}
a)
}
a.y = t3
}
8
txnLogObjectInt(&a.z,
Compiler Optimizations for Transactions
Standard optimizations
•
•
CSE, Dead-code-elimination, …
•
Subtle in presence of nesting
Careful IR representation exposes opportunities and enables
optimizations with almost no modifications
STM-specific optimizations
•
•
•
9
Immutable field / class detection & barrier removal (vtable/String)
Transaction-local object detection & barrier removal
Partial inlining of STM fast paths to eliminate call overhead
McRT-STM
PPoPP 2006 (Saha, et. al.)
• C / C++ STM
• Pessimistic Writes:
– strict two-phase locking
– update in place
– undo on abort
• Optimistic Reads:
– versioning
– validation before commit
• Benefits
– Fast memory accesses (no buffering / object wrapping)
– Minimal copying (no cloning for large objects)
– Compatible with existing types & libraries
Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06)
10
STM Data Structures
Per-thread:
• Transaction Descriptor
– Per-thread info for version validation, acquired locks, rollback
– Maintained in Read / Write / Undo logs
• Transaction Memento
– Checkpoint of logs for nesting / partial rollback
Per-data:
• Transaction Record
– Pointer-sized field guarding a set of shared data
– Transactional state of data
• Shared: Version number (odd)
• Exclusive: Owner’s transaction descriptor (even / aligned)
11
Mapping Data to Transaction Record
Every data item has an associated transaction record
class Foo {
Object
int x;
granularity
int y;
}
Word
granularity
12
class Foo {
int x;
int y;
}
Transaction
record embedded
In object
vtbl
TxR
x
y
vtbl
hash
x
y
TxR1
TxR2
TxR3
…
TxRn
Object words hash
into table of TxRs
Hash is
f(obj.hash, offset)
Granularity of Conflict Detection
Object-level
• Cheaper operation
• Exposes CSE opportunities
• Lower overhead on 1P
Word-level
// Thread 1
a.x = …
a.y = …
• Reduces false sharing
• Better scalability
Mix & Match
• Per type basis
• E.g., word-level for arrays,
object-level for non-arrays
13
// Thread 2
… = … a.z …
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory
• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads
• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default
– Word / mixed where beneficial
14
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
90%
% Overhead on 1P
80%
70%
Synchronized
60%
No STM Opt
50%
+Base STM Opt
40%
+Immutability
30%
+Txn Local
20%
+Fast Path Inlining
10%
0%
HashMap
TreeMap
Prior STMs typically incur ~2x on 1P
With compiler optimizations:
- < 40% over no concurrency control
- < 30% over synchronization
15
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)
•
Thread-unsafe w/o Concurrency Control
Synchronized
•
Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)
•
•
•
Multi-year effort: JSR 166 -> Java 5
Optimized for concurrent gets (no locking)
For updates, divides bucket array into 16 segments (size / locking)
Atomic
•
Transactional version via “AtomicMap” wrapper
Atomic Prime
•
Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution
•
•
16
10,000,000 operations / 200,000 elements
Defaults: load factor, threshold, concurrency level
Speedup over 1P Unsafe
Scalability: 100% Gets
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Unsafe
Synchronized
Atomic (No Opt)
Atomic
Concurrent
Atomic wrapper is competitive with ConcurrentHashMap
Effect of compiler optimizations scale
17
Speedup over 1P Unsafe
Scalability: 20% Gets / 80% Updates
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic (No Opt)
ConcurrentHashMap thrashes on 16 segments
Atomic still scales
18
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic conflicts on entire bucket array
- The array is an object
19
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes: Word-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Object Atomic
Word Atomic
We still conflict on the single size field in java.util.HashMap
20
Speedup over 1P Unsafe
20% Inserts and Removes: Atomic Prime
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Atomic Prime tracks size / segment – lowering bottleneck
No degradation, modest performance gain
21
Speedup over 1P Unsafe
20% Inserts and Removes: Mixed-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Mixed Atomic Prime
Mixed-level preserves wins & reduces overheads
-word-level for arrays
-object-level for non-arrays
22
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet
spot
Compiler optimizations significantly reduce STM
overhead
- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to
complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
23
Novel Contributions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Complete GC support
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
24
25