O-structures: Tracking Fine-grain, Inter
Download
Report
Transcript O-structures: Tracking Fine-grain, Inter
O-structures:
Tracking Fine-grain, Inter-task
Dependencies in Memory
(towards hardware support for tasks)
Yoav Etsion
Technion
In collaboration with: Eran Gilad, Tehila Mayzels, Elazar Raab (Technion), and Mark Oskin (UW)
Parallelizing irregular memory accesses
• Modern workloads (e.g., BigData, ML, analytics) make frequent use of
irregular data structures and algorithms
• The shift towards task-based programming simplifies the programmer’s
job, but offers limited support for irregular access patterns
• Primarily, tasking simplifies the way we express concurrency, but it does
not necessarily simplify data synchronization
• Especially irregular data accesses…
• Some examples:
Data synchronization in task-based programming
• Old school
• Examples: OpenMP 3.0, Grappa (UW)
• Similar to thread-based programming (e.g., barriers, mutexes, semaphores)
• Implicit in task hierarchy
• Examples: Cilk (MIT/Intel)
• Parent task waits for its children
• Coarse-grain dataflow:
• Examples: OpenMP 4.0, OmpSs (BSC), Legion/Realm (Stanford)
• Annotate task inputs and outputs #pragma omp task input(a, b)
void sgemm_t(float a[M][M],
• Task cannot have side-effects
float b[M][M],
float c[M][M]);
inout(c)
The problem:
Expressing dependencies for irregular data structures
• Example:
operating on a binary tree
// insert value to tree
#pragma task
void insert(int v);
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
// lookup value in tree
#pragma task
bool lookup(int v);
insert(3); // T7
lookup(5); // T8
}
Research hypothesis
• The simplistic von Neumann load/store model does not cut it anymore
• The memory system needs to support programs’ concurrency semantics
• Introducing O-structures…
O-structures:
Memory versioning and renaming
• O-structure: A memory element that stores ordered versions of its datum
• Resembles register renaming, but at the memory level
• A program can ask for a specific version or the “latest” version (in task order)
• Allows tasks to execute out of order, but guarantees that each task views
the same memory snapshot it would have viewed if executed sequentially
• The memory system maintains implicit memory ordering between tasks
• O-structures may be viewed as I-structures for imperative programs
• Each version is created once and is immutable (I-structure)
• O-structure semantics direct load to the correct I-structure (+garbage collection)
In-memory concurrency with O-structures
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
}
insert(3); // T7
lookup(5); // T8
Version T1
In-memory concurrency with O-structures
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
}
insert(3); // T7
lookup(5); // T8
Version T1
Version T5
In-memory concurrency with O-structures
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
Version T1
insert(8); // T5
lookup(5); // T6
}
Version T5
insert(3); // T7
lookup(5); // T8
Version T7
Memory order vs. execution order
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
}
insert(3); // T7
lookup(5); // T8
Memory order vs. execution order
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
}
insert(3); // T7
lookup(5); // T8
Memory order vs. execution order
int main()
{
insert(5);
lookup(5);
lookup(8);
lookup(3);
//
//
//
//
task T1
T2
T3
T4
insert(8); // T5
lookup(5); // T6
}
insert(3); // T7
lookup(5); // T8
Memory order is determined by
program semantics, and not by
execution order!
Status report
• Architecture implemented in gem5
• Heavy lifting in the cache system
• Parallelizing pass implemented in LLVM
• At the moment, we only support unipath algorithms
• Each node can only be reached through a unique path from an entry node
Architecture and microarchitecture
Architecture and microarchitecture
No time for that; Miquel will kill me if I
waste more time…
Preliminary results
Preliminary results
Preliminary results
Preliminary results
Conclusions
• The load/store von Neumann memory interface leaves the heavy lifting
of parallel programming to the programmer and runtime
• We need to revise the memory interface and semantics to aid
concurrency
• O-structures extend register renaming to the memory system
• Enables the memory system to enforce fine-grain data dependences
Thank you
Questions?