Transcript Slide 1

JASM: A Java Library for the
Generation and Scheduling of
PTX Assembly
Christos Kartsaklis
[email protected]
ICHEC
Purpose
• NVIDIA GPUs
– Many self-conflicting parameters affect performance.
– Some not nvcc-tunable.
• JASM
– Similar to a compiler back-end but programmable
itself.
– Different constructs to generate variants of the same
kernel.
– Explore the optimisations strategy space faster.
– For CUDA programmers and instruction bottlenecks.
2
Structure
1. Introduction
1. Features absent from nvcc
2. Code compilation
3. JASM
2. Library Features
1.
2.
3.
4.
5.
Dependencies
Aliasing
Predication
Snippets
Reverse compilation
3. Summary
3
Features absent from nvcc – 1/3
• What is PTX?
– Virtual machine ISA for NVIDIA’s GPUs.
– Generated by nvcc (-ptxas) and compiled by
ptxas.
• nvcc limitation
– Cannot write inlined PTX in CUDA C/C++.
– Some extra #pragmas needed.
4
Features absent from nvcc – 2/3
• Predication
–
–
–
–
Not exposed; would like to have:
predicate p = ...; // some condition
#pragma predicate using p
if (p) { ...
• Address space derivation
– nvcc must be able to determine what space pointers
refer to; cannot do:
– *(__shared__ float* someint) = ...
– double d = *(__constant__ double*) foo;
5
Features absent from nvcc – 3/3
• Latency Hints
– Hard for the compiler to determine (non)coalesced
accesses; ideally:
– #pragma !coalesced
foo = bar[i];
• Reflection
– __device__ float foo(...) { ... }
#pragma N=registersUsedBy(foo);
6
Code compilation – 1/2
• Instruction Generation
– Compiler does it for you
1. High-level code  intermediary form (I/F).
2. Transform the I/F.
3. Generate machine code from the transformed I/F.
– How good the generated code is?
•
Need to manually inspect it.
7
Code compilation – 2/2
• Instruction Scheduling
– Core part of any compiler.
• Determines the order that instructions will execute in.
– Purpose
• Correctness, latency hiding and ILP.
– Problems
• Hard to steer from a high-level language.
• Compiler often generates its own code.
• #pragma directives & compiler options.
8
JASM – 1/3
• Dedicated tools, such as BAGEL
– User selects instructions and BAGEL schedules.
– Generating code written in C using the BAGEL lib.
– Uses the notion of “abstract instructions”.
• JASM
– Similar philosophy – enhanced functionality.
• Focus:
– Reflection.
– User-Programmable Instruction Scheduling.
9
JASM – 2/3
• Basic Block (BB)
– A bunch of instructions that do not change the flow of
execution.
• Control Flow Graph (CFG)
– A directed graph of BBs where edges indicate
changes in execution flow.
• Instructions Stream
– The order of instructions in memory.
– Each instruction is “notionally” part of a BB.
10
JASM – 3/3
• Examples
– Append an instruction to a basic block:
• lBB.append(LDSHAREDS32, lArg1, 4, lArg2);
– Branching:
• lSource.branchTo(lTargetBB, false);
– Reorder:
• lBB = lBB.reorder(BASIC_COST_FUNCTION,
DA_PTX_DATA_REGISTER_HAZARDS,
DA_ALIASING);
– Predicate:
• lBB = lBB.predicate(lP, lRegisterFile);
– Obtain macro:
• SnippetDescriptor lSD = CToHLA.obtain(“x*x / (y-z)”,...);
BasicBlock lNewBB = lSD.instantiate(lArg1, ...);
11
Structure
1. Introduction
1. Features absent from nvcc
2. Code compilation
3. JASM
2. Library Features
1.
2.
3.
4.
5.
Dependencies
Aliasing
Predication
Snippets
Reverse compilation
3. Summary
12
Dependencies – 1/2
• All contribute to the final instructions stream.
– What is the ideal layout?
• What complicates the compiler
– Not enough information to distinguish between true
and false dependencies.
– Variable-latency instructions
• E.g. coalesced vs non-coalesced accesses.
13
Dependencies – 2/2
• JASM determines instruction order based on:
– Dependency Analysis (DA) modules & Cost Function.
• Full space exploration – no heuristics:
– DAs constrain instructions’ motion in the stream.
– Cost function estimates execution time for any
stream.
– Scheduling done by external constraints solver.
• Only applicable to basic blocks.
14
Aliasing – 1/3
• PTX is not the final thing.
– Further optimised by ptxas before machine code
generation.
• Want to specify exactly what is and what’s not
aliased.
– No #pragma aliased / disjoint in PTX.
• Goal:
– Simplify declaration of aliasing/disjoint accesses.
– Handle all memory spaces.
15
Aliasing – 2/3
• JASM Addressable Memory Regions (AMRs)
• An n-ary tree of AMRs where:
– Root nodes represent spaces (e.g. shared, global)
– Each node is a memory region and a sub-region of its
parent’s.
– Siblings are disjoint regions, collectively making their
parent’s
• Instructions are associated with AMRs.
– AMRs predefined for CUDA memory spaces.
16
Aliasing – 3/3
• Example
global mem AMR
01: st.global.f32 [%r0], %f5
02: ld.global.f64 [%r1], %lf8
03: st.global.f32 [%r3], %f4
04: st.global.s32 [%r8], %s1
• Generally:
– No need for pointer grouping (a la “#pragma disjoint”
etc.)
– We work with instructions, not pointers.
17
Predication – 1/3
• Conditional execution
– if (foo==0) bar=5;
– Thread divergence.
• Predicated execution
–
setp.eq.s32
@%p mov.s32
%p, foo, 0;
bar, 5;
– Non-divergent cycle-burner.
• Fine line between the two.
• Cannot predicate code explicitly in CUDA.
18
Predication – 2/3
• Explicit
– Can allocate predicates, modify them and predicate
instructions.
• Example:
– Direct:
Register lP = lRegFile.allocate(BT.PRED, “%p1”);
// @%p1 mov.s32 bar, 5
lBB.append(PMOVS32, lP, lBar, new Immediate(5));
– By reflection:
Instruction lT = new InstructionImpl(
MOVS32, lBar, new Immediate(5));
lBB.append(lT.descriptor().toPredicatedVersion(),
lP, lArg1, lArg2);
19
Predication – 3/3
• Any basic block can be predicated.
– Including already-predicated instructions.
• Example:
– %p mov.s32 %d, %s; // if (%p) %d = %s;
– Predicate by %q
• Output:
– @ %q and.pred %t, %p, %q;
@!%q mov.pred %t, %q;
// i.e. %t = %q ? (%p && %q) : false;
@ %t mov.s32 %d, %s;
20
Snippets – 1/3
• Problem:
– Certain tasks require knowing in advance how the
compiler treats a piece of code.
• Software pipelining
– template<typename T>vmult(
T* aDst, T* aSrc1, T* aSrc2) {
for(int i=0 ; i<N ; i++)
aDst[i] = aSrc2[i] * aSrc2[i];
}
21
Snippets – 2/3
• Consider H/W vs S/W instructions
– Tradeoff between pipeline stall & register pressure.
• However, register pressure:
– Is also a function of the # of thread blocks.
• Ideally
– Want to generate pipelined code for a variable
number of dependencies.
• Solution:
– Encapsulate function in a reflective macro
• Reflect instructions & dependencies.
22
Snippets – 3/3
• Consider the complex “multiplication”
– (a+ib)*(c+id)  (ac+bd)+i(ad+bc)
– 2 stages: 2 muls, then 2 madds.
• Snippet descriptor organisation:
– Group 0:
• mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d
– Group 1:
• mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c
– ?* items are parameters.
• Any basic block can be “snippetised”.
23
Reverse compilation – 1/3
• What to do with legacy CUDA code?
– Option: Manually re-write in JASM.
• No. Any PTX file can be loaded in JASM.
– Not just loaded in.
– Organised in basic blocks within a Control Flow
Graph.
– Malleable from thereon like every JASM code.
24
Reverse compilation – 2/3
• Inlined C in JASM
– Idea: obtain a snippet from a C function.
– Opposite of “inlined assembly in C”.
– Why?
• Reuse what nvcc makes available.
• Enjoy the benefits that come with snippets.
25
Reverse compilation – 3/3
• At the moment, we can do the following:
– Code:
• SnippetDescriptor lSD= CToHLA.obtain(
“(x*y) % 3”, “int”, “r”,
“int”, “x”, “int”, “y”);
– r is the return parameter; x & y are arguments.
– Equivalent to:
• int r = (x*y) % 3;
• Now we can write:
– if(lSD.numberOfRegisters() > 5) { ...
26
Summary
• NVIDIA GPUs
– Many self-conflicting parameters affect performance.
– Some not nvcc-tunable.
• JASM
– Similar to a compiler back-end but programmable
itself.
– Different constructs to generate variants of the same
kernel.
– Explore the optimisations strategy space faster.
• The optimisations are expressed as a function of
the code.
27