Transcript Topic 4

Advanced Computer Architecture
5MD00
Exploiting ILP with SW approaches
Henk Corporaal
www.ics.ele.tue.nl/~heco
TUEindhoven
December 2012
Topics
•
•
•
•
Static branch prediction and speculation
Basic compiler techniques
Multiple issue architectures
Advanced compiler support techniques
– Loop-level parallelism
– Software pipelining
• Hardware support for compile-time scheduling
4/3/2016
ACA H.Corporaal
2
We discussed previously dynamic
branch prediction
This does not help the compiler !!!
Should the compiler speculate
operations (= move operations before a
branch) from target or fall-through?
• We need Static Branch Prediction
4/3/2016
ACA H.Corporaal
3
Static Branch Prediction and Speculation
• Static branch prediction useful for code scheduling
• Example:
L:
ld
sub
beqz
or
addu
addu
r1,0(r2)
r1,r1,r3
r1,L
r4,r5,r6
r10,r4,r3
r7,r8,r9
# hazard
• If the branch is taken most of the times and since r7 is not
needed on the fall-through path, we could move
addu r7,r8,r9 directly after the ld
• If the branch is not taken most of the times and assuming that
r4 is not needed on the taken path, we could move
or r4,r5,r6 after the ld
4/3/2016
ACA H.Corporaal
4
4 Static Branch Prediction Methods
• Always predict taken
– Average misprediction rate for SPEC: 34% (9%-59%)
• Backward branches predicted taken, forward branches not taken
– In SPEC, most forward branches are taken, so always predict taken is
better
• Profiling
– Run the program and profile all branches. If a branch is taken (not taken)
most of the times, it is predicted taken (not taken)
– Behavior of a branch is often biased to taken or not taken
– Average misprediction rate for SPECint: 15% (11%-22%),
SPECfp: 9% (5%-15%)
• Can we do better? YES, use control flow restructuring to exploit
correlation
4/3/2016
ACA H.Corporaal
5
Static exploitation of correlation
a
...
...
bez t,b,c
b
a
d
...
...
bez t,e,f
e
f
g
...
...
bez t,b,c
b
c
d
4/3/2016
If correlation,
branch direction
in block d depends
on branch in block a
control flow
restructuring
c
...
...
bez t,e,f
e
d'
...
...
bez t,e,f
f
g
ACA H.Corporaal
6
Basic compiler techniques
• Dependencies limit ILP (Instruction-Level Parallelism)
– We can not always find sufficient independent operations to
fill all the delay slots
– May result in pipeline stalls
• Scheduling to avoid stalls (= reorder instructions)
• (Source-)code transformations to create more
exploitable parallelism
– Loop Unrolling
– Loop Merging (Fusion)
• see online slide-set about loop transformations !!
4/3/2016
ACA H.Corporaal
7
Dependencies Limit ILP: Example
C loop:
for (i=1; i<=1000; i++)
x[i] = x[i] + s;
MIPS assembly code:
; R1 = &x[1]
; R2 = &x[1000]+8
; F2 = s
Loop: L.D
ADD.D
S.D
ADDI
BNE
4/3/2016
ACA H.Corporaal
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,R2,Loop
;
;
;
;
;
F0 = x[i]
F4 = x[i]+s
x[i] = F4
R1 = &x[i+1]
branch if R1!=&x[1000]+8
8
Schedule this on a MIPS Pipeline
• FP operations are mostly multicycle
• The pipeline must be stalled if an instruction uses the
result of a not yet finished multicycle operation
• We’ll assume the following latencies
4/3/2016
Producing
Consuming
Latency
instruction
instruction
(clock cycles)
FP ALU op
FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Load double
Store double
0
ACA H.Corporaal
9
Where to Insert Stalls?
• How would this loop be executed on the
MIPS FP pipeline?
Loop:
L.D
ADD.D
S.D
ADDI
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1,R2,Loop
Inter-iteration
dependence !!
What are the true (flow) dependences?
4/3/2016
ACA H.Corporaal
10
Where to Insert Stalls
• How would this loop be executed on the MIPS FP
pipeline?
• 10 cycles per
Loop: L.D
F0,0(R1)
iteration
stall
ADD.D F4,F0,F2
stall
stall
S.D
0(R1),F4
ADDI R1,R1,8
stall
BNE
R1,R2,Loop
stall
4/3/2016
ACA H.Corporaal
11
Code Scheduling to Avoid Stalls
• Can we reorder the order of instruction to avoid
stalls?
• Execution time reduced from 10 to 6 cycles per
iteration
watch out!
Loop: L.D
ADDI
ADD.D
stall
BNE
S.D
F0,0(R1)
R1,R1,8
F4,F0,F2
R1,R2,Loop
-8(R1),F4
• But only 3 instructions perform useful work, rest is
loop overhead.
How to avoid this ???
4/3/2016
ACA H.Corporaal
12
Loop Unrolling: increasing ILP
At source level:
for (i=1; i<=1000; i++)
x[i] = x[i] + s;
•
4/3/2016
MIPS code after scheduling:
Loop: L.D
L.D
L.D
L.D
for (i=1; i<=1000; i=i+4)
ADD.D
{
ADD.D
x[i]
= x[i] + s;
ADD.D
x[i+1] = x[i+1]+s;
ADD.D
x[i+2] = x[i+2]+s;
S.D
x[i+3] = x[i+3]+s;
}
S.D
ADDI
Any drawbacks?
SD
– loop unrolling increases code size
BNE
– more registers needed
SD
ACA H.Corporaal
F0,0(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
8(R1),F8
R1,R1,32
-16(R1),F12
R1,R2,Loop
-8(R1),F16
13
Multiple issue architectures
How to get CPI < 1 ?
• Superscalar: multiple instructions issued per cycle
– Statically scheduled
– Dynamically scheduled (see previous lecture)
• VLIW ?
– single instruction issue, but multiple operations per
instruction (so CPI>1)
• SIMD / Vector ?
– single instruction issue, single operation, but multiple data
sets per operation (so CPI>1)
• Multi-threading ? (e.g. x86 Hyperthreading)
• Multi-processor ? (e.g. x86 Multi-core)
4/3/2016
ACA H.Corporaal
14
Instruction Parallel (ILP) Processors
The name ILP is used for:
• Multiple-Issue Processors
– Superscalar: varying no. instructions/cycle (0 to 8),
scheduled by HW (dynamic issue capability)
• IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4, etc.
–
VLIW (very long instr. word): fixed number of instructions
(4-16) scheduled by the compiler (static issue capability)
• Intel Architecture-64 (IA-64, Itanium), TriMedia, TI C6x
• (Super-) pipelined processors
• Anticipated success of multiple instructions led to
Instructions Per Cycle (IPC) metric instead of CPI
4/3/2016
ACA H.Corporaal
15
Vector processors
• Vector Processing:
Explicit coding of independent loops as operations on
large vectors of numbers
– Multimedia instructions being added to many processors
• Different implementations:
– real SIMD;
• e.g. 320 separate 32-bit ALUs + RFs
– (multiple) subword units:
• divide a single ALU into sub ALUs
– deeply pipelined units:
• aiming at very high frequency;
• with forwarding between units
4/3/2016
ACA H.Corporaal
16
Simple In-order Superscalar
• In-order Superscalar 2-issue processor: 1 Integer & 1 FP
– Used in first Pentium processor (also in Larrabee, but canceled!!)
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports needed for FP register file to execute FP load & FP op in parallel
Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
IF
IF
Pipe Stages
ID EX MEM WB
ID EX MEM WB
IF
ID EX MEM
IF
ID EX MEM
IF
ID
EX
IF
ID
EX
WB
WB
MEM WB
MEM WB
• 1 cycle load delay impacts the next 3 instructions !
4/3/2016
ACA H.Corporaal
17
Dynamic trace for unrolled code
for (i=1; i<=1000; i++)
a[i] = a[i]+s;
Integer instruction
L: LD
LD
LD
LD
LD
SD
SD
SD
ADDI
SD
BNE
SD
F0,0(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F18,32(R1)
0(R1),F4
8(R1),F8
16(R1),F12
R1,R1,40
-16(R1),F16
R1,R2,L
-8(R1),F20
Load: 1 cycle latency
ALU op: 2 cycles latency
FP instruction
ADDD
ADDD
ADDD
ADDD
ADDD
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F20,F18,F2
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
• 2.4 cycles per element vs. 3.5 for ordinary MIPS pipeline
• Int and FP instructions not perfectly balanced
4/3/2016
ACA H.Corporaal
18
Superscalar – Multi-issue Issues
• While Integer/FP split is simple for the HW, get IPC of 2 only for
programs with:
– Exactly 50% FP operations AND no hazards
• More complex decode and issue! E.g, already for a 2-issue we need:
– Issue logic: examine 2 opcodes, 6 register specifiers, and decide if 1 or 2
instructions can issue (N-issue ~O(N2) comparisons)
– Register file complexity: for 2-issue superscalar: needs 4 reads and 2
writes/cycle
– Rename logic: must be able to rename same register multiple times in one
cycle! For instance, consider 4-way issue:
add r1, r2, r3
add p11, p4, p7
sub r4, r1, r2

sub p22, p11, p4
lw r1, 4(r4)
lw p23, 4(p22)
add r5, r1, r2
add p12, p23, p4
Imagine doing this transformation in a single cycle!
– Bypassing / Result buses: Need to complete multiple instructions/cycle
• Need multiple buses with associated matching logic at every reservation station.
4/3/2016
ACA H.Corporaal
19
Why not VLIW Processors
• Superscalar HW expensive to build => let compiler find independent
instructions and pack them in one Very Long Instruction Word
(VLIW)
• Example: VLIW processor with 2 ld/st units, two FP units, one
integer/branch unit, no branch delay
Ld/st 1
Ld/st 2
LD
LD
LD
LD
F0,0(R1)
F10,16(R1)
F18,32(R1)
F26,48(R1)
LD F6,8(R1)
LD F14,24(R1)
LD F22,40(R1)
SD
SD
SD
SD
0(R1),F4
16(R1),F12
32(R1),F20
–8(R1),F28
SD 8(R1),F8
SD 24(R1),F16
SD 40(R1),F24
FP 1
ADDD
ADDD
ADD
ADDD
F4,F0,F2
F12,F10,F2
F20,F18,F2
F28,F26,F2
FP 2
Int
ADDD F8,F6,F2
ADDD F16,F14,F2
ADD F24,F22,F2
ADDI R1,R1,56
BNE R1,R2,L
9/7 cycles per iteration !
4/3/2016
ACA H.Corporaal
20
Superscalar versus VLIW
VLIW advantages:
• Much simpler to build. Potentially faster
VLIW disadvantages and proposed solutions:
• Binary code incompatibility
– Object code translation or emulation
– Less strict approach (EPIC, IA-64, Itanium)
• Increase in code size, unfilled slots are wasted bits
– Use clever encodings, only one immediate field
– Compress instructions in memory and decode them when they
are fetched, or when put in L1 cache
• Lockstep operation: if the operation in one instruction
slot stalls, the entire processor is stalled
– Less strict approach
4/3/2016
ACA H.Corporaal
21
Use compressed instructions
Memory
L1 Instruction Cache
compressed
instructions
in
memory
CPU
or decompress
here?
decompress
here?
Q: What are pros and cons?
4/3/2016
ACA H.Corporaal
22
Advanced compiler support
techniques
• Loop-level parallelism
• Software pipelining
• Global scheduling (across basic blocks)
4/3/2016
ACA H.Corporaal
23
Detecting Loop-Level Parallelism
• Loop-carried dependence: a statement executed in a
certain iteration is dependent on a statement executed
in an earlier iteration
• If there is no loop-carried dependence, then its
iterations can be executed in parallel
S1
for (i=1; i<=100; i++){
A[i+1] = A[i]+C[i];
/* S1 */
B[i+1] = B[i]+A[i+1]; /* S2 */
}
S2
A loop is parallel  the corresponding dependence graph
does not contain a cycle
4/3/2016
ACA H.Corporaal
24
Finding Dependences
• Is there a dependence in the following loop?
for (i=1; i<=100; i++)
A[2*i+3] = A[2*i] + 5.0;
• Affine expression: an expression of the form a*i + b
(a, b constants, i loop index variable)
• Does the following equation have a solution?
a*i + b = c*j + d
• GCD test: if there is a solution, then GCD(a,c) must
divide d-b
Note: Because the GCD test does not take the loop bounds
into account, there are cases where the GCD test says
“yes, there is a solution” while in reality there isn’t
4/3/2016
ACA H.Corporaal
25
Software Pipelining
• We have already seen loop unrolling
• Software pipelining is a related technique that
that consumes less code space. It interleaves
instructions from different iterations
– instructions in one iteration are often dependent on
each other
Softwarepipelined
iteration
4/3/2016
instructions
Iteration 0
ACA H.Corporaal
Iteration 1
Iteration 2
Steady
state
kernel
26
Simple Software Pipelining Example
L:
l.d
add.d
s.d
addi
bne
f0,0(r1)
f4,f0,f2
f4,0(r1)
r1,r1,-8
r1,r2,L
#
#
#
#
load M[i]
compute M[i]
store M[i]
i = i-1
• Software pipelined loop:
L:
s.d
add.d
l.d
addi
bne
f4,16(r1)
f4,f0,f2
f0,0(r1)
r1,r1,-8
r1,r2,L
# store M[i]
# compute M[i-1]
# load M[i-2]
• Need hardware to avoid the WAR hazards
4/3/2016
ACA H.Corporaal
27
Global code scheduling
• Loop unrolling and software pipelining work well when there
are no control statements (if statements) in the loop body ->
loop is a single basic block
• Global code scheduling: scheduling/moving code across
branches: larger scheduling scope
• When can the assignments to B and C be moved before the test?
A[i]=A[i]+B[i]
T
A[i]=0?
B[i]=
F
X
C[i]=
4/3/2016
ACA H.Corporaal
28
Which scheduling scope?
Trace
4/3/2016
Superblock
ACA H.Corporaal
Decision Tree
Hyperblock/region
29
Comparing scheduling scopes
Trace
Multiple exc. paths
Side-entries allowed
Join points allowed
Code motion down joins
Must be if-convertible
Tail dup. before sched.
4/3/2016
ACA H.Corporaal
No
Yes
Yes
Yes
No
No
Sup. Hyp. Dec. Region
block block Tree
No
Yes
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No
No
No
No
No
Yes
No
No
Yes
No
Yes
No
30
Scheduling scope creation (1)
Partitioning a CFG into scheduling scopes:
A
A
C
B
D
F
E
G
Trace
4/3/2016
ACA H.Corporaal
B
C
D
D’
E
E’
F
G
tail
duplication
G’
Superblock
31
Trace Scheduling
• Find the most likely sequence of basic blocks
that will be executed consecutively (trace
selection)
• Optimize the trace as much as possible (trace
compaction)
– move operations as early as possible in the trace
– pack the operations in as few VLIW instructions as
possible
– additional bookkeeping code may be necessary on
exit points of the trace
4/3/2016
ACA H.Corporaal
32
Scheduling scope creation (2)
Partitioning a CFG into scheduling scopes:
A
C
B
E
F
G
G’
ACA H.Corporaal
D
F’
E’
Decision Tree
C
B
D’
D
4/3/2016
A
tail
duplication
G’’
F
E
G
Hyperblock/ region
33
Code movement (upwards) within
regions
destination block
Legend:
Copy
needed
I
I
I
I
4/3/2016
ACA H.Corporaal
Check for
off-liveness
Code
movement
I
add
Intermediate
block
source block
34
Hardware support for compile-time
scheduling
• Predication
– (discussed already)
– see also Itanium example
• Deferred exceptions
• Speculative loads
4/3/2016
ACA H.Corporaal
35
Predicated Instructions (discussed before)
• Avoid branch prediction by turning branches into conditional
or predicated instructions:
If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional
move; PA-RISC can annul any following instr.
– IA-64/Itanium: conditional execution of any instruction
• Examples:
4/3/2016
if (R1==0) R2 = R3;
CMOVZ
if (R1 < R2)
R3 = R1;
else
R3 = R2;
SLT
R9,R1,R2
CMOVNZ R3,R1,R9
CMOVZ R3,R2,R9
ACA H.Corporaal
R2,R3,R1
36
Deferred Exceptions
if (A==0)
A = B;
else
A = A+4;
ld
bnez
ld
j
L1: addi
L2: st
r1,0(r3)
r1,L1
r1,0(r2)
L2
r1,r1,4
r1,0(r3)
# load A
# test A
# then part; load B
# else part; inc A
# store A
• How to optimize when then-part
is usually selected?
ld
ld
beqz
addi
L3: st
r1,0(r3)
r9,0(r2)
r1,L3
r9,r1,4
r9,0(r3)
#
#
#
#
#
load A
speculative load B
test A
else part
store A
• What if the load generates a page fault?
• What if the load generates an “index-out-of-bounds” exception?
4/3/2016
ACA H.Corporaal
37
HW supporting Speculative Loads
• Speculative load (sld): does not generate exceptions
• Speculation check instruction (speck): check for
exception. The exception occurs when this instruction is
executed.
L1:
L2:
4/3/2016
ACA H.Corporaal
ld
sld
bnez
speck
j
addi
st
r1,0(r3)
r9,0(r2)
r1,L1
0(r2)
L2
r9,r1,4
r9,0(r3)
#
#
#
#
load A
speculative load of B
test A
perform exception check
# else part
# store A
38
Next?
Core i7
3GHz
100W
Trends:
• #transistors follows
Moore
• but not freq. and
performance/core
4/3/2016
ACA H.Corporaal
5
39