C[i-1] - Computer Architecture
Download
Report
Transcript C[i-1] - Computer Architecture
Computer Architecture for
Medical Applications
Pipelining & SingleInstructionMultipleData – two
driving factors of single core performance
Gerhard Wellein, Department for Computer Science
and Erlangen Regional Computing Center
Dietmar Fey, Department for Computer Science
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
1
A different view on computer architecture
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
2
From high level code to macro-/microcode execution
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
Compiler
A(i) (incl. LD)
sum in register xmm1
i (loop counter)
ADD 1st argument to 2nd
argument and store result in
2nd argument
30. April 2013
N
ADD Execution unit
CAMA 2013 - D. Fey and G. Wellein
3
How does high level code interact with execution units
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
Many hardware execution units:
LOAD (STORE) operands from L1 cache
(register) to register (memory)
Floating Point (FP) MULTIPLY and ADD
Various Integer units
Execution units may work in parallel
“Superscalar” processor
Two important concepts at hardware
level: Pipelining + SIMD
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
4
Microprocessors – Pipelining
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
5
Introduction: Moore’s law
Intel Nehalem EX: 2.3 Billion
nVIDIA FERMI: 3 Billion
1965: G. Moore claimed
#transistors on processor chip
doubles every 12-24 months
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
6
Introduction: Moore’s law faster cycles and beyond
Moore’s law transistors are getting smaller run them faster
Faster clock speed Reduce complexity of instruction execution
Pipelining of instructions
Frequency [MHz]
Increasing transistor count
and clock speed allows /
requires architectural
changes:
10000
1000
Intel x86 clock speed
• Pipelining
100
• Superscalarity
10
• SIMD / Vector ops
20
09
20
03
19
99
19
95
19
91
19
87
19
83
19
79
0,1
19
75
19
71
1
Year
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
• Multi-Core/Threading
• Complex on chip
caches
7
Pipelining of arithmetic/functional units
Idea:
Split complex instruction into several simple / fast steps (stages)
Each step takes the same amount of time, e.g. a single cycle
Execute different steps on different instructions at the same time (in parallel)
Allows for shorter cycle times (simpler logic circuits), e.g.:
floating point multiplication takes 5 cycles, but
processor can work on 5 different multiplications simultaneously
one result at each cycle after the pipeline is full
Drawback:
Pipeline must be filled - startup times (#Instructions >> pipeline steps)
Efficient use of pipelines requires large number of independent instructions
instruction level parallelism
Requires complex instruction scheduling by compiler/hardware – softwarepipelining / out-of-order
Pipelining is widely used in modern computer architectures
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
8
Interlude: Possible stage for Multiply
Real numbers can be represented as mantissa and exponent in a
“normalized” representation, e.g.: s*0.m * 10e with
Sign s={-1,1}
Mantissa m which does not contain 0 in leading digit
Exponent e some positive or negative integer
Multiply two real numbers r1*r2 = r3
r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :
s1*0.m1 * 10e1 *
s2*0.m2 * 10e2
(s1*s2)* (0.m1*0.m2) * 10(e1+e2)
Normalize result: s3* 0.m3 * 10e3
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
9
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
Cycle:
1
2
B(1)
C(1)
B(2)
C(2)
B(1)
C(1)
3
4
5
6
...
B(3)
C(3)
B(4)
C(4)
B(5)
C(5)
B(6)
C(6)
...
B(2)
C(2)
B(3)
C(3)
B(4)
C(4)
B(5)
C(5)
...
B(1)
C(1)
B(2)
C(2)
B(3)
C(3)
B(4)
C(4)
...
A(1)
A(2)
B(3)
C(3)
...
A(1)
A(2)
...
N+4
Stage
Separate
Mant. / Exp.
Mult.
Mantissa
Add.
Exponents
Normal.
Result
Insert Sign
A(N)
First result is available after 5 cycles (=latency of pipeline)!
After that one instruction is completed in each cycle
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
10
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
Wind-up/-down phases: Empty pipeline stages
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
11
Pipelining: Speed-Up and Throughput
Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up
piplined vs non-pipelined execution at same clock speed
Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)
Throughput of piplined execution (=Average results per Cycle)
executing N instructions in pipeline with m stages:
N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]
Throughput for large N: N / Tpipe(N) ~ 1
Number of independent operations (NC) required to achive Tp
results per cycle:
Tp= 1 / [ 1+m/NC ]
Tp= 0.5
30. April 2013
NC = Tp m / (1- Tp)
NC = m
CAMA 2013 - D. Fey and G. Wellein
12
Throughput as function of pipeline stages
90% pipeline efficiency
m = #pipeline stages
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
13
Software pipelining
Example:
Assumption:
Fortran Code:
do i=1,N
a(i) = a(i) * c
end do
load a[i]
mult a[i] = c,a[i]
store a[i]
branch.loop
Instructions block execution if operands are not
available
Latencies
Load operand to register (4 cycles)
Multiply a(i) with c (2 cycles); a[i],c in registers
Write back result from register to mem./cache (2 cycles)
Increase loopcounter as long i less equal N (0 cycles)
Simple Pseudo Code:
loop:
load a[i]
mult a[i] = c, a[i]
store a[i]
branch.loop
30. April 2013
Optimized Pseudo Code:
loop:
load a[i+6]
mult a[i+2] = c, a[i+2]
store a[i]
branch.loop
CAMA 2013 - D. Fey and G. Wellein
14
Software pipelining
a[i]=a[i]*c; N=12
Naive instruction issue
Optimized instruction issue
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Cycle 12
Cycle 13
Cycle 14
Cycle 15
Cycle 16
Cycle 17
Cycle 18
Cycle 19
load a[1]
mult a[1]=c,a[1]
store a[1]
load a[2]
mult a[2]=c,a[2]
store a[2]
a[1]
a[2]
a[3]
a[4]
a[5]
a[6]
a[7]
a[8]
a[9]
a[10]
a[11]
a[12]
Prolog
mult a[1]=c,a[1]
mult a[2]=c,a[2]
mult a[3]=c,a[3]
mult a[4]=c,a[4]
mult a[5]=c,a[5]
mult a[6]=c,a[6]
mult a[7]=c,a[7]
mult a[8]=c,a[8]
mult a[9]=c,a[9]
mult a[10]=c,a[10]
mult a[11]=c,a[11]
mult a[12]=c,a[12]
load a[3]
T= 96 cycles
30. April 2013
load
load
load
load
load
load
load
load
load
load
load
load
store a[1]
store a[2]
store a[3]
store a[4]
store a[5]
store a[6]
store a[7]
store a[8]
store a[9]
store a[10]
store a[11]
store a[12]
Kernel
Epilog
T= 19 cycles
CAMA 2013 - D. Fey and G. Wellein
15
Efficient use of Pipelining
Software pipelining can be done by the compiler, but
efficient reordering of the instructions requires deep insight into
application (data dependencies) and processor (latencies of
functional units)
Re-ordering of instructions can also be done at runtime by
out-of-order (OOO) execution
(Potential) dependencies within loop body may prevent efficient
software pipelining or OOO execution, e.g.:
No dependency:
Dependency:
Pseudo-Dependency:
do i=1,N
a(i) = a(i) * c
end do
do i=2,N
a(i) = a(i-1) * c
end do
do i=1,N-1
a(i) = a(i+1) * c
end do
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
16
Pipelining: Data dependencies
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
17
Pipelining: Data dependencies
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
18
Pipelining: Data dependencies
a[i]=a[i-1]*c; N=12
Naive instruction issue
Optimized instruction issue
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Cycle 12
Cycle 13
Cycle 14
Cycle 15
Cycle 16
Cycle 17
Cycle 18
Cycle 19
load a[1]
load a[1]
Prolog
mult a[2]=c,a[1]
mult a[2]=c,a[1]
store a[2]
mult a[3]=c,a[2]
store a[2]
load a[2]
mult a[4]=c,a[3]
store a[3]
mult a[5]=c,a[4]
store a[4]
mult a[3]=c,a[2]
mult a[6]=c,a[5]
store a[5]
store a[3]
mult a[7]=c,a[6]
store a[6]
load a[3]
mult a[8]=c,a[7]
store a[7]
mult a[9]=c,a[8]
store a[8]
T= 96 cycles
Kernel
T= 26 cycles
Length of MULT pipeline determines throughput
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
19
Fill pipeline with independent recursive streams..
Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT
MULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update
Thread 0:
do i=1,N
A(i)=A(i-1)*s
B(i)=B(i-1)*s
C(i)=C(i-1)*s
D(i)=D(i-1)*s
E(i)=E(i-1)*s
enddo
A(2)*s
E(1)*s
D(1)*s
MULT pipe
B(2)*s
C(1)*s
5 independent updates on a single core!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
20
Pipelining: Beyond multiplication
Typical number of pipeline stages: 2-5 for the hardware pipelines
on modern CPUs.
x86 processors (AMD, Intel): 1 Mult & Add unit per processor core
No hardware for div / sqrt / exp / sin … expensive instructions
“FP costs” in cycles per instruction for Intel Core2 architecture
Operation
y=a+y (y=a*y)
y=a/y
y=sqrt(y)
y=sin(y)
Latency
3 (5)
32
29
>100
Throughput
1 (1)
31
28
>100
Cycles/Operation
0.5*
15.5*
14*
>100
Other instructions are also pipelined, e.g. LOAD operand to
register (4 cycles)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
21
Pipelining: Potential problems (1)
Hidden data dependencies:
void scale_shift(double *A, double *B, double *C, int n) {
for(int i=0; i<n; ++i)
C[i] = A[i] + B[i];
}
C/C++ allows “Pointer Aliasing” , i.e. A &C[-1] ; B &C[-2]
C[i] = C[i-1] + C[i-2] Dependency!
Compiler can not resolve potential pointer aliasing conflicts on its own!
If no “Pointer Aliasing” is used, tell it to the compiler, e.g.
use –fno-alias switch for Intel compiler
Pass arguments as (double *restrict A,…) (only C99 standard)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
22
Pipelining: Potential problems (2)
Simple subroutine/function calls within a loop
do i=1, N
call elementprod(A(i),B(i), psum)
C(i)=psum
enddo
…
function elementprod( a, b, psum)
…
psum=a*b
Inline subroutines! (can be done by compiler….)
do i=1, N
psum=A(i)*B(i)
C(i)=psum
enddo
…
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
23
Pipelining: Potential problems (3a)
Can we use pipelining or does this cost us 8*3 cycle
(assuming 3 stage ADD pipeline)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
24
Pipelining: Potential problems (3b)
More general – “reduction operations”?
A(i) (incl. LD)
sum in register xmm1
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
ADD 1st argument to 2nd
argument and store result in
2nd argument
N
i (loop counter)
Benchmark: Run above assembly language kernel with
N=32,64,128,…,4096 on processor with
3.5 GHz clock speed
1 pipelined ADD unit (latency 3 cycles)
1 pipelined LOAD unit (latency 4 cycles)
30. April 2013
ClSp=3500 Mcycle/s
1 cycle per iteration
(after 7 iterations)
CAMA 2013 - D. Fey and G. Wellein
25
Pipelining: Potential problems (4)
Expected Performance: Throughput * ClockSpeed
Throughput:
N/T(N) = N/ (L+N)
Assumption: L is total latency of one iteration and
one result per cycle delivered after pipeline startup.
Total runtime: L+N cycles
A(i-1)
A(i-2)
LOAD
A(i)
A(i-3)
Total latency: L = 4 cycles + 3 cycles = 7 cycles
Performance for N Iterations:
3500 MHz * (N / (L+N)) Iterations/cycle
Maximum performance (𝑵 → ∞):
A(i-5)
ADD
A(i-4)
A(i-6)
3500 Mcycle/s * 1 Iteration/cycle=
3500 Miterations/s
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
26
Pipelining: Potential problems (5)
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
Throughput here: N/T(N) = N/ (L+3*N)
Why?
A(2)
A(3)
s = s+A(2)
s+A(1)
Dependency on sum next instruction needs to wait for
completion of previous one only 1 out of 3 stages active
3 cycles per iteration
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
27
Pipelining: Potential problems (6)
Increase pipeline utilization by “loop unrolling”
“2-way Modulo Variable Expansion” (N is even)
sum1=0.d0
sum2=0.d0
do i=1, N, 2
sum1=sum1+A(i)
sum2=sum2+A(i+1)
enddo
sum=sum1+sum2
2 out of 3 pipeline stages can
be filled
2 results every 3 cycles
1.5 cycle/Iteration
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
28
Pipelining: Potential problems (7)
4-way Modulo Variable Expansion (MVE) to get best performance
(in principle 3-way should do as well)
Sum is split up in 4 independent partial sums
“4-way MVE”
Nr=4*(N/4)
sum1=0.d0
allowed to do so…
sum2=0.d0
sum3=0.d0
Computer’s floating point arithmetic is
sum4=0.d0
not associative!
do i=1, Nr, 4
𝒂 + 𝒃 + 𝒄 + 𝒅 + 𝒆) + 𝒇 ≠ 𝒂 + 𝒃 + 𝒄 + 𝒅 + (𝒆 + 𝒇) sum1=sum1+A(i)
sum2=sum2+A(i+1)
sum3=sum3+A(i+2)
If you require binary exact result
sum4=sum4+A(i+3)
(-fp-model strict) compiler is not
enddo
allowed to do this transformation
do i=Nr+1, N
L=(7+3*3) cycles (prev. slide)
sum1=sum1+A(i)
enddo
sum=sum1+sum2+sum3+sum4
Remainder
loop
Compiler can do that, if he is
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
29
Pipelining: The Instruction pipeline
Besides arithmetic & functional unit, instruction execution itself is
pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instruction
from L1I
Decode
instruction
Execute
Instruction
Hardware Pipelining on processor (all units can run concurrently):
Fetch Instruction 1
from L1I
Fetch Instruction 2
from L1I
Fetch Instruction 3
from L1I
Fetch Instruction 4
from L1I
1
2
t
3
4
Decode
Instruction 1
Decode
Instruction 2
Decode
Instruction 3
Execute
Instruction 1
Execute
Instruction 2
…
Branches can stall this pipeline! (Speculative Execution, Predication)
Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
30
Pipelining: The Instruction pipeline
Problem: Unpredictable branches to other instructions
1
Fetch Instruction 1
from L1I
Decode
Instruction 1
2
t
Execute
Instruction 1
3
4
Assume: Result
determines next
instruction!
Fetch Instruction 2
from L1I
Decode
…
Instruction 2
Execute
Instruction 2
Fetch Instruction 3
from L1I
Decode
Instruction 3
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
31
Microprocessors – Superscalar
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
32
Superscalar Processors
Superscalar processors provide additional hardware (i.e.
transistors) to execute multiple instructions per cycle!
Parallel hardware components / pipelines are available to
fetch / decode / issues multiple instructions per cycle
(typically 3 – 6 per cycle)
perform multiple integer / address calculations per cycle
(e.g. 6 integer units on Itanium2)
load (store) multiple operands (results) from (to) cache
per cycle (typically one load AND one store per cycle)
perform multiple floating point instructions per cycle
(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)
On superscalar RISC processors out-of order (OOO) execution
hardware is available to optimize the usage of the parallel
hardware
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
33
Superscalar Processors – Instruction Level Parallelism
Multiple units enable use of Instrucion Level Parallelism (ILP):
Instruction stream is “parallelized” on the fly
t
Fetch Instruction 4
Fetch Instruction 3
fromInstruction
L1I
Fetch
2
fromInstruction
L1I
Fetch
1
from L1I 2
Fetch Instruction
Decode
Fetch Instruction
Decode
from L1I 2
fromInstruction
L1I
1
Fetch
2 Instruction
Decode
fromInstruction
L1I
1
Fetch
5 Instruction
Decode
from L1I 3
Instruction 1
Fetch Instruction
Decode
from L1I 3
Instruction 1
Fetch Instruction
Decode
fromInstruction
L1I
2
Fetch
3 Instruction
Decode
fromInstruction
L1I
2
Fetch
9 Instruction
Decode
from L1I 4
Instruction 2
Fetch Instruction
Decode
from L1I 4
Instruction 5
Fetch Instruction
Decode
fromInstruction
L1I
3
Fetch
4 Instruction
Decode
from
L1I
3
Fetch
Instruction
13 Instruction
Decode
from L1I
Instruction 3
from L1I
Instruction 9
4-way
„superscalar“
Execute
Execute
Instruction
1
Execute
Instruction
1
Execute
Instruction 1
Execute
Instruction 1
Execute
Instruction
2
Execute
Instruction
2
Execute
Instruction 2
Instruction 5
Issuing m concurrent instructions per cycle: m-way superscalar
Modern processors are 3- to 6-way superscalar &
can perform 2 or 4 floating point operations per cycles
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
34
“3-way Modulo Variable Expansion” (N is multiple of 3)
A(i)R0
A(i-2)R2
A(i-3)R3
R11=R11+R4
R12=R12+R5
R13=R13+R6
ADD
A(i-1)R1
Register
Set
LOAD
Superscalar Processors – ILP in action
sum1=0.d0 ! reg. R11
sum2=0.d0 ! reg. R12
Sum3=0.d0 ! reg. R13
do i=1, N, 3
sum1=sum1+A(i)
sum2=sum2+A(i+1)
sum3=sum3+A(i+2)
enddo
sum=sum1+sum2+sum3
Complex register management not show (R4 contains A(i-4))
2-way superscalar: 1 LOAD instruction + 1 ADD instruction
completed per cycle
Often cited metrics for superscalar processors:
Instructions Per Cycle:
Cycles Per Instruction:
30. April 2013
IPC=2 above
CPI=0.5 above
CAMA 2013 - D. Fey and G. Wellein
35
Superscalar processor – Intel Nehalem design
Decode & issue a max. of 4
instructions per cycle: IPC=4
Min. CPI=0.25 Cycles/Instruction
Parallel units:
FP ADD
LOAD
& FP MULT (work in parallel)
+ STORE (work in parallel)
Max. FP performance:
1 ADD + 1 MULT instruction per cycle
Max. performance:
A(i) = r0 + r1 * B(i)
½ of max. FP performance:
A(i) = r1 * B(i)
1/3 of max. FP performance:
A(i) = A(i) + B(i) * C(i)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
36
Microprocessors –
Single Instruction Multiple Data (SIMD)processing
Basic Idea:
Apply the same instruction to
multiple operands in parallel
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
37
SIMD-processing – Basics
Single Instruction Multiple Data (SIMD) instructions allow the
concurrent execution of the same operation on “wide” registers.
x86_64 SIMD instruction sets:
SSE: register width = 128 Bit 2 double (4 single) precision FP operands
AVX: register width = 256 Bit 4 double (8 single) precision FP operands
“Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower
64 Bit (32 Bit) of registers are used.
Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit
AXV: No support for using the 256 bit register width for integer operations
SIMD-execution vector execution
If compiler has vectorized loop SIMD instructions are used
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
38
SIMD-processing – Basics
Example: Adding two registers holding double precision floating point
operands using 256 Bit register (AVX)
+
B[1]
C[1]
+
C[0]
C[2]
C[3]
+
B[3]
+
B[0]
R2
A[3]
SIMD execution:
V64ADD [R0,R1] R2
R1
A[2]
R0
B[2]
R2
A[1]
R1
A[0]
R0
C[0]
+
B[0]
64 Bit
A[0]
256 Bit
Scalar execution:
R2 ADD [R0,R1]
If 128 Bit SIMD instructions (SSE) are executed only half of the
registers width is used
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
39
SIMD-processing – Basics
Steps (done by the compiler) for “SIMD-processing”
for(int i=0; i<n;i++)
C[i]=A[i]+B[i];
“Loop unrolling”
for(int i=0; i<n;i+=4){
C[i] =A[i] +B[i];
C[i+1]=A[i+1]+B[i+1];
C[i+2]=A[i+2]+B[i+2];
C[i+3]=A[i+3]+B[i+3];}
//remainder loop handling
Load 256 Bits starting from address of
A[i] to register R0
Add the corresponding 64 Bit entries in R0
and R1 and store the 4 results to R2
Store R2 (256 Bit) to address
starting at C[i]
30. April 2013
“PseudoAssembler”
LABEL1:
VLOAD R0 A[i]
VLOAD R1 B[i]
V64ADD[R0,R1] R2
VSTORE R2 C[i]
ii+4
i<(n-4)? JMP LABEL1
//remainder loop handling
CAMA 2013 - D. Fey and G. Wellein
40
SIMD-processing – Basics
No SIMD-processing for loops with data dependencies
for(int i=0; i<n;i++)
A[i]=A[i-1]*s;
“Pointer aliasing” may prevent compiler from SIMD-processing
void scale_shift(double *A, double *B, double *C, int n) {
for(int i=0; i<n; ++i)
C[i] = A[i] + B[i];
}
C/C++ allows that A &C[-1] and B &C[-2]
C[i] = C[i-1] + C[i-2]: dependency No SIMD-processing
If no “Pointer aliasing” is used, tell it to the compiler, e.g. use
–fno-alias switch for Intel compiler SIMD-processing
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
41
SIMD-processing – Basics
SIMD-processing of a vector sum
double s=0.0;
for(int i=0; i<n;i++)
s = s + A[i];
s0=0.0;
s1=0.0;
S2=0.0;
S3=0.0;
for(int i=0; i<n;i+=4){
s0 = s0+ A[i] ;
s1 = s1+ A[i+1];
s2 = s2+ A[i+2];
s3 = s3+ A[i+3];
}
R0
R1
//remainder
s=s0+s1+s2+s3
30. April 2013
Data dependency on s must be
resolved for SIMD-processing
(assume AVX)
Compiler does transformation
(Modulo Variable Expansion) –
if programmer allows it to do so!
(e.g. use –O3 instead of –O1)
R0 (0.d0, 0.d0, 0.d0, 0.d0)
…
V64ADD(R0,R1)
…
R0
“Horizontal” ADD:
Sum up the 4 64 Bit entries of R0
CAMA 2013 - D. Fey and G. Wellein
42
SIMD-processing: What about pipelining?!
R0 (0.0,0.0,0.0, 0.0)
do i=1, N, 4
R0
R1
R2
do
VLOAD A(i:i+3) R1
V64ADD(R0,R1) R0
enddo
sum HorizontalADD(R0)
A(9:12)
A(5:8)
R0=R0+A(5:8)
R0=R0+A(1:4)
Need to do another
MVE step to fill
pipeline stages
(0.0,0.0,0.0,0.0)
(0.0,0.0,0.0,0.0)
(0.0,0.0,0.0,0.0)
i=1, N, 12
LOAD A(i:i+3)
LOAD A(i+4:i+7)
LOAD A(i+8:i+11)
V64ADD(R0,R3)
V64ADD(R1,R4)
V64ADD(R2,R5)
enddo
…
“Vertical add” V64ADD(R0,R1)
V64ADD(R0,R2)
R3
R4
R5
R0
R1
R2
R0
R0
“Horizontal add” sum HorizontalADD(R0)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
43
SIMD-processing: What about pipelining?!
1 AVX iteration
performs 4 i-Iterations
(successive)
Double Precision
Performance: 4x
higher than “scalar”
version
Start-up phase much
longer…
Unrolling factor of vectorized code
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
44
Compiler generated AVX code (loop body)
Baseline version (“scalar”):
No pipelining – no SIMD
3 cycles / Iteration
Compiler generated “AVX version”
(-O3 –xAVX):
SIMD processing: vaddpd
%ymm8 4 dp operands
(4-way unrolling)
Pipelining: 8-way MVE of SIMD
code
32-way unrolling in total
30. April 2013
0.25 cycles / Iteration
CAMA 2013 - D. Fey and G. Wellein
45
SIMD processing – Vector sum (double precision) – 1 core
SIMD: Most impact if data is close to the core –
other bottlenecks stay the same!
Peak
Scalar: Code execution in core is
slower than any data transfer
Location of “input data” (A[]) in memory hierarchy
Plain: No SIMD but 4-way MVE
AVX/SIMD: Full benefit only if
data is in L1 cache
Data parallel SIMD processing
Requires independent vector-like operations (“Data parallel”)
Compiler is required to generate “vectorized” code Check
compiler output
Check for the use of “Packed SIMD” instructions at runtime
(likwid) or in the assembly code
Packed SIMD may require alignment constraint, e.g. 16-Byte
alignment for efficient load/store on Intel Core2 architectures
Check also for SIMD LOAD / STORE instructions
Use of Packed SIMD instructions reduces the overall number of
instructions (typical theoretical max. of 4 instructions / cycle)
SIMD code may improve performance but reduce CPI!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
47
Data parallel SIMD processing: Boosting performance
Putting it all together: Modern x86_64 based Intel / AMD processor
One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a
throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)
Maximum 2 FP instructions/cycle
Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX)
instructions
2 (4) double precision FP operations per SSE (AVX) instruction
4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)
Peak performance of 3 GHz CPU (core):
SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)
SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)
BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
48
There is no single driving force for single core performance!
Maximum Floating Point (FP) Performance:
Pcore = F * S * n
F
FP instructions per cycle:
(1 MULT and 1 ADD)
S
FP ops / instruction:
2
4 (dp) / 8 (sp)
(256 Bit SIMD registers – “AVX”)
n
Clock speed :
∽2.5 GHz
Scalar (non-SIMD) execution
S = 1 FP op/instruction (dp / sp)
P = 20 GF/s (dp) / 40 GF/s (sp)
P = 5 GF/s (dp / sp)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
49
SIMD registers: floating point (FP) data and beyond
Possible data types in an SSE register
16x 8bit
4x 32bit
integer
8x 16bit
2x 64bit
4x 32 bit
2x 64 bit
AVX only applies to FP data (not to scale)
floating
point
1x 128bit
8x 32 bit
4x 64 bit
50
Rules for vectorizable loops / SIMD processing
1.
2.
3.
4.
Countable
Single entry and single exit
Straight line code
No function calls (exception intrinsic math functions)
Better performance with:
1. Simple inner loops with unit stride
2. Minimize indirect addressing
3. Align data structures (SSE 16 bytes, AVX 32 bytes)
4. In C use the restrict keyword for pointers to rule out aliasing
Obstacles for vectorization:
Non-contiguous memory access
Data dependencies
51
How to leverage vectorization / SIMD
Source code directives (pragmas) to ease the compiler’s job
Alternative programming models for
compute kernels (OpenCL, ispc)
Intrinsics (restricted to C/C++)
Implement directly in assembler
Complexity and efficiency increases
The compiler does it for you (aliasing, alignment, language)
52
Vectorization and the Intel compiler
Intel compiler will try to use SIMD instructions when enabled to do so
“Poor man’s vector computing”
Compiler can emit messages about vectorized loops (not by default):
plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.
Use option -vec_report3 to get full compiler output about which loops were
vectorized and which were not and why (data dependencies!)
Some obstructions will prevent the compiler from applying vectorization even if
it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)
You can use source code directives to provide more information to the
compiler
53
Vectorization compiler options
The compiler will vectorize starting with –O2.
To enable specific SIMD extensions use the –x option:
-xSSE2
vectorize for SSE2 capable machines
Available SIMD extensions:
SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX
-xAVX
on Sandy Bridge processors
Recommended option:
-xHost
will optimize for the architecture you compile on
Compiling for AMD Opteron: use plain –O3 as the -x options
may involve CPU type checks at runtime!
54
Vectorization source code directives (pragmas)
Fine-grained control of loop vectorization
Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler
directive
#pragma vector always
vectorize even if it seems inefficient (hint!)
#pragma novector
do not vectorize even if possible
#pragma vector nontemporal
use NT stores when allowed (i.e. alignment conditions are met)
#pragma vector aligned
specifies that all array accesses are aligned to 16-byte boundaries
(DANGEROUS! You must not lie about this!)
55
User mandated vectorization (pragmas)
Starting with Intel Compiler 12.0 the simd pragma is available
#pragma simd enforces vectorization where the other pragmas fail
Prerequesites:
Countable loop
Innermost loop
Must conform to for-loop style of OpenMP worksharing constructs
There are additional clauses: reduction, vectorlength, private
Refer to the compiler manual for further details
NOTE: Using the #pragma simd the compiler may generate incorrect code if
the loop violates the vectorization rules!
#pragma simd reduction(+:x)
for (int i=0; i<n; i++) {
x = x + A[i];
}
56
Basic approach to check the instruction code
Get the assembler code (Intel compiler):
icc –S –O3 -xHost triad.c -o triad.s
Disassemble Executable:
objdump –d ./cacheBench | less
Things to check for:
Is the code vectorized? Search for pd/ps suffix.
mulpd, addpd, vaddpd, vmulpd
Is the data loaded with 16 byte moves?
movapd, movaps, vmovupd
For memory-bound code: Search for nontemporal stores:
movntpd, movntps
The x86 ISA is documented in:
Intel Software Development Manual (SDM) 2A and 2B
AMD64 Architecture Programmer's Manual Vol. 1-5
57
Some basics of the x86-64 ISA
16 general Purpose Registers (64bit):
rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15
alias with eight 32 bit register set:
eax, ebx, ecx, edx, esi, edi, esp, ebp
Floating Point SIMD Registers:
xmm0-xmm15 SSE (128bit) alias with 256bit registers
ymm0-ymm15 AVX (256bit)
SIMD instructions are distinguished by:
AVX (VEX) prefix:
v
Operation:
mul, add, mov
Modifier:
non temporal (nt), unaligned (u), aligned (a), high(h)
Data range:
packed (p), scalar (s)
Data type:
single (s), double (d)
58
Some basic single core optimizations – warnings first
“Premature optimization is the root of all evil.”
Donald E. Knuth
“Parallel performance is easy, single node/core
performance is difficult”
Bill Gropp
59
Single core: Common sense optimizations (1)
Do less work!
Reducing the work to be done is never a bad idea!
logical :: flag
flag = .false.
do i=1,N
if(comlex_func(A(i)) < TRESHOLD) then
flag=.true.
endif
enddo
logical :: flag
flag = .false.
do i=1,N
if(comlex_func(A(i)) < TRESHOLD) then
flag=.true.
EXIT
endif
enddo
! Check if at
! Least one
! Is true
!
!
!
!
Check if at
Least one
Is true and
EXIT do-loop
60
Single core: Common sense optimizations (2)
Avoid expensive operations!
FP MULT & FP ADD are the fastest way to compute
Avoid DIV / SQRT / SIN / COS / TAN ,… table lookup
Avoid a one-to-one implementation of the algorithm, e.g.
A= A + B**2
(A,B float)
A = A + B**2.0 (V1)
A = A + B * B (V2)
(V1) is not a good idea: B**2.0 exp{2.0*ln(B)}
1. Computing exp & ln is very expensive
2. B < 0 ?!
Most useful if data is close to CPU!
61
Single core: Common sense optimizations (3)
Shrink the working set!
Working on small data sets reduces data transfer volume and
increases the probability for cache hits
Analyze if appropriate data types are used, e.g.:
4 different particle species have to be distinguished:
integer spec(1:N)
spec(i) = {0,1,2,3}
sizeof(spec(1:N)) = 4*N*byte
OR use 1-Byte integer datatype
integer*1 spec(1:N)
sizeof(spec(1:N)) = N*byte
OR use 2-Bit for each species
integer*1 spec(1:N/4) sizeof(spec(1:N)) = N/4*byte
Strongly depends on application!
62
Single core: Common sense optimizations (4a)
Elimination of common subexpressions!
This often reduces MFLOP/s rate but improves runtime
In principle the compiler should do the job but do not rely on it!
Associativity rules may prevent compiler to do so
Compiler does not recognize (limited scope)
63
Single core: Common sense optimizations (4b)
Replace expensive functions by table lookup
do iter = …
…
…
Entries: -1,0,1
do i=1,n
…
edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i)
BF= 0.5d0*(1.d0+tanh(edelz))
…
…
edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6
enddo
…
…
enddo
30. April 2013
Compute all 13 potential values of tanh(edelz) before and store
it in a table with 13 entries: tanh_table(-6:6)
CAMA 2013 - D. Fey and G. Wellein
64
Single core: Common sense optimizations (4c)
Replace expensive functions by table lookup
do i=-6,6
tanh_table(i) = tanh(i)
enddo
do iter = …
…
…
do i=1,n
…
edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i)
BF= 0.5d0*(1.d0+ tanh_table(edelz))
…
…
enddo
…
…
enddo
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
65
Single core: Common sense optimizations (5)
Avoid branches!
Support the compiler to understand and optimize your code!
Code change may enable
vectorization, SIMD &
other optimizations
BTW software pipelining is also much easier and no branch
prediction is required
66