C[i-1] - Computer Architecture

Download Report

Transcript C[i-1] - Computer Architecture

Computer Architecture for
Medical Applications
Pipelining & SingleInstructionMultipleData – two
driving factors of single core performance
Gerhard Wellein, Department for Computer Science
and Erlangen Regional Computing Center
Dietmar Fey, Department for Computer Science
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
1
A different view on computer architecture
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
2
From high level code to macro-/microcode execution
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
Compiler
A(i) (incl. LD)
sum in register xmm1
i (loop counter)
ADD 1st argument to 2nd
argument and store result in
2nd argument
30. April 2013
N
ADD Execution unit
CAMA 2013 - D. Fey and G. Wellein
3
How does high level code interact with execution units
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
 Many hardware execution units:
 LOAD (STORE) operands from L1 cache
(register) to register (memory)
 Floating Point (FP) MULTIPLY and ADD
 Various Integer units
 Execution units may work in parallel
 “Superscalar” processor
 Two important concepts at hardware
level: Pipelining + SIMD
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
4
Microprocessors – Pipelining
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
5
Introduction: Moore’s law
Intel Nehalem EX: 2.3 Billion
nVIDIA FERMI: 3 Billion
1965: G. Moore claimed
#transistors on processor chip
doubles every 12-24 months
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
6
Introduction: Moore’s law  faster cycles and beyond
 Moore’s law  transistors are getting smaller  run them faster
 Faster clock speed  Reduce complexity of instruction execution
 Pipelining of instructions
Frequency [MHz]
Increasing transistor count
and clock speed allows /
requires architectural
changes:
10000
1000
Intel x86 clock speed
• Pipelining
100
• Superscalarity
10
• SIMD / Vector ops
20
09
20
03
19
99
19
95
19
91
19
87
19
83
19
79
0,1
19
75
19
71
1
Year
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
• Multi-Core/Threading
• Complex on chip
caches
7
Pipelining of arithmetic/functional units
 Idea:
 Split complex instruction into several simple / fast steps (stages)
 Each step takes the same amount of time, e.g. a single cycle
 Execute different steps on different instructions at the same time (in parallel)
 Allows for shorter cycle times (simpler logic circuits), e.g.:
 floating point multiplication takes 5 cycles, but
 processor can work on 5 different multiplications simultaneously
 one result at each cycle after the pipeline is full
 Drawback:
 Pipeline must be filled - startup times (#Instructions >> pipeline steps)
 Efficient use of pipelines requires large number of independent instructions 
instruction level parallelism
 Requires complex instruction scheduling by compiler/hardware – softwarepipelining / out-of-order
 Pipelining is widely used in modern computer architectures
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
8
Interlude: Possible stage for Multiply
 Real numbers can be represented as mantissa and exponent in a
“normalized” representation, e.g.: s*0.m * 10e with
Sign s={-1,1}
Mantissa m which does not contain 0 in leading digit
Exponent e some positive or negative integer
 Multiply two real numbers r1*r2 = r3
r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :
s1*0.m1 * 10e1 *
s2*0.m2 * 10e2
(s1*s2)* (0.m1*0.m2) * 10(e1+e2)
Normalize result: s3* 0.m3 * 10e3
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
9
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
Cycle:
1
2
B(1)
C(1)
B(2)
C(2)
B(1)
C(1)
3
4
5
6
...
B(3)
C(3)
B(4)
C(4)
B(5)
C(5)
B(6)
C(6)
...
B(2)
C(2)
B(3)
C(3)
B(4)
C(4)
B(5)
C(5)
...
B(1)
C(1)
B(2)
C(2)
B(3)
C(3)
B(4)
C(4)
...
A(1)
A(2)
B(3)
C(3)
...
A(1)
A(2)
...
N+4
Stage
Separate
Mant. / Exp.
Mult.
Mantissa
Add.
Exponents
Normal.
Result
Insert Sign
A(N)
First result is available after 5 cycles (=latency of pipeline)!
After that one instruction is completed in each cycle
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
10
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
Wind-up/-down phases: Empty pipeline stages
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
11
Pipelining: Speed-Up and Throughput
 Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up
piplined vs non-pipelined execution at same clock speed
Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)
 Throughput of piplined execution (=Average results per Cycle)
executing N instructions in pipeline with m stages:
N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]
 Throughput for large N: N / Tpipe(N) ~ 1
 Number of independent operations (NC) required to achive Tp
results per cycle:
Tp= 1 / [ 1+m/NC ]
Tp= 0.5
30. April 2013
NC = Tp m / (1- Tp)
NC = m
CAMA 2013 - D. Fey and G. Wellein
12
Throughput as function of pipeline stages
90% pipeline efficiency
m = #pipeline stages
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
13
Software pipelining
 Example:
Assumption:
Fortran Code:
do i=1,N
a(i) = a(i) * c
end do
load a[i]
mult a[i] = c,a[i]
store a[i]
branch.loop
Instructions block execution if operands are not
available
Latencies
Load operand to register (4 cycles)
Multiply a(i) with c (2 cycles); a[i],c in registers
Write back result from register to mem./cache (2 cycles)
Increase loopcounter as long i less equal N (0 cycles)
Simple Pseudo Code:
loop:
load a[i]
mult a[i] = c, a[i]
store a[i]
branch.loop
30. April 2013
Optimized Pseudo Code:
loop:
load a[i+6]
mult a[i+2] = c, a[i+2]
store a[i]
branch.loop
CAMA 2013 - D. Fey and G. Wellein
14
Software pipelining
a[i]=a[i]*c; N=12
Naive instruction issue
Optimized instruction issue
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Cycle 12
Cycle 13
Cycle 14
Cycle 15
Cycle 16
Cycle 17
Cycle 18
Cycle 19
load a[1]
mult a[1]=c,a[1]
store a[1]
load a[2]
mult a[2]=c,a[2]
store a[2]
a[1]
a[2]
a[3]
a[4]
a[5]
a[6]
a[7]
a[8]
a[9]
a[10]
a[11]
a[12]
Prolog
mult a[1]=c,a[1]
mult a[2]=c,a[2]
mult a[3]=c,a[3]
mult a[4]=c,a[4]
mult a[5]=c,a[5]
mult a[6]=c,a[6]
mult a[7]=c,a[7]
mult a[8]=c,a[8]
mult a[9]=c,a[9]
mult a[10]=c,a[10]
mult a[11]=c,a[11]
mult a[12]=c,a[12]
load a[3]
T= 96 cycles
30. April 2013
load
load
load
load
load
load
load
load
load
load
load
load
store a[1]
store a[2]
store a[3]
store a[4]
store a[5]
store a[6]
store a[7]
store a[8]
store a[9]
store a[10]
store a[11]
store a[12]
Kernel
Epilog
T= 19 cycles
CAMA 2013 - D. Fey and G. Wellein
15
Efficient use of Pipelining
 Software pipelining can be done by the compiler, but
efficient reordering of the instructions requires deep insight into
application (data dependencies) and processor (latencies of
functional units)
 Re-ordering of instructions can also be done at runtime by
out-of-order (OOO) execution
 (Potential) dependencies within loop body may prevent efficient
software pipelining or OOO execution, e.g.:
No dependency:
Dependency:
Pseudo-Dependency:
do i=1,N
a(i) = a(i) * c
end do
do i=2,N
a(i) = a(i-1) * c
end do
do i=1,N-1
a(i) = a(i+1) * c
end do
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
16
Pipelining: Data dependencies
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
17
Pipelining: Data dependencies
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
18
Pipelining: Data dependencies
a[i]=a[i-1]*c; N=12
Naive instruction issue
Optimized instruction issue
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Cycle 12
Cycle 13
Cycle 14
Cycle 15
Cycle 16
Cycle 17
Cycle 18
Cycle 19
load a[1]
load a[1]
Prolog
mult a[2]=c,a[1]
mult a[2]=c,a[1]
store a[2]
mult a[3]=c,a[2]
store a[2]
load a[2]
mult a[4]=c,a[3]
store a[3]
mult a[5]=c,a[4]
store a[4]
mult a[3]=c,a[2]
mult a[6]=c,a[5]
store a[5]
store a[3]
mult a[7]=c,a[6]
store a[6]
load a[3]
mult a[8]=c,a[7]
store a[7]
mult a[9]=c,a[8]
store a[8]
T= 96 cycles
Kernel
T= 26 cycles
Length of MULT pipeline determines throughput
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
19
Fill pipeline with independent recursive streams..
Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMT
MULT Pipeline depth: 5 stages  1 F / 5 cycles for recursive update
Thread 0:
do i=1,N
A(i)=A(i-1)*s
B(i)=B(i-1)*s
C(i)=C(i-1)*s
D(i)=D(i-1)*s
E(i)=E(i-1)*s
enddo
A(2)*s
E(1)*s
D(1)*s
MULT pipe
B(2)*s
C(1)*s
5 independent updates on a single core!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
20
Pipelining: Beyond multiplication
 Typical number of pipeline stages: 2-5 for the hardware pipelines
on modern CPUs.
 x86 processors (AMD, Intel): 1 Mult & Add unit per processor core
 No hardware for div / sqrt / exp / sin …  expensive instructions
 “FP costs” in cycles per instruction for Intel Core2 architecture
Operation
y=a+y (y=a*y)
y=a/y
y=sqrt(y)
y=sin(y)
Latency
3 (5)
32
29
>100
Throughput
1 (1)
31
28
>100
Cycles/Operation
0.5*
15.5*
14*
>100
 Other instructions are also pipelined, e.g. LOAD operand to
register (4 cycles)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
21
Pipelining: Potential problems (1)
 Hidden data dependencies:
void scale_shift(double *A, double *B, double *C, int n) {
for(int i=0; i<n; ++i)
C[i] = A[i] + B[i];
}
 C/C++ allows “Pointer Aliasing” , i.e. A  &C[-1] ; B  &C[-2]
 C[i] = C[i-1] + C[i-2]  Dependency!
 Compiler can not resolve potential pointer aliasing conflicts on its own!
 If no “Pointer Aliasing” is used, tell it to the compiler, e.g.
 use –fno-alias switch for Intel compiler
 Pass arguments as (double *restrict A,…) (only C99 standard)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
22
Pipelining: Potential problems (2)
 Simple subroutine/function calls within a loop
do i=1, N
call elementprod(A(i),B(i), psum)
C(i)=psum
enddo
…
function elementprod( a, b, psum)
…
psum=a*b
 Inline subroutines! (can be done by compiler….)
do i=1, N
psum=A(i)*B(i)
C(i)=psum
enddo
…
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
23
Pipelining: Potential problems (3a)
Can we use pipelining or does this cost us 8*3 cycle
(assuming 3 stage ADD pipeline)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
24
Pipelining: Potential problems (3b)
 More general – “reduction operations”?
A(i) (incl. LD)
sum in register xmm1
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
ADD 1st argument to 2nd
argument and store result in
2nd argument
N
i (loop counter)
 Benchmark: Run above assembly language kernel with
N=32,64,128,…,4096 on processor with
 3.5 GHz clock speed
 1 pipelined ADD unit (latency 3 cycles)
 1 pipelined LOAD unit (latency 4 cycles)
30. April 2013
 ClSp=3500 Mcycle/s
1 cycle per iteration
(after 7 iterations)
CAMA 2013 - D. Fey and G. Wellein
25
Pipelining: Potential problems (4)
 Expected Performance: Throughput * ClockSpeed
 Throughput:
N/T(N) = N/ (L+N)
Assumption: L is total latency of one iteration and
one result per cycle delivered after pipeline startup.
 Total runtime: L+N cycles
A(i-1)
A(i-2)
LOAD
A(i)
A(i-3)
 Total latency: L = 4 cycles + 3 cycles = 7 cycles
Performance for N Iterations:
3500 MHz * (N / (L+N)) Iterations/cycle
Maximum performance (𝑵 → ∞):
A(i-5)
ADD
A(i-4)
A(i-6)
3500 Mcycle/s * 1 Iteration/cycle=
3500 Miterations/s
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
26
Pipelining: Potential problems (5)
sum=0.d0
do i=1, N
sum=sum + A(i)
enddo
…
Throughput here: N/T(N) = N/ (L+3*N)
Why?
A(2)
A(3)
s = s+A(2)
s+A(1)
Dependency on sum  next instruction needs to wait for
completion of previous one  only 1 out of 3 stages active
 3 cycles per iteration
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
27
Pipelining: Potential problems (6)
 Increase pipeline utilization by “loop unrolling”
“2-way Modulo Variable Expansion” (N is even)
sum1=0.d0
sum2=0.d0
do i=1, N, 2
sum1=sum1+A(i)
sum2=sum2+A(i+1)
enddo
sum=sum1+sum2
2 out of 3 pipeline stages can
be filled
2 results every 3 cycles
1.5 cycle/Iteration
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
28
Pipelining: Potential problems (7)
 4-way Modulo Variable Expansion (MVE) to get best performance
(in principle 3-way should do as well)
 Sum is split up in 4 independent partial sums
“4-way MVE”
Nr=4*(N/4)
sum1=0.d0
allowed to do so…
sum2=0.d0
sum3=0.d0
 Computer’s floating point arithmetic is
sum4=0.d0
not associative!
do i=1, Nr, 4
𝒂 + 𝒃 + 𝒄 + 𝒅 + 𝒆) + 𝒇 ≠ 𝒂 + 𝒃 + 𝒄 + 𝒅 + (𝒆 + 𝒇) sum1=sum1+A(i)
sum2=sum2+A(i+1)
sum3=sum3+A(i+2)
 If you require binary exact result
sum4=sum4+A(i+3)
(-fp-model strict) compiler is not
enddo
allowed to do this transformation
do i=Nr+1, N
 L=(7+3*3) cycles (prev. slide)
sum1=sum1+A(i)
enddo
sum=sum1+sum2+sum3+sum4
Remainder
loop
 Compiler can do that, if he is
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
29
Pipelining: The Instruction pipeline
 Besides arithmetic & functional unit, instruction execution itself is
pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instruction
from L1I
Decode
instruction
Execute
Instruction
 Hardware Pipelining on processor (all units can run concurrently):
Fetch Instruction 1
from L1I
Fetch Instruction 2
from L1I
Fetch Instruction 3
from L1I
Fetch Instruction 4
from L1I
1
2
t
3
4
Decode
Instruction 1
Decode
Instruction 2
Decode
Instruction 3
Execute
Instruction 1
Execute
Instruction 2
…
 Branches can stall this pipeline! (Speculative Execution, Predication)
 Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
30
Pipelining: The Instruction pipeline
 Problem: Unpredictable branches to other instructions
1
Fetch Instruction 1
from L1I
Decode
Instruction 1
2
t
Execute
Instruction 1
3
4
Assume: Result
determines next
instruction!
Fetch Instruction 2
from L1I
Decode
…
Instruction 2
Execute
Instruction 2
Fetch Instruction 3
from L1I
Decode
Instruction 3
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
31
Microprocessors – Superscalar
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
32
Superscalar Processors
 Superscalar processors provide additional hardware (i.e.
transistors) to execute multiple instructions per cycle!
 Parallel hardware components / pipelines are available to
 fetch / decode / issues multiple instructions per cycle
(typically 3 – 6 per cycle)
 perform multiple integer / address calculations per cycle
(e.g. 6 integer units on Itanium2)
 load (store) multiple operands (results) from (to) cache
per cycle (typically one load AND one store per cycle)
 perform multiple floating point instructions per cycle
(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)
 On superscalar RISC processors out-of order (OOO) execution
hardware is available to optimize the usage of the parallel
hardware
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
33
Superscalar Processors – Instruction Level Parallelism
 Multiple units enable use of Instrucion Level Parallelism (ILP):
Instruction stream is “parallelized” on the fly
t
Fetch Instruction 4
Fetch Instruction 3
fromInstruction
L1I
Fetch
2
fromInstruction
L1I
Fetch
1
from L1I 2
Fetch Instruction
Decode
Fetch Instruction
Decode
from L1I 2
fromInstruction
L1I
1
Fetch
2 Instruction
Decode
fromInstruction
L1I
1
Fetch
5 Instruction
Decode
from L1I 3
Instruction 1
Fetch Instruction
Decode
from L1I 3
Instruction 1
Fetch Instruction
Decode
fromInstruction
L1I
2
Fetch
3 Instruction
Decode
fromInstruction
L1I
2
Fetch
9 Instruction
Decode
from L1I 4
Instruction 2
Fetch Instruction
Decode
from L1I 4
Instruction 5
Fetch Instruction
Decode
fromInstruction
L1I
3
Fetch
4 Instruction
Decode
from
L1I
3
Fetch
Instruction
13 Instruction
Decode
from L1I
Instruction 3
from L1I
Instruction 9
4-way
„superscalar“
Execute
Execute
Instruction
1
Execute
Instruction
1
Execute
Instruction 1
Execute
Instruction 1
Execute
Instruction
2
Execute
Instruction
2
Execute
Instruction 2
Instruction 5
 Issuing m concurrent instructions per cycle: m-way superscalar
 Modern processors are 3- to 6-way superscalar &
can perform 2 or 4 floating point operations per cycles
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
34
“3-way Modulo Variable Expansion” (N is multiple of 3)
A(i)R0
A(i-2)R2
A(i-3)R3
R11=R11+R4
R12=R12+R5
R13=R13+R6
ADD
A(i-1)R1
Register
Set
LOAD
Superscalar Processors – ILP in action
sum1=0.d0 !  reg. R11
sum2=0.d0 !  reg. R12
Sum3=0.d0 !  reg. R13
do i=1, N, 3
sum1=sum1+A(i)
sum2=sum2+A(i+1)
sum3=sum3+A(i+2)
enddo
sum=sum1+sum2+sum3
 Complex register management not show (R4 contains A(i-4))
 2-way superscalar: 1 LOAD instruction + 1 ADD instruction
completed per cycle
 Often cited metrics for superscalar processors:
 Instructions Per Cycle:
 Cycles Per Instruction:
30. April 2013
IPC=2 above
CPI=0.5 above
CAMA 2013 - D. Fey and G. Wellein
35
Superscalar processor – Intel Nehalem design
 Decode & issue a max. of 4
instructions per cycle: IPC=4
Min. CPI=0.25 Cycles/Instruction
 Parallel units:
 FP ADD
 LOAD
& FP MULT (work in parallel)
+ STORE (work in parallel)
 Max. FP performance:
1 ADD + 1 MULT instruction per cycle
 Max. performance:
A(i) = r0 + r1 * B(i)
 ½ of max. FP performance:
A(i) = r1 * B(i)
 1/3 of max. FP performance:
A(i) = A(i) + B(i) * C(i)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
36
Microprocessors –
Single Instruction Multiple Data (SIMD)processing
Basic Idea:
Apply the same instruction to
multiple operands in parallel
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
37
SIMD-processing – Basics
 Single Instruction Multiple Data (SIMD) instructions allow the
concurrent execution of the same operation on “wide” registers.
 x86_64 SIMD instruction sets:
 SSE: register width = 128 Bit  2 double (4 single) precision FP operands
 AVX: register width = 256 Bit  4 double (8 single) precision FP operands
 “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower
64 Bit (32 Bit) of registers are used.
 Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit
 AXV: No support for using the 256 bit register width for integer operations
 SIMD-execution  vector execution
 If compiler has vectorized loop  SIMD instructions are used
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
38
SIMD-processing – Basics
 Example: Adding two registers holding double precision floating point
operands using 256 Bit register (AVX)
+
B[1]
C[1]
+
C[0]
C[2]
C[3]
+
B[3]
+
B[0]
R2
A[3]
SIMD execution:
V64ADD [R0,R1] R2
R1
A[2]
R0
B[2]
R2
A[1]
R1
A[0]
R0
C[0]
+
B[0]
64 Bit
A[0]
256 Bit
Scalar execution:
R2 ADD [R0,R1]
 If 128 Bit SIMD instructions (SSE) are executed  only half of the
registers width is used
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
39
SIMD-processing – Basics
 Steps (done by the compiler) for “SIMD-processing”
for(int i=0; i<n;i++)
C[i]=A[i]+B[i];
“Loop unrolling”
for(int i=0; i<n;i+=4){
C[i] =A[i] +B[i];
C[i+1]=A[i+1]+B[i+1];
C[i+2]=A[i+2]+B[i+2];
C[i+3]=A[i+3]+B[i+3];}
//remainder loop handling
Load 256 Bits starting from address of
A[i] to register R0
Add the corresponding 64 Bit entries in R0
and R1 and store the 4 results to R2
Store R2 (256 Bit) to address
starting at C[i]
30. April 2013
“PseudoAssembler”
LABEL1:
VLOAD R0  A[i]
VLOAD R1  B[i]
V64ADD[R0,R1]  R2
VSTORE R2  C[i]
ii+4
i<(n-4)? JMP LABEL1
//remainder loop handling
CAMA 2013 - D. Fey and G. Wellein
40
SIMD-processing – Basics
 No SIMD-processing for loops with data dependencies
for(int i=0; i<n;i++)
A[i]=A[i-1]*s;
 “Pointer aliasing” may prevent compiler from SIMD-processing
void scale_shift(double *A, double *B, double *C, int n) {
for(int i=0; i<n; ++i)
C[i] = A[i] + B[i];
}
 C/C++ allows that A  &C[-1] and B  &C[-2]
 C[i] = C[i-1] + C[i-2]: dependency  No SIMD-processing
 If no “Pointer aliasing” is used, tell it to the compiler, e.g. use
–fno-alias switch for Intel compiler  SIMD-processing
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
41
SIMD-processing – Basics
 SIMD-processing of a vector sum
double s=0.0;
for(int i=0; i<n;i++)
s = s + A[i];
s0=0.0;
s1=0.0;
S2=0.0;
S3=0.0;
for(int i=0; i<n;i+=4){
s0 = s0+ A[i] ;
s1 = s1+ A[i+1];
s2 = s2+ A[i+2];
s3 = s3+ A[i+3];
}
R0
R1
//remainder
s=s0+s1+s2+s3
30. April 2013
Data dependency on s must be
resolved for SIMD-processing
(assume AVX)
Compiler does transformation
(Modulo Variable Expansion) –
if programmer allows it to do so!
(e.g. use –O3 instead of –O1)
R0  (0.d0, 0.d0, 0.d0, 0.d0)
…
V64ADD(R0,R1)
…
 R0
“Horizontal” ADD:
Sum up the 4 64 Bit entries of R0
CAMA 2013 - D. Fey and G. Wellein
42
SIMD-processing: What about pipelining?!
R0 (0.0,0.0,0.0, 0.0)
do i=1, N, 4
R0
R1
R2
do
VLOAD A(i:i+3)  R1
V64ADD(R0,R1)  R0
enddo
sum  HorizontalADD(R0)
A(9:12)
A(5:8)
R0=R0+A(5:8)
R0=R0+A(1:4)
Need to do another
MVE step to fill
pipeline stages
(0.0,0.0,0.0,0.0)
(0.0,0.0,0.0,0.0)
(0.0,0.0,0.0,0.0)
i=1, N, 12
LOAD A(i:i+3)
LOAD A(i+4:i+7)
LOAD A(i+8:i+11)
V64ADD(R0,R3)
V64ADD(R1,R4)
V64ADD(R2,R5)
enddo
…
“Vertical add” V64ADD(R0,R1)
V64ADD(R0,R2)
 R3
 R4
 R5
 R0
 R1
 R2
 R0
 R0
“Horizontal add” sum HorizontalADD(R0)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
43
SIMD-processing: What about pipelining?!
1 AVX iteration
performs 4 i-Iterations
(successive)
Double Precision
Performance: 4x
higher than “scalar”
version
Start-up phase much
longer…
Unrolling factor of vectorized code
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
44
Compiler generated AVX code (loop body)
Baseline version (“scalar”):
No pipelining – no SIMD
 3 cycles / Iteration
Compiler generated “AVX version”
(-O3 –xAVX):
SIMD processing: vaddpd
%ymm8  4 dp operands
(4-way unrolling)
Pipelining: 8-way MVE of SIMD
code
32-way unrolling in total
30. April 2013
 0.25 cycles / Iteration
CAMA 2013 - D. Fey and G. Wellein
45
SIMD processing – Vector sum (double precision) – 1 core
SIMD: Most impact if data is close to the core –
other bottlenecks stay the same!
Peak
Scalar: Code execution in core is
slower than any data transfer
Location of “input data” (A[]) in memory hierarchy
Plain: No SIMD but 4-way MVE
AVX/SIMD: Full benefit only if
data is in L1 cache
Data parallel SIMD processing
 Requires independent vector-like operations (“Data parallel”)
 Compiler is required to generate “vectorized” code  Check
compiler output
 Check for the use of “Packed SIMD” instructions at runtime
(likwid) or in the assembly code
 Packed SIMD may require alignment constraint, e.g. 16-Byte
alignment for efficient load/store on Intel Core2 architectures
 Check also for SIMD LOAD / STORE instructions
 Use of Packed SIMD instructions reduces the overall number of
instructions (typical theoretical max. of 4 instructions / cycle)
 SIMD code may improve performance but reduce CPI!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
47
Data parallel SIMD processing: Boosting performance
 Putting it all together: Modern x86_64 based Intel / AMD processor
 One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a
throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)
 Maximum 2 FP instructions/cycle
 Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX)
instructions
 2 (4) double precision FP operations per SSE (AVX) instruction
 4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)
Peak performance of 3 GHz CPU (core):
SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)
SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)
BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
48
There is no single driving force for single core performance!
Maximum Floating Point (FP) Performance:
Pcore = F * S * n
F
FP instructions per cycle:
(1 MULT and 1 ADD)
S
FP ops / instruction:
2
4 (dp) / 8 (sp)
(256 Bit SIMD registers – “AVX”)
n
Clock speed :
∽2.5 GHz
Scalar (non-SIMD) execution
S = 1 FP op/instruction (dp / sp)
P = 20 GF/s (dp) / 40 GF/s (sp)
P = 5 GF/s (dp / sp)
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
49
SIMD registers: floating point (FP) data and beyond
 Possible data types in an SSE register
16x 8bit
4x 32bit
integer
8x 16bit
2x 64bit
4x 32 bit
2x 64 bit
 AVX only applies to FP data (not to scale)
floating
point
1x 128bit
8x 32 bit
4x 64 bit
50
Rules for vectorizable loops / SIMD processing
1.
2.
3.
4.
Countable
Single entry and single exit
Straight line code
No function calls (exception intrinsic math functions)
Better performance with:
1. Simple inner loops with unit stride
2. Minimize indirect addressing
3. Align data structures (SSE 16 bytes, AVX 32 bytes)
4. In C use the restrict keyword for pointers to rule out aliasing
Obstacles for vectorization:
 Non-contiguous memory access
 Data dependencies
51
How to leverage vectorization / SIMD
 Source code directives (pragmas) to ease the compiler’s job
 Alternative programming models for
compute kernels (OpenCL, ispc)
 Intrinsics (restricted to C/C++)
 Implement directly in assembler
Complexity and efficiency increases
 The compiler does it for you (aliasing, alignment, language)
52
Vectorization and the Intel compiler
 Intel compiler will try to use SIMD instructions when enabled to do so
 “Poor man’s vector computing”
 Compiler can emit messages about vectorized loops (not by default):
plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.
 Use option -vec_report3 to get full compiler output about which loops were
vectorized and which were not and why (data dependencies!)
 Some obstructions will prevent the compiler from applying vectorization even if
it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)
 You can use source code directives to provide more information to the
compiler
53
Vectorization compiler options
 The compiler will vectorize starting with –O2.
 To enable specific SIMD extensions use the –x option:
 -xSSE2
vectorize for SSE2 capable machines
Available SIMD extensions:
SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX
 -xAVX
on Sandy Bridge processors
Recommended option:
 -xHost
will optimize for the architecture you compile on
 Compiling for AMD Opteron: use plain –O3 as the -x options
may involve CPU type checks at runtime!
54
Vectorization source code directives (pragmas)
 Fine-grained control of loop vectorization
 Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler
directive
 #pragma vector always
vectorize even if it seems inefficient (hint!)
 #pragma novector
do not vectorize even if possible
 #pragma vector nontemporal
use NT stores when allowed (i.e. alignment conditions are met)
 #pragma vector aligned
specifies that all array accesses are aligned to 16-byte boundaries
(DANGEROUS! You must not lie about this!)
55
User mandated vectorization (pragmas)
 Starting with Intel Compiler 12.0 the simd pragma is available
 #pragma simd enforces vectorization where the other pragmas fail
 Prerequesites:
 Countable loop
 Innermost loop
 Must conform to for-loop style of OpenMP worksharing constructs
 There are additional clauses: reduction, vectorlength, private
 Refer to the compiler manual for further details
 NOTE: Using the #pragma simd the compiler may generate incorrect code if
the loop violates the vectorization rules!
#pragma simd reduction(+:x)
for (int i=0; i<n; i++) {
x = x + A[i];
}
56
Basic approach to check the instruction code
 Get the assembler code (Intel compiler):
icc –S –O3 -xHost triad.c -o triad.s
 Disassemble Executable:
objdump –d ./cacheBench | less
 Things to check for:
 Is the code vectorized? Search for pd/ps suffix.
mulpd, addpd, vaddpd, vmulpd
 Is the data loaded with 16 byte moves?
movapd, movaps, vmovupd
 For memory-bound code: Search for nontemporal stores:
movntpd, movntps
The x86 ISA is documented in:
Intel Software Development Manual (SDM) 2A and 2B
AMD64 Architecture Programmer's Manual Vol. 1-5
57
Some basics of the x86-64 ISA
16 general Purpose Registers (64bit):
rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15
alias with eight 32 bit register set:
eax, ebx, ecx, edx, esi, edi, esp, ebp
Floating Point SIMD Registers:
xmm0-xmm15 SSE (128bit) alias with 256bit registers
ymm0-ymm15 AVX (256bit)
SIMD instructions are distinguished by:
AVX (VEX) prefix:
v
Operation:
mul, add, mov
Modifier:
non temporal (nt), unaligned (u), aligned (a), high(h)
Data range:
packed (p), scalar (s)
Data type:
single (s), double (d)
58
Some basic single core optimizations – warnings first
“Premature optimization is the root of all evil.”
Donald E. Knuth
“Parallel performance is easy, single node/core
performance is difficult”
Bill Gropp
59
Single core: Common sense optimizations (1)
 Do less work!
Reducing the work to be done is never a bad idea!
logical :: flag
flag = .false.
do i=1,N
if(comlex_func(A(i)) < TRESHOLD) then
flag=.true.
endif
enddo
logical :: flag
flag = .false.
do i=1,N
if(comlex_func(A(i)) < TRESHOLD) then
flag=.true.
EXIT
endif
enddo
! Check if at
! Least one
! Is true
!
!
!
!
Check if at
Least one
Is true and
EXIT do-loop
60
Single core: Common sense optimizations (2)

Avoid expensive operations!
FP MULT & FP ADD are the fastest way to compute


Avoid DIV / SQRT / SIN / COS / TAN ,…  table lookup
Avoid a one-to-one implementation of the algorithm, e.g.
A= A + B**2
(A,B float)
 A = A + B**2.0 (V1)
 A = A + B * B (V2)
(V1) is not a good idea: B**2.0  exp{2.0*ln(B)}
1. Computing exp & ln is very expensive
2. B < 0 ?!

Most useful if data is close to CPU!
61
Single core: Common sense optimizations (3)
 Shrink the working set!
Working on small data sets reduces data transfer volume and
increases the probability for cache hits
 Analyze if appropriate data types are used, e.g.:
4 different particle species have to be distinguished:
integer spec(1:N)
 spec(i) = {0,1,2,3}
 sizeof(spec(1:N)) = 4*N*byte
OR use 1-Byte integer datatype
integer*1 spec(1:N)
 sizeof(spec(1:N)) = N*byte
OR use 2-Bit for each species
integer*1 spec(1:N/4)  sizeof(spec(1:N)) = N/4*byte
 Strongly depends on application!
62
Single core: Common sense optimizations (4a)
 Elimination of common subexpressions!
This often reduces MFLOP/s rate but improves runtime
 In principle the compiler should do the job but do not rely on it!
 Associativity rules may prevent compiler to do so
 Compiler does not recognize (limited scope)
63
Single core: Common sense optimizations (4b)
 Replace expensive functions by table lookup
do iter = …
…
…
Entries: -1,0,1
do i=1,n
…
edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i)
BF= 0.5d0*(1.d0+tanh(edelz))
…
…
edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6
enddo
…
…
enddo
30. April 2013
Compute all 13 potential values of tanh(edelz) before and store
it in a table with 13 entries: tanh_table(-6:6)
CAMA 2013 - D. Fey and G. Wellein
64
Single core: Common sense optimizations (4c)
 Replace expensive functions by table lookup
do i=-6,6
tanh_table(i) = tanh(i)
enddo
do iter = …
…
…
do i=1,n
…
edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i)
BF= 0.5d0*(1.d0+ tanh_table(edelz))
…
…
enddo
…
…
enddo
30. April 2013
CAMA 2013 - D. Fey and G. Wellein
65
Single core: Common sense optimizations (5)
 Avoid branches!
Support the compiler to understand and optimize your code!
 Code change may enable
vectorization, SIMD &
other optimizations
 BTW software pipelining is also much easier and no branch
prediction is required
66