Impact of TigerSHARC pipeline 4

Download Report

Transcript Impact of TigerSHARC pipeline 4

Understanding the
TigerSHARC ALU
pipeline
Determining the speed of one
stage of IIR filter – Part 4
IIR operation with Memory
Understanding the TigerSHARC
ALU pipeline



TigerSHARC has many pipelines
Review of the COMPUTE pipeline works
Interaction of memory (data) operations with COMPUTE
operations




What we want to be able to do?
The problems we are expecting to have to solve
Using the pipeline viewer to see what really happens
Changing code practices to get better performance


4/5/2016
Specialized C++ compiler options and #pragmas
(Will be covered by individual student presentation)
Optimized assembly code and optimized C++
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
2 / 28
Processor
Architecture



3 128-bit
data busses
2 Integer ALU
2 Computational
Blocks
 ALU
(Float and
integer)
 SHIFTER
 MULTIPLIER
 COMMUNICATIONSSpeed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
4/5/2016CLU
3 / 28
S0
Simple Example
IIR -- Biquad

For (Stages = 0 to 3) Do




S1
S2
SO
S1
S2
S0 = Xin * H5 + S2 * H3 + S1 * H4
Yout = S0 * H0 + S1 * H1 + S2 * H2
S2 = S1
S1 = S0
This code does
not handle filtering
of real data – but
is good for discussing
basic ALU pipeline
issues without
handling memory
pipeline issues
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
4 / 28
Rewrite Tests so that IIR( ) function
can take parameters
Can repeatedly call
this function to handle
real data.
Do in C++ now to valid
logic of design.
Use modified tests for C++
code to validate optimized
assembly code
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
5 / 28
Rewrite the “C++ code”
I leave the old “fixed”
values in as comments
until I can get the new
code version to work.
Proved useful this time as
the code failed
Why did it fail to return
the correct value?
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
6 / 28
Explore design issues – memory ops
Probable memory stalls expected
XR0 = 0.0;
XR1 = [J1 += 1];
XFR2 = R1 * R4;
XFR0 = R0 + R2;
XR3 = [J1 += 1];
XR5 = [J2 += 1];
XFR5 = R3 * R5;
XFR0 = R0 + R5;
XR5 = XR12;
XR12 = XR13
[J3 += 1] = XR12;
[J3 += 1] = XR5;
4/5/2016
// Set Fsum = 0;
// Fetch a coefficient from memory
// Multiply by Xinput (XR4)
// Add to sum
// Fetch a coefficient from memory
// Fetch a state value from memory
// Multiply coeff and state
// Perform a sum
// Update a state variable (dummy)
// Update a state variable (dummy)
// Store state variable to memory
// Store state variable to memory
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
7 / 28
Looking much better.
Use 10 nops to flush
the instruction
pipeline to get the
pipeline in to a state
we can “understand”
May not be “real”
picture for working
software
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
8 / 28
Pipeline performance predicted
When you start reading values from
memory, 1 cycle delay for value fetched
available for use within the COMPUTE
COMPUTE operations – 1 cycle delay
expected if next instruction needs the
result of previous instruction
When you have adjacent memory
accesses (read or write) does the pipeline
work better with [J1 += 1];; or with
[J1 += J4];; where J4 has been set to 1?
[J1 += 1];; works just fine here (no
delay).
Worry about [J1 += J4];; another day
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
9 / 28
Use C++ IIR code as comments when developing ASM
Things to think about “PREOPTIMIZATION”
Register name reorganization
Keep XR4 for xInput
– save a cycle
Put S1 and S2 into XR0 and XR1
-- chance to fetch 2 memory values in
one cycle using L[ ]
float *copyStateStartAddress = state;
S1 = *state++;
S2 =*state++;
*copyStateStartAddress++ = S1;
*copyStateStartAddress++ = S2;
4/5/2016
Put H0 to H5 in XR12 to XR16
-- chance to fetch 4 memory values in
one cycle using Q[ ]
followed by one normal fetch
-- Problems – if more than one IIR
stage then the second stage
fetches are not quad aligned
There are two sets of multiplications using
10 / 28
by done in X and
Y
Speed IIR -- stage 4
S1 andofS2.
Can
these
M. Smith, ECE, University
Calgary,
Canada
New assembly code – step 1
Things to think about for optimization
Register name reorganization

Make copy of
COMPUTE optimized
code
float IIRASM_Memory(void);

Change the register
names and make
sure that it still works
4/5/2016
Keep XR4 for xInput
– save a cycle
Put S1 and S2 into XR10 and XR11
-- chance to fetch 2 memory values in
one cycle using L[ ]
Put H0 to H5 in XR12 to XR16
-- chance to fetch 4 memory values in
one cycle using Q[ ]
followed by one normal fetch
-- Problems – if more than one IIR
stage then the second stage
fetches are not quad aligned
There are two sets of multiplications
Speed IIRusing
-- stage 4S1 and S2. Can these by done in X
11 / 28
M. Smith, ECE, University of Calgary, Canada
and Y compute blocks in one cycle?
Write new tests
NOTE: If new register names don’t overlap with old names
Makes the name conversion very straight forward
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
12 / 28
Register name conversion done in
steps – test each time
Setting Xin – XR4
and Yout = XR8
saves one cycle
Bulk
conversion
of registers
used with
coefficients
4/5/2016
with
no error
So many errors made during
bulk conversion that went to
Speed
IIR -- stage 4
Find/replace/ test for
each
M. Smith, ECE, University of Calgary, Canada
register individually
13 / 28
Update tests to use IIRASM_Memory( )
version with real memory access
Syntax error
missed during
code review
indicates probably
logical errors
missed too.
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
Finding the
required name
mangled name
14 / 28
Fix bringing state variables in
QUESTION
We have
XR18 = [J6 += 1]
(load S1)
and
R19 = [J6 += 1]
(load S2)
Both are valid syntax
What is the
4/5/2016
difference?
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
15 / 28
Send state variables out
Go for the gusto – use L[ ] (64-bit)

L[J7 +=2]
= XR19:18;;

Need to recalculate the
test result as
state[1] (inner filter
value) is NOT the same
as Yout (output).
Design defect reflected
in Test defect
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
16 / 28
Redo calculation for value stored
as S1


S0 =
Xin
+ S1 *H4
+ S2 * H3
5.5
+2*5
+3*4
SO
S1
S2
S1 = S0
Expect stored value of
27.5
 Need to fix test
of state values after
function
CHECK(state[0] == 27.5);

4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
17 / 28
Working solution -- I
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
18 / 28
Working Solution -- Part 2
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
19 / 28
Working solution – Part 3
I could not spot where
any extra stalls would
occur because of
memory pipeline reads
and writes
All values were in place
when needed
Need to check with
pipeline viewer
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
20 / 28
Lets look at DATA MEMORY and
COMPUTE pipeline issues -- 1
No
problems
here
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
21 / 28
Lets look at DATA MEMORY and
COMPUTE pipeline issues -- 2
4/5/2016
No
problems
here
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
22 / 28
Weird stuff happening with
INSTRUCTION pipeline
Only 9 instructions
being fetched but we
are executing 21!
Why all these
instruction stalls?
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
23 / 28
Adjust pipeline view for closer look.
Adjust dis-assembler window
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
24 / 28
Analysis of where the “missing
TigerSHARC instruction went to

We are seeing the impact of the processor doing high speed
quad-fetches of instructions (128-bits) into IAB (instruction alignment
buffer)

Once in the IAB, then
the instructions (32-bits)
are issued to the
various execution
units as needed.
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
25 / 28
Note the fetch into the next
subroutine despite return (CJMP)
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
26 / 28
Note that processor continues to fetch “the wrong”
instructions – how to (can we) we recover lost cycles?
4/5/2016
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
27 / 28
Understanding the TigerSHARC
ALU pipeline



TigerSHARC has many pipelines
Review of the COMPUTE pipeline works
Interaction of memory (data) operations with COMPUTE
operations




What we want to be able to do?
The problems we are expecting to have to solve
Using the pipeline viewer to see what really happens
Changing code practices to get better performance


4/5/2016
Specialized C++ compiler options and #pragmas
(Will be covered by individual student presentation)
Optimized assembly code and optimized C++
Speed IIR -- stage 4
M. Smith, ECE, University of Calgary, Canada
28 / 28