Transcript Document
Understanding the
TigerSHARC ALU
pipeline
Determining the speed of one stage of
IIR filter – Part 5
What syntax to make the code more
parallel?
Understanding the TigerSHARC
Parallel Operations
TigerSHARC has many pipelines
Review of the COMPUTE pipeline works
Interaction of memory (data) operations with
COMPUTE operations
Specialized C++ compiler options and #pragmas
(Will be covered by individual student
presentation)
Optimized assembly code and optimized C++
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
2 / 28
Processor
Architecture
3 128-bit
data busses
2 Integer ALU
2 Computational
Blocks
ALU
(Float and
integer)
SHIFTER
MULTIPLIER
COMMUNICATIONSSpeed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
4/7/2016CLU
3 / 28
Use C++ IIR code as comments
Things to think about prior to code writing
Register name reorganization
Keep XR4 for xInput
– save a cycle
Put S1 and S2 into XR0 and XR1
-- chance to fetch 2 memory values in
one cycle using L[ ]
float *copyStateStartAddress = state;
S1 = *state++;
S2 =*state++;
*copyStateStartAddress++ = S1;
*copyStateStartAddress++ = S2;
4/7/2016
Put H0 to H5 in XR12 to XR16
-- chance to fetch 4 memory values in
one cycle using Q[ ]
followed by one normal fetch
-- Problems – if more than one IIR
stage then the second stage
fetches are not quad aligned
There are two sets of multiplications using
S1 and S2. Can these by done in X and Y
Speed IIR -- stage 5
4 / 28
compute
blocks
in one cycle?
M. Smith, ECE, University
of Calgary,
Canada
Register name conversion done in
steps
Setting Xin – XR4
and Yout = XR8
saves one cycle
Bulk
conversion
with no error
4/7/2016
So many errors made during
bulk conversion that went to
Speed
IIR -- stage 5
Find/replace/ test for
each
M. Smith, ECE, University of Calgary, Canada
register individually
5 / 28
Fix bringing state variables in
QUESTION
We have
XR18 = [J6 += 1]
(load S1)
and
R19 = [J6 += 1]
(load S2)
Both are valid
What is the
4/7/2016
difference?
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
6 / 28
That difference – could it be
used to our advantage?
XR18 = [J6 += 1];;
Read the value at memory location [J6], and updates J6 to J6
+ 1 after fetch. Stores fetched value in XR18
XYR19 = [J6 += 1];;
Read the value at memory location [J6], and updates J6 to J6
+ 1 after fetch. Stores fetched value in XR19 AND YR18
XYR19 = L[J6 += 2];; -- concept correct – but executes faster
Read value at [J6], updates J6 to J6 + 1, store in XR19.
AND
Read value at [(new) J6], updates J6 to J6 + 1, store in XY19.
PROVIDED J6 was originally aligned on 64-bit boundary
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
7 / 28
Send state variables out
Go for the gusto – use L[ ] (64-bit)
Need to
recalculate the
test result
state[1] is NOT
Yout
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
8 / 28
Working solution -- I
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
9 / 28
Working Solution -- Part 2
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
10 / 28
Working solution – Part 3
I could not spot where
any extra stalls would
occur because of
memory pipeline reads
and writes
All values were in place
when needed
Need to check with
pipeline viewer
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
11 / 28
Lets look at DATA MEMORY and
COMPUTE pipeline issues -- 1
No
problems
here
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
12 / 28
Weird stuff happening with
INSTRUCTION pipeline
Only 9 instructions
being fetched but we
are executing 21!
Why all these
instruction stalls?
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
13 / 28
Analysis
We are seeing the impact of the processor doing quad-fetches of
instructions (128-bits) into IAB (instruction alignment buffer)
Once in the IAB, then
the instructions (32-bits)
are issued to the
various execution
units as needed.
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
14 / 28
Before we do any further optimization, need
to understand about processor parallelism
We already know
about
Parallel
multiplications and
additions and their
associated stalls
What about parallel
memory fetches?
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
15 / 28
Parallel
memory
fetches
What is
permissible?
Can we do?
4/7/2016
Parallel fetches
into XY at the
same time
Parallel into X and
a Y registers
Parallel into two X
registers
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
16 / 28
Parallel memory syntax – not too difficult
Only this syntax is illegal
Will need to do more
research to discover
whether “legal” means that
the operation is performed
without stalling the
memory pipeline
NOTE: Need to transfer
INPAR3 (J6) into a Kregister (K6) in order to be
able to use both the
J and K data busses
during IIR operation
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
17 / 28
Question: How do you (in C++)
place IIR coefficients in one memory
block and state values into another?
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
18 / 28
Question:
How do you (in assembly code) place
IIR coefficients in one memory block
and state values into another?
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
19 / 28
C++ manual talks about 2 data spaces (dm
and pm) for STATIC or GLOBAL variables
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
20 / 28
BAD
You can use the
VDSP C++
extension pm to
specify a different
memory space.
HOWEVER, there
is no such thing as
a pm stack so all
variable must be
declared “static” or
“global”
dm arrays can be
placed on the
stack but there
may be alignment
issues
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
21 / 28
The assembler manual says
something similar but different
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
22 / 28
VDSP C++ extensions
dm and pm parameters are still
being passed into functions via J5
and J6 as before.
Notice the very big difference in the “absolute
addresses” indicating that the data blocks are
in very different memory spaces.
4/7/2016
Also data memory address is widely different
from instruction memory space.
Do instruction and 2 data fetches at same
Speed IIR -- stage 5
time
23 / 28
M. Smith, ECE,
University of Calgary, Canada
IIR function using TigerSHARC
C++ DSP extensions dm and pm
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
24 / 28
Using dm and pm shows up a little
more parallel than only using dm
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
25 / 28
From TigerSHARC TS201
programming reference manual
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
26 / 28
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
27 / 28
Memory block operation will need
to be explored in more detail later
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
28 / 28
Understanding the TigerSHARC
Parallel Operations
TigerSHARC has many pipelines
Review of the COMPUTE pipeline works
Interaction of memory (data) operations with
COMPUTE operations
Specialized C++ compiler options and #pragmas
(Will be covered by individual student
presentation)
Optimized assembly code and optimized C++
4/7/2016
Speed IIR -- stage 5
M. Smith, ECE, University of Calgary, Canada
29 / 28