TigerSHARC assembly code development using test driven

Download Report

Transcript TigerSHARC assembly code development using test driven

A first attempt at learning
about optimizing the
TigerSHARC code
TigerSHARC assembly syntax
What we NOW KNOW!
 Can we return from an assembly language routine
without crashing the processor?
 Return a parameter from assembly language routine

(Is it same for ints and floats?)
 Pass parameters into assembly language
 (Is it same for ints and floats?)
 Do IF THEN ELSE statements
 Read and write values to memory
 Read and write values in a loop
 Do some mathematics on the values fetched from
memory
All this stuff is demonstrated by coding
HalfWaveRectifyASM( )
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
2 / 28
Not bad for a first effort
Faster than compiler in debug mode
Need to learn from
the compiler on
how to speed code
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
3 / 28
How does compiler do it?
Look at
source code and use mixed mode to show
 Warning – out of order instructions displayed
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
4 / 28
Many new instructions. Many parallel
instruction. Ones inside loop are key
How important is coding if
conditional jump (NP or not)
is predicted or not?
BIG 25%
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
523  435
5 / 28
Many new instructions. Many parallel
instruction. Ones inside loop are key
JMP (NP) 523  435
XR1 not J1 435  491
How important is not using J registers
when reading from memory
XR1 rather than J1
Now need
Condition XALT rather than JLT
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
XCOMP rather than COMP
6 / 28
Many new instructions. Many parallel
instruction. Ones inside loop are key
JMP (NP) 523  435
XR1 not J1 435  491
and ++ operator
491  435
How important is not using J registers as a destination
when reading from memory, and using pointers (*pt++)
rather than array ( pt[count])
XR1 rather than J1
Now need
3/27/2016
Condition XALT rather than JLT
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
XCOMP rather than COMP
7 / 28
Redoing our code to this point.
Note new instructions using XR2 and R2
Try a little thing. R2 = 0 is a constant – move outside loop
Found we had already set R2 = 0 outside loop
Difference, about half the time – expect improve by 12 cycles
TigerSHARC assemble code 3,
Got 491  476 = 15M.–Smith,
timing
only accurate to around 10 cycles
ECE, University of Calgary, Canada
3/27/2016
8 / 28
The IF THEN JUMPS in the loop are killing us.
Rewrite C++ code into optimized form
 Reduce loop size from 6 if > 0 and 7 if < 0 to 4
any way.
 Loop size 24 – expect improvement of 48 cycles
We go from 476 to 250 cycles
That’s 225 cycles or roughly
9 cycles saved each time around the loop
The jumps were causing us 9 cycles
by disrupting the TigerSHARC pipeline
Need to get rid of this jump
and counter increment.
3/27/2016
Blackfin has hardware loops
9 / 28
Does the TigerSHARC – Duh!!
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
Many new instructions. Many parallel
instruction. Ones inside loop are key
JMP (NP) 523  435
XR1 not J1 491
and ++ operator
435
Remove inner jumps
from loop 250
Hardware loop instructions
LC0 = loop counter 0 – may only be a few hardware loops possible
SHARC ADSP-21061 – allows 6, Blackfin ADSP-BF5XX – allows 2, so
need to still understand software loops
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
IF LC0E  If hardware loop expired, IF NLC0E, if not expired – MM!!
10 / 28
With hardware loops – 166 cycles!
Are we cooking or what!
Fine tuning – can we save
N cycles (1 each time round loop)
by merging instructions
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
11 / 28
Merge those two instructions and use our
fancy SIGN-BIT trick for float code
We are beating
the optimized
compiler on the
float code by a
factor of 2
We need 1 cycle
to beat the
compiler on the
optimized int code
Find in for
Assignment 1
I did 138 cycles
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
12 / 28
My code passes the tests in 138 cycles
Extra 11 cycles from outside the loop (not worth the time and effort
if the loop was larger, or there were more points to process)
 Does turning off the Cache make any
difference to our code
 Find out in assignment 1
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
13 / 28
What is the theoretical maximum
speed?
 This is something I always work out BEFORE optimizing.
I have a target to meet – normally finish all processing
before next sample comes in.
 If my code (in theory) can’t meet that target, I need to find a
different approach, not spend days optimizing useless code.
 In theory – if I have written the code with no hidden stalls – 1
cycle per instruction
 6 instructions outside the loop
 4 instruction inside the loop – N * 4 cycles
 Very short loop – read that getting out of very short loop
stalls the pipeline – lets add 5 cycles for that
 6 + 24 * 4 + 5 = 107 in theory, 138 in practice
 Difference 21 – close enough to being 24, or 1 stall per cycle
 Can use the pipeline viewer to find out where the problem is
occurring. In a long loop, done 4096 times, might be worth it.

3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
14 / 28
Trying to understand what we have
done
 Most TigerSHARC
instructions can be made
conditional.
 WHY? Because doing a
NOP instruction (if condition
not met) is much less
disruptive to the instruction
pipeline than doing a JUMP
(lose of 9 cycles if jump
taken – probably more
because of code format)
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
15 / 28
Why mostly conditional instructions?
 TigerSHARC has a very deep pipeline, so
that conditional jumps cause a potential large
disruption of the pipeline
 Better to use non-jump instructions which
don’t disrupt pipeline, even if instruction is not
executed (acts as nop)
If (N < 1) return_value = NULL;
else return_value = NULL;
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
16 / 28
Why mostly conditional instructions?
If (N < 1)
return_value = NULL;
else return_value = value;
COMP(N, 1);;
IF NJLT, JUMP _ELSE;;
J5 = NULL;;
JUMP _END_IF;;
_ELSE:
J5 = value;;
3/27/2016
If (N < 1)
return_value = NULL;
else return_value = value;
COMP(N, 1);;
IF NJLT; DO, J5 = NULL;;
IF JLT; DO, J5 = value;;
Concept is there – we need to
check on whether syntax is
correct
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
17 / 28
Trying to understand what we have
done
 Use J registers for
address operations, but
store values from
memory in XR1 and
YR1
 WHY? Instructions like
this [J1] = XR1;; has
the potential to be put in
parallel with more
operations
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
18 / 28
Hardware – zero overhead loop.
About 4 * N cycles better (N is times round the loop)
LC0 = N;;
Load counter 0 with value N
Start_of_loop_LABEL:
Loop code here ;;
IF NLC0E, JUMP Start_of_loop_LABEL;;
NLC0E – Not LC0 expired – essentially Compare LC0 with 2
If less than 2, continue (don’t jump)
If 2 or more, then decrement LC0 and jump
All sorts of stall issues if not properly aligned –TigerSHARC manual
8-23
CAN’T USE WHEN THERE IS A FUNCTION CALL IN THE LOOP?
WHY NOT? – WHAT HAPPENS – NEED TO EXPLORE MORE.
Using a software loop when
there
iscode
a 3,function is okay since19 / 28
TigerSHARC
assemble
3/27/2016
Smith, ECE, University of Calgary, Canada
calling a function is M.slow
anyway – don’t need efficiency
Hardware – zero overhead loop.
BIG WARNING
LC0 = N;;
Load counter 0 with value N
LC0 uses UNSIGNED ARITHMETIC – MAKE SURE N
is not negative, as a negative number has the same
bit pattern as a VERY large unsigned number, and
the processor will go around the loop for a week
We did a check for N <= 0 before entering the
hardware loop as another part of our code – so we
lucked in – otherise could have big problems.
This issue is so important (and time wasting in the
laboratories) that will
be deducting marks in quizzes
TigerSHARC assemble code 3,
3/27/2016
20 / 28
M. Smith, ECE, University of Calgary, Canada
and exams
What’s this XR1, YR1 and R1 stuff
 TigerSHARC is
designed to do
many things at
once
 So you need
appropriate
syntax to control
it
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
21 / 28
What’s this XR1, YR1 and R1 stuff
XYR1 = R2 + R3;;
does 2 adds
XR1 = XR2 + XR3
and
YR1 = YR2 + YR3;
You can add the X values
and not the Y values
with this syntax
XR1 = R2 + R3;;
And NOT with
XR1 = XR2 + XR3;;
Ugly – but they (ADI) will not
change the syntax
(DAMY)
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
22 / 28
What’s this XR1, YR1 and R1 stuff
XYR1 = [J0 += 0x1];;
Does a 32-bit fetch and puts the
same value into XR1 and
YR1. Same as doing
XR1 = [J0 += 0];; AND
YR1 = [J0 += 1];; at the same
time
XYR1 = L[J0 +0x2];;
Does a dual 64 bit fetch and is
the same as doing
XR1 = [J0 += 1];; AND
YR1 = [J0 += 1];; at the same
time
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
23 / 28
What’s this XR1, YR1 and R1 stuff
XYR1 = [J0 += 0x1];;
means
XR1 = [J0 += 0];; AND
YR1 = [J0 += 1];;
XYR1 = L[J0 +0x2];;
means
XR1 = [J0 += 1];; AND
YR1 = [J0 += 1];; at the same time
XR1:0 = L[J0 +0x2];;
means
XR0 = [J0 += 1];; AND
XR1 = [J0 += 1];;
XYR1:0 = L[J0 +0x2];;
means
XR0 = [J0 += 0];; AND
YR0 = [J0 += 1];; AND
XR1 = [J0 += 0];;
YR1 = [J0 += 1];;
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
24 / 28
What’s this XR1, YR1 and R1 stuff
XYR1:0 = L[J0 +0x2];;
means
XR0 = [J0 += 0];; AND
YR0 = [J0 += 1];; AND
XR1 = [J0 += 0];;
YR1 = [J0 += 1];;
XR3:0 = Q[J0 +0x4];;
means
XR0 = [J0 += 1];; AND
XR1 = [J0 += 1];; AND
XR2 = [J0 += 1];; AND
XR3 = [J0 += 1];;
XYR3:0 = Q[J0 +0x4];;
means
XR0 = [J0 += 0];; AND
YR0 = [J0 += 1];; AND
XR1 = [J0 += 0];; AND
YR1 = [J0 += 1];; AND
XR2 = [J0 +=0];; AND
YR2 = [J0 += 1];; AND
XR3 = [J0 += 0];; AND
YR3 = [J0 += 1];;
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
25 / 28
Float release generated by C++ compiler
– identify new instructions
 I see 1 new instruction
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
26 / 28
Difference between integer and math
operations
XYR1 = R2 + R3;;
does 2 INTEGER adds
XR1 = XR2 + XR3
and
YR1 = YR2 + YR3;
SYNTAX XR1 = R2 + R3;;
And NOT with
XR1 = XR2 + XR3;;
Use F syntax to make it a float
operation
XYFR1 = R2 + R3;;
does 2 FLOATING adds
XFR1 = R2 + R3
and
YFR1 = R2 + R3;
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
27 / 28
Exercise 1 – needed for Lab. 1
 FIR filter operation -- data and filter-coefficients are
both integer arrays – Write in C++
 New_value from Audio A/D, output sent to Audio
D/A
for
j  1to N  1
data[ N  j  1]  data[ N  j ];
data[0]  newvalue;
N 1
output   data[ j ]* filter _ coeffs[ j ];
j 0
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
28 / 28
Exercise – needed for Lab. 1
 FIR filter operation -- data and filter-
coefficients are both integer arrays -- ASM
Re adAudioSource(&newvalue);
for j  1to N  1
data[ N  j  1]  data[ N  j ];
data[0]  newvalue;
N 1
output   data[ j ]* filter _ coeffs[ j ];
j 0
WriteAudioSource(output );
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
29 / 28
Insert C++ code – for Lab. 1
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
30 / 28
Insert assembler code version (Lab. 2)
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
31 / 28
What we NOW KNOW EVERYTHING FOR
THE FINAL (REALLY -- ALMOST)!
 Can we return from an assembly language routine
without crashing the processor?
 Return a parameter from assembly language routine

(Is it same for ints and floats?)
 Pass parameters into assembly language
 (Is it same for ints and floats?)
 Do IF THEN ELSE statements
 Read and write values to memory
 Read and write values in a loop
 Do some mathematics on the values fetched from
memory
All this stuff was demonstrated by coding
HalfWaveRectifyASM( ) -- 
3/27/2016
TigerSHARC assemble code 3,
M. Smith, ECE, University of Calgary, Canada
32 / 28