Transcript Part 2

Generating a software loop
with memory accesses
TigerSHARC assembly syntax
Concepts
 Learning just enough TigerSHARC assembly
code to make a software loop “work”
 Comparing the timings for rectification of
integer and floating point arrays, using



debug C++ code,
Release C++ code
Our FIRST_ASM code
 Looking in “MIXED mode” at the code
generated by the compiler
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
2 / 38
Test Driven Development
Work with customer to check that the tests properly express what
the customer wants done. Iterative process with customer
“heavily involved” – “Agile” methodology.
CUSTOMER
DEVELOPER
Describe
Requirements
Design Solution
Build Solution
4/1/2016
Write
Acceptance Tests
Write
Unit Tests
Test Solution
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
3 / 38
Note
Special
marker
Compiler optimization
FLOATS 927  304 -- THREE FOLD
INTS 960  150 – SIX FOLD
Why the difference,
and can we do better, and do we want to?
4/1/2016
Note the failures – what are they 4 / 38
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
Write tests about passing values back
from an assembly code routine
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
5 / 38
More detailed look at the code
As with 68K and Blackfin needs a .section
But name and format different
As with 68K need .align statement
Is the “4” in bytes (8 bits)
or words (32 bits)
As with 68K need .global
to tell other code that this function
exists
Single semi-colons
Double semi-colons
Start function label
End function label
Used for
“profiling code”
4/1/2016
Label format similar to 68K
Needs leading underscore and final colon
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
6 / 38
Return registers
 There are many, depending on what you need to return
 Here we need to use J8 as the return register to pass back “integer” pointer
 Many registers available – need ability to control usage



J0 to J31 – registers (integers and pointers) (SISD mode)
XR0 to XR31 – registers (integers) (SISD mode)
XFR0 to XFR31 – registers (floats) (SISD mode)
 Did I also mention





I0 to I31 – registers (integers and pointers) (SISD mode)
YR0 to YR31 , YFR0 to YFR31 (SIMD mode)
XYR, YXR and R registers (SIMD mode)
And also the MIMD modes
And the double registers and the quad registers …….
#define return_pt_J8 J8
4/1/2016
// J8 is a VOLATILE, NON-PRESERVED register
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
7 / 38
Parameter passing
 SPACES for first four parameters ARE ALWAYS
present on the stack (as with 68K)
 But the first four parameters are passed in registers
(J4, J5, J6 and J7 most of the time) (as with MIPS
and Blackfin)
 The parameters passed in registers are often stored
into the spaces on the stack (like the MIPS) as the
first step when assembly code functions call
assembly code functions
 J4, J5, J6 and J7 are volatile, non-preserved
registers
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
8 / 38
Can we pass back the start of the final
array
Still passing tests by
accident and this needs
to be conditional return
value
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
9 / 38
What we need to know based on
experiences from other processors
 Can we return from an assembly language routine
without crashing the processor?
 Return a parameter from assembly language routine

(Is it same for ints and floats?)
 Pass parameters into assembly language
 (Is it same for ints and floats?)
 Do IF THEN ELSE statements
 Read and write values to memory
 Read and write values in a loop
 Do some mathematics on the values fetched from
memory
All this stuff is demonstrated by coding
HalfWaveRectifyASM( )
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
10 / 38
Why is ELSE a keyword
FOUR PART ELSE INSTRUCTION IS LEGAL
IF JLT; ELSE, J1 = J2 + J3; // Conditional execution – if true
ELSE, XR1 = XR2 + XR3; // Conditional – if true
YFR1 = YFR2 + YFR3;;
// Unconditional -- always
IF JLT; DO, J1 = J2 + J3; // Conditional execution -- if true
DO, XR1 = XR2 + XR3; // Conditional -- if true
YFR1 = YFR2 + YFR3;; // Unconditional -- always
Having this sort of format means that the instruction pipeline is
not disrupted when we do IF statements
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
11 / 38
Label name is not the problem
NOTE:
This is “C-like” syntax,
But it is not “C”
Statement must end in ;;
Not ;
ONE semicolon =
end of instruction
TWO semicolons =
end of
parallel instruction line
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
12 / 38
Add dual-semicolons everywhere
Worry about “multiple issues” later
This dual semi-colon
Is so important that you
MUST code review for it all
the time or else you waste
so much time in the
Lab. Key in exams / quizzes
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
13 / 38
At last an error I know how to fix 
Well I thought I understood it !!!
 Speed issue – JUMP instructions can’t be too
close together when stored in memory

4/1/2016
Not normally a problem when “if” code is larger
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
14 / 38
Add a single instruction of 4 NOPs
nop; nop; nop; nop;; TEMPORARY
 Fix the last error as part of Assignment
1
Fix the remaining error
In handling the IF THEN ELSE
as part of assignment 1
Worry about code efficiency later
(refactor) when all code working
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
15 / 38
What we need to know based on
experiences from other processors
 Can we return from an assembly language routine
without crashing the processor?
 Return a parameter from assembly language routine

(Is it same for ints and floats?)
 Pass parameters into assembly language
 (Is it same for ints and floats?)
 Do IF THEN ELSE statements
 Read and write values to memory
 Read and write values in a loop
 Do some mathematics on the values fetched from
memory
All this stuff is demonstrated by coding
HalfWaveRectifyASM( )
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
16 / 38
Target. Changing this C++ code into
assembly (to get “more” speed)
 Code we generated yesterday was similar to
parts of this, but not equivalent.
 Re-factor the code to make the assembly
code and C++ functionality equivalent
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
17 / 38
The code was not exactly what we designed (C++
equivalent) – re-factor and retest after the re-factoring
NEXT STEP
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
18 / 38
THINK I UNDERSTAND
Refactored C++ code IENOUGH
TO CHANGE THE
FORMAT OF THE
IF-THEN-ELSE
TO OPTIMIZE THIS
PARTICULAR CODE BIT
USE : IF TRUE EXECUTE THIS
STATEMENT – SINGLE LINE
Avoiding JUMPS in the main
flow of the code will speed
the flow of the code
Almost right. SYNTAX ERROR
Look in the manual to find
the correct syntax
IF NJLE;
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
DO, J8 = 0
19 / 38
No syntax errors (No CODE ERRORS).
Code does not work (CODE DEFECTS)
We don’t have
enough code to
pass all the tests
but we are failing
tests we did not
expect to fail
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
20 / 38
Run “forensic tests” to find out where
DEFECT is being introduced
Identify mistake by
removing “code
sections”
Without the IF
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
21 / 38
Add another line to the code
Can now spot the error
New format of
IF-THEN-ELSE
Is doing exactly
the opposite of
what we want
IF NOT TRUE
return NULL (0)
Need JLE not
NJLE
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
22 / 38
Assignment 1 – code the following as a software
loop – follow MIPS / Blackfin approach
DONE DURING TUTOTIAL
int CalculateSum(void) {
int sum = 0;
for (int count = 0; count < 6; count++) {
sum = sum + count;
}
return sum;
}
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
23 / 38
Reminder – software for-loop
becomes “while loop” with initial test
int CalculateSum(void) {
int sum = 0;
int count = 0;
while (count < 6) {
sum = sum + count;
count++;
}
return sum;
}
Do line by line translation into
assembly code
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
24 / 38
USE SOFTWARE LOOP HERE
Do loop control first
 Have some jumps too close together
NOTE
JGE is ILLEGAL
USE NJLT
Customize?
#define JGE NJLT
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
25 / 38
Run the tests with 4 nop padding to
check that get out of loop as expected
Adding 4 nops
-- lose 1 cycle
gain an hour
not trying to
solve the problem
If need the 1 cycle
refactor the code later
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
26 / 38
Accessing memory

Basic mode

Special register J31 – acts as zero when used in additions




Pt_J5 is a pointer register into an array
Value_J1 is being used as a data register
J registers like MIPS registers (used as pointer and data).
NOT like 68K or Blackfin registers – those can be used as either data
or address registers but not both
NOTE: Later we will find that using TigerSHARC registers for data
operations is a BAD idea
1. Value_J1 = [Pt_J5];;
read value from memory location pointed
to by J5 -- Compare to Blackfin Value_R0 = [Pt_P0];;
2. Value_J1 = [Pt_J5 + J31];;
read value from memory
location pointed to by J5 – but read somewhere that this CAN be faster
than just Value_J1 = [Pt_J5];; -- NEED TO CONFIRM
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
27 / 38
Accessing memory – step 2
 Basic mode



Pt_J5 is a pointer register into an array
Offset_J4 is used as an offset
Value_J1 is being used as a data register to receive
the memory value – load / store architecture
1. Read_J1 = [Pt_J5 + Offset_J4];; read value from
memory location pointed to by (J5 + J4)
PRE-MODIFY – address used J5 + J4, no change in J5
2. Read_J1 = [Pt_J5 += Offset_J4];; read value from
memory location pointed to by J5, and then perform add
operation on the J5 register (points to NEXT location)
POST-MODIFY – address used J5, then perform J5 = J5 + J4
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
28 / 38
Add in the memory accesses
FORGET TigerSHARC = RISC PROCESSOR
LOAD/STORE ONLY
Like MIPS and Blackfin
Must place value into
register, and then copy
register to memory
NO [J5 +J0] = 0;
;
NO J3 = 0
[J5 + J0] = J3;
Uses wrong J3 –
Remember TigerSHARC
can handle parallel
instructions
YES
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
;;
J3 = 0
[J5 + J0] = J3; 29 / 38
Understand the error message
Too many J resource usage = missing ;;
Unintentionally doing the
parallel instruction line
[J5 + J0] = J2; J0 = J0 + 1;;
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
30 / 38
Note: Missing label is not an
assembler error, it’s a linker error
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
Fix warnings
DEFECT
may be days
before try to link
then hard31to/ 38
find
NOW the assembler know where “CONTINUE”
is, then it can tell you that you have two JUMP
instructions too close together
 Fix with magic 4 nops; and lose one cycle / loop
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
32 / 38
Not getting expected Test results
Something is logically wrong (DEFECT)
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
33 / 38
Obvious question – are we even getting into the loop.
Add BREAKPOINT to TEST code flow.
(We don’t add BREAKPOINTS to code follow in detail)
CODE NEVER GOT TO
BREAKPOINT means
code never entered loop
Forgot to do count = 0
So not even getting
into loop as there is
a garbage value
already in
Count_J0 from
code we executed
earlier -- DEFECT
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
34 / 38
Not bad for a first effort
Faster than compiler in debug mode
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
35 / 38
Where did the float ASM code
suddenly appear from?
 Integer 0 has bit pattern 0x0000 0000
 Float 0.0 has bit pattern 0x0000 0000
 Integer +6 has format
b 0??? ???? ???? ???? ???? ???? ???? ????
 Float +6.0 has format
b 0??? ???? ???? ???? ???? ???? ???? ????
 Integer -6 has format
b 1??? ???? ???? ???? ???? ???? ???? ????
 Float -6.0 has format
b 1??? ???? ???? ???? ???? ???? ???? ????
EXPONENT
 Format’s are very different, but the sign bit is in the same place
 Float algorithm - if S == 1 (negative) set to zero
Otherwise leave unchanged – same as integer algorithm
 Just re-use integer algorithm with a change of name
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
36 / 38
Final code – Float rectify code just has a different name
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
37 / 38
What we NOW KNOW
 Can we return from an assembly language routine
without crashing the processor?
 Return a parameter from assembly language routine

(Is it same for ints and floats?)
 Pass parameters into assembly language
 (Is it same for ints and floats?)
 Do IF THEN ELSE statements
 Read and write values to memory
 Read and write values in a loop
 Do some mathematics on the values fetched from
memory
All this stuff is demonstrated by coding
HalfWaveRectifyASM( )
4/1/2016
TigerSHARC assemble code 2,
M. Smith, ECE, University of Calgary, Canada
38 / 38