ELG6163_TMS320C6x

Download Report

Transcript ELG6163_TMS320C6x

TMS320C6000 Architectural and
Programming Overview
Overview




Chapter 2, Slide 2
Interface between assembly language
and architecture
Architecture of TMS320C60
Linear assembly
Code optimization
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
General DSP System Block Diagram
Internal Memory
Internal Buses
External
Memory
Central
Processing
Unit
Chapter 2, Slide 3
P
E
R
I
P
H
E
R
A
L
S
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Implementation of Sum of Products (SOP)
SOP (Sum of Products) is
the key element for most
DSP algorithms.
So let’s write the code for
this algorithm and at the
same time discover the
C6000 architecture.
N
Y = 
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
Two basic
operations are required
for this algorithm.
(1) Multiplication
(2) Addition
Therefore two basic
instructions are required
Chapter 2, Slide 4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Implementation of Sum of Products (SOP)
So let’s implement the SOP
algorithm!
N
Y = 
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
The implementation in this
module will be done in
assembly.
Two basic
operations are required
for this algorithm.
(1) Multiplication
(2) Addition
Therefore two basic
instructions are required
Chapter 2, Slide 5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Multiply (MPY)
N
Y = 
an * xn
n = 1
= a1 * x1 + a2 * x2 +... + aN * xN
The multiplication of a1 by x1 is done in
assembly by the following instruction:
MPY
a1, x1, Y
This instruction is performed by a
multiplier unit that is called “.M”
Chapter 2, Slide 6
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Multiply (.M unit)
40
Y = 
an * xn
n = 1
.M
The . M unit performs multiplications in
hardware
MPY
.M
a1, x1, Y
Note: 16-bit by 16-bit multiplier provides a 32-bit result.
32-bit by 32-bit multiplier provides a 64-bit result.
Chapter 2, Slide 7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Addition (.?)
40
Y = 
an * xn
n = 1
.M
.?
Chapter 2, Slide 8
MPY
.M
a1, x1, prod
ADD
.?
Y, prod, Y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Add (.L unit)
40
Y = 
an * xn
n = 1
.M
.L
MPY
.M
a1, x1, prod
ADD
.L
Y, prod, Y
RISC processors such as the C6000 use registers to
hold the operands, so lets change this code.
Chapter 2, Slide 9
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Register File - A
40
Register File A
A0
A1
A2
A3
A4
Y = 
a1
x1
an * xn
n = 1
prod
Y
.M
.
.
.
.L
MPY
.M
a1, x1, prod
ADD
.L
Y, prod, Y
A15
32-bits
Let us correct this by replacing a, x, prod and Y by the
registers as shown above.
Chapter 2, Slide 10
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Specifying Register Names
40
Register File A
A0
A1
A2
A3
A4
Y = 
a1
x1
an * xn
n = 1
prod
Y
.M
.
.
.
.L
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
32-bits
The registers A0, A1, A3 and A4 contain the values to be
used by the instructions.
Chapter 2, Slide 11
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Specifying Register Names
40
Register File A
A0
A1
A2
A3
A4
Y = 
a1
x1
an * xn
n = 1
prod
Y
.M
.
.
.
.L
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
32-bits
Register File A contains 16 registers (A0 -A15) which
are 32-bits wide.
Chapter 2, Slide 12
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Data loading
Register File A
A0
A1
A2
A3
A4
Q: How do we load the
operands into the registers?
a1
x1
prod
Y
.M
.
.
.
.L
A15
32-bits
Chapter 2, Slide 13
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Load Unit “.D”
Register File A
A0
A1
a1
x1
A2
A3
Q: How do we load the
operands into the registers?
prod
Y
.M
.
.
.
.L
A: The operands are loaded
into the registers by loading
them from the memory
using the .D unit.
.D
A15
32-bits
Data Memory
Chapter 2, Slide 14
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Load Unit “.D”
Register File A
A0
A1
a1
x1
A2
A3
It is worth noting at this
stage that the only way to
access memory is through the
.D unit.
prod
Y
.M
.
.
.
.L
.D
A15
32-bits
Data Memory
Chapter 2, Slide 15
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Load Instruction
Register File A
A0
A1
a1
x1
A2
A3
Q: Which instruction(s) can be
used for loading operands
from the memory to the
registers?
prod
Y
.M
.
.
.
.L
.D
A15
32-bits
Data Memory
Chapter 2, Slide 16
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Load Instructions (LDB, LDH,LDW,LDDW)
Register File A
A0
A1
a1
x1
A2
A3
Q: Which instruction(s) can be
used for loading operands
from the memory to the
registers?
prod
Y
.M
.
.
.
.L
A: The load instructions.
.D
A15
32-bits
Data Memory
Chapter 2, Slide 17
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
Before using the load unit you
have to be aware that this
processor is byte addressable,
which means that each byte is
represented by a unique
address.
Data
address
00000000
00000002
00000004
00000006
00000008
Also the addresses are 32-bit
wide.
FFFFFFFF
16-bits
Chapter 2, Slide 18
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
The syntax for the load
instruction is:
LD *Rn,Rm
Where:
Rn is a register that contains
the address of the operand to
be loaded
Data
address
a1
x1
00000000
00000002
00000004
00000006
00000008
prod
Y
and
Rm is the destination register.
FFFFFFFF
16-bits
Chapter 2, Slide 19
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
The syntax for the load
instruction is:
LD *Rn,Rm
The question now is how many
bytes are going to be loaded
into the destination register?
Data
address
a1
x1
00000000
00000002
00000004
00000006
00000008
prod
Y
FFFFFFFF
16-bits
Chapter 2, Slide 20
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
The syntax for the load
instruction is:
LD *Rn,Rm
The answer, is that it depends on
the instruction you choose:
Data
address
a1
x1
00000000
00000002
00000004
00000006
00000008
prod
Y
• LDB: loads one byte (8-bit)
• LDH: loads half word (16-bit)
• LDW: loads a word (32-bit)
• LDDW: loads a double word (64-bit)
Note: LD on its own does not
exist.
Chapter 2, Slide 21
FFFFFFFF
16-bits
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
The syntax for the load
instruction is:
Data
1
0
0xA
0xB
0xC
0xD
0x2
0x1
Example:
0x4
0x3
If we assume that A5 = 0x4 then:
0x6
0x5
(1) LDB *A5, A7 ; gives A7 = 0x00000001
0x8
0x7
LD *Rn,Rm
address
00000000
00000002
00000004
00000006
00000008
(2) LDH *A5,A7; gives A7 = 0x00000201
(3) LDW *A5,A7; gives A7 = 0x04030201
(4) LDDW *A5,A7:A6; gives A7:A6 =
0x0807060504030201
FFFFFFFF
16-bits
Chapter 2, Slide 22
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Using the Load Instructions
The syntax for the load
instruction is:
address
Data
LD *Rn,Rm
Question:
If data can only be accessed by the
load instruction and the .D unit,
how can we load the register
pointer Rn in the first place?
0xA
0xB
0xC
0xD
0x2
0x1
0x4
0x3
0x6
0x5
0x8
0x7
00000000
00000002
00000004
00000006
00000008
FFFFFFFF
16-bits
Chapter 2, Slide 23
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Loading the Pointer Rn

The instruction MVKL will allow a
move of a 16-bit constant into a register
as shown below:
MVKL
.?
a, A5
(‘a’ is a constant or label)

How many bits represent a full address?
32 bits

So why does the instruction not allow a
32-bit move?
All instructions are 32-bit wide (see
instruction opcode).
Chapter 2, Slide 24
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Loading the Pointer Rn

To solve this problem another instruction
is available:
MVKH
eg.
MVKH
.?
a, A5
(‘a’ is a constant or label)

Chapter 2, Slide 25
ah
al
a
ah
x
A5
Finally, to move the 32-bit address to a
register we can use:
MVKL
a, A5
MVKH
a, A5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
LDH, MVKL and MVKH
Register File A
A0
A1
a
x
A2
A3
prod
Y
.M
.
.
.
.L
.D
A15
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
32-bits
Data Memory
Chapter 2, Slide 27
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Creating a loop
So far we have only
implemented the SOP
for one tap only, i.e.
Y= a1 * x1
So let’s create a loop
so that we can
implement the SOP
for N Taps.
Chapter 2, Slide 28
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Creating a loop
So far we have only
implemented the SOP
for one tap only, i.e.
Y= a1 * x1
With the C6000 processors
there are no dedicated
instructions such as block
repeat. The loop is created
using the B instruction.
So let’s create a loop
so that we can
implement the SOP
for N Taps.
Chapter 2, Slide 29
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
What are the steps for creating a loop
1. Create a label to branch to.
2. Add a branch instruction, B.
3. Create a loop counter.
4. Add an instruction to decrement the loop counter.
5. Make the branch conditional based on the value in
the loop counter.
Chapter 2, Slide 30
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
1. Create a label to branch to
loop
Chapter 2, Slide 31
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
2. Add a branch instruction, B.
loop
Chapter 2, Slide 32
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
B
.?
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Which unit is used by the B instruction?
Register File A
A0
A1
a
x
.S
prod
Y
.M
.M
.
.
.
.L
.L
A2
A3
loop
.D
.D
A15
MVKL
MVKH
pt1, A5
pt1, A5
MVKL
MVKH
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
B
.?
loop
32-bits
Data Memory
Chapter 2, Slide 33
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Which unit is used by the B instruction?
Register File A
A0
A1
a
x
.S
prod
Y
.M
.M
.
.
.
.L
.L
A2
A3
loop
.D
.D
A15
MVKL .S
MVKH .S
pt1, A5
pt1, A5
MVKL .S
MVKH .S
pt2, A6
pt2, A6
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
B
.S
loop
32-bits
Data Memory
Chapter 2, Slide 34
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
3. Create a loop counter.
Register File A
A0
A1
a
x
.S
prod
Y
.M
.M
.
.
.
.L
.L
A2
A3
loop
.D
.D
A15
32-bits
MVKL .S
MVKH .S
pt1, A5
pt1, A5
MVKL .S
MVKH .S
MVKL .S
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
B
.S
loop
B registers will be introduced later
Data Memory
Chapter 2, Slide 35
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
4. Decrement the loop counter
Register File A
A0
A1
a
x
.S
prod
Y
.M
.M
.
.
.
.L
.L
A2
A3
loop
.D
.D
A15
32-bits
MVKL .S
MVKH .S
pt1, A5
pt1, A5
MVKL .S
MVKH .S
MVKL .S
pt2, A6
pt2, A6
count, B0
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
Data Memory
Chapter 2, Slide 36
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
5. Make the branch conditional based on the
value in the loop counter

What is the syntax for making instruction
conditional?
[condition]
Instruction
[B1]
loop
Label
e.g.
B
(1) The condition can be one of the following
registers: A1, A2, B0, B1, B2.
(2) Any instruction can be conditional.
Chapter 2, Slide 37
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
5. Make the branch conditional based on the
value in the loop counter

The condition can be inverted by adding the
exclamation symbol “!” as follows:
[!condition]
Instruction
Label
[!B0]
B
loop ;branch if B0 = 0
[B0]
B
loop ;branch if B0 != 0
e.g.
Chapter 2, Slide 38
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
5. Make the branch conditional
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
Register File A
A0
A1
a
x
.S
prod
Y
.M
.M
.
.
.
.L
.L
A2
A3
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
loop
.D
.D
A15
32-bits
[B0]
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
Data Memory
Chapter 2, Slide 39
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Testing the code
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
This code performs the following
operations:
loop
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
However, we would like to perform:
ADD
.L
A4, A3, A4
a0*x0 + a1*x1 + a2*x2 + … + aN*xN
SUB
.S
B0, 1, B0
B
.S
loop
a0*x0 + a0*x0 + a0*x0 + … + a0*x0
[B0]
Chapter 2, Slide 42
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Modifying the pointers
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
The solution is to modify the pointers
loop
A5 and A6.
[B0]
Chapter 2, Slide 43
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Indexing Pointers
Syntax
Description
*R
Pointer
Pointer
Modified
No
In this case the pointers are used but not modified.
R can be any register
Chapter 2, Slide 44
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
Pointer
+ Pre-offset
- Pre-offset
Pointer
Modified
No
No
No
In this case the pointers are modified BEFORE being used
and RESTORED to their previous values.



[disp] specifies the number of elements size in DW (64-bit), W
(32-bit), H (16-bit), or B (8-bit).
disp = R or 5-bit constant.
R can be any register.
Chapter 2, Slide 45
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Pointer
Modified
No
No
No
Yes
Yes
In this case the pointers are modified BEFORE being used
and NOT RESTORED to their Previous Values.
Chapter 2, Slide 46
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
*R++[disp]
*R--[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Post-increment
Post-decrement
Pointer
Modified
No
No
No
Yes
Yes
Yes
Yes
In this case the pointers are modified AFTER being used
and NOT RESTORED to their Previous Values.
Chapter 2, Slide 47
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Indexing Pointers
Syntax
Description
*R
*+R[disp]
*-R[disp]
*++R[disp]
*--R[disp]
*R++[disp]
*R--[disp]
Pointer
+ Pre-offset
- Pre-offset
Pre-increment
Pre-decrement
Post-increment
Post-decrement



Chapter 2, Slide 48
Pointer
Modified
No
No
No
Yes
Yes
Yes
Yes
[disp] specifies # elements - size in DW, W, H, or B.
disp = R or 5-bit constant.
R can be any register.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Modify and testing the code
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
This code now performs the following
loop
operations:
a0*x0 + a1*x1 + a2*x2 + ... + aN*xN
[B0]
Chapter 2, Slide 49
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Store the final result
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
This code now performs the following
loop
operations:
a0*x0 + a1*x1 + a2*x2 + ... + aN*xN
[B0]
Chapter 2, Slide 50
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
STH
.D
A4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Store the final result
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 count, B0
loop
The Pointer A7 has not been initialised.
[B0]
Chapter 2, Slide 51
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
STH
.D
A4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Store the final result
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
MVKL .S2 pt3, A7
MVKH .S2 pt3, A7
MVKL .S2 count, B0
The Pointer A7 is now initialised.
loop
[B0]
Chapter 2, Slide 52
LDH
.D
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
STH
.D
A4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
What is the initial value of A4?
MVKL .S2 pt1, A5
MVKH .S2 pt1, A5
MVKL .S2 pt2, A6
MVKH .S2 pt2, A6
A4 is used as an accumulator,
so it needs to be reset to zero.
loop
[B0]
Chapter 2, Slide 53
MVKL
MVKH
MVKL
ZERO
LDH
.S2
.S2
.S2
.L
.D
pt3, A7
pt3, A7
count, B0
A4
*A5++, A0
LDH
.D
*A6++, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.S
B0, 1, B0
B
.S
loop
STH
.D
A4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Increasing the processing power!
Register File A
A0
A1
A2
A3
A4
.S1
.M1
.
.
.
.L1
How can we add
more processing
power to this
processor?
.D1
A15
32-bits
Data Memory
Chapter 2, Slide 54
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Increasing the processing power!
Register File A
A0
A1
A2
A3
A4
.S1
.M1
.
.
.
(1) Increase the clock
frequency.
(2) Increase the number
of Processing units.
.L1
.D1
A15
32-bits
Data Memory
Chapter 2, Slide 55
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
To increase the Processing Power, this processor has two
sides (A and B or 1 and 2)
Register File A
A0
A1
A2
A3
A4
.
.
.
A15
Register File B
.S1
.S2
.M1
.M2
.L1
.L2
.D1
.D2
32-bits
B0
B1
B2
B3
B4
.
.
.
B15
32-bits
Data Memory
Chapter 2, Slide 56
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Can the two sides exchange operands in order to increase
performance?
Register File A
A0
A1
A2
A3
A4
.
.
.
A15
Register File B
.S1
.S2
.M1
.M2
.L1
.L2
.D1
.D2
32-bits
B0
B1
B2
B3
B4
.
.
.
B15
32-bits
Data Memory
Chapter 2, Slide 57
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
The answer is YES but there are limitations.

To exchange operands between the two
sides, some cross paths or links are
required.
What is a cross path?

A cross path links one side of the CPU to
the other.

There are two types of cross paths:
Chapter 2, Slide 58

Data cross paths.

Address cross paths.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Data Cross Paths

Data cross paths can also be referred to
as register file cross paths.

These cross paths allow operands from
one side to be used by the other side.

There are only two cross paths:
Chapter 2, Slide 59

one path which conveys data from side B
to side A, 1X.

one path which conveys data from side A
to side B, 2X.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
TMS320C67x Data-Path
Chapter 2, Slide 60
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Chapter 2, Slide 61
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Architecture
Chapter 2, Slide 62
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Data path details
Chapter 2, Slide 63
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Functional Units
Chapter 2, Slide 64
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Instruction packing
Chapter 2, Slide 65
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Instruction packing
Chapter 2, Slide 66
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Data types
Chapter 2, Slide 67
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
DSP Instructions
Chapter 2, Slide 68
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Pipeline benefits
Chapter 2, Slide 69
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Pipeline Phases
Chapter 2, Slide 70
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Delay slots
Chapter 2, Slide 71
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Chapter 2, Slide 72
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Linear Assembly




Chapter 2, Slide 73
Comparison of programming
techniques.
How to write Linear Assembly.
Interfacing Linear Assembly with C.
Assembly optimiser tool.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Introduction



Chapter 2, Slide 74
With the assembly optimiser, optimisation
for loops can be made very simple.
Linear assembly takes care of the pipeline
structure and generates highly parallel
assembly code automatically.
The performance of the assembly
optimiser can easily reach the
performance of hand written assembly
code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Comparison of Programming Techniques
Source
Efficiency* Effort
ASM
Hand
Optimised
100%
High
Linear
ASM
Assembly
Optimiser
95 - 100%
Med
C
C ++
Optimising
Compiler
80 - 100%
Low
* Typical efficiency vs. hand optimized assembly.
Chapter 2, Slide 75
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Writing in Linear Assembly

Linear assembly is similar to hand assembly,
except:




Does not require NOPs to fill empty delay slots.
The functions units do not need to be specified.
Grouping of instructions in parallel is
performed automatically.
Accepts symbolic variable names.
loop
Chapter 2, Slide 76
ZERO
LDH
LDH
MPY
ADD
SUB
B
sum
*p_to_a, a
*p_to_b, b
a, b, prod
sum, prod, sum
B0, 1, B0
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
How to Write Code in Linear Assembly

File extension:


Use the “.sa” extension to specify the file is
written in linear assembly.
How to write code:
_sa_Function
ZERO
loop
[count]
.cproc defines the
beginning of the code
.cproc
LDH
LDH
MPY
ADD
SUB
B
sum
*pm++, m
*pn++, n
m, n, prod
sum, prod, sum
count, 1, count
loop
.return sum
.endproc
Chapter 2, Slide 77
NO NOPs required
NO parallel instructions required
NO functional units specified
NO registers required
.return specifies the
return value
.endproc defines the end of
the linear assembly code
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Passing and Returning Arguments


“pm” and “pn” are two pointers declared in
the C code that calls the linear assembly
function.
The following function prototype in C calls
the linear assembly function:
int y = dotp (short* a, short* x, int count)

The linear assembly function receives the
arguments using .cproc:
_dotp
Chapter 2, Slide 78
.cproc
...
.return
.endproc
pm, pn, count
y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Declaring the Symbolic Variables

All the symbolic registers except those used
as arguments are declared as follows:
.reg

Chapter 2, Slide 79
pm, pn, m, n, prod, sum
The assembly optimiser will attempt to
assign all these values to registers.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Complete Linear Assembly Code
_dotp
.cproc pm, pn, count
.reg m, n, prod, sum
loop
[count]
ZERO
sum
LDH
LDH
MPY
ADD
SUB
B
*pm++, m
*pn++, n
m, n, prod
sum, prod, sum
count, 1, count
loop
.return sum
.endproc

Chapter 2, Slide 80
Note: Linear assembly performs automatic
return to the calling function.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Chapter 2, Slide 83
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Code Optimization



Chapter 2, Slide 84
Introduction to optimisation and
optimisation procedure.
Optimisation of C code using the code
generation tools.
Optimisation of assembly code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Introduction

Software optimisation is the process of
manipulating software code to achieve
two main goals:


Faster execution time.
Small code size.
Note: It will be shown that in general there
is a trade off between faster
execution type and smaller code size.
Chapter 2, Slide 85
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Introduction

To implement efficient software, the
programmer must be familiar with:



Chapter 2, Slide 86
Processor architecture.
Programming language (C, assembly or
linear assembly).
The code generation tools (compiler,
assembler and linker).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Code Optimisation Procedure
Chapter 2, Slide 87
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Code Optimisation Procedure
C source
file
.c
.if
Parser
.opt
Optimiser
Code
generator
.asm
Optimising Compiler
Chapter 2, Slide 88
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Optimising C Compiler Options


Chapter 2, Slide 89
The ‘C6x optimising C compiler uses the
ANSI C source code and can perform
optimisation currently up-to about 80%
compared with a hand-scheduled
assembly.
However, to achieve this level of
optimisation, knowledge of different levels
of optimisation is essential. Optimisation is
performed at different stages and levels.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002

-O0
Optimization levels
– Performs control-flow-graph simplification
– Allocates variables to registers
– Performs loop rotation
– Eliminates unused code
– Simplifies expressions and statements
– Expands calls to functions declared inline

-O1
Performs all -O0 optimizations, plus:
– Removes unused assignments
– Eliminates local common expressions

-O2
Performs all -O1 optimizations, plus:
– Performs software pipelining
– Performs loop optimizations
– Eliminates global common subexpressions
– Eliminates global unused assignments
– Converts array references in loops to incremented pointer form
– Performs loop unrolling
Chapter 2, Slide 90
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Optimization levels

-O3
Performs all -O2 optimizations, plus:
– Removes all functions that are never called
– Simplifies functions with return values that are never
used
– Inlines calls to small functions
– Reorders function declarations; the called functions
attributes are known when the caller is optimized
– Propagates arguments into function bodies when all
calls pass the same value in the same argument
position
– Identifies file-level variable characteristics
Chapter 2, Slide 91
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Optimization levels
Chapter 2, Slide 92
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Intrisic C functions





Chapter 2, Slide 93
Intrinsics allow you to express the meaning of
certain assembly statements that would otherwise
be cumbersome or inexpressible in C/C++.
Intrinsics are used like functions; you can use
C/C++ variables with these intrinsics, just as you
would with any normal function.
int x1, x2, y;
y = _sadd(x1, x2);
int _sadd (int src1, int src2); Adds src1 to
src2 and saturates the result.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Assembly Optimisation

To develop an appreciation of how to
optimise code, let us optimise an FIR
filter:
N 1
yn   hk  xn  k 
k 0

For simplicity we write:
N 1
yn    hi  xi 
[1]
i 0
Chapter 2, Slide 94
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Assembly Optimisation

To implement Equation 1, we need to
perform the following steps:
(1)
(2)
(3)
(4)
Load the sample x[i].
Load the coefficients h[i].
Multiply x[i] and h[i].
Add (x[i] * h[i]) to the content of an
accumulator.
(5) Repeat steps 1 to 4 N-1 times.
(6) Store the value in the accumulator to y.
Chapter 2, Slide 95
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Assembly Optimisation

loop
[B0]
[B0]
Chapter 2, Slide 96
Steps 1 to 6 can be translated into the
following ‘C6x assembly code:
MVK
MVK
LDH
LDH
NOP
MPY
NOP
ADD
SUB
B
NOP
.S1
.S1
.D1
.D1
.M1
.L1
.L2
.S1
0,B0
0,A5
*A8++,A2
*A9++,A3
4
A2,A3,A4
A4,A5,A5
B0,1,B0
loop
5
;
;
;
;
;
;
;
;
;
;
;
Initialise the loop counter
Initialise the accumulator
Load the samples x[i]
Load the coefficients h[i]
Add “nop 4” because the LDH has a latency of 5.
Multiply x[i] and h[i]
Multiply has a latency of 2 cycles
Add “x [i]. h[i]” to the accumulator

 loop overhead
 The branch has a latency of 6 cycles
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Assembly Optimisation

In order to optimise the code, we need
to:
(1) Use instructions in parallel.
(2) Remove the NOPs.
(3) Remove the loop overhead (remove SUB
and B: loop unrolling).
(4) Use word access or double-word access
instead of byte or half-word access.
Chapter 2, Slide 97
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Cycle
Step 1 - Using Parallel Instructions
.D1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Chapter 2, Slide 98
.D2
.M1
.M2
.L1
.L2
.S1
.S2
NOP
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
nop
nop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Cycle
Step 1 - Using Parallel Instructions
.D1
.D2
.M1
.M2
.L1
.L2
.S1
.S2
1
ldh
ldh
2
3
4
5
mpy
6
7
add
8
9
sub
b
10
11
12
13
14Note: Not all instructions can be put in parallel since the
result of one unit is used as an input to the following
15
unit.
16
Chapter 2, Slide 99
NOP
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Step 2 - Removing the NOPs
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.D1
.D2
ldh
ldh
Chapter 2, Slide 100
.M1
.M2
.L1
.L2
.S1
.S2
NOP
sub
b
nop
nop
mpy
nop
add
loop LDH
LDH
[B0] SUB
[B0] B
NOP
MPY
NOP
ADD
.D1
.D1
.L2
.S1
*A8++,A2
*A9++,A3
B0,1,B0
loop
2
.M1 A2,B3,A4
.L1 A4,A5,A5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Step 3 - Loop Unrolling

The SUB and B instructions consume at
least two extra cycles per iteration (this
is known as branch overhead).
||
loop
[B0]
[B0]
LDH
LDH
SUB
B
NOP
MPY
NOP
ADD
.D1
.D1
.L2
.S1
.M1
*A8++,A2
*A9++,A3
B0,1,B0
loop
2
A2,B3,A4
.L1
A4,A5,A5
||
NOP
ADD
LDH
LDH
NOP
MPY
NOP
ADD
;
;
;
||
Chapter 2, Slide 101
LDH
LDH
NOP
MPY
LDH
LDH
NOP
MPY
NOP
ADD
.D1
.D1
.M1X
.L1
.D1
.D1
.M1
.L1
:
:
:
.D1
.D1
*A8++,A2
*B9++,B3
4
A2,B3,A4
A4,A5,A5
*A8++,A2
*A9++,A3
4
A2,B3,A4
;Start of iteration 1
;Use
of
cross
path
;Start of iteration 2
A4,A5,A5
.M1
*A8++,A2
*A9++,A3
4
A2,B3,A4
.L1
A4,A5,A5
; Start of iteration n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Step 4 - Word or Double Word Access
The ‘C6711 has two 64-bit data buses
for data memory access and therefore
up to two 64-bit can be loaded into the
registers at any time (see Chapter 2).
 In addition the ‘C6711 devices have
variants of the multiplication
instruction to support different
operation (see Chapter 2).
Note: Store can only be up to 32-bit.

Chapter 2, Slide 102
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Step 4 - Word or Double Word Access

Using word access, MPY and MPYH the
previous code can be written as:
loop
LDW
LDW
NOP
SUB
B
NOP
MPY
MPYH
NOP
ADD
||
[B0]
[B0]
||

.D1
.D2
.L2
.S1
*A9++,A3 ; 32-bit word is loaded in a single cycle
*B6++,B1
4
.M1
.M2
loop
2
A2,B3,A4
A0,B1,B3
.L1
A4,B3,A5
Note: By loading words and using MPY and
MPYH instructions the execution time has
been halved since in each iteration two 16x16bit multiplications are performed.
Chapter 2, Slide 103
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Optimisation Summary

It has been shown that there are four
complementary methods for code
optimisation:




Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.
These increase performance and reduce code size.
Chapter 2, Slide 104
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Optimisation Summary

It has been shown that there are four
complementary methods for code
optimisation:




Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.
This increases performance but increases code size.
Chapter 2, Slide 105
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Software Optimisation
Part 2 - Software Pipelining
Chapter 2, Slide 106
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Objectives





Chapter 2, Slide 107
Why using Software Pipelining, SP?
Understand software pipelining
concepts.
Use software pipelining procedure.
Code the word-wide software pipelined
dot-product routine.
Determine if your pipelined code is
more efficient with or without prolog
and epilog.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Why using Software Pipelining, SP?

SP creates highly optimized loop-code by:




Putting several instructions in parallel.
Filling delay slots with useful code.
Maximizes functional units.
SP is implemented by simply using the tools:


Chapter 2, Slide 108
Compiler options -o2 or -o3.
Assembly Optimizer if .sa file.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Software Pipeline concept
To explain the concept of software pipelining,
we will assume that all instructions execute in
on cycle.
LDH
||
LDH
MPY
ADD
Chapter 2, Slide 109
How many cycles would
it take to perform this
loop 5 times?
(Disregard delay-slots).
______________ cycles
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Software Pipeline Example
LDH
||
LDH
MPY
ADD
How many cycles would
it take to perform this
loop 5 times?
(Disregard delay-slots).
5 x 3 = 15
______________ cycles
Let’s examine hardware
(functional units) usage ...
Chapter 2, Slide 110
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Cycle .D1
1
ldh
.D1
Non-Pipelined Code
.D2
.D2
ldh
2
.M1
ldh
9
Chapter 2, Slide 111
.S1
.S2
ldh
mpy
6
8
.L2
add
5
7
.L1
mpy
3
4
.M2
add
ldh
ldh
mpy
add
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Cycle
1
.D1
ldh
Pipelining Code
.D2
ldh
.M1
.M2
.L1
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
mpy
add
6
7
.L2
.S1
.S2
add
Pipelining these instructions took 1/2 the cycles!
Chapter 2, Slide 112
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Cycle
1
.D1
ldh
Pipelining Code
.D2
ldh
.M1
.M2
.L1
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
mpy
add
6
7
.L2
.S1
.S2
add
Pipelining these instructions takes only 7 cycles!
Chapter 2, Slide 113
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Pipelining Code
Prolog
1
.D1
ldh
.D2
ldh
.M1
Staging for loop.
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
Single-cycle “loop”
iterated three times.
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
Epilog
6
mpy
add
Completing final
operations.
7
Loop Kernel
Chapter 2, Slide 114
.L1
add
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Pipelined Code
prolog:
; load 1
||
LDH
LDH
||
||
MPY
LDH
LDH
; mpy 1
; load 2
||
||
||
ADD
MPY
LDH
LDH
; add 1
; mpy 2
; load 3
ADD
MPY
LDH
LDH
.
.
; add 2
; mpy 3
; load 4
loop:
||
||
||
Chapter 2, Slide 115
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 2, Slide 116
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create dependency graph.
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002
Assignment



Chapter 2, Slide 117
Write a C code for FIR filter using 16bit coefficients and inputs
Is this code optimal for implementation
on TMS320C6x?
Rewrite the C code so that 2
multiplication units are used at each
iteration.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002