Transcript Document

Class Presentation of
Custom DSP Implementation Course on:
ECE Department – University of Tehran
TMS320C54x DSP
processor
Presented by:
Shahab adin Rahmanian
May 2005
This is a class presentation. All data are copy rights of their
respective authors as listed in the references and have been
used here for educational purpose only.
Outline
• Introduction
• Architecture
• Applications
• features
• Instruction Set and addressing
• FIR Filtering
• Accelerating Polynomial Evaluation
• Numerical Issues
• Write code in C
• Conclusion
Introduction
[2]
TMS320C54x
•
•
•
a fixed-point digital signal processor (DSP) in the TMS320
family.
Low power DSP
: 0.54 mW/MIP
Acceleration for FIR and LMS filtering, code book search,
polynomial evaluation, Viterbi decoding ,Fast Fourier transform
[4]
Some Typical Applications
•
•
•
•
General-Purpose
– Adaptive filtering
– Digital filtering
– Fast Fourier transforms
Control
– Disk drive control
– Laser printer control
– Robotics control
Military
– Missile guidance
– Radar processing
– Secure communication
Telecommunications
– 1200- to 19200-bps modems
– Adaptive equalizers
– Cellular telephones
– Echo cancellation
– Video conferencing
Software Applications
• Circular Buffers
• Single-Instruction Repeat (RPT) Loops
• Extended-Precision Arithmetic
–
–
–
–
Addition and Subtraction
Multiplication
Division
Square Root
• Floating-Point Arithmetic
• Application-Oriented Operations
– Symmetric FIR Filters
– Adaptive Filtering
– Viterbi Algorithm for Channel Decoding
• Fast Fourier Transforms
Some key features
• CPU
– Advanced multi bus architecture with three separate
16-bit data buses and one program bus
– 40-bit arithmetic logic unit (ALU), including a 40-bit
barrel shifter and two independent 40-bit
accumulators
– 17-bit × 17-bit parallel multiplier coupled to a 40-bit
dedicated adder for non-pipelined single-cycle
multiply/accumulate (MAC) operation
• Memory
– 192K words × 16-bit maximum addressable memory
space (64K words program, 64K words data, and 64K
words I/O)
– 28K words × 16-bit single-access on-chip ROM with
8K words configurable as program or data memory
(’C541 only)
Some key features
• On-chip peripherals
– On-chip phase-locked loop (PLL) clock generator
with internal oscillator or external clock source
– Two full-duplexed serial ports to support 8- and
16-bit transfers (’C541only)
– Time-division multiplexed (TDM) serial port
(’C542/’C543 only)
– One 16-bit timer
• Speed: 25/20-ns execution time for a single-cycle
fixed-point instruction (40 MIPS/50 MIPS) with 5-V
power supply
C54x Addressing Modes
•
•
•
Immediate
– Operand is part of the
instruction
ADD #0FFh
Absolute
– Address of operand is
part of the instruction
LD *(LABEL), A
Register
– Operand is specified in
a register
READA DATA
;(data read
from address in
accumulator A)
C54x Addressing Modes
•
•
Direct
– Address of operand is part of
the instruction (added to
implied memory page)
Indirect
– Address of operand is stored
in a register
– Offset addressing
– Register offset (ar1+ar0)
– Autoincrement/decrement
– Bit reversed addressing
– Circular addressing
ADD 010h,A
ADD *AR1
ADD *AR1(10)
ADD *AR1+0
ADD *AR1+
ADD *AR1+B
ADD *AR1+0B
C54X Instructions Set by Category
Arithmetic
ADD
MAC
MAS
MPY
NEG
SUB
ZERO
Data
Management
LD
MAR
MV(D,K,M,P)
ST
Logical
AND
BIT
BITF
CMPL
CMPM
OR
ROL
ROR
SFTA
SFTC
SFTL
XOR
Program
Control
B
BC
CALL
CC
IDLE
INTR
NOP
RC
RET
RPT
RPTB
RPTZ
TRAP
XC
Notes
CMPL complement
MAR modify address reg.
CMPM compare memory MAS multiply and subtract
Application
Specific
ABS
ABDST
DELAY
EXP
FIRS
LMS
MAX
MIN
NORM
POLY
RND
SAT
SQDST
SQUR
SQURA
SQURS
Block FIR Filtering
•
y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(n-1)]
– h stored as linear array of N elements (in prog. mem.)
– x stored as circular array of N elements (in data mem.)
; Addresses: a4 h, a5 N samples of x, a6 input buffer, a7 output buffer
; Modulo addressing prevents need to reinitialize regs each sample
; Moving filter coefficients from program to data memory is not shown
firtask: ld
#firDP,dp
; initialize data page pointer
stm
#frameSize-1,brc ; compute 256 outputs
rptbd
firloop-1
stm
#N,bk
; FIR circular buffer size
ld
*ar6+,a
; load input value to accumulator b
stl
a,*ar4+%
; replace oldest sample
with newest
rptz
a,#(N-1)
; zero accumulator a, do N taps
mac
*ar4+0%,*ar5+0%,a; one tap, accumulate in a
sth
a,*ar7+
; store y[n]
firloop: ret
Accelerating Symmetric FIR Filtering
• Coefficients in linear phase filters are either
symmetric or anti-symmetric
• Symmetric coefficients using 2 mult’s 3 adds
y[n] = h0 x[n] + h1 x[n-1] + h1 x[n-2] + h0 x[n-3]
y[n] = h0 (x[n] + x[n-3]) + h1 (x[n-1] + x[n-2])
• Accelerated by FIRS (FIR Symmetric) instruction
x in two
circular
buffers
h in
program
memory
Accelerating Symmetric FIR Filtering
; Addresses: a6 input buffer, a7 output buffer
; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8
; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8
; Modulo addressing prevents need to reinitialize regs each
sample
firtask:
ld
#firDP,dp
; initialize data page
pointer
stm
#frameSize-1,brc
; compute 256 outputs
rptbd firloop-1
stm
#N/2,bk
; FIR circular buffer size
ld
*ar6+,b
; load input value to accumulator b
mvdd *ar4,*a5+0%
; move old x[n-N/2] to new x[n-N/21]
stl
b,*ar4%
; replace oldest sample with
newest
add
*a4+0%,*a5+0%,a
; a = x[n] + x[n-N/2-1]
rptz
b,#(N/2-1)
; zero accumulator b, do N/2-1
taps
firs
*ar4+0%,*ar5+0%,coeffs
; b += a * h[i], do
next a
mar
*+a4(2)%
; to load the next newest sample
mar
*ar5+%
; position for x[n-N/2] sample
sth
b,*ar7+
Architecture - FIRS
Accelerating Polynomial Evaluation
•
Function approximation and spline interpolation
• Fast polynomial evaluation (N coefficients)
– y(x) = c0 + c1 x + c2 x2 + c3 x3
Expanded form
– y(x) = c0 + x (c1 + x (c2 + x (c3)))
Horner’s form
– POLY reduces 2 N cycles using MAC+ADD to N cycles
; ar2 contains address of array [c3 c2 c1 c0]
; poly uses temporary register t for multiplicand x
; first two times poly instruction executes gives
; 1. a = c(3) + x * 0 = c(3); b = c2
; 2. a = c(2) + x * c(3); b = c1
ld *ar2+,16,b
; b = c3 << 16
ld *ar3,t
; t = x (ar3 contains addr of x)
rptz a,#3
; a = 0, repeat next inst. 4 times
poly *ar2+
; a = b + x*a || b = c(i-1) << 16
sth a,*ar4
; store result (ar4 is addr of y)
Integer Multiplication
•
•
Integer multiplication yields products larger than the inputs, as
can be seen in the example below, using single digit decimal
values as inputs:
Does the user store the lower (1) or upper (8) result?
– Both must be kept, resulting in additional resources (two
cycles ,words of code, and RAM locations) to complete the
store.
– Worse, how can the double-sized result be used recursively
as an input in later calculations, given that the multiplier
inputs an input in later calculations, given that the multiplier
inputs are single-width?
Fractional Multiplication
•
•
•
Multiplication of fractions yields products that never exceed the
range of a fraction, as can be seen in the example below, using
single digit decimal fractions as inputs:
Don’t we still have a double sized result to store?
– In this case, we can store just the upper result (.8)
– This allows storage of result with fewer resources
– Results may be used recursively
Has accuracy been lost by dropping the lower accumulator
value?
Accuracy vs. Precision
• Often the programmer wants to retain the fullest
•
•
•
•
accuracy of a calculation, thus dropping the 16 LSB’s
of the result in the previous example seems a bad
choice.
Note though, the inputs: how much accuracy do they
offer?
The product offers double precision but its’ accuracy
is based on the single-width inputs.
Thus, storing a single precision result is not only an
efficient solution, but represents the limit of the
accuracy of the result.
The accumulator is double-sized for two reasons:
– To allow for integer operations, which would
possibly require the LSB’s for the result.
– So that sum-of-product operations will generate
accumulative noise at the 32nd vs. the 16th bit.
Redundant Sign Bit
• Multiplication of two signed
numbers yields product with two
sign bits
• Extra sign bit causes problems if
stored to memory as result:
Wastes space
Creates off-size Q
• Solution: Fractional mode bit!
• When FRCT (mode bit in ST1)
is set, the multiplier output is leftshifted by one
• For 16-bit ‘C54x:
Q1 5*Q1 5=Q1 5
Accumulation
• With fractions, we were able to guarantee that no
multiplicative overflow could occur, ie: F*F<=F.
• For addition, this rule does not apply, ie: F+F>F.
• Therefore, we need additional measures to manage
the possibility of overflow for accumulation. Two
general methods apply:
– Guard Bits: the ‘C54x offers an 8-bit extension
above the high accumulator to allow valid
representation of the result of up to 256
summations.
– Non-gain Systems: offer additional criteria that allow
a simple solution for unlimited length summations.
Guard Bits and saturation
• Guard Bits: the ‘C54x offers an 8-bit extension above
the high accumulator to allow valid representation of
the result of up to 256 summations.
• Saturation (SAT)
– SAT instruction saturates value exceeding 32-bit
range in the selected accumulator:
SAT A
SAT B
Non-gain Systems
•
•
•
•
•
•
Many systems can be modeled to have no DC gain:
– Filters with low Q.
– Any systems scaled by its’ maximum gain value.
Input values from A/D converters are automatically
fractions, if the limits of the A/D are presumed to be +/-1
Coefficient values can similarly bonded by making the
largest value the scaling factor for all other values.
For these systems, it is known that the final value of the
process is less than or equal to the input values.
The accumulator therefore can be allowed to temporarily
overflow, since the final result is known to be bonded +/-1.
Allows maximum usage of selected A/D and D/A
converters
– D/A bits for gain are more expensive than using analog
components
Division
The ‘C54x does not have a single cycle 16-bit divide
instruction
– Divide is a rare function in DSP
– Division hardware is expensive
• The ‘C54x does have a single cycle 1-bit divide
instruction: conditional subtract or SUBC
•
– Preceded by RPT #15, a 16-bit divide is performed
– Is much faster than without SUBC
•
The SUBC process operates only on unsigned operands,
thus software must:
– Compare the signs of the input operands
• If they are alike, plan a positive quotient
• If they differ, plan to negate (NEG) the quotient
– Strip the signs of the inputs
– Perform the unsigned division
– Attach the proper sign based on the comparison of the
inputs
Division Routine
B = num*den (tells sign)
Strip sign of numerator
Strip sign of denominator
16 iterations
1-bit divide
If result needs to be negative
Invert sign
Store negative result
Rounding
•
•
•
•
Result of multiplication can be rounded for MPY,
and MAS operations. This is specified by appending the
instruction with an “R” suffix.
Example: MAC with rounding is MACR. Rounding consists of
adding 215 to the result and then clearing the low accumulator.
In a long sum-of-products, only the last MAC operation should
specify rounding:
•Rounding can also be achieved with a load
operation:
Sign Extension (SXM)
Write code in C
• Inline Assembly
– Allows direct access to assembly language from C
– Useful for operating on components not used by C, ex:
•
Note: first column after leading quote is label field
• Long operations should be written in ASM and called from C
– main C file retains portability
– yields more easily maintained structures
– eliminates risk of interfering with registers in use by C
Accessing MMRs from C
• Using pointers to access Memory-Mapped
Registers:
– volatile
Createunsigned
a pointer int
and
set its value
to the assigned
memory
*SPC_REG
= (volatile
unsigned
int *) 0x0022;
address:
*SPC_REG=OxC
– Read
and write to the register as any other pointer:
8;
ioport unsigned port8000
• Accessing I/O Ports from C x = port8000;
port8000 = y;
– 1. create the port:
– 2. access the port:
Summary and Conclusion
• C54x is a conventional digital signal processor
– Separate data/program busses (3 reads & 1
write/cycle)
– Extended precision accumulators
– Single-cycle multiply-accumulate
– Saturation and wraparound arithmetic
– Bit-reversed and circular addressing modes
• C54x has instructions to accelerate algorithms
– Communications: FIR & LMS filtering, Viterbi
decoding
– Speech coding: vector distances for code book
search
– Interpolation: polynomial evaluation
References
[1] Texas instrument TMS320C54x DSP Design
Workshop
May 1997
[2] TMS320C54x User’s guide
[3] www.ti.com
[4] SIGNAL AND IMAGE PROCESSING ON THE
TMS320C54x DSP by Prof. Brian L. Evans
[5] TMS320C54x Assembly Language Tools