ECSE 436 Signal Processing Hardware

Download Report

Transcript ECSE 436 Signal Processing Hardware

DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
1
DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
2
Instruction Set Architecture (ISA)




Computers run programs made of simple operations
called “instructions”
The list of instructions offered by the machine is the
“instruction set”
The instruction set is what is visible to the
programmer (really the compiler, although humans
can directly program in “assembly language”)
Many different DSPs can share the same ISA but
have different hardware (i.e. the implementation of
the ISA is different)
ECSE 436
3
Instructions

Two kinds of information in a computer:




instructions
data
Instructions are stored as numbers, just like
data
Instructions and data are stored in the
memory
ECSE 436
4
Basic Computer Organization
CPU
Limited number
of fast registers
for temporary
storage
OPCODE
OPERANDS
registers
PC
load
Large amount
of slow memory
Arranged as an array
of bytes
store
memory
IR
Instructions are loaded into an
Instruction register (IR) from the
address pointed to by the
program counter (PC). The PC is
incremented by the instruction
size (in bytes)
for each new instruction.
E.g. PC  PC + 4
ECSE 436
5
Load/Store Architecture (Reg-Reg)
CPU
registers
• The register numbers are specified in the
operand fields of the instruction
PC
load
store
• Instructions can ONLY get their data and store
their data from/to registers.
IR
• Since data is stored in memory, we need
special “load” and “store” instructions for
transfers between registers and memory. These
two instructions are the ONLY ones allowed to
access memory
memory
ECSE 436
6
DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
7
C6000 Architecture

TMS320C62x/C64x


TMS320C67x



16-bit fixed point DSP
32-bit floating point DSP
Instuction set is a superset of the C62x
VLIW Architecture

Very Long Instruction Word
ECSE 436
8
VLIW

VLIW is an architecture that exploits
instruction level parallelism (ILP) in the code

What is ILP?

An instruction is dependent on another if it
uses (produces) a value produced (used) by
the other instruction
ECSE 436
9
Example
add c,d,e
mult
b,e,a

The mult instruction
must wait for the add
instruction to finish
before it can execute
(sequential data
flow)
e
ECSE 436
10
Example
add
add
add


a,b,e
c,d,f
e,f,g
The first two adds have no
data dependency and could
even be switched in the code
with no effect on the
correctness of the answer
The first two adds could be
executed in parallel if we had
the hardware to do it (two
adders)
a
b
d
c
+
+
e
f
+
g
ECSE 436
11
Scheduling


Given a set of hardware resources (functional
units), e.g. a number of adders, multipliers,
etc…,
the process of determining which instructions
can be executed in parallel and which
functional units to use on any given clock
cycle is called instruction scheduling
ECSE 436
12
VLIW




VLIW is an architecture that depends on the user
(compiler) to do the scheduling
Instructions are packed into a very long instruction
word (256 bits)
There is no scheduling hardware on the chip like on a
Pentium 4 which uses hardware, or dynamic
scheduling
Benefits


simple hardware
Drawbacks


requires sophisticated compilers
code compatibility – need to recompile if you use a different
DSP, even one with the same ISA
ECSE 436
13
C6713 Architecture
ECSE 436
14
Maximum Performance

C6713




8 functional units, two MACS per cycle
225 MHz
1800 MIPS
6 of the 8 units floating point


225 MHz
1350 MFLOPS
ECSE 436
15
DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
16
Addressing Modes

Load/Store


must load registers from memory, process data,
store back to memory
Linear (indirect addressing)

32 registers A0-A15, B0-B15 can act as pointers
*R
register R contains the address of
memory location where a data value is
stored
ECSE 436
17
Linear Addressing
*R++(d)
R contains the address. After R is used,
postincrement by discplacement d
(default is d = 1), -- post decrements
*++R(d)
preincrement or predecrement
*+R(d)
preincrement without modification
ECSE 436
18
Circular Addressing
ECSE 436
19
Circular Addressing

Address Mode Register (AMR)
ECSE 436
20
DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
21
TMS320 Assemby Language
[label][:]
mnemonic [operand list] [;
comment]
[x] means that x is optional

label


mnemonic



symbolic name for the address of the program line
instruction, assembler directive, macro
cannot start in column 1
operands



constants: binary (e.g. 010101b), decimal, hexdecimal (e.g. 0x9f or 9fh)
register names
symbols defined by assembler directives
ECSE 436
22
Assembler Directives



The assembler produces COFF (commonobect file format) files
COFF files are divided into sections that
contain instructions or data
Assembler directives are instructions to the
assembler on how to manipulate these
sections or to define constants


they are not machine instructions
see Section 4.1 in the text for more details
ECSE 436
23
C6000 ISA
functional unit
conditional execution
parallel
ECSE 436
24
Instruction Packing
VELOCITI: 1 to 8 execute packets in a fetch packet
Instruction
Instruction
Instruction
|| Instruction
|| Instruction
1
2
3
4
5
;
;
;
;
instructions
are executed
instructions
are executed
1 and 2
sequentially
3, 4, and 5
in parallel
ECSE 436
25
Sample Instructions
ADD .L1
A3,A7,A7
;add A3+A7->A7
SUB .S1
A1,1,A1
;subtract 1 from A1
MPY .M2
|| MPYH .M1
A7,B7,B6
A7,B7,A6
; mult 16LSBs of A7,B7->B6
; mult 16MSBs of A7,B7->A6
LDH .D2
LDH .D1
*B2++,B7
*A2++,A7
; load (B2) -> B7, inc B2
; load (A2) -> A7, inc A2
||
ECSE 436
26
Sample Instructions
Loop
MVKL .S1
MVKH .S2
SUB
[A1] B
NOP
STW
.S1
.S2
.D1
x,A4
x,A4
; move 16 LSBs of x addr->A4
; move 16 MSBs of x addr->A4
A1,1,A1
Loop
5
A3, *A7
;
;
;
;
decrement A1
branch to Loop if A1 != 0
5 NOP instructions
store A3 into (A7)
ECSE 436
27
Linear Assembly



To effectively program a DSP using assembly
language, you need to do the scheduling by
hand!
Need to account for the number of clock
cycles each functional unit takes, etc…
Difficult, so TI has linear assembly


you don’t have to schedule it, the compiler does it
for you
can use CPU resources without worrying about
scheduling, register allocation, etc…
ECSE 436
28
DSP architecture


Review of basic computer architecture
concepts
C6000 architecture: VLIW




Principle and Scheduling
Addressing
Assembly and linear assembly
Pipelining
ECSE 436
29
Pipelining

Key technique to make fast CPUs

Multiple instructions are overlapped in
execution

E.g. Automotive assembly line
ECSE 436
30
Pipelining: principle
body (B)
1 hour
paint (P)
1 hour
Wheels (W)
1 hour
ECSE 436
31
Pipelining: principle(II)
Time (h)
Bob
0
1
2
B1
P1
W1
3
4
5
B2
P2
W2
6
2 cars / 6 hours 
1/3 car / hour
ECSE 436
32
Pipelining: principle(III)
Time (h)
Bob
Alice
Bill
0
1
2
B1
1 car / hour (3 x speedup)
B2
P1
B3
P2
W1
B4
P3
W2
B5
P4
W3
B6
P5
W4
3
4
5
6
ECSE 436
33
Pipelining: principle(IV)
COMB. LOGIC
cycle time
cycle time
ECSE 436
34
Performance Gain

Pipelining a datapath m times can result in up
to m times improvement in cycle time


E.g. 5-stage pipelined processor is potentially 5
times faster than an unpipelined processor
In reality, this is limited to less than m
because of restrictions in overlapping
instructions
ECSE 436
35
5-Stage RISC Pipeline
ECSE 436
36
16-Stage C6713 Pipeline

Fetch (4 stages)


Decode (2 stages)


calc. address, send address, wait, receive
separate fetch packets into execute packets
Execute (10 stages)

Different instructions require different number of
cycles to execute
ECSE 436
37
Software and I/O
38
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
39
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
40
Code Efficiency

Intrinsic functions



e.g. _add2, _mpy, sadd
see TMS320C62x/C67x Programmers Guide
Packed data

use word access to operate on 16-bit data store in
the high and low parts of a 32-bit register
ECSE 436
41
Loop Unrolling



A loop is a compact way of representing a
repetitive sequence of instructions, but…
The loop condition test is overhead
To remove the loop overhead, unroll the loop
(make copies of the loop code)



key way of exposing parallelism !!!
The compiler can now look across loop iterations
to find parallel instructions
parallelism increased, but so is code size
ECSE 436
42
Example
; program A: code without unrolling
MVK
4,B0
loop:
LDH
*A5++,A0
||
LDH
*A6++,A1
ADD
A0,A1,A2 ;add 4 times
…
SUB
B0,1,B0
[B0]
B
loop
ECSE 436
43
Example
; program B: code with unrolling once
MVK
2,B0
loop:
LDH
*A5++,A0
||
LDH
*A6++,A1
ADD
A0,A1,A2
…
LDH
*A5++,A0
||
LDH
*A6++,A1
ADD
A0,A1,A2
…
SUB
B0,1,B0
[B0]
B
loop
; add first 2 numbers
; add other 2 numbers
ECSE 436
44
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
45
Software Pipelining

Software pipelining



compiler technique (don’t confuse with h/w
pipelining)
Schedule multiple iterations of a loop together to fill
any empty cycles and maximize functional unit
usage
-O2 –O3
ECSE 436
46
Software Pipelining

The general idea of this optimization is to
uncover long sequences of statements without
branch statements

Reorganize loops to interleave instructions
from different iterations


Dependent instructions within a single loop
iteration are then separated from one another by
an entire loop body
Increases possibilities of scheduling
ECSE 436
47
Software Pipelining
Iteration
0
Iteration
Iteration
1
2
Iteration
3
Iteration
4
Soft warepipelined
it eration
ECSE 436
48
Software Pipelining



Advantage: yields shorter code than loop
unrolling and uses fewer registers
Software pipelining is crucial for VLIW
processors
Often, both software pipelining and loop
unrolling are used
ECSE 436
49
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
50
Interrupts

A signal that causes the processor to suspend
its current program and execute a special
subroutine


interrupt service routine (ISR)
Sources

On-chip peripherals


External


timers, serial ports
resets, external peripherals
Software interrupts

arithmetic exceptions (divide by zero, overflow)
ECSE 436
51
Interrupts
ECSE 436
52
Interrupts
ECSE 436
53
Interrupts
ECSE 436
54
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
55
Direct Memory Access

Data transfer without intervention of
processsor



memory and CPU
peripherals and CPU
DMA channel:




source address
destination address
element count in a frame
number of frames in a block
ECSE 436
56
Software and I/O

Code efficiency and programming techniques



Loop unrolling
Software pipelining
I/O considerations



Interrupts
DMA
Block processing
ECSE 436
57
Block Processing
ECSE 436
58
Ping-Pong Buffering

Ping-pong buffer (double buffer)

DMA channel delivers N samples of data in
and out of buffers while the DSP operates on
data in the current buffer
Next block, roles of the buffers are changed

ECSE 436
59