Transcript c6x

Intro to the “c6x” VLIW processor
●
Texas Instruments TMSC6000 series
●
TMSC6700 subseries – include floating point
●
VLIW = Very Long Instruction Word
Operations in Parallel
registers
Function
units
Operations in Parallel
registers
bypassing
Function
units
Non-orthogonal
registers
Bypass
Function
units
registers
Non-orthogonal
B
A
registers
registers
Bypass
Function
units
L1
S1
M
1
D
1
L2
S2
M
2
D
2
*** See TI's picture ***
Specialized Function Units
●
L units: arithmetic, compare, and logical ops
●
S units: arithmetic, logical, branches, constant generation
●
M units: multiplies
●
D units: address generation / memory accesses
Complicated hardware
registers
registers
Explicit parallelism
registers
registers
Simple VLIW encoding
●
Slots that cannot be utilized are filled with no-ops
●
Bad for code density, cache utilization, energy, ...
C6X: Packets
●
●
One bit of each instruction indicates whether next
instruction can be executed in parallel (0 = “EOP”)
Any slot can go to any function unit
0
1
0
1
1
1
1
1
C6X: Packets
●
●
One bit of each instruction indicates whether next
instruction can be executed in parallel
Any slot can go to any function unit
0
1
0
1
1
1
1
1
C6X: Packets
●
●
●
●
●
One bit of each instruction indicates whether next
instruction can be executed in parallel
Any slot can go to any function unit
0
1
0
1
1
0
1
0
1
1
1
1
1
1
1
1
Packet cannot cross an 8-word boundary
Resources constrain which instructions can be combined
in the same packet
You can branch into the middle of a packet!
Explicit scheduling
Delay slots must be respected – no HW interlocks or scoreboarding
Multiply – 1 delay slot
Load – 4 delay slots
Branch – 5 delay slots
B5 := B3 * B2
B5 := B3 * B2
B7 := B5 + B1
B7 := B5 + B1
Right
Wrong
Predicated execution
Why? To get rid of branches (5 delay slots * 8 wide ....)
Basic idea: a comparison result is stored to a
condition register ; this register is then used as an
operand of other instructions, and its value causes
those operations to be selectively enabled or squashed.
[Condition registers: A1, A2, B0, B1, B2]
Example:
If (B3<B4)
B3++
else
B4++
Predicated execution
With branches:
cmp B3, B4
bge L2
<nop>
B3 := B3+1
b DONE
<nop>
L2:
B4 := B4+1
DONE:
With predicates:
cmplt B3, B4
B0
[B0] B3 := B3+1
[!B0] B4 := B4+1
...and the last two can
be issued in parallel!
Control dependency
has been converted to
data dependency...
Assembly details
.text
.align 32
.global proc
proc:
||
mvk
mvk
cmpgt
[ b0] mvk.S2
[!b0] mvk.S1
stw
.....
4,
5,
b3,
9,
8,
a5,
b3
b4
b4, b0
b5
a5
*-a15[4]
Fetch/execute pipeline
PG generate program address
PS program address send
PW program memory access
PR fetch reaches CPU boundary
DP instruction dispatch
DC instruction decode
E1 execute 1
E2 execute 2
E3 execute 3
E4 execute 4
E5 execute 5
Addressing Modes
*R
*+R[ucst5]
*-R[ucst5]
(*R)
(R[ucst5])
(R[-ucst5])
*+R[offsetR]
*-R[offsetR]
(R[offsetR])
(R[-offsetR])
Special case: 15b offsets:
*+B15[ucst15]
*+B14[ucst15]
C equivalent
Addressing Modes
Pre/post increment/decrement
*++R , *R++
*++R[ucst5], *R++[ucst5]
*--R[ucst5], *R--[ucst5]
*++R[offsetR], *R++[offsetR]
*--R[offsetR], *R--[offsetR]
Resources
http://www.cs.cmu.edu/~tcal/15745/