TI C6701 VLIW MIMD
Download
Report
Transcript TI C6701 VLIW MIMD
TI C6701 VLIW MIMD
Presentation Outline
Introduction / Overview
Differentiating Features
Assembly Syntax
Instruction Flow
Pipelining and Optimization
Conclusion
Introduction
TI’s C6000 family
VLIW architectures
Flexibility from Software
Characteristics Chart
Architecture
VLIW
FPU
Yes
MFLOPs (Peak)
1000
16x16 MACs (MMAC/s)
334
8x8 MACs (MMAC/s)
334
MIPS (Peak)
1336
MOPS (Peak)
336
Memory Bus Bandwidth (MB/s)
332
1K FP cfft (µsec)
108
1K 16 bit cfft (µsec)
108
1K FP dot product (µsec)
3.07
1K 16 bit dot product (µsec)
3.07
512 2 xFP Conv3x3 (msec)
7.11
512 2 x8 bit Conv3x3 (msec)
7.11
512 2 x8 bit Erosion/Dilation (msec)
3.62
Basic Overview
Eight 32-bit instructions fetched per clock
cycle, called a fetch packet
Two CPU multipliers , Six ALUs for execution.
Two general-purpose register files (A and B),
Eight functional units (.L1, .L2, .S1, .S2, .M1,
.M2, .D1, and .D2),
Two load-from-memory data paths per
register file (LD1a, LD1b, LD2a, LD2b),
Two data address paths (DA1 and DA2), and
Two register file data cross paths (1X and 2X)
Architecture Overview
Differentiating Features
The features that differentiate the
TI from other VLIW architectures
are:
1. Instructions that can be of varied
length
2. Predication in all instructions
3. Pipelining of the branch functions
Assembly Syntax
Label
Parallel Bars
Conditions
Instruction
Functional Unit
Operands
Comments
Assembly Example
Instruction Flow
Eight functional units - two separate
groups of four
Each group has a separate data path and
splits the general-purpose registers the two
units are named .L1 and .L2, .M1 and
.M2, .S1 and .S2, and .D1 and .D2
The .L units are responsible for
1. Logical operations
2. Data packing and unpacking
3. Some arithmetic.
Instruction Flow
32 General Purpose Registers
64 Bit Operations using the LDDW
instruction
LD1a manages the least-significant 32
bits and LD1b handles the mostsignificant 32 bits
The .D units are joined so that we can
look at either register file for data,
regardless of where the data address
came from
Instruction Flow
Fetch Packets occur at boundaries of
256-bit intervals
Important! An execute packet can’t
cross the fetch packet boundary
The execute packet for parallel
instructions is created by looking at the
first bit in the instruction (The P bit)
Maximum of eight instructions executed
in parallel.
Architecture Overview
Pipelining & Optimization
The C6701 doesn’t have the ability to look
ahead and schedule
The number of instructions in the execute
packet is the key to optimizing the code
The number of clock cycles used in executing
an instruction is called the number of delay
slots
Multiple cycle instructions will have significant
effects on the delay slot count of an
instruction
Pipelining & Optimization
Possible to have an execute packet that
contains NOPS.
By using multiple NOPS in parallel with a
multi-cycle instruction we will make the next
execute packet capable of using the previous
multi-cycle instruction result
If we use a cross-path during a multi-cycle
instruction then we can’t use that cross path
again until the instruction has finished
Execution Pipeline
AD vs. TI vs. Motorola
Conclusion
The C6701 allows scheduling of
instructions in the assembly code
Unfortunately, a good understanding of
the hardware is still necessary to be
able to schedule instructions in an
optimized way
Thank You