TI C6701 VLIW MIMD

Download Report

Transcript TI C6701 VLIW MIMD

TI C6701 VLIW MIMD
Presentation Outline
Introduction / Overview
 Differentiating Features
 Assembly Syntax
 Instruction Flow
 Pipelining and Optimization
 Conclusion

Introduction
TI’s C6000 family
 VLIW architectures
 Flexibility from Software

Characteristics Chart
Architecture
VLIW
FPU
Yes
MFLOPs (Peak)
1000
16x16 MACs (MMAC/s)
334
8x8 MACs (MMAC/s)
334
MIPS (Peak)
1336
MOPS (Peak)
336
Memory Bus Bandwidth (MB/s)
332
1K FP cfft (µsec)
108
1K 16 bit cfft (µsec)
108
1K FP dot product (µsec)
3.07
1K 16 bit dot product (µsec)
3.07
512 2 xFP Conv3x3 (msec)
7.11
512 2 x8 bit Conv3x3 (msec)
7.11
512 2 x8 bit Erosion/Dilation (msec)
3.62
Basic Overview






Eight 32-bit instructions fetched per clock
cycle, called a fetch packet
Two CPU multipliers , Six ALUs for execution.
Two general-purpose register files (A and B),
Eight functional units (.L1, .L2, .S1, .S2, .M1,
.M2, .D1, and .D2),
Two load-from-memory data paths per
register file (LD1a, LD1b, LD2a, LD2b),
Two data address paths (DA1 and DA2), and
Two register file data cross paths (1X and 2X)
Architecture Overview
Differentiating Features

The features that differentiate the
TI from other VLIW architectures
are:
1. Instructions that can be of varied
length
2. Predication in all instructions
3. Pipelining of the branch functions
Assembly Syntax







Label
Parallel Bars
Conditions
Instruction
Functional Unit
Operands
Comments
Assembly Example
Instruction Flow



Eight functional units - two separate
groups of four
Each group has a separate data path and
splits the general-purpose registers the two
units are named .L1 and .L2, .M1 and
.M2, .S1 and .S2, and .D1 and .D2
The .L units are responsible for
1. Logical operations
2. Data packing and unpacking
3. Some arithmetic.
Instruction Flow
32 General Purpose Registers
 64 Bit Operations using the LDDW
instruction
 LD1a manages the least-significant 32
bits and LD1b handles the mostsignificant 32 bits
 The .D units are joined so that we can
look at either register file for data,
regardless of where the data address
came from

Instruction Flow
Fetch Packets occur at boundaries of
256-bit intervals
 Important! An execute packet can’t
cross the fetch packet boundary
 The execute packet for parallel
instructions is created by looking at the
first bit in the instruction (The P bit)
 Maximum of eight instructions executed
in parallel.

Architecture Overview
Pipelining & Optimization




The C6701 doesn’t have the ability to look
ahead and schedule
The number of instructions in the execute
packet is the key to optimizing the code
The number of clock cycles used in executing
an instruction is called the number of delay
slots
Multiple cycle instructions will have significant
effects on the delay slot count of an
instruction
Pipelining & Optimization



Possible to have an execute packet that
contains NOPS.
By using multiple NOPS in parallel with a
multi-cycle instruction we will make the next
execute packet capable of using the previous
multi-cycle instruction result
If we use a cross-path during a multi-cycle
instruction then we can’t use that cross path
again until the instruction has finished
Execution Pipeline
AD vs. TI vs. Motorola
Conclusion
The C6701 allows scheduling of
instructions in the assembly code
 Unfortunately, a good understanding of
the hardware is still necessary to be
able to schedule instructions in an
optimized way
 Thank You
