PPT - ECE For You

Download Report

Transcript PPT - ECE For You

MODULE 2
Syllabus
 Fixed and floating point formats
 code improvement
 Constraints
 TMS 320C64x CPU
 simple programming examples using C/assembly.
Fixed point numbers
• Fast and inexpensive implementation
• Limited in the range of numbers
• Susceptible to problems of overflow
•In a fixed-point processor, numbers are represented in
integer format.
• Fixed-point numbers and their data types are
characterized by their word size in bits
binary point
and
whether they are signed or unsigned
• The dynamic range of an N-bit number based on 2’scomplement representation is between -(2N-1) & (2 N-1 1), or between -32,768 and 32,767 for a 16-bit system.
• By normalizing the dynamic range between -1 and 1, the
range will have 2N sections, 2 -(N-1) -size of each section
starting at -1 up to 1 – 2 -(N-1).
• For a 4-bit system, there would be 16 sections, each of
size 1/8, from -1 to 7/8 .
• In unsigned integer
the stored number can take on any integer value
from 0 to 65,535.
• signed integer
uses two's complement
allows negative numbers
it ranges from -32,768 to 32,767
• With unsigned fraction notation
65,536 levels spread uniformly between 0 and 1
• the signed fraction format
allows negative numbers, equally spaced between
-1 and 1
15+1=0
6+(-2)=4
• The 4-bit unsigned numbers represent a modulo (mod)
16 system.
• If 1 is added to the largest number (15), the operation
wraps around to give 0 as the answer.
• A number wheel graphically demonstrates the addition
properties of a finite bit system.
• Addition procedure
– 1Find the first number x on the wheel.
– 2. Step off y units in the clockwise direction, which brings you to
the answer.
Carry and Overflow
• Carry applies to unsigned numbers — when adding or
subtracting, result is incorrect.
• Overflow applies to signed numbers — when adding or
subtracting, result is incorrect.
Examples:
Overflow
Carry
Sign bit
01111 +
00111
-------10110
100+
111
------------1011
Sign bit
Carry
Fractional Fixed Point Rep
• Rather than using the integer values just
discussed, a fractional fixed-point number
that has values between +0.99 . . . and -1
can be used.
Data types
1.Short:
it is of size 16 bits represented as 2’s complement with
a range from -215 to (215 -1)
2.Int or signed int:
it is of size 32 bits represented as 2’s complement with
a range from -231 to ( 231-1)
3.Float:
it is of size 32 bits represented as IEEE 32 bit with a
range from 2-126(1.175494x10-38) to 2+128
(3.40282346x1038)
4.Double:
it is of size 64 bits represented as IEEE 64 bit with a
range from 2-1022(2.22507385x10-308) to 2
1024(1.79769313x10308)
Floating-point representation
•The advantage over fixed-point representation is that
it can support a much wider range of values.
• The floating-point format needs slightly more storage
• The speed of floating-point operations is measured in
FLOPS.
General format of floating point number :
X= M. be
where M is the value of the significand (mantissa),
b is the base
e is the exponent.
Mantissa determines the accuracy of the number
Exponent determines the range of numbers that can be
represented
Floating point numbers can be represented as:
Single precision :
•
called "float" in the C language family
•
•
it is a binary format that occupies 32 bits
its significand has a precision of 24 bits
Double precision :
•
called "double" in the C language family
•
•
it is a binary format that occupies 64 bits
its significand has a precision of 53 bits
Single Precision (SP):
31
S
30
23
e
22
0
f
Bit 31 represents sign bit
Bits 23 to 30 represents exponent bits
Bits 0 to 22 represents fractional bits
Numbers as small as 10-38 and as large as 10 38 can be
represented
Double precision (DP) :
• since 64 bits, more exponent and fractional bits are available
• a pair of registers are used
31
s
30
20
e
19
0
f
31
0
f
Bits 0 to 31 of first register represents fractional bits
Bits 0 to 19 second register also represents fractional bits
Bits 20 to 30 represents exponent bits
Bits 31 is the sign bit
Numbers as small as 10 -308 and as large as 10 +308 can be
represented
• Instructions ending in SP or DP represents single and double
precision
• Some Floating point instructions have more latencies than fixed
point instructions
Eg: MPY requires one delay
MPYSP has three delays
MPYDP requires nine delays
• Single precision floating point value can be loaded into a single
register where as Double precision values need a pair of
registers
A1:A0, A3:A2 ,…….. B1:B0, B3:B2 ,……………
• C6711 processor has a single precision reciprocal instruction
RCPSP for performing division
Code Optimization
code optimization is used to drastically reduce the
execution time of the code.
There are several techniques(i) Use instructions in parallel
(ii) Word-wide data
(iii) intrinsic functions
(iv) Software pipelining.
Optimized assembly (ASM) code runs faster than C and
require less memory space.
Comparison of Programming
Source
Techniques
Hand Optimised
ASM
Efficiency* Effort
100%
High
Linear
ASM
Assembly
Optimiser
95 - 100%
Med
C
C ++
Optimising
Compiler
80 - 100%
Low
* Typical efficiency vs. hand optimized assembly.
Linear Assembly
• The resulting assembly-coded program
produced by the assembler optimizer is
typically more efficient than one resulting
from the C compiler optimizer.
• Linear assembly code programming
provides a compromise between coding
effort and coding efficiency.
• Optimization Steps
1.Program in C. Build your project without Optimization
2. Use intrinsic functions when appropriate as well as
the various optimization levels
3. Use the profiler to determine/ identify the functions
that may need to be further optimized.
Then convert these functions in linear ASM.
4. Optimize code in ASM.
Profiler
• The profiler analyzes program execution and
shows you where your program is spending its
time.
• A profile analysis can report how many cycles a
particular function takes to execute and how
often it is called.
• Profiling helps you to direct valuable
development time toward optimizing the sections
of code that most dramatically affect program
performance.
Compiler options:
A C-coded program is first passed through a parser
that performs preprocessing functions and generate
an intermediate file (.if) which becomes the input to an
optimizer.
.opt
.if
C Code
Parser
Optimizer
code generator
ASM
The optimizer generates an (.opt) file which becomes
the input to a code generator for further optimization
and generates ASM file.
The options for optimization levels:
1. -O0 optimizes the use of registers
2. -O1 performs a local optimization in addition to
optimization done by -00.
3. -O2 performs global optimization in addition to
optimization done by -00 and -01.
4. -O3 performs file optimization in addition to the
optimizations done by -00, -01 and -02.
-02 and -03 attempt to do software optimizations.
Intrinsic C functions:
•
•
Similar to run time support library function
C intrinsic function are used to increase the efficiency of
code.
•
int-mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a
number by 16 LSBs of another number.
2.
int-mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of
a number by the 16 MSBs of another number.
3.
int-mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs
of a number by 16 MSBs of another.
4.
int-mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs
of a number by the 16 LSBs of another.
5.
Void-nassert (int) generates no code.
It tells the compiler that expression declared with the asssert function is true.
6. Uint-lo (double) and Uint-hi (double) obtain low and high 32 bits of a double word.
Trip directive for loop count:
Linear assembly directive (.trip) is used to specify the
number of times a loop iterates.
If the exact number is known and used, redundant
loops are not generated and can improve both code
size and execution time.
Cross-Paths
• Data and address cross-path instructions
are used to increase code efficiency.
• MPY .M1x A2,B2,A4
• MPY .M2x A2,B2,B4
Software pipelining
• software pipelining is a scheme which uses available
resources to obtain efficient pipelining code.
• The aim is to use all eight functional units within one
cycle.
There are three stages:
1. prolog (warm-up)- This stage contains instructions
needed to build up the loop kernel cycle.
2. Loop kernel (cycle)- within this loop, all instructions
are executed in parallel.
Entire loop is executed in one cycle.
3. Epilog (cool-off)- This stage contains the instructions
necessary to complete all iterations
Procedure for software pipelining:
1. Draw the dependency graph
2. Set up a scheduling table
3. Obtain code from the scheduling table.
Dependency graph: (Procedure)
1. Draw the nodes and paths
2. Write the number of cycles to complete an instruction
3. Assign functional units associated with each code
4. Separate the data paths, so that the maximum
number of units are utilized.
dependency graph
• A node has one or more data paths going
in and/or out of the node.
• The numbers next to each node represent
the number of cycles required to complete
the associated instruction.
• A parent node contains an instruction that
writes to a variable; whereas a child node
contains an instruction that reads a
variable written by the parent.
• LDH - > Parent of MPY
• MPY - >Parent of ADD
• The ADD instruction is fed back as input
for the next iteration; similarly with the
SUB instruction.
Dependency graph : (Eg. Two sum of product)
Side B
Side A
LDW
LDW
bi
ai
.D1
5
5
.M1x
.D2
5
5
MPY
MPYH
Prod h
Prod l
2
2
ADD
1
count
.L2
Sum h
SUB
1
.S1
1
Sum l
.L1
.M2x
B
1
loop
.S2
Scheduling table:
1. LDW starts in cycle 1
2. MPY and MPYH must start five cycles after LDW, due
to four delay slots.
Therefore MPY/MPYH starts at cycle 6.
3. ADD must start two cycles after MPY/MPYH due to
one delay slot of MPY/MPYH.
Therefore ADD starts in cycle 8.
4. B has 5 delay slots and starts in cycle 3, since
branching occurs in cycle 9, after ADD instructions.
5. SUB instruction must start one cycle before branch
instruction, since the loop count is decremented
before branching occurs.
Therefore SUB starts in cycle 2.
Schedule table before software pipelining:
cycles
units
1,9,17.. 2,10,18..
.D1
LDW
.D2
LDW
3,11,..
4,12,..
5,13,..
6,14,..
.M1
MPY
.M2
MPYH
.L1
.S2
8,16,..
ADD
.L2
.S1
7,15,..
ADD
SUB
B
Schedule table after software pipelining:
cycles
units
1,9,17.. 2,10,18..
3,11,..
4,12,..
5,13,..
6,14,..
7,15,..
8,16,..
.D1
LDW
LDW
LDW
LDW
LDW
LDW
LDW
LDW
.D2
LDW
LDW
LDW
LDW
LDW
LDW
LDW
MPY
MPY
MPYH
MPYH
LDW
.M1
MPY
.M2
MPYH
.L1
ADD
.L2
.S1
.S2
ADD
SUB
SUB
B
SUB
B
SUB
B
SUB
B
SUB
SUB
B
B
• Instructions within prolog stage (cycles 1-7) are
repeated until and including loop kernel (cycle 8).
• Instructions in the epilog stage (cycles 9,10…) are to
complete the functionality of the code.
Loop Kernel
• Within the loop cycle 8, multiple iterations of the loopexecute in parallel. ie, different iterations are
processed at same time.
eg: ADDs add data for iteration 1
MPY/MPYH multiply data for iteration 3
LDW load data for iterations 8
SUB decrements the counter for iteration 7
B branches for iteration 6
• ie, values being multiplied are loaded into registers 5
cycles prior to cycle when the values are actually
multiplied. Before first multiplication occurs, fifth load
has just completed.
• This software pipelining is 8 iterations deep.
• If the loop count is 100 (200 numbers)
Cycle 1:
LDW, LDW (also initialization of count and
accumulators A7 and B7)
Cycle 2:
LDW, LDW, SUB
Cycle 3-5: LDW, LDW, SUB, B
Cycle 6-7: LDW, LDW, MPY, MPYH, SUB, B
Cycle 8-107: LDW, LDW, MPY, MPYH, ADD, ADD,
SUB, B
Cycle 108: LDW, LDW, MPY, MPYH, ADD, ADD,
SUB, B
• Prolog section is within cycle 1-7
• Loop kernel is in cycle 8
• Epilog section is in cycle 108.
Execution Cycles:
Number of cycles (with software pipelining):
Fixed point
= 7+ (N/2) +1
eg: N = 200 ; 7+100+1 = 108
Floating points
= 9 + (N/2) + 15
Fixed Point
Floating Point
No Optimization
2 + (16 X 200) = 3202
2 + (18 X 200) = 3602
With parallel instructions
1 + (8 X 200) = 1601
1 + (10 X 200) = 2001
Two sums per iterations
1 + (8 X 100) = 801
1 + (10 X 100) + 7 = 1008
With S/W pipelining
7 + (200/2) + 1 = 108
9 + (200/2) +15 = 124
Memory Constraints:
• Internal memory is arranged through various banks of
memory so that loads and stores can occur
simultaneously.
• Since banks are single ported, only one access to
each bank is performed per cycle.
• Two memory access per cycle can be performed if
they do not access the same bank.
• If multiple access is performed to the same bank,
pipeline will stall.
Cross Path Constraints:
• Since there is one cross path in each side of the two datapaths, there can be
at most two instructions per cycle using cross path.
eg: Valid code segment (because both available cross paths are utilized )
II
ADD
.L1X
A1, B1, A0
MPY
.M2X A2, B2, B3
eg: Not valid ( because one cross path is used for both instructions)
II
ADD .L1X
A1, B1, A0
MPY .M1X
A2, B2, A3
Load/store constraints:
• The address register to be used must be on the same side as the .D unit.
eg: Valid code:
II
LDW
LDW
.D1
.D2
*A1, A2
*B1, B2
eg: Invalid code:
II
LDW .D1 . *A1, A2
LDW .D2 *A3, B2
• Loading and storing cannot be from the same register file.
eg: Valid code:
II
LDW .D1 *A0, B1
STW .D2 A1,*B2
eg: Invalid code:
II
LDW .D1
STW .D2
*A0, A1
A2,*B2
Pipelining Effects with More
Than One EP within an FP
• When the CPU detects that FP1 contains
more than one EP, it forces the pipeline to
stall so that EP2 and EP3, within FP1, can
each start its dispatching phase in cycles 6
and 7, respectively
• Hence, with the three EPs within one FP,
the pipeline stalls for two cycles.
TMS320C64x
• TMS320C64x is a family of 16-bit Very Long
Instruction Word (VLIW) DSP from Texas Instruments
• At clock rates of up to 1 GHz, C64x DSPs can process
information at rates up to 8000 MIPS
• C64x DSPs can do more work each cycle with built-in
extensions.
• They can process all C62x object code unmodified
(but not vice-versa)
Applications for the C64x
TMS320C64x can be used as a CPU in the following
devices:
 Wireless local base stations;
 Remote access server (RAS);
 Digital subscriber loop (DSL) systems;
 Cable modems;
 Multichannel telephony systems;
 Pooled modems;
New extensions
•
•
•
•
•
Register file enhancements
Data path extensions
Packed data processing
Additional functional unit hardware
Increased orthogonality
Register file enhancements
• The ’C64x register file has double the number of
general-purpose registers than the ’C62x/’C67x cores
• There are 32 32-bit registers per data path
A0-A31 for file A and B0-B31 for file B
• A0 may also be used as a condition register bringing
the total to six condition registers.
• In all ’C6000 devices, registers A4-A7 and B4-B7 can
be used for circular addressing.
Packed data processing
• The ’C64x register file supports all the ’C62x data
types and extends this by additionally supporting
packed 8-bit types and 64-bit fixed-point data types.
• Packed data types store either four 8-bit values or
two 16-bit values in a single 32-bit register or four 16bit values in a 64-bit register pair.
• Besides being able to perform all the ’C62x
instructions, the ’C64x also contains many 8–bit and
16–bit extensions to the instruction set.
Eg: MPYU4 instruction performs four 8x8 unsigned
multiplies with a single instruction on a .M unit.
Data path extensions
• On the ’C64x, all eight of the functional units have
access to the register file on the opposite side via a
cross path.
• on the ’C62x/’C67x, only six functional units have
access to the register file on the opposite side via a
cross path; the .D units do not have a data cross
path.
• The ’C64x pipelines data cross path accesses
allowing multiple units per side to read the same
cross path source simultaneously.
• In ’C62x/’C67x, only one functional unit per data path
per execute packet could get an operand from the
opposite register file.
Additional Functional Unit Hardware
• the .L units can perform byte shifts and the .M units
can perform bi-directional variable shifts in addition to
the .S unit’s ability to do shifts.
• Bit-count and rotate hardware on the .M unit extends
support for bit-level algorithms such as binary
morphology, image metric calculations and encryption
algorithms.
Increased Orthogonality
• The .D unit can now perform 32-bit logical
instructions in addition to the .S and .L units.
• Also, the .D unit now directly supports load and store
instructions for double-word data values
Block diagram
L1 Program cache
Direct-mapped
16 K Bytes total
SDRAM
SBSRAM
ZBT RAM
FIFO
SRAM
I/O devices
EMIF A
EMIF B
Enhanced
L2
DMA
Memory
Controller
1024K
(64-channel)
bytes
CPU
CORE
.
L1 Data cache
2-way set-associative
16 K Bytes total
C64X CPU
Architecture Overview
• 2 (almost) identical fixed-point data paths that
each contain
– 1 ALU (The .L Unit)
– 1 Shifter (The .S Unit)
– 1 Multiplier (The .M Unit)
– 1 Adder/Subtractor used for address
generation (The .D Unit)
– 1 register file containing thirty-two 32-bit
registers
• The 8 execution units in the 2 data paths are
capable of executing up to 8 instructions in
parallel.
• Can operate on 8-, 16-, 32-, and 40-bit data
• Can perform double-word (64-bit) loads and
stores by using 2 registers for the one operation.
General-Purpose Register Files
 The C64x register file contains 32 32-bit registers (A0-
A31 for file A and B0-B31 for file B);
 can be used for data, pointers or conditions
 Values larger than 32 bits (40-bit long and 64-bit float
quantities) are stored in register pairs.
 Packed data types are: four 8-bit values or two 16-bit
values in a single 32-bit register, four 16-bit values in a
64-bit register pair.
Odd register
Zero filled
39
32
31
Even register
0
Delay Slots
• Delay slots mean “how many CPU cycles come
between the current instruction and when the
results of the instruction can be used by another
instruction”
• Single Cycle Instructions: 0 delay slots
• 16x16 Single Multiply and .M Unit non-multiply
Instructions: 1 delay slot
• Store: 0 delay slots
– If a load occurs before a store (either in parallel or not),
then the old data is loaded from memory before the new
data is stored.
– If a load occurs after a store, (either in parallel or not), then
the new data is stored before the data is loaded.
• C64x Multiply Extensions: 3 delay slots
• Load: 4 delay slots
• Branch: 5 delay slots
– The branch target is in the PG slot when the branch
condition is determined in E1. There are 5 slots between
PG and E1 when the branch target begins executing useful
code again.
Memory
 The C64x has different spaces for program and data memory;
 Uses two-level cache memory scheme;

Internal Memory
The C64x has a 32-bit byte-addressable memory with the
following features:
 Separate data and program address spaces;
 Large on chip RAM, up to 7MB;
 2-level cache;
 Single internal program memory port with an
instruction-fetch bandwidth of 256 bits;
 Two 64-bit internal data memory ports;
Memory Map (Internal and External
Memory)
• Level 1 Program Cache is 128 Kbit direct
mapped
• Level 1 Data cache is 128Kbit 2-way setassociative
• Shared Level 2 Program/Data
Memory/Cache of 4Mbit
– Can be configured as mapped memory
– Cache (up to 256 Kbytes)
– Combination of the two
Memory Buses
• Instruction fetch using 32-bit address bus
and 256-bit data bus
• two 64-bit load buses (LD1 and LD2)
• two 64-bit store buses (ST1 and ST2)
Interrupts
• 16 prioritized interrupts: INT_00 to INT_15
• INT_00 has the highest priority and is dedicated
to RESET. This halts the CPU and returns it to
a known state
• The first four interrupts (INT_00 – INT_03) are
fixed and non maskable
• INT_01 – INT_03 are generally used to alert the
CPU of an impending hardware problem, such
as an imminent power failure
• The remaining interrupts are maskable and can
be programmed
Interrupt Performance
Consideration
• Overhead for all CPU interrupts is 7 cycles
• Interrupt latency is 11 cycles
• Interrupts can be recognized every 2
cycles
• 2 occurrences of a specific interrupt can
be recognized in 2 cycles
Peripheral Set
•
•
•
•
•
2 multichannel buffered audio serial ports
2 inter-integrated circuit bus modules (I2Cs)
3 multichannel buffered serial ports (McBSPs)
3 32-bit general-purpose timers
1 user-configurable 16-bit or 32-bit host-port interface
(HPI16/HPI32)
• 1 16-pin general-purpose input/output port (GP0) with
programmable interrupt/event generation modes
• 1 32-bit glueless external memory interface (EMIFA),
capable of interfacing to synchronous and asynchronous
memories and peripherals.
ZBT RAM
• Zero Bus Turnaround (ZBT) is a synchronous SRAM
architecture optimized for networking and
telecommunications applications.
• It can increase the internal bandwidth of a switch
fabric when compared to standard SyncBurst SRAM.
• The ZBT architecture is optimized for switching and
other applications with highly random READs and
WRITEs.
• ZBT SRAMs eliminate all idle cycles when turning the
data bus around from a WRITE operation to a READ
operation
Packaging – Top View
A1 Corner
Top View
Packaging - Bottom View
Bottom View
Sum of products example
C code:
TI TMS C64x code:
int DotP(short* m, short* n, int count) {
int i, product, sum = 0;
for(i = 0; i < count; i++)
{
product = m[i] * n[i];
sum+=product;
}
return(sum);
}
LOOP:
[A0]
| | [!A0]
||
| | [B0]
SUB
ADD
MPY
BDEC
LDH
LDH
.L1
.S1
.M1X
.S2
.D1T1
.D2T2
A0, 1, A0
A6, A5, A5
B4, A4, A6
LOOP, B0
*A3++, A4
*B5++, B4
Another code example
MIPS:
loop: LW R1, 0(R11)
MUL R2, R1, R10
SW R2, 0(R12)
ADDI R12, R12, #-4
ADDI R11, R11, #-4
BGTZ R12, loop
TI TMS C64x:
ADDK .S1 #-4,A11
|| LDW .D1 A1,0(A11) || MVK .S2 #-4,B1
ADDK .S1 #-4,A11
|| LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12
loop: ADDK .S1 #-4,A11
ADD .L2 B12,B1,B12
|| LDW .D1 A1,0(A11)
|| MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||
|| BGTZ .S2 B12, loop
ADD .L2 B12, B1, B12 ||
MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 ||
STW .D2x A2,0(B12)
Special purpose instructions
Instruction
Description
Example Application
BITC4
Bit counter
Machine vision
GMPY4
Galois Field MPY
Reed Solomon support
SHFL
Bit interleaving
Convolution encoder
DEAL
Bit de-interleaving
Cable modem
SWAP4
Byte swap
Endian swap
XPNDx
Bit expansion
Graphics
MPYHIx, MPYLIx
Extended precision 16x32 MPYs
Audio
AVGx
Quad 8-bit, Dual 16-bit average
Motion compensation
SUBABS4
Quad 8-bit Absolute of
differences
Motion estimation
SSHVL, SSHVR
Signed variable shift
GSM
THE END