Gettiing Ready for the Shift to 4G

Download Report

Transcript Gettiing Ready for the Shift to 4G

Introducing the
ConnX D2 DSP Engine
Introduced: August 24, 2009
Fastest Growing Processor / DSP IP Company
Customizable Dataplane Processor/DSP IP Licensing
– Leading provider of customizable Dataplane Processor Units (DPUs)
– Unique combination of processor & DSP IP cores + software design tools
– Customization enables improved power, cost, performance
– Standard DPU solutions for audio, video/imaging & baseband comms
– Dominant patent portfolio for configurable processor technology
Broad-Based Success
– 150+ Licensees, including 5 of the top 10 semiconductor companies
– Shipping in high volume today (>200M/yr rate)
– Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09)
• 21% revenue growth in 2007, 25% in 2008
Copyright © 2009, Tensilica, Inc.
2
Focus: Dataplane Processing Units (DPUs)
DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance
than CPU or DSP and providing better flexibility & verification than RTL
Embedded
Controller
For
Dataplane
Processing
Main
Applications CPU
Tensilica focus: Dataplane Processors
Copyright © 2009, Tensilica, Inc.
3
Communications DSP Trends / Challenges
Code Size Increases
• Communications standards growing in
number & complexity
• DSP algorithm code heavily integrated with
more (and more complex) control code
Development Teams Shrink
• SOC development schedules tightening
• Tightening resource constraints (do
more with less)
Maintenance and
flexibility pushes
DSP algorithms
towards C-code
Markets Changing Faster
• Market requirements in flux as
economy wobbles
• Emerging standards evolve faster in
the Internet age
Copyright © 2009, Tensilica, Inc.
4
Trends Within Licensable DSP Architectures
1st Generation Licensable DSP Cores
• Modest/Medium performance (single/dual MAC)
• Simple architecture (single issue, compound Instructions)
• Limited or no compiler support (mostly hand coded)
2nd Generation Licensable DSP Cores
• Added RISC like architecture features (register arrays)
• Improved compiler targets, but still assembly
• Some offer wide VLIW for performance
• Large area; code bloat
• Some offer wide SIMD for performance
• Good area/performance tradeoff
• No performance when vectorization fails
Copyright © 2009, Tensilica, Inc.
5
Vectorization Benefits (SIMD)
• Loop counts can be reduced
• Data computation can be done in parallel
• Cheapest (hardware cost) method to get higher performance
Example: 2-way SIMD performance benefit
After Vectorization
Before Vectorization
Data7
Data6
Data5
Data4
Data6
Data7
Data4
Data5
Data2
Data3
Data0
Data1
Data3
Data2
2-way SIMD Execution
Data1
Data0
Single Execution
Copyright © 2009, Tensilica, Inc.
6
VLIW Technology
•
•
•
•
Parallel execution of Instructions
Effective use of multiple ALUs/MACs
Compiler allocates instructions to VLIW slots
Orthogonal allocation yields more flexibility
Instruction #4
Instruction #3
Instruction #2
Instruction #3
Instruction #4
Instruction #1
Instruction #1
Instruction #2
Execution ALU
VLIW Execution ALU1 VLIW Execution ALU2
Copyright © 2009, Tensilica, Inc.
7
Ideal 3rd Generation Licensable DSP
Ideal Characteristics
• VLIW capability for good performance on general code
• Parallelization of independent operations
• SIMD capability for good performance on loop code
• Data parallel execution
• Good C compiler target
• Reduce or eliminate need to assembly program
• Productivity benefit
• Small, compact size
• Keep costs down in brutally competitive markets
Copyright © 2009, Tensilica, Inc.
8
Tensilica - the Stealth DSP Company
Comms
Audio
Video
Xtensa: Other Markets
DSP Building Blocks
16 MAC
ConnX
BBE
8 MAC
ConnX
545CK
DSP
Quad MAC
Custom DSPs
8 MAC
and
more
388VDO
Double Precision
Acceleration
Floating Point HW
ConnX
Vectra
LX
Single Precision
Floating Point Unit
Xtensa
TIE
DIV32
Dual MAC
ConnX D2
HiFi 2
MUL32
MAC16
Single MAC
Copyright © 2009, Tensilica, Inc.
9
ConnX D2 DSP Engine - Overview
•
Dual 16b MAC Architecture with Hybrid SIMD / VLIW
• Optimum performance on a wide range of algorithms
• SIMD offers high data computation rate for DSP algorithms
• 2-way VLIW allows parallel instruction execution on SIMD and scalar code
•
“Out of the Box” industry standard software compatibility
• TI C6x fixed-point C intrinsics supported
• Fully bit for bit equivalent with TI C6x
• ITU reference code fixed point C intrinsics directly supported
•
Goals: Ease of Use, Low Area/Cost
• Click and go “Out of the Box” performance from standard C code
• Standard C and fixed point data types - 16-bit, 32-bit and 40-bit
• Advanced optimizing, vectorizing compiler
• Less than 70K gates (under 0.2mm2 in 65nm)
Copyright © 2009, Tensilica, Inc.
10
Target Applications: ConnX D2
General purpose 16-bit DSP for a wide range of applications
• Embedded control
• VoIP gateways, voice-over-networks (including VoIP codecs)
• Femto-cell and pico-cell base stations
• Next generation disk drives, data storage
• Mobile terminals and handsets
• Home entertainment devices
• Computer peripherals, printers
Copyright © 2009, Tensilica, Inc.
11
ConnX D2 DSP:
An ingredient of an Xtensa DPU
Hardware Use Model
•
Click-button configuration option within Xtensa LX core
•
Part of the Tensilica configurable core deliverable package
•
Two reference configurations
• Typical DSP solution for high performance
• Small size for cost and power sensitive applications
•
Full tool support from Tensilica
• High level simulators (SystemC), ISS and RTL
• Debugger and Trace
• Compiler, IDE and Operating Systems
Copyright © 2009, Tensilica, Inc.
12
ConnX D2 Processor Block Diagram (Typical)
Copyright © 2009, Tensilica, Inc.
13
ConnX D2 Engine Architecture
AR Register Bank
(32 bits)
32b
Local
Memory
and/or
Cache
32-bits
Load
Store
Unit
32-bits
32b
32b
XDD Register File
(8 x 40-bits)
XDU Alignment Registers
(4 x 32 bits)
40-bit, 32-bit & 16-bit integer
40-bit, 32-bit & 16-bit fixed
8-bit 8-bit 8-bit 8-bit
16-bit vector 16-bit vector
16-bit real
Overflow State
Carry State
16-bit imaginary
Hi / Lo 16-bit select
16-bits
16-bits
16-bits
Addressing Modes
DSP specific instructions
16-bits
Add-Bit-Reverse-Base and
Add-Subtract : Useful for
FFT implementation
Add-Compare-Exchange :
Useful for Viterbi
implementation
Add-Modulo : Circular buffer
implementation. Useful for
FIR implementation
Q
Q
16b
rounding
Q
16b
40b
X
Q
16b
rounding
16b
40b
X
32b
32b
+
+
Shift / Saturation
Shift / Saturation
Accumulator (up to 40-bits)
DR Register
Accumulator (up to 40-bits)
DR Register
Copyright © 2009, Tensilica, Inc.
•
•
•
•
•
•
•
Immediate
Immediate updating
Indexed
Indexed updating
Aligning updating
Circular (instruction)
Bit-reversed
(instruction)
14
ConnX D2 : Instruction Allocation Options
16-bit Instructions
Base ISA
24-bit Instructions
Base ISA or ConnX D2
Slot 0
ConnX D2 or Base ISA
•
Slot 1
ConnX D2 or Base ISA (register
VLIW
Instructions
(64-bits)
moves & C ops on register data)
Flexible allocation of instructions available to compiler
• Optimum use of VLIW slots (ConnX D2 or base ISA instructions)
• Improved performance and no code bloat (reduced NOPs)
• Reduce code size when algorithm is less performance intensive
• Modeless switching between instruction formats
Copyright © 2009, Tensilica, Inc.
15
ConnX D2 : SIMD with VLIW – Extra Performance
Combining SIMD and VLIW can give 6 times performance
Example : Energy Calculation
SIMD Computation
127
A = ∑ Xn* Xn
128 iteration
C algorithm
n=0
loopgtz a3,.LBB52_energy
l16si a3,a2,2
l16si a5,a2,4
l16si a6,a2,6
l16si a7,a2,8
mul16s a3,a3,a3
mul16s a5,a5,a5
mul16s a6,a6,a6
mul16s a7,a7,a7
addi.n a2,a2,8
add.n a3,a4,a3
add.n a3,a3,a5
add.n a3,a3,a6
add.n a4,a3,a7
Instruction
Execution
(Control)
# [3]
# [0*II+0] id:16 a+0x0
# [0*II+1] id:16 a+0x0
# [0*II+2] id:16 a+0x0
# [0*II+3] id:16 a+0x0
# [0*II+4]
# [0*II+5]
416 cycles
# [0*II+6]
Base Xtensa
# [0*II+7]
# [0*II+8]
Configuration
# [0*II+9]
# [0*II+10]
# [0*II+11]
# [0*II+12]
loop {
# format XD2_FLIX_FORMAT
xd2_la.d16x2s.iu xdd0,xdu0,a4,4;
Slot0
Slot1
• Vectorization and SIMD gives double
data computation performance
• VLIW gives 2 pipeline executions (one
is SIMD) with auto-increment loads
• ConnX D2 architecture gives this
combination and performance
ConnX D2: 64 cycles
xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 }
One instruction (64-bit VLIW instruction)
Copyright © 2009, Tensilica, Inc.
16
When Vectorization is Not Possible
Performance for scalar code bases
int energy(short *a, int col, int cols, int rows)
{
int i;
int sum=0;
for (i=0; i<rows; i++) {
sum += a[cols*i+col] * a[cols*i+col];
}
return sum;
}
• Energy computation of column ‘col’ in 2-D array
• Above code loop cannot be vectorized
• Non–contiguous memory accesses thwarts vectorizers
• Regular compilers can not map this code into traditional SIMD
DSPs
Copyright © 2009, Tensilica, Inc.
17
When Vectorization is Not Possible
Performance for scalar code bases
int energy(short *a, int col, int cols, int rows)
{
int i;
int sum=0;
for (i=0; i<rows; i++) {
sum += a[cols*i+col] * a[cols*i+col];
}
return sum;
C-Code
}
• Confirmed that ConnX D2 and TI
C6x compilers can not vectorize
this code
• ConnX D2 compiler can however
use VLIW to increase performance
Generated Assembly Code
entry a1,32
blti
a5,1,.Lt_0_2306
addx2 a2,a3,a2
slli
a3,a4,1
addi.n a4,a5,-1
sub
a2,a2,a3
{
# format XD2_FLIX_FORMAT
xd2_l.d16s.xu xdd0,a2,a3
ConnX D2 : One cycle within loop
xd2_movi.d40
loopgtz a4,.LBB43_energy
{
# format XD2_FLIX_FORMAT
xd2_l.d16s.xu xdd0,a2,a3 ;
…………
xdd1,0 }
xd2_mula32.d16s.ll_s1
Load scalar 16-bits
xdd0 is loaded with memory contents defined in a2
register. a2 register value is updated by value in a3
xdd1,xdd0,xdd0
}
MAC operation on lower 16-bits.
Multiplies xdd0 with xdd0. Accumulated result is
stored in xdd1
Copyright © 2009, Tensilica, Inc.
18
Optimization with ITU / TI Intrinsics
Performance for generic code bases
Energy calculation loop
1000 looping, using
L_mac ITU intrinsic
#define ASIZE 1000
extern int a[ASIZE];
extern int red;
void energy()
{
int i;
int red_0 = red;
for (i = 0; i < ASIZE; i++) {
red_0 = L_mac(red_0, a[i], a[i]);
}
red = red_0;
}
L_mac maps to one ConnX D2 instruction
Compiler further optimizes by using SIMD to
accelerate loop
VLIW allows further accelerates with parallel loads
1000 loop C algorithm optimized to 500 cycles
loop
entry
a1,32
l32r
a2,.LC1_40_18
l32r
a5,.LC0_40_17
xd2_l.d16x2s.iu
xdd0,a2,4
test_arr_1+0x0
Sustained 3 operations
l32i.n
a3,a5,0
test_global_red_0+0x0
{ # format XD2_ARUSEDEF_FORMAT
xd2_mov.d32.a32s xdd1,a3
movi a3,499 }
/ cycle
loopgtz
a3,
{
# format XD2_FLIX_FORMAT
xd2_l.d16x2s.iu xdd0,a2,4;
xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0
}
Generated Assembly Code
..........
Copyright © 2009, Tensilica, Inc.
19
“Out of the Box” Performance - Results
Comparison to TI C55x
(TI C55x is an industry benchmark Dual-MAC, 2-way VLIW)
•
Why better?
20% more performance (256 point complex FFT)
Cycle count
(lower is better)
ConnX D2
"Out of the Box" C code
TI C55x
Optimized assembly
3740
4786 #
•
•
•
•
FFT specific instructions
Dual write to Register Files
Advanced Complier
SIMD and VLIW performance
Comparison to other DSP IP vendors
•
Almost twice the performance
ConnX D2
Why better?
CEVA - X1620
(Out of the Box ITU reference code) (Out of the Box ITU reference code)
Required MHz for
AMR-NB (VAD2)
27.7 MHz
48 MHz *
Encode + Decode
* - 2008, From CEVA published Whitepaper
# - Dec 2008, www.ti.com
Copyright © 2009, Tensilica, Inc.
• 1 to 1 mapping of ITU
intrinsics
• SIMD and VLIW performance
• Flexibility in VLIW allocation
• VLIW Performance for scalar
code
20
Small, Low Power, & High Performance

Optimized for low area / low cost applications
• Less than 70,000 gates
• 0.18mm2 in 65nm GP *

Low power
• 52uW/MHz power consumption
• 65nm GP, measured running AMR-NB algorithm

Very high performance
• 600MHz in 65nm GP **
* - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option
** - After full Place and Route, when optimized for speed
Copyright © 2009, Tensilica, Inc.
21
Flexible and Customizable
Configure memory subsystems to exact requirements
• Up to 4 local memories
•
Instruction memory, data memory
•
•
•
•
•
•
RAM and ROM options
DMA path into these memories
Instruction and data cache configurations
MMU and memory region protection
Memory port interface
Option of dual load/store architecture
Full customization
• Instruction set extensions
• Custom I/O Interfaces
• TIE Ports, Queues and Lookup Memory interfaces
Copyright © 2009, Tensilica, Inc.
22
ConnX D2 DSP Engine: Summary
•
•
•
•
Small size
Low power
Excellent performance on wide range of code
Easy to use – C programming centric
•
•
“Out of the Box” performance
•
Reduce development time – reduced cost
•
ITU and T.I. C intrinsic support – large existing code
base
•
Bit equivalent to TI C6x
•
Take current TI code, port and get same
functionality on ConnX D2
Flexible & customizable
Copyright © 2009, Tensilica, Inc.
23