Dain Rauscher Wessels

Download Report

Transcript Dain Rauscher Wessels

The Configurable Processor Company
Fundamental Change in MPSOC
A fifteen year outlook
Chris Rowen, President and CEO
Tensilica, Inc.
1
© 2003 TENSILICA INC.
Moore’s Law: Opportunity, Crisis and ROI
Moore’s Law:
Design Productivity Crisis (SRC 1997)
Standard cell density and speed
10,000
100,000,000
Clock
Gates
1,000
100
Logic Transistor per Chip (M)
Equivalent Added Complexity
1,000
10,000
Logic Tr./Chip
100
Tr./S.M.
1,000
10
100
1
10
0.1
xx
xx
x
x
x
1
x
0.01
0.1
0.001
0.01
Productivity (K) Trans./Staff – Mo.
Density (Kgates/mm2)
ASIC clock (MHz)
Potential Design Complexity and Designer Productivity
10,000
Return
volume * (chip ASP  chip unit cost)
ROI 

Investment
chip developmen t cost
2
Source: ITRS 2001, Moore 1965, Tensilica
© 2003 TENSILICA INC.
ROI Goal: One Design, Many Design-ins
SOC Flexibility = Cost Reduction
(Model: 100K and 1M system volumes)
120
Low-end
still camera
100
High-end
still camera
one chip
Total per unit cost
100,000
1,000,000
80
60
40
20
Video camcorder
0
1
many system
designs
3
2
3
4
5
6
7
System designs per chip design
$10M design cost, $15 manf. cost, 5% premium for programmability
© 2003 TENSILICA INC.
Performance
Configurable Processor’s Role
Applicationspecific
Logic
Configurable
Processors
Generalpurpose
Processors
Flexibility
4
© 2003 TENSILICA INC.
Configurable Processors Enable New Roles
Taking Performance to a New Level
2.0
2.0
0.5
0.14
0.473
0.123
1.8
~Energy efficiency
0.12
1.6
0.4
0.10
1.4
0.3
1.2
0.08
0.23
1.0
0.06
0.2
0.8
0.6
0.04
0.1
0.4
0.2
0.087 0.080 0.059
0.0
0.0
Optimized ConsumerMarks/MHz
Xtensa optimized
Xtensa out-of-box
MIPS64 20Kc
ARM1020E
MIPS64b (NEC VR5000)
MIPS32b (NEC VR4122)
5
0.03 0.023
0.017 0.016 0.013 0.011
0.058 0.039
Source: EEMBC
0.02
0.03
0.018 0.017 0.016
0.01
0.00
Optimized TeleMarks/MHz
Xtensa optimized
TI C6203 optimized
Xtensa out-of-box
TI C6203 out-of-box
MIPS64 20Kc
MIPS64b (NEC VR5000)
ARM1020E
MIPS32b (NEC VR4122)
Optimized NetMarks/MHz
Xtensa optimized
Xtensa out-of-box
MIPS64 20Kc
ARM1020E
MIPS64b (NEC VR5000)
MIPS32b (NEC VR4122)
© 2003 TENSILICA INC.
Automatic Generation of Processors
Achieves Required Performance Faster
Hardware Design
RISC
DSP
OCD
Cache
Designer-Defined
Electronic
Specification
Timer
FPU
Processor
Generator
Customized
Software
Build using
any IC
process
Design processor in one hour
6
© 2003 TENSILICA INC.
Processors as Basic Build Block
Memory
Configurable
Signal
DSP
processor
7
Configurable
CPU
processor
ApplicationConfigurable
Protocol
specific
processor
processing
processing
logic
ApplicationConfigurable
Application
specific
processor
accelerator
logic
ApplicationConfigurable
specific
Encrypt
processor
logic
I/O
ApplicationConfigurable
specific
Audio
processor
logic
ApplicationConfigurable
specific
Imaging
processor
logic
© 2003 TENSILICA INC.
Flexibility is the Key to ROI
Flexibility means
more systems
per design
ROI 
Programmability
more “hot features”
available
Little impact
on chip cost –
pennies per
processor
Return
volume * (chip ASP  chip unit cost )  * 



Investment
chip developmen t cost

Automatically-generated
configurable processors
reduce design time, team
size and re-spin risk
8
© 2003 TENSILICA INC.
Example:
NEC TCP/IP Offload Engine (TOE) Platform
200 MHz
200 MHz
8 parallel Xtensa
processors
200 MHz
MAC
MAC
Gigabit Ether × 2 ports
NEC TOE achieves full wire speed by eight parallel and two
management and dispatch Tensilica cores (Total 10) for high
performance IP-based network storage — NAS & IP-SAN
9
© 2003 TENSILICA INC.
Implications of Multiprocessor SOC
Designers will routinely “waste” processors to get other benefits
Greater speed to market and certainty of success
Higher abstraction in design
Tremendous creativity and diversity in on-chip communications
Topologies: buses, hierarchies of buses, cross-bars, systolic arrays, pipelines
New issues and methods: reliability, redundancy, asynchrony, QOS
Programming models for large numbers of task – finding parallelism
Software languages displace hardware languages
C/C++, not Verilog, VHDL, System Verilog etc.
Changing demographics of complex SOC design
Broader population of engineers and programmers capable of SOC design
Unified hardware-software design user interface “cockpit”
10
© 2003 TENSILICA INC.
New Types of Processors
Number of Processors
Exploiting Latent Parallelism
• Very small processors
• Modest extensions
• High task-level parallelism
128
112
96
Multiple processors vs. multiple-issue in network processors
80
• High-performance
processors
• VLIW, SIMD and
application-specific
extensions
• High data- and
instruction-level
parallelism
EZchip NP-1
64
48
Cisco PXF
32
20
IBM PowerNP
18
Lexra NetVortex
Motorola C-5
16
Cognigine RCU/RSF
14
Xelerated
12
Vitesse IQ2x00
10
8
Alchemy
Intel IXP1200
Mindspeed CX27470
AMCC
6
64 instrs/cycle
np7120
4
Agere PayloadPlus
2
16 instrs/cycle
BRECIS
Broadcom BCM1250
8 instrs/cycle
0
0
1
2
Source: K. Keutzer, UCB
11
3
Clearwater
CNP810
4
5
6
7
8
9
10
15
20
Operations per cycle
© 2003 TENSILICA INC.
Projected Processor Speed and Density
2009
2016
50nm
22nm
1.8GHz
5.7GHz
Small proc area (mm2)
0.08
0.016
Small proc/chip
240
1400
High perf proc/chip
10
15
600,000
11,000,000
Geometry
Clock
MIPS/chip
12
40mm2 die size for consumer SOC
© 2003 TENSILICA INC.
The Law of SOC Processor Scaling
Processors Per Chip
Aggregate SOC Performance
10,000
Billions of operations/second
100,000
1,000
100
1,000
100
10
10
2001
2003
2005
2007
2009
2011
Processors/chip:
Up to 30% per year
13
10,000
Tensilica model based on ITRS 2001, 140mm 2 die size
2013
2015
2001
2003
2005
2007 2009
2011
2013
2015
Total MIPS:
65% per year
© 2003 TENSILICA INC.
Enablers for Large Scale MPSOC
What is Tensilica Working On?
Performance
Throughput and efficiency mean more opportunities for applicationspecific processors over RTL
New processor interfaces enable greater parallelism
Insight
Unified hardware-software development environment
Performance and cost-oriented analysis
Automation
Automatic generation of compilers, RTOS, MP models
“Hands-free” instruction set optimization
14
© 2003 TENSILICA INC.
Performance:
FLIX™
FLIX = Flexible Length Instruction Xtensions
FLIX freely intermixes 16-, 24-, and 64-bit instructions
No code-bloat
No modes
Full backwards code compatibility with current Xtensa ISA
Long instructions implement complex extensions
Fast and parallel code when needed, else very compact code
Arbitrary Instruction Field Specification
Multiple independent operations packed into a wide instruction word
Multiple Load / Store Units
f 64
Minimal Overhead
24
64 g
24
~5000 gates added control logic
16
64
24
63
16
24
31
0
Instruction packing in Memory
15
(Little Endian Shown)
© 2003 TENSILICA INC.
Performance:
Pushing to new levels of throughput
256pt FFT (Radix-4)
Simple RISC Task
Engine
Minimal Configuration Xtensa
processor
(18K gates)
Scalar Performance
Base Xtensa processor with
MUL32 option
23633 cycles
SIMD Performance
Xtensa processor with
4-way SIMD Vectra DSP
Engine
3055 cycles
FLIX Performance
Conexant Testarossa DSP
with 4-way SIMD and FLIX
1063 cycles
155,389 cycles
FLIX: Average of 6% larger code on complex code sets
16
© 2003 TENSILICA INC.
Performance:
A Complex FLIX Example
63
59
58
57
53 52
37 36
25 24
19 18
14 13
9
4
3
0
MemA
InQ
WrtA
ExA/B
ExC/D
ExE
WrtB
OutQ
MemB
1110
5
1
5
16
12
6
5
5
5
4
Addr
4K x 256
RAM
Addr
4K x 256
RAM
In-Q
• 9 independent
operation fields
• Multiple
load/store
• Input/output
queues
17
8
Out-Q
WrtA
Register File
WrtB
(16 x 256)
A
B
C
D
E
© 2003 TENSILICA INC.
Performance:
Writing TIE for FLIX
length l64 64 { InstBuf[3:0] == 14 }
format flix64 l64
slot slot0 flix64[*]
slot slot1 flix64[*]
slot slot1 flix64[*]
opcode L32I slot0
opcode S32I slot0
opcode ADD slot0
opcode NOP slot0
All components of processor
solution automatically
generated from the TIE code
in <2 hours.
• RTL & HW flow scripts
• Toolchain
• System models
• Operating System support
opcode ADD slot1
opcode ADDI slot1
opcode SUB slot1
opcode NOP slot1
18
© 2003 TENSILICA INC.
Performance:
Conexant DSP Architectural Requirements
VLIW-SIMD programming model
16- and 24-bit scalar instructions
64-bit instructions with multiple operations
2 or 4 16x16 MAC units
6R/3W Conexant-defined register file
At least two load store units
7-stage pipe with 2 cycles for I/D memory access
Stall on memory bank conflicts
Backward compatibility with previous Conexant DSPs via Translation
Instruction Set (a sub-operation of the 64-bit instructions)
19
© 2003 TENSILICA INC.
Performance:
Conexant Testarossa Encoding
63
46
45
28
27
4 3
0
ALU
MAC
Load/Store
1 1 1 0
18
18
24
4
ALU
Complex Multiply
Shift
Real Multiply
2nd Load/Store
Select
Testarossa Load/Store
Vector Load/Store
Scalar Load/Store
Unaligned
Load/Store
Xtensa Core Instructions
Load/Store
Branch
ALU
52 operations
20
24 operations
234 operations
© 2003 TENSILICA INC.
Insight:
The Multiple Core SOC Design Problem
Three Skill Sets, Three Environments?
Software
Development
Environment
Environment
Tools
• C code development
• Debugging
• C project management
• Code profiling, tuning
21
• GNU-based Tensilica
software development
tools
• Xtensa C/C++ compiler
• Xtensa Instruction Set
Simulator
• Command line interface
• Partner-provided software
IDEs (WindRiver,
ATI/Mentor, MontaVista)
Processor
Optimization
Environment
• TIE code development
for extensions
• Configuration option
management
• Web-based Xtensa
Processor Generator
• TIE Compiler
Single source TIE file for
processor extension
• Command line interface
• Web browser interface
SOC System
Architecture
Exploration
• System modeling and
simulation
• Multiple core debug
• Xtensa Modeling Protocol
(XTMP)
• Bus functional models for
co-simulation / coverification EDA tools
• Command line interface
• EDA partners system
analysis / debug
environments
© 2003 TENSILICA INC.
Insight:
Xtensa Xplorer
Software
Development
Environment
• C code development
• Debugging
• C project management
• Code profiling, tuning
22
Processor
Optimization
Environment
• TIE code development
for extensions
• Configuration option
management
SOC System
Architecture
Exploration
• System modeling and
simulation
• Multiple core debug
© 2003 TENSILICA INC.
Insight:
Develop and Manage Processor Configurations
Interactive TIE Editor
•language-sensitive editing and
help
Manage complexity of growing variety of
processor optimization choices
Software and processor optimization within
same IDE
Interactive display of instruction…
•operands
•pipelining
•semantics
Gate count estimate:
•per instruction
•per register file
•per user state
23
© 2003 TENSILICA INC.
Insight:
Create, Analyze & Tune ISA Extensions (TIE)
Profile and visualize performance impact of custom instructions
Pipeline Viewer shows instruction flow of disassembled code
Highlight instructions with variable
latency (e.g. cache misses)
24
Static analysis of pipeline stalls
pinpoints areas for fine tuning
Interlocks on deep TIE
pipelines fully modeled
and explained
© 2003
TENSILICA INC.
Insight:
Analyze and Select Caches to Meet Speed/Area Goals
Automatically profile code across
range of cache configuration options
Performance charts visually
compare different configurations
25
© 2003 TENSILICA INC.
Insight:
Chip-level Software and Simulation for MPSOC
Manage system memory maps & link/load for multiple-core SOCs
Develop, run and debug multiple-core simulations using Xtensa Modeling Protocol (XTMP)
Auto-generated XTMP model
based on memory maps
• Specify chip-level memory maps
for shared/private memories
• Place interrupt and reset vectors
• Assign code/data to distributed
memories
26
© 2003 TENSILICA INC.
Automation:
The Next Generation
Complete Hardware
Design
ALU
DSP
Application
Source Code
Cache Timer
Register File
int main()
{
int i;
short c[100];
for (i=0;i<N;i++)
{
c[i] = 0;
}
for (i=0;i<N;i++)
NEW
Automation
Tool
Electronic
Specification
OCD
FPU
Any
Fab
Xtensa
Processor
Generator
Customized
Software Tools
27
© 2003 TENSILICA INC.
Automation:
Goals for Processor Extension
Flexibility
Application code might be written/modified after tape-out
Generated TIE must be sufficiently general purpose so that small changes
to application code do not degrade performance
Control
Full automation
C/C++ in  TIE out
C/C++ + generated TIE in binary code out
Optional full control by user
Guide tool and/or to select instructions
Add to or change generated TIE
Tune application to better take advantage of TIE
Speed: minutes, not days
28
© 2003 TENSILICA INC.
Automation:
Basic Operation - Fusion
Original C Code
int *a, *b, *c;
for (int i=0; i<n; i++)
c[i] = (a[i] + b[i]) >> 2
Complete TIE Code
operation add_shift (out AR c, in AR a, in
AR b) {
wire t[31:0] = a+b;
assign c = {2{t[29]},t[29:0]};
}
29
+
2
>>
Combined add-shift operator
automatically used wherever
equivalent expression occurs in
source
© 2003 TENSILICA INC.
Automation:
Basic Operation - Multiple Ops in FLIX
64 Bit Instruction with 3 Slots
Original C Code
for (int i=0; i<n; i++)
c[i] = (a[i] + b[i]) >>
2
S0
Complete TIE Code
length l 64 { InstBuf[3:0] == 14 }
format f l
slot slot0 f[*]
ADDI, NOP
slot slot1 f[*]
ADD, SRAI, NOP
slot slot2 f[*]
L32I, S32I, NOP
S1
S2
Generated Assembly
loop:
{addi
{addi
{addi
a9,a9,4; add a12,a10,a8;l32i
a11,a11,4;srai a12,a12,2; l32i
a13,a13,4;nop;
s32i
a8,a9,0}
a10,a11,0}
a12,a13,0}
• Original C compiled to 3 cycles/iteration
30
© 2003 TENSILICA INC.
Automation:
Basic Operation - SIMD/Vector
Original C Code
short *a, *b, *c;
for (int i=0; i<n; i++)
c[i] = a[i] + b[i];
Complete TIE Code
regfile vec 64 16 v;
+
=
…
…
a
b
…
c
operation add16x4(out vec c, in vec a,
in vec b) {
assign c = {a[63:48]+b[63:48],
Four iterations in parallel
a[47:32]+b[47:32],
a[31:16]+b[31:16],
a[15:0]+b[15:0]};
}
31
© 2003 TENSILICA INC.
Automation:
Processor Extension Step 1
Compile the C/C++ application code
Designer specifies compiler optimization flag
Compiler generates comments to help user tune code
Optimized code yields better results
Compiler generates information from application
Feedback optimization ranks code regions by frequency
Vectorizer determines which loops can be vectorized
Fuser generates dataflow graphs for important regions
Operation counts for each type of opcode for every region
32
© 2003 TENSILICA INC.
Automation:
Processor Extension Step 2
Generated information used to select and generate TIE:
For each code region, generate many potential sets of TIE
extensions (configurations)
Vectorize by 1, 2, 4, 8
Add FLIX functional units
Add fusions
Generation guided by estimated performance
Evaluate all generated configurations across all regions
Find best set of merged configurations given budget
33
© 2003 TENSILICA INC.
Automation:
Processor Extension Step 3
Use the TIE with a C/C++ or assembly application
Compiler reads TIE (automatically or manually generated) and
generates code
FLIX slot/format TIE specification mapped to resource tables
Generalized graph matcher generates dataflow graphs from TIE
Vectorizer vectorizes a loop and checks if all required operations
available in TIE
User free to tune the code in ANSI C/C++ or assembly
Simulator, assembler, debugger, RTOS support generated
directly from TIE
34
© 2003 TENSILICA INC.
Automation:
Example: “Sum-of-Absolute Differences” Search
Generated Configuration Parameters
4
6
Speedup
Gates
Added
(K)
SIMD
Factor
FLIX
Width
(Slots)
Load /
Store
Units
Fusion
1
2
i
1
8.7x
74
8
3
2
Yes
2
8.1x
57
8
2
2
Yes
3
7.6x
46
4
3
2
Yes
4
7.6x
37
8
2
1
Yes
5
6.8x
33
4
2
2
Yes
6
6.8x
26
8
1
1
Yes
7
6.1x
18
4
2
1
Yes
8
5.1x
12
4
1
1
Yes
9
4.3x
8
2
2
1
Yes
10
3.4x
5
2
1
1
Yes
11
1.4x
0.3
1
1
1
Yes
3
5
7
8
9
10
Wide range of choices of
performance increase versus hardware cost
35
© 2003 TENSILICA INC.
Automation:
Application Examples
Speedup
Original
Code Size
Radix-4 FFT
10.6x
1.5 KB
3.6 KB
4.4KB
175,796
3 minutes
GSM Encoder
3.9x
17 KB
20 KB
38 KB
576,722
15 minutes
GSM Encoder
(using FFT TIE)
1.8x
17 KB
19 KB
38 KB
N/A
N/A
MPEG4
Encoder
3.3x
111 KB
136 KB
356 KB
1,340,312
30 minutes
Application
36
(Before
Acceleration)
Code Size
Code Size on
Configurations
After
MIPS32
Visited
Acceleration (using gcc –O2)
Run Time to
Generate
Configurations
© 2003 TENSILICA INC.
Conclusion
MPSOC represents a new medium of implementation:
Opportunity: Cost, power, bandwidth potential of semiconductors
Challenge: Return on investment for design of complex chips
The transition to MPSOC will drive…
…new parallel architectures (focus becomes interconnect not ISA)
…shift from hardwired design to programmable design
…new class of hardware/software environments for processor and
SOC generation, integration and use
…rapid growth in processor counts and aggregate performance
Important historical parallel between
integrated circuit
(many small transistors per chip)
37
and
MPSOC
(many small processors per chip)
© 2003 TENSILICA INC.
Key Research Directions
Tool environments for identification/exploitation of latent parallelism
Unified programming model for MP
Technical and economic tools for optimizing efficiency vs. flexibility (spectrum of early vs.
late binding)
Application-specific interconnect topologies and generators
Role for hardware-centric programmability (FPGA) vs. software-centric programmability
(processor)
Vision: set of communicating tasks + chip interface specification + performance
constraintsg set of program binaries + chip GDSII
Profile-based automation:
Generation of ISA
Assignment of tasks to processors [1 g n, n g 1; static vs. dynamic allocation]
Profile-based implementation of messaging mechanism and physical interconnect
Memory configuration, memory map, shared code and data section allocation
University Program: Free license to tools and models for MPSOC design using
extensible processors:
Steve Roddy: [email protected]
38
© 2003 TENSILICA INC.