Survey of Digital Signal Processors

Download Report

Transcript Survey of Digital Signal Processors

Survey of Digital Signal
Processors
Michael Warner
ECD: VLSI Communication Systems
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Moore’s Law Drives
Processor Development
1010
Transistors per Die
109
108
107
106
80286
105
8008
104
386™
Itanium®
Itanium2®
Pentium®
Pentium4® III
Pentium® II
Pentium
486™ ®
8086
8080
4004
103
1965 Data (Moore)
102
Microprocessor
101
100
‘60
‘65
‘70
‘75
‘80
‘85
‘90
‘95
‘00
‘05
‘10
Source: Intel internal
Doubling the number of transistors
every 18-24 at same price point drives
significant product opportunities
…especially if you have little regard for
power
But what if energy-delay had to be
reduced every generation by an order
of magnitude?
Gene’s Law Drives
DSP Development
1,000
Gene’s Law
100
DSP Power
1
0.1
0.01
0.001
0.0001
Year
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
0.00001
1982
mW/MIPS
10
Gene’s Law will
have it’s
challenges to
hold the line!
What’s Driving Gene’s Law?
Digital Audio
 MP3
 Real Audio
Streaming Video
 MPEG 4
 H.263
Connectivity
 Internet
 Bluetooth
Modem Standards
 UMTS
 GMS
Buy
Now?
Yes
No
TXN 160 + 4 UPX 12 3/4
DSP Design Constraints
DEVICE CAPABILITIES
1982
1992
2002
Technology (uM)
3
0.8
0.1
Transistors
50K
500K
180M
MIPS
5
40
5,000
RAM (bytes)
256
2K
3M
Power (mW/MIPS)
250
12.5
0.1
Price/MIPS
$30.00
$0.38
$0.02
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
What Makes a DSP a DSP?
Hard Real-Time
Soft Real-Time (Application Processor)
 Single-Cycle MAC
 Single-Cycle MAC
 Multiple Execution Units
 Multiple Execution Units
 Custom Data Path
 Custom Data Path
 High Bandwidth (Flat) Memory Sub-Systems  L1D$, L1I$, L2$ with MMU
 Dual Access Memory
 Speculative Fetching and Branching
 Efficient Zero-Overhead Looping
 Virtual Memory
 Short Pipeline
 Protected Memory
 High Bandwidth I/O
 Virtual Machines
 Specialized Instruction Sets
 Semaphores
 Low Latency Interrupts
 Context Save and Restore
 Sophisticated DMA
 Threading: SMT, IMT
 No Speculation
 Efficient Zero-Overhead Looping
 RTOS
 Short Pipeline
 High Bandwidth I/O
 Specialized Instruction Sets
 Low Latency Interrupts
 Sophisticated DMA
 O/S
Single Cycle MAC
 MAC’s Typically Determine DSP
Performance and Pipeline Length (EX)
 Most DSP’s Have 2-8 MAC Units
 MAC’s Typically Operate in Both a Scalar
and Vector Mode
Multiple Instruction Units
 VLIW Architectures Driving ILP
 Typically Instruction Units




M-Unit - MAC
S-Unit - Shift
L-Unit - ALU
D-Unit – Load/Store
 Industry Has Converged on a ILP of ~8
Registers A0 - A15
Registers B0 - B15
1X
S1
2X
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
S2
M1
DDATA_I1
(load data)
D S1 S2
S2 S1 D
S2
S1 D
D1
D2
M2
S2
S1 D DL SL
S2
DDATA_I2
(load data)
SL DL D
S2
L2
S1
High Bandwidth
Memory Sub-Systems
 Multiple Load-Store Units Required to Feed Data Path
 Tightly Coupled Memory is Typically Dual Ported
 Harvard Architecture is Heavily Banked
M
U
X
E
S
Central
Arithmetic
Logic Unit
P
D
C
E
PC
ARs
M
U
X
MAC A B ALU SHIFTER
EXTERNAL
MEMORY
INTERNAL
MEMORY
CNTL
Specialized Instruction Sets
 Base RISC ISA Plus CISC ISA Driven by End
Application





MAC
SAD
LMS
FIRS
Viterbi
 Support For Both Scalar and Vector Instructions
 Support For 8, 16 and 32-Bit Instructions
 Instructions are Highly Orthogonal
Scalar (55x) vs VLIW (64x)
 Scalar DSP’s Tend to be More CISC Like




Hurts Compiler Performance
Improves Energy-Delay
Improves Code Density
Limits Top End Performance
 VLIW DSP’s Tend to be More RISC Like
 RISC + GP Regs + Orthogonality Makes For a Good
C Compiler
 Assembler Code Is Challenging
 RISC ISA Allows for Higher Frequencies
 Load-Store Hurts Energy-Delay
TMS320C54x
TMS320C54x Protected Pipeline
CYCLES
P1 F1 D1 A1
P2 F2 D2
P3 F3
P4
R1
A2
D3
F4
P5
Fully loaded pipeline
X1
R2
A3
D4
F5
P6
X2
R3
A4
D5
F6
X3
R4 X4
A5 R5 X5
D6 A6 R6 X6
Prefetch: Calculate address of instruction
Fetch: Collect instruction
Decode: Interpret instruction
Access: Collect address of operand
Read: Collect operand
Execute: Perform operation
Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance
TMS320C6xx
’C6xx CPU Core
Program Fetch
Instruction Dispatch
Control
Registers
Instruction Decode
Data Path 1
Data Path 2
A Register File
B Register File
Control
Logic
Test
Emulation
L1 S1 M1 D1
Arithmetic
Logic
Unit
Auxiliary
Logic
Unit
D2 M2 S2 L2
Multiplier
Unit
Interrupts
TMS320C6xx Exposed Pipeline
Fetch
Decode
Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
 Fetch




PG
PS
PW
PR
 Decode
Program Address Generate
Program Address Send
Program Access Ready Wait
Program Fetch Packet Receive
 DP
 DC
Instruction Dispatch
Instruction Decode
 Execute
Execute Packet 1 PG PS PW PR DP DC
Execute Packet 2 PG PS PW PR DP
Execute Packet 3 PG PS PW PR
Execute Packet 4 PG PS PW
Execute Packet 5 PG PS
Execute Packet 6 PG
Execute Packet 7
 E1 - E5 Execute 1 through Execute 5
E1
DC
DP
PR
PW
PS
PG
E2
E1
DC
DP
PR
PW
PS
E3
E2
E1
DC
DP
PR
PW
E4
E3
E2
E1
DC
DP
PR
E5
E4
E3
E2
E1
DC
DP
Note: Exposed Pipeline Adds Risk to Programming Model
E5
E4
E3
E2
E1
DC
E5
E4
E3
E2
E1
E5
E4 E5
E3 E4 E5
E2 E3 E4 E5
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
Micro-Architectural Challenges
 Accessing (Flat) On Chip Memory At Speed





Within 2-3 cycles
Feeding Multiple Functional Units From a Single
Register File
Running 600Mhz+ with a 7-9 Stage Pipeline
Linking Multiple Functional Units with Result
Forwarding
Implementing CISC Data-path to Meet Area and
Performance Goals
Achieving ARM Like Code Density
Agenda
 Industry Trends
 DSP Architecture
 DSP Micro-Architecture
 DSP Systems
DSP Systems
Wireless
Infrastructure
Wired
WiredInfrastructure
Infrastructure
Performance
Audio
Digital
Still Client
Camera
Wireless
Wireless Infrastructure
6 DSP CPU
600 MHz
Viterbi
Viterbi
and
Turbo
and
Turbo
hardware
hardware
accelerators
accelerators
Wireless Client
@ 300MHz
6225
DSPMHz
CPU
DSP+GPP
3MB
24Mb
DSP+GPP
@ 300MHz
integrated
Imaging
600
MHz
memory
Floating
Low
power
3MB
integrated
accelerators
180M
point
consumption
memory
transistors
DSP+GPP
Low power
consumption
Voice, data,
video
Viterbi
Voice,
180Mdata,
transistors
TMS320C6416
TMS320C5561and Turbo
OMAP5910
video
hardware
acceleratorsAudio
Digital Still Camera
Performance
DSP+GPP
Imaging
TMS320C5561
accelerators
TMS320DM310
TMS320DA610
OMAP5910
TMS320C6416
TMS320DM310
225 MHz
Floating
point
TMS320DA610
VIOP Platform
 TNETV3010 Features
 6 C55x DSP @ 300 MHz
 Shared Instruction
Memory
 Broadcast DMA
 24M Bits of On Chip
SRAM
OMAP Platform
 OMAP2420 Features
ARM11
+ VFP
TMS320C55x
DSP
2D/3D
Graphics
Accelerator
Imaging &
Video
Accelerator
(IVA)
Internal
SRAM
OMAP2420
 IVA supports still
Peripherals
Memory
Controller
accelerator
L4 Interconnect
Security
Camera
I/F
VFP (Vector Floating
Point), 32K/32K
I/Dcache
 DSP @ 220 MHz
 2D/3D graphics
L3 Interconnect
LCD
I/F
Video
Out
 ARM 1136 @ 330 MHz,
images
to >4 Mpixels, 30 fps
VGA video decode
 Output to TV for gaming
and video playback
 Encryption hardware for
DRM and security
IBM Cell Architecture
Design Features:
 Multi-Core Architecture
 Based on the Power Architecture
 Code compatibility




Coherent and cooperative off-load processing
Enhanced SIMD architecture
Power efficiency improved
“Absolute timers“ allow "hard” real-time data processing
 Good estimation of execution time is possible
 Big-endian memory
 Support Apple, but not Intel
 Isolation mechanism for secure code execution
FlexIO
DSP Architecture
SPE: (synergistic Processing Element)
 Dual issue, 128-bit 4-way SIMD
 Vector Processing
 4 Integer Units + 4 FP Units
 8-,16-,32-bit Integer + 32-,64-bit FP
 128x128-bit Registers
 256KB Local-Store Memory (specially designed)
 Caches are not used
 Data & Instruction in LS