Survey of Digital Signal Processors
Download
Report
Transcript Survey of Digital Signal Processors
Survey of Digital Signal
Processors
Michael Warner
ECD: VLSI Communication Systems
Agenda
Industry Trends
DSP Architecture
DSP Micro-Architecture
DSP Systems
Agenda
Industry Trends
DSP Architecture
DSP Micro-Architecture
DSP Systems
Moore’s Law Drives
Processor Development
1010
Transistors per Die
109
108
107
106
80286
105
8008
104
386™
Itanium®
Itanium2®
Pentium®
Pentium4® III
Pentium® II
Pentium
486™ ®
8086
8080
4004
103
1965 Data (Moore)
102
Microprocessor
101
100
‘60
‘65
‘70
‘75
‘80
‘85
‘90
‘95
‘00
‘05
‘10
Source: Intel internal
Doubling the number of transistors
every 18-24 at same price point drives
significant product opportunities
…especially if you have little regard for
power
But what if energy-delay had to be
reduced every generation by an order
of magnitude?
Gene’s Law Drives
DSP Development
1,000
Gene’s Law
100
DSP Power
1
0.1
0.01
0.001
0.0001
Year
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
0.00001
1982
mW/MIPS
10
Gene’s Law will
have it’s
challenges to
hold the line!
What’s Driving Gene’s Law?
Digital Audio
MP3
Real Audio
Streaming Video
MPEG 4
H.263
Connectivity
Internet
Bluetooth
Modem Standards
UMTS
GMS
Buy
Now?
Yes
No
TXN 160 + 4 UPX 12 3/4
DSP Design Constraints
DEVICE CAPABILITIES
1982
1992
2002
Technology (uM)
3
0.8
0.1
Transistors
50K
500K
180M
MIPS
5
40
5,000
RAM (bytes)
256
2K
3M
Power (mW/MIPS)
250
12.5
0.1
Price/MIPS
$30.00
$0.38
$0.02
Agenda
Industry Trends
DSP Architecture
DSP Micro-Architecture
DSP Systems
What Makes a DSP a DSP?
Hard Real-Time
Soft Real-Time (Application Processor)
Single-Cycle MAC
Single-Cycle MAC
Multiple Execution Units
Multiple Execution Units
Custom Data Path
Custom Data Path
High Bandwidth (Flat) Memory Sub-Systems L1D$, L1I$, L2$ with MMU
Dual Access Memory
Speculative Fetching and Branching
Efficient Zero-Overhead Looping
Virtual Memory
Short Pipeline
Protected Memory
High Bandwidth I/O
Virtual Machines
Specialized Instruction Sets
Semaphores
Low Latency Interrupts
Context Save and Restore
Sophisticated DMA
Threading: SMT, IMT
No Speculation
Efficient Zero-Overhead Looping
RTOS
Short Pipeline
High Bandwidth I/O
Specialized Instruction Sets
Low Latency Interrupts
Sophisticated DMA
O/S
Single Cycle MAC
MAC’s Typically Determine DSP
Performance and Pipeline Length (EX)
Most DSP’s Have 2-8 MAC Units
MAC’s Typically Operate in Both a Scalar
and Vector Mode
Multiple Instruction Units
VLIW Architectures Driving ILP
Typically Instruction Units
M-Unit - MAC
S-Unit - Shift
L-Unit - ALU
D-Unit – Load/Store
Industry Has Converged on a ILP of ~8
Registers A0 - A15
Registers B0 - B15
1X
S1
2X
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
S2
M1
DDATA_I1
(load data)
D S1 S2
S2 S1 D
S2
S1 D
D1
D2
M2
S2
S1 D DL SL
S2
DDATA_I2
(load data)
SL DL D
S2
L2
S1
High Bandwidth
Memory Sub-Systems
Multiple Load-Store Units Required to Feed Data Path
Tightly Coupled Memory is Typically Dual Ported
Harvard Architecture is Heavily Banked
M
U
X
E
S
Central
Arithmetic
Logic Unit
P
D
C
E
PC
ARs
M
U
X
MAC A B ALU SHIFTER
EXTERNAL
MEMORY
INTERNAL
MEMORY
CNTL
Specialized Instruction Sets
Base RISC ISA Plus CISC ISA Driven by End
Application
MAC
SAD
LMS
FIRS
Viterbi
Support For Both Scalar and Vector Instructions
Support For 8, 16 and 32-Bit Instructions
Instructions are Highly Orthogonal
Scalar (55x) vs VLIW (64x)
Scalar DSP’s Tend to be More CISC Like
Hurts Compiler Performance
Improves Energy-Delay
Improves Code Density
Limits Top End Performance
VLIW DSP’s Tend to be More RISC Like
RISC + GP Regs + Orthogonality Makes For a Good
C Compiler
Assembler Code Is Challenging
RISC ISA Allows for Higher Frequencies
Load-Store Hurts Energy-Delay
TMS320C54x
TMS320C54x Protected Pipeline
CYCLES
P1 F1 D1 A1
P2 F2 D2
P3 F3
P4
R1
A2
D3
F4
P5
Fully loaded pipeline
X1
R2
A3
D4
F5
P6
X2
R3
A4
D5
F6
X3
R4 X4
A5 R5 X5
D6 A6 R6 X6
Prefetch: Calculate address of instruction
Fetch: Collect instruction
Decode: Interpret instruction
Access: Collect address of operand
Read: Collect operand
Execute: Perform operation
Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance
TMS320C6xx
’C6xx CPU Core
Program Fetch
Instruction Dispatch
Control
Registers
Instruction Decode
Data Path 1
Data Path 2
A Register File
B Register File
Control
Logic
Test
Emulation
L1 S1 M1 D1
Arithmetic
Logic
Unit
Auxiliary
Logic
Unit
D2 M2 S2 L2
Multiplier
Unit
Interrupts
TMS320C6xx Exposed Pipeline
Fetch
Decode
Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
Fetch
PG
PS
PW
PR
Decode
Program Address Generate
Program Address Send
Program Access Ready Wait
Program Fetch Packet Receive
DP
DC
Instruction Dispatch
Instruction Decode
Execute
Execute Packet 1 PG PS PW PR DP DC
Execute Packet 2 PG PS PW PR DP
Execute Packet 3 PG PS PW PR
Execute Packet 4 PG PS PW
Execute Packet 5 PG PS
Execute Packet 6 PG
Execute Packet 7
E1 - E5 Execute 1 through Execute 5
E1
DC
DP
PR
PW
PS
PG
E2
E1
DC
DP
PR
PW
PS
E3
E2
E1
DC
DP
PR
PW
E4
E3
E2
E1
DC
DP
PR
E5
E4
E3
E2
E1
DC
DP
Note: Exposed Pipeline Adds Risk to Programming Model
E5
E4
E3
E2
E1
DC
E5
E4
E3
E2
E1
E5
E4 E5
E3 E4 E5
E2 E3 E4 E5
Agenda
Industry Trends
DSP Architecture
DSP Micro-Architecture
DSP Systems
Micro-Architectural Challenges
Accessing (Flat) On Chip Memory At Speed
Within 2-3 cycles
Feeding Multiple Functional Units From a Single
Register File
Running 600Mhz+ with a 7-9 Stage Pipeline
Linking Multiple Functional Units with Result
Forwarding
Implementing CISC Data-path to Meet Area and
Performance Goals
Achieving ARM Like Code Density
Agenda
Industry Trends
DSP Architecture
DSP Micro-Architecture
DSP Systems
DSP Systems
Wireless
Infrastructure
Wired
WiredInfrastructure
Infrastructure
Performance
Audio
Digital
Still Client
Camera
Wireless
Wireless Infrastructure
6 DSP CPU
600 MHz
Viterbi
Viterbi
and
Turbo
and
Turbo
hardware
hardware
accelerators
accelerators
Wireless Client
@ 300MHz
6225
DSPMHz
CPU
DSP+GPP
3MB
24Mb
DSP+GPP
@ 300MHz
integrated
Imaging
600
MHz
memory
Floating
Low
power
3MB
integrated
accelerators
180M
point
consumption
memory
transistors
DSP+GPP
Low power
consumption
Voice, data,
video
Viterbi
Voice,
180Mdata,
transistors
TMS320C6416
TMS320C5561and Turbo
OMAP5910
video
hardware
acceleratorsAudio
Digital Still Camera
Performance
DSP+GPP
Imaging
TMS320C5561
accelerators
TMS320DM310
TMS320DA610
OMAP5910
TMS320C6416
TMS320DM310
225 MHz
Floating
point
TMS320DA610
VIOP Platform
TNETV3010 Features
6 C55x DSP @ 300 MHz
Shared Instruction
Memory
Broadcast DMA
24M Bits of On Chip
SRAM
OMAP Platform
OMAP2420 Features
ARM11
+ VFP
TMS320C55x
DSP
2D/3D
Graphics
Accelerator
Imaging &
Video
Accelerator
(IVA)
Internal
SRAM
OMAP2420
IVA supports still
Peripherals
Memory
Controller
accelerator
L4 Interconnect
Security
Camera
I/F
VFP (Vector Floating
Point), 32K/32K
I/Dcache
DSP @ 220 MHz
2D/3D graphics
L3 Interconnect
LCD
I/F
Video
Out
ARM 1136 @ 330 MHz,
images
to >4 Mpixels, 30 fps
VGA video decode
Output to TV for gaming
and video playback
Encryption hardware for
DRM and security
IBM Cell Architecture
Design Features:
Multi-Core Architecture
Based on the Power Architecture
Code compatibility
Coherent and cooperative off-load processing
Enhanced SIMD architecture
Power efficiency improved
“Absolute timers“ allow "hard” real-time data processing
Good estimation of execution time is possible
Big-endian memory
Support Apple, but not Intel
Isolation mechanism for secure code execution
FlexIO
DSP Architecture
SPE: (synergistic Processing Element)
Dual issue, 128-bit 4-way SIMD
Vector Processing
4 Integer Units + 4 FP Units
8-,16-,32-bit Integer + 32-,64-bit FP
128x128-bit Registers
256KB Local-Store Memory (specially designed)
Caches are not used
Data & Instruction in LS