Lecture 2 rev2
Download
Report
Transcript Lecture 2 rev2
Microprocessor and DSP
Technologies for the
Nanoscale Era
Seminar 2
Ram Kumar Krishnamurthy
Microprocessor Research Labs
Intel Corporation, Hillsboro, OR
[email protected]
Intel
July 11, 2005
Labs
1
Outline
• General Technology and Circuit Challenges Beyond 65nm:
• Switching and active leakage energy
• Leakage tolerance and robustness
• On-chip interconnect scaling
• Process parameter variations and tolerance
• Execution core thermal/power density
• Emerging trends in wireless and embedded DSP industry
• Circuit solutions:
• Active and standby leakage power reduction strategies
• Multi-supply design: switching + leakage power benefits
• Energy-efficient arithmetic circuit technologies
• HW accelerators for specialized DSP applications
2
Technology Scaling 101
Gate
1
Tox
Source
L
Body
1
Delay 1
Freq 1
Drain
1
Gate
0.7 Tox
Source
0.7 L
Body
0.49
Drain
0.7
Delay 0.7
1
Freq
1.43
0. 7
0.7
3
Leakage vs. Switching Power
Power (Watts)
250
200
Active Power
Active Leakage power
150
100
50
0
250nm 180nm 130nm 90nm
Technology
65nm
• From a mP perspective, but true for DSPs too
• Ioff increase 3-5X per generation
• Active leakage power > 50% of total power 4
Interconnect Delay
30% per generation
30% per generation
Typical Gate Delay
0.001
Delay (ns)
0.1
0.01
1.0
On-chip Interconnect Performance
250 200 150 100 50
Technology Node (nm)
• RC/mm increases 40-60% per generation
• Local inter-gate wires dominate critical-path delays
• Global wire lengths not scaling by 0.7x
5
Process Variation Tolerance
200
(180nm CMOS, 110°C)
1.2
Number of dies
1.3
30%
Frequency
1.4
1.1
20X
1.0
150
100
Fast corner
50
0.9
0
5
10
Leakage
15
20
0
0
1
2
3
4
5
7
Normalized IOFF
•
•
•
•
Significant variation in IOFF (hence Fmax spread)
Worsening with process scaling
Excess leakage dies: lack in robustness
Low leakage dies: over-designed for robustness
Process parameter variation tolerant design techniques
6
DSP Application Demands
VOICE
DATA and
APPLICATIONS
Capability
> 200 MIPS
> 100 MIPS
ASIC DSP
Hardware
Assist
ASIC DSP
Hardware
• 2.5G:
GPRS
EDGE
IS-95B
• 3G:
WCDMA
< 50 MIPS
• 2G:
GSM
PDC
IS-95
2001
2003
2005
Time
• Smart cell-phones: $2B in ’02 $15B in ’06!
• Huge demand for high-performance DSPs
7
Multimedia, Graphics, Enterprise…
200+ MIPS
64+ MB Flash
16+ MB RAM
Capability
> 100 MIPS
16+ MB Flash
8+ MB RAM
< 50 MIPS
4+ MB Flash
0.5+ MB RAM
• Simple User
interface
• Calendar
• Notepad
2001
• Color Screen
• Audio
•Graphics
• Email
•
•
•
•
Full OS
GUI
Browser
Suite of apps
2003
• Speech
recognition
• Multimedia
• Large files
and
applications
• Secure
remote
access
• Full OS and
user
interface
• Browser
• Suite of apps
Multimedia
Graphics
Enterprise
OS, Services
and Apps
2005
Time
• Market is hungry for DSP MIPS (if you deliver, they will use it!)8
Typical Performance Requirements
Total required
memory
MHz per task
1000
64MB
200 - 300 MHz
150 - 200 MHz
100
80 - 100 MHz
Pocket PC
40 - 80 MHz
8 - 12 MHz
• MP3 encode
32 MB
• MPEG 4 Playback
• MP3 Playback
16MB
• Robust handwriting recognition
25 - 50 MHz
10
• MPEG 4 Playback
• Voice 128-bit encryption and decryption
8MB
• Graphical Browser - small screen
• ASCII Browser
< 4MB
9
Energy Efficiency in MOPS/mW
So, How Do We Meet This Surging
Demand Within Given Power Envelope?
1000
Dedicated HW
ASIC
100
Configurable Processor/Logic
Berkeley’s Pleiades:
10-80 MOPS/mW
10
Digital Signal Processors
or other ASIPs
1-2 MOPS/mW
1.0
Embedded Processors
SA110:
0.4 MOPS/mW
0.1
Flexibility (Coverage)
Courtesy: Prof. J. Rabaey, UC Berkeley
• Energy vs. Flexibility Trade-off
10
Energy and Area Efficiency
Courtesy: Prof. Teresa Meng, Stanford
11
MOPS/mW Distinction: General-purpose
vs. Dedicated
Courtesy: Prof. B. Brodersen, UC Berkeley, ISSCC’02
DSP functions are more throughput-oriented
Amenable for parallelism and pipelining (better powerperformance optimization)
12
Emerging Trends in DSP Industry
Normalized power efficiency
Specialized hardware: best energy efficiency,
enables algorithm tuning + low clock rates
1000
Specialized
hardware
100
10
Programmable
DSP
Embedded/mP
Microprocessors: Best flexibility
1
Flexibility
Prof. L. Clark, CICC 2002 [2]
13
Emerging Trends in DSP Industry
Normalized power efficiency
Specialized hardware: best energy efficiency,
enables algorithm tuning + low clock rates
1000
Specialized
hardware
Microprocessors add
specialized HW and
coprocessors with DSP
functionality
100
10
Programmable
DSP
Embedded/mP
Microprocessors: Best flexibility
1
Flexibility
14
Example Case Study
IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3]
• 153-380MHz, 53-500mW
in 180nm CMOS, 1.0-1.8V
• 5.84M transistors, 36mm2
• Dedicated DESEncryption and Speech
processing accelerators
Encryption and Speech Processing Specialized HW
15
Emerging Trends in DSP Industry
Normalized power efficiency
Specialized hardware: best energy efficiency,
enables algorithm tuning + low clock rates
1000
100
Specialized
hardware
DSPs add microcontroller
functionality and specialized
HW accelerators
Programmable
DSP
10
Embedded/mP
Microprocessors: Best flexibility
1
Flexibility
16
Example Case Study
TI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4]
• 600MHz, 4.8GOPS, 718mW
in 130nm CMOS, 1.2V
• Dedicated Viterbi and Turbo
decoding co-processor HW
• 64M transistors
• Integrated DMA controller,
PCI, 1MB L2, 16K I$ & D$
Viterbi and Turbo Co-processors
17
Specialized Hardware Accelerators
• Specialized (fixed function) hardware is 10-100x more
efficient than general purpose processors: Why?
– Trades hardware for power
– Allows very low clock rates
– Essential for some wireless functions
– Viterbi and Turbo decoding, speech recognition, encryption
• Allows custom algorithms and coefficients to limit power
– Use shifts instead of multiplies
• Cost is flexibility
– Fixed algorithms and coefficients
– As new applications and wireless standards emerge, is this
enough?
– How does this cover the application space?
18
Reconfigurable Processors
FPGA – Fine Grain
Reconfigurable Fabric
• Fine-grain gate-level functions
• Array of MUXes to implement any
N-input boolean function
• Speed sacrificed for generality
Course Grain
Reconfigurable Fabric
• Moderate grain function blocks
• Collections of Add, Mpy, Mux, …
• Interconnect overhead is
moderate to low
• If functions and connectivity are
known, can be highly optimized
Courtesy: Prof. F. Kurdahi, UCI
19
Generic Reconfigurable Architecture
Datapath Tiles
Registers
Configuration
Control
Array of Fine/Coarse Grain Datapath Tiles and Registers
20
How Do Reconfigurable Processors Work?
Execute one algorithm/ protocol
at any given time
Each algorithm is ‘configured’
from the building blocks
Time between subsequent
configurations: ~1-10ms
Configuration Control unit
decides which algorithm to
execute when
Protocol 1
Time
21
How Do Reconfigurable Processors Work?
Execute one algorithm/ protocol
at any given time
Each algorithm is ‘configured’
from the building blocks
Time between subsequent
configurations: ~1-10ms
Configuration Control unit
decides which algorithm to
execute when
Protocol 2
Time
22
How Do Reconfigurable Processors Work?
Execute one algorithm/ protocol
at any given time
Each algorithm is ‘configured’
from the building blocks
Time between subsequent
configurations: ~1-10ms
Configuration Control unit
decides which algorithm to
execute when
Protocol 3
Time
23
Standby Leakage Reduction:
Sleep Transistor design
• Motivation: Cut off power supply in sleep-mode
• Insert “sleep” transistor between main supply
and functional unit’s supply rails
• Latches tied to main supply rails: retain state
sleep
transistor
Virtual Vcc
Functional
Unit
Virtual Vss
sleep
transistor
MTCMOS
Boosted
Sleep
Sleep-TR size
5.1%
2.3%
NonBoosted
Sleep
3.2%
Leakage
power
reduction
Virtual supply
bounce
1450X
3130X
11.5X
60 mV
59 mV
58 mV
Standby leakage benefit for 5% delay penalty
24
Switching + Leakage Reduction:
Forward Body Bias
Vbp
+Ve
Normalized
total power
Vdd
Vcc: 1, 1.05, 1.1 … 1.5V
4
110oC
ZBB
3 =0.1
2
FBB
1.2V
500mV
1
1.1V
0
0.6 0.8
1
1.2 1.4
-Ve
Vbn
A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6]
20% power reduction at 1GHz
8% frequency at iso-power
20X idle-mode leakage
FBB/ZBB
leakage ratio
Frequency (GHz)
30
20
10
27oC
0
0.6 0.8
1
1.2 1.4
Frequency (GHz)
25
Multi-Vcc Usage Model
VRM1
VCCcore1
VCCcore2
VRM2
VRM3
VCCcore3
VCCcore4
VRM4
• Optimize performance and power with parallelism
and voltage
26
Switching + Leakage Reduction:
Multi Supply Design
• Active leakage benefit with lower supply voltage
• Exponential subthreshold and gate leakage reduction
R. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7]
Normalized
Leakage
100
80
60
40
20
0
0
Subthreshold lkg
Gate lkg
0.3 0.6 0.9 1.2 1.5
Voltage (V)
Leakage Energy (Normalized)
Measured Leakage in
1.2V, 130nm process
130nm L1 cache leakage
12
w.c. corner
10
8
6
79%
4
Nominal
corner
2
0
0
0.2
0.4 0.6 0.8
VCC (V)
1
1.2 1.4
27
Adaptive Vcc: Variation-tolerant Circuits
• Motive: change Vcc adaptively to reduce impact of parameter variations
• Large Fmax vs. leakage spread (worsening with scaling)
• Lower Vdd on leakage-limited circuits (subject to stability limits)
• Higher Vdd on speed-limited circuits (subject to reliability limits)
100%
80%
1.2
Die count
1.3
30%
Frequency
1.4
1.1
20X
1.0
Fixed Vdd: 1.05V
Adaptive Vdd: 20mV resolution
60%
40%
20%
0.9
0
5
10
Leakage
15
0%
20
0.85
0.9
0.95
1
1.05
Frequency Bin
100%
Die count
5.3 mm
80%
4.5 mm
21 sub-sites within 1 die
J. Tschanz et al, 2002 Symp. VLSI Circuits
Adaptive Vdd + body bias
Adaptive Vdd + WID body bias
60%
40%
20%
0%
0.85
0.9
0.95
Frequency Bin
1
1.05
28
Viterbi Decoder Organization
Branch
Metric Unit
(BMU)
Encoded
Bits
Path Metric
Unit (PMU)
Branch
Error
Traceback
Unit (TBU)
Transitions
Decoded
Bits
• BMU calculates errors for all branches
• PMU accumulates errors and outputs transitions with
minimum error
• TBU traces minimum error path back to get best estimate of
original input
One of the most performance and power
critical algorithms in wireless baseband DSP
29
90nm CMOS Implementation
PM
memory
PM
memory
BMU
BMU
PM
memory
90nm dual-Vt 7-metal CMOS technology
64-state radix-2 design: 40mW at 500Mbps, 1.2V
TB control
PM
memory
BMU
BMU
PM
memory
8 ACS
PM
memory
TB memory
ACS
230µm x 210 µm
Path memory
TB memory
Traceback
260µm x 510 µm
M. Anders et al, 2004 VLSI Circuits Symp. [10]
30
Summary
Technology
90nm dual-Vt CMOS
ACS area
230µm x 210µm (0.048 mm2)
Traceback area
260µm x 510µm (0.133 mm2)
Viterbi states
64-state
ACS precision
10 bits
Radix-2 max. TB length
96 symbols
M. Anders et al, 2004 VLSI Circuits Symp. [10]
• Fastest reported 64-state Viterbi accelerator
– Total power at 2 GHz (500Mbps) is 40mW (1.2V)
• Lowest power 802.11a implementation
– Total power at 216 MHz (54Mbps) is 5mW (0.7V)
31
Streaming Media Accelerators:
32-bit MAC [ISSCC’03]
• 5GHz 32-bit multiply-accumulate unit
• Targeted for special purpose streaming processors/graphics
MULTIPLIER
CLK
ALIGNER
ACCUMULATE
FIFOs
&
SCAN
Die Area
1.32 x 1.57 mm2
Process
90nm CMOS
Interconnect 1 poly, 7 metal
Transistors
230K
Frequency
5GHz
Maximum Vcc
1.5V
Chip Power
1.2W @ 1.2V
Pad Count
75
NORMALIZE
S. Vangal et al, ISSCC’03 [11]
32
32-bit MAC Architecture Overview
Scan Reg
MAC
Scan Out
32
32
FIFO FIFO FIFO
A
B
C
x
FIFO
Control
+
32
Scan Reg
Scan In
• Single-cycle 5GHz 32-bit MAC loop
• New Multiplier and Accumulator ALU circuit techniques
33
TCP/IP Off-load Accelerator
[ISSCC’03]
• 10GHz TCP/IP offload accelerator unit
• Targeted for 10Gbps Ethernet packet processing accel.
Core Area
Process
Interconnect
Transistors
Frequency
Max Vcc
Chip power
Pad count
2.23 x 3.54mm2
90nm dual-VT CMOS
1 poly, 7 metal
260K
10GHz
1.5V
1.9W @1.2V
306
Y. Hoskote et al, ISSCC’03 [12]
34
10GHz TCP/IP Execution Core
ROB
Key
6
96
input
TCB
CLB
ALU
TCB
264
index
Working register
Rcv buffer
Next address
Branch address
Start address
32
Scratch
registers
Pipelined
ALU
32
PC
112
9
decode
IR
ALU output
10GHz sparse-tree ALU
Instr ROM
• At-speed packet processing execution core for 10Gbps
35
Conclusions
• Several Technology and Circuit Challenges Beyond 65nm
• Switching and active leakage energy
• Leakage tolerance and robustness
• On-chip interconnect scaling
• Process parameter variations and tolerance
• Execution core thermal/power density
• Emerging trends in DSP industry
• Specialized hardware accelerators and co-processors
• Reconfigurable engines
• Circuit solutions:
• Active and standby leakage power reduction strategies
• Multi-supply design: switching + leakage power benefits
• Energy-efficient arithmetic circuit technologies
• DSP HW accelerators for Viterbi, Streaming media, TCP/IP
36
References
[1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE
Custom Integrated Circuits Conference 2002, pp. 125-128.
[2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference
2003, pp. 171-176.
[3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC
2002, pp. 340-341.
[4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57.
Reconfigurable processors:
1. http://brass.cs.berkeley.edu/
2. http://www.eng.uci.edu/comp.arch/
3. http://www.pactcorp.com/xneu/px_xpphw.html
[5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt allocation
and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219.
[6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”,
Symposium on VLSI Circuits 2002, pp. 312-315.
[7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI
Circuits 2002, pp. 128-129.
[8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI
Circuits 2002, pp. 126-127.
[9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004,
pp. 162-163.
[10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits
2004, pp. 174-175.
[11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS”, ISSCC 2003, pp. 334-335.
[12] Y. Hoskote et al, “A 10GHz TCP/IP offload accelerator for 10Gb/s Ethernet in 90nm CMOS”, ISSCC 2003, pp. 258-259.
37