MPSoC - VADA

Download Report

Transcript MPSoC - VADA

Reconfigurable Computation and
Communication Architectures
조준동
발표순서
 Why Reconfigurable System?
 S/W configurable platform의 필요성
 Design Space of Reconfigurable Architectures
 New Taxonomy/Metric for RA
 Reconfigurable Radio and Multimedia Systems
 Network-centric Design: Clock and Power
 Reliable Design
Technology Evolution
Why Reconfigurable System?


GPP와 재구성 h/w 를 포함
목적: 전력 감축 및 유연성
1.
2.
3.
동적인 환경에 따른
Quality of Service를 제공
알고리즘 진화에 따른 유연
한 구조
개발 및 유지 보수해야 하는
플랫폼 감소
Task 1
Task N
A B
W
C
X Y
D E
Z
X
D
H
A
W
Y
B
I
J
C ZE
Reconfigurable Hardware
Energy Efficiency
of Reconfigurability




system architecture
communication protocol
O/S and applications
Partitioning of functions between wireless
device and services on the network
 The mobiles must be flexible enough to
accommodate a variety of multimedia services
and communication capabilities and adapt to
various operating conditions in an (energy)
efficient way
The spectrum of solutions
flexibility
efficiency
application
Generalpurpose
processor
e.g. Pentium
Reconfigurable
architecture
ASIC
S/W configurable platform의 필요성
–
–
Doing More by Doing Less :다양한 표준을
다룰 수 있는 능력이 필요 (AM, FM, GSM,
UMTS, digital broadcasting standards,
analog and digital television and other data
links.
A fully software reconfigurable multichannel broadband sampling receiver for
standards in the 100 MHz band
Semiconductor Revolutions
“Mainstream Silicon Application
is switching every 10 Years”
software
standard
µproc.,
memory
TTL
1967
1957
custom
LSI,
MSI
hardware
1977
reconfigurable
FPGAs
2007
1987
ASICs,
accel’s
1997
coarse
grain
SoC Platform Adaptation
SoC Design Process:
Combined & Incremental synthesis
Granularité de la reconfiguration
Sébastien PILLEMENT - ENSSAT/LASTI
 Reconfiguration au niveau système
 Lx, C62 (décomposition en cluster)
 Reconfiguration au niveau fonctionnel
 Pleiades, RaPiD, DART(2001)
 Reconfiguration au niveau opérateur
 Chameleon, Piperench, Morphosys(2000)
 Reconfiguration au niveau porte
 Napa, GARP, FPGA
The gain size of operations
in Reconfigurable System Architectures
 Fine gained operations : Multiply and addition
 Medium gained operations : reconfigurable
modules
 Course gained operations : CPU, host
Design Space of Reconfigurable
Architectures
RECONFIGURABLE ARCHITECTURES
(R-SOC)
Lilian Bossuet
LESTER Lab
Université de Bretagne Sud
Lorient, France
MULTI GRANULARITY
(Heterogeneous)
FINE GRAIN
(FPGA)
Processor +
Coprocessor
Island
Topology
Hierarchical
Topology
Coarse Grain
Coprocessor
Fine Grain
Coprocessor
• Xilinx Virtex
• Xilinx Spartran
• Atmel AT40K
• Lattice ispXPGA
• Altera Stratix
• Altera Apex
• Altera Cyclone
• Chameleon
• REMARC
• Morphosys
• Pleiades
• Garp
• FIPSOC
• Triscend E5
• Triscend A7
• Xilinx Virtex-II Pro
• Altera Excalibur
• Atmel FPSIC
COARSE GRAIN
(Systolic)
Tile-Based
Architecture
Mesh
Topology
• aSoC
• E-FPFA
Linear
Topology
• RAW
• Systolic Ring
• CHESS
• RaPiD
• MATRIX
• PipeRench
• KressArray
• Systolix Pulsedsp
Hierarchical
Topology
• DART
• FPFA
Fine-Grained RSOCs
Xilinx Virtex II-Pro
 Xilinx, Inc., San Jose, CA
 Up to 4 PowerPC 405
Processor Cores
 Up to 160k Reconfigurable
Logic Cells (4-i/p 1-o/p
Lookup Table)
 Up to 216 18-bit x 18-bit
Dedicated Multipliers
 Up to 216 18-kbit On-Chip
Distributed Memory Blocks
 Up to 852 I/O Pins
 www.xilinx.com
Fine-Grained RSOCs
Altera Stratix
- 1.5-V, 0.13- alllayer-copper SRAM
process, with
densities ranging from
10,570 to 114,140 LEs
- 28 digital signal
processing (DSP)
blocks with up to 224
embedded multipliers
Digital Signal Processing With FPGAs
Paul Ekas
Jean-Charles Bouzigues
Multiplier Options In FPGAs for DSP
Processing
Option
Resource
Area Usage
1
Logic Multipliers
Logic Elements
(Traditional)
500 LEs per
18x18 Multiplier
2
Hard Multipliers
DSP Blocks
4 18x18 Multipliers
per DSP Block
3
Soft Multipliers
RAM
1 to 2 Embedded
Memory Blocks
Logic Elements
Control
Signals
4
 Smallest Unit of Logic
 Grouped into Logic Array
Blocks (LABs) of Ten LEs
 Features
 Four-Input Look-Up Table
(LUT)
 Configurable Register
 Dynamic Add/Subtract
Control
 Carry-Select Chain Logic
LE1
4
4
4
4
4
4
4
4
LE2
LE3
LE4
LE5
LE6
LE7
LE8
LE9
4
LE10
Local
Interconnect
Logic
Element
Logic Array
Block
DSP Block: Optimized Hard MAC
36
38
+
36
+-S
37
Output Register Unit
+-S
37
Output MUX
Optional Pipelining
144
Input Register Unit
36
144
36
9 Bit x 9 Bit
18 Bit x 18 Bit
36 Bit x 36 Bit
8 Multiplies
4 Multiplies
1 Multiply
2 Multiplies with Accumulate
2 Multiplies with Accumulate
2 Sum of 2 Multipliers
(Complex Multipliers)
1 Sum of 2 Multipliers
(Complex Multiply)
2 Sum of 4 Multiplies
1 Sum of 4 Multiplies
Soft Multipliers: Lookup Based
Multiplication
 Use Embedded RAM Blocks as Look-Up Tables (LUTs) for
Generating Partial Products
 Coefficient or Sum of Coefficients Values Stored in RAM Blocks
 MSB Partial Product Shifted & Added to LSB Partial Product
Address
5
ADDRESS
Example
 Multiplication of 5-Bit
Input with 13-Bit
Coefficient

Multiplier Table
All 18 Bit Possible
Results Stored at 32*18
Look Up Table
32*18
M512
18
MULT_RESULT
00000
0
00001
C
00010
2*C
00011
3*C
…
….
11111
31*C
Data Output
C = Coefficient[12:0]
Altera FPGA Memory Architectures
 Today’s applications need more high performance memory
 One size does not fit all
 Wide choice of modes and widths
M512 Blocks




Rate Changing
Embedded Shift
Register Mode
Operates Up to
312Mhz
Mixed Clock Mode
M4K Blocks




True Dual Port RAM
Embedded Shift
Register Mode
Operates Up to
312Mhz
Mixed Clock Mode
M-RAM





External Memory Devices
True Dual Port RAM
Embedded Shift Register
Mode
512K bits 300 Mhz
Operates Up to 300Mhz
Mixed Clock Mode

DDR SDRAM & SRAM

SDR SDRAM

QDR & QDRII SRAM

ZBT SRAM

DDR FCRAM
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
Soft Multiplier: Sum of Multiplications
16-Bit Serial Shift Registers
16-Bit Serial Shift Registers
Input
1
1
(Sample 16-Bit, Coefficient 16 Bit)
1
Sum of Multiplications Table
4
4
M512
32*18
18
18
+
19
35
+
Example: FIR Filter
Memory: 2 M512
M512
32*18
Output
ADDRESS
MULT_RESULT
0000
0
0001
C0
0010
C1
0011
C0+C1
…
….
1111
C0+C1+C2+C3
Example Direct Sequence
Spread Spectrum (DSSS)
Modem
DSSS Modem
Five Independent Data Channels Spread to 3.84 Mcps
Three-Stage FIR Interpolation-by-32
Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth
112 dB SFDR 15.36 MHz Quadrature Carriers
122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out–ofBand Rejection
 Automatic Gain Control (AGC) Compensating for Channel Attenuation of up to
30 dB
 Costas Loop Carrier Recovery
 4x Oversampling Code Synchronization





DCH0
DCH1
DCH2
DCH3
DCH4
DSSS
Modulator
Channel
Model
DSSS
Demodulator
DCH0
DCH1
DCH2
DCH3
DCH4
DSSS Modulator
DCH0
Cch,16,0
DCH1
S
FIR3 RRC
25-Tap FIR
Filter
Interpolation x4
Ex BW:22%
Re[]
Cch,16,1
gi
DCH2
K
Cch,16,2
SCH
Length 256
Gold Code
Spreader
K
DCH3
Cch,16,8
DCH4
Cch,16,9
PCH
Cch,16,10
Im[]
S
gq
FIR1
LPF
2-Channel
87-Tap
FIR Filter
Interpolation
x2
FIR2
LPF
2-Channel
47-Tap
FIR Filter
Interpolation
x4
Sin(wn)
NCO Frequency
Resolution:
0.03Hz
SFDR: 112dB Cos(wn)
Carrier Phase
Increment
FIR3 RRC
25-Tap FIR
Filter
Interpolation x4
Ex BW:22%
DSSS Demodulator
FIR
Altera RRC
31-Tap FIR Filter
Excess BW: 22%
Fixed Rate
AGC
NCO
Frequency
Resolution:
0.03Hz
SFDR: 112dB
pn_lock
8
Gold Code
Correlator
4x
Oversampling
Peak
Detector max_index
Data
Channels
Output
1…5
Carrier
Recovery
Loop
Free-Running
Phase Increment
Buffer
FIR
Altera RRC
31-Tap FIR Filter
Excess BW: 22%
Fixed Rate
I-Q
Derotate
Hadamard
Despreader
8
Pilot
Output
Pilot Monitor
DSSS Modem Resources
Resource Usage Summary
Design
Entity
Logic
Elements
M512
RAM
M4K
RAM
Mega
RAM
DSP Block
Elements
Modulator
9943
1
8
0
12
Demodulator
12196
60
8
1
60
Power Usage Estimates
Power
mW
Total Standby Internal Power
75
Total Logic Element Internal Power
283
Total Clocktree Internal Power
175
Total DSP Internal Power
23
Other Internal Power
92
Total Power
505
FIR Filter Example* – 16X
Cost/Performance Improvement
Device
Solution
FIR Performance
(MHz)
Device
Cost****
Cost per
FIR MHz
TI C6713-200
64 cycles** @
200MHz
3.125
$24.59
$7.87
TI C6416-600
32 cycles** @
600MHz
18.75
$160
$8.53
Altera 1C3-8
8 cycles*** @
230MHz
28.75
$14
$0.49
Altera 1C12-8
1 Cycles*** @
170MHz
170
$84
$0.49
* FIR 128 Tap, 16 bit data, 14 bit coefficients
** DSPLib Optimized Assembly Libraries from Texas Instruments
*** MegaCore Optimized FIR Compiler from Altera
**** Pricing in quantity of 100 at Arrow 6/25/03
Performance (MMACs/sec)
DSP System Architecture Options
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
Stand-Alone
Processor
Processor Array
Processor +
Co-Processor
Dedicated Hardware
Architecture
Optional Coprocessor Mappings
Processor On FPGA
Processor External to FPGA
FPGA
FPGA
Processor
Processor
•Nios
•ARM (AHB)
Memory
•TI c6x (EMIF)
•Mot PPC (MPX)
•Mot Starcore (MPX, AHB)
•Intel 2850 (PCI Express)
•ARM (AHB)
•…..
Fine-Grained RSOCs:
Triscend A7 CSOC
 A7 Family
 32-bit ARM 7
with 8kB Cache
 3200 logic cells
max. (40K gates)
 Up to 3800 FF’s
 Up to 300 Prog.
I/O pins

www.triscend.com
Coarse-Grained RSOCs
Chameleon Structure (2000)
Design a battery powered personal mobile computing device that has
multimedia functionality and can operate in a dynamic environment.
- Do just enough and not too much for a given task (QoS)









32-bit ARC control processor
Up to 84 32-bit Datapath Units
DPU=a 32-bit ALU+a 32-bit barrel
shifter
Up to 24 of 16x24-bit multipliers
Up to 48 of 128x32-bit local
memory modules
Up to 160 Prog. I/O pins
Targeted at 3rd gen. wireless
basestation, wireless local loop,
SW radio, etc.
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M.
Heysters, www.chameleonsystems.com
Field Programmable Function Array of
Chameleon Structure
Paul M. Heysters, Jaap Smit, Gerard J.M. Smit, Paul J.M. Havinga
•A FPFA consists of interconnected processor tiles
•Multiple processes can coexit in parallel on different tiles
•Within a tile multiple data streams can be processed in
parallel
•Each processor tile contains multiple reconfigurable ALUs,
local memories, a control unit and a communication unit
Field Programmable Function Array
The FPFA concept has a number of
advantage
 The FPFA has a highly regular organisation
 We use general purpose process core
 Its scalability stands in contrast to the
dedicated chips designed nowadays
 The FPFA can do media processing tasks such
as compression/decompression efficiently
Field Programmable Function Array
 Processor tiles
 Consists of five identical blocks, which share a control unit and
a communication unit
 An individual block contains an ALU, two memories and four
register banks of four 20-bit wide register
 A crossbar-switch makes flexible routing between the ALUs,
registers and memories
 This structure is convenient for the Fast Fourier Transform(6input,4-output) and the Finite impulse response
M
M
M
M
M
M
M
M
M
M
Memory
CrossBar
Registers
ALU
ALU
ALU
ALU
ALU
ALUs
Mapping of DSP Algorithms on the
FPFA
Fast Fourier Transform
 FFT recursively divides a DFT into smaller
DFTs
DFT
FFT
DFT
N=2
DFT
FFT
N=8
FFT
N=8
DFT
N=2
N=8
N=8
DFT
N=8
FFT
DFT
N=2
N=8
DFT
N=2
N=8
a
b
+
-
-
W
Recursion of a radix 2 FFT with 8
inputs
The radix 2 FFT butterfly
Mapping of DSP Algorithms on the
FPFA
Fast Fourier Transform
are
Wre bim
bre
aim
Wim
Cross Bar Switch :
At least 6 Buses
Level 2
-
+
Zre
Zim
Bre
Are
1
2
Level 3
Bim
Aim
3
4
Mapping of DSP Algorithms on the
FPFA
Five-tap finite-impulse response filter
Cross Bar
h4
h3
h2
h1
h0
Level 2
1
2
3
4
5
O
MorphoSys (1999)
Reconfigurable cell
RC Array
•Array of reconfigurable
cells
•64 cells in a 2-D matrix
•SIMD model
•Same row(column)
share configuration
• Each RC operates on
different data
TinyRISC (Cont’d)
Implementation & Performance
•0.35 micron technology
•4 metal layers
•Operation at 100MHz
•170 mm2
Motion Estimation
Block size : 16x16 pixel,
Image size : 352x288 pixel
Lx de STMicroelectronics
DART,
Raphael David, IRISA/ENSSAT
With STMicroelectronics, UBO univ.
 Reconfigurable multigrain=
DPR+FPGA
 Reconfiguration Dynamique
 Faible Consommation
 Distribution hierarchique des
ressources
 SCMD (Single Configuration
Multiple Data)
11 GOPS/cluster
1.6 GMACS/cluster
0.64 W @ 11GOPS
16 MIPS/mW @ 11GOPS
0.18u CMOS
DART
Cluster
Cluster architecture
DPR1
Control
DPR3
DPR4
DMA
ctrl
DPR5
Config
mem.
FPGA
DPR6
Segmented network
DPR2
Data
mem
DPR architecture
Loop management
Global bus
AG1
AG2
AG3
AG4
Data
mem1
Data
mem2
Data
mem3
Data
mem4
Multibus network
reg1
reg2
MUL1
ALU1
MUL2
ALU2
Reconfigurable architectures
(Rabaey et al.)
· reconfiguration: change
of hardware structure in
the field
· approaches at the logic
level (FPGA) or at the
function block level
· dynamic change of
specialization
· example: PLEIADES
template for low power
systems
The Re-configurable Terminal
Satellite Processors
Elements of Energy- Efficiency
Communication Network
Distributed Data- Driven Control
Execution of a hardware module is triggered by the arrival of tokens.
When there are no tokens to be processed at a given module, no
switching activity occurs in that module.
Design Methodology
Multi-DSP Tree Structure
A. K. Salkintzis, N. Hong and P. T. Mathiopoulos
Multi-DSP Network Structure
Data traffic is reduced with each connection
Multiplexing &
Burst Construction
Modulation
Encription
Interleaving
Channel
Coding
CRC
insertion
Data
Processing
Sequencer
Spreading
Rate matching
Channelization
Radio
Resource
Equalization
Segmentation
Reconfigurable Radios
SDR Configuration
• Digital Down/Up Conversion (DDC)
– Channel Center
– Decimation/Interpolation rates
– Compensation Filters
– Matched Filter a = {0.25,0.35,...}
• FEC
– Convolutional
– Reed-Solomon
– Concatenated Coding
– Turbo CC/PC
– (De-)Interleave
• Beam Forming
Soft Radio
Digital
Signal
Processing
Engine
• Security
• Modulation Format
– QPSK
– DQPSK
– p/4 DQPSK
– {16,64,256,1024} QAM
– OFDM
– OFDM CDMA
• Channel Access
– CDMA
– TDMA
• DSSS
– Rake, track, acquire
– Multi User Detect. (MUD)
– ICU
• Network Interface Definition
Key Software Radio Components
Multibeam Antenna Array
Multiband RF Conversion
Spectral
Purity
Wideband A/D & D/A Conversion
IF Processing
Environment
Characterization
IF Processing
WB Digital
DSP &
Software
Design
Modulator
Demodulator
Advanced
Control
Bitstream
Bitstream Processing
Bitstream Processing
Service
Quality
Time to Market
Transmit
Isochronism
Throughput
Response Time
Receive
Larger Network
On Line Adaptation
Off Line Support
SNR/BER optimization
Data rate adaptation
interference suppression
Band/Mode selection
Development
Optimization
Over the air
Delivery
SDR Architecture
RF unit
Signal processing/control unit
Input/
Output
Rx SYN
LNA
RX
Tx SYN
LNA
TX
RX
Receive/
Transmit
Rx SYN
PA
EX.
TX
Tx SYN
C-PCI bus
Isao TESHIMA, Hitachi Kokusai Electric Inc., [email protected]
C
o
n
tr
o
l
EX.
In
te
fra
c
e
PA
B
a
s
e
b
a
n
d
M
O
D
Q
E
uM
a
d
ra
tu
re
M
O
D D
E a
M ta
c
o
n
v
e
rt
e
r
Receive/
Transmit
HMI
Terminal
Specification of Prototype
Signal processing
FPGA : Quadrature MODEM
DSP : Baseband MODEM
FPGA
XCV2000E x 3
DSP
TMS320C6701 x 4
CPU
Control module : Celeron Peripheral module
System bus
cPCI
Operating system
Linux
HMI
Operates from web browser
Interface
Audio I/O
Serial I/O
Ethernet(100BASE-TX)
Specification of Prototype
RF range
2~500MHz
Waveform
SSB, AM, FM, BPSK, QPSK, 8PSK,
16QAM
Number of channel
Four full-duplex
Radio relay
Repeat/Bridge
Frequency accuracy
<0.1ppm
Rx IF frequency
70MHz
Tx IF frequency
25MHz
Dynamic range
14bits
Rx IF sampling frequency
40MHz
Tx IF sampling frequency
100MHz
PACT’s SDR XPP
Martin Vorbach
PACT XPP Technologies, Germany
U-P vs XPP
A SDR/Multimedia Solution
PACT’s SDR XPP
Reconfigurable video processor for
SDRAM access optimization
(Henriss, Ernst et al.)
Reconfigurable video platform
· SDRAM memory centered design
· FPGA based scheduler merges different
streams and random accesses exploitation of
SDRAM bank structure
· supports 2 HDTV streams at 1.48 Gbit/s each
plus DSP and filter unit access
· reaches 700MByte/s in practical application for
4 Byte SDRAM memory word
· extremly cost efficient design
· used in professional video product line
NexperiaTM DVP Hardware architecture
(source: Th. Claasen, Philips, DAC 2000)
New Taxonomy/Metric
 Flynn: Triple (d,i,c)
d: # of data streams
i: # of instruction
streams
c: # of configuration
states
SISD, SIMD, MIMD,MISD
 RA: (c,g,a)
 c: configurability to
various environment
 g: size of granularity
 a: adaptability to
various components
 SCSG,SCMG,SCLG
 MCSG,MCMG,MCLG
Systolic Ring
Dnode
Sequencer
layer 1
• Based on a coarse-grained
configurable PE
• Circular datapaths
C: # of layers C = 4
N: # of Dnodes per layer
N=2
S: # of Rings s = 1
•
Control Units (sequencer)
Local Dnode unit
Local Ring unit
Global unit
Dnode
Dnode
layer 4
Dnode
Dnode
Local Ring
SequencerDnode
Dnode
layer 2
Dnode
Dnode
layer 3

Remanence
Fc
Configuration Memory
Sequencer
N PE .Fe
R=
Nc.Fc
inst0 inst1 inst2 inst3
…
instn
Processing Elements
Fe
Sequencing Unit
PE
PE
PE
PE
…
Interconnection

NPE: # of processing elements (PE)
Routing

Nc: # of PE configurable per cycle

Fe: operating frequency

Fc configuration frequency
Characterizes the Dynamism

# of cycles to (re)configure the whole architecture

Amount of data to compute between 2 configurations
PE
Operative Density
Configuration Memory
N PE
=
OD( N PE )
A( N PE )
Sequencer
Sequencing Unit
inst0 inst1 inst2 inst3
…
instn
Processing Elements
PE
PE PE
PE
…
Interconnection
Routing
NPE: # of PE
A: Core Area (relative unit ²)
Area can be expressed as a function of NPE
PE
Remanence formalisation
 # of layers : C = 8
 # of Dnode per layer : N = 2
 1 Systolic Ring: S = 1
layer 1
layer 2
layer 8
R (N
PE
) =
layer 3
k .N
PE
k= C/N
REMANENCE
layer 7
40
layer 4
35
30
25
20
layer 6
15
10
5
0
0
20
40
60
80
100
120
140
160
180
# Dnodes
layer 5
Architectural model
Characterization
# of layers : 4 (C = 4)
# of Dnode per layer : 2 (N = 2)
4 Systolic Ring (S = 4)
Control Units
• Local Dnode unit
• Local Ring unit
• Global unit
•www.qstech.com
Global Sequencer
Local Ring
Sequencer
Local Ring
Sequencer
Local Ring
Sequencer
Local Ring
Sequencer
Best OD and remanence
Design Space
Worst interconnect resources and processing power
0,040
Operative Density
ce
S=8
0,030
15
S=4
0,025
0,020
10
S=2
0,015
S=1
0,010
5
0,005
0
0,000
0
20
40
60
80
100
120
140
# Dnodes
Remanence
en
n
a
Rem
0,035
20
Worst OD and remanence
Best interconnect resources and processing power
Design Space
0,040
ce
en
n
a
Rem
Operative Density
0,035
S=8
0,030
15
S=4
0,025
0,020
10
S=2
0,015
S=1
0,010
5
0,005
0
0,000
0
20
40
60
80
100
120
140
# Dnodes
Remanence
20
Comparisons of RA
Pascal BENOIT
N
RR=
.Fe
Nc.Fc
Name
Type
NPE
Nc
F (MHz)
ARDOISE
Fine Grain RA
2304
0.14
33
16457
MorphoSys
Coarse Grain RA
128
16
100
8
Systolic Ring
Coarse Grain RA
24
4
200
6
DART
Coarse Grain RA
24
4
130
6
8
8
300
1
TMS320C62
DSP VLIW
1.
Only 1 cycle to (re)configure the DSP
2.
Few cycles to (re)configure coarse grain RA (8)
3.
Many cycles to (re)configure fine grain RA
PE
MPSoC Clock and Power
Olivier Franza, Intel
 Increased uncertainty with process scaling

Process, voltage, temperature variations, noise, coupling
 Affects design margin over design, power & performance loss

Increased power constraints

Increasing leakage, power (density, delivery) limitations
 More transistors mean:


Larger clock distribution networks
Higher capacitance (more load and parasitics)
 With each new technology:




Gate delay decreases ~25%
Wire delay increases ~100%
Cross-chip communication increases
Clock needs multiple cycles to cover die
Interconnect Delays & Density
Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology
Multiple Clocks due to
Interconnect limitation
At reduced performance,
larger resource size
Noise in Mixed Signal Systems
Multiple clock domains
 Low skew and jitter ALWAYS a must
 Clock modeling requires more accuracy
 Within-die variations, inductance, crosstalk,
electromigration, self-heat, …
 Floor plan modularity
 Think adding/removing cores seamlessly!
 Hierarchical clock partitioning
 Reduce global clock and possibly relax its requirements
 Generate “locally”-used clock “locally”
 Implement clock domain deskewing techniques
 Bound clock problem into simple, reliable, efficient domains
DEC/Compaq Alpha
more complex core to improve performance, more
complex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos &
al., ISSCC 2001 – Barroso & al., ISCA 2000
Clock and Power Convergence
Intel® Itanium® Montecito
 Each core split into 3 clock domains
on variable power supply
 Each domain controlled by Digital
Frequency Divider (DFD)
generating low-skew variablefrequency clocks; fed by central PLL
and aligned through phase detectors
 Regional Voltage Detector (RVD):
supply voltage monitor
 Second level clock buffer (SLCB):
digitally controlled delay buffer for
active deskewing
 Regional Active Deskew (RAD): phase
comparators monitoring
and adjusting delay difference between
SLCBs
 Clock Vernier Device (CVD): digitally
controlled delay buffer
Clock generation and distribution are essential Clock generation and distribution are
essential enablers of microprocessor performance
On-Chip Interconnects:
Circuits and Signaling,
Wayne Burleson
• Using Vdd programmability
• High Vdd to devices on critical path
• Low Vdd to devices on non-critical
paths
• VddOff for inactive paths
A – Baseline Fabric
B – Fabric with Vdd Configurable
Interconnect
This work builds on a similar idea for FPGAs described in:
Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International
Conference on Computer-Aided Design, Nov. 2004
From Spaghetti wires to Noc
Marcello Coppola, STMicroelectronics
Benchmarks,
EE Times,7/2005
 Xpipes, Bologna and Stanford : compared w/ Amba
AHB multilayer bus, 21% faster, but worse latency
 When, Univ. of Kaiserslautern: LPDC decoder:
500Mhz vs 64 Mhz (fixed bus), but 30W vs. 700mW,
twice the die size.
 Arteris: better die size, comparable power
consumption, 740Mhz (250Mhz)
 SonicsMX: power-efficient mobile-handset w/ power
management
 STNoC, Spidergon: topology w/ degree 2-3
NoC Applications
http://www.eit.uni-kl.de/wehn
• Turbo-Decoder UMTS compliant, 100Mbit: large flexibilty
w/ 14 parallel units, area = 16.84 mm2 (14mm2 PUs,
2.8mm2 NoC)
• LDPC Decoding,
T. Theocharides, G. Link, N. Chip, T. Theocharides,
G. Link, N. Vijaykrishnan, M. J. Irwin, Int. Conference on VLSI Design
2005
– 1024 Bit block size, 1.2Gb/s, R=0.75
– NoC: 5x5 2D mesh, dimension-order routing, large
flexibility
– 160nm CMOS Technology, 1.8V, 500 MHz, 110
mm2, ~30 Watt
Reliable design,
G. De Micheli
1. Manufacturing imperfections: More likely to
happen as lithography scales down
2. Approximations during design: Uncertainty
about details of design
3. Aging: Oxide breakdown,electromigration
4. Environment-induced Soft-errors (Data
corruption due external radiation exposure),
electro-magnetic interference
5. Operating-mode induced: Extremely-low
voltage supply
Dealing with variability
• Most variability problems that induce timing
errors
1.
2.
3.
4.
Power supply variation
Wire length estimation
Crosstalk
Soft errors
Adaptive low-power
transmission scheme
Frédéric Worm, Patrick Thiran, Giovanni De Micheli, and Paolo Ienne.
Self-calibrating Networks-on-Chip.In Proceedings of the IEEE International
Symposium on Circuits and Systems, Kobe, Japan, May 2005.
Reduced Energy Consumption