- VLSI Computation Lab

Transcript - VLSI Computation Lab

Implementing
Tile-based Chip Multiprocessors
with GALS Clocking Styles
Zhiyi Yu, Bevan Baas
VLSI Computation Lab, ECE Department
University of California, Davis, USA
Outline
• Introduction
• Timing issues
• Scalability issues
• A design example
Tile-based Chip Multiprocessors
and GALS Clocking Style
• Chip multiprocessors
– High performance due to parallel computing
– Potential high energy efficiency since high
performance may allow reducing clock and voltage
• Tile-based architecture
– Highly scalable
• Globally Asynchronous Locally Synchronous
– Simplified clock tree design
– High energy efficiency from adaptive clock/voltage
scaling for each module
Tile-based GALS
Chip Multiprocessors
clock osc
• Globally synchronous vs. GALS
– Tile-based GALS chip multiprocessors have nearly
perfect scalability
Hierarchical Physical Design Flow
• Three steps
– Oscillator
– Single processor
– Entire chip
• Chip array is assembled
by a simple tiling of
processors
Synthesis
HDL
(Verilog or VHDL)
OSC
Floorplan
Placement
Single
processor
Clock tree
insertion
Entire chip
In-place
optimization
Final Layout
(ready for fabrication)
Route
Layout edit
The Challenges
• Timing issues
– Boundaries between
clock domains
P1
clk1
P2
clk2
• Scalability issues
– The most important
global signal (clock) is
avoided
– But, clock might not be
the only global signal
programming, configuration
P1
P2
P3
Outline
• Introduction
• Timing issues
• Scalability issues
• A design example
Two Methods to
Cross the GALS Clock Domains
• Single transaction
handshake
– Each data word is
acknowledged before
a subsequent transfer
request (src)
ack (des)
data (src)
receive (des)
• Coarse grain flow
control
– Data words are
transmitted without an
individual
acknowledgement
valid (src)
full (des)
data (src)
clk_s (src)
Overview of Timing Issues
Proc. B’s
clk domain
Proc. A’s clk domain
Clk A
A B
clock
clk tree
Clk B
B C
signals
A B
signals
FIFO
Proc. A
AB
signals
B C
clock
FIFO
Proc. B
BC
signals
• Signal categories between processor A and B
– A to B clock, synchronizing the source signals
– A to B signals, data and other signals such as “valid”
– B to A signals, such as “ready” or “hold” signals
• Each processor contains two clock domains
Two Clock Domains
within One Processor
write
write data
(data_in, valid_in)
clk_upstrm
write
logic
read
SRAM
read
logic
read
data
clk_dnstrm
cfg.
syn.
Dual-clk FIFO
Processor
• Use dual-clock FIFO to handle the unrelated
read and write clock within one processor
• Multiple Flip-flops are inserted at the clock
domain boundary as a configurable synchronizer
send data
send valid
send clk (a)
(b)
(c)
Relative comm. power
Inter-processor Timing Issues --Three Communication Methods
1
0.8
0.6
0.4
0.2
0
method (a) method (b) method (c)
• (a): sends clock only when there is valid data
• (b): sends clock one cycle earlier and one cycle
later than the valid data
• (c): always sends clock
Timing Waveform of the
Inter-processor Communication
Dclk
Send clk
Send data
Rec. clk
Rec. data
Ddata
timing violation
DLY Rec. data
DDLY
sample time,
no timing violation
Circuitry for
Inter-processor Communication
Ddata_A
Ddata_w
clk tree
clk_upstrm
write signal
(data,valid)
Ddata_B
Output
logic
DLY
out
in
FIFO_full_in
clk_dnstrm
DLY
input
logic
FIFO_full_out
Proc. A
Dclk_A
FIFO
SRAM
Proc. B
Dclk_w
Dclk_B
• Insert configurable DLY logic at the path of data
– Compensate the additional clock tree delay and avoid
the timing violation
Inter-chip Communication
DA1 DA2 D12 DB2 DB1
data
clk
Chip 1
Chip 2
• Inter-chip communication shares similar features
with inter-processor communication
• The path is longer and the timing is more complex
– Output processor might need low speed clock
– Destination processor can operate at full speed
Outline
• Introduction
• Timing issues
• Scalability issues
• A design example
Special Signals Besides Clock
• Avoiding designing a global clock enhances
scalability, but there are some other signals that
must be addressed to maintain high scalability
– Various global signals such as configuration and test
signals
– Power distribution
– Processor IO pins
• Key idea: avoid or isolate all global signals if
possible, so multiple processors can be directly
tiled without further changes
Clocking and Buffering of
Global Signals
• There are unavoidable
global signals such as
configure and test
• Three options:
– Pipeline these signals
– Asynchronous
interconnect
– Use a low speed clock,
and buffer signals in
each processor
slow clock
global signals
Complete Power Distribution for
Each Processor
Metal 2
Metal 2
Metal 2
Metal 2
Metal5
Metal1
OSC grid not
connected to main grid
OSC
power ring
OSC
power grid
Metal6
Metal6
Tile
boundary
Metal6
Core
boundary
Metal6
Position of IO Pins
• Position of IO pins is
important since they
must connect to other
processors
• Align IO pins with
each other so that
connecting wires are
very short
Outline
• Introduction
• Timing issues
• Scalability issues
• A design example
An Asynchronous Array of
simple Processors (AsAP)
• Single-chip tile-based 6 x 6 GALS multiprocessor
– Simple architecture & small mems. for each processor
– Nearest neighbor interconnect between processors
• Targets computationally intensive DSP apps
OSC
FIFO
0
Output
CPU
FIFO
1
Back-end Design Flow
• Standard cell based
design flow was used
– RTL coding
– Synthesis
– Placement & Routing
• Intensive verification
were used throughout
the design process
–
–
–
–
Gate level analysis
Circuit level simulation
DRC/LVS
Formal verification
HDL
(Verilog or VHDL)
Synthesis
Formal Verification
Synthesized
verilog
Final
verilog
P&R
Gate level dynamic &
static timing analysis
timing
Layout edit
Final layout
(ready for fabrication)
DRC, LVS
Spice netlist
Spice level Simulation
Design procedure
Verification procedure
Chip Micrograph
Technology:
TSMC 0.18 µm
Max speed:
475 MHz @ 1.8 V
Area:
1 Proc
Chip
single
processor
0.66 mm²
32.1 mm²
Power (1 Proc @ 1.8V, 475 MHz):
Typical application
32 mW
Typical 100% active 84 mW
Power(1 Proc @ 0.9V, 116 MHz):
Typical application 2.4 mW
Summary
• GALS tile-based chip multiprocessor is an
attractive architecture
– High performance, energy efficient, and highly scalable
• Timing issues
– Multiple clock domains in one single processor
– Inter-processor communication
– Inter-chip communication
• Scalability issues
– “Global” signals
– Power distribution
– IO pins
Acknowledgments
• Funding
–
–
–
–
Intel Corporation
UC Micro
NSF Grant No. 0430090
UCD Faculty Research Grant
• Special Thanks
– E. Work, D. Truong, W. Cheng, T. Jacobson,
T. Mohsenin, R. Krishinamurthy, M. Anders, and
S. Mathew
– MOSIS
– Artisan

- VLSI Computation Lab

Transcript - VLSI Computation Lab

Directory