Low-Latency Virtual-Channel Routers for On

Download Report

Transcript Low-Latency Virtual-Channel Routers for On

Low-Latency Virtual-Channel
Routers for On-Chip
Networks
Robert Mullins, Andrew West, Simon Moore
Presented by
Sailesh Kumar
Outline

Motivation
» Why Network-on-chip (NoC)

Comparison to Packet Networks
» Similarities
» Differences
» Design Constraints

Topology and Routing/Switching techniques for NoC
» Mesh, fat-tree, honey-comb
» Greedy, Deflection, Wormhole, Virtual-Channels

Start with the paper – Design of a Low-Latency
Virtual-Channel Router
‹#› - Sailesh Kumar - 4/11/2017
Why NoC

Billion transistor era has arrived
» Several such SoC are in pipeline, Inter-connection is critical

A generic inter-connection architecture ensures
» Reduced design time
» IP reuse
» Predictable backend (versus ad-hoc wiring)

Bus based inter-connects were sufficient until now

But not now
» Shared bus is slow (arbitrates between several requesters)
» More components increase loading => speed drops further
» Ad-hoc routing of wires results in backend complications,
lower performance and higher power consumption
‹#› - Sailesh Kumar - 4/11/2017
Why NoC (cont)

Recently Dally proposed an idea
» “Route packets not wires” as in data networks
– Point to point communication
» Point to point links are faster

Create a chip wide network (Like a regular IP WAN)
» A router at every node
» Links connecting all routers
» Messages encapsulated in packets, which are routed

Challenges
» Topologies, Routing protocol
» Network and router design with small footprint and low
latency
‹#› - Sailesh Kumar - 4/11/2017
Some more motivations

The need to put repeaters into long wires allows us to
add the switching needed to implement a network at
little additional cost

Makes efficient use of critical global wiring resources
by sharing them across different senders and
receivers

Simplifies overall design

Design a single router and do copy-paste in both
dimension
‹#› - Sailesh Kumar - 4/11/2017
A typical NoC node




Layered Design of
reconfigurable
micronetworks.
Exploits methods and
tools used for general
network.
Micronetworks based on
the ISO/OSI model.
NoC architecture
consists of Physical,
Data link, and Network
layers.
‹#› - Sailesh Kumar - 4/11/2017
A typical NoC




Layered Design of
reconfigurable
micronetworks.
Exploits methods and
Implemented in
tools used for general cores, enables end-toend reliable transport
network.
Micronetworks based on
the ISO/OSI model.
NoC architecture
consists of Physical,
Data link, and Network
layers.
‹#› - Sailesh Kumar - 4/11/2017
A typical NoC




Layered Design of
reconfigurable
micronetworks.
Exploits methods and
Implemented in
tools used for general cores, enables end-toend reliable transport
network.
Micronetworks based on
the ISO/OSI model.
Multi hop route
NoC architecture
setup, packet
addressing, etc
consists of Physical,
Data link, and Network
layers.
‹#› - Sailesh Kumar - 4/11/2017
A typical NoC




Layered Design of
reconfigurable
micronetworks.
Exploits methods and
Implemented in
tools used for general cores, enables end-toend reliable transport
network.
Micronetworks based on
the ISO/OSI model.
Multi hop route
NoC architecture
setup, packet
addressing, etc
consists of Physical,
Data link, and Network
Contention issues,
layers.
reliability issues,
‹#› - Sailesh Kumar - 4/11/2017
grouping of physical
layer bits, e.g. “flits”
A NoC topology




Cores Communicates
With Each Other Using
NoC
NoC Consists of
Routers (R) and
Network Interfaces
(NI)
A NI linked to Router
by Non-Pipelined Wires
One or More Cores
Connected to a NI
‹#› - Sailesh Kumar - 4/11/2017
Another NoC topologies
Fat tree
Mesh
‹#› - Sailesh Kumar - 4/11/2017
Multi hop route setup, packet
addressing, etc
Routing protocols

We will only consider mesh topology

Objective is to find a path from
a source to a destination

Greedy Algorithms (deterministic)
» Choose shortest path (e.g. X-Y)

Adaptive routing
» If congestion, choose alternative path
» Deflection routing
» Is adaptive better than greedy => NOT REALLY (when only
local information is used)

Adaptive routing can also result in livelock
‹#› - Sailesh Kumar - 4/11/2017
Switching techniques

Circuit Switching: A control message is sent from source to
destination and a path is reserved. Communication starts. The
path is released when communication is complete.


Store-and-forward policy (Packet Switching): each
switch waits for the full packet to arrive in switch
before sending to the next switch
Cut-through routing or worm hole routing: switch
examines the header, decides where to send the
message, and then starts forwarding it immediately
» In worm hole routing, when head of message is blocked,
message stays strung out over the network, potentially
blocking other messages (Needs only buffer the piece of the
packet that is sent between switches).
» Cut through routing lets the tail continue when head is
blocked, storing the whole message into an intermediate
switch. (Need buffer large enough to hold the largest packet).
‹#› - Sailesh Kumar - 4/11/2017
Wormhole Routing – Good fit for NoC

Wormhole routing is good for NoC
» Low latency
» Less buffering requirements

Suffers from deadlock
[example from Li and McKinley,
IEEE Computer v26n2, 1993]
‹#› - Sailesh Kumar - 4/11/2017
Adding Virtual Channels

With virtual channels, deadlock can be avoided

Move message and reply on different channels =>
Will never have loop on a single channel
‹#› - Sailesh Kumar - 4/11/2017
Packet switches
from lo to hi channel
Designing Virtual Channel Routers

Design Constraints in NoC
» Minimize Latency
» Minimize Buffering
» Minimal footprint

Can exploit far greater number of pins and wires

May use fat data and flow control wires

Objective: Design routers with minimal latency
» This will also result in smaller buffers

This paper presents design of a low latency router
» Cycle time of 12 FO4
» Single cycle routing/switching
‹#› - Sailesh Kumar - 4/11/2017
Designing Virtual Channel Routers

Design Constraints in NoC
» Minimize Latency
» Minimize Buffering
» Minimal footprint

Can exploit far greater number of pins and wires

May use fat data and flow control wires

Objective: Design routers with minimal latency
» This will also result in smaller buffers

This paper presents design of a low latency router
» Cycle time of 12 FO4
» Single cycle routing/switching
‹#› - Sailesh Kumar - 4/11/2017
A Virtual Channel Router
‹#› - Sailesh Kumar - 4/11/2017
Designing Virtual Channel Routers
Every VC of every
input port has buffers
to hold arriving flits
Arriving flits are
placed into the buffers
of corresponding VC
‹#› - Sailesh Kumar - 4/11/2017
Designing Virtual Channel Routers
Every VC of every
input port has buffers
to hold arriving flits
Routing logic assigns
set of outgoing VC on
which flit can go
Arbitrates between
competing input VC &
allocates output VC
Arriving flits are
placed into the buffers
of corresponding VC
‹#› - Sailesh Kumar - 4/11/2017
Designing Virtual Channel Routers
Every VC of every
input port has buffers
to hold arriving flits
Routing logic assigns
set of outgoing VC on
which flit can go
Arbitrates between
competing input VC &
allocates output VC
Arriving flits are
placed into the buffers
of corresponding VC
‹#› - Sailesh Kumar - 4/11/2017
Matches successful
input ports (allocated
VC) to output ports
Flits at input VCs
getting grants are
passed to output VCs
Routing Logic

Three possibilities
» Return a single VC
» Return set of VCs on a single port
» Return any VCs

Look ahead routing
» Routing performed at the previous router
» Good for X-Y deterministic (non adaptive) routing
» A SGI routing chip first implemented it
‹#› - Sailesh Kumar - 4/11/2017
VC Allocation

Complexity of VC allocation depends on routing range

Routing returns single VC
» Needs PxV input arbiter for every outgoing VC

Routing returns multiple VC at single port
» Additional V:1 arbiter at every input VC to reduce potential
outgoing VC to 1

Routing returns any set of VCs
» Needs two cascaded PxV input arbiters

We consider multiple VC at single port case
‹#› - Sailesh Kumar - 4/11/2017
VC Allocation Logic
» At every outgoing VC following logic is needed
‹#› - Sailesh Kumar - 4/11/2017
Switch Allocation

Individual flits at input VCs arbitrate for access to the
crossbar port

Arbitration can be performed in two stages
First stage

» A VC among V possible VCs at every input port is selected
» V:1 arbiter at every input port

Second stage
» Winning VC at every input port is matched to the output port
» P:1 arbiter at every output port


This scheme doesn’t guarantee a
maximal/maximum/good matching
But simple to implement
‹#› - Sailesh Kumar - 4/11/2017
‹#› - Sailesh Kumar - 4/11/2017
Switch Allocation
Issues

VC allocation and Switch allocation are serialized


A flit will either take 2 clocks to get through
Else clock speed will be low

Solution: Speculative switch allocation
‹#› - Sailesh Kumar - 4/11/2017
Speculative Switch Allocation

Dally proposed speculative switch allocation
» Perform switch and VC allocation in parallel
» Assume that participating VC in switch allocation will get the
output VC
» If not then wasted cycle

An even better idea is to perform speculative and
non-speculative switch allocation in parallel
» Non-speculative allocation has higher priority
» Note that non-speculative allocation is done for input VCs
which has already been allocated an output VC
Speculative
will work


Mostly one cycle delay under light load
Mostly one cycle delay under heavy load
‹#› - Sailesh Kumar - 4/11/2017
Non-speculative
will work

Further Enhancement

Is it possible to have zero cycle VC/switch allocation
YES, Most of the time, that’s what this paper is about!
‹#› - Sailesh Kumar - 4/11/2017
Idea 1: Free Virtual Channel Queue

Keep queue of free VC at
every outgoing port
» Also bit mask with one set bit

Thus First stage of VC
allocation where an output
VC is selected will be
removed
‹#› - Sailesh Kumar - 4/11/2017
Idea 1: Free Virtual Channel Queue

Keep queue of free VC at
every outgoing port
» Also bit mask with one set bit

Thus First stage of VC
allocation where an output
VC is selected will be
removed
‹#› - Sailesh Kumar - 4/11/2017
Idea 2: Pre-computing arbitration decisions

If somehow, you know the arbitration results before
flits actually arrive and fight for the VC and switch

I mean, every arbitration decision
» VC allocation
» Switch allocation
» Etc

Then the router can be made to run in zero cycle
» The arriving flit route/switch in the same clock they arrive

Also, clock speed may be pretty good
» Data path and control path are no more in series

That’s what the idea 2 is.
‹#› - Sailesh Kumar - 4/11/2017
Some preliminaries before going into detail

Tree Arbiters
» Implements large arbiters using tree of small arbiters

Matrix Arbiters
» Fair and Fast arbiter implementation
‹#› - Sailesh Kumar - 4/11/2017
VC allocation using a Tree Arbiter
‹#› - Sailesh Kumar - 4/11/2017
‹#› - Sailesh Kumar - 4/11/2017
A Matrix Arbiter
Pre-computing arbitration decisions

An alternative arbiter design
‹#› - Sailesh Kumar - 4/11/2017
Pre-computing arbitration decisions

An alternative arbiter design
Generate grant
enables one cycle
prior and latch them
‹#› - Sailesh Kumar - 4/11/2017
Grants are product
of latched enables
and the requests
Pre-computing arbitration decisions

An alternative arbiter design
Generate grant
enables one cycle
prior and latch them

Grants are product
of latched enables
and the requests
Grants are generated in same clock as request arrives
» If at least one request remains
‹#› - Sailesh Kumar - 4/11/2017
Pre-computing arbitration decisions

An alternative arbiter design
Generate grant
enables one cycle
prior and latch them

Grants are product
of latched enables
and the requests
However, when no request remains, it is difficult to
generate grant enables ???
‹#› - Sailesh Kumar - 4/11/2017
Generating grant enables

Safe Environment
» Only one request may arrive in a cycle
» Thus it is safe to assert all grant enables
» Thus grant can still be generated in same cycle

Unsafe Environment
» Multiple request may arrive in same cycle
» Can still assert all grants
» But need to abort when multiple requests arrive in same cycle


All first stage V:1 arbiters operate under safe
environment
However P:1 arbiters doesn’t
‹#› - Sailesh Kumar - 4/11/2017
Generating grant enables

Even in unsafe environments, assert all grants
» May need to abort when multiple requests arrive
» Note that after an abort, a correct arbitration is ensured in the
next cycle

Why will it work?

Because in lightly loaded network, multiple requests
for same VC/port will not arrive (few aborts)
In heavily loaded network flits will remain buffered
and Non-speculative arbitration (higher priority) will
happen most of the time

» Few aborts again
‹#› - Sailesh Kumar - 4/11/2017
I will skip the design details now

Since it is confusing and complex

Will jump to critical path analysis
‹#› - Sailesh Kumar - 4/11/2017
Analysis of critical path
Generates VC/switch
grants from precomputed grant enables
‹#› - Sailesh Kumar - 4/11/2017
Analysis of critical path
Generates VC/switch
grants from precomputed grant enables
‹#› - Sailesh Kumar - 4/11/2017
Crossbar traversal is
aborted once invalid
grants are detected
Analysis of critical path
Generates VC/switch
grants from precomputed grant enables
‹#› - Sailesh Kumar - 4/11/2017
In case of an abort, the
correct control signals are
ensured in the next cycle
Crossbar traversal is
aborted once invalid
grants are detected
Final design

Control path critical delay is 12 FO4
» Until now, the best design had 20 FO4 delays

They have sampled a NoC based ASIC last week using
this idea

Runs at several GHz speeds

Note that fast cycle time is possible by
» Running VC allocation and Switch allocation in parallel
» Must use speculation, else delay will be higher (1 more cycle)
‹#› - Sailesh Kumar - 4/11/2017
Simulation results
‹#› - Sailesh Kumar - 4/11/2017
If (doubts) Then
Ask;
Else
Thank you;
Goto Discussion;
End if;
‹#› - Sailesh Kumar - 4/11/2017