Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Networks-on-Chip:
Motivations and Architecture
Federico Angiolini
[email protected]
DEIS Università di Bologna
Why NoCs
The System Interconnect

Chips tend to have more than one “core”





Control processors
Accelerators
Memories
I/O
How do we get them to talk to each other?

This is called “System Interconnect”
Traditional Answer: with Buses


Shared bus topology
Aimed at simple, cost-effective integration of
components
Master 0
Master 1
Master 2
Shared Bus
Slave 0

Slave 1
Slave 2
Typical example: ARM Ltd. AMBA AHB


Arbitration among multiple masters
Single outstanding transaction allowed

If wait states are needed, everybody waits
So, Are We All Set?


Well... not really.
Let’s consider two trends


System/architectural: systems are
becoming highly parallel
Physical: wires are becoming slower
(especially in relative terms)
System/Architectural Level

Parallel systems... OK, but how much?





CPUs: currently four cores (not so many...)
Playstation/Cell: currently nine engines (still OK)
GPUs: currently 100+ shaders (hey!)
Your next cellphone: 100+ cores (!!!)
And the trend is: double every 18 months
Multicore Testimonials 1
“We believe that Intel’s Chip
Level Multiprocessing (CMP)
architectures represent the
future of microprocessors
because they deliver massive
performance scaling while
effectively managing power
and heat”.
White paper “Platform 2015: Intel
Processor and Platform evolution for
the next decade”
Intel IXP2800
with 16 micro-engines
and one Intel XScale core
Multicore Testimonials 2
"The next 25 years of
digital signal processing
technology will literally
integrate hundreds of
processors on a single chip
to conceive applications
beyond our imagination.”
Mike Hames, senior VP,
Texas Instruments
Intel: 80-core chip
shown at ISSCC 2007
Rapport: Kilocore (1024
cores), for gaming & media
Expected mid 2007
“Focus here is on
parallelism and what's
referred to as multi-core
technology.”
Phil Hester, CTO, AMD
Multicore Testimonials 3
What Does This Mean for the
Interconnect?


A new set of requirements!
High performance


High parallelism (bandwidth)


Many cores will want to communicate, fast
Many cores will want to communicate, simultaneously
High heterogeneity/flexibility

Cores will operate at different frequencies, data widths,
maybe with different protocols
Physical Level


Logic becomes faster and faster
Global wires don’t
1 cm
And If We Consider a Floorplan...
2 cm

If you assume a shared bus, the wires have to
go all around the chip (i.e. are very long)


Propagation delay
Spaghetti wiring
What Does This Mean for the
Interconnect?


A new set of requirements!
Short wiring


Point-to-point and local is best
Simple, structured wiring

Bundles of many wires are impractical to route
System Interconnect Evolution
Topology
Evolution
Hierarchical
Buses
Traditional
Shared Bus
AHB crossbar
component
Advanced
Slave 4
AHB Layer 0
Protocols
Protocol Slave 0 SlaveBus
1
Master 0 Master 1
Evolution
Slave 5
Master 2 Master 3
AHB Layer 1


Slave 6
Slave 3 solve them
Help with the issues, but doSlave
not2 fully
More scalable solutions needed
An Answer: Networks-on-Chip (NoCs)
CPU
NI
NI
switch
DSP
switch
switch
switch
NoC
DRAM



switch
NI
NI
switch
NI
Accel
DMA
MPEG
NI
Packet-based communication
NIs convert transactions by cores into packets
Switches route transactions across the system
First Assessment of NoCs


High performance
High parallelism (bandwidth)


High heterogeneity/flexibility


Yes: just design appropriate NIs, then plug in
Short wiring


Yes: just add links and switches as you add cores
Yes: point-to-point, then just place switches as
close as needed
Simple, structured wiring

Yes: links are point-to-point, width can be tuned
Problem Solved?



Maybe, but... buses excel in simplicity, low
power and low area
When designing a NoC, remember that
tradeoffs will be required to keep those
under control
Not all designs will require a NoC, only the
“complex” ones
How to Design NoCs
How to Make NoCs Tick

A NoC is a small network
Many of the same architectural degrees of
freedom
 Some problems are less stringent

Static number of nodes
 (Roughly) known traffic patterns and
requirements


Some problems are much tougher
MANY less resources to solve problems
 Latencies of nanoseconds, not milliseconds


But... what characterizes a network?
Key NoC Properties






Topology
Routing policy (where)
Switching policy (how)
Flow control policy (when)
Syn-, asyn- or meso-chronicity
...and many others!

Huge design space
NoC Topologies

Must comply with demands of…





performance (bandwidth & latency)
area
power
routability
Can be split in…


direct: node connected to every switch
indirect: nodes connected to specific subset of
switches
NoC Topology Examples:
Hypercubes
0-D


Compositional design
Example: hypercube topologies



Arrange N=2n nodes in n-dimensional cube
At most n hops from source to destination
High bisection bandwidth






2-D
good for traffic (but can you use it?)
bad for cost [O(n2)]
Exploit locality
Node size grows

1-D
as n [input/output cardinality]
as n2 [internal crossbar]
3-D
Adaptive routing may be possible
4-D
NoC Topology Examples:
Multistage Topologies


Need to fix hypercube resource requirements
Idea: unroll hypercube vertices




switch sizes are now bounded, but
loss of locality
more hops
can be blocking; non-blocking with even more stages
NoC Topology Examples:
k-ary n-cubes (Mesh Topologies)





Alternate reduction from hypercube: restrict to < log2(N)
dimensional structure
e.g. mesh (2-cube), 3D-mesh (3-cube)
Matches with physical world structure and allows for locality
Bounds degree at node
Even more bottleneck potentials
2D Mesh
NoC Topology Examples:
Torus


Need to improve mesh
performance
Idea: wrap around n-cube ends




2-cube  cylinder
3-cube  donut
Halves worst-case hop count
Can be laid-out reasonably
efficiently

maybe 2x cost in channel width?
NoC Topology Examples:
Fat-Tree Topologies

Fatter links (actually: more of them) as you
go to the root, so bisection bandwidth scales
NoC Routing Policies

Static



Adaptive






e.g. source routing or coordinate-based
simpler to implement and validate
e.g. congestion-based
potentially faster
much more expensive
allows for out-of-order packet delivery
possibly a bad idea for NoCs
Huge issue: deadlocks
Deadlocks
B
A




A would like to talk
to C
B to A
C to B
Everybody is stuck!!
C

Showstopper problem




avoid by mapping: no route loops
avoid by architecture: e.g. virtual channels
provide deadlock recovery
Critical for adaptive routing

livelocks also possible
NoC Switching Policies

Packet switching


maximizes global network usage dynamically
store-and-forward


wormhole


minimum logic, but higher latency, needs more buffers
minimum buffering, but deadlock-prone, induces congestion
Circuit switching

optimizes specific transactions


no contention, no jitter
requires handshaking


may fail completely
setup overhead
Virtual Channels

Performance improvement using virtual channels
Node 1
Node 2
Node 3
Node 4
Node 5
Destination of B
Node 1
Node 2
Node 3
Node 4
Node 5
Destination of B
A
B
Block
NoC Flow Control Policies

We need it because...





How?




Sender may inject bursty traffic
Receiver buffers may fill up
Sender and receiver may operate at different
frequencies
Arbitrations may be lost
TDMA: pre-defined time slots
Speculative: send first, then wait for confirmation
(acknowledge - ACK)
Conservative: wait for token, then send (creditbased)
Remember... links may be pipelined
Example: ACK/NACK Flow Control
Transmission
ACK
and buffering
NACK
ACK/NACK propagation
Memory deallocation
Retransmission
Go-back-N
NoC Timing: Synchronous






Flip-flops everywhere, clock tree
Much more streamlined design
Clock tree burns 40% power budget,
plus flip flops themselves
Not easy to integrate cores at different
speeds
Increasingly difficult to constrain skew
and process variance
Worst-case design
NoC Timing: Asynchronous








Potentially allows for data to arrive at any time,
solves process variance etc.
Average-case behaviour
Lower power consumption
Maximum flexibility in IP integration
More secure for encryption logic
Less EMI
Much larger area
Can be much slower (if really robust)



Two-way handshake removes the “bet” of synchronous logic
Intermediate implementations exist
Much tougher to design
NoC Timing: Mesochronous








Attempts to optimize latency of long paths
Everybody uses the same clock
Senders embed their clock within packets
Data is sent over long links and arrives out of sync with receiver clock
Embedded clock is used to sample incoming packets
Dual-clocked FIFO restores synchronization
Tough to design
Somewhat defeats the NoC principles
Data
Sender
CK
Strobe
Link
Receiver
Dual-clocked FIFO
CK
The xpipes NoC
The xpipes NoC

xpipes is a library of NoC components



xpipes is designed to be scalable to future
technology nodes, architecturally and physically



Network Interface (NI), Switch, Link
Configurability of parameters such as flit width, amount
of buffering, flow control and arbitration policies…
Leverages a cell synthesis flow, no hard macros
Pipelined links to tackle wire propagation delay
A complete CAD flow is provided to move from the
application task graph level to the chip floorplan
The xpipes NoC:
the Network Interface
target NI
initiator NI
LUT
request channel
packeting
unpacketing
OCP
packets
NoC
topolog
y
OCP
packets
unpacketing
LUT
packeting
response channel
OCP clk




xpipes clk
Performs packeting/unpacketing
OCP 2.0 protocol to connect to IP cores
Source routing via routing Look-Up Tables
Dual Clock operation
OCP clk
Basic OCP Concepts

Point-to-point, unidirectional, synchronous


Master/slave, request/response


Well-defined, simple roles
Extensions


Easy physical implementation
Added functionality to support cores with more complex
interface requirements
Configurability


Match a core’s requirements exactly
Tailor design to required features only
Reference: [SonicsInc]
Basic OCP Protocol
MAddr [32]
MData [32]
MRespAccept
SCmdAccept
SResp [2]
SData [32]
OCP Slave
OCP Master
MCmd [3]
Request
phase
Response
phase
Read Transaction
Write Transaction
OCP Extensions

Simple Extensions




Complex Extensions


Threads and Connections
Sideband Signals


Byte Enables
Bursts
Flow Control/Data Handshake
Interrupts, etc.
Testing Signals
The xpipes NoC:
the Switch
Allocator
Arbiter
Crossbar
Routing &
Flow Control



Input and/or output buffering
Wormhole switching
Supports multiple flow control policies
The xpipes NoC:
the ACK/NACK Link
S


FLIT
FLIT
FLIT
REQ
REQ
REQ
ACK
ACK
ACK
R
Repeaters are pure registers
Buffering and retransmission logic in the sender
The xpipes NoC:
the STALL/GO Link
S


FLIT
FLIT
FLIT
REQ
REQ
REQ
STALL
STALL
STALL
Repeaters are two-entry FIFOs
No retransmission allowed
R
Quality of Service and
the Æthereal NoC
Speaking of Quality of Service...
Signal processing
Media processing
Multi-media
hard real time
very regular load
hard real time
irregular load
soft real time
irregular load
high quality
high quality
limited quality
worst case
typically on DSPs
average case
SoC/media processors
average case
PC/desktop
Very challenging!
Multimedia Application Demands



Increasing functionality and heterogeneity
Higher semantic content/entropy
More dynamism
29000
27000
VBR MPEG
DVD stream
25000
23000
21000
19000
worst-case load
17000
structural load
15000
running average
instantaneous load
[Gossens03]
Negotiating NoC Resources
29000
VBR MPEG
DVD stream
27000
25000
23000
21000
19000
worst-case load
17000
structural load
15000
running average
instantaneous load
[Gossens03]
(re)negotiate
steady states
A QoS Approach

Essential to recover global predictability and improve
performance



What is QoS?






Applications require it!
It fits well with protocol stack concept
Requester poses the service request (negotiation)
Provider either commits to or rejects the request
Renegotiate when requirements change
After negotiation, steady states that are predictable
Guaranteed versus best-effort service
Types of commitment



correctness
completion
bounds
e.g. uncorrupted data
e.g. no packet loss
e.g. maximum latency
QoS + Best Effort


Best-effort services have better average resource
utilisation at the cost of unpredictable/unbounded
worst-case behaviour
The combination of best-effort & guaranteed services
is useful!
QoS in the Æthereal NoC

Conceptually, two disjoint networks



a network with throughput+latency guarantees (GT)
a network without those guarantees (best-effort, BE)
Several types of commitment in the network

combine guaranteed worst-case behaviour
with good average resource usage
best-effort
router
programming
guaranteed
router
priority/arbitration
Æthereal Router Architecture

Best-effort router




Wormhole routing
Input queueing
Source routing
Guaranteed throughput router

Contention-free routing




Store-and-forward routing
Headerless packets


synchronous, using slot tables
time-division multiplexed circuits
information is present in slot table
A lot of hardware overhead!!!
Æthereal: Contention-Free
Routing



Latency guarantees are easy in circuit switching
With packet switching, need to “emulate”
Schedule packet injection in network such that
they never contend for same link at same time






in space: disjoint paths
in time: time-division multiplexing
Use best-effort packets to set up connections
Distributed, concurrent, pipelined, consistent
Compute slot assignment at build time, run
time, or combination
Connection opening may be rejected