CPU Performance

Download Report

Transcript CPU Performance

Network-on-Chip
Introduction
Z. Lu / A. Jantsch / I. Sander
Dally and Towles, Chapter 1, 2, 3, 4,5
Overview
Interconnect Network
Introduction
Deadlock, Livelock
Topology
(Regular, irregular)
Router Architecture
(pipelined, classic)
Routing
(algorithms, mechanics)
Network Interface
(message passing, shared memory)
Flow Control
(Circuit, packet switching (SAF,
wormhole, virtual channel)
Concepts
July 20, 2015
Implementation
Summary
SoC Architecture
Performance
Analysis and QoS
Evaluation
2
Network-on-Chip (NoC)


Today buses are the dominating technology for
systems-on-chips (SoCs)
However, buses have severe limitations that
become evident, if the number of components in a
system is large



The bus is a communication bottleneck, bandwidth is
limited
The bus uses broadcast for each transfer, thus is power
inefficient
Networks-on-Chip shall overcome the limitation of
buses, since they provide a much larger amount of
communication resources and are scalable
July 20, 2015
SoC Architecture
3
A Network-on-Chip
S
S
T
S
S
T
S
T
S
T
S
T
T
S
T
Switch
T
Channel
Terminal Node
S
S
T
S
T
S
T
July 20, 2015
S
S
T
S
T
T
S
T
SoC Architecture
T
4
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 20, 2015
T
T
T
T


T
S
T
A terminal node can be
any kind of component
like

S
S
T
T
S
S
S
T
T
T

S
T
SoC Architecture

Processor
Memory
Hardware component
Bus-based system with
several components,
e.g. Processor and
Memory
5
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 20, 2015
T
T
T
S
T
S
T
T
S
S
S
T
T
T

S
Information in the form
of packets (not wires) is
routed via channels
and switches from one
terminal node to
another
T
S
T
T
SoC Architecture
6
Network Interface
Network

Switch

Interface
Terminal
Node
(Resource)
July 20, 2015
SoC Architecture
Different terminals with
different interfaces shall
be connected to the
network
The network uses a
specific protocol and all
traffic on the network
has to comply to the
format of this protocol
7
Network Interface
Network

Switch
In order to allow for different
resources to connect to the
network, the network
interface can be divided into
A resource independent
part (Network Interface)
 A resource dependent part
(Resource Network
Interface)
This is also the solution for the
Nostrum NoC (developed at
KTH)

Network
Interface
Resource
Network
Interface
July 20, 2015
Terminal
Node
(Resource)

SoC Architecture
8
Network abstractions

International Standards Organization (ISO)
developed the Open Systems Interconnection
(OSI) model to describe networks:



7-layer model
Provides a standard way to classify network
components and operations
Networks-on-Chips use a similar protocol
stack corresponding to the 4 lowest layers of
the OSI protocol
July 20, 2015
SoC Architecture
9
OSI model
application
presentation
session
July 20, 2015
end-use interface
data format
transport
application dialog control
connections
network
end-to-end service
data link
reliable data transport
physical
mechanical, electrical
SoC Architecture
10
OSI layers




Physical: connectors, bit formats, electrical
properties
Data link: error detection and control across a
single link (single hop).
Network: end-to-end multi-hop data
communication
Transport: connection-oriented services over
multiple links, e.g. ordering of packets, errorfree connection
July 20, 2015
SoC Architecture
11
OSI layers, cont’d.



Session: services for end-user applications:
data grouping, checkpointing, etc
Presentation: data formats, transformation
services
Application: interface between network and
end-user programs
July 20, 2015
SoC Architecture
12
Internet Protocol
(not an on-chip protocol!)
application
presentation
session
transport
network
data link
physical
network
data link
physical
node A
router
July 20, 2015
IP
SoC Architecture
application
presentation
session
transport
network
data link
physical
node B
13
Units of Resource Allocation




A message is a continuous group of bits that is
delivered from source terminal to destination
terminal. A message consists of packets.
A packet is the basic unit for routing and
sequencing. Packets maybe divided into flits.
A flit (flow control digits) is the basic unit of
bandwidth and storage allocation. Body/tail flits do
not have any routing or sequence information and
have to follow the head flit and remain in order.
A phit (physical transfer digits) is the unit that is
transfered across a channel in a single clock cycle.
July 20, 2015
SoC Architecture
14
Units of Resource Allocation
Message
Packet
Header
Packet
RI
SN
Head Flit
Flit
Type
Body Flit
Body Flit
Tail Flit
VC
Phit
Messages, Packets, Flits and Phits are handled
in different layers of the network protocol
July 20, 2015
SoC Architecture
15
Performance Factors
S
S
T
S
T

S
T
T
Factors that influence the
performance of a network-onchip are

S
S
S
S

T
T
T
T

S
S
S
S

T
S
T
S
T
July 20, 2015
T
S
T
T

S
T
T
SoC Architecture

Topology (static arrangement
of channels and nodes)
Routing Techniques (selection
of a path through the network)
Switching Techniques (How a
route is traversed)
Flow Control (how are network
resources allocated, if packets
traverse the network)
Router Architecture (buffers
and switches)
Traffic Pattern
16
Network-on-Chip
Topologies
Dally: Ch 3, (4), 5
Network Topology



The network topology refers to the static
arrangement of channels and nodes in the
network
A good topology allows to fulfill the
requirements of the traffic at reasonable
costs
Network topology can be compared with a
network of roads
July 20, 2015
SoC Architecture
18
Topology Examples
Destination-tag Routing
0
0
0
1
2
3
00
0
10
20
1
4-node ring
00
01
02
1
1
2
2
03
01
11
21
3
10
11
12
3
How is it routed to destination 6, 110?
13
4
4
02
20
21
22
23
30
31
32
33
12
22
5
5
6
6
03
13
23
7
”4x4”-Torus
July 20, 2015
7
Butterfly with 8 nodes
SoC Architecture
19
Rings, Tori and Meshes
0
1
2
3
4-node ring
(4-ary 1-cube)
00
01
02
03
00
01
02
03
10
11
12
13
10
11
12
13
20
21
22
23
20
21
22
23
30
31
32
33
30
31
32
33
July 20, 2015
”4x4”-Torus
(4-ary 2-cube)
”4x4”-Mesh
(4-ary 2-mesh)
SoC Architecture
20
Combined Node consists of
Terminal and Switch Node
Switch Node
is equivalent to
Combined Node
Terminal Node
July 20, 2015
SoC Architecture
21
Nomenclature
Network-on-Chip

00
01
02
03
10
11
12
13

20
21
22
23
30
31
32
33


”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
The topology of an
interconnection network is
specified by a set of nodes
N* connected by a set of
channels C
Messages originate and
terminate in a set of terminal
nodes N, where N  N*
Here:

SoC Architecture
N = N* = 16
C = 16 ∙ 4 = 64
22
Nomenclature
Network-on-Chip

00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33


”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
SoC Architecture
Each channel c = (x, y) ∈ C
connects a source node x to
a destination node y, where
x, y ∈ N*
A channel is characterized
by its width wc or wxy , which
is the number of parallel
signals it contains
The source node of a
channel is denoted sc and
the destination node dc
23
Nomenclature
Network-on-Chip

00
01
02
03

10
11
12
13
20
21
22
23
30
31
32
33

”4x4”-Torus
(Channels are bidirectional)
July 20, 2015

SoC Architecture
Its frequency fc or fxy is the
rate at which bits are
transported on each signal
Its latency tc or txy is the time
required for a bit to travel
from x to y
Usually the latency is
directly related to the
physical length of the
channel, lc = vtc , by a
propagation velocity v
The bandwidth of the
channel is bc = wcfc
24
Nomenclature
Network-on-Chip

00
01
02
03
Each switch node x has
a channel set Cx = CIx ⋃
COx , where

10
11
12
13

20
21
22
23

30
31
32
33
The degree of x is δx =
|Cx |, which is the sum
of the in degree and out
degree
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
CIx = {c ∈ C | dc = x} is the
input channel set
COx = {c ∈ C | sc = x} is the
output channel set
SoC Architecture
25
Direct and Indirect Networks

00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
Direct Network

Every Node in the
network is both a
terminal and a switch
Direct Network

Indirect Network

Nodes are either a
switch or terminal
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
Indirect Network
July 20, 2015
SoC Architecture
26
Bisection of a network


A bisection of a network is a cut that partitions the entire network
nearly in half
The channel bisection of a network is the minimum channel count
over all bisections of the network
min C ( N , N )
Bc  bisections
1
2

The bisection bandwidth of a network is the minimum
bandwidth over all bisections of the network
min B( N , N )
BB  bisections
1
2
July 20, 2015
SoC Architecture
27
Bisection

Channel bisection


BC = 4 (2 bidirectional
channels go through the
bisection)
1
2
3
4-node ring
Bandwidth bisection

0
BB = 4b (b is the
bandwidth of each
channel)
July 20, 2015
SoC Architecture
28
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13


20
21
22
23

30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
Minimal
Path (|P| = 3)
SoC Architecture
A path is an ordered
set of channels P = { c1,
c2, ... ,cn }, where dci  sci1
for i = 1... (n - 1)
The length or hop count
of a path is |P |
A minimal path from
node x to node y is a
path with the smallest
hop-count
29
Paths

00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
The set of all minimal
paths between x and y
is denoted Rxy
Minimal
Paths (|P| = 3)
SoC Architecture
30
Paths
Largest minimal hop count
(Diameter Hmax = 4)
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33

The diameter Hmax is
the largest minimal hop
count over all pairs of
terminal nodes
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
SoC Architecture
31
Paths
Distance in hops
from node 00

00
0
01
1
02
2
03
1
10
1
11
2
12
3
13
2
2
20
3
21
4
22
3
23
30
1
31
2
32
3
July 20, 2015
H min
33
2
”4x4”-Torus
(Channels are bidirectional)
The average minimum
hop count Hmin is
defined as the average
hop count over all
sources and
destinations

SoC Architecture
1
 2
N
 H ( x, y )
x , yN
Here: Hmin = 2
32
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13


20
21
22
23
30
31
32
33
Havg  H min
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
A specific
implementation may
choose to incorporate
some non-minimal path
Then the actual
average hop count Havg
is defined over the path
used by the network
SoC Architecture
33
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13

D( P)  lc
cP

20
21
22
23
30
31
32
33
The physical distance
of the path is
The delay of the path is
t ( P) D( P) / v
”4x4”-Torus
(Channels are bidirectional)
July 20, 2015
SoC Architecture
34
Traffic Patterns



Traffic pattern is a very important factor for
the performance of a network
In uniform random traffic each source is
equally likely to send to each destination
Uniform random traffic is the most commonly
used traffic pattern for network evaluation.
However it implies balancing of the load,
which often does not cause a problem for the
network
July 20, 2015
SoC Architecture
35
Throughput



The throughput of a network is the data rate in bits
per second that the network accepts per input port
The topology of a network has a significant impact
on the throughput (besides flow control and routing)
The ideal throughput is defined as the throughput
assuming a perfect routing and flow control


Load is balanced over alternate paths
No idle cycles on bottleneck channels
July 20, 2015
SoC Architecture
36
Throughput


Maximum throughput occurs, if some channel of the
network becomes saturated
Why some channel
not all channels?
The channel load  of a channel is



the ratio of the bandwidth demanded from the channel to
the bandwidth of the input ports
(in other words) the amount of traffic that must cross the
channel, if each input unit injects one unit of traffic
according to the given traffic pattern
The channel that carries the largest fraction of the
traffic determines the maximum channel load max
July 20, 2015
SoC Architecture
37
Throughput

The ideal throughput ideal is the input
bandwidth that saturates the bottleneck channel
ideal = b / max


In general it is difficult to determine the
maximum channel load max, but in case for
uniform traffic, bounds can be found.
Use the ideal throughput of a network on uniform
traffic ideal (U) as the capacity of the network.
July 20, 2015
SoC Architecture
38
An example: Ideal throughput

Channel bandwidth, channel load
S1
D1
1 packet/cycle
ideal = b / max
S1
Assume b = 1 packet/cycle
D1
Mux
De-mux
S2
1. Where is the bottleneck
channel?
2. What is the load of the
bottleneck channel?
3. What is the throughput on
the bottleneck channel?
4. What is the network’s ideal
throughput?
D2
What if b = 2 packet/cycle ?
July 20, 2015
SoC Architecture
39
Ideal Throughput in a Torus

00
01
02
03

10
11
12
13
20
21
22
23

30
31
32
33

”4x4”-Torus
(Channels are bidirectional)
July 20, 2015

SoC Architecture
Assuming uniform traffic,
50% of the packets cross
the bisection channels
Best throughput, if packets
are evenly distributed over
the bisection channels
Load on these channels is
then B = N / 2BC
Thus max ≥  B = N / 2BC
And the ideal throughput is
ideal = b / max ≤ 2bBC /N
40
Ideal Throughput in a Torus

00
01
02
03


Thus max ≥  B = N / 2BC
ideal = b / max ≤ 2bBC /N
Example:
10
11
12
13

20
21
22
23

30
31
32
33

”4x4”-Torus
(Channels are bidirectional)
July 20, 2015

SoC Architecture
4 x 4 torus:  B = 16 / 2BC
= 8/16 = ½
4 x 4 mesh:  B = 16 / 2BC
= 8/8 = 1
n x n torus:  B = n2 / 2BC
= n2/(2 · 4n) = n/8
n x n mesh:  B = n2 / 2BC
= n2/4n = n/4
41
Another useful lower bound on
channel load




A packet needs Hmin hops to be delivered
There are C channels in the network
We have N nodes sending packets
With equal load, we get a lower bound for
 max   c , LB 
July 20, 2015
H min N
C
SoC Architecture
42
Lower Bounds on Bottlneck
Hop count bound:
 max   c , LB 
 max 
Bisection bound:
H min N
C
N
2 BC
ideal = b / max
July 20, 2015
SoC Architecture
43
Bisection Bandwidth
Mesh – Uniform Traffic
BT … total channel count
BC … bisection channel count
E … node emission probability
(Emitted packets per cycle)
Havg … average hop count
BT  4k  4k
2
Total Load: ENH avg
Balance:
BC  2k
N  k2
1. When k increases, what
happens for E? Why?
2. Are both bounds correct?
3. If so, what does this
mean?
4. Why is the B-bound
tighter?
July 20, 2015
2
2
 Ek
k  Ek 3
3
3
2
2 3
Ek  4k 2  4k
3
E
Bisection Load:
Balance:
3 ( 4k 2  4k )
k 1
6 2
3
2
k
k
1 2
Ek  2k
2
E
k
T-bound
B-bound
2
3/2=1.5
2
4
9/8=1.125
1
6
5/6=0.83
2/3=0.67
8
21/32=0.656
½=0.5
10
27/50=0.54
2/5=0.4
SoC Architecture
EN Ek 2

2
2
4
k
44
Latency



The latency of the network is the time
required for a message to traverse a network,
from the time the head arrives at the input
port to the time when the tail of the message
departs the output port
Latency depends not only on topology, but
also on routing, flow control and the design of
the router
Topology gives a lower bound on latency
July 20, 2015
SoC Architecture
45
Latency

There are two latency components:


Head latency Th : Time required for head of the
message to traverse the network
Serialization latency Ts= L/b : Time required for
the tail to catch up (time for a message of length
L to cross a channel with bandwidth b)
July 20, 2015
SoC Architecture
46
Head Latency

Head latency depends on two topology
factors

Router delay Tr (time spent in the routers) and
time of flight Tw (time spent on wires)


July 20, 2015
Tr = Hmin tr
Tw = Dmin / v (average distance Dmin, propagation
velocity v)
SoC Architecture
47
Latency

Together this gives ideal average latency:



T0 = Hmin tr + Dmin / v + L / b (no congestion)
Clearly Hmin, Dmin, and b are to a large extent
determined by the topology
If there is congestion in the network, there is
a forth term TC
July 20, 2015
SoC Architecture
48
Latency with time-space
diagram
Head
Tail
How the time-space diagram
looks like if the transmission
is not pipelined, say, in a
store-and-forward fashion?
x
Arrival at node x
tr
Leave x
txy
y
Arrival at node y
tr
Leave y
tyz
z
Arrival at switch z
L/b
July 20, 2015
SoC Architecture
49
Examples








T0 = Hmin tr + Dmin / v + L / b
A torus 8-by-8 network
The network has N = 64
nodes
Avg. min. hop count Hmin = 4
Channel width wc = 16
Channel frequency fc = 1
GHz
Channel latency tc = 5 ns
Router delay tr = 8 ns
Packet Length L = 64 bytes
July 20, 2015




SoC Architecture
Tr = Hmin tr = 4 ∙ 8 ns = 32 ns
Tw = Hmin tc = 4 ∙ 5 ns = 20 ns
Ts = L / b = L / (fc wc ) = 64 ∙ 8
/ (1 GHz ∙ 16) = 512 / 16 ns =
32 ns
T0 = 32 ns + 20 ns + 32 ns = 84
ns
50
To get a feeling about NoC
(Toy example)







The network has N = 64
nodes
Hmin = 2 ∙ 8 / 3 = 5.33
Channel width wc = 32
Channel frequency fc = 1
GHz
Channel latency tc = 1 ns
Router delay tr = 1 ns
Packet length L = 16 bytes
July 20, 2015
SoC Architecture
51
To get a feeling about NoC
(Toy example)







The network has N = 64
nodes
Hmin = 2 ∙ 8 / 3 = 5.33
Channel width wc = 32
Channel frequency fc = 1
GHz
Channel latency tc = 1 ns
Router delay tr = 1 ns
Packet length L = 16 bytes
July 20, 2015




SoC Architecture
Tr = Hmin tr = 5.33 ∙ 1ns = 5.33
ns
Tw = Hmin tc = 5.33 ns
Ts = L / b = L / (fc wc ) = 16 ∙ 8
/ (1 GHz ∙ 32) = 128 / 32 ns =
4 ns
T0 = 5.33 ns + 5.33 ns + 4 ns =
14.66 ns
52
Path Diversity

A network with multiple minimal paths
between most pairs of nodes is more robust
than a network that has only one single route
between the nodes
July 20, 2015
SoC Architecture
53
Path Diversity
0

Random Traffic



Each node is
equally likely to
send a message to
any other node
50% of the packets
pass the bisection
max = 4/4=1
0
00
10
20
1
1
2
2
01
11
21
3
3
4
4
02
12
22
5
5
6
6
03
13
23
7
7
Butterfly with 8 nodes
July 20, 2015
SoC Architecture
54
Traffic Patterns


The performance of a network is strongly depending on the
traffic pattern
The table below shows a number of different traffic patterns
that can be used to analyze the performance of the network
July 20, 2015
SoC Architecture
55
Path Diversity
0

Bit (left) Rotation Traffic
 The node with address {
b2, b1, b0 } sends to { b1, b0,
b2 }
 Thus for sources
{ 0, 1, 2, 3, 4, 5, 6, 7 }, we
get the shuffle permutation
{ 0, 2, 4, 6, 1, 3, 5, 7 }
 Thus packets from nodes
{0,1,4, 5} will all have to
pass switch node 10
 max,BR = 2 (since for
instance channel (00,10) is
used by two connections)
 Max capacity: 50%
July 20, 2015
0
00
10
20
1
1
2
2
01
11
21
3
3
4
4
02
12
22
5
5
6
6
03
13
23
7
SoC Architecture
7
Butterfly with 8 nodes
56
Torus and Mesh Networks


Torus and Mesh networks, k-ary n-cubes,
pack N = kn nodes.
Advantages




Regular structure allows efficient packaging
For local communication latency is low
Good path diversity
Disadvantage

Comparably larger hop count
July 20, 2015
SoC Architecture
57
Rings, Tori and Meshes
0
1
2
3
4-node ring
(4-ary 1-cube)
00
01
02
03
00
01
02
03
10
11
12
13
10
11
12
13
20
21
22
23
20
21
22
23
30
31
32
33
30
31
32
33
July 20, 2015
”4x4”-Torus
(4-ary 2-cube)
”4x4”-Mesh
(4-ary 2-mesh)
SoC Architecture
58
Properties of Tori and Meshes

Torus





Channel Bisection
BC,T = 4 N / k
Channel load under uniform
traffic (50% of traffic
crosses bisection)
T,U = k / 8
Channel load under worst
traffic (100% of traffic
crosses bisection)
T,W = k / 4
Average minimum hop
count (k even)
Hmin, T = nk / 4
July 20, 2015
Mesh




Channel Bisection
BC,M = 2 N / k
Channel load under uniform
traffic (50% of traffic
crosses bisection)
M,U = k / 4
Channel load under worst
traffic (100% of traffic
crosses bisection)
M,W = k / 2
Average minimum hop
count (k even)
Hmin, M = nk / 3
SoC Architecture
59
Physical implementation of
Mesh and Tori


In order to implement a network on a chip,
the abstract nodes of the network must be
mapped to real positions in physical space
A goal is to have the same latency for all
channels
July 20, 2015
SoC Architecture
60
Folding networks leads to
shorter largest channel length
Folded 4-ary 2 cube
July 20, 2015
SoC Architecture
61
Summary

Network is characterized by topology, routing,
switching and flow control.



Topology is the first concern of the network
Mesh and Tori are regular and symmetric, which
many on-chip networks of today propose.
Performance metrics: throughput and latency



Use channel load to determine ideal throughput
Latency includes time spent on routers, wires and
serialization
Performance is strongly sensitive to traffic pattern
July 20, 2015
SoC Architecture
62