CPU Performance
Download
Report
Transcript CPU Performance
Network-on-Chip
Introduction
Z. Lu / A. Jantsch / I. Sander
Dally and Towles, Chapter 1, 2, 3, 4,5
Overview
Interconnect Network
Introduction
Deadlock, Livelock
Topology
(Regular, irregular)
Router Architecture
(pipelined, classic)
Routing
(algorithms, mechanics)
Network Interface
(message passing, shared memory)
Flow Control
(Circuit, packet switching (SAF,
wormhole, virtual channel)
Concepts
July 21, 2015
Implementation
Summary
SoC Architecture
Performance
Analysis and QoS
Evaluation
2
Network-on-Chip (NoC)
Today buses are the dominating technology for
systems-on-chips (SoCs)
However, buses have severe limitations that
become evident, if the number of components in a
system is large
The bus is a communication bottleneck, bandwidth is
limited
The bus uses broadcast for each transfer, thus is power
inefficient
Networks-on-Chip shall overcome the limitation of
buses, since they provide a much larger amount of
communication resources and are scalable
July 21, 2015
SoC Architecture
3
A Network-on-Chip
S
S
T
S
S
T
S
T
S
T
S
T
T
S
T
Switch
T
Channel
Terminal Node
S
S
T
S
T
S
T
July 21, 2015
S
S
T
S
T
T
S
T
SoC Architecture
T
4
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 21, 2015
T
T
T
T
T
S
T
A terminal node can be
any kind of component
like
S
S
T
T
S
S
S
T
T
T
S
T
SoC Architecture
Processor
Memory
Hardware component
Bus-based system with
several components,
e.g. Processor and
Memory
5
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 21, 2015
T
T
T
S
T
S
T
T
S
S
S
T
T
T
S
Information in the form
of packets (not wires) is
routed via channels
and switches from one
terminal node to
another
T
S
T
T
SoC Architecture
6
Network Interface
Network
Switch
Interface
Terminal
Node
(Resource)
July 21, 2015
SoC Architecture
Different terminals with
different interfaces shall
be connected to the
network
The network uses a
specific protocol and all
traffic on the network
has to comply to the
format of this protocol
7
Network Interface
Network
Switch
In order to allow for different
resources to connect to the
network, the network
interface can be divided into
A resource independent
part (Network Interface)
A resource dependent part
(Resource Network
Interface)
This is also the solution for the
Nostrum NoC (developed at
KTH)
Network
Interface
Resource
Network
Interface
July 21, 2015
Terminal
Node
(Resource)
SoC Architecture
8
Network abstractions
International Standards Organization (ISO)
developed the Open Systems Interconnection
(OSI) model to describe networks:
7-layer model
Provides a standard way to classify network
components and operations
Networks-on-Chips use a similar protocol
stack corresponding to the 4 lowest layers of
the OSI protocol
July 21, 2015
SoC Architecture
9
OSI model
application
presentation
session
July 21, 2015
end-use interface
data format
transport
application dialog control
connections
network
end-to-end service
data link
reliable data transport
physical
mechanical, electrical
SoC Architecture
10
OSI layers
Physical: connectors, bit formats, electrical
properties
Data link: error detection and control across a
single link (single hop).
Network: end-to-end multi-hop data
communication
Transport: connection-oriented services over
multiple links, e.g. ordering of packets, errorfree connection
July 21, 2015
SoC Architecture
11
OSI layers, cont’d.
Session: services for end-user applications:
data grouping, checkpointing, etc
Presentation: data formats, transformation
services
Application: interface between network and
end-user programs
July 21, 2015
SoC Architecture
12
Internet Protocol
(not an on-chip protocol!)
application
presentation
session
transport
network
data link
physical
network
data link
physical
node A
router
July 21, 2015
IP
SoC Architecture
application
presentation
session
transport
network
data link
physical
node B
13
Units of Resource Allocation
A message is a continuous group of bits that is
delivered from source terminal to destination
terminal. A message consists of packets.
A packet is the basic unit for routing and
sequencing. Packets maybe divided into flits.
A flit (flow control digits) is the basic unit of
bandwidth and storage allocation. Body/tail flits do
not have any routing or sequence information and
have to follow the head flit and remain in order.
A phit (physical transfer digits) is the unit that is
transfered across a channel in a single clock cycle.
July 21, 2015
SoC Architecture
14
Units of Resource Allocation
Message
Packet
Header
Packet
RI
SN
Head Flit
Flit
Type
Body Flit
Body Flit
Tail Flit
VC
Phit
Messages, Packets, Flits and Phits are handled
in different layers of the network protocol
July 21, 2015
SoC Architecture
15
Performance Factors
S
S
T
S
T
S
T
T
Factors that influence the
performance of a network-onchip are
S
S
S
S
T
T
T
T
S
S
S
S
T
S
T
S
T
July 21, 2015
T
S
T
T
S
T
T
SoC Architecture
Topology (static arrangement
of channels and nodes)
Routing Techniques (selection
of a path through the network)
Switching Techniques (How a
route is traversed)
Flow Control (how are network
resources allocated, if packets
traverse the network)
Router Architecture (buffers
and switches)
Traffic Pattern
16
Network-on-Chip
Topologies
Dally: Ch 3, (4), 5
Network Topology
The network topology refers to the static
arrangement of channels and nodes in the
network
A good topology allows to fulfill the
requirements of the traffic at reasonable
costs
Network topology can be compared with a
network of roads
July 21, 2015
SoC Architecture
18
Topology Examples
Destination-tag Routing
0
0
0
1
2
3
00
0
10
20
1
4-node ring
00
01
02
1
1
2
2
03
01
11
21
3
10
11
12
3
How is it routed to destination 6, 110?
13
4
4
02
20
21
22
23
30
31
32
33
12
22
5
5
6
6
03
13
23
7
”4x4”-Torus
July 21, 2015
7
Butterfly with 8 nodes
SoC Architecture
19
Rings, Tori and Meshes
0
1
2
3
4-node ring
(4-ary 1-cube)
00
01
02
03
00
01
02
03
10
11
12
13
10
11
12
13
20
21
22
23
20
21
22
23
30
31
32
33
30
31
32
33
July 21, 2015
”4x4”-Torus
(4-ary 2-cube)
”4x4”-Mesh
(4-ary 2-mesh)
SoC Architecture
20
Combined Node consists of
Terminal and Switch Node
Switch Node
is equivalent to
Combined Node
Terminal Node
July 21, 2015
SoC Architecture
21
Nomenclature
Network-on-Chip
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
The topology of an
interconnection network is
specified by a set of nodes
N* connected by a set of
channels C
Messages originate and
terminate in a set of terminal
nodes N, where N N*
Here:
SoC Architecture
N = N* = 16
C = 16 ∙ 4 = 64
22
Nomenclature
Network-on-Chip
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
Each channel c = (x, y) ∈ C
connects a source node x to
a destination node y, where
x, y ∈ N*
A channel is characterized
by its width wc or wxy , which
is the number of parallel
signals it contains
The source node of a
channel is denoted sc and
the destination node dc
23
Nomenclature
Network-on-Chip
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
Its frequency fc or fxy is the
rate at which bits are
transported on each signal
Its latency tc or txy is the time
required for a bit to travel
from x to y
Usually the latency is
directly related to the
physical length of the
channel, lc = vtc , by a
propagation velocity v
The bandwidth of the
channel is bc = wcfc
24
Nomenclature
Network-on-Chip
00
01
02
03
Each switch node x has
a channel set Cx = CIx ⋃
COx , where
10
11
12
13
20
21
22
23
30
31
32
33
The degree of x is δx =
|Cx |, which is the sum
of the in degree and out
degree
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
CIx = {c ∈ C | dc = x} is the
input channel set
COx = {c ∈ C | sc = x} is the
output channel set
SoC Architecture
25
Direct and Indirect Networks
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
Direct Network
Every Node in the
network is both a
terminal and a switch
Direct Network
Indirect Network
Nodes are either a
switch or terminal
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
Indirect Network
July 21, 2015
SoC Architecture
26
Bisection of a network
A bisection of a network is a cut that partitions the entire network
nearly in half
The channel bisection of a network is the minimum channel count
over all bisections of the network
min C ( N , N )
Bc bisections
1
2
The bisection bandwidth of a network is the minimum
bandwidth over all bisections of the network
min B( N , N )
BB bisections
1
2
July 21, 2015
SoC Architecture
27
Bisection
Channel bisection
BC = 4 (2 bidirectional
channels go through the
bisection)
1
2
3
4-node ring
Bandwidth bisection
0
BB = 4b (b is the
bandwidth of each
channel)
July 21, 2015
SoC Architecture
28
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
Minimal
Path (|P| = 3)
SoC Architecture
A path is an ordered
set of channels P = { c1,
c2, ... ,cn }, where dci sci1
for i = 1... (n - 1)
The length or hop count
of a path is |P |
A minimal path from
node x to node y is a
path with the smallest
hop-count
29
Paths
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
The set of all minimal
paths between x and y
is denoted Rxy
Minimal
Paths (|P| = 3)
SoC Architecture
30
Paths
Largest minimal hop count
(Diameter Hmax = 4)
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
The diameter Hmax is
the largest minimal hop
count over all pairs of
terminal nodes
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
31
Paths
Distance in hops
from node 00
00
0
01
1
02
2
03
1
10
1
11
2
12
3
13
2
2
20
3
21
4
22
3
23
30
1
31
2
32
3
July 21, 2015
H min
33
2
”4x4”-Torus
(Channels are bidirectional)
The average minimum
hop count Hmin is
defined as the average
hop count over all
sources and
destinations
SoC Architecture
1
2
N
H ( x, y )
x , yN
Here: Hmin = 2
32
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
Havg H min
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
A specific
implementation may
choose to incorporate
some non-minimal path
Then the actual
average hop count Havg
is defined over the path
used by the network
SoC Architecture
33
Paths
Non-Minimal
Path (|P| = 5)
00
01
02
03
10
11
12
13
D( P) lc
cP
20
21
22
23
30
31
32
33
The physical distance
of the path is
The delay of the path is
t ( P) D( P) / v
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
34
Traffic Patterns
Traffic pattern is a very important factor for
the performance of a network
In uniform random traffic each source is
equally likely to send to each destination
Uniform random traffic is the most commonly
used traffic pattern for network evaluation.
However it implies balancing of the load,
which often does not cause a problem for the
network
July 21, 2015
SoC Architecture
35
Throughput
The throughput of a network is the data rate in bits
per second that the network accepts per input port
The topology of a network has a significant impact
on the throughput (besides flow control and routing)
The ideal throughput is defined as the throughput
assuming a perfect routing and flow control
Load is balanced over alternate paths
No idle cycles on bottleneck channels
July 21, 2015
SoC Architecture
36
Throughput
Maximum throughput occurs, if some channel of the
network becomes saturated
Why some channel
not all channels?
The channel load of a channel is
the ratio of the bandwidth demanded from the channel to
the bandwidth of the input ports
(in other words) the amount of traffic that must cross the
channel, if each input unit injects one unit of traffic
according to the given traffic pattern
The channel that carries the largest fraction of the
traffic determines the maximum channel load max
July 21, 2015
SoC Architecture
37
Throughput
The ideal throughput ideal is the input
bandwidth that saturates the bottleneck channel
ideal = b / max
In general it is difficult to determine the
maximum channel load max, but in case for
uniform traffic, bounds can be found.
Use the ideal throughput of a network on uniform
traffic ideal (U) as the capacity of the network.
July 21, 2015
SoC Architecture
38
An example: Ideal throughput
Channel bandwidth, channel load
S1
D1
1 packet/cycle
ideal = b / max
S1
Assume b = 1 packet/cycle
D1
Mux
De-mux
S2
D2
1. Where is the bottleneck
channel?
2. What is the load of the
bottleneck channel?
3. What is the network’s
throughput?
4. What is the network’s ideal
throughput?
5. What is the throughput on
the bottleneck channel?
What if b = 2 packet/cycle ?
July 21, 2015
SoC Architecture
39
Ideal Throughput in a Torus
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
Assuming uniform traffic,
50% of the packets cross
the bisection channels
Best throughput, if packets
are evenly distributed over
the bisection channels
Load on these channels is
then B = N / 2BC
Thus max ≥ B = N / 2BC
And the ideal throughput is
ideal = b / max ≤ 2bBC /N
40
Ideal Throughput in a Torus
00
01
02
03
Thus max ≥ B = N / 2BC
ideal = b / max ≤ 2bBC /N
Example:
10
11
12
13
20
21
22
23
30
31
32
33
”4x4”-Torus
(Channels are bidirectional)
July 21, 2015
SoC Architecture
4 x 4 torus: B = 16 / 2BC
= 8/16 = ½
4 x 4 mesh: B = 16 / 2BC
= 8/8 = 1
n x n torus: B = n2 / 2BC
= n2/(2 · 4n) = n/8
n x n mesh: B = n2 / 2BC
= n2/4n = n/4
41
Another useful lower bound on
channel load
A packet needs Hmin hops to be delivered
There are C channels in the network
We have N nodes sending packets
With equal load, we get a lower bound for
max c , LB
July 21, 2015
H min N
C
SoC Architecture
42
Lower Bounds on Bottlneck
Hop count bound:
max c , LB
max
Bisection bound:
ideal = b / max
H min N
C
N
2 BC
1. If the bandwidth of the bottleneck channel is
b, what is the maximum ideal throughput of
the network?
2. What is the best (meaningful) minimum
value for the maximum channel load, max?
July 21, 2015
SoC Architecture
43
Bisection Bandwidth
Mesh – Uniform Traffic
BT … total channel count
BC … bisection channel count
E … node emission probability
(Emitted packets per cycle)
Havg … average hop count
BT 4k 4k
2
Total Load: ENH avg
Balance:
BC 2k
N k2
1. When k increases, what
happens for E? Why?
2. Are both bounds correct?
3. If so, what does this
mean?
4. Why is the B-bound
tighter?
July 21, 2015
2
2
Ek
k Ek 3
3
3
2
2 3
Ek 4k 2 4k
3
E
Bisection Load:
Balance:
3 ( 4k 2 4k )
k 1
6 2
3
2
k
k
1 2
Ek 2k
2
E
k
T-bound
B-bound
2
3/2=1.5
2
4
9/8=1.125
1
6
5/6=0.83
2/3=0.67
8
21/32=0.656
½=0.5
10
27/50=0.54
2/5=0.4
SoC Architecture
EN Ek 2
2
2
4
k
44
Latency
The latency of the network is the time
required for a message to traverse a network,
from the time head arrives at the input port to
the time where the tail of the mesage departs
the output port
Latency depends not only on topology, but
also on routing, flow control and the design of
the router
Topology gives a lower bound on latency
July 21, 2015
SoC Architecture
45
Latency
There are two latency components:
Head latency Th : Time required for head of the
message to traverse the network
Serialization latency Ts= L/b : Time required for
the tail to catch up (time for a message of length
L to cross a channel with bandwidth b)
July 21, 2015
SoC Architecture
46
Head Latency
Head latency depends on two topology
factors
Router delay Tr (time spent in the routers) and
time of flight Tw (time spent on wires)
July 21, 2015
Tr = Hmin tr
Tw = Dmin / v (average distance Dmin, propagation
velocity v)
SoC Architecture
47
Latency
Together this gives ideal average latency:
T0 = Hmin tr + Dmin / v + L / b (no congestion)
Clearly Hmin, Dmin, and b are to a large extent
determined by the topology
If there is congestion in the network, there is
a forth term TC
July 21, 2015
SoC Architecture
48
Latency with time-space
diagram
Head
Tail
How the time-space diagram
looks like if the transmission
is not pipelined, say, in a
store-and-forward fashion?
x
Arrival at node x
tr
Leave x
txy
y
Arrival at node y
tr
Leave y
tyz
z
Arrival at switch z
L/b
July 21, 2015
SoC Architecture
49
Examples
T0 = Hmin tr + Dmin / v + L / b
A torus 8-by-8 network
The network has N = 64
nodes
Avg. min. hop count Hmin = 4
Channel width wc = 16
Channel frequency fc = 1
GHz
Channel latency tc = 5 ns
Router delay tr = 8 ns
Packet Length L = 64 bytes
July 21, 2015
SoC Architecture
Tr = Hmin tr = 4 ∙ 8 ns = 32 ns
Tw = Hmin tc = 4 ∙ 5 ns = 20 ns
Ts = L / b = L / (fc wc ) = 64 ∙ 8
/ (1 GHz ∙ 16) = 512 / 16 ns =
32 ns
T0 = 32 ns + 20 ns + 32 ns = 84
ns
50
To get a feeling about NoC
(Toy example)
The network has N = 64
nodes
Hmin = 2 ∙ 8 / 3 = 5.33
Channel width wc = 32
Channel frequency fc = 1
GHz
Channel latency tc = 1 ns
Router delay tr = 1 ns
Packet Length L = 16 bytes
July 21, 2015
SoC Architecture
51
To get a feeling about NoC
(Toy example)
The network has N = 64
nodes
Hmin = 2 ∙ 8 / 3 = 5.33
Channel width wc = 32
Channel frequency fc = 1
GHz
Channel latency tc = 1 ns
Router delay tr = 1 ns
Packet Length L = 16 bytes
July 21, 2015
SoC Architecture
Tr = Hmin tr = 5.33 ∙ 1ns = 5.33
ns
Tw = Hmin tc = 5.33 ns
Ts = L / b = L / (fc wc ) = 16 ∙ 8
/ (1 GHz ∙ 32) = 128 / 32 ns =
4 ns
T0 = 5.33 ns + 5.33 ns + 4 ns =
14.66 ns
52
Path Diversity
A network with multiple minimal paths
between most pairs of nodes is more robust
than a network that has only one single route
between the nodes
July 21, 2015
SoC Architecture
53
Path Diversity
0
Random Traffic
Each node is
equally likely to
send a message to
any other node
50% of the packets
pass the bisection
max = 4/4=1
0
00
10
20
1
1
2
2
01
11
21
3
3
4
4
02
12
22
5
5
6
6
03
13
23
7
7
Butterfly with 8 nodes
July 21, 2015
SoC Architecture
54
Traffic Patterns
The performance of a network is strongly depending on the
traffic pattern
The table below shows a number of different traffic patterns
that can be used to analyze the performance of the network
July 21, 2015
SoC Architecture
55
Path Diversity
0
Bit (left) Rotation Traffic
The node with address {
b2, b1, b0 } sends to { b1, b0,
b2 }
Thus for sources
{ 0, 1, 2, 3, 4, 5, 6, 7 }, we
get the shuffle permutation
{ 0, 2, 4, 6, 1, 3, 5, 7 }
Thus packets from nodes
{0,1,4, 5} will all have to
pass switch node 10
max,BR = 2 (since for
instance channel (00,10) is
used by two connections)
Max capacity: 50%
July 21, 2015
0
00
10
20
1
1
2
2
01
11
21
3
3
4
4
02
12
22
5
5
6
6
03
13
23
7
SoC Architecture
7
Butterfly with 8 nodes
56
Torus and Mesh Networks
Torus and Mesh networks, k-ary n-cubes,
pack N = kn nodes.
Advantages
Regular structure allows efficient packaging
For local communication latency is low
Good path diversity
Disadvantage
Comparably larger hop count
July 21, 2015
SoC Architecture
57
Rings, Tori and Meshes
0
1
2
3
4-node ring
(4-ary 1-cube)
00
01
02
03
00
01
02
03
10
11
12
13
10
11
12
13
20
21
22
23
20
21
22
23
30
31
32
33
30
31
32
33
July 21, 2015
”4x4”-Torus
(4-ary 2-cube)
”4x4”-Mesh
(4-ary 2-mesh)
SoC Architecture
58
Properties of Tori and Meshes
Torus
Channel Bisection
BC,T = 4 N / k
Channel load under uniform
traffic (50% of traffic
crosses bisection)
T,U = k / 8
Channel load under worst
traffic (100% of traffic
crosses bisection)
T,W = k / 4
Average minimum hop
count (k even)
Hmin, T = nk / 4
July 21, 2015
Mesh
Channel Bisection
BC,M = 2 N / k
Channel load under uniform
traffic (50% of traffic
crosses bisection)
M,U = k / 4
Channel load under worst
traffic (100% of traffic
crosses bisection)
M,W = k / 2
Average minimum hop
count (k even)
Hmin, M = nk / 3
SoC Architecture
59
Physical implementation of
Mesh and Tori
In order to implement a network on a chip,
the abstract nodes of the network must be
mapped to real positions in physical space
A goal is to have the same latency for all
channels
July 21, 2015
SoC Architecture
60
Folding networks leads to
shorter largest channel length
Folded 4-ary 2 cube
July 21, 2015
SoC Architecture
61
Summary
Network is characterized by topology, routing,
switching and flow control.
Topology is the first concern of the network
Mesh and Tori are regular and symmetric, which
many on-chip networks of today propose.
Performance metrics: throughput and latency
Use channel load to determine ideal throughput
Latency includes time spent on routers, wires and
serialization
Performance is strongly sensitive to traffic pattern
July 21, 2015
SoC Architecture
62