Presentation Slides

Download Report

Transcript Presentation Slides

GCA: Global Congestion Awareness
for Load Balance in Networks-onChip
Mukund Ramakrishna, Paul V. Gratz & Alex
Sprintson
Department of Electrical and Computer
Engineering
Texas A&M University
Networks-on-Chip
• Moore’s Law is putting more and more transistors on the chip.
• NoCs scale better than traditional interconnects.
• High interconnect latencies translate into idle processor core cycles
and wasted power
Intel Single-Chip Cloud
Computer
(ISSCC 2010)
Tilera
Tile64
Routing in NoCs
fft benchmark from SPLASH-2 under DOR
Darker arrows = Higher congestion
• Real workloads are
unbalanced in nature
• Oblivious routing (DOR)
– Tends to exacerbate
congestion
• Adaptive routing:
– Try to avoid congested spots
– Classified on the basis of
awareness:
• Local adaptive
• Regionally aware
• Globally aware
Outline
•
•
•
•
Introduction
Motivation for Global Awareness
Related Work
Global Congestion Awareness (GCA):
– Route Computation
– Information Propagation
• Implementation
• Evaluation
• Conclusion and Future Work
Local Congestion
• Local adaptive
– Measure local congestion metric (free VC, free buffers)
– Greedy local decisions due to poor visibility
D
Low congestion
Moderate congestion
High congestion
Optimal
S
Local adaptive
Regional Awareness
• Regionally aware(Gratz et al. HPCA’ 08, Ma et al. ISCA ’11 )
– Aggregated congestion of all nodes in a dimension
– Noisy information degrades performance.
D
Low congestion
Moderate congestion
High congestion
Optimal
RCA-1D
S
Ideally …
• On a per-destination basis:
– Evaluate end-to-end delay along all minimal paths to
destination
– Pick path with least delay
D
Low congestion
Moderate congestion
High congestion
Optimal
S
Global Awareness
• Earlier schemes utilize a separate
congestion monitoring network (Manevich et al. DSD`11,
Ramanujam et al. ANCS`10)
• Increased network complexity
• Slow route calculation mechanism in DAR
• Challenges:
– Low overhead dissemination technique
– Limited resource for storage and computation
Outline
• Introduction
• Motivation for Global Awareness
• Global Congestion Awareness (GCA):
– Route Computation
– Information Propagation
• Implementation
• Evaluation
• Conclusion and Future Work
GCA: Bird’s eye view
1. Congestion information is conveyed via
piggybacking onto the header flits
2. Every node builds a “map” representing
the congestion of the network
3. Optimal path is calculated using a
shortest path graph algorithm in each
router
Packet-level
• “Piggybacking” of congestion information in header flits (Zhang et al.
PrimeAsia`10, Chen et al. NOCS`12)
• Back-annotation appends congestion information for link in opposite
direction
– Direction of flit traversal: Black
– Congestion Information appended: Red
Router Micro-Architecture
Traffic Vector
VC-1
N
VC-n
Optimal
Output Port
Table
E
Header
Modification
W
Routing Unit
Local
Congestion
Values
Route
Compute
Hardware
VC Allocator
S
I
n
Congestion
Map
Destination
Node
VC-1
VC-n
XB Allocator
X
N
E
W
S
Ej
Router Micro-Architecture
Traffic Vector
VC-1
N
VC-n
Optimal
Output Port
Table
E
Header
Modification
W
Routing Unit
Local
Congestion
Values
Route
Compute
Hardware
VC Allocator
S
I
n
Congestion
Map
Destination
Node
VC-1
VC-n
XB Allocator
X
N
E
W
S
Ej
Router Micro-Architecture
P1
Pout
P2
d1
l1
+
<
d2
l2
+
Dout
4
Route Computation
• Node marked 0 is
source
• Number on link
denotes congestion
• Number in node
denotes shortest path
cost to that node
• Letter denotes
optimal output port.
Route Computation (contd.)
• At most two feeder
nodes for every node
• Pick the feeder node
with the least cost
path
• For nodes a hop
away:
– Cost = Congestion of
connecting link
Route Computation (contd.)
• Simple “add and
compare” step
• Example: Top-left
node
– From East port:
• 3+1=4
– From South port:
• 8+0=8
• Cost assigned = 4
Route Computation (contd.)
• Every iteration flows
outward
• Every quadrant
computed in parallel
• Re-evaluate only the
downstream subgraph every update
Caveats
• Infrequent updates of distant links
– Scale the weights so that distant information is less important
– For a link i with congestion ci, which is n hops away from the
local node, we calculate its weight as
𝑊𝑖 = 𝑐𝑖 ∗ (1 − 𝑤 ∗ 𝑛 )
• Staleness of congestion information
– Entries untouched for n cycles are faded
– Fade towards a nominal value in steps of x (empirically
determined)
– For large networks, fade one entry every n/L cycles where L is
the number of links
Limited GCA (LGCA)
• Constrain the visibility
to a smaller window
• Store information only
for nodes k hops or
less away
• Reduces storage
overhead at the cost
of slight performance
penalty vis-à-vis GCA
Implementation
• Storage overhead:
– Congestion Map:
• 3 bit congestion metric
• 6𝑁 ∗ 𝑁 − 1 𝑏𝑖𝑡𝑠
– Optimal output port table:
• 1 bit per node because of minimal routing
• No storage for nodes in same dimension
• 𝑁 2 − 2𝑁 + 1 𝑏𝑖𝑡𝑠
– Flag array for fading mechanism:
• One bit per entry in the congestion map
• 2𝑁 ∗ 𝑁 − 1 𝑏𝑖𝑡𝑠
• Synthesis results:
– 16 node computation and storage circuit
– 2GHz at 45nm and less than 1% area overhead
– No impact to router critical path
Simulation parameters
• Simulation carried out in a cycle accurate C++ simulator1
Characteristics of simulated design
Topology
Realistic Workloads
Synthetic traffic
7x7 2D Mesh
8x8 2D Mesh
Router uArch
Per hop latency
3 cycles: 2 cycles in router, 1 cycle to cross channel
Virtual Channels/Port
8
Flit buffers/VC
5
Traffic Workload
SPLASH-2 traces
Random, Transpose, Bitcomplement
Duration of simulation
10 million cycles or end of trace
10000 warm-up cycles followed by
100000 packets
Scaling Factor (w)
Fading (x,n)
1.
Two Stage Speculative
0.25
x=1 unit; n = 100 cycles
S. Prabhu, B. Grot, P. Gratz, and J. Hu, “Ocin_tsim - DVFS Aware Simulator for NoCs,” in Proc. SAW-1, Jan
2010.
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
ocean
barnes
geomean
(overall)
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
Improvement due to GCA (average):
- DOR: 45%
- Local: 26%
- RCA-1D: 15%
- DAR: 8%
ocean
barnes
geomean
(overall)
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
ocean
barnes
geomean
(overall)
Outliers:
• DAR better than GCA on inherently static workloads (fft,radix)
• Statistical traffic distribution enables better performance
• GCA better than DAR on all other workloads
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
ocean
barnes
geomean
(overall)
LGCA performance:
• Close to GCA on most workloads
• lu is an exception
• Overall average slightly worse than GCA but still better than other competing
algorithms
Conclusion
• Proposed a novel adaptive routing mechanism which uses
global congestion information to perform per-hop routing in
on-chip networks.
• Uses back-annotated piggybacking to propagate congestion
information which alleviates the issue of overheads
• Light-weight implementation of the shortest path computation
• GCA improves average packet latency
– By 26% against local adaptive
– By 15% against RCA -1D
– By 8% against DAR
On average for the SPLASH-2 suite of benchmarks.
Thank You
BACKUP
Minimal Adaptive Routing
• Model
– Adaptive routing along minimal paths.
D
S
Fading
• Entries untouched for n cycles are faded
• Fade towards the gray value in steps of x, an empirically
determined parameter
– Black = extremely congested
– White = uncongested
– Gray = middle value
𝑊𝑖 = 𝑊𝑖 ± 𝑥
• For large networks, stagger the fading mechanism
– Fade one entry every n/L cycles where L is the number of links
Synthetic Traffic
Transpose
Uniform Random
60
40
xydor
30
local
20
RCA
10
LGCA
0
GCA
10
20
30
Injection Rate (flits/node/cycle)
40
Average Packet Latency
50
50
40
xydor
30
local
20
RCA
LGCA
10
GCA
0
5
Bit-complement
Average Packet Latency
Average Packet Latency
60
15
25
35
Injection Rate (flits/node/cycle)
90
80
70
60
50
40
30
20
10
0
xydor
local
RCA
LGCA
GCA
3
8
13
18
23
Injection Rate (flits/node/cycle)
28
45
Network Sensitivity Experiments
• Variation of two parameters:
– Vary VC Count
• No variation in relative performance
– Vary the mesh size
• Performs better for larger meshes
• For both experiments, we simulate
Transpose traffic
Network Size
16 x 16 Mesh
8 x 8 Mesh
60
21 %
16 %
100
80
60
local
LGCA
40
GCA
Average Packet Latency
Average Packet Latency
50
40
local
30
LGCA
20
GCA
10
20
0
0
4
9
14
Injection Rate (% of flits/node/cycle)
19
5
15
25
35
Injection Rate (% of flits/node/cycle)
45
Congestion Information Scaling
• Scale the weights so that distant information is
less important in making routing decision
• Degree of scaling determined by an empirical
constant w (0< w < 1).
• For a link i with congestion ci, which is n hops
away from the local node, we calculate its weight
as
𝑊𝑖 = 𝑐𝑖 ∗ (1 − 𝑤 ∗ 𝑛 )
Number of steps
• For a 𝑁𝑥𝑁 2-D mesh, the number of steps
required lies in the range:
– The upper bound is always 2𝑁 − 3
– If N is even, the lower bound is ⌊𝑁/2⌋
– If N is odd, the lower bound is 𝑁/2 − 1
Empirical parameters
• Scaling Factor w:
– w = 0.25
• GCA: links beyond 4 hops are assigned a constant
scaling factor of 0.25
• LGCA: links beyond 4 hops are not stored as k=4
• Fading mechanism:
– n = 100 cycles
– x = 1 unit
Challenges in global awareness
• Dissemination of congestion information
– Low overhead
– Account for staleness
• Limited storage in on-chip routers
– Exponential number of paths to each
destination
• Limited hardware resources for
computations
Future Work
• Congestion prediction: proactive adaptive routing instead
of reactive adaptive routing
• Stability analysis: Does the algorithm thrash between
different paths for some traffic patterns?
• Effect of imperfect congestion state representation
Back-annotation
For each outgoing flit, the node appends the
congestion metric for the link in the same
direction
For each outgoing flit, the node appends the
congestion metric for the link in the opposite
direction.
D
S
D
S
Packet Traversal direction
Congestion information direction
Multi-region
• Network partitioned
into four quadrants
• Each quadrant runs a
benchmark as shown
• Isolated traffic regions
emulate virtual
machine-like scenario
Average Packet Latency (relative to
DOR)
Multi-region
local_uniregion
local_multiregion
RCA_uniregion
RCA_multiregion
GCA_uniregion
GCA_multiregion
1.4
1.2
1
0.8
0.6
0.4
0.2
0
lu
waternsq
ocean
raytrace
Geomean
• Local adaptive is unaffected due to its lack of visibility
• RCA’s performance suffers due to noise through aggregation
• GCA maintains fine-grained information
• Helps avoid noise and perform better than RCA
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
Improvement over local adaptive:
• GCA: 26% average, 86% best case
• LGCA: 23% average, 82% best case
ocean
barnes
geomean
(overall)
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
Improvement over RCA-1D:
• GCA: 15% average, 51% best case
• LGCA: 11% average, 38% best case
ocean
barnes
geomean
(overall)
SPLASH-2
Average Packet Latency (Relative to DOR)
1.2
1
0.8
local
RCA
0.6
DAR
GCA
0.4
LGCA
0.2
0
water
spatial
fft
lu
water
nsquared
radix
raytrace
Improvement over DAR:
• GCA: 8% average, 53% best case
• LGCA: 4% average, 41% best case
ocean
barnes
geomean
(overall)