RPM - Center for Wireless Communication :: UCSD

Download Report

Transcript RPM - Center for Wireless Communication :: UCSD

Near-Optimal Oblivious Routing
for 3D-Mesh Networks
ICCD 2008
Rohit Sunkam Ramanujam
Bill Lin
Electrical and Computer Engineering Department
University of California, San Diego
1
Motivation: Networks-on-Chip
• Chip-multiprocessors (CMPs) increasingly popular
• 2D-mesh networks often used as on-chip fabric
12.64mm
I/O Area
single tile
1.5mm
21.72mm
2.0mm
Technology
65nm, 1 poly, 8 metal (Cu)
Transistors
100 Million (full-chip)
1.2 Million (tile)
Die Area
275mm2 (full-chip)
3mm2 (tile)
C4 bumps # 8390
Tilera Tile64
I/O Area
Intel 80-core
2
Motivation: 3D Integrated Circuits
• 3D Benefits
– Reduced wire delays
– Enormous bandwidth
– Heterogeneous system
integration
• Natural progression
– 3D-mesh for 3D CMPs
2D to 3D
3
Routing Algorithm Objectives
• Maximize throughput
– How much load the network can handle
• Minimize hop count
– Minimize routing delay between source and destination
4
Challenges
• For 2D-case, a near-optimal throughput routing algorithm with
minimal hop count called O1TURN is known [Seo’05].
• Surprisingly, optimality of O1TURN does not extend to 3D case,
actual throughput performance degrades severely.
• Only known optimal throughput routing algorithm is Valiant
(VAL) load-balancing, but VAL performs poorly on hop count
(latency), twice that of minimal routing.
5
Main Contribution
• Developed a new oblivious routing algorithm called
“Randomized Partially Minimal” (RPM) routing.
• RPM provably guarantees near-optimal worst-case
throughput in 3D case.
– Optimal for even radix k (e.g. 8 x 8 x 8 mesh).
– Within factor of 1/k2 for odd radix (e.g. 7 x 7 x 7 mesh).
• Good latency performance.
– Only factor of 1.33 of minimal routing (much better than 2x cost of
VAL, only known routing algorithm with optimal throughput)
– In practice, 3D-meshes are asymmetric because number of device
layers less than number of tiles per edge.
– e.g., for 16 x 16 x 4 mesh (4 layers), RPM’s hop count just factor of 1.1
of minimal routing.
6
Outline
• Motivation for our work
 Existing 2D routing algorithms don’t extend well into 3D
• RPM routing algorithm
• Simulation results
• Extensions and future work
7
Existing Routing Algorithms
The 2D case
• Dimension-Ordered Routing (DOR)
– Route minimal XY
• Valiant load-balancing (VAL)
– Route source → randomly chosen
intermediate node → destination
– Route minimal XY in both phases
• ROMM
– Same as VAL, but intermediate node
restricted to minimal direction
• Orthogonal 1-TURN (O1TURN)
– Route minimal XY and YX with equal
probability
Extending to the 3D case …
• Dimension-Ordered Routing (DOR)
– Route minimal XYZ
• Valiant load-balancing (VAL)
– Route source → randomly chosen
intermediate node → destination
– Route minimal XYZ in both phases
• ROMM
– Same as VAL, but intermediate node
restricted to minimal direction
• Orthogonal 1-TURN (O1TURN)
– Route along one of 6 minimal
orthogonal paths (XYZ, XZY, YXZ,
YZX, ZXY, ZYX) with equal probability
8
Worst-Case Throughput
• Best theoretical normalized worst-case throughput known to
be 50% (well-known result).
• Worst-case throughput analysis can be reduced to a maximal
weighted matching problem [Towles’02].
• VAL achieves this optimal throughput, but has poor latency.
• As shown next, DOR, ROMM, and O1TURN are all far from
optimal in 3D.
9
Poor Worst-Case Throughput
VAL/Optimal
Only
6-15%
10
How do 2D mesh algorithms fare in 3D?
Normalized
Worst-Case
Throughput
Normalized
Average-Case
Throughput
Hop Count
(normalized to
minimal)
VAL
8 x 8 x 8 Network
DOR
ROMM
O1TURN
0.5
0.063
0.132
0.15
0.5
0.316
0.454
0.513
VAL
DOR
ROMM
O1TURN
2
1
1
1
• Worst case throughput of DOR, ROMM, O1TURN far from optimal
• Average hop count of VAL far from minimal
• Need a routing algorithm that can trade latency for worst-case
11
throughput
Why O1TURN performs poorly in 3D?
• O1TURN – Worst-Case throughput optimal for 2D but more
than 3 times worse than optimal for 3D
• The difference
– 2D traffic matrix is “admissible” for 2D mesh
– In 3D, projected traffic on each 2D plane is no longer admissible !!
• Can we transform the 3D routing problem to routing admissible
traffic on each 2D plane ?
12
Outline
• Motivation for our work
• Existing 2D algorithms don’t extend well into 3D
 RPM routing algorithm
• Simulation results
• Extensions and future work
13
Randomized Partially-Minimal Routing (RPM)
Phase-2 Z
Intermediate
layer to
destination
Random
intermediate layer
Phase-1 Z
Source to
intermediate
layer
Destination
Z
Y
Source
X
XY or YX routing on the intermediate layer
14
Main Idea
• Load-balance uniformly across the vertical layers
• Min XY/YX used on each layer
• Main Result: RPM has near-optimal worst-case throughput
– Achieves optimal worst-case throughput when network radix k is even
– Within a factor of 1/k2 optimal when k is odd.
15
RPM achieves Near-Optimal Worst Case
Throughput (optimal for even radix)
VAL/Optimal
RPM
16
Average-Case Throughput
• RPM outperforms VAL, DOR, ROMM and O1TURN in averagethroughput on randomly generated traffic.
0.8
0.7
0.6
VAL
DOR
ROMM
O1TURN
RPM
0.5
0.4
0.3
0.2
0.1
0
4x4x4
8x8x8
16x16x4
17
Average Hop Count
• Normalized hop count of RPM
– Symmetric Meshes - 1.33 times minimal compared to 2x
for VAL
– Asymmetric 16x16x4 Mesh – 1.1 times minimal
18
Outline
• Motivation for our work
• Existing 2D routing algorithms don’t extend well into 3D
• RPM routing algorithm
 Simulation results
• Extensions and future work
19
Flit-Level Simulation
• Ideal throughput evaluation assumes
– Ideal single-cycle router
– Infinite buffers
– No contention in switches, no flow control
• Flit-level simulation
– PopNet network simulator
– 4 stage router pipeline – Route computation, VC allocation, Switch
arbitration, Link traversal
– Credit-based flow control
– 8 virtual channels, each 5 flits deep
– Multi-flit packets injected into the network (5 flits/packet)
20
Flit-Level Simulation (cont’d)
• Network configurations simulated
– 4 x 4 x 4 Mesh
– 8 x 8 x 8 Mesh
– 16 x 16 x 4 Mesh
• Routing algorithms compared: DOR, VAL, ROMM, O1TURN,
DUATO, RPM
– DUATO is a minimal adaptive routing algorithm implemented for
comparison
• Four different traffic traces used
–
–
–
–
Transpose traffic – (x,y,z) → (y,z,x)
Complement traffic – (x,y,z) → (k-x-1, k-y-1, k-z-1)
Uniform traffic
Worst Case traffic pattern for DOR (DOR-WC) – (x,y,z) → (k-z-1, k-y-1, k-x-1)
21
Uniform Traffic
8x8x8 Mesh
16x16x4 Mesh
22
Transpose Traffic
8x8x8 Mesh
16x16x4 Mesh
23
Complement Traffic
8x8x8 Mesh
16x16x4 Mesh
24
DOR-WC Traffic
8x8x8 Mesh
16x16x4 Mesh
25
To sum it up …
• 3D IC technology is emerging.
• Stacking cores in 3 dimensions offers several advantages over
2D placement of cores.
• 2D minimal Mesh routing algorithms have poor worst-case
throughput in 3D, VAL has high latency penalty.
• RPM trades off latency (partially-minimal) for better worst
case performance (near-optimal).
26
Thank You
Questions?
27