Transcript ppt

Networks for Multi-core Chip
—A Controversial View
Shekhar Borkar
Intel Corp.
1
Outline
Multi-core system outlook
On die network challenges
A simpler but controversial proposal
Benefits
Summary
2
A Sample Multi-core System
10mm
Core 50%
Cache 50%
45nm
10mm
65nm, 4 Cores
1V, 3GHz
10mm die, 5mm each core
Core Logic: 6MT, Cache: 44MT
Total transistors: 200M
32nm
10mm
22nm
10mm
16nm
10mm
8 Cores, 1V, 3GHz
3.5mm each core
16 Cores, 1V, 3GHz
2.5mm each core
32 Cores, 1V, 3GHz
1.8mm each core
64 Cores, 1V, 3GHz
1.3mm each core
Total: 400MT
Total: 800MT
Total: 1.6BT
Total: 3.2BT
3
A Sample MC Network
5mm
0.4mm
Packet Switched Mesh
16B=128 bit each direction
0.4mm @ 1.5u pitch
192GB/s Bisection BW
Tech
Core Port size Bisection BW
(mm) (mm)
GB/sec@3GHz
65nm
5
0.4
192
45nm
3.5
0.4
272
32nm
2.5
0.4
384
22nm
1.8
0.4
543
16nm
1.3
0.4
768
4
100
1.E+00
1.E-01
Asci Red
1.E-02
TFLOP
1.E-03
1.E-04
Network Power (W)
Active Cap (nf/bdir bit)
Mesh Power @ 3GHz, 1V
80
60
1X
1.4X
40
20
0
0.5u
0.18u
65nm
65nm 45nm 32nm 22nm 16nm
1. Power too high
2. Worse if link width scales up each generation
3. Most of the power dissipation is in router logic
(not in the metal busses)
4. Cache coherency mechanism is complex
5
Why Mesh (or any other complex Network)?
Bus: Good at board level, does not extend well
• Transmission line issues: loss and signal integrity, limited frequency
• Width is limited by pins and board area
• Broadcast, simple to implement
Point to point busses: fast signaling over longer distance
• Board level, between boards, and racks
• High frequency, narrow links
• 1D Ring, 2D Mesh and Torus to reduce latency
• Higher complexity and latency in each node
Do you need point to point busses on a chip?
6
Bus for Multi-Core Chip?
Issues:
Slow, < 300MHz
Shared, limited scalability?
Solutions:
Repeaters to increase freq
Wide busses for bandwidth
Multiple busses for scalability
Benefits:
Power?
Simpler cache coherency
Move away from frequency, embrace parallelism
7
Repeated Bus
O
R
R
R
R
R
R
R
R
Assume:
10mm die,
1.5u bus pitch
50ps repeater delay
Arbitration:
Each cycle for the next cycle
Decision visible to all nodes
Repeaters:
Align repeater direction
No driving contention
Core
(mm)
Bus Seg
Max Bus
Delay (ps) Freq (GHz)
65nm
5
195
2.2
45nm
3.5
99
2
32nm
2.5
51
1.8
22nm
1.8
26
1.5
16nm
1.3
13
1.2
8
Example of a Bus Repeater
9
Other Bus Enhancements
Differential, low voltage swing
Twisted to reduce cross-talk
Optimal repeater placement
• Not necessarily at the core
• Higher bus frequency
Wide bus, 1024 bit or more, transfer lots of data in one cycle
Multiple busses for concurrency
Employ interconnect engineering techniques
10
Bus Power and Bandwidth
100
80
1X
Full Swing
1.4X
60
40
20
0
Bisection BW (GB/s)
65nm 45nm 32nm 22nm 16nm
4000
1X
3000
Mesh
1.4X
2000
1000
0
100mV Swing Bus Power (W)
120
7
1X
6
5
1.4X
4
0.1V
Differential
3
2
1
0
65nm 45nm 32nm 22nm 16nm
600
Bus BW (GB/s)
Full Swing Bus Power (W)
Includes bus and repeater power
500
1X
400
1.4X
Bus
300
200
100
0
65nm 45nm 32nm 22nm 16nm
65nm 45nm 32nm 22nm 16nm
11
Factors Affecting Latency
Mesh
Bus
Arbitration in each node, multiple
arbitration cycles
Single arbitration for entire bus
transaction
Multiple hops from source to
destination
One cycle operation
3-5 Clock latency in each node
None
Fast clock (3 GHz)
Slow clock (1 GHz)
One source and destination
Broadcast
12
Summary
Point to point busses are not necessary for multi-core chip
Rings and meshes were devised for point to point busses
over long distances—overkill for on chip network?
Router power could be prohibitive
Wide bus or busses, may be adequate
• Simple to implement
• Simpler coherency
• Lower power
• Maybe lower latency
Go slower, wider, and simpler
13