Benefits to Cadence of the MIT MAP

Download Report

Transcript Benefits to Cadence of the MIT MAP

VLSI Architecture
Past, Present, and Future
William J. Dally
Computer Systems Laboratory
Stanford University
March 23, 1999
3/23/99: 1
Past, Present, and Future
• The last 20 years has seen a 1000-fold
increase in grids per chip and a 20-fold
reduction in gate delay
• We expect this trend to continue for the next
20 years
• For the past 20 years, these devices have
been applied to implicit parallelism
• We will see a shift toward implicit parallelism
over the next 20 years
3/23/99: 2
Technology Evolution
10
10
3
10
2
10
2
1
wire pitch (um)
10
1
10
0
gate delay (ns)
gate length (um)
10
0
10
-2
-1
10
1960
3/23/99: 3
-1
1970
1980
1990
2000
2010
10
1960
1970
1980
1990
2000
2010
Technology Evolution (2)
Parameter
1999
2019
Units
Gate Length 5
0.2
0.008
m
Gate Delay
150
7.5
ps
Clock Cycle 200
2.5
0.08
ns
Gates/Clock 67
17
10
Wire Pitch
15
1
.07
m
Chip Edge
6
15
38
mm
Grids/Chip
1.6  105 2.3  108 3.0  1011
3/23/99: 4
1979
3000
Architecture Evolution
Year
Microprocessor High-end
Processor
1979
i8086
Cray 1
0.5 MIPS
70 MIPS
0.001 MFLOPS 250 MFLOPS
Compaq 21264
500 MIPS, 500 MFLOPS (x 4?)
1999
2019
3/23/99: 5
X
MP with 1000
10000 MIPS,
Xs
10000 MFLOPS
Performance
Incremental Returns
Quad-issue out of order
Dual-issue in order
Pipelined RISC
Processor Cost (Die Area)
3/23/99: 6
Peak Performance
Efficiency and Granularity
2P+M
2P+2M
P+M
System Cost (Die Area)
3/23/99: 7
VLSI in 1979
3/23/99: 8
VLSI Architecture in 1979
•
•
•
•
5m NMOS technology
6mm die size
100,000 grids per chip, 10,000 transistors
8086 microprocessor
– 0.5MIPS
3/23/99: 9
1979-1989: Attack of the Killer Micros
• 50% per year improvement in performance
• Transistors applied to implicit parallelism
– pipeline processor (10 CPI --> 1 CPI)
– shorten clock cycle
(67 gates/clock --> 30 gates/clock)
• in 1989 a 32-bit processor w/ floating point
and caches fits on one chip
– e.g., i860 40MIPS, 40MFLOPS
– 5,000,000 grids, 1M transistors (many memory)
3/23/99: 10
1989-1999: The Era of Diminishing Returns
• 50% per year increase in performance through 1996, but
– projects delayed, performance below expectations
– 50% increase in grids, 15% increase in frequency (72% total)
• Squeaking out the last implicit parallelism
– 2-way to 6-way issue, out-of-order issue, branch prediction
– 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock
• Convert data parallelism to ILP
• Examples
– Intel Pentium II (3-way o-o-o)
– Compaq 21264 (4-way o-o-o)
3/23/99: 11
1979-1999: Why Implicit Parallelism?
• Opportunity
– large gap between micros and fastest processors
• Compatibility
– software pool ready to run on implicitly parallel
machines
• Technology
– not available for fine-grain explicitly parallel
machines
3/23/99: 12
1999-2019: Explicit Parallelism Takes Over
• Opportunity
– no more processor gap
• Technology
– interconnection, interaction, and shared memory
technologies have been proven
3/23/99: 13
Technology for Fine-Grain Parallel Machines
• A collection of workstations does not make a
good parallel machine. (BLAGG)
–
–
–
–
Bandwidth - large fraction (0.1) of local memory BW
LAtency - small multiple (3) of local memory latency
Global mechanisms - sync, fetch-and-op
Granularity - of tasks (100 inst) and memory (8MB)
3/23/99: 14
Technology for Parallel Machines
Three Components
• Networks
– 2 clocks/hop latency
– 8GB/s global bandwidth
• Interaction mechanisms
– single-cycle communication and synchronization
• Software
3/23/99: 15
k-ary n-cubes
• Link bandwidth, B,
depends on radix, k, for
both wire- and pinlimited networks.
• Select radix to trade-off
diameter, D, against B.
70
60
Latency
50
40
30
4K Nodes
L = 256
Bs= 16K
20
T
10
0
0
2
4
6
8
10
Dimension
12
L
D
B
L nk
T

Ck 4
Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990
Delay of Express Channels
The Torus Routing Chip
• k-ary n-cube topology
– 2D Torus Network
– 8bit x 20MHz Channels
•
•
•
•
•
Hardware routing
Wormhole routing
Virtual channels
Fully Self-Timed Design
Internal Crossbar Architecture
Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986
The Reliable Router
• Fault-tolerant
– Adaptive routing (adaptation of
Duato’s algorithm)
– Link-level retry
– Unique token protocol
• 32bit x 200MHz channels
– Simultaneous bidirectional
signalling
– Low latency plesiochronous
synchronizers
• Optimisitic routing
Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994
Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995
Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993
Equalized 4Gb/s Signaling
3/23/99: 20
End-to-End Latency
• Software sees ~10s latency
with 500ns network
• Heavy compute load associated
with sending a message
Regs
– system call
– buffer allocation
– synchronization
Send
Tx Node
• Solution: treat the network like
memory, not like an I/O device
Net
Buffer
Dispatch
Rx Node
– hardware formatting,
addressing, and buffer
allocation
Network Summary
• We can build networks with 2-4 clocks/hop latency (12-24
clocks for a 512-node 3-cube)
– networks faster than main memory access of modern machines
– need end-to-end hardware support to see this, no ‘libraries’
• With high-speed signaling, bandwdith of 4GB/s or more
per channel (512GB/s bisection) is easy to achieve
– nearly flat memory bandwidth
• Topology is a matter of matching pin and bisection
constraints to the packaging technology
– its hard to beat a 3-D mesh or torus
• This gives us B and LA (of BLAGG)
3/23/99: 22
The Importance of Mechanisms
A
B
Serial Execution
3/23/99: 23
The Importance of Mechanisms
A
B
Serial Execution
OVH
COM
A
COM
OVH
Sync
B
Parallel Execution (High Ovherhead 0.5)
3/23/99: 24
The Importance of Mechanisms
A
B
Serial Execution
OVH
COM
A
COM
OVH
Sync
B
Parallel Execution (High Ovherhead 0.5)
A
B
Parallel Execution
(Low Ovherhead 0.062)
3/23/99: 25
Granularity and Cost Effectiveness
•
Parallel Computers Built for
– Capability - run problems that are too
big or take too long to solve any other
way
P
M
$
• absolute performance at any cost
– Capacity - get throughput on lots of
small problems
• performance/cost
•
A parallel computer built from
workstation size nodes will always have
lower perf/cost than a workstation
P $
P $
P $
P $
M
M
M
M
– sublinear speedup
– economies of scale
•
A parallel computer with less memory
per node can have better perf/cost than
a workstation
3/23/99: 26
MIT J-Machine (1991)
3/23/99: 27
Exploiting fine-grain threads
• Where will the parallelism come
from to keep all of these processors
busy?
– ILP - limited to about 5
– Outer-loop parallelism
• e.g., domain decomposition
• requires big problems to get lots of
parallelism
• Fine threads
– make communication and
synchronization very fast (1 cycle)
– break the problem into smaller
pieces
– more parallelism
3/23/99: 28
Mechanism and Granularity Summary
• Fast communication and synchronization mechanisms
enable fine-grain task decomposition
– simplifies programming
– exposes parallelism
– facilitates load balance
• Have demonstrated
– 1-cycle communication and synchronization locally
– 10-cycle communication, synchronization, and task dispatch
across a network
• Physically fine-grain machines have better
performance/cost than sequential machines
3/23/99: 29
A 2009 Multicomputer
Processor
Memory
8MB
System: 16 Chips
3/23/99: 30
Chip: 64 Tiles
Tile: P + 8MB
Challenges for the Explicitly Parallel Era
• Compatibility
• Managing locality
• Parallel software
3/23/99: 31
Compatibility
• Almost no fine-grain parallel software exists
• Writing parallel software is easy
– with good mechanisms
• Parallelizing sequential software is hard
– needs to be designed from the ground up
• An incremental migration path
– run sequential codes with acceptable performance
– parallelize selected applications for considerable
speedup
3/23/99: 32
Performance Depends on Locality
• Applications have data/timedependent graph structure
– Sparse-matrix solution
• non-zero and fill-in structure
– Logic simulation
• circuit topology and activity
– PIC codes
• structure changes as particles move
– ‘Sort-middle’ polygon rendering
• structure changes as viewpoint
moves
3/23/99: 33
Fine-Grain Data Migration
Drift and Diffusion
• Run-time relocation based on pointer use
– move data at both ends of pointer
– move control and data
• Each ‘relocation cycle’
– compute drift vector based on pointer use
– compute diffusion vector based on
density potential (Taylor)
– need to avoid oscillations
• Should data be replicated?
– not just update vs. invalidate
– need to duplicate computation to avoid
communication
3/23/99: 34
Migration and Locality
6
Distance (in tiles)
5
4
3
NoMigration
2
OneStep
Hierarchy
1
Mixed
0
1
3/23/99: 35
5
9
13
17
21
25
Migration Period
29
33
37
Parallel Software:
Focus on the Real Problems
• Almost all demanding problems have ample
parallelism
• Need to focus on fundamental problems
– extracting parallelism
– load balance
– locality
• load balance and locality can be covered by excess parallelism
• Avoid incidental issues
– aggregating tasks to avoid overhead
– manually managing data movement and replication
– oversynchronization
3/23/99: 36
Parallel Software:
Design Strategy
• A program must be designed for parallelism
from the ground up
– no bottlenecks in the data structures
• e.g., arrays instead of linked lists
• Data parallelism
– many for loops (over data,not time) can be forall
– break dependencies out of the loop
– synchronize on natural units (no barriers)
3/23/99: 37
Conclusion: We are on the threshold of the
explicitly parallel era
• As in 1979, we expect a 1000-fold increase in ‘grids’
per chip in the next 20 years
• Unlike 1979 these ‘grids’ are best applied to explicitly
parallel machines
– Diminishing returns from sequential processors (ILP) - no
alternative to explicit parallelism
– Enabling technologies have been proven
• interconnection networks, mechanisms, cache coherence
– Fine-grain machines are more efficient than sequential
machines
• Fine-grain machines will be constructed from multiprocessor/DRAM chips
• Incremental migration to parallel software
3/23/99: 38