Echelon Component Architecture

Download Report

Transcript Echelon Component Architecture

Vanguard TTI
12/6/11
1
Q: What do these have in common?
Tianhe 1A
~4PFLOPS peak
2.5PFLOPS sustained
~7,000 NVIDIA GPUs
About 5MW
3G smart phone
Baseband processing 10GOPS
Applications processing 1GOPS
and increasing
Power limit of 300mW
Vanguard TTI
12/6/11
2
Vanguard TTI
12/6/11
3
Both are Based on NVIDIA Chips
Fermi
3 x 109 Transistors
512 Cores
Tegra-2 (T20)
3 ARM Cores
GPU
Audio, Video, etc…
Vanguard TTI
12/6/11
4
More Fundamentally
Both
are power limited
get performance from
parallelism
Vanguard TTI
12/6/11
need
100x performance
increase in 10 years
5
100x performance in 10 years, Moore’s
Law will take care of that, right?
Vanguard TTI
12/6/11
6
100x performance in 10 years, Moore’s
Law will take care of that, right?
Wrong!
Vanguard TTI
12/6/11
7
Moore’s Law gives us transistors
Which we used to turn into scalar
performance
Moore, Electronics 38(8) April 19, 1965
Vanguard TTI
12/6/11
8
But ILP was ‘mined out’ in 2000
1e+7
1e+6
Perf (ps/Inst)
1e+5
Linear (ps/Inst)
1e+4
1e+3
1e+2
30:1
1e+1
1,000:1
1e+0
1e-1
30,000:1
1e-2
1e-3
1e-4
1980
1990
2000
2010
Dally
ISAT LCC:
9 et al. “The Last Classical Computer”, ISAT Study, 2001
Vanguard TTI
12/6/11
2020
9
And L3 energy scaling ended in 2005
Gordon Moore, ISSCC 2003
Moore, ISSCC Keynote, 2003
Vanguard TTI
12/6/11
10
Result: The End of Historic Scaling
Vanguard TTI
12/6/11
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
11
Historic scaling is at an end!
To continue performance scaling of all
sizes of computer systems requires
addressing two challenges:
Power and Parallelism
Much of the economy depends on this
Vanguard TTI
12/6/11
12
The Power Challenge
Vanguard TTI
12/6/11
13
In the past we had constant-field scaling
L’ = L/2
V’ = V/2
E’ = CV2 = E/8
f’ = 2f
D’ = 1/L2 = 4D
P’ = P
Halve L and get 8x the capability for the
same power
Vanguard TTI
12/6/11
14
Now voltage is held nearly constant
L’ = L/2
V’ = V
E’ = CV2 = E/2
f’ = 2f*
D’ = 1/L2 = 4D
P’ = 4P
Halve L and get 2x the capability for the
same power in ¼ the area
*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did
Vanguard TTI
12/6/11
15
Performance = Efficiency
Efficiency = Locality
Vanguard TTI
12/6/11
16
Locality
Vanguard TTI
12/6/11
17
The High Cost of Data Movement
Fetching operands costs more than
computing on them
20mm
64-bit DP
20pJ
26 pJ
256 pJ
256-bit
buses
256-bit access
8 kB SRAM
16 nJ
500 pJ
DRAM
Rd/Wr
Efficient
off-chip
link
50 pJ
1 nJ
28nm
Vanguard TTI
12/6/11
18
Scaling makes locality even more
important
Vanguard TTI
12/6/11
19
Its not about the FLOPS
Its about data movement
Algorithms should be designed to perform
more work per unit data movement.
Programming systems should further
optimize this data movement.
Architectures should facilitate this by
providing an exposed hierarchy and
efficient communication.
Vanguard TTI
12/6/11
20
System Sketch
Vanguard TTI
12/6/11
21
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
SM
SM
SM
SM
SM
SM
Lane
Lane
Lane
Lane
SM
SM
SM
Lane
Lane
Lane
Lane
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
DRAM I/O
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
LOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
NOC
DRAM I/O
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
NOC
DRAM I/O
SM
SM
SM
SM
SM
SM
NOC
DRAM I/O
SM
SM
NOC
NW I/O
SM
SM
NOC
DRAM I/O
SM
SM
NOC
SM
SM
SM
LOC
SM
SM
NOC
17mm
NW I/O
SM
NOC
10nm process
290mm2
SM
NOC
XBAR
SM
NOC
DRAM I/O
SM
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
NOC
DRAM I/O
NOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
LOC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
NOC
NOC
SM
NOC
22
Vanguard TTI
12/6/11
DRAM I/O
DRAM I/O
NW I/O
DRAM I/O
DRAM I/O
DRAM I/O
DRAM I/O
NW I/O
DRAM I/O
DRAM I/O
L2
Banks
NOC
DRAM I/O
SM
Echelon Chip Floorplan
Overhead
Vanguard TTI
12/6/11
23
An Out-of-Order Core
Spends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
4/11/11
Vanguard TTI
12/6/11
Milad Mohammadi
24
24
SM Lane Architecture
Control
Path
Data
Path
Thread PCs
Net
RF
L0Addr
L1Addr
Net
RF
L0Addr
L1Addr
Net
LM
Bank
0
LM
Bank
3
Scheduler
Active
PCs
To LD/ST
To LD/ST
L0
I$
ORF
ORF
ORF
Inst
64 threads
4 active threads
2 DFMAs (4 FLOPS/clock)
ORF bank: 16 entries (128 Bytes)
L0 I$: 64 instructions (1KByte)
LM Bank: 8KB (32KB total)
Vanguard TTI
12/6/11
FP/Int
FP/Int
LS/BR
25
Solving the Power Challenge – 1, 2, 3
Vanguard TTI
12/6/11
26
Solving the ExaScale Power Problem
2500
2000
1500
Local
Op
Off-Chip
On-Chip
Overhead
1000
500
0
Today
Vanguard TTI
12/6/11
Scale
Ovh
Local
27
More Fundamentally
Both
are power limited
get performance from
parallelism
Vanguard TTI
12/6/11
need
100x performance
increase in 10 years
28
Parallelism
Vanguard TTI
12/6/11
29
Parallel programming is not inherently any
more difficult than serial programming
However, we can make it a lot more difficult
Vanguard TTI
12/6/11
30
A simple parallel program
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
Vanguard TTI
12/6/11
31
Why is this easy?
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { // nested
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
No machine details
All parallelism is expressed
Synchronization is semantic (in reduction)
Vanguard TTI
12/6/11
32
We could make it hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization
// manipulate struct
unlock(struct.lock) ;
code = send(pid, tag, &msg) ;
Vanguard TTI
12/6/11
// partition across nodes
33
Programmers, tools, and architecture
Need to play their positions
Programmer
Tools
Vanguard TTI
12/6/11
Architectur
e
34
Programmers, tools, and architecture
Need to play their positions
Programmer
Tools
Vanguard TTI
12/6/11
Combinatorial optimization
Mapping
Selection of mechanisms
Algorithm
All of the parallelism
Abstract locality
Architectur
e
Fast mechanisms
Exposed costs
35
Programmers, tools, and architecture
Need to play their positions
Programmer
Tools
Map foralls in time and space
Map molecules across memories
Stage data up/down hierarchy
Vanguard
TTI mechanisms
Select
12/6/11
forall molecule in set { // launch a thread array
forall neighbor in molecule.neighbors { //
forall force in forces { // doubly nested
molecule.force =
reduce_sum(force(molecule, neighbor))
}
}
}
Architectur
e
Exposed storage hierarchy
Fast comm/sync/thread mechanisms
36
Fundamental and Incidental
Obstacles to Programmability
Fundamental
Expressing 109 way parallelism
Expressing locality to deal with >100:1 global:local energy
Balancing load across 109 cores
Incidental
Dealing with multiple address spaces
Partitioning data across nodes
Aggregating data to amortize message overhead
Vanguard TTI
12/6/11
37
The fundamental problems are hard
enough. We must eliminate the
incidental ones.
Vanguard TTI
12/6/11
38
Execution Model
Object Thread
A
B
Global Address Space
A
B
Load/Store
Abstract
Memory
Hierarchy
B
Active Message
Vanguard TTI
12/6/11
39
Thread array creation, messages,
block transfers, collective
operations – at the “speed of light”
Vanguard TTI
12/6/11
40
Fermi
Hardware thread-array
creation
Fast syncthreads() ;
Shared memory
Vanguard TTI
12/6/11
41
Scalar ISAs don’t matter
Parallel ISAs – the mechanisms for
threads, communication, and
synchronization make a huge
difference.
Vanguard TTI
12/6/11
42
Abstract description of Locality
– not mapping
compute_forces::inner(molecules, forces) {
tunable N ;
set part_molecules[N] ;
part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) {
compute_forces(part_molecules) ;
}
Vanguard TTI
12/6/11
43
Abstract description of Locality
– not mapping
compute_forces::inner(molecules, forces) {
tunable N ;
set part_molecules[N] ;
part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) {
compute_forces(part_molecules) ;
}
Autotuner picks number and size of partitions - recursively
No need to worry about “ghost molecules”
with global address space, it just works
Vanguard TTI
12/6/11
44
Autotuning Search Spaces
Execution Time of Matrix Multiplication
for Unrolling and Tiling
Architecture enables simple and effective autotuning
T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle
Vanguard TTI
12/6/11
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.
In IEEE PACT, pages 237-248, 2000.
45
Performance of Auto-tuner
Cell
Cluster
Cluster of
PS3s
Conv2D
SGEMM
FFT3D
SUmb
Auto
96.4
129
57
10.5
Hand
85
119
54
Auto
26.7
91.3
5.5
Hand
24
90
5.5
Auto
19.5
32.4
0.55
Hand
19
30
0.23
1.65
0.49
Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.
For FFT3D, performances is with fusion of leaf tasks.
SUmb is too complicated to be hand-tuned.
Vanguard TTI
12/6/11
46
More Fundamentally
Both
are power limited
get performance from
parallelism
Vanguard TTI
12/6/11
need
100x performance
increase in 10 years
47
A Prescription
Vanguard TTI
12/6/11
48
Research
Need a research vehicle (experimental system)
Co-design architecture, programming system, applications
Productive parallel programming
Express all the parallelism and locality
Compiler and run-time map to the target machine
Leverage an existing eco-system
Mechanisms – for: threads, comm, sync
Eliminate ‘incidental’ programming issues
Enable fine-grain execution
Power
Locality – exposed memory hierarchy and software to use it
Overhead – move scheduling to compiler
Others are investing, if we don’t invest we will be left
behind.
Vanguard TTI
12/6/11
49
Education
We need parallel programmers
But we are training serial programmers
and serial thinkers
Parallelism throughout the CS curriculum
Programming
Algorithms
Parallel algorithms
Analysis focused on communications, not counting ops
Systems
Models need to include locality
Vanguard TTI
12/6/11
50
A Bright Future from
Supercomputers to Cellphones
Eliminate overhead and
exploit locality to get 100x
power efficiency
Easy parallelism with a
coordinated team
Programmer
Tools
Architecture
Vanguard TTI
12/6/11
51
Vanguard TTI
12/6/11
52
Granularity
Vanguard TTI
12/6/11
53
#Threads increasing faster than
problem size.
Vanguard TTI
12/6/11
54
Number of Threads increasing faster
than problem size
1.E+16
1.E+15
1.E+14
1.E+13
1.E+12
1.E+11
1.E+10
1.E+09
1.E+08
1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1995
Vanguard TTI
12/6/11
Threads
Bytes
2000
2005
2010
2015
2020
55
Number of Threads increasing faster
than problem size
1.E+16
1.E+15
1.E+14
1.E+13
1.E+12
1.E+11
Weak
1.E+10
Scaling
1.E+09
1.E+08
1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1995
2000
Vanguard TTI
12/6/11
Threads
Bytes
2005
2010
2015
2020
56
Number of Threads increasing faster
than problem size
1.E+16
1.E+15
1.E+14
1.E+13
1.E+12
1.E+11
Weak
1.E+10
Scaling
1.E+09
1.E+08
1.E+07
1.E+06
1.E+05
1.E+04
1.E+03
1995
2000
Vanguard TTI
12/6/11
Strong
Scaling
Threads
Bytes
2005
2010
2015
2020
57
Smaller sub-problem per thread
Vanguard TTI
12/6/11
58
Smaller sub-problem per thread
Vanguard TTI
12/6/11
59
Smaller sub-problem per thread
More frequent comm, sync, and thread
operations
Vanguard TTI
12/6/11
60
Smaller sub-problem per thread
More frequent comm, sync, and thread
operations
Vanguard TTI
12/6/11
61
This fine-grain parallelism is multilevel and irregular
Vanguard TTI
12/6/11
62
To support this requires fast
mechanisms for
Thread arrays – create, terminate, suspend, resume
Hardware allocation of resources to a thread array
threads, registers, shared memory
With locality
Communication
Data movement up and down the hierarchy
Fast active messages (message-driven computing)
Synchronization
Collective operations (e.g., barrier, reduce)
Pairwise (producer-consumer)
Vanguard TTI
12/6/11
63
Execution Model
Object Thread
A
B
Global Address Space
A
B
Load/Store
Abstract
Memory
Hierarchy
B
Active Message
Vanguard TTI
12/6/11
64
J-Machine Speedup with Strong
Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
J-Machine Speedup with Strong
Scaling
2 characters per node
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235