Parcels - IFIP Working Group 10.3 "Concurrent Systems"

Download Report

Transcript Parcels - IFIP Working Group 10.3 "Concurrent Systems"

Presentation to IFIP WG10.3 e-Seminar Series:
Critical Factors and Directions for
Petaflops-scale Supercomputers
Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
January 4, 2005
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
1
IBM BG/L: Fastest Computer in the
World
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
2
Blue Gene/L
71 Teraflops Linpack Performance










IBM BlueGene/L DD2 beta-System
Peak Performance: 91.75 Tflops
Linpack Performance: 70.72 Tflops
Based on the IBM 0.7 GHz PowerPC
440
2.8 Gflops/processor (peak –
2/ASIC)
32768 processors
128 MB/processor DDR, 4 TB
system
3D Torus network + combining tree
100 Tbytes disk storage
Power consumption of 500 Kwatts
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
3
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
4
Where Does Performance Come From?

Device Technology




Logic switching speed and device
density
Memory capacity and access time
Communications bandwidth and
latency
Computer Architecture

Instruction issue rate





Execution pipelining
Reservation stations
Branch prediction
Cache management
Parallelism

Parallelism – number of operations per
cycle per processor




January 4, 2005
Instruction level parallelism (ILP)
Vector processing
Parallelism – number of processors
per node
Parallelism – number of nodes in a
system
Thomas Sterling WG10.3 e-Seminar
5
A Growth-Factor of a Billion in Performance in
a Single Lifetime
1959
IBM 7094
1
1949
Edsac
One OPS
1823
Babbage Difference
Engine
January 4, 2005
1976
Cray 1
1991
Intel Delta
1996
T3E
103
106
109
1012
KiloOPS
MegaOPS
GigaOPS
TeraOPS
1951
1943
Harvard Univac 1
Mark 1
1964
CDC 6600
1982
Cray XMP
1988
1997
Cray YMP ASCI Red
Thomas Sterling WG10.3 e-Seminar
2003
Cray X1
1015
PetaOPS
2001
Earth
Simulator
6
Moore’s Law – an opportunity
missed
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
7
Microprocessor Clock Speed
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
8
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
9
Classes of Architecture for
High Performance Computers

Parallel Vector Processors (PVP)




Massively Parallel Processors (MPP)






SGI Origin
HP Superdome
Single Instruction stream Single Data
stream (SIMD)


Intel Touchstone Delta & Paragon
TMC CM-5
IBM SP-2 & 3, Blue Gene/Light
Cray T3D, T3E, Red Storm/Strider
Distributed Shared Memory (DSM)


NEC Earth Simulator, SX-6
Cray- 1, 2, XMP, YMP, C90, T90, X1
Fujitsu 5000 series
Goodyear MPP, MasPar 1 & 2, TMC
CM-2
Commodity Clusters



Beowulf-class PC/Linux clusters
Constellations
HP Compaq SC, Linux NetworX MCR
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
10
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
11
Beowulf Project








Wiglaf - 1994
16 Intel 80486 100 MHz
VESA Local bus
256 Mbytes memory
6.4 Gbytes of disk
Dual 10 base-T Ethernet
72 Mflops sustained
$40K
January 4, 2005






Hrothgar - 1995
16 Intel Pentium100 MHz
PCI
1 Gbyte memory
6.4 Gbytes of disk
100 base-T Fast Ethernet
(hub)
 240 Mflops sustained
 $46K






Hyglac-1996 (Caltech)
16 Pentium Pro 200 MHz
PCI
2 Gbytes memory
49.6 Gbytes of disk
100 base-T Fast Ethernet
(switch)
 1.25 Gflops sustained
 $50K
Thomas Sterling WG10.3 e-Seminar
12
HPC Paths
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
13
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
14
Why Fast Machines Run Slow
 Latency

Waiting for access to memory or other parts of the system
 Overhead

Extra work that has to be done to manage program
concurrency and parallel resources the real work you want to
perform
 Starvation

Not enough work to do due to insufficient parallelism or poor
load balancing among distributed resources
 Contention

Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
15
The SIA CMOS Roadmap
100,000
MB per DRAM Chip
Logic Transistors per Chip (M)
uP Clock (MHz)
10,000
1,000
100
10
2012
2009
2006
2003
2001
1999
1997
1
Year of Technology Availability
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
16
Latency in a Single System
Ratio
1000
Memory Access Time
Time (ns)
100
400
300
200
10
100
CPU Time
1
Memory to CPU Ratio
500
0
0.1
1997
1999
2001
2003
2006
2009
X-Axis
CP U Clock P e riod (ns)
Me mory S yste m Acce ss T ime
R a tio
THE WALL
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
17
Microprocessors no longer realize
the full potential of VLSI technology
1e+7
1e+6
Perf (ps/Inst)
1e+5
Linear (ps/Inst)
1e+4
1e+3
1e+2
30:1
1e+1
1,000:1
1e+0
30,000:1
1e-1
1e-2
1e-3
1e-4
1980
January 4, 2005
1990
2000
2010
Thomas Sterling WG10.3 e-Seminar
2020
18
Opportunities for Future Custom MPP
Architectures for Petaflops Computing
 ALU proliferation


Lower ALU utilization improves performance & flops/$
Streaming (e.g. Bill Dally)
 Overhead mechanisms support in hardware




ISA for atomic compound operations on complex data
Synchronization
Communications
Reconfigurable Logic
 Processor in Memory (PIM)


100X memory bandwidth
Supports low/no temporal locality execution
 Latency hiding



Multithreading
Parcel driven transaction processing
Percolation prestaging
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
19
High Productivity Computing Systems
Goal:
 Provide a new generation of economically viable high productivity computing
systems for the national security and industrial user community (2009 – 2010)
Impact:
 Performance (time-to-solution): speedup critical national
security applications by a factor of 10X to 40X
 Programmability (idea-to-first-solution): reduce cost and
time of developing application solutions
 Portability (transparency): insulate research and
operational application software from system
 Robustness (reliability): apply all known techniques to
protect against outside attacks, hardware faults, &
programming errors
HPCS Program Focus Areas
Applications:
 Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant
modeling and biotechnology
Fill the Critical Technology and Capability Gap
Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
20
Cray Cascade High Productivity
Petaflops-scale Computer - 2010
 DARPA High Productivity Computing Systems Program
 Deliver sustained Petaflops performance by 2010
 Aggressively attacks causes of performance degradation




Reduces contention through high bandwidth network
Latency hiding by vectors, multithreading, and parcel driven
computation, and processor in memory
Low overhead with efficient remote memory access and thread
creation, PIM acquiring overhead tasks from main processors,
hardware support for communications
Starvation lowered by exposing fine grain data parallelism
 Greatly simplifies user programming




January 4, 2005
Distributed shared memory
Hierarchical multithreaded execution model
Low performance penalties for distributed execution
Hardware support for performance tuning and correctness debugging
Thomas Sterling WG10.3 e-Seminar
21
Cascade Architecture
(logical view)
Programming Environment
• Mixed UMA/NUMA programming model
• High productivity programming language
Locale
Locale
Operating System
• Highly robust
• Highly scalable
• Global file system
Locale
DRAM
Locale
Locale
HWP
DRAM
Locale
…
LPC
DRAM
Cache
DRAM
Locale
Interconne
ct
…
LPC
DRAM
Network
Router
…
LPC
Locale
Network
Router
Network
Router
Network
Router
Locale
DRAM
Locale
…
LPC
DRAM
LWP
• Highly concurrent scalar
• Fine-grained multithreading
• Remote thread creation
January 4, 2005
Locale
…
LPC
…
LPC
Network
Router
Locale
DRAM
…
LPC
Interconnection Network
• High bandwidth, low latency
• High radix routers
Locale
HWP
• Clustered vectors
• Coarse-grained multithreading
• Compiler assisted cache
…
LPC
Network
Router
Locale
RAID
TCP/IP
I/O
Network
Router
Network
Router
I/O
I/O
Network
Router
System Technology
• Opto-electrical interconnect
• Cooling
GRAPHICS
Thomas Sterling WG10.3 e-Seminar
22
Processor in Memory (PIM)
 PIM merges logic with memory


Wide ALUs next to the row buffer
Optimized for memory throughput, not ALU utilization
Sense Amps
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps





greatly increasing effective memory bandwidth,
providing many more concurrent execution threads,
reducing latency,
reducing power, and
increasing overall system efficiency
Decode
 PIM has the potential of riding Moore's law while
Node Logic
Sense Amps
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps
 It may also simplify programming and system
design
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
23
Why is PIM Inevitable?
 Separation between memory and logic artificial



von Neumann bottleneck
Imposed by technology limitations
Not a desirable property of computer architecture
 Technology now brings down barrier


We didn’t do it because we couldn’t do it
We can do it so we will do it
 What to do with a billion transistors



Complexity can not be extended indefinitely
Synthesis of simple elements through replication
Means to fault tolerance, lower power
 Normalize memory touch time through scaled bandwidth with capacity

Without it, takes ever longer to look at each memory block
 Will be mass market commodity commercial market


Drivers outside of HPC thrust
Cousin to embedded computing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
24
Roles for PIM














Perform in-place operations on zero-reuse data
Exploit high degree data parallelism
Rapid updates on contiguous data blocks
Rapid associative searches through contiguous data blocks
Gather-scatters
Tree/graph walking
Enables efficient and concurrent array transpose
Permits fine grain manipulation of sparse and irregular data
structures
Parallel prefix operations
In-memory data movement
Memory management overhead work
Engage in prestaging of data for HWT processors
Fault monitoring, detection, and cleanup
Manage 3/2 memory layer
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
25
Strategic Concepts of the MIND
Architecture
 Virtual to physical address translation in memory



Global distributed shared memory thru distributed directory table
Dynamic page migration
Wide registers serve as context sensitive TLB
 Multithreaded control



Unified dynamic mechanism for resource management
Latency hiding
Real time response
 Parcel active message driven computing



Decoupled split-transaction execution
System wide latency hiding
Move work to data instead of data to work
 Caching of external DRAM
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
26
Memory Stack
memory address buffer
MIND Node
Memory
Controller
On-chip
Interface
sense amps & row buffer
Permutation Network
Wide Multi Word ALU
Wide Register Bank
January 4, 2005
Multithreading
Execution
Control
Parcel
Handler
Thomas Sterling WG10.3 e-Seminar
Parcel Interface
27
Microprocessor with PIMs
PIM Node 1
TRR
TML
TLcycle
%WH
Contrl
Reg
THcycle
Cache
ALU
Memory
TMH
mixl/s
mixl/s
TCH
Metrics
%WL
PIM processor
Microprocessor
Pmiss
W  total work  WH  WL
PIM Node 2
%WH  percent heavyweigh t work
%WL  percent lightweigh t work
THcycle  heavyweigh t cycle time
PIM Node 3
TLcycle  lightweigh t cycle time
TMH  heavyweigh t memory access time
TCH  heavyweigh t cache access time
TML  lightweigh t memory access time
PIM Node N
Pmiss  heavyweigh t cache miss rate
mixl/s  instructio n mix for load and store ops
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
28
Threads Timeline
Time
Heavy HW Thread
Weight
Thread
Processor
HW Thread
HW Thread
HW Thread
Light Weight
Thread
Processors
LW Threads
January 4, 2005
LW Threads
LW Threads
Thomas Sterling WG10.3 e-Seminar
29
Simulation of Performance Gain
1.00E+03
1 Node
Performance Gain
1.00E+02
2 Nodes
4 Nodes
8 Nodes
16 Nodes
32 Nodes
1.00E+01
64 Nodes
1.00E+00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PIM Workload
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
30
Simulation of PIM Execution Time
1.60E+09
1.40E+09
No LWT Work
10% LWT
1.20E+09
Time to Execution
20% LWT
30% LWT
1.00E+09
40% LWT
8.00E+08
50% LWT
60% LWT
6.00E+08
70% LWT
80% LWT
4.00E+08
90% LWT
100% LWT
2.00E+08
0.00E+00
1
2
4
8
16
32
64
Number of PIM Nodes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
31
Analytical Expression for Relative
Execution Time
 1  TLcycle  mixl/s  TML  TLcycle   
Timerelative  1  %WL  1   


 N 1  mixl/s  TCH  1  Pmiss  TMH   
 TLcycle  mixl/s  TML  TLcycle  
let NB  



1

mix
l/s

T
CH

1

P
miss

T
MH


 NB 
then Timerelative  1  %WL  1  
 N
and parameters %WL, N , and NB are independen t
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
32
Effect of PIM on Execution Time
with Normalized Runtime
# PIM Nodes
3.5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Relative Time to Execution
3
2.5
2
1.5
1
0.5
0
0
10
1
10
2
10
Number of PIM Nodes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
33
Parcels
 Parcels






Enable lightweight communication between LWPs or between HWP
and LWP.
Contribute to system-wide latency management
Support split-transaction message-driven computing
Low overhead for efficient communication
Implementation of remote thread creation (rtc).
Implementation of remote memory references.
Wrapper
January 4, 2005
Destination
Action
Payload
Thomas Sterling WG10.3 e-Seminar
CRC
34
Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Payload
Action
Destination
Methods
Target Action Code
Source
Locale
Destination
Return Parcel
Action
Payload
Thread
Frames
Remote Thread
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
35
Parcel Simulation Latency Hiding
Experiment
Nodes
Nodes
Flat
Network
Remote Memory
Requests
Control
Experiment
Nodes
Remote Memory
Requests
Nodes
Test
Experiment
ALU
Remote Memory
Requests
ALU
Local Memory
Process Driven Node
January 4, 2005
Input
Parcels
Remote Memory
Requests
Output
Parcels
Local Memory
Parcel Driven Node
Thomas Sterling WG10.3 e-Seminar
36
Latency Hiding with Parcels
with respect to System Diameter in cycles
Sensitivity to Remote Latency and Remote Access Fraction
16 Nodes
deg_parallelism in RED (pending parcels @ t=0 per node)
256
64
100
16
1/4%
4
1/2%
10
1%
2
2%
1
4%
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
1
64
25
6
10
24
40
96
16
38
4
Total transactional work done/Total process work done
1000
0.1
Remote Memory Latency (cycles)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
37
Latency Hiding with Parcels
Idle Time with respect to Degree of Parallelism
Idle Tim e/Node
(num ber of nodes in black)
8.E+05
2
1
4
8
32
16
64
128
256
7.E+05
6.E+05
Idle time/node (cycles)
Process
Transaction
5.E+05
4.E+05
3.E+05
2.E+05
1.E+05
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
0.E+00
Parallelism Level (parcels/node at tim e=0)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
38
Multithreading in PIMS
 MIND must respond asynchronously to service requests
from multiple sources
 Parcel-driven computing requires rapid response to
incident packets
 Hardware supports multitasking for multiple concurrent
method instantiations
 High memory bandwidth utilization by overlapping
computation with access ops
 Manages shared on-chip resources
 Provides fine-grain context switching
 Latency hiding
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
39
Parcels, Multithreading, and
Multiport Memory
Latency Hiding
Process vs transaction Model
16 Nodes, Netw ork Latency = 1024 cycles,
1% of Actual Accesses to Memory are Remote
10
Total Transactional work/Total Process Work
9
8
7
No Multi-Threading
6
Multi-Threading in Transaction Model
5
Dual Port Memory & Multi-Threading in
Transaction Model
4
3
2
1
0
1
2
4
16
64
256
Parallelism Level (parcels/node @ time = 0)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
40
MPI – The Failed Success
 A 10 year odyssey
 Community wide standard for parallel programming
 A proven “natural” model for distributed fragmented
memory class systems
 User responsible for locality management
 User responsible for minimizing overhead
 User responsible for resource allocation
 User responsible for exposing parallelism
 Relied on ILP and OpenMP for more parallelism
 Mediocre scaling: demands problem size expansion
for greater performance
 We now are constrained to legacy MPI codes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
41
What is required
 Global name spaces; both data and active tasks
 Rich parallelism semantics and granularity


Diversity of forms
Tremendous increase in amount
 Support for sparse data parallelism
 Latency hiding
 Low overhead mechanisms


Synchronization
Scheduling
 Affinity semantics
 Do not rely on:



Direct control of hardware mechanisms
Direct management and allocation of hardware resources
Direct choreographing of physical data and task locality
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
42
ParalleX: a Parallel Programming Model
 Exposes parallelism in diverse forms and granularities



Greatly increases available parallelism for speedup
Matches more algorithms
Exploits intrinsic parallelism of sparse data
 Exploits split transaction processing


Decouples computation and communication
Moves work to data, not just data to work
 Intrinsics for latency hiding


Multithreading
Message driven computation
 Efficient lightweight synchronization overhead




Register synchronization
Futures hardware support
Lightweight objects
Fine grain mutual exclusion
 Provides for global data and task name spaces


Efficient remote memory accesses (e.g. shmem)
Lightweight atomic memory operations
 Affinity attribute specifiers


Automatic locality management
Rapid load balancing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
43
Agincourt: A Latency Tolerant Parallel
Programming Language
 Bridges the cluster gap




Eliminates constraints of message passing model
Reduces need for inefficient global barrier synchronization
Mitigates local to remote access time disparity
Removes OS from critical path
 Greatly simplifies programming



Global single system image
Manipulates sparse irregular time varying meta data
Facilitates dynamic adaptive applications
 Dramatic performance advantage



Lower overhead
Latency hiding
Load balancing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
44
This could be a very bad idea
 New languages almost always fail
 Fancy assed languages usually do not match needs of
system hardware
 Compilers take forever to bring to maturity
 People, quite reasonably, like what they do; they don’t
want to change
 People feel threatened by others who want to impose
silly naive expensive impractical unilateral ideas
 Acceptance is a big issue
 And then there’s the legacy problem
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
45
Real-World Practical Petaflops
Computer Systems
 Sustained Petaflops performance on wide range of
applications.
 Full Peta-scale system resources of a Petaflops
computer routinely allocated to real world users, not
just for Demos before SCXY.
 There are many Petaflops computers available
throughout the nation, not just a couple of National
Laboratories

Size, power, cooling, and cost not prohibitive.
 Programming is tractable, so that a scientist can use it
and not change professions in the process.
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
46
1 Petaflops is only the beginning
An extrapolation of the Linpack Top-500 List of supercomputers where
N=1 is the fastest machine, and Sum is aggregate performance of all.
10 Eflops
1 Eflops
100 Pflops
10 Pflops
SUM
1 Pflops
100 Tflops
Courtesy of Thomas Sterling
N=1
10 Tflops
N=500
1 Tflops
100 Gflops
10 Gflops
1 Gflops
100 Mflops
1993
1995
January 4, 2005
2000
2005
2010
2015
Thomas Sterling WG10.3 e-Seminar
2020
2023
47