Parcels - IFIP Working Group 10.3 "Concurrent Systems"
Download
Report
Transcript Parcels - IFIP Working Group 10.3 "Concurrent Systems"
Presentation to IFIP WG10.3 e-Seminar Series:
Critical Factors and Directions for
Petaflops-scale Supercomputers
Thomas Sterling
California Institute of Technology
and
NASA Jet Propulsion Laboratory
January 4, 2005
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
1
IBM BG/L: Fastest Computer in the
World
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
2
Blue Gene/L
71 Teraflops Linpack Performance
IBM BlueGene/L DD2 beta-System
Peak Performance: 91.75 Tflops
Linpack Performance: 70.72 Tflops
Based on the IBM 0.7 GHz PowerPC
440
2.8 Gflops/processor (peak –
2/ASIC)
32768 processors
128 MB/processor DDR, 4 TB
system
3D Torus network + combining tree
100 Tbytes disk storage
Power consumption of 500 Kwatts
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
3
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
4
Where Does Performance Come From?
Device Technology
Logic switching speed and device
density
Memory capacity and access time
Communications bandwidth and
latency
Computer Architecture
Instruction issue rate
Execution pipelining
Reservation stations
Branch prediction
Cache management
Parallelism
Parallelism – number of operations per
cycle per processor
January 4, 2005
Instruction level parallelism (ILP)
Vector processing
Parallelism – number of processors
per node
Parallelism – number of nodes in a
system
Thomas Sterling WG10.3 e-Seminar
5
A Growth-Factor of a Billion in Performance in
a Single Lifetime
1959
IBM 7094
1
1949
Edsac
One OPS
1823
Babbage Difference
Engine
January 4, 2005
1976
Cray 1
1991
Intel Delta
1996
T3E
103
106
109
1012
KiloOPS
MegaOPS
GigaOPS
TeraOPS
1951
1943
Harvard Univac 1
Mark 1
1964
CDC 6600
1982
Cray XMP
1988
1997
Cray YMP ASCI Red
Thomas Sterling WG10.3 e-Seminar
2003
Cray X1
1015
PetaOPS
2001
Earth
Simulator
6
Moore’s Law – an opportunity
missed
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
7
Microprocessor Clock Speed
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
8
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
9
Classes of Architecture for
High Performance Computers
Parallel Vector Processors (PVP)
Massively Parallel Processors (MPP)
SGI Origin
HP Superdome
Single Instruction stream Single Data
stream (SIMD)
Intel Touchstone Delta & Paragon
TMC CM-5
IBM SP-2 & 3, Blue Gene/Light
Cray T3D, T3E, Red Storm/Strider
Distributed Shared Memory (DSM)
NEC Earth Simulator, SX-6
Cray- 1, 2, XMP, YMP, C90, T90, X1
Fujitsu 5000 series
Goodyear MPP, MasPar 1 & 2, TMC
CM-2
Commodity Clusters
Beowulf-class PC/Linux clusters
Constellations
HP Compaq SC, Linux NetworX MCR
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
10
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
11
Beowulf Project
Wiglaf - 1994
16 Intel 80486 100 MHz
VESA Local bus
256 Mbytes memory
6.4 Gbytes of disk
Dual 10 base-T Ethernet
72 Mflops sustained
$40K
January 4, 2005
Hrothgar - 1995
16 Intel Pentium100 MHz
PCI
1 Gbyte memory
6.4 Gbytes of disk
100 base-T Fast Ethernet
(hub)
240 Mflops sustained
$46K
Hyglac-1996 (Caltech)
16 Pentium Pro 200 MHz
PCI
2 Gbytes memory
49.6 Gbytes of disk
100 base-T Fast Ethernet
(switch)
1.25 Gflops sustained
$50K
Thomas Sterling WG10.3 e-Seminar
12
HPC Paths
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
13
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
14
Why Fast Machines Run Slow
Latency
Waiting for access to memory or other parts of the system
Overhead
Extra work that has to be done to manage program
concurrency and parallel resources the real work you want to
perform
Starvation
Not enough work to do due to insufficient parallelism or poor
load balancing among distributed resources
Contention
Delays due to fighting over what task gets to use a shared
resource next. Network bandwidth is a major constraint.
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
15
The SIA CMOS Roadmap
100,000
MB per DRAM Chip
Logic Transistors per Chip (M)
uP Clock (MHz)
10,000
1,000
100
10
2012
2009
2006
2003
2001
1999
1997
1
Year of Technology Availability
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
16
Latency in a Single System
Ratio
1000
Memory Access Time
Time (ns)
100
400
300
200
10
100
CPU Time
1
Memory to CPU Ratio
500
0
0.1
1997
1999
2001
2003
2006
2009
X-Axis
CP U Clock P e riod (ns)
Me mory S yste m Acce ss T ime
R a tio
THE WALL
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
17
Microprocessors no longer realize
the full potential of VLSI technology
1e+7
1e+6
Perf (ps/Inst)
1e+5
Linear (ps/Inst)
1e+4
1e+3
1e+2
30:1
1e+1
1,000:1
1e+0
30,000:1
1e-1
1e-2
1e-3
1e-4
1980
January 4, 2005
1990
2000
2010
Thomas Sterling WG10.3 e-Seminar
2020
18
Opportunities for Future Custom MPP
Architectures for Petaflops Computing
ALU proliferation
Lower ALU utilization improves performance & flops/$
Streaming (e.g. Bill Dally)
Overhead mechanisms support in hardware
ISA for atomic compound operations on complex data
Synchronization
Communications
Reconfigurable Logic
Processor in Memory (PIM)
100X memory bandwidth
Supports low/no temporal locality execution
Latency hiding
Multithreading
Parcel driven transaction processing
Percolation prestaging
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
19
High Productivity Computing Systems
Goal:
Provide a new generation of economically viable high productivity computing
systems for the national security and industrial user community (2009 – 2010)
Impact:
Performance (time-to-solution): speedup critical national
security applications by a factor of 10X to 40X
Programmability (idea-to-first-solution): reduce cost and
time of developing application solutions
Portability (transparency): insulate research and
operational application software from system
Robustness (reliability): apply all known techniques to
protect against outside attacks, hardware faults, &
programming errors
HPCS Program Focus Areas
Applications:
Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant
modeling and biotechnology
Fill the Critical Technology and Capability Gap
Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
20
Cray Cascade High Productivity
Petaflops-scale Computer - 2010
DARPA High Productivity Computing Systems Program
Deliver sustained Petaflops performance by 2010
Aggressively attacks causes of performance degradation
Reduces contention through high bandwidth network
Latency hiding by vectors, multithreading, and parcel driven
computation, and processor in memory
Low overhead with efficient remote memory access and thread
creation, PIM acquiring overhead tasks from main processors,
hardware support for communications
Starvation lowered by exposing fine grain data parallelism
Greatly simplifies user programming
January 4, 2005
Distributed shared memory
Hierarchical multithreaded execution model
Low performance penalties for distributed execution
Hardware support for performance tuning and correctness debugging
Thomas Sterling WG10.3 e-Seminar
21
Cascade Architecture
(logical view)
Programming Environment
• Mixed UMA/NUMA programming model
• High productivity programming language
Locale
Locale
Operating System
• Highly robust
• Highly scalable
• Global file system
Locale
DRAM
Locale
Locale
HWP
DRAM
Locale
…
LPC
DRAM
Cache
DRAM
Locale
Interconne
ct
…
LPC
DRAM
Network
Router
…
LPC
Locale
Network
Router
Network
Router
Network
Router
Locale
DRAM
Locale
…
LPC
DRAM
LWP
• Highly concurrent scalar
• Fine-grained multithreading
• Remote thread creation
January 4, 2005
Locale
…
LPC
…
LPC
Network
Router
Locale
DRAM
…
LPC
Interconnection Network
• High bandwidth, low latency
• High radix routers
Locale
HWP
• Clustered vectors
• Coarse-grained multithreading
• Compiler assisted cache
…
LPC
Network
Router
Locale
RAID
TCP/IP
I/O
Network
Router
Network
Router
I/O
I/O
Network
Router
System Technology
• Opto-electrical interconnect
• Cooling
GRAPHICS
Thomas Sterling WG10.3 e-Seminar
22
Processor in Memory (PIM)
PIM merges logic with memory
Wide ALUs next to the row buffer
Optimized for memory throughput, not ALU utilization
Sense Amps
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps
greatly increasing effective memory bandwidth,
providing many more concurrent execution threads,
reducing latency,
reducing power, and
increasing overall system efficiency
Decode
PIM has the potential of riding Moore's law while
Node Logic
Sense Amps
Sense Amps
Memory
Stack
Memory
Stack
Sense Amps
Sense Amps
It may also simplify programming and system
design
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
23
Why is PIM Inevitable?
Separation between memory and logic artificial
von Neumann bottleneck
Imposed by technology limitations
Not a desirable property of computer architecture
Technology now brings down barrier
We didn’t do it because we couldn’t do it
We can do it so we will do it
What to do with a billion transistors
Complexity can not be extended indefinitely
Synthesis of simple elements through replication
Means to fault tolerance, lower power
Normalize memory touch time through scaled bandwidth with capacity
Without it, takes ever longer to look at each memory block
Will be mass market commodity commercial market
Drivers outside of HPC thrust
Cousin to embedded computing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
24
Roles for PIM
Perform in-place operations on zero-reuse data
Exploit high degree data parallelism
Rapid updates on contiguous data blocks
Rapid associative searches through contiguous data blocks
Gather-scatters
Tree/graph walking
Enables efficient and concurrent array transpose
Permits fine grain manipulation of sparse and irregular data
structures
Parallel prefix operations
In-memory data movement
Memory management overhead work
Engage in prestaging of data for HWT processors
Fault monitoring, detection, and cleanup
Manage 3/2 memory layer
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
25
Strategic Concepts of the MIND
Architecture
Virtual to physical address translation in memory
Global distributed shared memory thru distributed directory table
Dynamic page migration
Wide registers serve as context sensitive TLB
Multithreaded control
Unified dynamic mechanism for resource management
Latency hiding
Real time response
Parcel active message driven computing
Decoupled split-transaction execution
System wide latency hiding
Move work to data instead of data to work
Caching of external DRAM
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
26
Memory Stack
memory address buffer
MIND Node
Memory
Controller
On-chip
Interface
sense amps & row buffer
Permutation Network
Wide Multi Word ALU
Wide Register Bank
January 4, 2005
Multithreading
Execution
Control
Parcel
Handler
Thomas Sterling WG10.3 e-Seminar
Parcel Interface
27
Microprocessor with PIMs
PIM Node 1
TRR
TML
TLcycle
%WH
Contrl
Reg
THcycle
Cache
ALU
Memory
TMH
mixl/s
mixl/s
TCH
Metrics
%WL
PIM processor
Microprocessor
Pmiss
W total work WH WL
PIM Node 2
%WH percent heavyweigh t work
%WL percent lightweigh t work
THcycle heavyweigh t cycle time
PIM Node 3
TLcycle lightweigh t cycle time
TMH heavyweigh t memory access time
TCH heavyweigh t cache access time
TML lightweigh t memory access time
PIM Node N
Pmiss heavyweigh t cache miss rate
mixl/s instructio n mix for load and store ops
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
28
Threads Timeline
Time
Heavy HW Thread
Weight
Thread
Processor
HW Thread
HW Thread
HW Thread
Light Weight
Thread
Processors
LW Threads
January 4, 2005
LW Threads
LW Threads
Thomas Sterling WG10.3 e-Seminar
29
Simulation of Performance Gain
1.00E+03
1 Node
Performance Gain
1.00E+02
2 Nodes
4 Nodes
8 Nodes
16 Nodes
32 Nodes
1.00E+01
64 Nodes
1.00E+00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PIM Workload
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
30
Simulation of PIM Execution Time
1.60E+09
1.40E+09
No LWT Work
10% LWT
1.20E+09
Time to Execution
20% LWT
30% LWT
1.00E+09
40% LWT
8.00E+08
50% LWT
60% LWT
6.00E+08
70% LWT
80% LWT
4.00E+08
90% LWT
100% LWT
2.00E+08
0.00E+00
1
2
4
8
16
32
64
Number of PIM Nodes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
31
Analytical Expression for Relative
Execution Time
1 TLcycle mixl/s TML TLcycle
Timerelative 1 %WL 1
N 1 mixl/s TCH 1 Pmiss TMH
TLcycle mixl/s TML TLcycle
let NB
1
mix
l/s
T
CH
1
P
miss
T
MH
NB
then Timerelative 1 %WL 1
N
and parameters %WL, N , and NB are independen t
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
32
Effect of PIM on Execution Time
with Normalized Runtime
# PIM Nodes
3.5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Relative Time to Execution
3
2.5
2
1.5
1
0.5
0
0
10
1
10
2
10
Number of PIM Nodes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
33
Parcels
Parcels
Enable lightweight communication between LWPs or between HWP
and LWP.
Contribute to system-wide latency management
Support split-transaction message-driven computing
Low overhead for efficient communication
Implementation of remote thread creation (rtc).
Implementation of remote memory references.
Wrapper
January 4, 2005
Destination
Action
Payload
Thomas Sterling WG10.3 e-Seminar
CRC
34
Parcels for remote threads
Destination Locale
Data
Target Operand
Remote Thread Create Parcel
Payload
Action
Destination
Methods
Target Action Code
Source
Locale
Destination
Return Parcel
Action
Payload
Thread
Frames
Remote Thread
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
35
Parcel Simulation Latency Hiding
Experiment
Nodes
Nodes
Flat
Network
Remote Memory
Requests
Control
Experiment
Nodes
Remote Memory
Requests
Nodes
Test
Experiment
ALU
Remote Memory
Requests
ALU
Local Memory
Process Driven Node
January 4, 2005
Input
Parcels
Remote Memory
Requests
Output
Parcels
Local Memory
Parcel Driven Node
Thomas Sterling WG10.3 e-Seminar
36
Latency Hiding with Parcels
with respect to System Diameter in cycles
Sensitivity to Remote Latency and Remote Access Fraction
16 Nodes
deg_parallelism in RED (pending parcels @ t=0 per node)
256
64
100
16
1/4%
4
1/2%
10
1%
2
2%
1
4%
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
64
25
6
10
24
40
96
16
38
4
1
64
25
6
10
24
40
96
16
38
4
Total transactional work done/Total process work done
1000
0.1
Remote Memory Latency (cycles)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
37
Latency Hiding with Parcels
Idle Time with respect to Degree of Parallelism
Idle Tim e/Node
(num ber of nodes in black)
8.E+05
2
1
4
8
32
16
64
128
256
7.E+05
6.E+05
Idle time/node (cycles)
Process
Transaction
5.E+05
4.E+05
3.E+05
2.E+05
1.E+05
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
12
8
16
2
25
6
32
4
64
8
1
0.E+00
Parallelism Level (parcels/node at tim e=0)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
38
Multithreading in PIMS
MIND must respond asynchronously to service requests
from multiple sources
Parcel-driven computing requires rapid response to
incident packets
Hardware supports multitasking for multiple concurrent
method instantiations
High memory bandwidth utilization by overlapping
computation with access ops
Manages shared on-chip resources
Provides fine-grain context switching
Latency hiding
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
39
Parcels, Multithreading, and
Multiport Memory
Latency Hiding
Process vs transaction Model
16 Nodes, Netw ork Latency = 1024 cycles,
1% of Actual Accesses to Memory are Remote
10
Total Transactional work/Total Process Work
9
8
7
No Multi-Threading
6
Multi-Threading in Transaction Model
5
Dual Port Memory & Multi-Threading in
Transaction Model
4
3
2
1
0
1
2
4
16
64
256
Parallelism Level (parcels/node @ time = 0)
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
40
MPI – The Failed Success
A 10 year odyssey
Community wide standard for parallel programming
A proven “natural” model for distributed fragmented
memory class systems
User responsible for locality management
User responsible for minimizing overhead
User responsible for resource allocation
User responsible for exposing parallelism
Relied on ILP and OpenMP for more parallelism
Mediocre scaling: demands problem size expansion
for greater performance
We now are constrained to legacy MPI codes
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
41
What is required
Global name spaces; both data and active tasks
Rich parallelism semantics and granularity
Diversity of forms
Tremendous increase in amount
Support for sparse data parallelism
Latency hiding
Low overhead mechanisms
Synchronization
Scheduling
Affinity semantics
Do not rely on:
Direct control of hardware mechanisms
Direct management and allocation of hardware resources
Direct choreographing of physical data and task locality
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
42
ParalleX: a Parallel Programming Model
Exposes parallelism in diverse forms and granularities
Greatly increases available parallelism for speedup
Matches more algorithms
Exploits intrinsic parallelism of sparse data
Exploits split transaction processing
Decouples computation and communication
Moves work to data, not just data to work
Intrinsics for latency hiding
Multithreading
Message driven computation
Efficient lightweight synchronization overhead
Register synchronization
Futures hardware support
Lightweight objects
Fine grain mutual exclusion
Provides for global data and task name spaces
Efficient remote memory accesses (e.g. shmem)
Lightweight atomic memory operations
Affinity attribute specifiers
Automatic locality management
Rapid load balancing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
43
Agincourt: A Latency Tolerant Parallel
Programming Language
Bridges the cluster gap
Eliminates constraints of message passing model
Reduces need for inefficient global barrier synchronization
Mitigates local to remote access time disparity
Removes OS from critical path
Greatly simplifies programming
Global single system image
Manipulates sparse irregular time varying meta data
Facilitates dynamic adaptive applications
Dramatic performance advantage
Lower overhead
Latency hiding
Load balancing
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
44
This could be a very bad idea
New languages almost always fail
Fancy assed languages usually do not match needs of
system hardware
Compilers take forever to bring to maturity
People, quite reasonably, like what they do; they don’t
want to change
People feel threatened by others who want to impose
silly naive expensive impractical unilateral ideas
Acceptance is a big issue
And then there’s the legacy problem
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
45
Real-World Practical Petaflops
Computer Systems
Sustained Petaflops performance on wide range of
applications.
Full Peta-scale system resources of a Petaflops
computer routinely allocated to real world users, not
just for Demos before SCXY.
There are many Petaflops computers available
throughout the nation, not just a couple of National
Laboratories
Size, power, cooling, and cost not prohibitive.
Programming is tractable, so that a scientist can use it
and not change professions in the process.
January 4, 2005
Thomas Sterling WG10.3 e-Seminar
46
1 Petaflops is only the beginning
An extrapolation of the Linpack Top-500 List of supercomputers where
N=1 is the fastest machine, and Sum is aggregate performance of all.
10 Eflops
1 Eflops
100 Pflops
10 Pflops
SUM
1 Pflops
100 Tflops
Courtesy of Thomas Sterling
N=1
10 Tflops
N=500
1 Tflops
100 Gflops
10 Gflops
1 Gflops
100 Mflops
1993
1995
January 4, 2005
2000
2005
2010
2015
Thomas Sterling WG10.3 e-Seminar
2020
2023
47