Summary of “Towards Petaflops” - School of Physics and Astronomy

Download Report

Transcript Summary of “Towards Petaflops” - School of Physics and Astronomy

Towards Petaflops
Summary
A D Kennedy, D C Heggie, S P Booth
The University of Edinburgh
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Programme
Mond ay
24th May
Computational
Chemistry
Tu esd ay
25th May
Particle Physics &
A stronomy
Dr N H Christ
Dr J M Levesqu e
Dr G Ackland
Dr P Bu rton
REGISTRATIO N
Columbia University, USA
IBM Research, USA
University of Ed inburgh
UKMO, Bracknell
09:00
09.30
10:00
Wed nesd ay
26th May
Thu rsd ay
27th May
Biological sciences
M aterials, soft & hard
W elcome
Prof A D Kenned y
10:15
10:30
11:15
11:30
TEA/CO FFEE
Prof R Catlow
Royal Institution, Lond on
Prof R D’Inverno
Dr M F Gu est
Dr M Payne
Dr T Arber
TEA/CO FFEE
University of Southampton
CLRC, Daresbury
Cambrid ge University
University of St Andrews
Prof E De Schu tter
Dr R Trip iccione
Dr N Top ham
University of Antw erp, Belgium
IN FN -Pisa, Italy
University of Ed inburgh
12.30
13.30
14.15
Dr D Row eth
Discussion Session
Dr A H N elson
Prof M van H eel
Chair: Prof R Catlow
University of Card iff
Imperial College, Lond on
Prof D C H eggie
Mr M Wood acre
Dr M F O’Boyle
University of Ed inburgh
SGI/Cray
University of Ed inburgh
Dr L G Ped ersen
Dr M Wilson
University of Durham
17:00
Dr M Ru ffert
University of Ed inburgh
Discussion Session
15.00
16.15
Dr P Coveney
Queen Mary & Westfield
College, Lond on
LUNCH
University of N orth Carolina,
USA
15.30
Frid ay
28th May
M eteorology &
Fluids
TEA/CO FFEE
Discussion Session Discussion Session
Chair: Prof N Christ
Chair: Dr J Bard
QSW, Bristol
Chair: Dr T Arber
Discussion Session
Chair: Dr M Payne
Dr A R Jen kins
Dr C M Reeves
University of Durham
University of Ed inburgh
CLO SE
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Participants
Dr Graham Ackland
Prof Tony Arber
Dr Jonathan Bard
Dr Stephen Booth
Dr Ken Bowler
Dr Paul Burton
Prof Mike Cates
Prof Richard Catlow
Prof Norman Christ
Prof Peter Coveney
Prof Ray D d' Inverno
Prof Erik De Schutter
Dr Paul Durham
Mr Dietland Gerloff
Mr Simon Glover
Dr Bruce Graham
Dr Martyn F Guest
Prof Douglas C Heggie
Dr Suhail A Islam
Dr Adrian R Jenkins
Mr Bruce Jones
Mr Balint Joo
Prof Anthony D Kennedy
Prof Richard Kenway
Dr Crispin Kneble
graeme@holyrood
[email protected]
[email protected]
[email protected]
[email protected]
pmburton.meto.gov.uk
[email protected]
[email protected]
[email protected]
p.v.coveney.qmw.ac.uk
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Mr John M Levesque
Dr Nick Maclaren
Mr Rod McAllister
Dr Avery Meiksin
Dr Alistair Nelson
Dr Mike O'Boyle
Dr John Parkinson
Dr Mike C Payne
Dr Lee G Pederson
Dr Nilesh Raj
Dr Federico Rapuano
Dr Clive M Reeves
Dr Duncan Roweth
Dr Max Ruffert
Mr Vance Shaffer
Dr Doug Smith
Mr Philip Snowdon
Dr Nigel Topham
Dr Arthur Trew
Dr Raffaele Tripiccione
Prof Marin Van Heel
Mr Claudio Verdozzi
Dr Mark Wilson
Mr Stuart Wilson
Mr Michael Woodacre
Dr Andrea Zavanella
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Federico.Rapuano@roma1
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
mark.wilson.durham.ac.uk
[email protected]
[email protected]
[email protected]
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Introduction

Objectives of this Summary
 Summarise areas of general agreement
 Highlight areas of uncertainty or disagreement
 Concentrate on technology, architecture, & organisation
 For details of science which might be done see the slides of
the individual talks
 This summary expresses views & opinions of its authors
– It does not necessarily represent a consensus or majority view...
– … but it tries to do so as far as possible
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Devices & Hardware

Silicon CMOS will continue to dominate
 GaAs is still tomorrow’s technology (and always will be?)

Moore’s Law
 Performance increases exponentially
 Doubling time of 18 months
 Will continue for at least 5 years, and probably more
 Trade-offs between density & speed
– Gb DRAM and GHz CPU in O(5 years) ...
– … but not both on the same chip
 Choice between speed & power
 10 transistors per device by 2005, 10 by 2012
 Most cost-effective technology is usually a generation behind
the latest technology
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Devices & Hardware

Memory latency will increase
 More levels of cache hierarchy
– Implies a tree-like hierarchy of access speeds
– Not clear how scientific HPC applications map onto this
 Access to memory becoming relatively as slow as access to
remote processor’s cache
 Understanding of memory architecture required to achieve
optimal performance
– analogous to use of virtual memory

In the fairly near future
 Arithmetic will be almost free
 Pay for memory & communications bandwidth
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Devices & Hardware

Devices & HardwareTechnology driven by
mass market
 Commodity parts
 “Intercepting technology”
– systems designed to use technology current under development
– cost & risk of newest generation v. performance benefit
– “sweet point” on technology curve
 PCs
 Workstations
 DSP
 … not designed for HPC
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Devices & Hardware

Level of integration
 HPC vendors will move from board to chip level design
 Cost effective to produce O(10³—10) chips
 Silicon compilers
 Time scale?
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Devices & Hardware

Error rates will increase
 Fault tolerance required
 Implications for very large systems?
 Time scale?

Disks & I/O
 Increasing density
 Decreasing cost/bit
 Increasing relative latency
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

Memory addressing
 Flat memory (implicit communications)
– Model that naïve users want
– Does not really exist in hardware
– Dynamic hardware coherency mechanisms seem unlikely to work
well enough in practice
 Distributed memory (explicit communications)
– NUMA
– Protocols
• MPI, OpenMP,…
• SHMEM
– Scientific problems usual have a simple static communications
structure, easily handled by get and put primitives
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

Single node performance
 Fat nodes or thin nodes?
– Limited by communication network bandwidth?
– Limited by memory bandwidth (off-chip access)?
– “Sweet point” on technology curve

Single node architecture
 VLIW
 Vectors
 Superscalar
 Multiple CPUs on a chip
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

Communications
 What network topology?
– 2d, 3d, or 4d grid
–  network, butterfly, hypercube, fat tree
– Crossbar switch
 Bandwidth
 Latency
– Major problem for coarse-grain machines
 Packet size
– A problem for very fine-grain machines
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

MPP
 Scalable for the right kind of problems
– up to technological limits
 Commercial interconnects
– e.g., from QSW (http://www.quadrics.com/)
 Flexibility v. price/performance
– Custom networks for a few well-understood problems which
require high-end performance (e.g., QCD)
– More general networks for large but more general purpose
machines
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

SMP clusters
 Limited scalability?
 Appears to be what vendors want to sell to us
– IBM, SGI, Compaq,…
– Large market for general purpose SMP machines
– Adding cluster interconnect is cheap
 Unclear whether large-scale scientific problems map onto the
tree-like effective network topology well
 How do we program such machines?
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Architecture

PC or workstation clusters
 Beowulf, Avalon, ...
 Cheap, but not tested for large machines
 Communication mechanisms unspecified
 “Farms” of PCs very cost-effective solution to provide large
capacity

Static v. dynamic scheduling
 Static (compiler) instruction scheduling more appropriate than
dynamic (hardware) scheduling for most large scientific
applications
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Languages & Tools

Efficiency
 new languages will not be widely used for HPC unless they
can achieve performance comparable with low level languages
(assembler, C, Fortran)

Portability
 to different parallel architectures
 to next generation of machines
 to different vendor’s architecture

Reusability
 Object-oriented programming
 Current languages not designed for HPC (C++, JAVA, …)
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Languages & Tools

Optimisation
 Compilers can handle local optimisation well
– register allocation
– instruction scheduling
 Global optimisation will not be automatic
–
–
–
–
–
choice of algorithms
data layout
memory hierarchy management
re-computation v. memory use
could be helped by better languages & tools
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Languages & Tools

How to get scientists & engineers to use new
languages & tools?
 Performance must be good enough
 Compilers & tools must be widely available
 Compilers & tools must be reliable
 Documentation & training

New generation of scientists with interest and
expertise in both Computer Science and
Applications required
 Encouragement to work & publish in this area
 Usual problems for interdisciplinary work
– credit for software written is not on par with publications
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Models & Algorithms

Disciplines with simple well-established
methods
 Models are “exact”
 Methods are well understood & stable
 Errors are under control
– at least as well as for experiments
 Leading-edge computation required for international
competitiveness
 Examples:
– Particle physics (QCD)
– Astronomy (N body)
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Models & Algorithms

Disciplines with complex models
 Approximate models used for small-scale physics
 Is reliability limited by
– sophistication of underlying models?
– scale of computation?
– availability of data for initial or boundary conditions?
 Many different calculations for different systems
– capacity v. capability issues
 Examples:
– Meteorology
– Materials
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Models & Algorithms

Reliance on packages
 Commercial packages not well-tuned for large parallel
machines
– Algorithms may need changing
 Community has resistance to changing to new packages or
writing their own systems
 Examples:
– Chemistry
– Engineering
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Models & Algorithms

Exploration
 HPC not widely used
 Access to machine and expertise is a big hurdle
 Exciting prospects for future progress
 Algorithms and models need development
 Examples:
– Biology
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Bespoke machines
 The best solution for a few special areas
 QCDSP, APE

Special-purpose machines
 Grape
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Performance versus cost
 Slide courtesy of Norman
Christ (Columbia University)
 Diagonal lines are fixed cost
 Note dates of various
machines
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Commercial machines
 SMP
– Convenient, easy to use, but not very powerful
– Good for capacity as opposed to capability
 SMP clusters
– Unclear how effective for large-scale problems
– Unclear how they will be programmed
 Commercial interconnects (QSW,…)
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Capacity v. capability
 Large machine required to get “final point on graph”
 Cost-effectiveness
 International competitiveness
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Shared v. dedicated resources
 Systems management costs
– advantages of shared resources
• central large-scale data store
• backups
– disadvantages
• more reasonable requests if users have to pay for their
implementation
• tendency for centres to invent software projects which are not
the users’ highest priority
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Access & Organisation

Dedicated machines for consortia
 Flexibility in scheduling
– Do not have to prioritise projects in totally different subjects
– Users know & can negotiate with each other
 Ease of access for experimental projects
– consortia can be more flexible at allocating resources for
promising new approaches
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Sponsors

The workshop was supported by
 The University of Edinburgh Faculty of Science & Engineering
 Hitachi
 IBM
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

API
 Application Program Interface
– documented interface to a software subsystem so that its facilities
can be used by application (user) programs

Butterfly Network
 Network topology which allows “perfect shuffle” required for
FFTs to be carried out in parallel
– equivalent to  network, Fat tree, and Hypercube
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Cache
 Fast near-processor memory (usually SRAM) which contains a
copy of the contents of parts of main memory
– often separate instruction & data caches
– commonly organised as a hierarchy of larger caches of increasing
latency
– data is automatically fetched from memory when needed if it is
not already in the cache
– an entire cache “line” is moved from/to memory even if only part
of it is required/modified
– data is written back to memory when cache “line” is needed for
data from some other memory address, or when someone else
needs the new value of the data
– one (direct map) or several (set associative) cache “lines” can be
associated with a given memory address
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Capability
 The ability to solve on big problem in a given time

Capacity
 The ability to solve many small problems in a given time

CISC
 Complex Instruction Set Computer
–
–
–
–
instructions combine memory and arithmetic operations
instructions are not of uniform size or duration
often implemented using microcode
power is dissipated only when switching states
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

CMOS
 Complementary Metal Oxide Semiconductor
– by far the most widely used VLSI technology at present
– VLSI technology in which both p type and n type FETs are used
– design methodology is that each output is connected by a low
impedance path to either the source or drain voltage, thus low
static power dissipation
– NMOS technology requires less fabrication steps, but draws more
current and is not in common use anymore
– BiCMOS allows the construction of both FET and Bipolar
transistors on the same chip
• requires more fabrication steps and therefore larger minimum
feature size (and thus lower density) for an acceptable yield
• bipolar transistors can drive more current (for a given size
and delay) than FETs
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Coherency
 Means of ensuring that the copies of data in memory and
caches are consistent
– the illusion that a single value is associated with each address

Crossbar
 Network topology allowing an arbitrary permutation in a single
operation

Data Parallel
 Programming model in which all nodes carry out the same
operation on different data simultaneously
– may be implemented using SIMD or MIMD architectures
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Delayed branches
 several instructions following a branch are unconditionally
executed before the branch is taken, allowing the instruction
pipeline to remain filled

DRAM
 Dynamic RAM
– each bit is stored as the charge on the gate of an FET transistor
– only one transistor required to store each bit
– DRAM needs to be refreshed (read and rewritten) every few s
before charge leaks away
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

DSP
 Digital Signal Processor
–
–
–
–

low cost
low power
no cache
used for embedded devices
Dynamic Scheduling
 Order in which instructions are issued is determined on the
basis of current activity
– dynamic branch prediction: instructions are prefetched along the
path taken the last few times through this branch
– scoreboarding: instructions are delayed until the resources
required (e.g., registers) are free
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

ECC
 Error Correcting Codes
– mechanism for correcting bit errors in DRAM by using, e.g.,
Hamming codes

Fat Nodes
 Fast & large processor nodes in a multiprocessor machine
– allow relatively large sub-problem to live on each node
– permits coarse-grained communications (large packets, but less
of them)
– memory bandwidth is a potential problem

Fat Tree
 Network topology which allows “perfect shuffle” required for
FFTs to be carried out in parallel
– equivalent to Butterfly,  network, and Hypercube
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

FET
 Field Effect Transistor
– transistor in which the channel (source to drain) impedance is
controlled by the charge applied to the gate
– no current flows from the gate to the channel
– c.f., a bipolar transistor in which a current is drawn flows through
the base from the emitter to the collector

FFT
 Fast Fourier Transform
– O(n log n) algorithm for taking the Fourier transform of n data
values
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

GaAs
 Gallium Arsenide
– a semiconductor whose band gap structure allows faster
switching times than those for Si (silicon)
– fabrication technology lags that for Si
– larger minimum feature size for acceptable yield
– VLSI speed is limited by path length, not by switching times

Generation
 a level of technology used for chip fabrication, usually
measured by the minimum feature size

GRAPE
 special purpose machine for solving the n-body problem
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Hypercube
 Network topology which allows “perfect shuffle” required for
FFTs to be carried out in parallel
– nodes live on the vertices of a d-dimensional hypercube
– communications links are the edges of the hypercube
– equivalent to Butterfly,  network, and Fat tree

Instruction prefetch
 The ability to fetch & decode an instruction while the previous
instruction is still executing
– allows “pipelining” of instruction execution stages
– requires special techniques to deal with conditional branches
where it is not yet known which instruction is “next”
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Latency
 The time between issuing a request for data and receiving it

Microcode
 sequence of microinstructions used to implement a single
machine instruction
– similar functionality to having simpler (RISC) instructions with an
instruction cache, but less flexible
– stored in ROM
– reduces complexity of processor logic
– some similarities
• “vertical” code  RISC
• “horizontal” code  VLIW
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

MPI
 Message Passing Interface

MPP
 Massively Parallel Processor
– a collection of processor nodes each with their own local memory
– explicit distributed memory architecture
– nodes connected by a network of some regular topology

MIMD
 Multiple Instruction Multiple Data
– Architecture in which each processor has its own instruction and
data streams

NUMA
 Non Uniform Memory Access
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Object Oriented Programming
 Programming paradigm in which data and procedures are
encapsulated into objects
– objects are defined by their interfaces, i.e., what they do and not
how they do it
– the way the data is represented within an object, and the way the
methods which can manipulate it are implemented is hidden from
the rest of the program

 Network
 Network topology which allows “perfect shuffle” required for
FFTs to be carried out in parallel
– equivalent to Butterfly, Fat tree, and Hypercube
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

OpenMP
 Message Passing API
– See http://www.openmp.org for details

Packet Size
 The amount of data that can (has to) be transferred as an
atomic unit
– overhead associated with each packet (framing, headers,…)
– packet size must be small for a fine-grained machine, otherwise
available bandwidth cannot be used to send useful data

QCD
 Quantum ChromoDynamics
– theory of the strong interaction, by which strongly interacting
elementary particles are built from quarks and gluons
– non-perturbative QCD calculations use Monte Carlo methods on
a lattice discretisation of space-time
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

RAM
 Random Access Memory

RISC
 Reduced Instruction Set Computer
– arithmetic operations only act on registers
– instructions are of uniform length and duration
• this rule is almost always violated by the inclusion of floating
point instructions
– memory access, instruction decoding, and arithmetic can be
carried out by separate units in the processor

ROM
 Read Only Memory
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

SDRAM
 Synchronous DRAM
– a DRAM chip protocol allowing more overlap of memory access
operations

SECDED
 Single Error Correction Double Error Detection
– the most common form of ECC used for DRAM memories
– usually uses 7 syndrome bits for 32 bit words or 8 syndrome bits
for 64 bit words
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Silicon Compilers
 Translators from a Hardware Description Language (such as
VHDL) into set of masks from which a (semi) custom chip can
be made

SIMD
 Single Instruction Multiple Data
– architecture in which all processors execute the same instruction
at the same time on different data
– instructions are usually broadcast from a single copy of the
program
– processors run in lock-step
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

SMP
 Symmetric MultiProcessor
– Set of processors connected to a shared memory by a common
bus or switch

SRAM
 Static RAM
– bit value stored as state of bistable “flip-flop”
– six transistors required per bit
– used for registers and on-chip caches
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Static Scheduling
 Scheduling of instructions (usually by a compiler) to optimise
performance
– does not rely on hardware to analyse dynamic behaviour of the
program
– may make use of knowledge of “average” behaviour of a program
obtained from profiling

Superscalar
 architecture having several functional units which can carry out
several operations simultaneously
–
–
–
–
load/store
integer arithmetic
branch
floating point pipelines
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Thin Nodes
 Relatively slow & small processors in a multiprocessor
machine
– require fine-grained parallelism
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

Vector Instructions
 Instructions which carry out the same operation on a whole
stream of data values
– reduces memory bandwidth requirements by providing many data
words for one address word
– reduces memory bandwidth requirements by reducing number of
instructions (no gain if there is a reasonably large instruction
cache)
– requires more register space for temporaries
– some architectures using short vector instructions
• Intel Pentium MMX instructions
• Hitachi PA-RISC extensions
Summary of “Towards Petaflops” Workshop, 24—28 May 1999
Glossary

VLIW
 Very Long Instruction Word
– instructions which allow software to control many functional units
simultaneously
– hard to program by hand
– allows compilers opportunities for static scheduling

VLSI
 Very Large Scale Integration
– technology in which a large number of TTL (transistor-transistor
logic) circuits an fabricated on a single semiconductor chip
Summary of “Towards Petaflops” Workshop, 24—28 May 1999