Tutorial slides

Download Report

Transcript Tutorial slides

Achieving Usability and Efficiency in
Large-Scale Parallel Computing Systems
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Performance and Architectures Lab
(PAL), CCS-3
Computer and Computational Sciences Division
Los Alamos National Laboratory
PAL
Overview
CCS-3


In this part of the tutorial we will discuss the
characteristics of some of the most powerful
supercomputers
We classify these machines along three dimensions
 Node
Integration - how processors and network
interface are integrated in a computing node
 Network Integration – what primitive mechanisms
the network provides to coordinate the processing
nodes
 System Software Integration – how the operating
system instances are globally coordinated
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
2
PAL
Overview
CCS-3


We argue that the level of integration in each of the
three dimensions, more than other parameters (as
distributed vs shared memory or vector vs scalar
processor), is the discriminating factor beween
large-scale supercomputers
In this part of the tutorial we will briefly characterize
some existing and up-coming parallel computers
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
3
ASCI Q: Los Alamos National
Laboratory
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
4
PAL
ASCI Q
CCS-3
 Total
— 20.48 TF/s, #3 in the top 500
 Systems
— 2048 AlphaServer ES45s
 8,192 EV-68 1.25-GHz CPUs
with 16-MB cache
 Memory
 System
— 22 Terabytes
Interconnect
 Dual Rail Quadrics Interconnect
 4096 QSW PCI adapters
 Four 1024-way QSW federated
switches
 Operational in 2002
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
5
PAL
Node: HP (Compaq) AlphaServer ES45
21264 System Architecture
CCS-3
EV68 1.25
GHz
Each @ 64b 500 MHz (4.0 GB/s)
Memory
Up to 32 GB
MMB 0
256b 125 MHz
(4.0 GB/s)
Quad
C-Chip Controller
MMB 1
D D
D D
D D
D D
MMB 2 256b 125 MHz
Cache
16 MB per CPU
(4.0 GB/s)
PCI Chip
Bus
Bus0,1
0
MMB 3
PCI Chip
Bus
Bus2,3
1
64b 66MHz (528 MB/S)
64b
MB/S)
64b66MHz
33MHz(528
(266MB/S)
64b 33
33MHz
64b
MHz(266MB/S)
(266 MB/S)
64b 66 MHz (528 MB/S)
PCI0
PCI0
PCI1
PCI1
PCI1
HS
PCI2
PCI2
PCI2
HS
PCI3
PCI3
PCI3
HS
PCI4
PCI4
3.3V I/O
PCI5
PCI5
PCI6
PCI6
PCI6
HS
PCI7
HS
PCI7
PCI7
PCI8
PCI8
PCI8
HS
PCI9
PCI
PCI9HS
9
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PCI-USB
PCI-USB
PCI-junk IO
IO
PCI-junk
5.0V I/O
Europar 2004, Pisa Italy
6
Serial, Parallel
keyboard/mouse
floppy
QsNET: Quaternary Fat Tree
PAL
CCS-3
• Hardware support for Collective
Communication
• MPI Latency 4ms, Bandwidth
300 MB/s
• Barrier latency less than 10ms
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
7
PAL
Interconnection Network
Super Top Level
6
Switch
Level
CCS-3
5
Mid Level
4
...
3
1024
nodes
(2x = 2048
nodes)
2
16th 64U64D
Nodes 960-1023
1st 64U64D
Nodes 0-63
1
48
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
63
960
Europar 2004, Pisa Italy
1023
8
PAL
System Software
CCS-3



Operating System is Tru64
Nodes organized in Clusters of 32 for resource
allocation and administration purposes (TruCluster)
Resource management executed through Ethernet
(RMS)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
9
PAL
ASCI Q: Overview
CCS-3

Node Integration


Network Integration


Low (multiple boards per node, network interface on
I/O bus)
High (HW support for atomic collective primitives)
System Software Integration

Medium/Low (TruCluster)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
10
ASCI Thunder, 1,024
Nodes, 23 TF/s peak
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
11
ASCI Thunder, Lawrence
Livermore National
Laboratory
PAL
CCS-3
• 1,024 Nodes, 4096 Processors, 23 TF/s,
•#2 in the top 500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
12
PAL
ASCI Thunder: Configuration
CCS-3






1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB
DDR266 SDRAM (8 Terabytes total)
2.5 ms, 912 MB/s MPI latency and bandwidth over
Quadrics Elan4
Barrier synchronization 6 ms, allreduce 15 ms
75 TB in local disk in 73GB/node UltraSCSI320
Lustre file system with 6.4 GB/s delivered parallell
I/O performance
Linux RH 3.0, SLURM, Chaos
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
13
PAL
CCS-3

CHAOS: Clustered High Availability Operating
System
 Derived
from Red Hat, but differs in the following
areas
 Modified
kernel (Lustre and hw specific)
 New packages for cluster monitoring, system
installation, power/console management
 SLURM, an open-source resource manager
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
14
PAL
ASCI Thunder: Overview
CCS-3

Node Integration


Network Integration


Medium/Low (network interface on I/O bus)
Very High (HW support for atomic collective
primitives)
System Software Integration

Medium (Chaos)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
15
PAL
System X: Virginia Tech
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
16
PAL
System X, 10.28 TF/s

1100 dual Apple G5 2GHz CPU based nodes.



8 billion operations/second/processor (8 GFlops) peak double
precision floating performance.
Each node has 4GB of main memory and 160 GB of Serial
ATA storage.


CCS-3
176TB total secondary storage.
Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial
support for collective communication
System-level Fault-tolerance (Déjà vu)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
17
PAL
System X: Overview
CCS-3

Node Integration


Network Integration


Medium/Low (network interface on I/O bus)
Medium (limited support for atomic collective
primitives)
System Software Integration

Medium (system-level fault-tolerance)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
18
PAL
BlueGene/L System
System
(64 cabinets, 64x32x32)
Cabinet
(32 Node boards, 8x8x16)
Node Card
(32 chips, 4x4x2)
16 Compute Cards
Compute Card
(2 chips, 2x1x1)
180/360 TF/s
16 TB DDR
Chip
(2 processors)
90/180 GF/s
8 GB DDR
2.8/5.6 GF/s
4 MB
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
Europar 2004, Pisa Italy
19
CCS-3
PAL
BlueGene/L Compute ASIC
CCS-3
PLB (4:1)
32k/32k L1
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
Memory
L2
256
Includes ECC
256
“Double FPU”
128
• IBM CU-11, 0.13 µm
• 11 x 11 mm die size
• 25 x 32 mm CBGA
• 474 pins, 328 signal
• 1.5/2.5 Volt
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Ethernet
Gbit
Gbit
Ethernet
JTAG
Access
JTAG
Torus
Tree
6 out and
3 out and
6 in, each at
3 in, each at
1.4 Gbit/s link 2.8 Gbit/s link
Europar 2004, Pisa Italy
DDR
Control
with ECC
Global
Interrupt
4 global
barriers or
interrupts
20
144 bit wide
DDR
256/512MB
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
21
PAL
CCS-3
16
compute
cards
2 I/O cards
DC-DC Converters:
40V  1.5, 2.5V
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
22
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
23
BlueGene/L Interconnection
Networks
PAL
CCS-3
3 Dimensional Torus





Interconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GBytes/s per node)
350/700 GBytes/s bisection bandwidth
Communications backbone for computations
Global Tree





One-to-all broadcast functionality
Reduction operations functionality
2.8 Gb/s of bandwidth per link
Latency of tree traversal in the order of 5 µs
Interconnects all compute and I/O nodes (1024)
Ethernet



Incorporated into every node ASIC
Active in the I/O nodes (1:64)
All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier

8 single wires crossing whole system, touching all nodes
Control Network (JTAG)

Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
For booting, checkpointing, error logging
Europar 2004, Pisa Italy
24
PAL
BlueGene/L System Software
Organization
CCS-3



Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Compute nodes dedicated to
running user application, and
almost nothing else - simple
compute node kernel (CNK)
I/O nodes run Linux and
provide O/S services
file access
process launch/termination
debugging
Service nodes perform system
management services (e.g.,
system boot, heart beat, error
monitoring) - largely
transparent to
application/system software
Europar 2004, Pisa Italy
25
Operating Systems


Compute nodes: CNK
Specialized simple O/S
5000 lines of code,
40KBytes in core
No thread support, no virtual
memory
Protection
Protect kernel from
application
Some net devices in
userspace
File I/O offloaded (“function
shipped”) to IO nodes
Through kernel system calls
“Boot, start app and then stay out
of the way”
I/O nodes: Linux
2.4.19 kernel (2.6 underway) w/
ramdisk
NFS/GPFS client
CIO daemon to
Start/stop jobs
Execute file I/O
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
CCS-3



Global O/S (CMCS, service node)
 Invisible to user programs
 Global and collective decisions
 Interfaces with external policy
modules (e.g., job scheduler)
 Commercial database technology
(DB2) stores static and dynamic
state
 Partition selection
 Partition boot
 Running of jobs
 System error logs
 Checkpoint/restart
mechanism
 Scalability, robustness, security
Execution mechanisms in the core
Policy decisions in the service node
Europar 2004, Pisa Italy
26
PAL
BlueGeneL: Overview
CCS-3

Node Integration


Network Integration


High (separate tree network)
System Software Integration


High (processing node integrates processors and
network interfaces, network interfaces directly
connected to the processors)
Medium/High (Compute kernels are not globally
coordinated)
#2 and #4 in the top500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
27
PAL
Cray XD1
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
28
PAL
Cray XD1 System Architecture
CCS-3
Compute

12 AMD Opteron 32/64
bit, x86 processors

High Performance
Linux
RapidArray Interconnect

12 communications
processors

1 Tb/s switch fabric
Active Management

Dedicated processor
Application Acceleration

6 co-processors

Processors directly
connected to the
interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
29
Cray XD1 Processing Node
PAL
CCS-3
Six 2-way SMP
Blades
4 Fans
Six SATA
Hard Drives
500 Gb/s
crossbar
switch
Chassis Front
12-port Interchassis
connector
Four
independent
PCI-X Slots
Connector to 2nd
500 Gb/s crossbar
switch and 12-port
inter-chassis
connector
Chassis Rear
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
30
Cray XD1 Compute Blade
CCS-3
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
AMD Opteron
2XX
Processor
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
RapidArray
Communications
Processor
AMD Opteron
2XX Processor
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
Connector to
Main Board
Europar 2004, Pisa Italy
31
PAL
Fast Access to the Interconnect
CCS-3
GigaBytes
Memory
Xeon
Server
Cray
XD1
GFLOPS
Processor
GigaBytes per Second
I/O
1 GB/s
PCI-X
Interconnect
0.25 GB/s
GigE
5.3 GB/s
DDR 333
6.4GB/s
DDR 400
8 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
32
Communications Optimizations
PAL
CCS-3
RapidArray Communications Processor
 HT/RA tunnelling with bonding
 Routing with route redundancy
 Reliable transport
 Short message latency optimization
 DMA operations
 System-wide clock synchronization
RapidArray
Communications
Processor
AMD
Opteron 2XX
Processor
2 GB/s
3.2 GB/s
2 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
33
Active Manager System
PAL
CCS-3
Usability

Single System Command
and Control
Resiliency
Active Management
Software
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov

Dedicated management
processors, real-time OS
and communications fabric.

Proactive background
diagnostics with selfhealing.

Synchronized Linux kernels
Europar 2004, Pisa Italy
34
PAL
Cray XD1: Overview
CCS-3

Node Integration


Network Integration


Medium/High (HW support for collective
communication)
System Software Integration


High (direct access from HyperTransport to
RapidArray)
High (Compute kernels are globally coordinated)
Early stage
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
35
PAL
ASCI Red STORM
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
36
PAL
Red Storm Architecture
CCS-3






Distributed memory MIMD parallel supercomputer
Fully connected 3D mesh interconnect. Each
compute node processor has a bi-directional
connection to the primary communication network
108 compute node cabinets and 10,368 compute
node processors (AMD Sledgehammer @ 2.0 GHz)
~10 TB of DDR memory @ 333MHz
Red/Black switching: ~1/4, ~1/2, ~1/4
8 Service and I/O cabinets on each end (256
processors for each color240 TB of disk storage (120
TB per color)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
37
PAL
Red Storm Architecture
CCS-3




Functional hardware partitioning: service and
I/O nodes, compute nodes, and RAS nodes
Partitioned Operating System (OS): LINUX
on service and I/O nodes, LWK (Catamount)
on compute nodes, stripped down LINUX on
RAS nodes
Separate RAS and system management
network (Ethernet)
Router table-based routing in the interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
38
PAL
Red Storm architecture
CCS-3
Service
Compute
File I/O
Users
/home
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Net I/O
Europar 2004, Pisa Italy
39
PAL
System Layout
(27 x 16 x 24 mesh)
CCS-3
Switchable
Nodes
Normally
Classified
{
{
Normally
Unclassified
Disconnect Cabinets
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
40
PAL
Red Storm System Software
CCS-3
Run-Time System
Logarithmic loader
Fast, efficient Node allocator
Batch system – PBS
Libraries – MPI, I/O, Math
File Systems being considered include
PVFS – interim file system
Lustre – Pathforward support,
Panassas…
Operating Systems
LINUX on service and I/O nodes
Sandia’s LWK (Catamount) on compute nodes
LINUX on RAS nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
41
PAL
ASCI Red Storm: Overview
CCS-3

Node Integration


Network Integration


Medium (No support for collective communication)
System Software Integration


High (direct access from HyperTransport to network
through custom network interface chip)
Medium/High (scalable resource manager, no global
coordination between nodes)
Expected to become the most powerful machine in
the world (competition permitting)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
42
PAL
Overview
CCS-3
Node
Integration
Network
Integration
Software
Integration
ASCI Q
ASCI Thunder
System X
BlueGene/L
Cray XD1
Red Storm
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
43
PAL
A Case Study: ASCI Q
CCS-3


We try to provide some insight on the what we
perceive are the important problems in a large-scale
supercomputer
Our hands-on experience on ASCI Q shows that the
system software and the global coordination are
fundamental in a large-scale parallel machine
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
44
PAL
ASCI Q
CCS-3
2,048 ES45 Alphaservers, with 4 processors/node
16 GB of memory per node
8,192 processors in total
2 independent network rails, Quadrics Elan3
 > 8192 cables
20 Tflops peak, #2 in the top 500 lists
A complex human artifact

Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
45
Dealing with the complexity
of a real system
PAL
CCS-3


In this section of the tutorial we provide insight
into our methodology, that we used to
substantially improve the performance of ASCI Q.
This methodology is based on an arsenal of





analytical models
custom microbenchmarks
full applications
discrete event simulators
Dealing with the complexity of the machine and
the complexity of a real parallel application,
SAGE, with > 150,000 lines of Fortran & MPI code
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
46
PAL
Overview
CCS-3


Our performance expectations for ASCI Q and the reality
Identification of performance factors


Application performance and breakdown into components
Detailed examination of system effects
A methodology to identify operating systems effects
 Effect of scaling – up to 2000 nodes/ 8000 processors
 Quantification of the impact


Towards the elimination of overheads



demonstrated over 2x performance improvement
Generalization of our results: application resonance
Bottom line: the importance of the integration of the
various system across nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
47
Performance of SAGE on
1024 nodes
PAL
CCS-3
Performance consistent across QA and QB (the two segments of
ASCI Q, with 1024 nodes/4096 processors each)
 Measured time 2x greater than model (4096 PEs)

SAGE Performance (QA & QB)
1.2
There is a
difference
why ?
Cycle-time (s)
1
0.8
0.6
0.4
Model
0.2
Sep-21-02
Nov-25-02
0
0
512
1024
1536
2048
2560
3072
3584
# PEs
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
4096
Lower is
better!
48
PAL
Using fewer PEs per Node
CCS-3
Test performance using 1,2,3 and 4 PEs per node
Sage on QB (timing.input)
1.4
1PEsPerNode
1.2
2PEsPerNode
3PEsPerNode
Cycle Time (s)
1
4PEsPerNode
0.8
0.6
0.4
0.2
Lower is
better!
0
1
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
10
100
#PEs
Europar 2004, Pisa Italy
1000
10000
49
Using fewer PEs per node (2)
PAL
Measurements match model almost exactly for 1,2 and 3
PEs per node!
Sage on QB (timing.input)
0.6
Error (s) - Measured - Model
1PEsPerNode
0.5
2PEsPerNode
3PEsPerNode
0.4
4PEsPerNode
0.3
0.2
0.1
0
1
10
100
#PEs
1000
10000
Performance issue only occurs when using 4 PEs per node
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
50
CCS-3
PAL
Mystery #1
CCS-3
SAGE performs significantly worse on ASCI Q
than was predicted by our model
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
51
SAGE performance
components

PAL
CCS-3
Look at SAGE in terms of main components:
Put/Get (point-to-point boundary exchange)
 Collectives (allreduce, broadcast, reduction)

SAGE on QB - Breakdown (timing.input)
1.4
token_allreduce
1.2
token_bcast
token_get
Time/Cycle (s)
1
token_put
token_reduction
0.8
cyc_time
0.6
0.4
0.2
0
1
10
100
#PEs
1000
10000
Performance issue seems to occur only on collective operations
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
52
Performance of the collectives

PAL
Measure collective performance separately
•Allreduce Latency
•3
•Latency ms
•2.5
•1 process per node
•2 processes per node
•3 processes per node
•4 processes per node
•2
•1.5
•1
•0.5
•0
•0
•100 •200 •300 •400 •500 •600 •700 •800 •900 •1000
•Nodes

Collectives (e.g., allreduce and barrier) mirror
the performance of the application
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
53
CCS-3
Identifying the problem
within Sage
PAL
CCS-3
Sage
Simplify
Allreduce
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
54
Exposing the problems with simple
PAL
benchmarks
CCS-3
Allreduce
Challenge: identify the simplest
Add
benchmark that exposes the
problem
complexity
Benchmarks
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
55
Interconnection network and
communication libraries
PAL
CCS-3





The initial (obvious) suspects were the interconnection
network and the MPI implementation
We tested in depth the network, the low level transmission
protocols and several allreduce algorithms
We also implemented allreduce in the Network Interface
Card
By changing the synchronization mechanism we were able
to reduce the latency of an allreduce benchmark by a factor
of 7
But we only got small improvements in Sage (5%)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
56
PAL
Mystery #2
CCS-3
Although SAGE spends half of its time in
allreduce (at 4,096 processors), making allreduce
7 times faster leads to a small performance
improvement
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
57
PAL
Computational noise
CCS-3



After having ruled out the network and MPI we
focused our attention on the compute nodes
Our hypothesis is that the computational noise is
generated inside the processing nodes
This noise “freezes” a running process for a certain
amount of time and generates a “computational”
hole
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
58
Computational noise:
intuition


PAL
CCS-3
Running 4 processes on all 4 processors of
an Alphaserver ES45
The computation of one process is interrupted by an
external event (e.g., system daemon or kernel)
P0
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
P1
Europar 2004, Pisa Italy
P2
P3
59
Computational noise:
3 processes on 3 processors


PAL
CCS-3
Running 3 processes on 3 processors of an
Alphaserver ES45
The “noise” can run on the 4th processor without
interrupting the other 3 processes
P0
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
P1
Europar 2004, Pisa Italy
P2
IDLE
60
Coarse grained
measurement
PAL
CCS-3

We execute a computational loop for 1,000 seconds
on all 4,096 processors of QB
•START
•END
•P
•1
•P
•2
•P
•3
•P
•4
•TIME
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
61
Coarse grained computational
overhead per process
PAL
CCS-3

The slowdown per process is small,
between 1% and 2.5%
lower is
better
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
62
PAL
Mystery #3
CCS-3
Although the “noise” hypothesis could explain
SAGE’s suboptimal performance, the
microbenchmarks of per-processor noise
indicate that at most 2.5% of performance is
lost to noise
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
63
PAL
Fine grained measurement
CCS-3


We run the same benchmark for 1000 seconds, but we measure the run
time every millisecond
Fine granularity representative of many ASCI codes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
64
Fine grained computational overhead
PAL
per node
CCS-3


We now compute the slowdown per-node, rather than perprocess
The noise has a clear, per cluster, structure
Optimum is 0
(lower is better)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
65
PAL
Finding #1
CCS-3
Analyzing noise on a per-node
basis reveals a regular structure
across nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
66
PAL
Noise in a 32 Node Cluster


The Q machine is organized in 32 node clusters (TruCluster)
CCS-3
In each cluster there is a cluster manager (node 0), a quorum node
(node 1) and the RMS data collection (node 31)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
67
PAL
Per node noise distribution
CCS-3


Plot distribution of one million, 1 ms computational
chunks
In an ideal, noiseless, machine the distribution graph is



a single bar at 1 ms of 1 million points per process (4
million per node)
Every outlier identifies a computation that was delayed
by external interference
We show the distributions for the standard cluster node,
and also nodes 0, 1 and 31
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
68
PAL
Cluster Node (2-30)

10% of the times the execution of the 1 ms chunk of
computation is delayed
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
69
CCS-3
PAL
Node 0, Cluster Manager
CCS-3

We can identify 4 main sources of noise
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
70
PAL
Node 1, Quorum Node

One source of heavyweight noise (335 ms!)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
71
CCS-3
PAL
Node 31
CCS-3

Many fine grained interruptions, between 6 and 8
milliseconds
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
72
PAL
The effect of the noise

An application is usually a sequence of a computation
followed by a synchronization (collective):
..

..
..
..
CCS-3
..
But if an event happens on a single node then it can affect
all the other nodes
..
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
..
..
Europar 2004, Pisa Italy
..
73
PAL
Effect of System Size
CCS-3
..

..
..
..
The probability of a random event occurring increases with
the node count.
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
74
Tolerating Noise: Buffered
Coscheduling (BCS)
PAL
CCS-3
...
.. ..
.. ..
.. ..
We can tolerate the noise by coscheduling the activities of the
system software on each node
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
75
Discrete Event Simulator:
used to model noise



PAL
CCS-3
DES used to examine and identify impact of noise: takes as input the
harmonics that characterize the noise
Noise model closely approximates experimental data
The primary bottleneck is the fine-grained noise generated by the compute
nodes (Tru64)
Lower
is better
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
76
PAL
Finding #2
CCS-3
On fine-grained applications, more
performance is lost to short but
frequent noise on all nodes
than to long but less frequent noise
on just a few nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
77
Incremental noise reduction
PAL
CCS-3
1.
2.
3.
removed about 10 daemons from all nodes
(including: envmod, insightd, snmpd, lpd, niff)
decreased RMS monitoring frequency by a
factor of 2 on each node (from an interval of
30s to 60s)
moved several daemons from nodes 1 and 2
to node 0 on each cluster.
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
78
Improvements in the Barrier
Synchronization Latency
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
79
PAL
CCS-3
Resulting SAGE Performance
PAL
1.4
1.4
1.2
1.2
1
1
Cycle-time (s)
Cycle-time (s)
CCS-3
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0

0.8
Model
Sep-21-02
Model
Nov-25-02
Sep-21-02
Jan-27-03
Nov-25-02
May-01-03
Jan-27-03
(Min)
Jan-27-03
May-01-03 (min)
0
0
512
10243072
1536 4096
2048 2560
3584 7168
4096
1024
2048
512030726144
PEs
## PEs
Nodes 0 and 31 also configured out in the optimization
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
80
8192
PAL
Finding #3
CCS-3
We were able to double SAGE’s
performance by selectively removing
noise caused by several types
of system activities
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
81
PAL
Generalizing our results:
application resonance


The computational granularity of a balanced bulksynchronous application correlates to the type of noise.
Intuition:



any noise source has a negative impact, a few noise sources
tend to have a major impact on a given application.
Rule of thumb:


CCS-3
the computational granularity of the application “enters in
resonance” with the noise of the same order of magnitude
The performance can be enhanced by selectively removing
sources of noise
We can provide a reasonable estimate of the performance
improvement knowing the computational granularity of a
given application.
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
82
Cumulative Noise Distribution, Sequence
PAL
of Barriers with No Computation
CCS-3

Most of the latency is generated by the fine-grained, high-frequency
noisie of the cluster nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
83
PAL
Conclusions
CCS-3

Combination of Measurement, Simulation and Modeling to
 Identify
and resolve performance issues on Q

Used modeling to determine that a problem exists

Developed computation kernels to quantify O/S events:

Effect increases with the number of nodes

Impact is determined by the computation granularity in an
application

Application performance has significantly improved

Method also being applied to other large-systems
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
84
PAL
About the authors
CCS-3
Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is
currently working on system software solutions for reliability and usability of large-scale parallel computers.
Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer
system simulation, and parallel functional language implementation. His research interests are centered on parallel
computing; more specifically, various aspects of operating systems, parallel programming, and programming
language design and implementation. Kei received his PhD in Computing Science from Glasgow University and
his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at
the Computing Research Laboratory at New Mexico State University.
Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory
(LANL). He received his PhD in Computer Science from the University of Pisa in 1997. Before his appointment at
LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral
researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard
Laboratories. His research interests include various aspects of supercomputers, including high-performance
interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating
systems and parallel programming languages. He has received numerous awards from the NNSA for contributions
to supercomputing projects, and from other organizations for scientific publications.
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
85