Transcript Section 2
PAL
CCS-3
STATE OF THE ART
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
1
PAL
Section 2
CCS-3
Overview
We are going to briefly describe some state-of-theart supercomputers
The goal is to evaluate the degree of integration of
the three main components, processing nodes,
interconnection network and system software
Analysis limited to 6 supercomputers (ASCI Q and
Thunder, System X, BlueGene/L, Cray XD1 and
ASCI Red Storm), due to space and time limitations
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
2
ASCI Q: Los Alamos National
Laboratory
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
3
PAL
ASCI Q
CCS-3
Total
— 20.48 TF/s, #3 in the top 500
Systems
— 2048 AlphaServer ES45s
8,192 EV-68 1.25-GHz CPUs
with 16-MB cache
Memory
System
— 22 Terabytes
Interconnect
Dual Rail Quadrics Interconnect
4096 QSW PCI adapters
Four 1024-way QSW federated
switches
Operational in 2002
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
4
PAL
Node: HP (Compaq) AlphaServer ES45
21264 System Architecture
CCS-3
EV68 1.25
GHz
Each @ 64b 500 MHz (4.0 GB/s)
Memory
Up to 32 GB
MMB 0
256b 125 MHz
(4.0 GB/s)
Quad
C-Chip Controller
MMB 1
D D
D D
D D
D D
MMB 2 256b 125 MHz
Cache
16 MB per CPU
(4.0 GB/s)
PCI Chip
Bus
Bus0,1
0
MMB 3
PCI Chip
Bus
Bus2,3
1
64b 66MHz (528 MB/S)
64b
MB/S)
64b66MHz
33MHz(528
(266MB/S)
64b 33
33MHz
64b
MHz(266MB/S)
(266 MB/S)
64b 66 MHz (528 MB/S)
PCI0
PCI0
PCI1
PCI1
PCI1
HS
PCI2
PCI2
PCI2
HS
PCI3
PCI3
PCI3
HS
PCI4
PCI4
3.3V I/O
PCI5
PCI5
PCI6
PCI6
PCI6
HS
PCI7
HS
PCI7
PCI7
PCI8
PCI8
PCI8
HS
PCI9
PCI
PCI9HS
9
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PCI-USB
PCI-USB
PCI-junk IO
IO
PCI-junk
5.0V I/O
Europar 2004, Pisa Italy
5
Serial, Parallel
keyboard/mouse
floppy
QsNET: Quaternary Fat Tree
PAL
CCS-3
• Hardware support for Collective
Communication
• MPI Latency 4ms, Bandwidth
300 MB/s
• Barrier latency less than 10ms
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
6
PAL
Interconnection Network
Super Top Level
6
Switch
Level
CCS-3
5
Mid Level
4
...
3
1024
nodes
(2x = 2048
nodes)
2
16th 64U64D
Nodes 960-1023
1st 64U64D
Nodes 0-63
1
48
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
63
960
Europar 2004, Pisa Italy
1023
7
PAL
System Software
CCS-3
Operating System is Tru64
Nodes organized in Clusters of 32 for resource
allocation and administration purposes (TruCluster)
Resource management executed through Ethernet
(RMS)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
8
PAL
ASCI Q: Overview
CCS-3
Node Integration
Network Integration
Low (multiple boards per node, network interface on
I/O bus)
High (HW support for atomic collective primitives)
System Software Integration
Medium/Low (TruCluster)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
9
ASCI Thunder, 1,024
Nodes, 23 TF/s peak
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
10
ASCI Thunder, Lawrence
Livermore National
Laboratory
PAL
CCS-3
• 1,024 Nodes, 4096 Processors, 23 TF/s,
•#2 in the top 500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
11
ASCI Thunder: Configuration
PAL
CCS-3
1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB
DDR266 SDRAM (8 Terabytes total)
2.5 ms, 912 MB/s MPI latency and bandwidth over
Quadrics Elan4
Barrier synchronization 6 ms, allreduce 15 ms
75 TB in local disk in 73GB/node UltraSCSI320
Lustre file system with 6.4 GB/s delivered parallell
I/O performance
Linux RH 3.0, SLURM, Chaos
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
12
PAL
CCS-3
CHAOS: Clustered High Availability Operating
System
Derived
from Red Hat, but differs in the following
areas
Modified
kernel (Lustre and hw specific)
New packages for cluster monitoring, system
installation, power/console management
SLURM, an open-source resource manager
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
13
PAL
ASCI Thunder: Overview
CCS-3
Node Integration
Network Integration
Medium/Low (network interface on I/O bus)
Very High (HW support for atomic collective
primitives)
System Software Integration
Medium (Chaos)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
14
PAL
System X: Virginia Tech
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
15
PAL
System X, 10.28 TF/s
1100 dual Apple G5 2GHz CPU based nodes.
8 billion operations/second/processor (8 GFlops) peak double
precision floating performance.
Each node has 4GB of main memory and 160 GB of Serial
ATA storage.
CCS-3
176TB total secondary storage.
Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial
support for collective communication
System-level Fault-tolerance (Déjà vu)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
16
PAL
System X: Overview
CCS-3
Node Integration
Network Integration
Medium/Low (network interface on I/O bus)
Medium (limited support for atomic collective
primitives)
System Software Integration
Medium (system-level fault-tolerance)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
17
PAL
BlueGene/L System
System
(64 cabinets, 64x32x32)
Cabinet
(32 Node boards, 8x8x16)
Node Card
(32 chips, 4x4x2)
16 Compute Cards
Compute Card
(2 chips, 2x1x1)
180/360 TF/s
16 TB DDR
Chip
(2 processors)
90/180 GF/s
8 GB DDR
2.8/5.6 GF/s
4 MB
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
Europar 2004, Pisa Italy
18
CCS-3
BlueGene/L Compute ASIC
PAL
CCS-3
PLB (4:1)
32k/32k L1
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
Memory
L2
256
Includes ECC
256
“Double FPU”
128
• IBM CU-11, 0.13 µm
• 11 x 11 mm die size
• 25 x 32 mm CBGA
• 474 pins, 328 signal
• 1.5/2.5 Volt
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Ethernet
Gbit
Gbit
Ethernet
JTAG
Access
JTAG
Torus
Tree
6 out and
3 out and
6 in, each at
3 in, each at
1.4 Gbit/s link 2.8 Gbit/s link
Europar 2004, Pisa Italy
DDR
Control
with ECC
Global
Interrupt
4 global
barriers or
interrupts
19
144 bit wide
DDR
256/512MB
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
20
PAL
CCS-3
16
compute
cards
2 I/O cards
DC-DC Converters:
40V 1.5, 2.5V
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
21
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
22
BlueGene/L Interconnection
Networks
PAL
CCS-3
3 Dimensional Torus
Interconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GBytes/s per node)
350/700 GBytes/s bisection bandwidth
Communications backbone for computations
Global Tree
One-to-all broadcast functionality
Reduction operations functionality
2.8 Gb/s of bandwidth per link
Latency of tree traversal in the order of 5 µs
Interconnects all compute and I/O nodes (1024)
Ethernet
Incorporated into every node ASIC
Active in the I/O nodes (1:64)
All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier
8 single wires crossing whole system, touching all nodes
Control Network (JTAG)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
For booting, checkpointing, error logging
Europar 2004, Pisa Italy
23
PAL
BlueGene/L System Software
Organization
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Compute nodes dedicated to
running user application, and
almost nothing else - simple
compute node kernel (CNK)
I/O nodes run Linux and
provide O/S services
file access
process launch/termination
debugging
Service nodes perform system
management services (e.g.,
system boot, heart beat, error
monitoring) - largely
transparent to
application/system software
Europar 2004, Pisa Italy
24
Operating Systems
Compute nodes: CNK
Specialized simple O/S
5000 lines of code,
40KBytes in core
No thread support, no virtual
memory
Protection
Protect kernel from
application
Some net devices in
userspace
File I/O offloaded (“function
shipped”) to IO nodes
Through kernel system calls
“Boot, start app and then stay out
of the way”
I/O nodes: Linux
2.4.19 kernel (2.6 underway) w/
ramdisk
NFS/GPFS client
CIO daemon to
Start/stop jobs
Execute file I/O
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
CCS-3
Global O/S (CMCS, service node)
Invisible to user programs
Global and collective decisions
Interfaces with external policy
modules (e.g., job scheduler)
Commercial database technology
(DB2) stores static and dynamic
state
Partition selection
Partition boot
Running of jobs
System error logs
Checkpoint/restart
mechanism
Scalability, robustness, security
Execution mechanisms in the core
Policy decisions in the service node
Europar 2004, Pisa Italy
25
PAL
BlueGeneL: Overview
CCS-3
Node Integration
Network Integration
High (separate tree network)
System Software Integration
High (processing node integrates processors and
network interfaces, network interfaces directly
connected to the processors)
Medium/High (Compute kernels are not globally
coordinated)
#2 and #4 in the top500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
26
PAL
Cray XD1
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
27
PAL
Cray XD1 System Architecture
CCS-3
Compute
12 AMD Opteron 32/64
bit, x86 processors
High Performance
Linux
RapidArray Interconnect
12 communications
processors
1 Tb/s switch fabric
Active Management
Dedicated processor
Application Acceleration
6 co-processors
Processors directly
connected to the
interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
28
Cray XD1 Processing Node
PAL
CCS-3
Six 2-way SMP
Blades
4 Fans
Six SATA
Hard Drives
500 Gb/s
crossbar
switch
Chassis Front
12-port Interchassis
connector
Four
independent
PCI-X Slots
Connector to 2nd
500 Gb/s crossbar
switch and 12-port
inter-chassis
connector
Chassis Rear
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
29
Cray XD1 Compute Blade
CCS-3
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
AMD Opteron
2XX
Processor
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
RapidArray
Communications
Processor
AMD Opteron
2XX Processor
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
Connector to
Main Board
Europar 2004, Pisa Italy
30
PAL
Fast Access to the Interconnect
CCS-3
GigaBytes
Memory
Xeon
Server
Cray
XD1
GFLOPS
Processor
GigaBytes per Second
I/O
1 GB/s
PCI-X
Interconnect
0.25 GB/s
GigE
5.3 GB/s
DDR 333
6.4GB/s
DDR 400
8 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
31
Communications Optimizations
PAL
CCS-3
RapidArray Communications Processor
HT/RA tunnelling with bonding
Routing with route redundancy
Reliable transport
Short message latency optimization
DMA operations
System-wide clock synchronization
RapidArray
Communications
Processor
AMD
Opteron 2XX
Processor
2 GB/s
3.2 GB/s
2 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
32
Active Manager System
PAL
CCS-3
Usability
Single System Command
and Control
Resiliency
Active Management
Software
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Dedicated management
processors, real-time OS
and communications fabric.
Proactive background
diagnostics with selfhealing.
Synchronized Linux kernels
Europar 2004, Pisa Italy
33
PAL
Cray XD1: Overview
CCS-3
Node Integration
Network Integration
Medium/High (HW support for collective
communication)
System Software Integration
High (direct access from HyperTransport to
RapidArray)
High (Compute kernels are globally coordinated)
Early stage
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
34
PAL
ASCI Red STORM
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
35
PAL
Red Storm Architecture
CCS-3
Distributed memory MIMD parallel supercomputer
Fully connected 3D mesh interconnect. Each
compute node processor has a bi-directional
connection to the primary communication network
108 compute node cabinets and 10,368 compute
node processors (AMD Sledgehammer @ 2.0 GHz)
~10 TB of DDR memory @ 333MHz
Red/Black switching: ~1/4, ~1/2, ~1/4
8 Service and I/O cabinets on each end (256
processors for each color240 TB of disk storage (120
TB per color)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
36
PAL
Red Storm Architecture
CCS-3
Functional hardware partitioning: service and
I/O nodes, compute nodes, and RAS nodes
Partitioned Operating System (OS): LINUX
on service and I/O nodes, LWK (Catamount)
on compute nodes, stripped down LINUX on
RAS nodes
Separate RAS and system management
network (Ethernet)
Router table-based routing in the interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
37
PAL
Red Storm architecture
CCS-3
Service
Compute
File I/O
Users
/home
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Net I/O
Europar 2004, Pisa Italy
38
PAL
System Layout
(27 x 16 x 24 mesh)
CCS-3
Switchable
Nodes
Normally
Classified
{
{
Normally
Unclassified
Disconnect Cabinets
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
39
PAL
Red Storm System Software
CCS-3
Run-Time System
Logarithmic loader
Fast, efficient Node allocator
Batch system – PBS
Libraries – MPI, I/O, Math
File Systems being considered include
PVFS – interim file system
Lustre – Pathforward support,
Panassas…
Operating Systems
LINUX on service and I/O nodes
Sandia’s LWK (Catamount) on compute nodes
LINUX on RAS nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
40
PAL
ASCI Red Storm: Overview
CCS-3
Node Integration
Network Integration
Medium (No support for collective communication)
System Software Integration
High (direct access from HyperTransport to network
through custom network interface chip)
Medium/High (scalable resource manager, no global
coordination between nodes)
Expected to become the most powerful machine in
the world (competition permitting)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
41
PAL
Overview
CCS-3
Node
Integration
Network
Integration
Software
Integration
ASCI Q
ASCI Thunder
System X
BlueGene/L
Cray XD1
Red Storm
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
42