The IBM Blue Gene Program

Download Report

Transcript The IBM Blue Gene Program

Parallel Scalable
Operating Systems
Presented by Dr. Florin Isaila
Universidad Carlos III de Madrid
Visiting Scholar Argonne National Lab
Contents

Preliminaries



Blue Gene






Top500
Scalability
History
BG/L, BG/C, BG/P, BG/Q
Scalable OS for BG/L
Scalable file systems
BG/P at Argonne National Lab
Conclusions
Top500
Generalities





Since 1993 twice a year: June and November
Ranking of the most powerful computing
systems in the world
Ranking criteria: performance of the
LINPACK benchmark
Jack Dongarra alma máter
Site web: www.top500.org
HPL: High-Performance
Linpack







solves a dense system of linear equations
 Variant of LU factorization of matrices of size N
measure of a computer’s floating-point rate of execution
computation done in 64 bit floating point arithmetic
Rpeak : theoretic system performance
 upper bound for the real performance (in MFLOP)
 Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS
Nmax : obtained by varying N and choosing the maximum
performance
Rmax : maximum real performance achieved for Nmax
N1/2: size of problem needed to achieve ½ of Rmax
Jack Dongarra´s slide
Amdahl´s law
 Suppose
a fraction f of your application is not
parallelizable
 1-f : parallelizable on p processors
Speedup(P) = T1 /Tp
<= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p)
<= 1/f
Amdahl’s Law (for
1024 processors)
Speedup
1024
896
768
640
512
384
256
128
0
0
0.01
0.02
0.03
s
0.04
Load Balance
Sequential Work
Speedup ≤
Max Work on any Processor



Work: data access, computation
Not just equal work, but must be busy at same time
Ex: Speedup ≤ 1000/400 = 2.5
Processor 4
300
Processor 3
400
Processor 2
100
Processor 1
200
Sequential
1000
0
200
400
600
800
1000
1200
Communication and synchronization
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost)
Communication is expensive!
 Measure: communication to computation ratio
 Inherent communication

 Determined
by assignment of tasks to processes
 Actual communication may be larger (artifactual)

One principle: Assign tasks that access same data to same process
Process 1
Process 2
Process 3
Work
Communication
Synchronization point
Synchronization wait time
Blue Gene
Blue Gene partners

IBM





“Blue”: The corporate color of IBM
“Gene”: The intended use of the Blue Gene
clusters – Computational biology, specifically,
protein folding
Lawrence Livermore National Lab
Department of Energy
Academia
BG History
December 1999
29 Septiembre
2004
November 2004
24 March 2005
Project created
Supercomputer for biomolecular phenomena
Blue Gene/L prototype had overtaken NEC's Earth
Simulator
(36,01 TFlops | 8 Cabinets )
Blue Gene/L reaches 70,82 TFlops (16 Cabinets)
Blue Gene/L broke its record, reaching 135.5 TFLOPS
(32 Cabinets)
June 2005
Several sites world-wide took 5 out of 10 in top 10
(top 500.org)
27 October
2006
280.6 TFLOPS (65,536 "Compute Nodes" and
adition1024 "IO nodes“
in 64 air-cooled cabinets)
478.2 TFLOPS
November 2007
Family




BG/L
BG/C
BG/P
BG/Q
System
64 Racks, 64x32x32
Blue
Gene/L
Rack
32 Node Cards
Node Card
180/360 TF/s
32 TB
(32 chips 4x4x2)
16 compute, 0-2 IO cards
2.8/5.6 TF/s
512 GB
Compute Card
2 chips, 1x2x1
90/180 GF/s
16 GB
Chip
2 processors
5.6/11.2 GF/s
1.0 GB
2.8/5.6 GF/s
4 MB
Technical specifications






64 cabinets which contain 65.536 high-performance
compute nodes (chips)
1.024 I/O nodes.
32-bit PowerPC processors
5 networks
The main memory has a size of 33 terabytes.
Maximum performance of 183.5 TFLOPS when
using one processor for computation and the other
one for communication, and 367 TFLOPS if using
both for computation.
Blue Gene / L

Networks:





3D Torus
Collective Network
Global Barrier/Interrupt
Gigabit Ethernet (I/O & Connectivity)
Control (system boot, debug,
monitoring)
Networks
Three dimensional torus
- Compute nodes
Global tree
- collective communication
- I/O
Ethernet
Control network
Three-dimensional (3D) torus network in which the nodes (red balls)
are connected to their six nearest-neighbor nodes in a 3D mesh.
Blue Gene / L

Processor: PowerPC 440 700Mhz



Low power allows dense packaging
External Memory: 512MB SDRAM
per node / 1GB
Slow embedded core at a clock
speed of 700 Mhz



32 KB L1 cache
L2 is a small prefetch buffer
4MB Embedded DRAM L3 cache
PowerPC 440 core
BG/L compute ASIC





Non-cache coherent L1
Pre-fetch buffer L2
Shared 4MB DRAM (L3)
Interface to external DRAM
5 network interfaces

Torus, collective, global barrier, Ethernet, control
Block diagram
Blue Gene / L

Compute Nodes:


Dual processor, 1024 per Rack
I/O Nodes:

Dual processor, 16-128 per Rack
Blue Gene / L

Compute Nodes:


I/O Nodes:


Proprietary kernel (tailored to processor design)
Embedded Linux
Front-end and service nodes:

Suse SLES 9 Linux (familiarity with users)
Blue Gene / L

Performance:


Peak performance per rack: 5,73 TFlops
Linpack performance per rack: 4,71 TFlops
Blue Gene / C




a.k.a Cyclops64
massively parallel (first supercomputer on a
chip)
Processors with a 96 port, 7 stage non-internally
blocking crossbar switch.
Theoretical peak performance (chip): 80 GFlops
Blue Gene / C


Cellular architecture
64-bit Cyclops64 chip:



500 Mhz
80 processors ( each has 2 thread units and a FP
unit)
Software

Cyclops64 exposes much of the underyling
hardware to the programmer, allowing the
programer to write very high performance, finely
tuned software.
Blue Gene / C


Picture of BG/C
Performances:



Board: 320 GFlops
Rack: 15,76 Tflops
System: 1,1 PFlops
Blue Gene / P

Similar Architecture to BG/L, but






Cache coherent L1 cache
4 cores per nodes
10 Gbit Ethernet external IO infrastructure
Scales upto 3-PFLOPS
More energy efficient
167TF/s by 2007, 1PF by 2008
Blue Gene / Q








Continuation of Blue Gene/L and /P
Targeting 10PF/s by 2010/2011
Higher freq at similar performance / watt
Similar number of nodes
Many more cores
More generally useful
Aggressive compiler
New network: Scalable and cheap
Motivation for a scalable OS





Blue Gene/L is currently the world’s fastest and most
scalable supercomputer
Several system components contribute to that scalability.
The Operating Systems for the different nodes of Blue
Gene/L are among the components responsible for that
scalability.
The OS overhead on one node affects the scalability of
the whole system
Goal: design a scalable solution for the OS.
High level view of BG/L

Principle: the structure of the software should reflect the structure
of the hardware.
BG/L Partitioning




Space-sharing
Divided along natural boundaries into
partitions
Each partition can run only one job
Each node can be in one of this modes


Coprocessor: one processor assists the other
Virtual node: two separate processors, each of
them with its own memory space
OS





Compute nodes: dedicated OS
I/O nodes: dedicated OS
Service nodes: conventional off-the-shelf OS
Front-end nodes: program compilation,
debug, submit
File servers: store data , no specific for BG/L
BG/L OS solution


Components: I/O, service nodes, CNK
The compute and I/O nodes organized into logical
entities called processing sets or psets: 1 I/O node + a
collection of CNs




8, 16, 64, 128 CNs
Logical concept
Should reflect physical proximity => fast communication
Job: collection of N compute processes (on CNs)



Own private address space
Message passing
MPI: ranks 0, N-1
High level view of BG/L
BG/L OS solution:CNK


Compute node: run only compute processes an all the
compute nodes of a particular partition can execute in
two different modes:
 Coprocessor mode
 Virtual node mode
Compute Node Kernel (CNK): simple OS
 Creates an address spaces
 Load code and initialize data
 Transfer processor control to the loaded executable
CNK


consumes 1MB
Creates either



No virtual memory, no paging



One address space of 511/1023MB
2 address spaces of 255/511MB
The entire mapping fits into the TLB of PowerPC
Load in push mode: 1 CN reads the executable from
FS and sends to all the others
One image loaded and then stays out of the way!!!
CNK




No OS scheduling (one thread)
No memory management (No TLB overhead)
No local file services
User level execution until:



Process requests a system call
Hardware interrupts: timer (requested by application),
abnormal events
Syscall



Simple: handled locally (getting the time, set an alarm)
Complex: forward to I/O nodes
Unsupported (fork/mmap): error
Benefits of the simple solution


Robustness: simple design, implementation,
test, debugging
Scalability: no interference among compute
nodes


Low system noise
Performance measurements
I/O node

Two roles in Blue Gene/L:
 Act as an effective master of its corresponding pset
 To offer services request from compute nodes in its pset

Mainly I/O operations on locally mounted FSs

Only one processor used: due to the lack of memory
coherency

Executes an embedded version of the Linux operating system:
 Does not use any swap space
 it has an in-memory root file system
 it uses little memory
 lacks the majority of LINUX daemons.
I/O node




Complete TCP/IP stack
Supported FS: NFS, GPFS, Lustre, PVFS
Main process: Control and I/O daemon
(CIOD)
Launch a job



Job manager sends the request to the service
node
Service node contacts the CIOD
CIOD sends the executable to all processes in
pset
System calls
Service nodes




run the Blue Gene/L control system.
Tight integration with CNs and IONs
CN and IONs: stateless, no persistent memory
Responsible for operation and monitoring the CNs and I/ONs



Creates system partitions and isolates it
Computes network routing for torus, collective and global interrupt
networks
loads OS code for CNs and I/ONs
Problems


Not fully POSIX compliant
Many applications need




Process/thread creation
Full server sockets
Shared memory segments
Memory mapped files
File systems for BG systems


Need for scalable file systems: NFS is not a
solution
Most supercomputers and clusters in top 500
use one of these parallel file systems



GPFS
Lustre
PVFS2
GPFS/PVFS/Lustre mounted on the I/O nodes
File system servers
File systems for BG systems





All these parallel file systems are storing files
round-robin over several FS servers
Allow concurrent access to files
FS calls are forwarded from the compute
nodes to the I/O nodes
The I/O nodes execute the calls on behalf of
the compute nodes and forward back the
result
Problem: data travels over different networks
for each FS call (caching becomes critical)
Summary



The OS solution for Blue Gene adopts a software architecture reflecting
the hardware architecture of the system.
The result of this approach is a lightweight kernel for the compute
nodes (CNK) and a port of Linux that implements the file system and
TCP/IP functionality for the I/O nodes.
This separation of responsibilities leads to an OS solution that is:







Simple
Robust
High Performing
Scalable
Extensible
Problem: limited applicability
Scalable parallel file systems
Further reading


Designing a highly-scalable operating
system: the Blue Gene/L story Proceedings of the 2006 ACM/IEEE
conference on Supercomputing
www.research.ibm.com/bluegene/