No Slide Title
Download
Report
Transcript No Slide Title
Deploying HPC Linux Clusters (the LLNL way!)
Robin Goldstone
Presented to ScicomP/SP-XXL
August 10, 2004
UCRL-PRES-206330
This work was performed under the auspices of the U.S. Department of Energy by
University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.
UNCLASSIFIED
Background
UNCLASSIFIED
Page 1
UNCLASSIFIED
HPC Linux Cluster Strategy
Purpose of HPC Linux Clusters
Create low-cost alternatives to vendor
integrated solutions.
LLNL Strategy
motivates vendors to cut costs
provides labs with a level of independence
from vendor solutions
Provide affordable capacity solutions today
for the program, so capability systems can be
used for capability.
Provide a path for next generation capability,
if technology evolves appropriately for HPC.
Design s/w & h/w for ease of manageability.
Leverage open source software model.
Augment a base Linux distro with in-house
development expertise & vendor partnerships.
Develop best of breed software including:
Employ multi-tiered software support model:
system administrators, on-site developers,
vendor partners, open source community
Build clusters from (mostly) commodity
components.
Use “self-maintenance” model for h/w repair.
Provide users with a feature-rich environment
including:
robust, scalable cluster management tools
efficient, scalable resource manager to exploit
maximum utilization of resources
parallel environment (MPI, OpenMP)
development tools
parallel file system
Deliver world-class HPC systems to our users.
UNCLASSIFIED
Page 2
UNCLASSIFIED
Production Computing Resources
(projected to 6/30/04)
Manufacturer &
System
Program Model
Unclassified Network (OCF)
Thunder
MCR
ALC
Frost
Blue
T C2K
iLX
PVC
GP S
Qbert
Riptide
M&IC
M&IC
ASCI
ASCI
ASCI
M&IC
M&IC
ASCI
M&IC
M&IC
ASCI
Operating
System
Memory
Peak
(GB) GFLOP/s
47,847
Interconnect Nodes
CPUs
California Digital
CHAOS 2.0
Linux NetworX
CHAOS 1.2
IBM xSeries
CHAOS 1.2
IBM SP
AIX 5.1
IBM SP
AIX 5.1
Compaq SC ES40
T ru64 5.1b
RAND Federal
CHAOS 1.2
Acme Micro
CHAOS 1.2
Compaq GS320/ES45 T ru64 5.1b
Digit al 8400
T ru64 5.1b
SGI Onyx2
Irix 6.5.13f
Elan4
Elan3
Elan3
Colony DS
T B3
Elan3
N/A
Elan3
N/A
MC 1.5
8 IR2 P ipes
1024
1152
960
68
264
128
67
64
49
2
1
4,096
2,304
1,920
1,088
1,056
512
134
128
160
20
48
8,192
4,608
3,840
1,088
396
280
268
128
344
24
37
IBM SP
IBM xSeries
IBM p655
IBM p655
Pro Micro
Pro Micro
Rackable Systems
IBM SP
Rackable Systems
IBM SP
Compaq ES40/ES45
SGI Onyx3800
SGI Onyx2
SGI Onyx2
Colony DS
Elan3
Federat ion
Federat ion
Elan3
Elan3
N/A
T B3
Elan3
Colony DS
N/A
4 IR3 P ipes
16 IR2 P ipes
10 IR2 P ipes
512
768
128
128
128
128
128
488
64
28
40
1
1
1
8,192
1,536
1,024
1,024
256
256
256
1,952
128
448
160
96
64
40
8,192
3,072
2,048
2,048
512
512
512
1,164
256
448
384
96
24
18
Classified Network (SCF)
Whit e
Lilac (xEDTV)
Violet (pEDT V)
Magent a (pEDT V)
Adelie
Emperor
Ace
S (Blue-P acific)
GViz
Ice
SC Cluster
Whit ecap
T idalwave
Edgewater
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
ASCI
22,938
11,059
9,216
1,632
701
683
678
614
277
25
24
Unclassified
Linux 93%
UNIX 7%
Classified
Linux 35%
UNIX 65%
41,123
AIX 5.1
CHAOS 1.2
AIX
AIX
CHAOS 1.2
CHAOS 1.2
CHAOS 1.2
AIX 5.1
CHAOS 1.2
AIX 5.1
T ru64 5.1b
Irix 6.5.13f
Irix 6.5.13f
Irix 6.5.13f
UNCLASSIFIED
12,288
9,186
6,144
6,144
1,434
1,434
1,434
1,296
717
672
235
77
38
24
Un cl assi fie d
Capabilit y
71%
Capacit y
26%
Serial
2%
Visualizat ion 1%
C l assi fi e d
Capabilit y
30%
Capacit y
64%
Serial
4%
Visualizat ion 2%
Page 3
UNCLASSIFIED
Livermore Model
Term coined during vector to MPP
transition (~1992 Meiko CS/2).
Assure code teams that porting effort
would not have to be repeated for
every MPP platform.
Common application programming
environment on all LC clusters.
Allows complex scientific apps to be
ported across generations of
hardware architectures and different
operating systems with minimal
disruption.
MPI for parallelism between nodes.
OpenMPI for SMP parallelism
Parallel filesystem
POSIX facilities
Compilers/debuggers – not
necessarily the same ones, but best of
breed wherever possible.
HPC Linux clusters must
implement the Livermore
Model!
UNCLASSIFIED
Page 4
UNCLASSIFIED
Decision-making process
Principal Investigator for HPC Platforms
“Visionary” who tracks HPC trends and innovations
Understands scientific apps and customer requirements
Very deep and broad understanding of computer architecture
IO Testbed
Evaluate new technologies: nodes, interconnects, storage systems
Explore integration with existing infrastructure
Develop required support in Linux OS stack
RFP process
In response to a specific programmatic need for a new computing resource:
Performance characteristics
Manageability aspects, MTBF, etc.
determine the best architecture
define the requirements
solicit bids
Best value procurements or sole source where required
Bidders typically required to execute a benchmark suite on the proposed HW and submit results
Integration -> acceptance -> deployment
Integration requires close coordination with LLNL staff since our OS stack is used
Often the integrator is not familiar with all of the HPC components (interconnect, etc.) so LLNL must act at the “prime
contractor” to coordinate the integration of cluster nodes, interconnect and storage systems.
Acceptance testing is critical! Does the system function/perform as expected?
For LLNL’s large Linux clusters, time from contract award to production status has been ~8 months.
UNCLASSIFIED
Page 5
UNCLASSIFIED
Software Environment
UNCLASSIFIED
Page 6
UNCLASSIFIED
CHAOS Overview
What is CHAOS?
Clustered High Availability
Operating System
Internal LC HPC Linux Distro
Currently derived from Red Hat
Enterprise Linux
Scalability target: 4K nodes
The CHAOS “Framework”
Limitations
What is added?
Configuration Management
Release discipline
Integration testing
Bug tracking
Kernel modifications for HPC
requirements
Cluster system administration tools
MPI environment
Parallel filesystem (Lustre)
Resource manager (SLURM)
UNCLASSIFIED
Assumes LC services structure
Assumes LC sys admin culture
Tools supported by DEG are not
included in CHAOS distro
Only supports LC production h/w
Assumes cluster architecture
Page 7
UNCLASSIFIED
CHAOS Target Environment
UNCLASSIFIED
Page 8
UNCLASSIFIED
CHAOS Content
Kernel
Based on Red Hat kernel
Add: QsNet modules plus coproc, ptrack,
misc core kernel changes
Update: drivers for e1000, qla2300,
nvidia, digi, etc.
Add: Lustre modules plus VFS intents,
read-only device support, zero-copy
sendpage/recvpackets, etc.
Add: increase kernel stack size, reliable
panic on overflow.
Update: netdump for cluster scalability
Update: IA-64 netdump, machine check
architecture fixes, trace unaligned access
traps, spinlock debugging, etc.
Update: ECC module support for i860,
E75XX
Add: implement totalview ptrace
semantics.
Add: p4therm
Update: various NFS fixes.
Cluster Admin/Monitoring Tools
pdsh
YACI
ConMan/PowerMan
HM
Genders
MUNGE/mrsh
Firmware update tools
lmsensors
CMOS config
FreeIPMI
Ganglia/whatsup
SLURM resource manager
Lustre parallel filesystem
QsNet MPI environment
UNCLASSIFIED
Page 9
UNCLASSIFIED
Releases
CHAOS 2.0
CHAOS 3.0
(GA: June 2004)
(GA: ~Feb 2005)
Red Hat Enterprise Linux 3
2.4.21 kernel with NPTL
IA64
QsNet II
IPMI 1.5
MCORE Netdump
Lustre fully integrated
Red Hat Enterprise Linux 4
2.6 kernel
X86_64
OpenIB
IPMI 2.0
UNCLASSIFIED
Page 10
UNCLASSIFIED
CHAOS Software Life Cycle
Target is six month release cycle.
Loosely synchronized with RedHat releases but also dependent on availability of
key third-party tools (e.g. Intel compilers, Totalview debugger).
Also driven by requirement to support new hardware:
PCR
MCR
ALC
Thunder
ia32, i860 chipset, RDRAM memory
E7500 chipset, Federated Quadrics switch, LinuxBIOS
Serverworks GC-LE chipset, IBM Service Processor
IA64, Intel 8870, IPMI
Automated test framework (TET)
Fall 2001
Fall 2002
Late 2002
Late 2003
Subset of SWL apps
Regression tests
Staged rollout: small large, unclassified classified
testbed systems
small production systems
large production systems
dev, mdev, adev, tdev
pengra, ilx, PVC, Gviz
PCR, Lilac, MCR, ALC, Thunder
UNCLASSIFIED
Page 11
UNCLASSIFIED
CHAOS Software Support Model
System administrators perform first level problem determination.
If a software defect is suspected, problem is reported to CHAOS
development team.
Bugs are entered and tracked in CHAOS GNATS database.
Fixes are pursued as appropriate:
in open source community
through RedHat
through vendor partners (Quadrics, CFS, etc.)
locally by CHAOS developers
Bug fix is incorporated into CHAOS CVS source tree and new
RPM(s) built and tested on testbed systems.
Changes are rolled out in next CHAOS release or sooner depending
on severity.
UNCLASSIFIED
Page 12
UNCLASSIFIED
SLURM in a Nutshell
External
scheduler
manages
the queue
Job 1
Job 2
Job 3
Job 4
SLURM allocates
nodes, starts and
manages the jobs
Users submit work
UNCLASSIFIED
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Page 13
UNCLASSIFIED
SLURM Architecture
UNCLASSIFIED
Page 14
UNCLASSIFIED
SLURM Job Startup
/bin/hostname, 2 tasks per node
UNCLASSIFIED
Page 15
UNCLASSIFIED
SLURM Plans
Support more configurations
LLNL: Kerberos, AIX, IBM Federation switch, Blue Gene/L
Collaborators: Myrinet, InfiniBand, LAMMPI, MPICH, Maui…
Job checkpoint/restart
Job preempt/resume
Manage consumable resources both system-wide (e.g. licenses) and
within shared nodes (memory, disk space, etc.)
UNCLASSIFIED
Page 16
UNCLASSIFIED
LLNL Interests in Lustre
LustreLite
a cluster file system for:
a single Linux cluster:
multiple clusters that share a file system:
a “Lustre Farm” file system
DOE TriLabs PathForward Project:
MCR, ALC, LILAC
MCR+PVC
full-functionality Lustre, 3-year contract
scaling to 10,000 client nodes, bells, whistles, more “commercial” features
platform for a “enterprise wide” shared file system
Storage industry “buy in.”
UNCLASSIFIED
Page 17
UNCLASSIFIED
Clusters Using Lustre
Cluster
Arch
Nodes
CPUs
Mem/node
Ost nodes/
GWnodes
Lustre disk
space
MCR
ia32
1152
2304
4GB
16/
16GW
84TB+
~89TB
PVC
ia32
70
140
2GB
4GW
Shares MCR
file systems
ALC
ia32
960
1920
4GB
32GW
67TB+
7.9TB
THUNDER
ia64
1024
4096
8GB
16GW
~200TB
LILAC
ia32
768
1536
4GB
32
50TB+
~90TB
UNCLASSIFIED
Page 18
UNCLASSIFIED
The InterGalactic Filesystem
1,116 Dual P4 Compute Nodes
OST
OST
MCR - 1152 Port QsNet Elan3
LLNL
External
Backbone
B439
GW
MDS
GW
2 MetaData
(fail-over)
Servers
MDS
OST
OST
OST
OST
B451
OST
OST
OST
SW
SW
SW
Thunder
GW
MDS
GW
ALC - 960 Port QsNet Elan3
SW
OST
MDS
OST
OST
OST
OST
OST
MM Fiber 1 GigE
Archive
SM Fiber 1 GigE
Copper 1 GigE
SW
2 Login nodes
Federated
Gb-Enet
Switches
GW
B 113
GW
PVC - 128 Port QsNet Elan3
B439
924 Dual P4 Compute Nodes
OST
Aggregated OST for Single Lustre file system
SW
SW
OST
OST
PFTP
2 Login nodes 32 Gateway nodes @ 190 MB/s
with 4 Gb-Enet delivered Lustre I/O over 2x1GbE
OST
OST
27-35
SW
OST
OST
OST
OST
OST
OST
OST
2 Login nodes 32 Gateway nodes @ 190 MB/s
with 4 Gb-Enet delivered Lustre I/O over 2x1GbE
OST
OST
OST
OST
B451
UNCLASSIFIED
52 Dual P4
Render Nodes
6 Dual P4
Display Nodes
Page 19
UNCLASSIFIED
Lustre Issues and Priorities
Reliability!!!!
Manageability of “Lustre Farm” configurations:
first and foremost
OST “pool” notion
management tools (config, control, monitor, fault-determination)
Security of “Lustre Farm” TCP-connected operations.
Better/quicker fault determination in multi-networked environment.
Better recovery/operational robustness in the face of failures.
Improved performance.
UNCLASSIFIED
Page 20
UNCLASSIFIED
File Per Process IOR (SC=2, SS=64K, TS=128K)
2500
Average Throughput (MB/sec)
2000
1500
Write rate
Read rate
1000
500
0
0
100
200
300
400
500
600
700
800
Number of Clients
UNCLASSIFIED
Page 21
UNCLASSIFIED
Single File IOR (SC=64, SS=64K, Transfer Size=4M)
3000
Average Throughput (MB/sec)
2500
2000
Write rate
1500
Read rate
1000
500
0
0
100
200
300
400
500
600
700
800
900
Number of Clients
UNCLASSIFIED
Page 22
UNCLASSIFIED
Development Environment
Compilers
Intel MKL, currently 6.1
Other libs (ScaLAPACK, pact, etc)
are responsibility of apps teams
Java- j2re1.4.1_03
Python 2.2
Debuggers
Totalview 6 – port with ASC
enhancements
gdb, ddd, idb, pgdb
ddt – under evaluation
Profiling
mpiP
gprof, pgprof
Vampir/Guideview
PAPI
In-house HWmon tool - layered on
perfctr
TAU (experimental)
VTune coming (not yet available for
RHEL 3)
Memory debuggers
Valgrind
dmalloc, efence, Purify for gcc
Interpreters
Intel compilers (ifort, icc) through v8
PGI compilers (pgcc, pgCC, pgf90,
pghpf) through 5.1
Gcc 2.96, 3.2.3, 3.3.2
Glibc 2.2.5, 2.3.2
Libraries
UNCLASSIFIED
Page 23
UNCLASSIFIED
CSM vs. CHAOS
Function
CSM method
CHAOS method
Cluster Install/Update
NIM
YACI
Cluster Security
CtSec
munge
Topology services
RSCT (hats)
ganglia
Remote command execution
dsh
pdsh
Node groups
nodegrp
genders
Configuration management
CFM
genderized rdist
Node Monitoring
SMC
SNMP
Cluster monitoring
WebSM
HM
Console management
HMC+conserver
conman
Remote power management
HMC+rpower
powerman
Resource management
LoadLeveler
SLURM
UNCLASSIFIED
Page 24
UNCLASSIFIED
Linux Development Staffing
2 FTE kernel
3 FTE cluster tools + misc
2 FTE SLURM
4 FTE Lustre
1 FTE on-site Red Hat analyst
UNCLASSIFIED
Page 25
UNCLASSIFIED
Hardware Environment
UNCLASSIFIED
Page 26
UNCLASSIFIED
Typical Linux Cluster Architecture
1,112 P4 Compute Nodes
1152
Elan3
QsNet Port
Elan3,QsNet
100BaseT
Control
login login mgmt mgmt
100BaseT Management
MDS
MDS GW
GW GW
GW
GW
GW GW
GW
GbEnet Federated Switch
6 Login nodes
with 4 Gb-Enet
SPC
SPC
OST
OST
OST
SPC
OST
OST
OST
OST
SPC
Serial Port Concentrators
OST
OST
OST
OST
OST
OST
RPC
RPC
RPC
OST
OST
OST
RPC
Remote Power Controllers
UNCLASSIFIED
Page 27
UNCLASSIFIED
Hardware component choices
Motherboard / chipset: Is chipset specification open?
Processors – Influenced by kernel/distro support, availability of compilers,
tools…
Memory – ECC mandatory, chipkill = $$
Hard drives – IDE or SCSI, RAID, mirroring, diskless?
Node form factor – influenced primarily by # PCI/AGP slots needed. What
about blades?
Remote power mgmt: external plug control vs. service processor
Remote console management: terminal servers vs. Serial Over LAN (IPMI)
Even slight variations in hardware configuration can create a
significant support burden!
UNCLASSIFIED
Page 28
UNCLASSIFIED
Integration Considerations
Power and cooling
Racking, cabling and labeling
BIOS/CMOS settings
Burn-in
Acceptance testing
Software deliverables: can integrator contribute Open
Source tools for the hardware that they are providing?
UNCLASSIFIED
Page 29
UNCLASSIFIED
Hardware support strategy
Self-maintenance vs. vendor maintenance
Overhead of vendor maintenance:
Overhead of self-maintenance:
Turnaround time
Access to classified facilities
Convincing vendor something is really broken when it passes diags
Need appropriately skilled staff to perform repairs
Inventory management
RMA coordination
Hot spare cluster – include in cluster purchase!
Spare parts cache
How big?
What happens when you run out of spare parts?
UNCLASSIFIED
Page 30
UNCLASSIFIED
Conclusions
UNCLASSIFIED
Page 31
UNCLASSIFIED
Lessons Learned (SW)
A “roll your own” software stack like CHAOS takes significant effort
and in-house expertise. Is this overkill for your needs?
Consider alternatives like OSCAR or NPACI Rocks for small (< 128
node) clusters with common interconnects (Myrinet or GigE).
Use LLNL cluster tools to augment or replace deficient components of
these stacks as needed.
Avoid proprietary vendor cluster stacks which will tie you to one
vendor’s hardware. A vendor-neutral Open Source software stack
enables a common system management toolset across multiple
platforms.
Even with vendor-supported software stack, a local resource with
kernel/systems expertise is a valuable asset to assist in problem
isolation and rapid deployment of bug fixes.
UNCLASSIFIED
Page 32
UNCLASSIFIED
Lessons Learned (HW)
Buy hardware from vendors who understand Linux and can guarantee
compatibility.
Who will do integration? The devil is in the details.
BIOS pre-configured for serial console or LinuxBIOS!
Linux-based tools (preferably Open Source) for hardware management functions.
Linux device drivers for all hardware components.
A rack is NOT just a rack, a rail is not just a rail
All cat5 cables are not created equally
Labeling is critical
Hardware self-maintenance can be a significant burden.
Node repair
Parts cache maintenance
RMA processing
PC hardware in a rack != HPC Linux cluster
UNCLASSIFIED
Page 33
UNCLASSIFIED
LCRM/SLURM Delivers Excellent
Capacity/Capability Mix
Distribution of Job Sizes on MCR
70.00%
60.00%
50.00%
40.00%
% of Resources Used
% of Jobs Run
30.00%
20.00%
10.00%
0.00%
1-64
65-128
129-256
257-512
513-768 769-1024
>1024
Node Count
LCRM/SLURM scheduling environment allows optimal use of resources.
Over 40% of the cycles go to jobs requiring over 50% of the machine
UNCLASSIFIED
Page 34
UNCLASSIFIED
Do Our Users Like It?
Yes They Do!
MCR Node Utilization
100%
90%
80%
90% is beyond saturated
70%
60%
80% is best of class
50%
40%
30%
Scheduled Debug Shots Impact Utilization
20%
10%
Limited Availability Status on 8/13
8-Sep
7-Sep
6-Sep
5-Sep
4-Sep
3-Sep
2-Sep
1-Sep
31-Aug
30-Aug
29-Aug
28-Aug
27-Aug
26-Aug
25-Aug
24-Aug
23-Aug
22-Aug
21-Aug
20-Aug
19-Aug
18-Aug
17-Aug
16-Aug
15-Aug
14-Aug
13-Aug
12-Aug
0%
MCR is more heavily used than ASCI White!
UNCLASSIFIED
Page 35