BeowulfSysMgmt4 - Computer Science at SUNY Potsdam
Download
Report
Transcript BeowulfSysMgmt4 - Computer Science at SUNY Potsdam
IBM Systems and Technology Group
System Management Considerations
for Beowulf Clusters
Bruce Potter
Lead Architect, Cluster Systems Management
IBM Corporation, Poughkeepsie NY
[email protected]
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Beowulf Clusters
Beowulf is the earliest surviving epic poem written in
English. It is a story about a hero of great strength and
courage who defeated a monster called Grendel.
Beowulf Clusters are scalable performance clusters
based on commodity hardware, on a private system
network, with open source software (Linux) infrastructure.
The designer can improve performance proportionally with
added machines.
http://www.beowulf.org
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
Beowulf Clusters are Used in a Broad Spectrum of Markets
Life Sciences
Research, drug discovery, diagnostics,
information-based medicine
Business
Intelligence
Digital Media
Digital content creation,
management and
distribution, Game server
farms
Data warehousing and
data mining, CRM
Petroleum
Oil and gas exploration and
production, seismic and
reservoir analysis
Industrial/Product
Lifecycle Management
CAE, EDA, CAD/PDM for
electronics, automotive,
and aerospace
3
© 2006 IBM Corporation
Financial Services
Optimizing IT infrastructure,
risk management and
compliance, financial market
modeling, Insurance/actuary
analysis
Government &
Higher Education
Scientific research, classified/defense,
weather/environmental sciences
System Management Considerations for Beowulf Clusters
IBM Deep Computing
What Drives HPC? --- “The Need for Speed…”
Computational Needs of Technical, Scientific, Digital Media and Business Applications
Approach or Exceed the Petaflops/s Range
CFD Wing Simulation
512x64x256 Grid
( 8.3 x 10e6 mesh points)
5000 FLOPs per mesh point,
5000 time steps/cycles
2.15 x 10e14 FLOPs
CFD Full Plane Simulation
512x64x256 Grid
( 3.5 x 10e17 mesh points)
5000 FLOPs per mesh point
5000 time steps/cycles
Source: A. Jameson, et al
Magnetic Materials:
Current: 2000 atoms; 2.64 TF/s, 512GB
Future: HDD Simulation – 30TF/s, 2 TBs
Electronic Structures:
Current: 300 atoms; 0.5 TF/s, 100GB
Future: 3000 atoms; 50TF/s, 2TB
8.7x 10e24 FLOPs
Digital Movies and
Special Effects
~ 1E14 FLOPs per frame
50 frames/sec
90 minute movie
2.7E19 FLOPs
~ 150 days on 2000
1 GFLOP/s CPUs
Source: Pixar
4
Materials Science
© 2006 IBM Corporation
Source: D. Bailey, NERSC
Spare Parts Inventory Planning
Modeling the optimized deployment of 10,000 part
numbers across 100 parts depots and requires:
2 x 10e14 FLOP/s
( 12 hours on 10, 650MHz CPUs)
2.4 PetaFlop/s sust. performance
( 1 hour turn-around time )
Industry trend for rapid, frequent
modeling for timely business
decision support drives higher
sustained performance
Source: B. Dietrich, IBM
System Management Considerations for Beowulf Clusters
The Impact of Machine-Generated Data
Created by hand
(salary, orders, etc.)
Historically in
databases
Low volume (but high
value)
Machine-generated
data
Sensors
High volume
Not amenable to
traditional database
architecture
5
© 2006 IBM Corporation
Machine-generated
versus
authored
Machine-generated
versus
authored
data data
1,000
1,000
online
Storage
online
Machine-generated
Machine-generated Storage
Allmedical
medical
imaging
All
imaging
Data
100
100 Data
Medical
datadata
stored
Medical
stored
Surveillance
Surveillance
bytes
bytes
Gigabytes/
US
capita/
year
10
10
Surveillance
Surveillance
for for
urban
areas
urban areas
Personal
multimedia
Personal
multimedia
11
databases
InIndatabases
.1.1
.01
.01
.001
.001
Gigabytes/US capita/year
Authored data
Authored
Authored
data
data
Static
Web
data
Static
Web
data
Text
data
Text
data
1995
1995 2000
2000 2005
2005 2010
2010 2015
2015
Year
Year
System Management Considerations for Beowulf Clusters
GPFS on ASC Purple/C Supercomputer
6
1536-node, 100 TF pSeries cluster at Lawrence Livermore National Laboratory
2 PB GPFS file system (one mount point)
500 RAID controller pairs, 11000 disk drives
126 GB/s parallel I/O measured to a single file (134GB/s to multiple files)
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
IDE: Programming Model Specific Editor
7
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Scheduling and Resource Management
Concepts
Resource manager
Batch Scheduler
Scheduler
Job Submission
and control
Launch jobs (serial and parallel) on
specific resources at specific times
Optimize use of Cluster
resources to maximize
throughput and comply
with organizational
policies
Additional capabilities:
Job1
Job2
Job3
Job4
a) provide a runtime execution
environment;
b) system utilization and
monitoring environment
Job Queue
Compute Cluster
8
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
LL Vision: Enterprise Data Centers
Scheduling Domain: The enterprise
Cluster D
/NCSA
over WAN
Cluster A
Cluster C
Cluster B
Sc03 SAN
Customer data center environments are heterogeneous
Customer data center environments are sometimes
scattered over a WAN
Customers want meta-scheduling capabilities with
automatic load balancing
Facility wide file sharing now becoming possible making
scheduling across the enterprise critical
9
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Wide Area File Sharing: GPFS
With 64 dual processor IA-64 IBM eServers and 500 TB of IBM FastT100 storage,we created
a General Parallel File System (GPFS) accessible over the dedicated 40 Gb/s backbone
network (Wide Area Network (WAN)) of the TeraGrid. The WAN GPFS is mounted at the
TeraGrid partner sites in addition to any local parallel file systems. To determine ownership of
files between sites with different UID spaces, a user's Globus x.509 certificate identity is used.
GPFS-WAN Read Performance
7000
6000
5000
MB/sec
Scientists and researchers use geographically distributed compute, data and other scientific
instruments and resources on the TeraGrid. Efficient use of these distributed resources is often
hindered by the manual process of checking for available storage at various sites, copying files
and then managing multiple copies of datasets. If a scientist's files and datasets were
automatically available at each geographically distributed resource, the scientist's ability to
perform research efficiently is increased. A scientist could schedule resources at different sites
to process the globally available dataset more quickly.
SDSC 1ppn
SDSC 2ppn
ANL 1ppn
ANL 2PPN
NCSA 1PPN
NCSA 2PPN
4000
3000
2000
1000
0
1
2
4
8
16
24
32
48
64
96
128
192
NodeCount
Several applications including the Bioinformatics Research Network (BIRN), the Southern
California Earthquake Center (SCEC), the National Virtual Observatory (NVO) and ENZO, the
cosmological simulation code use WAN GPFS as part of a pilot project.
The upper graph shows measured performance on parallel IO benchmarks
from SDSC, NCSA, and ANL. The SDSC numbers represent the peak
achievable bandwidth to the filesystem. The lower graph shows sustained
usage of the TeraGrid network generated by GPFS-WAN usage over a 24-hour
period.
Special thanks to:, Roger Haskin, IBM , Yuri Volobuev, IBM, Puneet Chaudhury,
IBM, Daniel McNabb, IBM Jim Wyllie, IBM Phil Andrews, SDSC Patricia
Kovatch, SDSC Brian Battistuz, SDSC, Tim Cockerill, Ruth Ayott NCSA, Dan
Lapine NCSA, Anthony Tong NCSA, Darrn Adams NCSA, Stephen Simms, NJ,
Chris Raymhauer, Purdue, Dave Carver TACC
Point of Contact: Chris Jordan, [email protected]
Apps Contact: Don Frederik, [email protected]
10
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Mapping Applications to Architectures
Application performance is affected by . . .
– Node design (cache, memory bandwidth, memory bus architecture, clock, operations
per clock cycle, registers
– Network architecture (latency, bandwidth)
– I/O architecture (separate I/O network, NFS, SAN, NAS, local disk, diskless)
– System architecture (OS support, network drivers, scheduling software)
A few considerations
–
–
–
–
Memory bandwidth, I/O, clock frequency, CPU architecture, cache
SMP size (1-64 CPUs)
SMP memory size (1-256 GB)
Price/performance
The cost of switching to other microprocessor architectures or operating
systems is high, especially in production environments or organizations
with low to average skill levels
Deep Computing customers have application workloads that include
various properties that are both problem and implementation dependent
11
© 2006 IBM Corporation
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Platform Positioning – Application Characteristics
Auto/Aero
Climate /
Ocean
Media &
Entertainment
Electronics
Environment
Memory Bandwidth
Data Analysis /
Data Mining
Reservoir
Simulation
NVH,
Structural & Thermal Analysis,
Selected CFD
Structure Based
Drug Design
Weather
Selected CFD
Gene Sequencing
& Assembly
Seismic
Games, DCC,
Migration & Image Processing
Imaging
General
Seismic
PLM
Bioinformatics
Crash
EDA
Data Analysis /
Data Mining
HCLS
Petroleum
I/O Bandwidth
12
© 2006 IBM Corporation
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Power Affects System Size
Floor-Loading
Size = 36 sq ft
Physical
Size =
14.5 sq ft
Service Size
= 25 sq ft
Cooling Size = 190 sq ft
(@150 W/sq ft, 8% floor utilization)
"What matters most to the computer designers at Google is not speed, but power -- low
power, because data centers can consume as much electricity as a city."
Eric Schmidt, CEO Google (Quoted in NY Times, 9/29/02)
13
© 2006 IBM Corporation
System Management Considerations for Beowulf Clusters
Top 500 List
http://www.top500.org/
The # 1 position was again claimed by the Blue Gene/L
System, a joint development of IBM and DOE’s National
Nuclear Security Administration (NNSA) and installed at
DOE’s Lawrence Livermore National Laboratory in
Livermore, Calif. It has reached a Linpack benchmark
performance of 280.6 TFlop/s (“teraflops” or trillions of
calculations per second)
“Even as processor frequencies seem to stall, the
performance improvements of full systems seen at the very
high end of scientific computing shows no sign of slowing
down… the growth of average performance remains stable
and ahead of Moore’s Law.”
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
IBM Supercomputing Leadership
TOP500 Nov ember 2005
100%
53
17
80%
60%
17
6
18
2
11
18
7
6
169
8
1
1
2
1
40%
20%
49
219
5
Others
NEC
Dell
Cray
SGI
HP
IBM
0%
TOP500
TOP100
TOP10
Semiannual independent ranking of
top 500 supercomputers in the world
Nov 2005 Aggregate Performance
IBM: 1,214 of 2,300 TFlops
Dell
5%
Sun
0%
NEC
2%
Linux
Networx
2%
IBM is clear leader
•
#1 System – DOE/LLNL - BlueGene/L (280.6 TF)
•
Most entries on TOP500 list (219)
•
Most installed aggregate throughput (over 1,214
TF)
•
Most in TOP10 (5), TOP20 (8) and TOP100 (49)
•
Fastest system in Europe (Mare Nostrum)
•
Most Linux Commodity Clusters with 158 of 360
16
© 2006 IBM Corporation
Other
7%
IBM
53%
Cray
6%
SGI
6%
HP
19%
Source: www.top500.org
System Management Considerations for Beowulf Clusters
System Management Challenges for Beowulf Clusters
Scalability/simultaneous operations
– ping: 3 seconds
– 3 x 1500 / 60 = 75 minutes
– Do operations in parallel, but…
– Can run into limitations in:
•
•
•
•
# of file descriptors per process
# of port numbers
ARP cache
network bandwidth
– Need a “fan-out” limit on the # of simultaneous parallel
operations
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Lights out machine room
– Location, security, ergonomics
– Out-of-band hardware control and console
– Automatic discovery of hardware
– IPMI – defacto standard protocol for x86 machines
• http://www.intel.com/design/servers/ipmi/
• http://openipmi.sourceforge.net/
– conserver - logs consoles and supports multiple
viewers of console
• http://www.conserver.com/
– vnc – remotes a whole desktop
• http://www.tightvnc.com/
– openslp – Service Location Protocol
•
http://www.openslp.org/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges …
Node installation
Mgmt Svr
– Parallel and hierarchical
– Unattended, over the
network
Install Svr
Install Svr
Install Svr
– Repeatable (uniformity)
Node
Node
Node
– Different models:
Node
Node
Node
…
…
…
•
•
•
direct (e.g. kickstart)
cloning
diskless
– Kickstart, Autoyast – direct installation of distro RPMs
– syslinux, dhcp, tftp – PXE network boot of nodes
– system imager – cloning of nodes
•
http://www.systemimager.org/
– warewulf – diskless boot of nodes
•
http://www.warewulf-cluster.org/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges …
Software/firmware maintenance
– Selection of combination of upgrades (BIOS, NIC firmware,
RAID firmware, multiple OS patches, application updates
– Distribution and installation of updates
– Inventory of current levels
– up2date - Red Hat Network
•
https://rhn.redhat.com/
– you - YAST Online Update
– yum – Yellow Dog Updater, Modified (RPM update mgr)
•
http://linux.duke.edu/projects/yum/
– autoupdate – RPM update manager
•
http://www.mat.univie.ac.at/~gerald/ftp/autoupdate/index.html
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Configuration management
–
–
–
–
–
–
–
–
–
–
–
Manage /etc files across the cluster
Apply changes immediately (w/o reboot)
Detect configuration inconsistencies between nodes
Time synchronization
Mgmt Svr
User management
rdist – distribute files in parallel to many nodes
• http://www.magnicomp.com/rdist/
rsync – copy files (when changed) to another machine
node 1
node 2
.
.
.
node n
cfengine – manage machine configuration
• http://www.cfengine.org/
NTP – network time protocol
• http://www.ntp.org/
openldap – user information server
• http://www.openldap.org/
NIS – user information server
• http://www.linux-nis.org/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Monitoring for failures
–
–
–
–
–
–
–
–
–
With the mean time to failure of some disks, in large clusters several nodes
can fail each week
Heart beating
Storage servers, networks, node temperatures, OS metrics, daemon status,
etc.
Notification and automated responses
Use minimum resources on the nodes and network
fping – ping many nodes in parallel
• http://www.fping.com/
Event monitoring web browser interfaces
• ganglia - http://ganglia.sourceforge.net/
• nagios - http://www.nagios.org/
• big brother – http://bb4.com/
snmp – Simple Network Management Protocol
• http://www.net-snmp.org/
pegasus – CIM Object Manager
• http://www.openpegasus.org/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Cross-geography
– Secure connections (e.g. SSL)
•
WANs are often not secure
– Tolerance of routing
•
Broadcast protocols (e.g. DHCP) usually are not forwarded
through routers
– Tolerance of slow connections
•
Move large data transfers (e.g. OS installation) close to target
– Firewalls
•
Minimize the number of ports used
– openssh – Secure Shell
•
http://www.openssh.com/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Security/Auditability
– Install/confirm security patches
– Conform to company security policies
– Log all root activity
– sudo - give certain users the ability to run some commands as
root while logging the commands
•
http://www.courtesan.com/sudo/
Accounting
– Track usage and charge departments for use
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
System Management Challenges…
Flexibility
– Broad variety of hardware and Linux distros
– Node installation customization
– Monitoring extensions
– Spectrum of security policies
– Hierarchical clusters
Executive
Mgmt Svr
– Scriptable commands
First Line
Mgmt Svr
First Line
Mgmt Svr
First Line
Mgmt Svr
Node
Node
Node
Node
Node
Node
…
…
…
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
Beowulf Cluster Management Suites
Open source
–
OSCAR – collection of cluster management tools
•
–
ROCKS – stripped down RHEL with management software
•
–
http://oscar.openclustergroup.org/
http://www.rocksclusters.org/
webmin – web interface to linux administration tools
• http://www.webmin.com/
Products
–
Scali Manage
•
–
–
Linux Networx Clusterworx
• http://linuxnetworx.com/
Egenera
•
–
http://www.penguincomputing.com/
HP XC
•
–
http://www.egenera.com/
Scyld
•
–
http://www.scali.com/
http://h20311.www2.hp.com/HPC/cache/275435-0-0-0-121.html
Windows Compute Cluster Server
•
http://www.microsoft.com/windowsserver2003/ccs/
© 2003 IBM Corporation
System Management Considerations for Beowulf Clusters
Beowulf Cluster Management Suites
IBM Products
–
–
–
–
IBM 1350 – Cluster hardware & software bundle
• http://www.ibm.com/systems/clusters/hardware/1350.html
IBM BladeCenter
• http://www.ibm.com/systems/bladecenter/
Blue Gene
• http://www.ibm.com/servers/deepcomputing/bluegene.html
CSM
• http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html
© 2003 IBM Corporation