Martinez_Presentation

Download Report

Transcript Martinez_Presentation

Performance Monitoring of SLAC Blackbox Nodes
Using Perl, Nagios, and Ganglia
Roxanne Martinez
Mentor: Yemi Adesanya
United States Department of Energy
Stanford, CA 94305
SCCS
The Scientific Computing and Computing
Services at SLAC:
• Provides computing power, technical
support, communications capabilities.
• Core services include Unix systems,
Windows, networking, network
operations, telecommunications.
• Supplies dept. support, science
applications, network security.
• Houses thousands of servers.
The High Performance
Computing Group of SCCS
• To ensure optimal computing performance of all
of these servers, they must be monitored. This is
the responsibility of the HPC group.
• The group watches data storage, electrical
service to servers, cooling system abilities.
• This is made possible through the use of
monitoring software: Nagios and Ganglia.
SCCS Task
• Until last year, all computing capacity at SLAC
was located within the SCCS computing
building.
• By then the datacenter had reached its
maximum electrical service and cooling system
capacities.
• New experiments meant the need for more
computing power.
• A new datacenter would take years and a lot of
funding to complete.
The Solution: Blackboxes
• This is a Sun Modular
Datacenter produced by Sun
Microsystems.
• It is a portable computing
center built into a standard 8
foot by 20 foot shipping
container.
• It is painted white for energy
efficiency and is tightly
sealed, insulated, and
cooled.
• Today, SLAC maintains 2
blackboxes.
Blackbox Contents
• Blackbox 1
– 252 bali machines
(Sun X2200 servers)
• Blackbox 2156
– yili machines (Sun
X4100 servers)
– 139 boer machines
(Sun X2200 servers)
The operating system on these
machines is RedHat Enterprise Linux
(RHEL) version 4.
Current Monitoring of the
Blackboxes
The High Performance Computing Group
currently uses Nagios and Ganglia to
monitor:
• Percentage of CPU in use,
• Amount of memory in use, and
• Input/output rates.
The software periodically calls on utilities
to extract monitoring data for the
machines, displaying the info in
graphs, storing the info in databases,
and – in the case of Nagios – alerting
administrators if machines reach
warning or critical states.
Nagios
• User specifies items to be monitored by
providing external plugins that return the
status of machines to Nagios.
• If a warning or critical status is returned,
Nagios can alert via email, IM, text, etc.
• Admins and users can view current
status and history using a web browser.
– MySQL runs as a server to provide
multi-user access to multiple
databases. Interface: PerfParse.
– Round robin database (RRD)
provides useful graphs of broad
historical data. Popular because the
database files do not increase in size
over time.
Ganglia
• Robust scalable distributed monitoring
system designed for clusters and grids.
• Based on a hierarchical design: uses a
tree of connections to representative
nodes for each cluster, reducing
overheads.
• Updates the RRD.
• Has a web frontend like Nagios but does
not have alerting feature.
Additional Monitoring Needed
• Temperature
• Fan speed
• Power supply voltage
“Materials”
• Baseboard management controller (BMC)
– Service processor that monitors physical state of machine.
– Located in the motherboard.
– Performs monitoring through use of machines sensors.
– Part of the Intelligent Platform Management Interface (IPMI)
which provides set of interfaces to manage and monitor a
system.
• IPMI tool
– Open source utility.
– Can be used to extract physical parameters and parameter
thresholds. These are important in determining the status.
• Lower Non-Recoverable, Lower Critical, Lower Non-Critical, Upper NonCritical, Upper Critical, and Upper Non-Recoverable
“Materials” continued
“sudo ipmitool –c sdr”
Output for both
commands are
when connected
to the Sun
X2200 server
boer0113.
“sudo ipmitool sensor list”
“Materials” continued
• Cron (Chronograph)
– Time-based scheduling service in Unix.
– Used for security reasons since root user is needed to
collect data.
• Perl
– ideal Unix scripting language for the task.
– Interpreted language; no compiler.
– Efficient programming language that is powerful for
file input and output because of its text manipulation
capabilities and fast development cycle .
Task
Create three Perl scripts (temperature, fan speed, voltage) that can be
used on any machine regardless of the specific BMC.
– Work first with yili0113, bali0113, and boer0113.
– Cron will run root user to call on IPMI tool and will store data
every 15 minutes in a readable file.
– The scripts will read the data every 15 minutes from the file to
produce the current machine parameters and interpret the
current status of the machine (OK, WARNING, CRITICAL,
UNKNOWN).
– For Nagios, the scripts will return the current status and
parameters.
– For Ganglia, the scripts will call on the Ganglia command which
passes in the parameters.
Results
• In a test of the check_cpu_temp.pl script
on the bali0113 machine, the following
results were produced using the Perl
interpreter:
“Temperature OK - CPU_0_Temp=49.000, CPU_1_Temp=51.000 |
CPU_0_Temp=49.000 CPU_1_Temp=51.000”
The Scripts as Nagios Plugins
Ganglia work is still underway!
Conclusions
• Perl scripts, Nagios monitoring, and graphics
tools work successfully.
• All three test machines are running with
acceptable temperatures, fan speeds, and
power supply voltages. This suggests that
current cooling systems and electrical supplies
in blackboxes are effective. The monitoring must
be done on all servers, however, for a complete
evaluation to be possible.
• The HPC group is much closer to ensuring
optimal computing performance for the lab.
Future Work
• The scripts are portable.
– 3 test machines
– KIPAC machines
– All blackbox machines upon approval
– Possibly more to come
• The scripts can also be edited to monitor
different parameters.
Acknowledgements
Thank you to the U.S. Department of
Energy Office of Science and the Stanford
Linear Accelerator Center for the
opportunity to participate in the Science
Undergraduate Laboratory Internships
program. Thank you to Steve, Susan, and
Farah. Thank you to my mentor, Yemi
Adesanya, for his mentorship throughout
the project.