Software Tools for Beowulf Cluster

Download Report

Transcript Software Tools for Beowulf Cluster

Putchong Uthayopas, Thara Angsakul,
Jullawadee Maneesilp
Parallel Research Group,
Computer and Network System Research Laboratory
Department of Computer Engineering,Faculty of Engineering
Kasetsart University Bangkok, Thailand
Phone: (662) 942 8555 Ext.. 1416
Fax: (662) 5614621
Email: [email protected]
1
Motivation
• Beowulf Cluster becomes one of the
most widely used platform for high
performance computing
• Very large and complex Beowulf
Cluster start to appear
• System management is still a
challenging task. There are needs for
– The effective way to navigate and interact
with cluster components.
– Mechanism and tools to perform collective
commands
– Some services such as monitoring, fault
detection and recovery
– Special software tools that recognize
special characteristics and needs of the
cluster administration task
2
SCMS: An Extensible Cluster
Management Tool for Beowulf
Cluster
• A collection of system management
tools for Beowulf Cluster
• Package includes
–
–
–
–
Portable real-time monitoring
Parallel Unix command
Alarm system
Large collection of graphical user
interface tools for users and system
administrator
•
•
•
•
•
Checking user status
Remote software installation
System disk space and process space status
Boot up and shutdown nodes
Change node configuration remotely
– Web/VRML interface
• Current version 1.1 only support
RedHat Linux
3
Portable Real-time Monitoring
• Provides a global access to node
information
– Interface to local OS and get node
information
– Collect the information to a single point
– Provides heartbeat and node health
diagnostic
– Provides API for application to access
the information. The API is available in
C, Java, and TCL/TK .
• System Architecture
– Client/Server
– Layered Architecture
4
System Architecture
Configuration
Management
Task
Scheduling
Performance
Monitoring
Parallel Unix
command
Resource Management API ( C, TCL, Java)
System Information
Repository
SMA
CMA
•
CMA
CMA
CMA - Control and Monitoring Agent
– Get system information from local
operating system on each node
– Portability is achieved using HAL
(Hardware Abstraction Layer)
•
SMA - System Management Agent
– Running on management node to
collect information from CMA
•
CMA
CMA
HAL API
HAL
LOCAL OS (LINUX)
RMI - Resource Management
Interface
– Library that provides interface to
functionality of SMA
5
Parallel Unix Command
• Parallel version of
commonly used unix
commands such as pps,
pls, prm
• Follows the scalable
unix tool model (Lusk
and Gropp 1994)
• Graphical user
interface for these
commands
– Ease of use
– Filtering output data
pps -aux
data
ps -aux
command
ps -aux
ps -aux
6
Alarm System
• Set of daemons that monitor important
system parameters
– Processor utilization, Memory usage,
Main board temperature and more
• User can specify the condition to alarm
and action to be taken
• Issues the alarm and shutdown some
part of the system if needed
• Notification is sent using email. Future
release will include pager, ICQ and
speech synthesis
Alarm Manager
Notification/action
Detector
Detector
Detector
Config
Detector
7
SCMS Utilities
SCMS Comes with many GUI utilities
• Node status
• Control Panel
• Disk Space
• Process Status
• Shutdown/Reboot
• Remote login
• User status
• Package Installation
8
SCMS Screen Shot
9
KCAP Web and VRML based
Interface for SCMS
• Two versions of Web Interface are
available
– KCAP : Normal web interface
– KCAP-VR : VRML Interface that allows
you to walk and interact with your cluster
• Java Applet is used to report real-time system
information
Web
Generator
Web Tree
Web server
System
Config
VRML World
Generator
Real time
Monitoring
VRML
World
External Network
10
KCAP and KCAP-VR Screen shot
11
Future Works
• KSIX: A frame work to support parallel
tools and applications
• Offer features such as
– process control, signal delivery
– Naming services
– Event based communication
Application
Application
MPI
KSIX (Kasetsart System Interconnect eXecutive)
Node OS
Node OS
Node OS
Node OS
Node
Hardware
Node
Hardware
Node
Hardware
Node
Hardware
Interconnection Network
12
SQMS: SMILE Queuing
Management System
• Batch scheduler for sequential an
parallel task
• Static and dynamic load balancing
• Reconfigurable scheduling policy
• Auto docking between cluster
Task
Submitter
Task
Queue
Remote
Queue
Node
Allocator
Scheduler
Cluster Nodes
13
Beowulf Computing Environment
at Kasetsart University, Thailand
SMILE Beowulf Cluster
– 16 nodes Pentium II/III
Cluster
• Test bed for cluster
technology and support
of HPC research
activities
PIRUN Beowulf Cluster
(Pile of Redundant Universal Nodes)
• 72 nodes Beowulf System
– PII500 MHz, 128 MB RAM
• Largest Computing System
in Thailand
• Installation will completein
December 1999
14