Cluster Administration Tool

Download Report

Transcript Cluster Administration Tool

Jaeyoung Choi
Jiyeon Kim, Yongkwan Park, Sungjoo Kwon, Jaeyoung Choi
[email protected]
{heaven, psiver, lithmmon}@ss.ssu.ac.kr,
[email protected]
School of Computing, Soongsil University
School of Computing, Soongsil University
1-1, Sangdo-Dong, Dongjak-Ku
1-1, Sangdo-Dong, Dongjak-Ku
Seoul 156-743, Korea
Seoul 156-743, Korea
2015-07-18
Motivation
Linux Cluster System widely used for high
performance computing
It emphasizes on the use of commodity hardware
and open source software
It delivers a very high-performance at the
extremely low cost
System management is a challenging task




Automatic and convenient installation of OS & application
software packages
The effective way to navigate and interact with cluster
component
Mechanism and tools to perform collective commands
Some services such as monitoring, fault detection and
recovery
2015-07-18
Soongsil university
What is CATS-i ?
Cluster Administration ToolS on the Internet
A collection of system management tools



Provides automatic and convenient installation of OS
& application software packages
Provides efficient monitoring and management of
cluster nodes with simple operation on the Internet.
Provides easy-to-use GUI of PBS.
Easy-to-install CATS-i rpm package
2015-07-18
Soongsil university
CATS-i System Architecture
Client Daemon
Setup tool
Management tool

Get system
information from
local OS on each
node
Server Daemon
Server Daemon
Repository

Running on server
node to collect
information from
client daemon
Setup tool

Client
daemon
Client
daemon
Client
daemon
Management tool


2015-07-18
Soongsil university
Implemented with
JAVA
Implemented with
JAVA
Support internet
Difference with CATS-i
NodeCloner





CACR at CalTech
to make all nodes identical
using the Bootp and NFS
not provide a GUI
must edit the setup files related to NodeCloner
Beoboot





2015-07-18
Rembo Technology SaRL, Swizerland
Boot-ROM booting
using DHCP
using batch file interpreter
defect: make the batch file, difficult interface
Soongsil university
Difference with CATS-i
LUI(Linux Utility for cluster Installation)






IBM
Support BOOTP protocol and using DHCP and
PXE.
GUI Interface
Heterogeneous cluster
Must define the resource object
Using TFTP
 As the number of nodes is increased, I/O road is
increased.
2015-07-18
Soongsil university
Installation using the IP Multicasting
It provides same speed of installation and
reduce I/O load
Automatically, multicast a client module
through NFS
Sever sends slave node disk image through
the D class IP address
To make up for the unreliability of UDP

timeout and retransmission
2015-07-18
Soongsil university
Setup tools with IP multicasting
Master node
Network
Configuration info
Node DB
GUI
Error/Flow
Control
Multicast Server
Module
UDP
D class IP (224.0.0.0 ~ 239.255.255.255)
UDP
Node 1
2015-07-18
Node 2
Node 3
Soongsil university
……
Node N
Setup tool in the CATS-i
Disk Cloning using the NFS





A slave node must be boot with DHCP and NFS
enabled kernel
It has a same way to boot as the diskless
terminal
using DHCP
It makes a disk image of a slave node include
hard disk info
store slave node disk image in the server disk
2015-07-18
Soongsil university
OS Setup tools Architecture - Disk cloning
Interface
Master-node
DHCP
server
mode
change
2.command
1.Start
Slave-node
4.IP info
5.Query
Boot kernel image
Init Program
7.Mode
11.Partition
info
Backup
wizard
8.Operation
Client program
15.Result
6
Boot
disk
management
Daemon
NFS client
server
14
Image file
Disk cloning preparation
 Step 1, 2, 3
Command operation
 Step 4, 5, 6, 7, 8
2015-07-18
13
10
Lock
management
Disk
config
Low disk
input
9
12
Hard disk
Make disk image

Step 9, 10, 11, 12, 13
Save disk image

Step 14, 15
Soongsil university
3.booting
OS Setup tools Architecture - Installation
Interface
Master-node
2.command
1.Start
Slaver-node
4.IP info
5.Query
DHCP
server
mode
change
Boot kernel image
Init Program
7.Mode
Restore
wizard
8. Operation
8.connect
Client program
15. Result
9.Start
command
Boot
disk
management
Daemon
6
Multicast Client
Lock
management
Server
Sender
Step 1, 2, 3
Installation

Step 4, 5, 6, 7, 8, 9
2015-07-18
14
Hard Disk
Command operation

Low disk
output
12
Image file
Installation preparation
13
11
Disk
Config
format
10

3.booting
Soongsil university
Step 10, 11, 12, 13, 14
OS Setup tools
Slave
Node
Master
Node
2015-07-18
Soongsil university
Related works for CMS -VACM
VA Linux Systems
Cluster
administration tool
runs on VA-Linux
Real-time hardware
sensor data such as
temperature, fan
speed and voltage
are reported
2015-07-18
Soongsil university
Related works for CMS - MAT
Ryerson University,
Canada
It is implemented with
Tcl/Tk

It causes a lot of
overhead to display
rapidly changing data
Individual management
about each node
monitor about system
file mainly
2015-07-18
Soongsil university
Related works for CMS - SCMS
Kasetsart University
It consists of real-time
monitoring system, parallel
unix command and numerous
system administration utilities
It supports java applet to
report real-time system
information
It supports 3D interface using
VRML
2015-07-18
Soongsil university
Related works for CMS – M3C
Oak Ridge National
Lab
It is implemented
with java.User can
manage multiple
cluster group in one
interface
It supports job
scheduling and
software installation
2015-07-18
Soongsil university
Management tools in the CATS-i
Management tool offers maintenance of cluster
nodes.
Characteristics of management tool





It is possible to bind many node as one cluster group,
and manage multiple cluster groups in one place.
It is possible to apply the same operation efficiently to all
or selected nodes.
It offers real-time monitoring to users for resource
information such as CPU, memory and etc.
Console implemented with java is interactive and easy to
use.
Job scheduling using JPBS through Internet
CATS-i offers many function about resource.
2015-07-18
Soongsil university
CATS-i function
Node status


CPU, memory, process, user list, account
Disk space
File management
Alarm
System log
Shutdown/Reboot
Package management
JPBS
2015-07-18
Soongsil university
Management tools – Node status
It shows node information for each group
Real-time information about CPU and memory
total view
2015-07-18
Soongsil university
Management tools – Node status
It enable user to monitor resource information of cluster
nodes such as CPU, memory, account, user, real-time CPU
and memory monitoring, process monitoring, and managing
basic info
Performance
2015-07-18
Soongsil university
process
Disk
Account
User List
2015-07-18
Soongsil university
Management tools – file management
It provides file
management
functions for a
cluster group.
File
Management
It is very easy to use

When they want to perform jobs related with files, users just
click the right button to show a pop-up menu.
2015-07-18
Soongsil university
Management tools – alarm function
Monitor import system parameters

Processor utilization, Memory Usage, etc.
Notification is done through e-mail of system functions.
2015-07-18
Soongsil university
Management tools – system log
Log information is
very useful in
various situation
Server daemon
collects log
information from
each node
Log Tree
2015-07-18
Soongsil university
Management tools – RPM package
User can install, remove, upgrade application packages with
management tool and query about installed RPM
Support REDHAT
Linux
It is implemented
with thread library
Option
Dialog
2015-07-18
Soongsil university
Management tools – PBS Interface
It enables users to user a general PBS with the same
CATS-i interface.
JPBS job
Submission
Dialog
main screen
2015-07-18
Soongsil university
Conclusion & Future works
CATS-i will offer more functions such as


Status of CPU temperature, voltage and speed
Extended aggregation of services
 Statistical memory and CPU information for each user
 Statistical information can be displayed graphically

Network monitoring using SNMP and network analysis
 detect network bottleneck of clusters.

Enhanced alarm services
 Administrator can can specify the condition to alarm and
action to be taken
 In emergence, CATS-i can shutdown or reboot cluster nodes
2015-07-18
Soongsil university