FarmVV - INFN Bologna

Download Report

Transcript FarmVV - INFN Bologna

First Implementation of a
Diskless Computer Farm
for LHCb
Vincenzo Vagnoni
Bologna, June 13, 2001
Outline




Hardware overview

Motherboards and rack mount boxes

Disk Storage

Remote Power Control
Network boot

Preboot eXecution Environment (PXE)

Server side daemons
System Configuration

Linux kernel preparation

The operating system

System administration
Conclusions
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Schematic Representation
Login Nodes
Chipset AGP
DRAM
Job Execution Nodes
GC
GC
IDE
DRAM
PCI
PCI
CPUs
AGP Chipset
NI
CPUs
NI
GC
swap
AGP Chipset
DRAM
PCI
NI
Chipset AGP
DRAM
GC
GC
PCI
CPUs
IDE
CPUs
NI
AGP Chipset
DRAM
PCI
NI
CPUs
swap
GC
CPU
Chipset
CPU
DRAM
PCI
host
adapter
AGP Chipset
DRAM
PCI
NI
NI
NI
GC
CPUs
AGP Chipset
DRAM
PCI
Switched Node Backplane
NI
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
GC
CPUs
AGP Chipset
DRAM
PCI
NI
CPUs
Network Attached Storage
Ethernet Switch
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Motherboards I

9 bi-processor motherboards (GigaByte 6VXDC7)

Upgrade to 25 foreseen after summer

2 Pentium III 866 MHz

512 MB RAM (non ECC)


2*256 MB modules
Only two peripherals: Graphics card and network
Adapter

Completely Diskless
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Motherboards II

100 Mb NIC equipped with a Boot PROM




3Com 3C905C-TX with Managed Boot Agent v4.30
(Lanworks) and PXE v2.20
Onboard hardware healt monitoring chips (Inter
Integrated Circuits – I2C – compatible, linux drivers
“lm_sensors” exists) for temperatures and fan speed
readout
Arranged in a 2U rack mounted box hosting also a
standard power supply and 3 fans
Current absorption: 300 mA (idle), 600 mA (200%
load)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How they look like
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Disk Storage

Network Attached Storage solution

RaidZone OpenNAS RS15-R1200

14*80 GB EIDE Hard Disks (+ 1 auto hot spare)

Hardware controlled RAID-5 (total usable disk area 1 TB)

Dual Pentium III 800 MHz

256 MB ECC RAM

Two Network Cards configured in port trunking (200
Mb/sec) - upgrade to Gigabit is possible

Dual redundant power supply

Operating System: RedHat Linux patched by RaidZone

Suggested File System: ReiserFS (we use it)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How it looks like
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Considerations on the NAS I

System reliable and pretty stable


With the latest available kernel (a 2.2.17 patched by
RaidZone) no problem observed (about one month of
continuous intensive usage)
Very flexible


It looks to the administrator like a normal RedHat Linux
system
It can host any kind of services (PXE, DHCP, APACHE, …),
and it does in our case
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Considerations on NAS II

Good performance




Almost full network bandwith used for (not small) file transfers
through NFS (about 160 Mb/sec)
Performance more than adequate for MC production issues – for
Analysis jobs more thinking is needed
Very compact


About 50 MB/sec local reading, 35 MB/sec local writing (with
RAID-5 and ReiserFS) – very close to real life SCSI Ultra 160
RAID-5 arrays performance
1 TB (could be 2 TB by using recent 144 GB disks) in about 4U
Fairly cheap

20000 $ in the US, something more in Italy
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Remote Power Control

Even if a Linux based system is usually rather stable, it can happen that
a system hangs up

This event can be in general not so rare in large installations

Possible solution: remote control of PCs power input

National Instruments Distributed I/O modules, controlled via network

FieldPoint Ethernet Controller FP1600

It controls up to nine FieldPoint FP-RLY-420 modules

Each FP-RLY-420 is equipped with 8 independent relays, i.e. 8 channels




Total: 8*9=72 independent power channels at maximum handled by one
ethernet controller
Client GUI provided for Windows (not for Linux until now)
With this last instrument the system can be almost completely
controlled from remote sites, except in case of serious failures
It can help a lot the system administration
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How it looks like (in our arrangment)
FPBOX - Network-controlled 32 channels power switch - Front view
PS-2 Power Supply 20V. 0,8A.
NI cod. 777584-04
Rj45 TCP/IP socket
FP-ENCI - Windowed polycarbonate cover enclosure
NI cod. 777596-01
TCP/IP internal link
Fp1600 - 10/100Mbps. Network module
NI cod. 777792-00
FP-TB1 Terminal Base
NI cod. 777519-01
DIN Rail
FP-RLY420 - 8 SPST Relay outputs
NI cod. 777518-420
Giulio Avoni - 23/03/2001
FPBOX - Network-controlled 32 channels power switch - Rear view
FP-ENCI - Windowed polycarbonate cover enclosure
NI cod. 777596-01
Rj45 TCP/IP socket
ON/OFF
power switch
32 independent outputs - IEC female sockets
220V. 16A.
CEE standard plug
Giulio Avoni - 23/03/2001
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Hardware Summary
Installed
racks
1
Commodity bi-processor motherboards:

100Base-TX network IF with BOOT-PROM

Intel Pentium III 866 MHz CPU

512 MB RAM
9
Network Attached Storage RAID-5 Raidzone RS15-R1200

Dual processor Pentium III 800 MHz

15 IDE 80 GB Ultra ATA/100 disks

Dual 100 Mbps network IF (configured in port trunking)

Dual redundant 300W hot swap power supply
1
56 ports Modular Fast Ethernet Switch
1
Modular Remote Power Control Switch
1 (16 channels)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Putting all together
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The Network Boot I

Each booting client must be equipped with a network boot code
installed either in the system BIOS or in a PROM on the
Network Interface Card


In our case we make use of 3COM 3C905C-TX NICs with on card
boot PROM
Several pre-boot procedures are available on nowadays NICs



Novell RPL, based on Netware: requires a Novell server or
emulator… forget it
TCP/IP, i.e. DHCP/TFTP based
Intel PXE (our choice), similar to TCP/IP procedure but more
flexible and probably going to become a standard
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The Network Boot II

Preboot eXecution Environment (PXE) protocol




Client preboot code available on most modern NICs (suggested 3COM or
Intel)
Intel defined the protocol and provides a set of APIs to write server codes
RedHat developed and distributes a package to serve boot images to PXE
clients
Three phases



Pre-boot phase: the client configures its network by means of (extended)
DHCP requests and dowloads the boot image(s)
Kernel-boot phase: the kernel boots and makes a new (standard) DHCP
request, then mounts the root file system over NFS
Operating System boot phase: the operating system can now boot more or
less as usual
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
PXE Boot Sequence
PXE
Client
PXE
Client
PXE
Client
DHCP Discover to Port 67
Contains “PXEClient” extension tags
Extended DHCP Offer to Port 68 contains:
PXE server extension tags +
[Other DHCP option tags] +
Boot Server list, Client IP address,
Multicast Discovery IP address
DHCP Request to port 67
Contains “PXEClient” extension tags +
[Other DHCP option tags]
DHCP Acknowledge reply to Port 68
DHCP/
Proxy DHCP
Server
DHCP/
Proxy DHCP
Server
DHCP/
Proxy DHCP
Server
Boot Service Discover to port 67 or 4011
Contains: “PXEClient” extension tags
+ [Other DHCP option tags]
PXE
Client
Execute
Downloaded
Boot Image
PXE
Client
PXE Client
Boot Service Ack reply to client source port
Contains: [PXE Server extension tags]
(contain Network Bootstrap Program file name)
Network Bootstrap Program download
request to TFTP port 69 or MTFTP port
(from Boot Service Ack)
Network Bootstrap Program Download
to Client’s port
Boot
Server
M/TFTP
Server
Boot
Server
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The server side daemons

Three different servers necessary

DHCP


PXE (RedHat provides an implementation)



Provides standard DHCP informations
Provides non-standard DHCP extensions specified by the PXE protocol through a
Proxy-DHCP service, e.g. provides information to the clients for multiple boot menu
options to be chosen interactively at boot time by the user
This can be useful, for example, to boot different kernel images during tests or to
boot diagnostic programs (e.g. memtest86 to test memory health)
TFTP or MTFTP (Multicast TFTP, also provided by RedHat)



Downloads to the clients the Network Bootstrap Program (NBP), the Linux kernel
image and optionally an initial ram disk image
NBP is a small piece of code that takes the control, downloads the linux kernel and
can pass configuration parameters to it
Multicast based implementation of TFTP can be useful in occasional simultaneous
boots of several machines to avoid network overload
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Linux Kernel Preparation

To perform a remote boot, with a diskless configuration, the Linux
kernel must be prepared accordingly (>2.2.17 is suggested)


No patches are necessary, simply some changes in the configuration before
compile time
What happens after kernel download

When the kernel completes its boot procedure it still doesn’t have a mounted
file system, and in order to reach the remote NFS file server it needs to
configure dynamically its network parameters



It has to make a DHCP request
To do that the kernel must be compiled with built-in DHCP client support, built-in
Network adapter driver and network auto-configuration enabled
After the network adapter has been auto-configured the kernel can start to
mount the root file system over NFS


the NFS server address has been already provided by means of DHCP
To proceed with this operation NFS client support must be compiled resident into
the kernel and the root over NFS option must be enabled
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Operating System Configuration



Once the root file system (that is placed on the NAS)
has been mounted, the system proceeds more or less
as usual
We installed the CERN certified RedHat 6.1 release,
with slight changes in some startup and shutdown
scripts (for example to delay network and NFS
shutdown until the rest of the shutdown procedure is
terminated)
Directory sharing is similar to that of a typical
cluster configuration: some directories must be
private and resident in the root tree for each node
(/var, /tmp, /dev, /etc, /lib, /bin, /sbin), while some
others are shared among the nodes (/usr, /opt,
/home, …)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
System Administration

System administration becomes rather simpler with a centralized
filesystem




The file systems of every node are all accessible under the NAS filesystem
at the same time (e.g. no need to perform logins on different machines to
edit files, delete, move, etc.)
System backups can be centralized and performed in one single step on the
NAS
No risk of damage if a system is hard rebooted (no fsck because no local
disk is used)
To perform the OS installation of a new machine a simple script that
duplicates some directories and makes some simple operations on the NAS is
sufficient

A new installation is performed in 30 seconds

No need to develop packages to automatize installations on each different node

The installations are by default identical (the filesystems are built by simply
copying the directories from a central repository on the NAS)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Diskless configuration drawbacks

There are some possible drawbacks to
address


The absence of a disk swap area essentially means
that the job memory demand must strictly fit into
the RAM, otherwise the job is abruptly terminated
Anyway one can say that it doesn’t make a lot of
sense to make intensive computing on a system
that is heavily swapping memory pages


instead of buying a local disk buy more RAM!
On the contrary, on machines dedicated to
interactive sessions, a local swap area can be
necessary and should be added
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Status Summary

The farm prototype in its diskless implementation is installed at
CNAF (Bologna) and is fully operational



9 machines (18 processors) with PIII at 866 MHz available up to
now - upgrade to 25 machines foreseen after summer
First tests on intensive Monte Carlo production of minimum bias
events (jobs scheduled by PBS) are in progress since a couple of
weeks and no problem is observed
First release of tools available for monitoring (see Domenico’s
talk)

system health (temperatures, fan speed)

disk availability

CPU loads

network load

batch queue length
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Conclusions

An overview of the main concepts for the hardware design and
the system configuration of the installed farm prototype has
been given





Rack mounted
Completely diskless nodes, except those dedicated to interactive
sessions
Network Boot through Intel PXE
Nodes file-systems, disk data storage and basic services provided
by a NAS with 1TB disk array in RAID-5

Remote power managment

Linux kernel preparation and operating system configuration

Other details can be found on LHCb Computing note 2001-088
We are on the way…
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni