FarmVV - INFN Bologna
Download
Report
Transcript FarmVV - INFN Bologna
First Implementation of a
Diskless Computer Farm
for LHCb
Vincenzo Vagnoni
Bologna, June 13, 2001
Outline
Hardware overview
Motherboards and rack mount boxes
Disk Storage
Remote Power Control
Network boot
Preboot eXecution Environment (PXE)
Server side daemons
System Configuration
Linux kernel preparation
The operating system
System administration
Conclusions
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Schematic Representation
Login Nodes
Chipset AGP
DRAM
Job Execution Nodes
GC
GC
IDE
DRAM
PCI
PCI
CPUs
AGP Chipset
NI
CPUs
NI
GC
swap
AGP Chipset
DRAM
PCI
NI
Chipset AGP
DRAM
GC
GC
PCI
CPUs
IDE
CPUs
NI
AGP Chipset
DRAM
PCI
NI
CPUs
swap
GC
CPU
Chipset
CPU
DRAM
PCI
host
adapter
AGP Chipset
DRAM
PCI
NI
NI
NI
GC
CPUs
AGP Chipset
DRAM
PCI
Switched Node Backplane
NI
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Switched
Node
Disk IF
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
Ultra ATA
GC
CPUs
AGP Chipset
DRAM
PCI
NI
CPUs
Network Attached Storage
Ethernet Switch
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Motherboards I
9 bi-processor motherboards (GigaByte 6VXDC7)
Upgrade to 25 foreseen after summer
2 Pentium III 866 MHz
512 MB RAM (non ECC)
2*256 MB modules
Only two peripherals: Graphics card and network
Adapter
Completely Diskless
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Motherboards II
100 Mb NIC equipped with a Boot PROM
3Com 3C905C-TX with Managed Boot Agent v4.30
(Lanworks) and PXE v2.20
Onboard hardware healt monitoring chips (Inter
Integrated Circuits – I2C – compatible, linux drivers
“lm_sensors” exists) for temperatures and fan speed
readout
Arranged in a 2U rack mounted box hosting also a
standard power supply and 3 fans
Current absorption: 300 mA (idle), 600 mA (200%
load)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How they look like
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Disk Storage
Network Attached Storage solution
RaidZone OpenNAS RS15-R1200
14*80 GB EIDE Hard Disks (+ 1 auto hot spare)
Hardware controlled RAID-5 (total usable disk area 1 TB)
Dual Pentium III 800 MHz
256 MB ECC RAM
Two Network Cards configured in port trunking (200
Mb/sec) - upgrade to Gigabit is possible
Dual redundant power supply
Operating System: RedHat Linux patched by RaidZone
Suggested File System: ReiserFS (we use it)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How it looks like
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Considerations on the NAS I
System reliable and pretty stable
With the latest available kernel (a 2.2.17 patched by
RaidZone) no problem observed (about one month of
continuous intensive usage)
Very flexible
It looks to the administrator like a normal RedHat Linux
system
It can host any kind of services (PXE, DHCP, APACHE, …),
and it does in our case
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Considerations on NAS II
Good performance
Almost full network bandwith used for (not small) file transfers
through NFS (about 160 Mb/sec)
Performance more than adequate for MC production issues – for
Analysis jobs more thinking is needed
Very compact
About 50 MB/sec local reading, 35 MB/sec local writing (with
RAID-5 and ReiserFS) – very close to real life SCSI Ultra 160
RAID-5 arrays performance
1 TB (could be 2 TB by using recent 144 GB disks) in about 4U
Fairly cheap
20000 $ in the US, something more in Italy
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Remote Power Control
Even if a Linux based system is usually rather stable, it can happen that
a system hangs up
This event can be in general not so rare in large installations
Possible solution: remote control of PCs power input
National Instruments Distributed I/O modules, controlled via network
FieldPoint Ethernet Controller FP1600
It controls up to nine FieldPoint FP-RLY-420 modules
Each FP-RLY-420 is equipped with 8 independent relays, i.e. 8 channels
Total: 8*9=72 independent power channels at maximum handled by one
ethernet controller
Client GUI provided for Windows (not for Linux until now)
With this last instrument the system can be almost completely
controlled from remote sites, except in case of serious failures
It can help a lot the system administration
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
How it looks like (in our arrangment)
FPBOX - Network-controlled 32 channels power switch - Front view
PS-2 Power Supply 20V. 0,8A.
NI cod. 777584-04
Rj45 TCP/IP socket
FP-ENCI - Windowed polycarbonate cover enclosure
NI cod. 777596-01
TCP/IP internal link
Fp1600 - 10/100Mbps. Network module
NI cod. 777792-00
FP-TB1 Terminal Base
NI cod. 777519-01
DIN Rail
FP-RLY420 - 8 SPST Relay outputs
NI cod. 777518-420
Giulio Avoni - 23/03/2001
FPBOX - Network-controlled 32 channels power switch - Rear view
FP-ENCI - Windowed polycarbonate cover enclosure
NI cod. 777596-01
Rj45 TCP/IP socket
ON/OFF
power switch
32 independent outputs - IEC female sockets
220V. 16A.
CEE standard plug
Giulio Avoni - 23/03/2001
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Hardware Summary
Installed
racks
1
Commodity bi-processor motherboards:
100Base-TX network IF with BOOT-PROM
Intel Pentium III 866 MHz CPU
512 MB RAM
9
Network Attached Storage RAID-5 Raidzone RS15-R1200
Dual processor Pentium III 800 MHz
15 IDE 80 GB Ultra ATA/100 disks
Dual 100 Mbps network IF (configured in port trunking)
Dual redundant 300W hot swap power supply
1
56 ports Modular Fast Ethernet Switch
1
Modular Remote Power Control Switch
1 (16 channels)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Putting all together
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The Network Boot I
Each booting client must be equipped with a network boot code
installed either in the system BIOS or in a PROM on the
Network Interface Card
In our case we make use of 3COM 3C905C-TX NICs with on card
boot PROM
Several pre-boot procedures are available on nowadays NICs
Novell RPL, based on Netware: requires a Novell server or
emulator… forget it
TCP/IP, i.e. DHCP/TFTP based
Intel PXE (our choice), similar to TCP/IP procedure but more
flexible and probably going to become a standard
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The Network Boot II
Preboot eXecution Environment (PXE) protocol
Client preboot code available on most modern NICs (suggested 3COM or
Intel)
Intel defined the protocol and provides a set of APIs to write server codes
RedHat developed and distributes a package to serve boot images to PXE
clients
Three phases
Pre-boot phase: the client configures its network by means of (extended)
DHCP requests and dowloads the boot image(s)
Kernel-boot phase: the kernel boots and makes a new (standard) DHCP
request, then mounts the root file system over NFS
Operating System boot phase: the operating system can now boot more or
less as usual
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
PXE Boot Sequence
PXE
Client
PXE
Client
PXE
Client
DHCP Discover to Port 67
Contains “PXEClient” extension tags
Extended DHCP Offer to Port 68 contains:
PXE server extension tags +
[Other DHCP option tags] +
Boot Server list, Client IP address,
Multicast Discovery IP address
DHCP Request to port 67
Contains “PXEClient” extension tags +
[Other DHCP option tags]
DHCP Acknowledge reply to Port 68
DHCP/
Proxy DHCP
Server
DHCP/
Proxy DHCP
Server
DHCP/
Proxy DHCP
Server
Boot Service Discover to port 67 or 4011
Contains: “PXEClient” extension tags
+ [Other DHCP option tags]
PXE
Client
Execute
Downloaded
Boot Image
PXE
Client
PXE Client
Boot Service Ack reply to client source port
Contains: [PXE Server extension tags]
(contain Network Bootstrap Program file name)
Network Bootstrap Program download
request to TFTP port 69 or MTFTP port
(from Boot Service Ack)
Network Bootstrap Program Download
to Client’s port
Boot
Server
M/TFTP
Server
Boot
Server
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
The server side daemons
Three different servers necessary
DHCP
PXE (RedHat provides an implementation)
Provides standard DHCP informations
Provides non-standard DHCP extensions specified by the PXE protocol through a
Proxy-DHCP service, e.g. provides information to the clients for multiple boot menu
options to be chosen interactively at boot time by the user
This can be useful, for example, to boot different kernel images during tests or to
boot diagnostic programs (e.g. memtest86 to test memory health)
TFTP or MTFTP (Multicast TFTP, also provided by RedHat)
Downloads to the clients the Network Bootstrap Program (NBP), the Linux kernel
image and optionally an initial ram disk image
NBP is a small piece of code that takes the control, downloads the linux kernel and
can pass configuration parameters to it
Multicast based implementation of TFTP can be useful in occasional simultaneous
boots of several machines to avoid network overload
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Linux Kernel Preparation
To perform a remote boot, with a diskless configuration, the Linux
kernel must be prepared accordingly (>2.2.17 is suggested)
No patches are necessary, simply some changes in the configuration before
compile time
What happens after kernel download
When the kernel completes its boot procedure it still doesn’t have a mounted
file system, and in order to reach the remote NFS file server it needs to
configure dynamically its network parameters
It has to make a DHCP request
To do that the kernel must be compiled with built-in DHCP client support, built-in
Network adapter driver and network auto-configuration enabled
After the network adapter has been auto-configured the kernel can start to
mount the root file system over NFS
the NFS server address has been already provided by means of DHCP
To proceed with this operation NFS client support must be compiled resident into
the kernel and the root over NFS option must be enabled
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Operating System Configuration
Once the root file system (that is placed on the NAS)
has been mounted, the system proceeds more or less
as usual
We installed the CERN certified RedHat 6.1 release,
with slight changes in some startup and shutdown
scripts (for example to delay network and NFS
shutdown until the rest of the shutdown procedure is
terminated)
Directory sharing is similar to that of a typical
cluster configuration: some directories must be
private and resident in the root tree for each node
(/var, /tmp, /dev, /etc, /lib, /bin, /sbin), while some
others are shared among the nodes (/usr, /opt,
/home, …)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
System Administration
System administration becomes rather simpler with a centralized
filesystem
The file systems of every node are all accessible under the NAS filesystem
at the same time (e.g. no need to perform logins on different machines to
edit files, delete, move, etc.)
System backups can be centralized and performed in one single step on the
NAS
No risk of damage if a system is hard rebooted (no fsck because no local
disk is used)
To perform the OS installation of a new machine a simple script that
duplicates some directories and makes some simple operations on the NAS is
sufficient
A new installation is performed in 30 seconds
No need to develop packages to automatize installations on each different node
The installations are by default identical (the filesystems are built by simply
copying the directories from a central repository on the NAS)
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Diskless configuration drawbacks
There are some possible drawbacks to
address
The absence of a disk swap area essentially means
that the job memory demand must strictly fit into
the RAM, otherwise the job is abruptly terminated
Anyway one can say that it doesn’t make a lot of
sense to make intensive computing on a system
that is heavily swapping memory pages
instead of buying a local disk buy more RAM!
On the contrary, on machines dedicated to
interactive sessions, a local swap area can be
necessary and should be added
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Status Summary
The farm prototype in its diskless implementation is installed at
CNAF (Bologna) and is fully operational
9 machines (18 processors) with PIII at 866 MHz available up to
now - upgrade to 25 machines foreseen after summer
First tests on intensive Monte Carlo production of minimum bias
events (jobs scheduled by PBS) are in progress since a couple of
weeks and no problem is observed
First release of tools available for monitoring (see Domenico’s
talk)
system health (temperatures, fan speed)
disk availability
CPU loads
network load
batch queue length
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni
Conclusions
An overview of the main concepts for the hardware design and
the system configuration of the installed farm prototype has
been given
Rack mounted
Completely diskless nodes, except those dedicated to interactive
sessions
Network Boot through Intel PXE
Nodes file-systems, disk data storage and basic services provided
by a NAS with 1TB disk array in RAID-5
Remote power managment
Linux kernel preparation and operating system configuration
Other details can be found on LHCb Computing note 2001-088
We are on the way…
First Implementation of a Diskless Computer Farm for LHCb
Vincenzo Vagnoni