CLRC-RAL site report - HEPiX Services at CASPUR

Download Report

Transcript CLRC-RAL site report - HEPiX Services at CASPUR

RAL Tier 1/A Status
HEPiX-HEPNT
NIKHEF, May 2003
Martin Bly
RAL CSF Tier 1/A
CPU Farm – Existing Hardware
• 108 dual processors (450, 600 and 1GHz)
– Up to 1GB RAM
– Desktop towers on warehouse shelves
• 156 dual processor 1400MHz PIII
– 133MHz FSB, 1Gb RAM each
– 1U rackmount, remote power switching
– RedHat 7.2
Martin Bly
RAL CSF Tier 1/A
New Hardware – Spring 2003 +
• 80 dual processor 1U rackmount units
– 2 x 2.66GHz P4 Xeons @ 533MHz FSB
– Hyper-threading
– 2048Mbyte memory
– 2x1Gb/s NICs (o/b)
– RedHat 7.3
– 3 racks, remote power switching
• Next delivery expected Summer 2003
Martin Bly
RAL CSF Tier 1/A
Operating Systems
• Operating Systems:
– Redhat 6.2 service will close end May
– Redhat 7.2 service has been in production for
Babar for 6 months.
– New Redhat 7.3 service now available for
LHC/other experiments
– Testing/benchmarking on new Xeon systems
• Increasing demands for security updates
becoming problematic.
Martin Bly
RAL CSF Tier 1/A
Disk Farm – Existing Hardware
• 2002 – 26 servers, each with 2 external RAID
arrays - 1.7TB disk per server, RAID 5:
– Excellent performance, well balanced system
– Problems with a bad batch of Maxtor drives –
many failures and high error rate – all 620
drives now replaced by Maxtor.
– Still outstanding problems with Accusys
controller failing to eject bad drives from RAID
set.
Martin Bly
RAL CSF Tier 1/A
Disk Farm – Spring 2003 +
• Recent upgrade to disk farm:
– 11 dual P4 Xeon servers (2.4GHz, 1024Mb RAM, PCIx), each
with 2 Infortrend IFT-6300 arrays via Ultra160 SCSI
– 12 Maxtor 200GB DiamondMax Plus 9 drives per array, RAID 5.
• Not yet in production – but a few snags:
– Originally tendered Maxtor Maxline Plus II drive was found not to
exist!
– Infortrend array has 2TB limit per RAID set – pushing for a
firmware update.
– 11+1spare better than 2 x 6 – 5Gb over 11 systems.
• Nick White ([email protected]) for more info.
Martin Bly
RAL CSF Tier 1/A
New Projects
• Basic fabric performance monitoring
(ganglia)
• Resource CPU accounting (based on PBS
accounts/mysql)
• New CA in production
• New batch scheduler (MAUI)
• Deploy new helpdesk (May)
Martin Bly
RAL CSF Tier 1/A
Ganglia
• Urgently needed live performance and
utilisation monitoring:
– RAL Ganglia Monitoring
http://ganglia.gridpp.rl.ac.uk/
• Scalable solution based on multicast
• Very rapidly deployable - reasonable
support on all Tier1A Hardware
• See: http://ganglia.sourceforge.net/
Martin Bly
RAL CSF Tier 1/A
PBS Accounting Software
• Need to keep track of system CPU and disk
usage.
• Home grown PBS accounting package (Derek
Ross):
– Upload PBS and disk stats into MYSQL
– Process with Perl DBI script
– Serve via Apache
• http://www.gridpp.rl.ac.uk/stats
• Contact Derek ([email protected]) for more info.
Martin Bly
RAL CSF Tier 1/A
MAUI / PBS
• Maui scheduler has been in production for last 4
months.
• Allows extremely flexible scheduling with many
features. But ….
– Not all of it works – we have done much work
with developers for fixes.
– Major problem – MAUI schedules on wall
clock time – not CPU time. Had to bodge it!!
Martin Bly
RAL CSF Tier 1/A
New Helpdesk Software
• Old helpdesk email based/unfriendly.
• With additional staff, urgently need to deploy
new solution.
• Expect new system to be based on free software
– probably Request Tracker
• Hope that deployed system will also meet needs
of Testbed and may also satisfy Tier 2 sites.
• Expect deployment by end of May.
• http://requestracker.gridpp.rl.ac.uk
Martin Bly
RAL CSF Tier 1/A
Outstanding issues / worries
• We have to run many distinct services.
– Fermi Linux
– RH 6.2/7.2/7.3…
– EDG testbeds, LCG …
• Farm management is getting very
complex. We need better tools and
automation.
• Security is becoming a big concern again.
Martin Bly
RAL CSF Tier 1/A