Fabric Services: LXBATCH, SLC3 and the GRID - Indico

Download Report

Transcript Fabric Services: LXBATCH, SLC3 and the GRID - Indico

Status and plans of central
CERN Linux facilities
Thorsten Kleinwort
IT/FIO-FS
For PH/SFT Group
10.06.2005
1
Introduction
• 2 years ago:
Post C5 on migration from RH6 to RH7
• Now:
Migration from RH7 to SLC3
• Achievements:
Scalability
Tools framework
Scope
• Conclusions & outlook
10 April 2016
Thorsten Kleinwort IT/FIO/FS
2
Operating System
Scalability
Tools framework
Scope
10 April 2016
Thorsten Kleinwort IT/FIO/FS
3
Operating System
• SLC3 new default Operating System:
• LXPLUS fully migrated, new h/w
• small rest on RH7 o(5)
• LXBATCH 95% on SLC3
10 April 2016
Thorsten Kleinwort IT/FIO/FS
4
10 April 2016
Thorsten Kleinwort IT/FIO/FS
5
Operating System
• SLC3 new default platform:
•
•
•
•
•
LXPLUS fully migrated, new h/w
small rest on RH7 o(5)
LXBATCH 95% on SLC3
Rest to be migrated soon (even old h/w)
Other clusters are migrated now as well:
• LXGATE, LXBUILD, LXSERV, …
• Still some problems on special Clusters with special
hardware (disk, tape server)
10 April 2016
Thorsten Kleinwort IT/FIO/FS
6
Operating System
• Besides this ‘main’ OS, we have RHES:
• RH ES 2 as well as RH ES 3
• Needed for ORACLE
• Now supporting also other architectures:
• ia64, {xf86_64}
• Needed for Service Challenge (CASTORGRID)
• No major problems, but:
• Additional work to provide/maintain those
• Minor differences, e.g. no AFS on ES, lilo on ia64
10 April 2016
Thorsten Kleinwort IT/FIO/FS
7
Operating System
Scalability
Tools framework
Scope
10 April 2016
Thorsten Kleinwort IT/FIO/FS
8
Scalability
• Already reached 1000 nodes with RH7
• Automated node installation
• Now at 2200 Quattor managed machines
• Machine arrive in bunches o(100)
• Installed/stress-tested/moved
• Now, cluster management automated
• Kernel upgrade on LXBATCH
• Vault move/renumbering
• Cluster upgrade to new version of OS
10 April 2016
Thorsten Kleinwort IT/FIO/FS
9
Cluster upgrade workflow
10 April 2016
Thorsten Kleinwort IT/FIO/FS
10
Scalability
• Batch System LSF:
• We are up to 50000 jobs in ~2500 slots
• So far o.k., except for AFS copy -> NFS
• Infrastructure has to scale as well:
• Power, cooling, space, network,…
10 April 2016
Thorsten Kleinwort IT/FIO/FS
11
Operating System
Scalability
Tools framework
Scope
10 April 2016
Thorsten Kleinwort IT/FIO/FS
12
Tools framework
• We adapted EDG-WP4 tools for our needs
• With RH7 still hybrid with old tools (SUE,
ASIS), now clean on SLC3
• Improved and strengthened them in ELFms:
• Quattor, with SPMA and NCM configuration framework
• CDB configuration database with SQL interface
10 April 2016
Thorsten Kleinwort IT/FIO/FS
13
CDB: Web access tool
10 April 2016
Thorsten Kleinwort IT/FIO/FS
14
Tools framework
• We adapted EDG-WP4 tools for our needs
• With RH7 still hybrid with old tools (SUE,
ASIS), now clean on SLC3
• Improved and strengthened them in ELFms:
• Quattor, with SPMA and NCM configuration framework
• CDB configuration database with SQL interface
• Lemon Monitoring, including web interface
10 April 2016
Thorsten Kleinwort IT/FIO/FS
15
Lemon Start Page
10 April 2016
Thorsten Kleinwort IT/FIO/FS
16
Lemon: E.g. LXBATCH
10 April 2016
Thorsten Kleinwort IT/FIO/FS
17
Tools framework
• We adapted EDG-WP4 tools for our needs
• With RH7 still hybrid with old tools (SUE,
ASIS), now clean on SLC3
• Improved and strengthened them in ELFms:
•
•
•
•
Quattor, with SPMA and NCM configuration framework
CDB configuration database with SQL (r/o) interface
Lemon Monitoring, including web interface
LEAF, the SMS and HMS framework
10 April 2016
Thorsten Kleinwort IT/FIO/FS
18
LEAF: CCTracker & HMS
10 April 2016
Thorsten Kleinwort IT/FIO/FS
19
Tools framework
We rely on other tools/groups:
• All Linux version come from Linux Support:
• Need for new version increases their workload, too
• AIMS: our boot/installation service
• LANDB: now SOAP interface instead of the web
• Good collaboration
• The scale increases the pressure for robust tools on
their side as well
10 April 2016
Thorsten Kleinwort IT/FIO/FS
20
Operating System
Scalability
Tools framework
Scope
10 April 2016
Thorsten Kleinwort IT/FIO/FS
21
Scope
•
•
Original scope for the framework was
LXBATCH/LXPLUS
Framework adapted to other clusters:
1. LXGATE, LXBUILD:
• Similar to LXPLUS/LXBATCH
2. Disk Server, Tape Server:
• Different h/w, larger variety, more special configuration
3. Non – FIO cluster:
• LXPARC
• GM (EGEE): several clusters, used for tests and
prototyping
• GD (LCG test clusters)
10 April 2016
Thorsten Kleinwort IT/FIO/FS
22
Scope
• These new clusters:
• Increase the scale even further
• Enlarge the requirements for the tools, e.g.
• New NCM components
• New SMS/HMS states/workflows
• Additional local users,…
• Come with new OS requirements, e.g.
• RH ES for ORACLE servers
• Ia64 support for new CASTORGRID machines
• Proper testing for new s/w, OS, kernel
has to be done on the cluster level
10 April 2016
Thorsten Kleinwort IT/FIO/FS
23
Fabric Services as part of
the GRID
• Additional LCG s/w was incorporated into our
Framework
• All SLC3 LXBATCH nodes (>800 MHz) are WN
CERN-PROD biggest site >1800 CPUs
• UI available on LXPLUS
• CE on LXGATE, 2 at the moment
• SE, cluster of 6 machines, running SRM and
CASTORGRID
• All upgraded to LCG_2_4_0
10 April 2016
Thorsten Kleinwort IT/FIO/FS
24
GOC Entry for CERN-PROD
10 April 2016
Thorsten Kleinwort IT/FIO/FS
25
GRID Monitoring:
10 April 2016
Thorsten Kleinwort IT/FIO/FS
26
GRID Resource Infos:
10 April 2016
Thorsten Kleinwort IT/FIO/FS
27
Conclusions & outlook
•
•
•
•
•
•
Not only migrated to one new OS
Next one: SLC4 or SLC5?
•
Tools are ready, no major problems foreseen
•
•
•
•
Gone from machine automation to cluster automation
Improve usability
Increase robustness
Decrease necessary expert level
•
•
From LXBATCH/LXPLUS to many different clusters
How to manage non-FIO Clusters?
We have overcome some scalability issues
Prepared to go to LHC scale
Tools:
Scope:
10 April 2016
Thorsten Kleinwort IT/FIO/FS
28