Some Hints for “Best Practice” Regarding VO Boxes running Critical

Download Report

Transcript Some Hints for “Best Practice” Regarding VO Boxes running Critical

CASTOR2
Robust Services - Middleware Developers' Techniques & Tips
Dennis Waldron
CERN IT-FIO-FD
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Outline
• What is Robustness?
• Database Centric Architecture
– Lifecycle of a PUT request
– Service redundancy and recoverability
• Application Framework
– Reusability and Maintainability
– Code Generation
• Testing and Certification
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
2
What is Robustness?
(For Castor2)
• The ability to provide a service that can handle sustained
request rates.
• But, also a service which can handle high peaks in requests.
• When problems do occur be able to recover from those
situations in a timely fashion and in a graceful manor.
• To provide a high availability solution, shielding the end user
from deployment architecture.
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
3
Database Centric Architecture
(PUT Request – Simplified!!)
Stager Catalogue
(ORACLE 10G)
Sends Request
Stored in DB
Client
stager_put –M /castor/cern.ch/...
Request Handler
(RH)
Decision Stored in DB
Job Information
Job Start Confirmation
Callback
Extract Request
Job Ended
Acknowledges submission into LSF
File Transfer
Response
Stager
Job Manager
Job Start Request
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Mover is forked to perform the transfer between
diskserver and client. The type of mover forked is
dependant on the requested transfer protocol
Job Request
Scheduler
(LSF – Load Sharing Facility)
Total Database Transactions: 9
Number of Daemons Involved: 4
0.5 second RTT
(Idle instance, excluding file transfer time)
Database Centric Architecture II
(Advantages & Disadvantages)
• Advantages:
– All New C++ Daemons are stateless
– Processing logic provided by PLSQL (~4.5K lines).
•
Hot fixes can be deployed on a running instance with no disruption or restart necessary.
– Transaction like behaviour.
– Easy Information exchange.
– Parallelism – Daemons perform operations on data in an atomic way.
•
Multiple daemons with the same functionality can operate in tandem on different requests at
the same time.
– Minimal footprint on unprocessed requests.
•
•
Requests are stored in DB while waiting for processing / resources e.g. Tape Recalls.
Requests are not instantiated as processes until they run.
• Disadvantages:
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
– A single database front-end = A single point of failure.
– The scalability and speed of CASTOR2 can be limited by the DB
performance.
– Changes in SQL execution plans, deadlocks, stale DB connections can
heavily degrade the service.
29 November 2007
WLCG Service Reliability Workshop
5
Database Centric Architecture III
(Service Redundancy and Recoverability – File Query)
Request Handlers (PRIVATE)
Request Handlers (PUBLIC)
DNS Load Balanced Alias
Database Connections
Requests
Stager Catalogue
(ORACLE 10G RAC)
Responses
Disk Cache
Stagers
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
We rely on ORACLE technologies to solve database redundancy and high
availability. The only thing we need to worry about at the application level is
reconnecting when a connection to the DB is dropped.
Daemon Overview
(Stateless vs. Non Stateless)
Daemon Name
Stateless
C++ Framework
Castor Version
Stager
YES
YES (2.1.6+)
2
Request Handler
YES
YES
2
Job Manager
YES
YES (2.1.4+)
2
Scheduler Plugin
(LSF)
YES
YES
2
Distributed Logging
Facility
N/A *
NO
2
Mig Hunter
YES
NO
1
Name Server
N/A *
NO
1
RTCPClientD
NO
NO
1
Notes
Previous implementation was not DB centric and
required direct communication with the stager.
Major rewrite in v2.1.3 to remove scalability
problems. Previously 1000 pending requests in
LSF would cause LSF to stop. Still some work to
be done to allow for LSF master and slave
configuration to work with the CASTOR2 LSF
Plugin
Asynchronous Logging as of v2.1.1. Before this
a failure of this daemon stopped CASTOR2
Fails to reconnect to database after
disconnection. Likewise for CUPV and VMGR.
Fix scheduled for 2.1.6
Requires Operation team invention after
daemon restarts to fix database inconsistencies
and restart requests.
* DLF and Name Server are not required to be stateless
Daemons are evolving all the time to provide for high availability
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
7
Application Framework
(Reusability and Maintainability)
• A framework exists for creating New C++ Daemons which
relies heavily on abstract interfaces. Interfaces and
implementation exist for:
– Database access via converters
– Object streaming between daemons via converters
– Multithreading utilising the Cthread API
• Cthread is an OS abstraction layer created in the days of CASTOR1
and offers the underlying threading abilities of the C++ daemons
• Multithread Daemon example
srm::daemon::SrmDaemon daemon("SRM Daemon");
daemon.addThreadPool(
new castor::server::BaseThreadPool("GC",
new srm::daemon::GCThread()));
daemon.getThreadPool('G')->setNbThreads(1); // we need a single thread here
daemon.parseCommandLine(argc, argv);
daemon.start();
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
8
Application Framework II
(Reusability and Maintainability)
• High levels of code reuse reduces the number of lines of
maintainable code.
– Less code, fewer potential bugs are introduced.
– A bug fixed in the framework is a bug fixed in all daemons using that
framework.
– Simplifies design.
• Code Generation
– All C++ Daemons are modelled in UML (Umbrello)
– We generate classes, interfaces and DB schemas from the UML
diagrams.
– About 570k lines of code in CASTOR2 CVS. 187k lines auto generated.
~ 32%.
– C++ Only, about 248k lines, 164k auto generated, 66%.
– Eliminates COPY and PASTE errors.
– Introduces standards and commonality.
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
9
Testing and Certification
• From CVS to Production:
–
–
–
–
CVS Head – tested by developers
c2test – integration tests, test suites
c2itdc – Stress test
c2pps – Pre-production instance, certification for 3rd party
interfaces/protocols e.g. SRM2, xrootd …
– Production release announced.
– Deployment in production Tier 0/1.
•
•
Test suites are evolving all the time. ~ 250 test cases (mainly
functionality)
The Key aim is to:
– Increase stability.
– Eliminating bugs prior to production release.
– Increase confidence amongst operations team.
•
•
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Not all areas of CASTOR2 are tested, still progress and work to be
made in some areas e.g Unit testing and test suites for all transfer
protocols.
Certification not only done at CERN but also at RAL (v2.1.6+)
29 November 2007
WLCG Service Reliability Workshop
10
Questions?
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
29 November 2007
WLCG Service Reliability Workshop
11