slides - Indico

Download Report

Transcript slides - Indico

1
Introduction
Topics:
•SUMS (STAR Unified Meta Scheduler) overview
–Usage
•Architecture
•Deprecated Configuration
•Current Configuration
–Configuration via Information services
•Future Configuration
2
Quick Overview of SUMS
•The first version was developed in 2002, the STAR physics
community has been using it for the past four years.
•Benefits:
–Resource management, and knowledge of complex resources is
taken off the users hands.
–Administrator has tighter control over jobs
•Used for both user analyses and production
(see next slide for usage )
1
5
Developers :
•Jerome Lauret and Levente Hajdu – Architect,
coding, administration of SUM at BNL
•Lidia Didenko – Testing for grid readiness
•David Alexander, Paul Hamill, Chuang Li (Tech-X
corp) - Private organization developing third party
modules for SUMS in (nuclear physics)
•Eric Hjort – File transfer solutions (SRM integration)
•Iwona Sakrejda, Doug Olson – administration of
SUMS at PDSF
•Efstratios Efstathiadis – Queue monitoring, research
•Valeri Fine – Grid testing
•Andrey Y. Shevel - administration of SUM at Stony
Brook University and development of a PBS module
•Elisabeth Atems - administration of SUM at Wayne
State University
•Michael DePhillips – statistics monitoring / Data base
administration
•Wayne Betts – Test bed administration and
deployment
6
2
3
4
3
4
Holidays
5
Architecture Overview
Dispatchers and Policies
•Format of the configuration file
•Configuration of the policy
•Configuration of the Dispatcher
–Nuances
•Configuration of the Queue
6
An overview of the configuration
• The configuration continues to evolve over time.
• The original format of the configuration is SUN JAVA object serialized XML
as implemented by java.beans.XMLDecoder in the JAVA JDK.
– For more information see: http://java.sun.com/products/jfc/tsc/articles/persistence3/
– The benefits include
• Automated parsing
• Easy to edit by hand
• The hierarchical structure is easily understood (XML)
• Ability to reference configuration blocks
–
example if five policies use the same queue, it is only declared once IDFER=“BNL_LSF_LongQueue”
• Ability to make function calls (powerful initialization tool)
• No need for data base engine
7
Configuration of the Policy
What parameters are needed ?
•A list of Queues (sometimes with weights)
•The base algorithm to use
•A name which the user can call to invoke the policy
•Configuration for monitoring plug-ins (optional)
8
Configuration of the Policy
What does it really look like ?
9
Configuration of the Dispatcher
•The base class for the given submission method
–LSF, CONDOR, SGE, …
•Timing information
–delay between submissions, timeout time, number of retries
•Gatekeeper names (for grid submission)
•Script generator
–Program location table.
•Site specific nuances.
10
Configuration of the Dispatcher
11
Configuration of the Dispatcher
•Site specific nuances:
–Some Examples:
•Submitting via the condor-batch system some sites require additional
keywords such as +Experiment=“star” else the job is held indefinitely.
•At PDSF it is necessary to use the “module load [name]” command
before being able to access certain software packages such as Java or
Globus, otherwise the user gets “program not found” errors.
12
Configuration of the Queue
•The queue objects are virtual entities representing a
subset of nodes examples:
A condor Pool, A subset of a batch system queue where memory > 256MB
•What parameters are needed ?
–Queue weight
•Policies use Queue weights for decision making,
Dynamic (Monitoring) policies derive there own
weights, Static policies have user configured weights.
–Will the job “fit” ?
•Time limits (cpu, wall)
•Memory
•Scratch Space
13
Configuration of the Queue
Typical Configuration:
14
15
Note: That the configuration file for site “A” is different then for “B”.
16
*This approach works for a small number of sites however it does not scale well, because every
configuration is different and when a new site is added all configurations need to be updated.
17
Demand Drives the Need to Evolve
In order to reduce redundancy the following steps where taken:
•Merge all files.
–All sites are merged in the same file.
–There is a higher “site block” to encapsulate (delimitate) all sites in the configuration.
•Normalize of the configuration (removal of duplicates)
–The duplication of queues inside policies was a major source of redundancy as a result queues
where pulled out and only referenced in the policy.
–Batch system information was pulled out of the queue blocks to a high level that encapsulates the
queue blocks. This level is referred to as the batch system.
–Gatekeeper information was pulled out of the dispatchers and put into the batch system block.
–The dispatcher blocks where moved into the batch system block. There is only prevision for two
dispatchers per batch system block .
•One for submission to the batch block by local users
•One for submission to the batch block by remote users
18
19
Benefits
•Policies are reusable, one policy can be used by multiple sites.
•Changes are easily implemented and distributed.
•Redundancy is cut down.
20
How it works.
After a queue is assigned to a job by a policy, it has to be determined if the local or grid
dispatcher should be used. This is done by recovering the users domain name. Multiple methods
are used to try and recover the domain name the most common is “/bin./domainname”. This is
compared with the domain of the site on which the queue resides (from config file). If they are
the same the local dispatcher is used. If they are different the grid dispatcher is used.
21
How it works.
22
The Job Script
Building user sand boxes and data recovery
23
Adding New Sites
1.
When SUMS is initialized it tries to find its site in the
configuration.
If the site is there, SUMS will ask the administrator a
minimal number of questions to configure the site.
SUMS will write the site information in a special location.
2.
3.
–
The administrator can decide if they what to feed this
information back to the master configuration.
24
Getting to the Dynamic Part
25
Fragment of Program Locations Table from nersc.gov
String, String pairs
String, Method
call returning a string
26
Information service cycle
1.
2.
3.
A configuration with some parts obtained via configuration and some parts
obtained via an information service.
The information service absorbs (makes available) parameters previously
statically configured.
Site N requires a new configuration parameter, deemed necessary in order to
submit to the site. The parameter is statically added to the configuration.
27
Improvements and Plans for Future
Development
Where do go from here ?
28
Accuracy Counts
29
30
Conclusions
•SUMS produces jobs that take best advantage of resources on any
given site.
•SUMS provides seamless GRID and local integration.
•We have provided an easy to use method for adding new sites.
•We are moving to dynamic recovery of configuration parameters.
•We want to be able to provide one install for all sites in many
different communities.
31
The End
Questions ?
32