Transcript ppt

Quill / Quill++ Tutorial
European Condor Week
June 2006
INFN
Milan, Italy
Todd Tannenbaum
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
What is Quill?
A non-invasive method of storing a
read only version of the job queue
and job historical data in a relational
database.
www.cs.wisc.edu/condor
Why Do We Need It?
› Presents the job queue information as
a set of tables in a relational
database (Big Win!)
› Fault tolerance
› Provides performance enhancements
in very large and busy pools
www.cs.wisc.edu/condor
Job Queue Management
Without Quill
schedd
With Quill
schedd
Database
quilld
Job Queue
Job Queue
www.cs.wisc.edu/condor
Deployment
›
›
›
›
›
One Quill daemon per schedd
Quill daemons must be uniquely named
Each Quill daemon uses a unique DB name
Multiple Quill daemons may utilize one
database server
Currently uses PostgreSQL
 Recommend PostgreSQL 8.1 or later for automatic
vacuuming of tables
www.cs.wisc.edu/condor
Condor’s Interface to Quill
› Modified two tools to utilize the DB
 condor_q
 condor_history
› Very minor modifications to schedd
› Multiple sources for Job Queue & History
pose an interesting problem
www.cs.wisc.edu/condor
Job Queue Discovery Sequence
(Local Query)
Database
1
quilld
2
3
schedd
Job Queue
condor_q
www.cs.wisc.edu/condor
Job Queue Discovery Sequence
(Remote Query)
Database
collector
1
0
quilld
2
3
schedd
Job Queue
condor_q
www.cs.wisc.edu/condor
A User Perspective: condor_q
› condor_q changes
-name takes a ScheddName or
QuillName
-avgqueuetime details average time in
queue for all jobs
www.cs.wisc.edu/condor
A User Perspective: condor_q
Example: condor_q -name
Linux merlin > condor_q -name [email protected]
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
92.0
psilord
4/21 09:21
0+00:00:00 I 0
9.8 foo
1 jobs; 1 idle, 0 running, 0 held
www.cs.wisc.edu/condor
A User Perspective
Example: condor_q -avgqueuetime
Linux merlin > condor_q -avgqueuetime
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
Average time in queue for uncompleted jobs (in hh:mm:ss)
00:40:47.011993
www.cs.wisc.edu/condor
Job History Discovery Sequence
(Local Query)
Database
quilld
The quilld is never
queried directly!
1
condor_history
2
Job Queue
History File
www.cs.wisc.edu/condor
Job History Discovery
(Remote Query) NEW!
Database
quilld
The quilld is never
queried directly!
collector
1
Job Queue
0
condor_history
History File
www.cs.wisc.edu/condor
A User Perspective:
condor_history
› condor_history changes
-name takes a Quill Name to retrieve
job histories from a remote quill’s
database
-completedsince returns all jobs
completed since a PostgreSQL
formatted date
www.cs.wisc.edu/condor
A User Perspective: condor_history
Example: condor_history -name
Linux merlin > condor_history -name [email protected]
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST
COMPLETED CMD
91.0
psilord
4/20 14:23
0+00:00:00 X
???
/scratch/psilor
92.0
psilord
4/21 09:21
0+00:00:00 X
???
/scratch/psilor
93.0
psilord
4/21 10:12
0+00:00:01 C
4/21 10:12 /scratch/psilor
www.cs.wisc.edu/condor
A User Perspective: condor_history
Example: condor_history -completedsince
Linux merlin > condor_history -completedsince "2006-01-01 00:00:01"
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST
COMPLETED CMD
93.0
psilord
4/21 10:12
0+00:00:01 C
4/21 10:12 /scratch/psilor
www.cs.wisc.edu/condor
Short Circuiting the Discovery
Sequence
› Use the –direct option!
› Examples
 condor_q –direct rdbms
 condor_q –direct quilld
 condor_q –direct schedd
› “rdbms”, “quilld”, and “schedd” are the
›
actual parameters.
Invaluable for debugging!
www.cs.wisc.edu/condor
PostgreSQL 8.1 Installation
› ./configure
› gmake && gmake install
› mkdir /path/to/pgsql/data
› initdb –D /path/to/pgsql/data
› postmaster –D /path/to/pgsql/data
› Note: Default port binding is 5432.
www.cs.wisc.edu/condor
PostgreSQL Configuration
› Add two special user accounts:
quillreader and quillwriter
createuser quillreader
--no-createdb --no-adduser
--pwprompt
createuser quillwriter
--createdb --no-adduser
--pwprompt
www.cs.wisc.edu/condor
PostgreSQL Configuration
(cont)
› Allow TCP/IP connections
 Edit file postgresql.conf
• Add listen_address = '*'
› Allow connections from specific hosts
 Edit file pg_hba.conf
• host all quillreader 128.105.0.0 255.255.0.0 password
• host all quillwriter 128.105.0.0 255.255.0.0 password
› Note: only use ‘password’ authentication at
this time.
www.cs.wisc.edu/condor
Quill Configuration
› User quillwriter needs a write
password.
› Store it in a file called
.quillwritepassword in the $(SPOOL)
directory.
› Ensure only the condor uid can read it
if Condor is running as root
www.cs.wisc.edu/condor
Quill Configuration (cont)
› Condor system specific attributes in
file condor_config.local
 QUILL
=
 QUILL_LOG
=
 QUILL_ADDRESS_FILE =
 DAEMON_LIST
=
 VALID_SPOOL_FILES =
 DC_DAEMON_LIST
=
$(SBIN)/condor_quill
$(LOG)/QuillLog
$(LOG)/.quill_address
…, QUILL
…, .quillwritepassword
…, QUILL
www.cs.wisc.edu/condor
Quill Configuration (cont)
› Quill specific attributes
QUILL_ENABLED
= TRUE
# The quill name must be unique across all
# quill daemons AND schedds
QUILL_NAME
=
[email protected]
QUILL_DB_NAME
= psilord_db
QUILL_DB_IP_ADDR
= merlin.cs.wisc.edu:5432
QUILL_POLLING_PERIOD = 10 (seconds)
www.cs.wisc.edu/condor
Quill Configuration (cont)
›
›
›
›
›
QUILL_HISTORY_CLEANING_INTERVAL
QUILL_HISTORY_DURATION
QUILL_MANAGE_VACUUM
QUILL_IS_REMOTELY_QUERYABLE
QUILL_DB_QUERY_PASSWD
=
=
=
=
=
24 (hours)
30 (days)
FALSE
TRUE
xxx
www.cs.wisc.edu/condor
DB Storage Method
› Schema designed to store and query
classads
 4 tables to represent the job queue classads
 2 for history data
 1 for metadata
› Some queries are easier than others
› Ask more questions at the BOF!
www.cs.wisc.edu/condor
Quill++
› More comprehensive than Quill (data from
›
›
›
›
all daemons, not just SchedD)
Built on Quill code base
Condor daemons write to SQL logs, Quill
daemon reads and inserts in DBMS
Central database serves entire pool
Web-based query GUI
www.cs.wisc.edu/condor
Data Capture in Quill++
› Condor daemons
›
›
Schedd
augmented to record
important events in a
database
Database is in addition
to standard daemon
logs
Pool will run
unaffected even in the
absence of a database
Shadow
Startd
Starter
Negotiator
A Machine
www.cs.wisc.edu/condor
Quill++ Architecture
Master
Startd
…
Schedd
Quill++
RDBMS
Event
logs
Job
Queue
log
Queue,
History,
Machine,
Match etc.
www.cs.wisc.edu/condor
Implementation Details
› Quill++: First class condor daemon
 Managed by Condor Master
 Native PostgreSQL API
 Can be ported to any platform for which
PostgreSQL drivers are available (AIX, BSD,
IRIX, HP-UX, Linux, Solaris, Windows etc.)
 Porting Quill++ to other databases involves
implementing a database virtual class
www.cs.wisc.edu/condor
Web Interface
› Useful for:
User job monitoring
Administrative monitoring over jobs and
resources
Debugging
www.cs.wisc.edu/condor
Condordb Admin Screen
Jobs in queue
History jobs
Machine Status
Recency summary
www.cs.wisc.edu/condor
Job history by owner
www.cs.wisc.edu/condor
Machine Report
www.cs.wisc.edu/condor
Status about a job
Classad Info
Run Info
Event Info
Match Info
Rejects Info
www.cs.wisc.edu/condor
Recency info for exceptional
data sources
www.cs.wisc.edu/condor
Quill++ Present Status
› Deployed in testbed
dbc cluster (93 machines)
Has successfully run almost 100,000
jobs.
Planning distribution with early v6.9.x
Condor release.
www.cs.wisc.edu/condor
Quill++ Caveats
› Web interface to DB
Basic prototype implemented
Needs to be made more robust, user
friendly (!)
› Gathers incomplete information in
multiple pool scenarios (flocking, glide-
in, condor-c)
www.cs.wisc.edu/condor
Thank you!
www.cs.wisc.edu/condor