Condor - A Project and a System

Download Report

Transcript Condor - A Project and a System

Quill Tutorial
Condor Week 2006
Peter Keller
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
What is Quill?
A non-invasive method of storing a
read only version of the job queue
and job historical data in a relational
database.
www.cs.wisc.edu/condor
Why Do We Need It?
› Presents the job queue information as
a set of tables in a relational
database (Big Win!)
› Fault tolerance
› Provides performance enhancements
in very large and busy pools
www.cs.wisc.edu/condor
Job Queue Management
Without Quill
schedd
With Quill
schedd
Database
quilld
Job Queue
Job Queue
www.cs.wisc.edu/condor
Deployment
›
›
›
›
›
One Quill daemon per schedd
Quill daemons must be uniquely named
Each Quill daemon uses a unique DB name
Multiple Quill daemons may utilize one
database server
Currently uses PostgreSQL
 Recommend PostgreSQL 8.1 or later for automatic
vacuuming of tables
www.cs.wisc.edu/condor
Condor’s Interface to Quill
› Modified two tools to utilize the DB
 condor_q
 condor_history
› Very minor modifications to schedd
› Multiple sources for Job Queue & History
pose an interesting problem
www.cs.wisc.edu/condor
Job Queue Discovery Sequence
(Local Query)
Database
1
quilld
2
3
schedd
Job Queue
condor_q
www.cs.wisc.edu/condor
Job Queue Discovery Sequence
(Remote Query)
Database
collector
1
0
quilld
2
3
schedd
Job Queue
condor_q
www.cs.wisc.edu/condor
A User Perspective: condor_q
› condor_q changes
-name takes a ScheddName or
QuillName
-avgqueuetime details average time in
queue for all jobs
www.cs.wisc.edu/condor
A User Perspective: condor_q
Example: condor_q -name
Linux merlin > condor_q -name [email protected]
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
92.0
psilord
4/21 09:21
0+00:00:00 I 0
9.8 foo
1 jobs; 1 idle, 0 running, 0 held
www.cs.wisc.edu/condor
A User Perspective
Example: condor_q -avgqueuetime
Linux merlin > condor_q -avgqueuetime
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
Average time in queue for uncompleted jobs (in hh:mm:ss)
00:40:47.011993
www.cs.wisc.edu/condor
Job History Discovery Sequence
(Local Query)
Database
quilld
The quilld is never
queried directly!
1
condor_history
2
Job Queue
History File
www.cs.wisc.edu/condor
Job History Discovery
(Remote Query) NEW!
Database
quilld
The quilld is never
queried directly!
collector
1
Job Queue
0
condor_history
History File
www.cs.wisc.edu/condor
A User Perspective:
condor_history
› condor_history changes
-name takes a Quill Name to retrieve
job histories from a remote quill’s
database
-completedsince returns all jobs
completed since a PostgreSQL
formatted date
www.cs.wisc.edu/condor
A User Perspective: condor_history
Example: condor_history -name
Linux merlin > condor_history -name [email protected]
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST
COMPLETED CMD
91.0
psilord
4/20 14:23
0+00:00:00 X
???
/scratch/psilor
92.0
psilord
4/21 09:21
0+00:00:00 X
???
/scratch/psilor
93.0
psilord
4/21 10:12
0+00:00:01 C
4/21 10:12 /scratch/psilor
www.cs.wisc.edu/condor
A User Perspective: condor_history
Example: condor_history -completedsince
Linux merlin > condor_history -completedsince "2006-01-01 00:00:01"
-- DB: [email protected] : <merlin.cs.wisc.edu:42999> : psilord_db
ID
OWNER
SUBMITTED
RUN_TIME ST
COMPLETED CMD
93.0
psilord
4/21 10:12
0+00:00:01 C
4/21 10:12 /scratch/psilor
www.cs.wisc.edu/condor
Short Circuiting the Discovery
Sequence
› Use the –direct option!
› Examples
 condor_q –direct rdbms
 condor_q –direct quilld
 condor_q –direct schedd
› “rdbms”, “quilld”, and “schedd” are the
›
actual parameters.
Invaluable for debugging!
www.cs.wisc.edu/condor
PostgreSQL 8.1 Installation
› ./configure
› gmake && gmake install
› mkdir /path/to/pgsql/data
› initdb –D /path/to/pgsql/data
› postmaster –D /path/to/pgsql/data
› Note: Default port binding is 5432.
www.cs.wisc.edu/condor
PostgreSQL Configuration
› Add two special user accounts:
quillreader and quillwriter
createuser quillreader
--no-createdb --no-adduser
--pwprompt
createuser quillwriter
--createdb --no-adduser
--pwprompt
www.cs.wisc.edu/condor
PostgreSQL Configuration
(cont)
› Allow TCP/IP connections
 Edit file postgresql.conf
• Add listen_address = '*'
› Allow connections from specific hosts
 Edit file pg_hba.conf
• host all quillreader 128.105.0.0 255.255.0.0 password
• host all quillwriter 128.105.0.0 255.255.0.0 password
› Note: only use ‘password’ authentication at
this time.
www.cs.wisc.edu/condor
Quill Configuration
› User quillwriter needs a write
password.
› Store it in a file called
.quillwritepassword in the $(SPOOL)
directory.
› Ensure only the condor uid can read it
if Condor is running as root
www.cs.wisc.edu/condor
Quill Configuration (cont)
› Condor system specific attributes in
file condor_config.local
 QUILL
=
 QUILL_LOG
=
 QUILL_ADDRESS_FILE =
 DAEMON_LIST
=
 VALID_SPOOL_FILES =
 DC_DAEMON_LIST
=
$(SBIN)/condor_quill
$(LOG)/QuillLog
$(LOG)/.quill_address
…, QUILL
…, .quillwritepassword
…, QUILL
www.cs.wisc.edu/condor
Quill Configuration (cont)
› Quill specific attributes
 QUILL_ENABLED






= TRUE
# The quill name must be unique across all
# quill daemons AND schedds
QUILL_NAME
= [email protected]
QUILL_DB_NAME
= psilord_db
QUILL_DB_IP_ADDR
= merlin.cs.wisc.edu:42999
QUILL_POLLING_PERIOD = 10 (seconds)
www.cs.wisc.edu/condor
Quill Configuration (cont)
›
›
›
›
›
QUILL_HISTORY_CLEANING_INTERVAL
QUILL_HISTORY_DURATION
QUILL_MANAGE_VACUUM
QUILL_IS_REMOTELY_QUERYABLE
QUILL_DB_QUERY_PASSWD
=
=
=
=
=
24 (hours)
30 (days)
FALSE
TRUE
xxx
www.cs.wisc.edu/condor
DB Storage Method
› Schema designed to store and query
classads
 4 tables to represent the job queue classads
 2 for history data
 1 for metadata
› Some queries are easier than others
› Ask more questions at the BOF!
www.cs.wisc.edu/condor
Thank you!
› Want more information?
› BOF “Databases in Condor: Now and
in the Future”
www.cs.wisc.edu/condor