Transcript PowerPoint

Implementing a Central Quill Database in
a Large Condor Installation
Preston Smith
[email protected]
Condor Week 2008 - April 30, 2008
Overview
• Background
– BoilerGrid
•
•
•
•
•
Motivation
What works well
What has been challenging
What just doesn’t work
Future directions
BoilerGrid
• Purdue Condor Grid (BoilerGrid)
– Comprised of Linux HPC clusters, student labs,
machines from academic department, and Purdue
regional campuses
• 8900 batch slots today..
• 14,000 batch slots in a few weeks
• 2007 - Delivered over 10 million CPU-hours to
high-throughput science to Purdue and national
community through Open Science Grid and
TeraGrid
BoilerGrid - Growth
BoilerGrid Pool Size
14000
12000
10000
Cores
8000
Pool Size
6000
4000
2000
0
2003
2004
2005
2006
Year
2007
2008
2009
BoilerGrid - Results
BoilerGrid - Jobs Completed
12,000,000
BoilerGrid - Unique Users per year
10,000,000
140
8,000,000
120
6,000,000
Jobs
100
BoilerGrid - Hours Delivered
4,000,000
80
Users
12,000,000
2,000,000
Unique Users
60
10,000,000
0
2003
2004
2005
40
2006
2007
2008
Year
8,000,000
20
6,000,000
Hours Delivered
0
2003
2004
2005
20064,000,000
2007
2008
Year
2,000,000
0
2003
2004
2005
2006
Year
2007
2008
A Central Quill Database
• Condor 6.9.4,
– Quill can store information about all the
execute machines and daemons in a pool
– Quill now able to store job history and queue
contents in a single, central database.
• Since December 2007, we’ve been
working to store the state of BoilerGrid in a
Quill installation
Motivation
• Why would we want to do such a thing??
– Research into the state of a large distributed system
• Several at Purdue, collaborators at Notre Dame
– Failure analysis/prediction, smart scheduling,
interesting reporting for machine owners
– “events” table useful for user troubleshooting?
– And one of our familiar gripes - usage reporting
• Structural biologists (see earlier today) like to submit jobs
from their desks, too
• How can we access that job history to complete the picture of
BoilerGrid’s usage?
The Quill Server
• Dell 2850
– 2x 2.8GHz Xeons (hyperthreaded)
– Postgres on 4-disk Ultra320 SCSI RAID-0
– 5GB RAM
What works well
• Getting at usage data!
quill=> select distinct
scheddname,owner,cluster_id,proc_id,remotewallclocktime from
jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10;
scheddname
| owner | cluster_id | proc_id | remotewallclocktime
------------------------+---------+------------+---------+-------------------epsilon.bio.purdue.edu | jiang12 |
276189 |
0 |
345
epsilon.bio.purdue.edu | jiang12 |
280668 |
0 |
4456
epsilon.bio.purdue.edu | jiang12 |
280707 |
0 |
1209
epsilon.bio.purdue.edu | jiang12 |
280710 |
0 |
1197
epsilon.bio.purdue.edu | jiang12 |
280715 |
0 |
1064
epsilon.bio.purdue.edu | jiang12 |
280717 |
0 |
567
epsilon.bio.purdue.edu | jiang12 |
280718 |
0 |
485
epsilon.bio.purdue.edu | jiang12 |
280720 |
0 |
480
epsilon.bio.purdue.edu | jiang12 |
280721 |
0 |
509
epsilon.bio.purdue.edu | jiang12 |
280722 |
0 |
539
(10 rows)
What works, but is painful
• Thousands of hosts pounding a Postgres
database is non-trivial
– Be sure to turn down QUILL_POLLING_PERIOD
• Default is 10s - we went down to 1 hour on execute
machines
– At some level, this is an exercise in tuning your
Postgres server.
top - 13:45:30 up 23 days, 19:59, 2 users, load average: 563.79, 471.50, 428.
Tasks: 804 total, 670 running, 131 sleeping,
3 stopped,
0 zombie
Cpu(s): 94.6% us, 2.9% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 2.2% si
Mem:
5079368k total, 5042452k used,
36916k free,
10820k buffers
Swap: 4016200k total,
68292k used, 3947908k free, 2857076k cached
• Quick diversion into Postgres tuning 101..
Postgres
• Assuming that there’s enough disk
bandwidth….
– In order to support 2500 simultaneous
connections, one must turn up
max_connections
– If you turn up max_connections, you need
~400 bytes of shared memory per slot.
• Currently we have 2G of shared memory allocated
Postgres
• Then you’ll need to turn up shared_buffers
– 1G currently
WARNING: relation "public.machines_vertical_history" contains more than
"max_fsm_pages" pages with useful free space
HINT: Consider compacting this relation or increasing the configuration
parameter "max_fsm_pages".
– Don’t forget fsm_pages…
What works, but is painful
• So by now we can withstand the worker nodes
reasonably well
• Add schedds
– condor_history returns history from ALL schedds
• Bug fixed in 7.0.2
– The execute machines create enough load that
condor_q is sluggish
– Added a 2nd quill database server just for job
information
What works, but is painful
• If your daemons log a lot to sql.log files,
but not writing to the database..
– Database down, etc
– Your database is in a world of hurt while it
tries to catch up..
What Hasn’t Worked
• Many Postgres tuning guides recommend
a connection pooler if you need scads of
connections
– pgpool-II
– Pgbouncer
• Tried both, Quill doesn’t seem to like it
– It *did* reduce load….
But, often locked up the database (idle in transaction), and didn’t get
anywhere
What can we do about it?
• Throw hardware at the database!
– Spindle count seems ok
• Not I/O bound (any more)
– More memory = more connections
• 16GB? More?
– More, faster CPUs
• We appear to be CPU-bound now
• Get latest multi-cores
What can we do about it?
• Contact Wisconsin and call for rescue
“Hey guys.. This is really
hard on the old database”
“Hmm. Let’s take a
look.”
What can Wisconsin do about it?
• Todd, Greg, and probably others take a
look:
– Quill always hits the database, even for
unchanged ads
– Postgres backend does not prepare SQL
queries before submitting
• Being fixed, Todd is optimistic
– We’ll report with the results as soon as we
have them
Future Directions
• Reporting for users
– Easy access to statistics about who ran on
“my” machines.
• Mashups, web portals
– Diagnostic tools to help users
• Troubleshooting, etc.
The End
• Questions?
Backup slides
BoilerGrid - Results
Year Pool
Size
2004 1500
2005 4000
2006 6100
Jobs
Hours
Delivered
43,551
346,000
210,717
1,695,000
4,251,981 5,527,000
2007 7700
9,611,813 9,524,000 117
2008 14000+ ?
?
Unique
Users
14
26
72
63 so far..