Presentation

Download Report

Transcript Presentation

Millions of
Jobs
or a few good
solutions ….
David Abramson
Monash University
MeSsAGE Lab
X
No shortage of applications
How many jobs do they want/need?
•
•
•
•
•
•
Physics
Chemistry
Environmental Science
Biological Systems
Engineering
Astronomy
The Nimrod Tool Family
• Nimrod workflows for robust design and search
– Vary parameters
– Execute programs
– Copy data in and out
•
•
•
•
•
Sequential and parallel dependencies
Computational economy drives scheduling
Computation scheduled near data when appropriate
Use distributed high performance platforms Generate
scenarios
Upper middleware broker for resources
discovery
Analysis
• Wide Community adoption
Results
gathered
Execution
3
Nimrod Portal
Plan File
Nimrod/O
parameter pressure float range from 5000 to 6000 points 4
parameter concent float range from 0.002 to 0.005 points 2
parameter material text select anyof “Fe” “Al”
Nimrod/E
Nimrod/G
Actuators
task main
copy compModel node:compModel
copy inputFile.skel node:inputFile.skel
node:substitute inputFile.skel inputFile
node:execute ./compModel < inputFile > results
copy node:results results.$jobname
endtask
Grid Middleware
4
Sent to available machines
Prepare Jobs using Portal
Results displayed &
interpreted
5
Jobs Scheduled Executed Dynamically
Parameter Sweeps and
searches
• A full parameter sweep is the cross
product of all the parameters
– Too easy to generate millions!
• An optimization run minimizes some
output metric and returns parameter
combinations that do this
– Limited concurrency (except GAs)
• Design of Experiments limits number
of combinations further.
– And old idea ….
Nimrod/O
Results
Results
Results
Issues for millions of jobs
• Generation issues
– Don’t necessarily need 1,000,000 jobs!
– Smarter ways of specifying problems
• Don’t want to see 1,000,000 jobs!
• Don’t necessarily generate all at once
• Performance issues
– Nimrod/G: Server load
• Hierarchical resource management
– Nimrod/K: Handling token load in matching store
• Need k-bounded loops ideas from 1980’s
• Fault tolerance
– Engaging the user
• Don’t want to see 1,000,000 jobs
– Distributed experiment management (p2p)?
Issues for millions of jobs
• Analysis issues
– Need smarter ways of interacting with results
• Scientific visualisation, data mining, mega-pixel displays
• Commercial realities
– License management
• Need parametric licenses like parallel ones.
• Appropriate infrastructure
– Tera Grid class of machine not most appropriate
– Parametric Clouds