Transcript AMUN

AMUN
A Practical Application Using the
Nile Distributed Operating System
Authors:
R. Baker (Cornell University, Ithaca, NY USA)
L. Zhou (University of Florida, Gainesville, FL USA)
J. Duboscq (Ohio State University, Columbus, OH USA)
Presented by:
D. Mimnagh (University of Texas, Austin, TX USA)
2/8/00
CHEP2000
1
Overview
•
•
•
•
What is Nile?
What is AMUN?
Results
Conclusions
2/8/00
CHEP2000
2
What is Nile?
• Nile: Distributed computing solution for CLEO
– fault-tolerant (recover from resource failure)
– self-managing (sophisticated resource scheduling)
– heterogeneous (will run anything anywhere)
• Designed for HEP
– track reconstruction
– data analysis
– simulation
• But very generic
2/8/00
CHEP2000
3
Nile Architecture
2/8/00
CHEP2000
4
What is AMUN?
• Advanced Monte Carlo Under Nile
• CLEO II.V signal Monte Carlo
– τ lepton pair events
• Testbed
– Nile control system using RMI (see E272)
– Borrowed workstation program
2/8/00
CHEP2000
5
Managing Loaned Workstations
• Prototype
CPU Boss
– csh scripts
– list of machine owners
• Must be easy and honest
–
–
–
–
Node 1 Node 2 Node 3
...
simple configuration files creation
monitor usage remotely and locally
allow preemption for unexpected usage
need local space for intermediate results
• Will be integrated with Nile in Java
2/8/00
CHEP2000
6
Nile performance Results
• Very stable
– weeks of uninterrupted use
• Heterogeneity
– as many as 60 machines, Alpha Linux + Unix
– SpecInt ranging from 1 to 25
• Scaling
– linear
– Network topology issues can break linearity
– 1-3 second to reschedule CPU
2/8/00
CHEP2000
7
Scaling with Total SpecInt
2/8/00
CHEP2000
8
Events Generated
• Job construction requirements:
– choose subjob size
– collection script
• 25 million τ events generated
• as many as 1 million a day
2/8/00
CHEP2000
9
Conclusion
• Successful implementation of Nile in RMI
• CPU resources used efficiently
– loaned CPU
• To do:
– rewrite scripts in Java
– admin tools
– GUI tools
2/8/00
CHEP2000
10