Condor - Computer Sciences User Pages

Download Report

Transcript Condor - Computer Sciences User Pages

Checkpoint and Migration in the Condor
Distributed Processing System
Presentation by James Nugent
CS739
University of Wisconsin Madison
1
Primary Sources
 J. Basney and M. Livny, "Deploying a High Throughput
Computing Cluster", in High Performance Cluster
Computing, Rajkumar
 Condor Web page http://www.cs.wisc.edu/condor/
 M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny,
"Checkpoint and Migration of UNIX Processes in the
Condor Distributed Processing System", Computer Sciences
Technical Report #1346, University of Wisconsin-Madison,
April 1997.
 Michael Litzkow, Miron Livny, and Matt Mutka, "Condor A Hunter of Idle Workstations", Proceedings of the 8th
International Conference of Distributed Computing
Systems, pages 104-111, June, 1988.
2
Outline
 Goals of Condor
 Overview of Condor
 Scheduling
 Remote Execution
 Checkpointing
 Flocking
3
Goals
 Opportunistically uses idle cycles for work


Remote machines have intermittent availability
Lower reliability, as cluster processing is not the
primary function of the machines used.
 Portability

Only requires re-linking(non-relinked apps can’t
be checkpointed)
 Remote machines not a part of system.
4
Voluntary
 People should see benefit of using it,



At at least no penalty for letting others use their
machine.
No security problem from using Condor, or
Condor using your machine.
Low impact on remote machines perf.
5
Implementation Challenges
 Runs serial programs in remote locations, without
any source code changes to program.

Program must be re-linked
 Programs may have to move to a different machine

Checkpointing allows process migration and provides
reliability.
 Provide transparent access to remote resources
6
Basic Functioning-Condor Pool
Central Manager
Select
A+B
Job +C
Unavailable
A
Job
B
Job
Condor Pool
C
D
7
Basic Execution Model
Remote Machine
Source Machine
1.Create Shadow
process.
2.Send job to
remote machine.
3)Remote runs
the job
4)Job sends
remote system
calls to shadow
process on
source machine
Shadow
4
1
Job
3
2
Condor Daemon
Condor Daemon
8
Scheduling
CM selects
A machine
N
Can next job find
a remote to
run on?
Start
Y
N
Y
Machine has
jobs to run?
Next job runs
9
Central Manager—Select a
machine to have a job run.
 Priority(Machine)=Base Priority-(Users)+(# Running
Jobs)




(Users) is the number of users who have jobs local to this
machine, i.e. their jobs started on this machine
Lower values are a higher priority
Ties are broken randomly
CM periodically receives info from each machine.
Number of remote jobs machine is running
 Number of remote jobs machine wants to run
 This is the only state about machines the CM stores, which reduces
stored state and what machines must communicate
CM.

10
Fair Scheduling
 If user A submits 4 jobs and user B submits
1 then:




User A’s jobs wait 1/5 of the time.
User B’s job must wait 4/5 of the time.
Assumes there is insufficient capacity to schedule all jobs
immediatly
User B contributes as much computing power to the
system as User A.
11
Up-Down scheduling—
Workstation priorities
 Each machine has a scheduling index.(SI)




Lower indexes have higher priority
Index goes up when a machine is granted
capacity.(Priority Decreases)
Index goes down when a machine is denied
access to capacity (Priority Increases)
Index also decreases slowly over time.(Priority
Increases over time)
12
Host Machines
 Priority(Job)=(UserP*10)+Runing(1|0)+(
Wait Time/1000000000).
Runing jobs consume disk space
 Jobs are ordered first by User Priority, then by if
they are running or not, and finally by wait time.

13
Classads—Finding Servers for
Jobs
 Classadds are tuples of attribute names and
expressions. (Requirements= Other.Memory > 32)
 Two classads match if their requirements attributes
evaluate to true in the context of each other.
 Example



Job: (Requirements Other.Memory > 32 && Other.Arch=“Alpha”)
Machine: ( Memory=64) ,(Arch=“Alpha”)
These two classads would match.
14
Remote ExecutionGoals
 Execute a program on a remote machine like
it is executing on the local machine.
 All the machines in a pool should not need to
be identically configured in order to run
remote jobs.
 CM should only be involved in starting and
stopping jobs, not while they are running.
15
Resource Access
 Remote Resource access

Have a shadow process on local machine



Condor library sends system calls to local machine
Thus there is a penalty to submitting from one’s machine with this
mechanism. Also extra traffic.
Users does not need account on remote machine
 Jobs execute as ‘nobody’

Compensates for system differences
 AFS,NFS


Shadow checks for possible use.(I.e. is file reachable? Looks up
local mappings)
If file can be reach via the file system, this is used instead of remote
system calls for performance reasons.
16
Interposition-Sending Syscalls
1
Remote Job
….
…
fd=open(args);
….
Condor Library
int open(args){
send(open,args);
receive(result);
store(args,result);
return result;}
Shadow process on
local machine
receive(open, args);
result=syscall(open, args);
send(result);
2
3
17
Limitations
 File operations must be idempotent

Example:




checkpoint
Open a file for append
Write a file
Rollback to previous checkpoint.
 Some functionality limited









No socket communications
Only single process jobs.
Some signals reserved for condor use(SIGUSR2, SIGTSTP)
No interprocess communication(pipes, semaphores or shared memory)
Alarms,timers and sleeping not allowed.
Multiple kernel threads not allowed(User level threads ok)
Memory mapped files not allowed
File locks not retained between checkpoints.
Network communication must be brief

Communication defers checkpoints
18
When Users Return
Suspend Job
Wait 5 minutes
Start
Vacate Job
( Start Chkpting)
Y
Wait 5 minutes
Is user still using
machine?
N
Job resumes
execution
Is chkpt done yet?
N
Y
Checkpoint
Migrates
Kill Job
19
Goals of Checkpointing/Migration
 Packages up a process, including dynamic
state
 Does not require any special considerations
on programmers part; done by a linked in
library.
 Provides fault tolerance & reliability.
20
Checkpoint Considerations
 Must schedule checkpoints to prevent excess checkpoint
server/network traffic.
Example: Many jobs submitted in a cluster, checkpoint interval is
fixed, so they would all checkpoint at the same time.

Solution: Have checkpoint server schedule checkpoints to
avoid excessive network use.

 Checkpoints use a transactional model

A checkpoint will either:
1.
2.

Succeed
Fail and roll back to the previous checkpoint
Only the most recent successful checkpoint is stored.
21
Checkpointing—The Job
 Identify Segments

Use /proc interface to get segment addresses
 Compare segment addresses to know data to determine
which segment is which.




A global variable marks the data segment
Static function marks the text segment
Stack pointer and stack ending address mark stack segments
All others are dynamic library text, or data
 Most segments can be stored with no special actions needed
 Static text segment is the executable and does not need to
be stored
 Stack

Heap
Shared
Libraries
Data
Info saved by Setjmp(includes registers & CPU state)
 Signal state must also be saved

Stack
For instance if a signal is pending.
Text
Condor Lib22
Restart—I
•
•
Shadow process created on
machine by Condor daemon
Heap
Sets up environment
•
Check if files must use
remote syscalls or a
networked FS
•
Shadow creates segments
•
Shadow restores segments
Shared
Libraries
Data
Text
Condor Lib
23
Restart—II
•
File state must be restored
Stack
•
Heap
Lseek sets file pointers
to correct spot
•
Blocked signals must be
re-sent
•
Stack restored
•
Return to job
•
Appears like a return
from a signal handler
Shared
Libraries
Data
Temp Stack
Kernel state
Text
Signals
Condor Lib
24
Flocking-Goals
 Ties groups of condor pools into, effectively, one large pool.
 No change to CM
 User see no difference if a job is in the Flock or their local
pool.
 Fault Tolerance: Fail in one pool should not effect rest of
flock
 Allows each group to maintain authority over their own
machines
 Having one large pool, instead of a flock, could overload the
central manager
25
Flocking
Pool
M M M
Pool
M
M
M
M
M
Net
CM
CM
M
WM
WM
M
Net
Pool M
M
M
M
M
M
M
Possible Job
assignment
Net
M
Net
WM M
CM
WM
CM
Net
M
M Pool
M
26
Flocking--Limitations
•World Machine uses normal priority system
•Machines can run on flock, when local space available
•World machine cannot be used to actually run jobs
•Word machine can only present itself as having one set of
hardware
•The World machine can only receive one job per
scheduling interval, no matter how many machines are
available.
27
Bibliography






7.M.L. Powell and B.P. Miller, "Process Migration in DEMOS/MP", 9th Symposium on
Operating Systems Principles, Bretton Woods, NH, October 1983, pp. 110-119.
9.E. Zayas, "Attacking the Process Migration Bottleneck", 11th Symposium on Operating
Systems Principles, Austin, TX, November 1987, pp. 13-24.
Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations",
Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104111, June, 1988.
Matt Mutka and Miron Livny, "Scheduling Remote Processing Capacity In A WorkstationProcessing Bank Computing System",
Proceedings of the 7th International Conference of Distributed Computing Systems, pp. 2-9,
September, 1987.
Jim Basney and Miron Livny, "Deploying a High Throughput Computing Cluster", High
Performance Cluster Computing, Rajkumar Buyya, Editor, Vol. 1, Chapter 5, Prentice Hall
PTR, May 1999.
Miron Livny, Jim Basney, Rajesh Raman, and Todd Tannenbaum, "Mechanisms for High
Throughput Computing", SPEEDUP Journal, Vol. 11, No. 1, June 1997.
28
Bibliography II







Jim Basney, Miron Livny, and Todd Tannenbaum, "High Throughput Computing with Condor", HPCU
news, Volume 1(2), June 1997.
D. H. J Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne, "A Worldwide Flock of
Condors : Load Sharing among Workstation Clusters" Journal on Future Generations of Computer
Systems, Volume 12, 1996
Scott Fields, "Hunting for Wasted Computing Power", 1993 Research Sampler, University of WisconsinMadison.
Jim Basney and Miron Livny, "Managing Network Resources in Condor", Proceedings of the Ninth IEEE
Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania, August
2000, pp 298-299.
Jim Basney and Miron Livny, "Improving Goodput by Co-scheduling CPU and Network Capacity",
International Journal of High Performance Computing Applications, Volume 13(3), Fall 1999.
Rajesh Raman and Miron Livny, "High Throughput Resource Management", chapter 13 in The Grid:
Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, California, 1999.
Matchmaking: Distributed Resource Management for High Throughput Computing Proceedings of the
Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998,
Chicago, IL.
29
Communication & Files
 File state saved by overloaded library calls

Syscall
 Signal state must also be saved
 Flush communications channels and store them
 If only one endpoint is checkpointable

Communication with only one checkpointed endpoint is
done with a ‘switchboard process’.


Example: license server
Switchboard buffers communication till app Restarted. May be a
problem if licensee server expects prompt replies(need to be
specific)
30
Basic Problems
 How are jobs scheduled on remote
machines?
 How are jobs moved around?
 How are issues such as portability and user
impact dealt with?
31
Basic Functioning-Condor Pool
Central Manager
Select:A & B
Job+Condor Lib
Host Machine: A
Network
Remote Machines
32
Checkpointing--Process
 Identify Segments

/proc interface

Compare to know data to determine which is which




A global variable marks the data segment
Static function marks the text segment
Stack pointer and stack ending address mark stack segments
All others are dynamic library text, or data
 Most segments can just be written

Exception:Static text==Executable
 Stack

Info saved by Setjmp(includes registers & CPU state)
33
Additional Scheduling
 Scheduling affected by parallel PVM apps.


(Master on submit machine, never preemt)
Opportunistic workers.

Since workers contact master, avoids problems
Of having two mobile endpoints.(workers know
where master is)
 Jobs can be ordered in a directed acyclic
graph.
34