The World of Computer Software

home *** CD-ROM | disk | FTP | other *** search

/ The World of Computer Software / World_Of_Computer_Software-02-385-Vol-1of3.iso / c / condor40.zip / CONDOR / doc / tech / tech.me < prev next >

Wrap

Text File | 1989-10-18 | 23KB | 754 lines

.nr si 3n .he 'CONDOR TECHNICAL SUMMARY''%' .fo 'Version 4.0.0''' .+c .(l C .sz 14 CONDOR TECHNICAL SUMMARY .)l .(l C Allan Bricker and Michael J. Litzkow .)l .sp .5i .sh 1 "Introduction to the Problem" .pp A common computing environment consists of many workstations connected together by a high speed local area network. These workstations have grown in power over the past several years, and if viewed as an aggregate they can represent a significant computing resource. However in many cases even though these workstations are owned by a single organization, they are dedicated to the exclusive use of individuals. .pp In examining the usage patterns of the workstations, we find it useful to identify three .q typical types of users. .q "Type 1" users are individuals who mostly use their workstations for sending and receiving mail or preparing papers. Theoreticians and administrative people often fall into this category. We identify many software development people as .q "type 2" users. These people are frequently involved in the debugging cycle where they edit software, compile, then run it possibly using some kind of debugger. This cycle is repeated many times during a typical working day. Type 2 users sometimes have too much computing capacity on their workstations such as when editing, but then during the compilation and debugging phases they could often use more CPU power. Finally there are .q "type 3" users. These are people who frequently do large numbers of simulations, or combinitoric searches. These people are almost never happy with just a workstation, because it really isn't powerful enough to meet their needs. Another point is that most type 1 and type 2 users leave their machines completely idle when they are not working, while type 3 users may keep their machines busy 24 hours a day. .pp .i Condor is an attempt to make use of the idle cycles from type 1 and 2 users to help satisfy the needs of the type 3 users. The .i condor software monitors the activity on all the participating workstations in the local network. Those machines which are determined to be idle, are placed into a resource pool or .q "processor bank" . Machines are then allocated from the bank for the execution of jobs belonging to the type 3 users. The bank is a dynamic entity; workstations enter the bank when they become idle, and leave again when they get busy. .sh 1 "Design Features" .np No special programming is required to use condor. Condor is able to run normal UNIX\** .(f \**UNIX is a trademark of AT&T. .)f programs, only requiring the user to relink, not recompile them or change any code. .np The local execution environment is preserved for remotely executing processes. Users do not have to worry about moving data files to remote workstations before executing programs there. .np The condor software is responsible for locating and allocating idle workstations. Condor users do not have to search for idle machines, nor are they restricted to using machines only during a static portion of the day. .np .q Owners of workstations have complete priority over their own machines. Workstation owners are generally happy to let somebody else compute on their machines while they are out, but they want their machines back promptly upon returning, and they don't want to have to take special action to regain control. Condor handles this automatically. .np Users of condor may be assured that their jobs will eventually complete. If a user submits a job to condor which runs on somebody else's workstation, but the job is not finished when the workstation owner returns, the job will be checkpointed and restarted as soon as possible on another machine. .np Measures have been taken to assure owners of workstations that their filesystems will not be touched by remotely executing jobs. .np Condor does its work completely outside the kernel, and is compatible with Berkeley 4.2 and 4.3 UNIX kernels and many of their derivitives. You do not have to run a custom operating system to get the benefits of condor. .sh 1 "Limitations" .np Only single process jobs are supported, i.e. the fork(2), exec(2), and similar calls are not implemented. .np Signals and signal handlers are not supported, i.e. the signal(3), sigvec(2), and kill(2) calls are not implemented. .np Interprocess communication (IPC) calls are not supported, i.e. the socket(2), send(2), recv(2), and similar calls are not implemented. .np All file operations must be idempotent \(em read-only and write-only file accesses work correctly, but programs which both read and write the same file may not. .np Each condor job has an associated .q "checkpoint file" which is approximately the size of the address space of the process. Disk space .b must be available to store the checkpoint file .b both on the .b submitting and .b remote machines. .np Condor does a significant amount of work to prevent security hazards, but some loopholes are known to exist. One problem is that condor user jobs are supposed to do only remote system calls, but this is impossible to guarantee. User programs are restricted on the remote machine both by running only as an ordinary user (condor), and by operating in a changeroot'd directory. Still a sufficiently malicious and clever user could cause problems by doing local system calls on the remote machine. .np A different security problem exists for owners of condor jobs who necessarily give remotely running processes access to their own file system. The risk can be greatly reduced by requesting that access only be granted to a changeroot'd directory in the local file system, but that does reduce the flexibility of file access for the condor jobs. See condor(1) for details on how to submit jobs with such a request. .sh 1 "Overview of Condor Software" .pp Condor user programs do .q "remote system calls" back to the machine from which they were submitted. Remote system calls provide user programs with the illusion that they are operating in the local environment and give the user the flexibility of running programs written for the normal UNIX environment on condor. Programs are converted to using remote system calls simply by relinking with a special library. The remote system call mechanism is described in Section 6. .pp Condor user programs are constructed in such a way that they can be checkpointed and restarted at will. This assures users that their jobs will complete, even if they are interrupted during execution by the return of a hosting workstation's owner. Checkpointing is also implemented by linking with the special library. The checkpointing mechanism is described more fully in Section 7. .pp Condor includes control software consisting of two daemons which run on each member of the condor pool, and two other daemons which run on a single machine called the .b "central manager" . This software automatically locates and releases .q "target machines" and manages the queue of jobs waiting for condor resources. The control software is described in Section 8. .sh 1 "Remote System Calls" .pp To better understand how the condor remote system calls work, it is appropriate to quickly review how normal UNIX system calls work. Figure 1 illustrates the normal UNIX system call mechanism. The user program is linked with a standard library called the .q "C library" . This is true even for programs written in languages other than C. The C library contains routines, often referred to as .q "system call stubs" , which cause the actual system calls to happen. What the stubs really do is push the system call number, and system call arguments onto the stack, then execute an instruction which causes a trap to the kernel. When the kernel trap handler is called, it reads the system call number and arguments, and performs the system call on behalf of the user program. The trap handler will then place the system call return value in a well known register or registers, and return control to the user program. The system call stub then returns the result to the calling process, completing the system call. .(b .GS C file 01_syscall_review.grn 3 8 4 12 height 2.5 .GE .)b .pp Figure 2 illustrates how this mechanism has been altered by condor to implement remote system calls. Whenever condor is executing a user program remotely, it also runs a .q shadow program on the initiating host. The .b shadow acts an agent for the remotely executing program in doing system calls. Condor user programs are linked with a special version of the C library. The special version contains all of the functions provided by the normal C library, but the system call stubs have been changed to accomplish remote system calls. The remote system call stubs package up the system call number and arguments and send them to the .b shadow using the network. The .b shadow , which is linked with the normal C library, then executes the system call on behalf of the remotely running job in the normal way. The .b shadow then packages up the results of the system call and sends them back to the system call stub in the special C library on the remote machine. The remote system call stub then returns its result to the calling procedure which is unaware that the call was done remotely rather than locally. Note that the .b shadow runs with its UID set to the owner of the remotely running job so that it has the correct permissions into the local file system, and the remotely running job runs with its UID set to .q condor. Condor is an ordinary user on the remote system, and thus has no special privileges into that file system. The remotely running user program runs in a .q changeroot'd environment to further protect the owner of the remote machine from unwanted file system accesses by the foreign job it is hosting. .(b .GS C file 02_syscall_remote.grn 3 8 4 12 height 3 .GE .)b .sh 1 Checkpointing .pp To checkpoint a UNIX process, several things must be preserved. The text, data, stack, and register contents are needed, as well as information about what files are open, where they are seek'd to, and what mode they were opened in. The data, and stack are available in a core file, while the text is available in the original executable. Condor gathers the information about currently open files through the special C library. In condor's special C library the system call stubs for .q open , .q close , and .q dup not only do those things remotely, but they also record which files are opened in what mode, and which file descriptors correspond to which files. .pp Condor causes a running job to checkpoint by sending it a signal. When the program is linked, a special version of .q crt0 is included which sets up CKPT() as that signal handler. When CKPT() is called, it updates the table of open files by seeking each one to the current location and recording the file position. Next a setjmp(3) is executed to save key register contents in a global data area, then the process sends itself a signal which results in a core dump. The condor software then combines the original executable file, and the core file to produce a .q checkpoint file, (figure 3). The checkpoint file is itself executable. .pp When the checkpoint file is restarted, it starts from the crt0 code just like any UNIX executable, but again this code is special, and it will set up the restart() routine as a signal handler with a special signal stack, then send itself that signal. When restart() is called, it will operate in the temporary stack area and read the saved stack in from the checkpoint file, reopen and reposition all files from the saved file state information, and execute a longjmp(3) back to CKPT(). When the restart routine returns, it does so with respect to the restored stack, and CKPT() returns to the routine which was active at the time of the checkpoint signal, not crt0. To the user code, checkpointing looks exactly like a signal handler was called, and restarting from a checkpoint looks like a return from that signal handler. .(b .GS C file 03_checkpoint.grn 3 8 4 12 height 3 .GE .)b .sh 1 "Control Software" .pp Each machine in the condor pool runs two daemons, the .b schedd and the .b startd . In addition, one machine runs two other daemons called the .b collector and the .b negotiator . While the .b collector and the .b negotiator are separate processes, they work closely together, and for purposes of this discussion can be considered one logical process called the .b "central manager" . The .b "central manager" has the job of keeping track of which machines are idle, and allocating those machines to other machines which have condor jobs to run. On each machine the .b schedd maintains a queue of condor jobs, and negotiates with the .b "central manager" to get permission to run those jobs on remote machines. The .b startd determines whether its machine is idle, and also is responsible for starting and managing foreign jobs which it may be hosting. On machines running the X window system, an additional daemon the .b kbdd will periodically inform the .b startd of the keyboard and mouse .q "idle time" . Periodically the .b startd will examine its machine, and update the .b "central manager" on its degree of "idleness". Also periodically the .b schedd will examine its job queue and update the .b "central manager" on how many jobs it wants to run and how many jobs it is currently running, (figure 4). .(b .GS C file 04_control_A.grn 3 8 4 12 height 3 .GE .)b .pp At some point the .b "central manager" may learn that .i "machine b" is idle, and decide that .i "machine c" should execute one of its jobs remotely on .i "machine b" . The .b "central manager" will then contact the .b schedd on .i "machine c" and give it .q permission to run a job on .i "machine b" . The .b schedd on .i "machine c" will then select a job from its queue and spawn off a .b shadow process to run it. The .b shadow will then contact the .b startd on .i "machine b" and tell it that it would like to run a job. If the situation on .i "machine b" hasn't changed since the last update to the .b "central manager" , .i "machine b" will still be idle, and will respond with an OK. The .b startd on .i "machine b" then spawns a process called the .b starter . It's the .b starter's job to start and manage the remotely running job (figure 5). .(b .GS C file 05_control_B.grn 3 8 4 12 height 3 .GE .)b .pp The .b shadow on .i "machine c" will transfer the checkpoint file to the .b starter on .i "machine b" . The .b starter then sets a timer and spawns off the remotely running job from .i "machine c" (figure 6). The .b shadow on .i "machine c" will handle all system calls for the job. When the .b starter's timer expires it will send the user job a checkpoint signal, causing it to save its file state and stack, then dump core. The .b starter then builds a new version of the checkpoint file which is stored temporarily on .i "machine b" . The .b starter restarts the job from the new checkpoint file, and the cycle of execute and checkpoint continues. At some point, either the job will finish, or .i "machine b's" user will return. If the job finishes, the job's owner is notified by mail, and the .b starter and .b shadow clean up. If .i "machine b" becomes busy, the .b startd on .i "machine b" will detect that either by noting recent activity on one of the tty or pty's, or by the rising load average. When the .b startd on .i "machine b" detects this activity, it will send a .q suspend signal to the .b starter , and the .b starter will temporarily suspend the user job. This is because frequently the owners of machines are active for only a few seconds, then become idle again. This would be the case if the owner were just checking to see if there were new mail for example. If .i "machine b" remains busy for a period of about 5 minutes, the .b startd there will send a .q vacate signal to the .b starter . In this case, the .b starter will abort the user job and return the latest checkpoint file to the .b shadow on .i "machine c" . If the job had not run long enough on .i "machine b" to reach a checkpoint, the job is just aborted, and will be restarted later from the most recent checkpoint on .i "machine c" . Notice that the .b starter checkpoints the condor user job periodically rather than waiting until the remote workstation's owner wants it back. Checkpointing, and in particular core dumping, is an I/O intensive activity which we avoid doing when the hosting workstation's owner is active. .(b .GS C file 06_control_C.grn 3 8 4 12 height 3 .GE .)b .sh 1 "Control Expressions" .pp The condor control software is driven by a set of powerful .q "control expressions" . These expressions are read from the file .q ~condor/condor_config on each machine at run time. It is often convenient for many machines of the same type to share common control expressions, and this may be done through a fileserver. To allow flexibility for control of individual machines, the file .q ~condor/condor_config.local is provided, and expressions defined there take precedence over those defined in condor_config. Following are examples of a few of the more important condor control expressions with explanations. See condor_config(5) for a detailed description of all the control expressions. .sh 2 "Starting Foreign Jobs" .pp This set of expressions is used by the .b startd to determine when to allow a foreign job to begin execution. .ta 15n .(l BackgroundLoad = 0.3 StartIdleTime = 15 * $(MINUTE) CPU_Idle = LoadAvg <= $(BackgroundLoad) START : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime) .)l .lp This example of the START expression specifies that to begin execution of a foreign job the load average must be less than 0.3, and there must have been no keyboard activity during the past 15 minutes. .lp Other expressions are used to determine when to suspend, resume, and abort foreign jobs. .sh 2 "Prioritizing Jobs" .pp The .b schedd must prioritize its own jobs and negotiate with the .b "central manager" to get permission to run them. It uses a control expression to assign priorities to its local jobs. .(l PRIO : (UserPrio * 10) + $(Expanded) - (QDate / 1000000000.0) .)l .lp .q UserPrio is a number defined by the jobs owner in a similar spirit to the UNIX .q nice command. .q Expanded will be 1 if the job has already completed some execution, and 0 otherwise. This is an issue because expanded jobs require more disk space than unexpanded ones. .q QDate is the UNIX time when the job was submitted. The constants are chosen so that .q UserPrio will be the major criteria, .q Expanded will be less important, and .q QDate will be the minor criteria in determining job priority. .q UserPrio , .q Expanded , and .q QDate are variables known to the .b schedd which it determines for each job before applying the PRIO expression. .sh 2 "Prioritizing Machines" .pp The .b "central manager" does not keep track of individual jobs on the member machines. Instead it keeps track of how many jobs a machine wants to run, and how many it is running at any particular time. This keeps the information that must be transmitted between the .b schedd and the .b "central manager" to a minimum. The .b "central manager" has the job of prioritizing the machines which want to run jobs, then it can give permission to the .b schedd on high priority machines and let them make their own decision about what jobs to run. .(l UPDATE_PRIO : Prio + Users - Running .)l .lp Periodically the .b "central manager" will apply this expression to all of the machines in the pool. The priority of each machine will be incremented by the number of individual users on that machine who have jobs in the queue, and decremented by the number of jobs that machine is already executing remotely. Machines which are running lots of jobs will tend to have low priorities, and machines which have jobs to run, but can't run them, will accumulate high priorities. .sh 1 "Acknowledgements" .pp This project is based on the idea of a .q "processor bank" , which was introduced by Maurice Wilkes in connection with his work on the Cambridge Ring.\** .(f \**Wilkes, M. V., Invited Keynote Address, 10th Annual International Symposium on Computer Architecture, June 1983. .)f .pp We would like to thank Don Neuhengen and Tom Virgilio for their pioneering work on the remote system call implementation; Matt Mutka and Miron Livny for first convincing us that a general checkpointing mechanism could be practical and for ideas on how to distribute control and prioritize the jobs; and David Dewitt and Marvin Solomon for their continued guidance and support throughout this project. .pp This research was supported by the National Science Foundataion under grants MCS81-05904 and DCR-8512862 and by a Digital Equipment Corporation External Research Grant. .sh 1 "Copyright Information" .lp Copyright 1986, 1987, 1988, 1989 University of Wisconsin .lp Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of the University of Wisconsin not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. The University of Wisconsin makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. .lp THE UNIVERSITY OF WISCONSIN DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE UNIVERSITY OF WISCONSIN BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. .lp .ta 10n Authors: Allan Bricker and Michael J. Litzkow, .br University of Wisconsin, Computer Sciences Dept. .sh 1 "Bibliography" .np Mutka, M. and Livny, M. .q "Profiling Workstations' Available Capacity For Remote Execution" . .i Proceedings of Performance-87, The 12th IFIP W.G. 7.3 International Symposium on Computer Performance Modeling, Measurement and Evaluation. .r Brussels, Belgium, December 1987. .np Litzkow, M. .q "Remote Unix \(em Turning Idle Workstations Into Cycle Servers" . .i Proceedings of the Summer 1987 Usenix Conference. .r Phoenix, Arizona. June 1987 .np Mutka, M. .i Sharing in a Privately Owned Workstation Environment. .r Ph.D. Th., University of Wisconsin, May 1988. .np Litzkow, M., Livny, M. and Mutka, M. .q "Condor \(em A Hunter of Idle Workstations" . .i Proceedings of the 8th International Conference on Distributed Computing Systems. .r San Jose, Calif. June 1988 .np Bricker, A. and Litzkow M. .q "Condor Installation Guide" . May 1989 .np Bricker, A. and Litzkow, M. Unix manual pages: condor_intro(1), condor(1), condor_q(1), condor_rm(1), condor_status(1), condor_summary(1), condor_config(5), condor_control(8), and condor_master(8). May 1989