home *** CD-ROM | disk | FTP | other *** search
- Submitted-by: breynolds@UCSD.EDU (Bill Reynolds)
-
- I originally posted this to comp.unix.questions. It was then
- recommended to me that I post here as well.
-
- >Greetings,
- > We are a computational physics group running a network of Sun
- >and SGI workstations. We often have long running jobs on many of our
- >machines. This leads to problems when a machine needs to be taken down
- >that has a job in the third day of a five day run. What we would like
- >is a routine to checkpoint a job to a disk file for later reloading
- >into memory. I've looked at undump, but isn't adequate, we need to
- >restart the job where it was interrupted. I've also looked at condor,
- >but it seems to be a fly-with-a-sledgehammer type solution. I'm
- >wondering if there are any simple unix/sun/sgi utilities to do
- >checkpointing. (I know that such facilities exist for crays).
-
- I would also like to add that such a facility would have to support
- fortran and would have to be simple enough to use that someone with
- only a background in scientific computing could use it (i.e. no system
- calls, no calls to c routines from fortran, etc). It has also been
- suggested that I modify the code to undump. I find this a daunting
- task (any takers?). (By the way, I have not actually gotten an undump
- working for the sun or the sgi).
-
- --
- _______________________________________________________________________
- | Bill Reynolds
- | bill@inls1.ucsd.edu
-
- [ First of all, there is Dan Bernstein's Poor Man's Checkpointing Package,
- posted to alt.sources (I think) a month or three ago. Also, one of
- the POSIX subgroups specifies checkpointing, that being the main reason
- I'm posting this. I will let others (who are likely to be more
- knowledgeable about it) comment further, if they wish. -- mod ]
-
- Volume-Number: Volume 23, Number 47
-
-