home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!ames!agate!darkstar.UCSC.EDU!osr
- From: ymwang@crhc.uiuc.edu (Yi-Min Wang)
- Newsgroups: comp.os.research
- Subject: Summary: Checkpointing for Parallel and Distributed Systems
- Message-ID: <14pmr6INNniv@darkstar.UCSC.EDU>
- Date: 24 Jul 92 19:48:54 GMT
- Organization: Center for Reliable and High-Performance Computing, UIUC
- Lines: 99
- Approved: comp-os-research@ftp.cse.ucsc.edu
- NNTP-Posting-Host: ftp.cse.ucsc.edu
- Originator: osr@ftp
-
- I am making an effort to construct the big picture for the area
- of "checkpointing and rollback recovery for parallel and distributed
- systems". The following list contains 26 papers and in no way covers
- the whole area. Any additions, corrections and comments are welcome
- and greatly appreciated.
-
- My personal point of view starts from a general model with independent
- (uncooridnated) checkpointing for possibly non-deterministic execution,
- and the checkpointing pattern is independent of the communication
- pattern. Such a model suffers from possible domino effect. Other
- approaches are then classified according to their ways of handling
- the domino effect, for example, checkpoint coordination,
- communication-induced checkpointing and deterministic execution.
-
- =====================================================================
- (0) >>> The general model described above <<<
-
- (0.1) Tsuruoka, Kaneko and Nishihara [SRDSDS 1981]
- (0.2) Bhargava and Lian [SRDS 1988]
- (0.3) Wang and Fuchs [SRDS 1992]
-
- (1) >>> Checkpoint coordination <<<
-
- (1.1) Tamir and Sequin [ICPP 1984]
- (1.2) Chandy and Lamport [ACM-TOCS 1985]
- (1.3) Lai and Yang [IPL 1987]
- (1.4) Koo and Toueg [IEEE-TSE 1987]
-
- (1.5) >>> Multicomputer <<<
- (1.4.1) Li, Naughton and Plank [SRDS 1991]
-
- (1.6) >>> Clock synchronization and/or bounded message delay <<<
- (1.6.1) Ramanathan and Shin [SRDS 1988]
- (1.6.2) Cristian and Jahanian [SRDS 1991]
- (1.6.3) Tong, Kain and Tsai [IEEE-TPDS 1992]
- (1.6.4) Long and Fuchs [Submitted 1992]
-
- (2) >>> Communication-induced (message-triggered) checkpointing <<<
-
- (2.1) >>> Extra checkpoints at the sender side <<<
- (2.1.1) Wu and Fuchs [IEEE-TC 1990]
-
- (2.2) >>> Extra checkpoints at the receiver side <<<
- (2.2.1) Briatico, Ciuffoletti and Simoncini [SRDSDS 1984]
- (2.2.2) Kim, You and Abouelnaga [FTCS 1986]
- (2.2.3) Venkatesh, Radhakrishnan and Li [IPL 1987]
-
- (3) >>> Piecewise deterministic execution (or the capability of
- detecting/recording/replaying internal non-deterministic events) <<<
-
- (3.1) >>> Receiver-based message logging <<<
-
- (3.1.1) >>> Synchronous (pessimistic) logging <<<
- (3.1.1.1) Borg, Baumbach and Glazer [SOSP 1983]
- (3.1.1.2) Powell and Presotto [SOSP 1983]
- (3.1.1.3) Borg et al [ACM-TOCS 1989]
-
- (3.1.2) >>> Asynchronous (optimistic) logging <<<
- (3.1.2.1) Strom and Yemini [ACM-TOCS 1985]
- (3.1.2.2) Sistla and Welch [SPDC 1989]
- (3.1.2.3) Johnson and Zwaenepoel [JA 1990]
- (3.1.2.4) Juang and Venkatesan [ICDCS 1991]
-
- (3.2) >>> Sender-based message logging <<<
- (3.2.1) Johnson and Zwaenepoel [FTCS 1987]
- (3.2.2) Strom, Bacon and Yemini [FTCS 1988]
- (3.2.3) Elnozahy and Zwaenepoel [IEEE-TC 1992]
-
-
- SRDSDS : IEEE Symp. on Reliability in Distributed Software
- and Database Systems
- SRDS : IEEE Symp. on Reliable Distributed Systems
- ICPP : Intl. Conf. on Parallel Processing
- ACM-TOCS : ACM Trans. on Computer Systems
- IPL : Information Processing Letters
- IEEE-TSE : IEEE Trans. on Software Engineering
- IEEE-TPDS: IEEE Trans. on Parallel and Distributed Systems
- IEEE-TC : IEEE Trans. on Computers
- FTCS : IEEE Fault-Tolerant Computing Symposium
- SOSP : ACM Symp. on Operating Systems Principles
- SPDC : ACM Symp. on Principles of Distributed Computing
- JA : Journal of Algorithms
- ICDCS : IEEE Intl. Conf. on Distributed Computing Systems
-
- =====================================================================
-
- Thanks,
-
- Yi-Min
-
- --------------------------------------------------
- Yi-Min Wang
- ymwang@crhc.uiuc.edu
-
- Center for Reliable and High-Performance Computing
- Coordinated Science Laboratory
- University of Illinois at Urbana-Champaign
-
-
-