home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.isis
- Path: sparky!uunet!srvr1.engin.umich.edu!batcomputer!cornell!ken
- From: ken@cs.cornell.edu (Ken Birman)
- Subject: Lazy replication
- Message-ID: <1993Jan22.165709.28909@cs.cornell.edu>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- Date: Fri, 22 Jan 1993 16:57:09 GMT
- Lines: 99
-
- I've gotten a few inquiries about the comparison with Isis in a recent
- TOCS paper by Rivka Ladin and some others, titled Providing High Availability
- using Lazy Replication. The paper came out in ACM TOCS in Nov. 1992,
- which was just sent out.
-
- This is a nice paper and describes some very interesting work, but the
- comparison with Isis is slightly inaccurate in some ways. Also, I think
- the paper uses unusually strong language in some of these comparisons,
- but thats their decision; they wrote it. (By the way, I had no role in
- handling or reviewing this paper at TOCS).
-
- - The paper says that that Isis sends more messages than the LR scheme
- when a client talks to a group. Actually, this seems to depend on
- the value of a parameter called K in their scheme, and the replication
- approach you use in our case. Its true that in Isis we normally
- replicate data among all the members of a group, so that each member
- has a valid local copy, and in the LR scheme you don't do this. But,
- if you did, the costs of the two methods would then be identical.
-
- So, the real point is not so much that one scheme is fundamentally
- more expensive than the other, but rather that we make a design
- choice in the way we replicate data for Isis, which they make differently
- in the LR approach.
-
- - Timestamp overhead: well, yes, if you want multi-group causality,
- which LR doesn't guarantee, we do this, or we delay when switching
- from group to group. But, we also compress timestamps (in Horus),
- which makes them quite small, and use other tricks to try to avoid
- putting timestamps in messages unless we need to. The evidence is
- that timestamp overhead can be kept to something very minor.
-
- So, again, the real point seems to be a design choice: Isis, by default,
- provides an ordering property that LR doesn't provide, and this costs
- something, although it also gives you something. If you don't want
- Isis to provide multigroup causality you can tell it so and the cost
- goes away -- in Isis V3.0 this is the "O" option, and in Horus, you
- will do it with something called a causal domain. In the case where
- Isis is asked to act like LR, I think the overhead is roughly the same.
-
- - The observation about delays during view changes is correct; Isis V3.0
- does delay in this way. Talking to the Transis people (Yair Amir)
- it became clear to us that we don't need to delay at this stage of
- our flush protocol -- the correctness proof remains valid even if
- we just leave out the delay(!), and Robbert van Renesse coded the
- Horus implementation of the flush to use this trick. Good idea on
- their part, and imitation is the sincerest flattery.
-
- So: good point, we fixed our protocol to do the same thing when Yair
- Amir suggested this. I was not aware that LR was doing this too,
- but it makes a lot of sense. A shame that Pat and Andre and I didn't
- notice that we weren't making any real use of the delay in the proof
- of correctness...
-
- - Network partitions: this is a bit more complicated. The LR scheme
- logs updates and they have a notion of group member in which the
- same process may come back after a crash and you resume talking to
- it by replaying logged messages. Isis, on the other hand, makes
- such a process "rejoin" the group and basically forces it to restore
- its state from scratch.
-
- With the Isis spooling tool, you could get the same replay behavior
- if you like. We don't normally do this, though. If a partition is
- a real risk, we prefer to use the long-haul tool over the flakey link
- and run Isis separately on both sides. A design preference that
- makes life simpler for us.
-
- Anyhow, LR and Psync both have this spooling scheme, so yes, they
- both "tolerate" partitions, in a sense. But, none of the three systems
- actually can allow updates in an unrestricted way while the partition
- is present, and in fact, you can build exactly the same application
- in any of these schemes, Isis, LR, or Psync. I can't see why the
- performance would be expected to be different, although the issue
- here is one of engineering -- how well our logging scheme works, etc.
- I assume that any of these systems would be disk-IO limited in this
- mode.
-
- To summarize: the LR paper is a very nice paper and describes a good
- piece of work, and our work has been influenced in several ways by
- this work. But, I doubt that either system is particularly faster
- than the other. I think the contrast is more interesting at the level
- of design choices that were made, and decisions about what the default
- behaviors of the systems should be -- these are quite different, and
- they do have performance, functionality, and user-transparency implications.
- We have tended to err on the side of making Isis more costly but simpler
- to use. If people are routinely forced to turn off Isis ordering because
- of this cost, perhaps the LR design point would make more sense. However,
- I am not aware of any wide-spread trend along these lines.
-
- Frankly, I think the main lesson I have learned in 3 years of struggling
- with speed issues in Isis is that very little about performance has
- anything to do with which protocol you use. The real problems are flow
- control, where you run relative to UNIX and to the devices, memory
- management, when you send acks, when you retransmit dups, whether UNIX
- is losing packets... all that grungy stuff. At least, thats my insight.
- Not a very deep insight, unfortunately!
- --
- Kenneth P. Birman E-mail: ken@cs.cornell.edu
- 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-