home *** CD-ROM | disk | FTP | other *** search
- Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!news-out.internetmci.com!newsfeed.internetmci.com!164.67.42.145!awabi.library.ucla.edu!128.32.155.1!agate!news.ucsc.edu!osr
- From: bos@serpentine.com (Bryan O'Sullivan)
- Newsgroups: comp.os.research,comp.answers,news.answers
- Subject: Comp.os.research: Frequently answered questions [3/3: l/m 13 Aug 1996]
- Followup-To: poster
- Date: 1 Nov 1997 10:00:34 GMT
- Organization: Polymorphous Thaumaturgy
- Lines: 552
- Approved: comp-os-research@cse.ucsc.edu, news-answers-request@mit.edu
- Message-ID: <63euk2$iv@darkstar.ucsc.edu>
- Reply-To: os-faq@cse.ucsc.edu
- NNTP-Posting-Host: ftp.cse.ucsc.edu
- Summary: frequent topics of discussion on the operating systems research group
- Originator: osr@cse.ucsc.edu
- Xref: senator-bedfellow.mit.edu comp.os.research:6840 comp.answers:28751 news.answers:115889
-
- Archive-name: os-research/part3
- Version: $Revision: 1.3 $
- Posting-Frequency: monthly
- Last-Modified: Tue Aug 13 21:03:20 1996
- URL: http://www.serpentine.com/~bos/os-faq/
-
- Answers to frequently asked questions
- for comp.os.research: part 3 of 3
-
- Copyright (C) 1994--1996
- Bryan O'Sullivan
-
-
-
- TABLE OF CONTENTS
-
-
- 1. Distributed systems
- 1.1. What is the current status of the (insert name) project?
- 1.2. How do approaches to load balancing differ?
- 1.3. Fault tolerance in distributed systems
- 1.4. Naming in distributed systems
- 1.5. Distributed shared memory
- 1.5.1. Data consistency
- 1.5.1.1. Strictly consistent systems
- 1.5.1.2. Relaxing consistency
- 1.5.1.3. Application-specific coherence
- 1.5.2. Access synchronisation
- 1.5.3. Transfer and caching granularity
- 1.5.4. Address space structure
- 1.5.5. Fault tolerance
- 1.5.6. A brief bibliography on distributed shared memory
- 1.6. What have we learned?
-
- 2. Needful things
-
-
-
- ------------------------------
- Subject: [1] Distributed systems
- From: Distributed systems
-
- A great deal of the high-profile research carried out in operating
- systems these days deals with distributed computing. Not
- surprisingly, discussions of distributed systems make up a large
- amount of the traffic on comp.os.research.
-
- ------------------------------
- Subject: [1.1] What is the current status of the (insert name) project?
- From: Distributed systems
-
- See the section on `available software' for information on
- distributions of some of the systems mentioned here.
-
- - The Amoeba project is still going. There are roughly 20 people
- working on it, but most of these are no longer kernel hackers. They
- are working on using it for parallel programming, wide-area
- distributed systems, and other things. Amoeba is used in over 100
- universities at the moment, and is also used at commercial
- institutions.
-
- - Brazil is the new research operating system being developed at AT&T
- Bell Labs. Research topics being addressed in Brazil center on
- higher-performance machines and, particularly, networks. A new
- in-house 300 megabit/s switched fiber network increases the
- potential bandwidth between machines by at least an order of
- magnitude; our aim is to realize and exploit that bandwidth. The
- overall design is to eliminate unnecessary overhead, particularly by
- restructuring and redesigning where necessary to avoid copying data
- from element to element along the communications path. Most of this
- software (except the operating system kernel) is written in a new
- concurrent systems programming language, Alef, which makes it easy
- to write multi-process servers and applications that can communicate
- using messages or shared memory, as appropriate. A paper on Alef is
- available from the Plan 9 ftp site; see part 2 of this FAQ for a
- pointer.
-
- - Cronus is still under development at BBN. The current public
- release is 3.0. The project currently has two thrusts---as the base
- for advanced distributed system R&D, and as a platform for
- constructing and deploying sophisticated distributed applications.
-
- Ongoing research topics include the integration of Cronus and Mach
- technology, the exploration of techniques for the construction of
- WAN-based and multi-organisational applications, investigation into
- the integration of distributed systems and network management
- systems, and work in high-performance distributed computing.
-
- - Horus is being developed by the same group that worked on Isis; the
- head of this group is Robbert van Renesse.
-
- - Isis is no longer being developed at Cornell; it is now managed as a
- commercial product.
-
- - Mach is no longer being developed at CMU. Current work on Mach is
- being carried out by the OSF Research Institute and at the
- University of Utah.
-
- - Plan 9 is no longer in development at AT&T Bell Labs. fibre-optic
- network. The operating systems research group at Bell Labs has
- moved on to a new project, called Brazil, which addresses portable
- computing and distributed applications programming.
-
- - QNX is a commercial POSIX-certified realtime OS with an installed
- base of over 250,000 systems. It is used extensively in process
- control, factory automation, medical instrumentation, communications
- and point-of-sale. A number of universities are also doing research
- with QNX.
-
- - The Sprite network operating system project has ended.
-
- ------------------------------
- Subject: [1.2] How do approaches to load balancing differ?
- From: Distributed systems
-
- Load-balancing policy falls into two broad groups: static and dynamic.
- Static policies use algorithms which operate without regard to
- run-time loads across a system, while dynamic policies use the
- run-time performance of various parts of a system in order to make
- more `informed' decisions about balancing.
-
- [92-11-06-12-53.57] A dynamic load-balancing policy is one which uses
- run-time state information in making scheduling decisions.
-
- There are two kinds of dynamic policies: adaptive and non-adaptive.
- The latter always use the same (fixed, load-dependent) policy; the
- former may adjust policy parameters in order to gradually improve
- their performance.
-
- The key point is that while non-adaptive policies use only the
- information about the run-time state, adaptive policies use, in
- addition to this, information about current performance.
-
- In adaptive policies, the rules for adjusting policy parameters may be
- static or dynamic. An example of the former might be: `shift to a
- conservative migration rule when system-wide load patterns are varying
- too rapidly'. An example of the latter could be: `increase
- sender-side threshold when migrated jobs cause slowdown rather than
- speedup'. Some researchers refer to the performance-driven adaptation
- exhibited by the second policy as `learning'.
-
- Since both non-adaptive policies and adaptive policies with static
- rules really use only load information, it is confusing to distinguish
- between them. One way to avoid such confusion is to restrict the use
- of the word `adaptive' to policies that use performance feedback in
- order to drive their adjustment of policy parameters.
-
- ------------------------------
- Subject: [1.3] Fault tolerance in distributed systems
- From: Distributed systems
-
- One approach to providing fault tolerance in distributed systems
- involves the use of redundant services, such that standby facilities
- can become active in the event of the failure of, or loss of
- connection to, a primary service.
-
- Another approach is to provide multiple paths of connectivity between
- the computers that make up the distributed system. The QNX system,
- for example, supports multiple network drivers per node. The purpose
- of the network connection under QNX is to merge the microkernels on
- the LAN into a single logical kernel. Hence, if multiple LAN
- connections per node are present, the networking code can load balance
- the LAN traffic on the paths available. It can also route around
- failed links, providing both greater LAN bandwidth and better fault
- tolerance.
-
- See below for treatment of fault tolerance in systems which make use
- of distributed shared memory.
-
- ------------------------------
- Subject: [1.4] Naming in distributed systems
- From: Distributed systems
-
- [Material on naming and/or global naming sought.]
-
- ------------------------------
- Subject: [1.5] Distributed shared memory
- From: Distributed systems
-
- Distributed computer systems have evolved using message passing as
- their main method of communication. Other communication systems used
- in loosely coupled distributed systems, such as RPC, are usually
- implemented on top of an underlying message passing system. On the
- other hand, in tightly coupled systems, such as a multi-processor
- machine, the communication method used is usually shared memory.
-
- In distributed shared memory (DSM) systems [Nitzberg & Lo, 91],
- processes share data transparently across node boundaries; data
- faulting, location, and movement is handled by the underlying system.
- Among other things, this allows parallel programs designed to use
- shared memory to execute transparently on a loosely coupled
- distributed system. While the performance implications cannot be
- ignored, the advantages of the shared memory programming model are
- well known:
-
- - Shared memory programs are usually shorter and easier to understand
- than equivalent message passing programs.
-
- - Large or complex data structures may easily be communicated.
-
- - Shared memory gives transparent process-to-process communication.
-
- - Programming with shared memory is a well-understood problem.
-
- Shared-memory (or `procedure-oriented') and message-oriented operating
- systems are, in some sense, equivalent [Lauer & Needham, 78], though
- it has been claimed that the former are `more powerful' [Tam et al.,
- 90].
-
- ------------------------------
- Subject: [1.5.1] Data consistency
- From: Distributed systems
-
- Despite recent advances in both local and wide-area networking
- technologies, network latency is still a major factor in distributed
- systems and likely to remain so. All DSM systems provide some sort of
- caching in an attempt to improve the performance beyond that provided
- by doing a network access on every reference to a non-local data item.
- Each system must decide whether or not to attempt to keep the data
- coherent, and, if so, what coherence strategy to use. The coherence
- semantics which may be provided to the programmer include:
-
- - `strict' consistency, where a read always returns the value written
- by the most recent write
-
- - a `loosely' consistent system where the system enforces some form of
- weak consistency guarantees and the application (or compiler or
- user) can indicate synchronisation points where consistency must be
- enforced;
-
- - no automatic consistency mechanism, but provide the user with the
- facilities necessary to implement user level synchronisation and
- consistency.
-
- ------------------------------
- Subject: [1.5.1.1] Strictly consistent systems
- From: Distributed systems
-
- Older, strictly consistent systems tend to enforce a single writer,
- multiple reader model, where at any time data will be held either at a
- single node (which may have write access) or several nodes (none of
- which may have write access).
-
- Given this model, we must be able to locate a copy of our data when it
- is not resident. The method most frequently used is to assign an
- `owner' to each item of data, where the owner has either the only
- writeable copy of the data, or one of the read-only copies. Ownership
- may remain fixed throughout the life of a datum, or it may change
- dynamically. In the latter case, the problem arises of locating the
- owner. A database of locations may be maintained by centralised
- managers, or ownership information can be distributed among nodes of
- the system [Li and Hudak, 89].
-
- In a strictly consistent system, we must also be able to synchronise
- writes. The two major solutions to this problem are:
-
- - Write broadcast. The effects of every write are broadcast to ever
- node that has a copy of the data being written; this effectively
- implements a replication algorithm. Write broadcast is usually
- considered too expensive to be used as a general solution.
-
- - Write invalidation. Each node in the system holding a read-only
- copy of the data being written is sent an invalidation message.
-
- ------------------------------
- Subject: [1.5.1.2] Relaxing consistency
- From: Distributed systems
-
- Permitting temporary inconsistencies is a common method of increasing
- performance in distributed systems. Memory is said to be loosely
- coherent if the value returned by a read operation is the value
- written by an update operation to the same object that `could' have
- immediately preceded the read operation in some legal schedule of the
- threads in execution [Bennett et al., 90].
-
- Using loose coherence, more than one thread may have write access to
- the same object, provided that the programmer knows that the writes
- will not conflict.
-
- Another memory consistency model is `release consistency'
- [Gharachorloo et al., 90], in which memory accesses are divided into
- ordinary and synchronisation-related accesses. The latter are further
- divided into `acquire' and `release' operations. The `acquire'
- operation indicates that shared data is needed, and a processor's
- updates are not guaranteed to be performed at other nodes until a
- `release' is performed. The primary advantage of this form of
- consistency is that it allows consistency updates to be tied to
- synchronisation events, and therefore to be delayed until actually
- needed by applications. However, most release consistent systems
- require the programmer to make explicit use of `acquire' and `release'
- operations.
-
- A DSM system called Midway introduces another new consistency model,
- `entry consistency' [Bershad et al., 93]. Entry consistency is weaker
- than many of the other models suggested, including release
- consistency; it requires explicit annotations to associate
- synchronisation objects and data. On an `acquire', only the data
- associated with the synchronisation object is guaranteed to be
- consistent. This extra weakness permits higher performance
- implementations of the underlying consistency protocols to be written.
- Midway also supports stronger consistency models, so that the
- application programmer can trade-off performance against the extra
- effort required to write entry consistent programs.
-
- ------------------------------
- Subject: [1.5.1.3] Application-specific coherence
- From: Distributed systems
-
- From [Cheriton, 86]:
- `Problem-oriented shared memory' is a shared memory that implements
- fetch and store operations specialised to the particular problem or
- application it is supporting. In particular, a problem-oriented
- shared memory commonly provides a specialised form of consistency
- and consistency maintenance that exploits application-specific
- semantics.
- Cheriton goes on to propose that consistency constraints be relaxed
- and more use be made of problem semantics. He suggests that, in some
- cases, stale data may be detected on use by the client, and the client
- may then recover. A example would be hint caching. In some
- applications, stale data may actually be sufficiently accurate,
- provided that the client can obtain up to date information when
- necessary. In other applications, some data may be optional in the
- sense that the client can continue without it. Other applications may
- tolerate having the results of store operations being lost or undone,
- for example, an application that regularly updates the entire data
- set.
-
- Another approach is presented by the designers of Munin, where the
- runtime system accepts hints from the compiler or user to determine
- the coherence mechanism to be used for each object. The default, in
- the absence of hints, is to use a general read-write consistency
- mechanism, much like that employed by IVY. Munin supports several
- different object types that are based on the results of a survey of
- shared memory access characteristics. The results of the survey
- showed that a very small percentage of all accesses to shared data
- fall under the general read-write type. The Munin designers also note
- that a program moves through various stages of execution, and the
- types associated with objects change as time progresses
-
- ------------------------------
- Subject: [1.5.2] Access synchronisation
- From: Distributed systems
-
- Most parallel applications will use some sort of synchronisation
- system to order and control accesses to shared data before actually
- accessing the data. The most important thing to note in DSM systems
- is that just blindly using standard test and set operations on bytes
- in shared pages will produce a high fault rate; faults are usually
- expensive, making this approach unacceptable.
-
- Clouds merges locking with the cache consistency protocol, so that the
- user may obtain both a lock and the data in one network transaction.
- This system has the advantage that no invalidation messages are
- required, since the granting of the lock guarantees that there are no
- conflicting copies; it has the disadvantage that an explicit
- unlock/discard operation is required to release access to the data.
- This is acceptable in Clouds, as the DSM system was designed
- specifically to support object invocation, so it is easy to discard on
- a return.
-
- Munin provides a distributed lock mechanism using `proxy objects' to
- reduce network load. Proxy objects are maintained by a lock server on
- each node; when a thread wants to obtain a lock on an object, it
- attempts to lock the proxy instead. The server obtains the global
- lock if it is not already held locally. Global locking is done by
- negotiating with all the other lock servers in the system. Each lock
- may be migrated from server to server, and part of the Munin system
- allows objects to be migrated along with their locks.
-
- Other systems, such as IVY and Mermaid, use modified versions of classic
- multiprocessor synchronisation facilities.
-
- ------------------------------
- Subject: [1.5.3] Transfer and caching granularity
- From: Distributed systems
-
- When caching objects in local memory, it is necessary to decide what
- level of granularity to use. All current systems use a fixed block
- size in the cache, rather than varying the granularity based on object
- size. Usually this is due to constraints imposed by the system
- hardware and memory management.
-
- The choice of the block size in the cache depends on several issues.
-
- - Cost of communication: for example, on many local area networks
- there is little difference between the time required to send a
- one-byte message and that required to send a 1024-byte message.
- Transmitting bulk changes rather than single-byte modifications
- would therefore seem desirable.
-
- - The choice of granularity also depends on the locality of reference
- in the application, as thrashing may occur when two machines are
- both accessing the same block (this is also known as the `ping-pong
- effect'). This would seem to argue for a smaller block size. It
- should be noted that many object-oriented systems exhibit very poor
- locality of reference.
-
- In practice, a compromise must be achieved, as with conventional
- virtual memory systems. Most systems use a block size which is the
- same as that of the virtual memory management unit on the system, or a
- multiple thereof. Among other things, it allows the hardware to be
- used to help in the maintenance of consistency. The choice is
- complicated somewhat when heterogeneous machines are being used, but
- in these cases, the lowest common multiple of hardware supported page
- sizes can usually be used.
-
- The only major system that doesn't use a large block size is Memnet,
- in which a hardware based DSM system was implemented on a high speed
- token ring; a 32-byte block size was used instead [Delp & Farber].
- The choice of a small block size is appropriate, as the system is much
- closer to a shared memory multi-processor than it is to a software DSM
- system. This is because the entire processor is blocked on a cache
- miss; the processor is not actually aware of the distributed nature of
- its address space. Also, the ratio between remote and local memory
- access times is much lower than in the software based systems due to
- the dedicated token ring (200Mbps) and hardware assistance.
-
- ------------------------------
- Subject: [1.5.4] Address space structure
- From: Distributed systems
-
- In a single shared address space system, the system appears as a set
- of threads executing in a shared distributed address space. Objects
- always appear at the same addresses on all nodes. Single address
- space systems have had a resurgence in popularity with the arrival of
- 64-bit processors. A number of researchers believe that a 64-bit
- address space is large enough to act as a single global address space
- for all the memory (both primary and secondary) in a distributed
- system. Examples of such systems include Angel, Mungi, and Opal.
- Security and protection are a major problem in such systems, and
- current approaches either rely on hardware assistance or stochastic
- algorithms, or ignore the problem.
-
- Another approach is to divide each process's address space into
- different fixed regions, some of which are private and not shared, and
- some of which are shared with some other processes. Ra, the Clouds
- kernel, takes this approach using O, P, and K address regions, with
- the O region shared between all processes executing in a given object;
- the P and K regions are local to a process and kernel, respectively.
- Here objects always appear at the same address but may not be visible
- from every address space. By contrast, some systems, including Mirage
- and Mach, allow shared data to exist at differing addresses in
- different processes address spaces. However, neither system does
- transparent pointer translation, so the address changes are not
- entirely transparent to the application.
-
- As for the structuring of the shared region itself, some systems --
- for example, IVY and Mether -- use a single flat region: one
- continuous range of virtual addresses represent the shared address
- space and are managed by the DSM system. This single address space is
- usually sub-divided into pages. Most systems use paged segmentation:
- the shared region consists of disjoint pieces, which are usually
- managed separately and are not all mapped in any one process.
- Frequently, the segments (sometimes called memory objects, or windows)
- are related to the backing store. For example, in Clouds, the object
- address space consists of windows onto larger segments; these segments
- are usually maintained on secondary storage.
-
- ------------------------------
- Subject: [1.5.5] Fault tolerance
- From: Distributed systems
-
- Most DSM systems ignore the fault tolerance issue or maintain that it
- is an operating system issue and should be handled by the underlying
- system. However, it would appear that in practice a DSM system would
- strongly effect the fault tolerance of a system. For example, in a
- system where several systems are sharing access to a set of data, the
- failure of any one of them could lead to the failure of all the
- connected sites (or, at least, some of the processes on each site).
- We are also presented with an unusual failure handling problem. It is
- fairly easy to see how to handle a failed message or RPC, but how do
- you handle a failed page fault?
-
- The original Clouds system provided recoverability using shadowing of
- segments and a transactional system using commits. The recovery
- system was not really integrated with the DSM system and was merely
- implemented at the segment storage site. In order to maintain a
- consistent view of data when one transaction is active at multiple
- nodes, they have more recently been forced to integrate the
- transaction system with the DSM support system.
-
- ------------------------------
- Subject: [1.5.6] A brief bibliography on distributed shared memory
- From: Distributed systems
-
- [Nitzberg & Lo, 1991]
- Nitzberg, W. and Lo, V., `Distributed shared memory: a survey of
- issues and algorithms', IEEE Computer, August 91, pp. 52-60
-
- [Lauer & Needham, 1978]
- [Tam et al., 90]
- Tam, M.-C., Smith, J. M. & Farber, D. J., `A taxonomy-based
- comparison of several distributed shared memory systems', ACM
- Operating Systems Review 24(3), July 90, pp. 40-67
-
- [Li and Hudak, 89]
- Li, K. & Hudak, P., `Memory coherence in shared virtual memory
- systems', ACM Transactions on Computer Systems 7(4), November 89,
- pp. 321-359
-
- [Bennett et al., 90]
- Bennett, J. K., Carter, J. B. & Zwaenopoel, W., `Munin:
- distributed shared memory based on type-specific memory
- coherence', Proceedings of the 2nd ACM SIGPLAN Symposium on
- Principles and Practice of Parallel Programming, SIGPLAN Notices
- 25(3), March 90, pp. 168-176
-
- [Gharachorloo et al., 90]
- Gharachorloo, K., et al., `Memory consistency and event ordering in
- scalable shared-memory multiprocessors', ACM SIGARCH News 18(2),
- June 90
-
- [Bershad et al., 93]
- Bershad, B. N., et al., `The Midway distributed shared memory
- system', Technical Report CMU-CS-93-119, School of Computer
- Science, Carnegie Mellon University, 1993. Available via
- anonymous ftp from
- <URL:ftp://ftp.cs.cmu.edu/project/mach/public/doc/published/midway.ps>.
-
- [Cheriton, 86]
- Cheriton, D. R., `Problem-oriented shared memory: a decentralized
- approach to distributed system design', Proceedings of the 6th
- International Conference on Distributed Computing Systems, May 86,
- pp. 190-197
-
- [Delp & Farber]
- Delp, G. S. & Farber, D. J., `Memnet -- a different approach to a
- network', Technical Report, Department of Electrical Engineering,
- University of Delaware, ???
-
-
- ------------------------------
- Subject: [1.6] What have we learned?
- From: Distributed systems
-
- Andy Tanenbaum started a (very long) thread on this topic in
- comp.os.research in April of 1992 [92-04-03-17-10.05]. The interested
- reader is directed to the comp.os.research archives, since this thread
- proved rather divisive (i.e. nobody really agreed on any issue).
-
-
- ------------------------------
- Subject: [2] Needful things
- From: Needful things
-
- This FAQ is incomplete, and will probably remain in this state to a
- greater or lesser extent for ever and ever. Should you feel willing
- to contribute some material, the following is a list of topics which
- ``urgently'' require treatment (some of which I may get around to
- covering myself at some point):
-
- - naming in distributed systems
-