InfoMagic Source Code 1993 July

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Source Code 1993 July / THE_SOURCE_CODE_CD_ROM.iso / mach / doc / published / IPCperf.doc.Z / IPCperf.doc

Wrap

Text File | 1992-09-09 | 27.0 KB | 532 lines

The Increasing Irrelevance of IPC Performance for Microkernel-Based Operating Systems Brian N. Bershad School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Brian.Bershad@cs.cmu.edu March 10, 1992 Abstract IPC is the glue with which traditional operating system services such as networking, and filing, are provided in microkernel-based operating systems. Because applications rely heavily on cross-address space communication, IPC performance is often viewed as being the ``Achilles heel'' of a microkernel-based operating system. In this paper I discuss four reasons why IPC performance is becoming increasingly irrelevant to overall system performance. 1 Introduction Microkernels such as V [Cheriton 84], Chorus [Rozier et al. 88], and Mach [Accetta et al. 86] provide the infrastructure with which other operating systems such as Unix, MS-DOS and VMS can be implemented as user-level programs [Golub et al. ()This research was sponsored in part by The Defense Advanced Research Projects Agency, Information Science and Technology Office, under the title ``Research on Parallel Computing'', ARPA Order No. 7330, issued by DARPA/CMO under Contract MDA972-90-C-0035 and in part by the Open Software Foundation (OSF), and a National Science Foundation Presidential Young Investigator Award. () 1 90, Rashid et al. 91, Cheriton et al. 90, Wiecek 92]. Because applications use cross-address space IPC to interact with traditional operating system services, IPC performance has been thought to be the Achilles heel of microkernel-based systems. Even with the improvements in IPC performance that have occurred in the past 5 years [Bershad et al. 90, Draves et al. 91], microkernel based systems are still believed to have inherently worse performance than their monolithic counterparts. This belief stems from the fact that the system call, which is the mechanism for interacting with monolithic operating systems, such as 4.3BSD [Leffler et al. 89] and Sprite [Ousterhout et al. 88], is faster than a cross-address space IPC, which is how applications interact with user-level operating system services. Although IPC latency in microkernels is slower than system call latency, the absolute difference between them has reached the point where it can be largely ignore<d. In other words, IPC performance is becoming increasingly irrelevant as a metric with which to assess microkernel viability. There are four reasons that IPC performance should no longer be a principal metric by which one judges the ``goodness'' of a particular operating system microkernel, or even a particular approach to building operating system microkernels. In brief, these reasons are: 1. IPC has gotten faster faster than the rest of the operating system. 2. Performance is dominated by caches, not by address spaces. 3. All data does not need to be marshalled through the kernel. 4. All services do not need a hardware firewall. The first two reasons stem from the ever-growing performance imbalances which exist in today's systems. Simply stated, IPC mechanisms are not the performance bottlenecks which they once were. Instead, there are other, more fundamental bottlenecks, such as memory transfer speed, network latencies, disk speed, and cache management overhead. The second two reasons are due to the maturity which microkernel-based systems have achieved in the last few years. Specifically, in their efforts to increase performance, systems builders have discovered a collection of techniques which can be used to bypass, or at least ``dance around'' IPC facilities. In the rest of this paper I expand on the reasons listed above, and discuss why we should stop measuring microkernel systems by the speed of their round trip IPC times. I present examples and observations from the Mach 3.0 microkernel running on a collection of architectural platforms to motivate and substantiate the discussion. 2 IPC Has Gotten Faster Faster Than the Rest of The Operating System It has been observed that operating system performance has improved far less rapidly than would be expected given improvements in processor architecture and implementation [Ousterhout 90, Anderson et al. 91]. Although the time to add two registers together has decreased by almost two orders of magnitude in the last decade, the performance of key operating system services, such as filing, paging, and networking, has remained relatively flat. Disks continue to spin at about the same speed as they always have, buffer caches remain limited by memory bandwidth, and core network latency remains on the order of several hundred microseconds (although network bandwidth has improved somewhat more dramatically). In contrast to the performance of the services which are being provided by the operating system, the time to send a message between two address spaces on the same machine has dropped substantially. The reasons for this are two-fold. First, we've become more careful and more successful at building IPC mechanisms ``for speed,'' ruthlessly streamlining and optimizing the common cases. Using the Mach 3.0 microkernel as an example, IPC performance on a Microvax-III (CVax processor) has gone from about 750 XXsecs for a round-trip RPC in 1989 [Bershad 90] to 497 XXsecs in 1992 (measured recently using Mach 3.0 version MK68). The improvements were due to tightening the interface [Draves 90], and the implementation [Draves et al. 91]. The second reason why IPC has gotten faster faster than the rest of the operating system is that measured IPC performance has, at least to this point, tracked processor performance reasonably well.(1) In Mach, this is largely because the improvements made in the last several years have moved IPC performance off of the memory curve and onto the processor curve by tightening the locality of the IPC paths. Again, using Mach 3.0 MK68, a round-trip RPC takes 57 XXsecs(2) on a DecStation 5000/200. That system, which uses a MIPS R3000 processor running at 25Mhz, is rated at roughly ten times the performance of the CVax. In keeping with this, cross-address space RPC is about nine times faster than the same code running on a CVax. With IPC performance improving more rapidly than general operating system service performance, the ``hit'' required to access such a service is becoming less important. For example, on the older CVax, the 497 XXsecs required to do a (1)Although IPC performance is ultimately limited by architectural features such as trap and context switch time, and although these times have not been improving at the same rate as processor speed [Anderson et al. 91], practical IPC performance has not yet reached this limit. (2)All times were measured with warm caches. cross-address space RPC was comparable to the time it took seek one disk track, or copy a 512 byte block from a system buffer cache into a user buffer, or to transmit a packet over an Ethernet. On the DecStation 5000/200, however, the IPC penalty is substantially less. Since disk access, data transfer, and network latencies have not improved at anywhere the same rate, the additional cost of the IPC required to indirect to these facilities has become less significant. 2.1 System Calls Are Not The Solution System calls used in monolithic kernels are the alternative to the IPC mechanisms used in microkernels. Monolithic operating system services reside in the kernel, and are accessed by applications with a single kernel boundary crossing. On older systems, such as the CVax, the time to execute a Mach system call(3) was about 60 XXsecs, substantially less than the IPC overhead (497 XXsecs) on that machine. In contrast, on the newer DecStation 5000/200, system call overhead is about 8 XXsecs, or about 50 XXsecs less than a round-trip RPC. While the relative cost of IPC and system calls on the two machines is the same, the absolute difference compared to the service access cost (disk, buffer, or network), is much smaller. As a result, there is diminishing incentive to use system calls, rather than generalized IPC facilities, to access system services. 3 Performance is Dominated By Caches, Not by Address Spaces The invocation of an operating system service implies a change in locality. In monolithic systems, the new locality is in the kernel. In microkernel-based systems, the new locality is in another address space. In both cases, however, the change in locality can result in an increased cache miss rate [Agarwal et al. 88, Mogul & Borg 91]. On older style, slower architectures (< 10 MIPS), cache miss penalties were only a few cycles. On today's architectures, however, cache miss penalties are tens, and soon to be hundreds, of cycles. These kinds of penalties can easily dwarf the kernel's IPC overhead. In effect, the cost of accessing an operating system service is going to be most influenced by whether or not that service is in the cache -- not whether or not it's in the kernel or in another address space. One could argue, however, that OS interaction via a microkernel actually involves two changes in locality -- one to the microkernel and another to the server. While true, the microkernel's locality (at least on the critical path through to the operating system server) is small. The common-case round trip IPC path in Mach 3.0 on the MIPS, for example, requires less than 4KB of instructions and references less than 2KB of data, most of which is on the kernel stack. Because the locality is small, and identifiable a priori, one (3)Mach, although a microkernel, does export a small number of ``true'' system calls. could decrease the chance that misses occur on the IPC path by allocating memory with an eye towards the cache hashing function [Bershad et al. 92]. For virtually addressed caches, this means devoting pieces of the machine's virtual address space to the microkernel. For a physically addressed cache, physical memory must be devoted, although one can play tricks with split instruction and data caches to ensure that only data pages conflict with designated instruction pages, and vice versa. One place where cache performance becomes apparent is in the management of external devices which use DMA, and not programmed, I/O. Before the processor reads memory into which DMA has been performed, it must purge that memory from the cache to ensure that stale data is not returned. Before a processor issues a write request to a DMA device, it must flush the memory to be written from the cache, again to ensure that stale data is not read by the device. These cache operations are expensive. For example, on the HP-700, a high performance workstation based on the HPPA RISC processor with a cycle time of 20 nanoseconds, cache purge and flush operations take between 1 and 14 cycles per line (32 bytes), DMA operations tend to be page-oriented, so I/O operations require between 128 and 896 cycles simply to ensure memory coherency. In the case of device reads, performance degrades even further as the newly transferred data is faulted into the cache, at a cost of 16 cycles per line (2048 cycles per page). In contrast, a cross-address space RPC in Mach 3.0 on that machine takes about 70 XXsecs (3600 cycles). Consequently, the cost of accessing an out-of-kernel device server (changing address spaces) represents only one component of the total CPU/device communication cost. 4 All Data Does Not Need to be Marshalled Through the Kernel Microkernel-based operating systems can preallocate buffers between client and server address spaces. This allows an operating system service to share address space with applications, just as it did when the service was resident in the kernel. Small to medium amounts of data can be transferred from one address space to another by depositing it in the shared regions, rather than by sending it through the kernel in a message [Bershad et al. 91]. Large data segments (on the order of pages) can be passed using virtual memory primitives. Co-mapping of IPC data has been used in several places. At CMU, applications share memory with the Unix server to pass data in and out of the file system. We have recently applied this technique to the socket interface as well. On a DS5000/200, we have found that the mapped socket interface does not improve the performance on small packets (fewer than 100 bytes), but in larger packets there is an improvement. For packets of 4KB, sends using the mapped socket interface are 15% faster than using the regular, non-mapped, interface. In packets of more than 4KB the mapped interface avoids the cost of dynamically allocating and deallocating the region for out-of-line transmission, although for extremely large packets remapping, rather than copying, is more efficient. Co-mapping can be used effectively for non-Unix interfaces as well. For example, we have built a version of the X11 window server which uses Mach IPC for communication between X clients and the server. X does call request buffering, but clients frequently flush the buffer, which causes the data to be made available to the server. When Unix stream sockets are used to implement the transport layer, the flush causes a socket write to occur. With sockets, multiple flushes may occur before the server collects any of the data, but all of the data can be collected in one receive, so transferring data through the kernel doesn't necessarily increase the number of context switches. When using Mach's IPC, which is message-oriented, every flush results in a message. Message boundaries are maintained, so every message must be collected separately by the X server, increasing the the number of user-kernel boundary crossings and context switches relative to a socket-based implementation. We are modifying the X library to buffer requests in shared memory. Flushes transfer the data to the shared buffer, and only cause an IPC message to be sent if previous IPC messages have not yet been collected. 5 All Services Do Not Need a Hardware Firewall The cost of accessing functions in another protection domain (whether in the kernel or in a server's address space) has provided motivation for microkernel-based systems to migrate what were once kernel functions into client address spaces. This is possible when unprotecting the functions has no security implications. For example, there's no reason to put the system clock in the kernel, let alone another address space -- on most architectures, the clock can be mapped directly into user address spaces. Similarly, network protocols can execute in the address spaces of the applications which are communicating rather than in the kernel or a special protocol server [Schroeder & Burrows 90, Maeda & Bershad 92]. Applications can send and receive packets directly through the network interface. If the network is assumed to be insecure (as is generally the case), then executing the protocols in a secure protection domain offers no additional integrity. Encryption, and not hardware protection modes, are necessary here. Another technique for removing hardware firewalls is to migrate pieces of the operating system service into clients' address spaces. Mach's Unix emulation package uses this approach with its transparent emulation library, which is a shared library mapped into every Unix address space. Unix system calls are reflected out of the kernel into the caller's emulation library. There, the emulation library may simply forward the system call onto the Unix server, or it may implement the call itself, if possible. For example, the emulation library uses a mapped file interface for communicating with the Unix server. Read and write system calls are intercepted by the emulation library and converted into loads and stores to the mapped memory which backs the file. In this way, binary compatibility with Unix is maintained, and cross-address space IPC is avoided. This approach has been generalized in the Mach multiserver project [Julin et al. 91], and has resulted in substantial IPC reductions. In benchmarks on that system, which are intended to be persuasive but not conclusive, client-side emulation permits two out of three system calls to be implemented without an RPC. 6 Conclusions IPC performance has come a long way in the last ten years. We now understand how to build IPC mechanisms which are only a few tens of microseconds slower than system calls. While this may at first seem unacceptably high, an examination of the other issues in operating system performance reveals that the additional overhead is small compared to the services which are being accessed. Moreover, the growing mismatch between cache and memory speed is making the physical location of operating system code and data much more important than the software path by which it is accessed. Finally, as microkernel-based operating systems have matured, useful techniques which reduce the frequency of operating system interaction, and hence, IPC, have also been developed. For these reasons, the raw performance of IPC facilities is largely becoming an irrelevant metric by which to judge microkernel-based operating systems. Acknowledgements Jose Carlos Brustoloni, Rich Draves, Dan Julin, Mark Stevenson, and Bob Wheeler supplied most of the measurements presented in this paper. Conversations with these people, as well as with Bob Baron, Joe Barrera, Alessandro Forin, Michael Ginsberg, Chris Maeda and Dan Stodolsky contributed to this paper. References [Accetta et al. 86] Accetta, M. J., Baron, R. V., Bolosky, W., Golub, D. B., Rashid, R. F., Tevanian, Jr., A., and Young, M. W. Mach: A New Kernel Foundation for UNIX Development. In Proceedings of the Summer 1986 USENIX Conference, pages 93--113, July 1986. [Agarwal et al. 88] Agarwal, A., Hennessy, J., and Horowitz, M. Cache Performacne of Operating System and Multiprogramming Workloads. ACM Transactions on Computer Systems, 6(4):393--431, November 1988. [Anderson et al. 91] Anderson, T., Levy, H., Bershad, B., and Lazowska, E. The Interaction of Architecture and Operating System Design. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 108--121, April 1991. [Bershad 90] Bershad, B. N. High Performance Cross-Address Space Communication. PhD dissertation, University of Washington, Department of Computer Science and Engineering, Seattle, WA 98195, June 1990. [Bershad et al. 90] Bershad, B. N., Anderson, T. E., Lazowska, E. D., and Levy, H. M. Lightweight Remote Procedure Call. ACM Transactions on Computer Systems, 8(1):37--55, February 1990. Also appeared in Proceedings of the 12th ACM Symposium on Operating Systems Principles, December 1989. [Bershad et al. 91] Bershad, B. N., Anderson, T. E., Lazowska, E. D., and Levy, H. M. User-Level Remote Procedure Call. ACM Transactions on Computer Systems, 9(2):175--198, May 1991. [Bershad et al. 92] Bershad, B. N., Forin, A., and Draves, R. Cache Effects for a Microkernel Operating System. Technical report, School of Computer Science, Carnegie Mellon University, 1992. In preparation. [Cheriton 84] Cheriton, D. R. The V Kernel: A Software Base for Distributed Systems. IEEE Software, 1(2):19--42, April 1984. [Cheriton et al. 90] Cheriton, D. R., Whitehead, G. R., and Sznyter, E. W. Binary Emulation of Unix using the V Kernel. In Summer 1990 Usenix Conference Proceedings, 1990. [Draves 90] Draves, R. P. A Revised IPC Interface. In Proceedings of the First Mach USENIX Workshop, pages 101--121, October 1990. [Draves et al. 91] Draves, R. P., Bershad, B. N., Rashid, R. F., and Dean, R. W. Using Continuations to Implement Thread Management and Communication in Operating Systems. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 122--136, October 1991. [Golub et al. 90] Golub, D., Dean, R., Forin, A., and Rashid, R. Unix as an Application Program. In Proceedings of the Summer 1990 USENIX Conference, pages 87--95, June 1990. [Julin et al. 91] Julin, D. P., Chew, J. J., Stevenson, J. M., Guedes, P., Neves, P., and Roy, P. Generalized Emulation Services for Mach 3.0: Overview, Experiences and Current Status. In Proceedings of the 1991 Usenix Mach Workshop, November 1991. [Leffler et al. 89] Leffler, S., McKusick, M., Karels, M., and Quarterman, J. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley, Reading, MA, 1989. [Maeda & Bershad 92] Maeda, C. and Bershad, B. N. Networking Performance for Microkernels. In Proceedings of the Third Workshop on Workstation Operating Systems, April 1992. [Mogul & Borg 91] Mogul, J. and Borg, A. The Effect of Context Switches on Cache Performance. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 75--84, April 1991. [Ousterhout 90] Ousterhout, J. K. Why Operating Systems Aren't Getting Faster As Fast As Hardware. In Proceedings of the summer 1991 USENIX Conference, pages 247--256, June 1990. [Ousterhout et al. 88] Ousterhout, J., Cherenson, A., and Douglis, F. The Sprite Network Operating System. IEEE Computer Magazine, 21(2):23--36, February 1988. [Rashid et al. 91] Rashid, R. F., Malan, G., Golub, D., and Baron, R. DOS as a Mach 3.0 Application. In Proceedings of the 1991 Usenix Mach Workshop, pages 27--40, November 1991. [Rozier et al. 88] Rozier, M., Abrossimov, V., Armand, F., Boule, I., Giend, M., Guillemont, M., Herrmann, F., Leonard, P., Langlois, S., and Neuhauser, W. The Chorus Distributed Operating System. Computing Systems, 1(4), 1988. [Schroeder & Burrows 90] Schroeder, M. D. and Burrows, M. Performance of Firefly RPC. ACM Transactions on Computer Systems, 8(1):1--17, February 1990. [Wiecek 92] Wiecek, C. A Model and Prototype of VMS Using the Mach 3.0 Kernel. In Proceedings of the 1992 Usenix Microkernel Workshop, April 27--28 1992. This issue.