A 4.2BSD Interprocess Communication Primer DRAFT of January 30, 1994 Samuel J. Leffler Robert S. Fabry William N. Joy Computer Systems Research Group Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, California 94720 (415) 642-7780 ABSTRACT This document provides an introduction to the interprocess communication facilities included in the 4.2BSD release of the VAX* UNIX** system. It discusses the overall model for interpro- cess communication and introduces the interprocess communication primitives which have been added to the system. The majority of the document consid- ers the use of these primitives in developing applications. The reader is expected to be fami- liar with the C programming language as all exam- ples are written in C. _________________________ * DEC and VAX are trademarks of Digital Equipment Cor- poration. ** UNIX is a Trademark of Bell Laboratories. January 30, 1994 4.2BSD IPC Primer - 2 - Introduction 1. INTRODUCTION One of the most important parts of 4.2BSD is the interpro- cess communication facilities. These facilities are the result of more than two years of discussion and research. The facilities provided in 4.2BSD incorporate many of the ideas from current research, while trying to maintain the UNIX philosophy of simplicity and conciseness. It is hoped that the interprocess communication facilities included in 4.2BSD will establish a standard for UNIX. From the response to the design, it appears many organizations carry- ing out work with UNIX are adopting it. UNIX has previously been very weak in the area of interprocess communication. Prior to the 4.2BSD facilities, the only standard mechanism which allowed two processes to communicate were pipes (the mpx files which were part of Version 7 were experimental). Unfortunately, pipes are very restrictive in that the two communicating processes must be related through a common ancestor. Further, the semantics of pipes makes them almost impossible to maintain in a dis- tributed environment. Earlier attempts at extending the ipc facilities of UNIX have met with mixed reaction. The majority of the problems have been related to the fact these facilities have been tied to the UNIX file system; either through naming, or implementation. Consequently, the ipc facilities provided in 4.2BSD have been designed as a totally independent sub- system. The 4.2BSD ipc allows processes to rendezvous in many ways. Processes may rendezvous through a UNIX file system-like name space (a space where all names are path names) as well as through a network name space. In fact, new name spaces may be added at a future time with only minor changes visible to users. Further, the communication facilities have been extended to included more than the sim- ple byte stream provided by a pipe-like entity. These extensions have resulted in a completely new part of the system which users will need time to familiarize themselves with. It is likely that as more use is made of these facil- ities they will be refined; only time will tell. The remainder of this document is organized in four sections. Section 2 introduces the new system calls and the basic model of communication. Section 3 describes some of the supporting library routines users may find useful in constructing distributed applications. Section 4 is con- cerned with the client/server model used in developing applications and includes examples of the two major types of servers. Section 5 delves into advanced topics which sophisticated users are likely to encounter when using the ipc facilities. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 3 - Basics 2. BASICS The basic building block for communication is the socket. A socket is an endpoint of communication to which a name may be bound. Each socket in use has a type and one or more associated processes. Sockets exist within communica- tion domains. A communication domain is an abstraction introduced to bundle common properties of processes communi- cating through sockets. One such property is the scheme used to name sockets. For example, in the UNIX communica- tion domain sockets are named with UNIX path names; e.g. a socket may be named ``/dev/foo''. Sockets normally exchange data only with sockets in the same domain (it may be possi- ble to cross domain boundaries, but only if some translation process is performed). The 4.2BSD ipc supports two separate communication domains: the UNIX domain, and the Internet domain is used by processes which communicate using the the DARPA standard communication protocols. The underlying com- munication facilities provided by these domains have a sig- nificant influence on the internal system implementation as well as the interface to socket facilities available to a user. An example of the latter is that a socket ``operat- ing'' in the UNIX domain sees a subset of the possible error conditions which are possible when operating in the Internet domain. 2.1. Socket types Sockets are typed according to the communication pro- perties visible to a user. Processes are presumed to commun- icate only between sockets of the same type, although there is nothing that prevents communication between sockets of different types should the underlying communication proto- cols support this. Three types of sockets currently are available to a user. A stream socket provides for the bidirectional, reli- able, sequenced, and unduplicated flow of data without record boundaries. Aside from the bidirectionality of data flow, a pair of connected stream sockets provides an inter- face nearly identical to that of pipes*. A datagram socket supports bidirectional flow of data which is not promised to be sequenced, reliable, or undupli- cated. That is, a process receiving messages on a datagram socket may find messages duplicated, and, possibly, in an _________________________ * In the UNIX domain, in fact, the semantics are ident- ical and, as one might expect, pipes have been imple- mented internally as simply a pair of connected stream sockets. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 4 - Basics order different from the order in which it was sent. An important characteristic of a datagram socket is that record boundaries in data are preserved. Datagram sockets closely model the facilities found in many contemporary packet switched networks such as the Ethernet. A raw socket provides users access to the underlying communication protocols which support socket abstractions. These sockets are normally datagram oriented, though their exact characteristics are dependent on the interface pro- vided by the protocol. Raw sockets are not intended for the general user; they have been provided mainly for those interested in developing new communication protocols, or for gaining access to some of the more esoteric facilities of an existing protocol. The use of raw sockets is considered in section 5. Two potential socket types which have interesting pro- perties are the sequenced packet socket and the reliably delivered message socket. A sequenced packet socket is identical to a stream socket with the exception that record boundaries are preserved. This interface is very similar to that provided by the Xerox NS Sequenced Packet protocol. The reliably delivered message socket has similar properties to a datagram socket, but with reliable delivery. While these two socket types have been loosely defined, they are currently unimplemented in 4.2BSD. As such, in this docu- ment we will concern ourselves only with the three socket types for which support exists. 2.2. Socket creation To create a socket the socket system call is used: s = socket(domain, type, protocol); This call requests that the system create a socket in the specified domain and of the specified type. A particular protocol may also be requested. If the protocol is left unspecified (a value of 0), the system will select an appropriate protocol from those protocols which comprise the communication domain and which may be used to support the requested socket type. The user is returned a descriptor (a small integer number) which may be used in later system calls which operate on sockets. The domain is specified as one of the manifest constants defined in the file . For the UNIX domain the constant is AF_UNIX*; for the Internet domain AF_INET. The socket types are also defined in this file and one of SOCK_STREAM, _________________________ * The manifest constants are named AF_whatever as they indicate the ``address format'' to use in interpreting names. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 5 - Basics SOCK_DGRAM, or SOCK_RAW must be specified. To create a stream socket in the Internet domain the following call might be used: s = socket(AF_INET, SOCK_STREAM, 0); This call would result in a stream socket being created with the TCP protocol providing the underlying communication sup- port. To create a datagram socket for on-machine use a sam- ple call might be: s = socket(AF_UNIX, SOCK_DGRAM, 0); To obtain a particular protocol one selects the proto- col number, as defined within the communication domain. For the Internet domain the available protocols are defined in or, better yet, one may use one of the library routines discussed in section 3, such as getproto- byname: #include #include #include #include ... pp = getprotobyname("tcp"); s = socket(AF_INET, SOCK_STREAM, pp->p_proto); There are several reasons a socket call may fail. Aside from the rare occurrence of lack of memory (ENOBUFS), a socket request may fail due to a request for an unknown protocol (EPROTONOSUPPORT), or a request for a type of socket for which there is no supporting protocol (EPROTO- TYPE). 2.3. Binding names A socket is created without a name. Until a name is bound to a socket, processes have no way to reference it and, consequently, no messages may be received on it. The bind call is used to assign a name to a socket: bind(s, name, namelen); The bound name is a variable length byte string which is interpreted by the supporting protocol(s). Its interpreta- tion may vary from communication domain to communication domain (this is one of the properties which comprise the ``domain''). In the UNIX domain names are path names while in the Internet domain names contain an Internet address and port number. If one wanted to bind the name ``/dev/foo'' to a UNIX domain socket, the following would be used: DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 6 - Basics bind(s, "/dev/foo", sizeof ("/dev/foo") - 1); (Note how the null byte in the name is not counted as part of the name.) In binding an Internet address things become more complicated. The actual call is simple, #include #include ... struct sockaddr_in sin; ... bind(s, &sin, sizeof (sin)); but the selection of what to place in the address sin requires some discussion. We will come back to the problem of formulating Internet addresses in section 3 when the library routines used in name resolution are discussed. 2.4. Connection establishment With a bound socket it is possible to rendezvous with an unrelated process. This operation is usually asymmetric with one process a ``client'' and the other a ``server''. The client requests services from the server by initiating a ``connection'' to the server's socket. The server, when willing to offer its advertised services, passively ``listens'' on its socket. On the client side the connect call is used to initiate a connection. Using the UNIX domain, this might appear as, connect(s, "server-name", sizeof ("server-name")); while in the Internet domain, struct sockaddr_in server; connect(s, &server, sizeof (server)); If the client process's socket is unbound at the time of the connect call, the system will automatically select and bind a name to the socket; c.f. section 5.4. An error is returned when the connection was unsuccessful (any name automatically bound by the system, however, remains). Oth- erwise, the socket is associated with the server and data transfer may begin. Many errors can be returned when a connection attempt fails. The most common are: ETIMEDOUT After failing to establish a connection for a period of time, the system decided there was no point in retrying the connection attempt any more. This usually occurs because the destination host is down, or because DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 7 - Basics problems in the network resulted in transmissions being lost. ECONNREFUSED The host refused service for some reason. When con- necting to a host running 4.2BSD this is usually due to a server process not being present at the requested name. ENETDOWN or EHOSTDOWN These operational errors are returned based on status information delivered to the client host by the under- lying communication services. ENETUNREACH or EHOSTUNREACH These operational errors can occur either because the network or host is unknown (no route to the network or host is present), or because of status information returned by intermediate gateways or switching nodes. Many times the status returned is not sufficient to distinguish a network being down from a host being down. In these cases the system is conservative and indicates the entire network is unreachable. For the server to receive a client's connection it must perform two steps after binding its socket. The first is to indicate a willingness to listen for incoming connection requests: listen(s, 5); The second parameter to the listen call specifies the max- imum number of outstanding connections which may be queued awaiting acceptance by the server process. Should a connec- tion be requested while the queue is full, the connection will not be refused, but rather the individual messages which comprise the request will be ignored. This gives a harried server time to make room in its pending connection queue while the client retries the connection request. Had the connection been returned with the ECONNREFUSED error, the client would be unable to tell if the server was up or not. As it is now it is still possible to get the ETIMEDOUT error back, though this is unlikely. The backlog figure supplied with the listen call is limited by the system to a maximum of 5 pending connections on any one queue. This avoids the problem of processes hogging system resources by setting an infinite backlog, then ignoring all connection requests. With a socket marked as listening, a server may accept a connection: fromlen = sizeof (from); snew = accept(s, &from, &fromlen); DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 8 - Basics A new descriptor is returned on receipt of a connection (along with a new socket). If the server wishes to find out who its client is, it may supply a buffer for the client socket's name. The value-result parameter fromlen is ini- tialized by the server to indicate how much space is associ- ated with from, then modified on return to reflect the true size of the name. If the client's name is not of interest, the second parameter may be zero. Accept normally blocks. That is, the call to accept will not return until a connection is available or the sys- tem call is interrupted by a signal to the process. Further, there is no way for a process to indicate it will accept connections from only a specific individual, or indi- viduals. It is up to the user process to consider who the connection is from and close down the connection if it does not wish to speak to the process. If the server process wants to accept connections on more than one socket, or not block on the accept call there are alternatives; they will be considered in section 5. 2.5. Data transfer With a connection established, data may begin to flow. To send and receive data there are a number of possible calls. With the peer entity at each end of a connection anchored, a user can send or receive a message without specifying the peer. As one might expect, in this case, then the normal read and write system calls are useable, write(s, buf, sizeof (buf)); read(s, buf, sizeof (buf)); In addition to read and write, the new calls send and recv may be used: send(s, buf, sizeof (buf), flags); recv(s, buf, sizeof (buf), flags); While send and recv are virtually identical to read and write, the extra flags argument is important. The flags may be specified as a non-zero value if one or more of the fol- lowing is required: SOF_OOB send/receive out of band data SOF_PREVIEW look at data without reading SOF_DONTROUTE send data without routing packets Out of band data is a notion specific to stream sockets, and one which we will not immediately consider. The option to have data sent without routing applied to the outgoing pack- ets is currently used only by the routing table management DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 9 - Basics process, and is unlikely to be of interest to the casual user. The ability to preview data is, however, of interest. When SOF_PREVIEW is specified with a recv call, any data present is returned to the user, but treated as still ``unread''. That is, the next read or recv call applied to the socket will return the data previously previewed. 2.6. Discarding sockets Once a socket is no longer of interest, it may be dis- carded by applying a close to the descriptor, close(s); If data is associated with a socket which promises reliable delivery (e.g. a stream socket) when a close takes place, the system will continue to attempt to transfer the data. However, after a fairly long period of time, if the data is still undelivered, it will be discarded. Should a user have no use for any pending data, it may perform a shutdown on the socket prior to closing it. This call is of the form: shutdown(s, how); where how is 0 if the user is no longer interested in read- ing data, 1 if no more data will be sent, or 2 if no data is to be sent or received. Applying shutdown to a socket causes any data queued to be immediately discarded. 2.7. Connectionless sockets To this point we have been concerned mostly with sock- ets which follow a connection oriented model. However, there is also support for connectionless interactions typi- cal of the datagram facilities found in contemporary packet switched networks. A datagram socket provides a symmetric interface to data exchange. While processes are still likely to be client and server, there is no requirement for connection establishment. Instead, each message includes the destination address. Datagram sockets are created as before, and each should have a name bound to it in order that the recipient of a message may identify the sender. To send data, the sendto primitive is used, sendto(s, buf, buflen, flags, &to, tolen); The s, buf, buflen, and flags parameters are used as before. The to and tolen values are used to indicate the intended recipient of the message. When using an unreliable datagram interface, it is unlikely any errors will be reported to the sender. Where information is present locally to recognize a message which may never be delivered (for instance when a DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 10 - Basics network is unreachable), the call will return -1 and the global value errno will contain an error number. To receive messages on an unconnected datagram socket, the recvfrom primitive is provided: recvfrom(s, buf, buflen, flags, &from, &fromlen); Once again, the fromlen parameter is handled in a value- result fashion, initially containing the size of the from buffer. In addition to the two calls mentioned above, datagram sockets may also use the connect call to associate a socket with a specific address. In this case, any data sent on the socket will automatically be addressed to the connected peer, and only data received from that peer will be delivered to the user. Only one connected address is per- mitted for each socket (i.e. no multi-casting). Connect requests on datagram sockets return immediately, as this simply results in the system recording the peer's address (as compared to a stream socket where a connect request ini- tiates establishment of an end to end connection). Other of the less important details of datagram sockets are described in section 5. 2.8. Input/Output multiplexing One last facility often used in developing applications is the ability to multiplex i/o requests among multiple sockets and/or files. This is done using the select call: select(nfds, &readfds, &writefds, &execptfds, &timeout); Select takes as arguments three bit masks, one for the set of file descriptors for which the caller wishes to be able to read data on, one for those descriptors to which data is to be written, and one for which exceptional conditions are pending. Bit masks are created by or-ing bits of the form ``1 << fd''. That is, a descriptor fd is selected if a 1 is present in the fd'th bit of the mask. The parameter nfds specifies the range of file descriptors (i.e. one plus the value of the largest descriptor) specified in a mask. A timeout value may be specified if the selection is not to last more than a predetermined period of time. If timeout is set to 0, the selection takes the form of a poll, returning immediately. If the last parameter is a null pointer, the selection will block indefinitely*. Select _________________________ * To be more specific, a return takes place only when a descriptor is selectable, or when a signal is received by the caller, interrupting the system call. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 11 - Basics normally returns the number of file descriptors selected. If the select call returns due to the timeout expiring, then a value of -1 is returned along with the error number EINTR. Select provides a synchronous multiplexing scheme. Asynchronous notification of output completion, input avai- lability, and exceptional conditions is possible through use of the SIGIO and SIGURG signals described in section 5. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 12 - Network Library Routines 3. NETWORK LIBRARY ROUTINES The discussion in section 2 indicated the possible need to locate and construct network addresses when using the interprocess communication facilities in a distributed environment. To aid in this task a number of routines have been added to the standard C run-time library. In this sec- tion we will consider the new routines provided to manipu- late network addresses. While the 4.2BSD networking facili- ties support only the DARPA standard Internet protocols, these routines have been designed with flexibility in mind. As more communication protocols become available, we hope the same user interface will be maintained in accessing network-related address data bases. The only difference should be the values returned to the user. Since these values are normally supplied the system, users should not need to be directly aware of the communication protocol and/or naming conventions in use. Locating a service on a remote host requires many lev- els of mapping before client and server may communicate. A service is assigned a name which is intended for human con- sumption; e.g. ``the login server on host monet''. This name, and the name of the peer host, must then be translated into network addresses which are not necessarily suitable for human consumption. Finally, the address must then used in locating a physical location and route to the service. The specifics of these three mappings is likely to vary between network architectures. For instance, it is desir- able for a network to not require hosts be named in such a way that their physical location is known by the client host. Instead, underlying services in the network may dis- cover the actual location of the host at the time a client host wishes to communicate. This ability to have hosts named in a location independent manner may induce overhead in connection establishment, as a discovery process must take place, but allows a host to be physically mobile without requiring it to notify its clientele of its current location. Standard routines are provided for: mapping host names to network addresses, network names to network numbers, pro- tocol names to protocol numbers, and service names to port numbers and the appropriate protocol to use in communicating with the server process. The file must be included when using any of these routines. 3.1. Host names A host name to address mapping is represented by the hostent structure: DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 13 - Network Library Routines struct hostent { char *h_name; /* official name of host */ char **h_aliases; /* alias list */ int h_addrtype; /* host address type */ int h_length; /* length of address */ char *h_addr; /* address */ }; The official name of the host and its public aliases are returned, along with a variable length address and address type. The routine gethostbyname(3N) takes a host name and returns a hostent structure, while the routine gethostbyaddr(3N) maps host addresses into a hostent struc- ture. It is possible for a host to have many addresses, all having the same name. Gethostybyname returns the first matching entry in the data base file /etc/hosts; if this is unsuitable, the lower level routine gethostent(3N) may be used. For example, to obtain a hostent structure for a host on a particular network the following routine might be used (for simplicity, only Internet addresses are considered): #include #include #include #include ... struct hostent * gethostbynameandnet(name, net) char *name; int net; { register struct hostent *hp; register char **cp; sethostent(0); while ((hp = gethostent()) != NULL) { if (hp->h_addrtype != AF_INET) continue; if (strcmp(name, hp->h_name)) { for (cp = hp->h_aliases; cp && *cp != NULL; cp++) if (strcmp(name, *cp) == 0) goto found; continue; } found: if (in_netof(*(struct in_addr *)hp->h_addr)) == net) break; } endhostent(0); return (hp); } (in_netof(3N) is a standard routine which returns the DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 14 - Network Library Routines network portion of an Internet address.) 3.2. Network names As for host names, routines for mapping network names to numbers, and back, are provided. These routines return a netent structure: /* * Assumption here is that a network number * fits in 32 bits -- probably a poor one. */ struct netent { char *n_name; /* official name of net */ char **n_aliases; /* alias list */ int n_addrtype; /* net address type */ int n_net; /* network # */ }; The routines getnetbyname(3N), getnetbynumber(3N), and getnetent(3N) are the network counterparts to the host rou- tines described above. 3.3. Protocol names For protocols the protoent structure defines the protocol-name mapping used with the routines getprotobyname(3N), getprotobynumber(3N), and getprotoent(3N): struct protoent { char *p_name; /* official protocol name */ char **p_aliases; /* alias list */ int p_proto; /* protocol # */ }; 3.4. Service names Information regarding services is a bit more compli- cated. A service is expected to reside at a specific ``port'' and employ a particular communication protocol. This view is consistent with the Internet domain, but incon- sistent with other network architectures. Further, a ser- vice may reside on multiple ports or support multiple proto- cols. If either of these occurs, the higher level library routines will have to be bypassed in favor of homegrown rou- tines similar in spirit to the ``gethostbynameandnet'' rou- tine described above. A service mapping is described by the servent structure, DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 15 - Network Library Routines struct servent { char *s_name; /* official service name */ char **s_aliases; /* alias list */ int s_port; /* port # */ char *s_proto; /* protocol to use */ }; The routine getservbyname(3N) maps service names to a ser- vent structure by specifying a service name and, optionally, a qualifying protocol. Thus the call sp = getservbyname("telnet", (char *)0); returns the service specification for a telnet server using any protocol, while the call sp = getservbyname("telnet", "tcp"); returns only that telnet server which uses the TCP protocol. The routines getservbyport(3N) and getservent(3N) are also provided. The getservbyport routine has an interface simi- lar to that provided by getservbyname; an optional protocol name may be specified to qualify lookups. 3.5. Miscellaneous With the support routines described above, an applica- tion program should rarely have to deal directly with addresses. This allows services to be developed as much as possible in a network independent fashion. It is clear, however, that purging all network dependencies is very dif- ficult. So long as the user is required to supply network addresses when naming services and sockets there will always some network dependency in a program. For example, the nor- mal code included in client programs, such as the remote login program, is of the form shown in Figure 1. (This example will be considered in more detail in section 4.) If we wanted to make the remote login program indepen- dent of the Internet protocols and addressing scheme we would be forced to add a layer of routines which masked the network dependent aspects from the mainstream login code. For the current facilities available in the system this does not appear to be worthwhile. Perhaps when the system is adapted to different network architectures the utilities will be reorganized more cleanly. Aside from the address-related data base routines, there are several other routines available in the run-time library which are of interest to users. These are intended mostly to simplify manipulation of names and addresses. Table 1 summarizes the routines for manipulating variable length byte strings and handling byte swapping of network DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 16 - Network Library Routines #include #include #include #include #include ... main(argc, argv) char *argv[]; { struct sockaddr_in sin; struct servent *sp; struct hostent *hp; int s; ... sp = getservbyname("login", "tcp"); if (sp == NULL) { fprintf(stderr, "rlogin: tcp/login: unknown service\n"); exit(1); } hp = gethostbyname(argv[1]); if (hp == NULL) { fprintf(stderr, "rlogin: %s: unknown host\n", argv[1]); exit(2); } bzero((char *)&sin, sizeof (sin)); bcopy(hp->h_addr, (char *)&sin.sin_addr, hp->h_length); sin.sin_family = hp->h_addrtype; sin.sin_port = sp->s_port; s = socket(AF_INET, SOCK_STREAM, 0); if (s < 0) { perror("rlogin: socket"); exit(3); } ... if (connect(s, (char *)&sin, sizeof (sin)) < 0) { perror("rlogin: connect"); exit(5); } ... } Figure 1. Remote login client code. addresses and values. The byte swapping routines are provided because the operating system expects addresses to be supplied in network order. On a VAX, or machine with similar architecture, this is usually reversed. Consequently, programs are sometimes required to byte swap quantities. The library routines which return network addresses provide them in network order so that they may simply be copied into the structures pro- vided to the system. This implies users should encounter the byte swapping problem only when interpreting network DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 17 - Network Library Routines ____________________________________________________________________________ |________________|_________________________________________________________| |bcmp(s1, s2, n) | compare byte-strings; 0 if same, not 0 otherwise | |bcopy(s1, s2, n)| copy n bytes from s1 to s2 | |bzero(base, n) | zero-fill n bytes starting at base | |htonl(val) | convert 32-bit quantity from host to network byte order| |htons(val) | convert 16-bit quantity from host to network byte order| |ntohl(val) | convert 32-bit quantity from network to host byte order| |ntohs(val) | convert 16-bit quantity from network to host byte order| |________________|_________________________________________________________| Table 1. C run-time routines. addresses. For example, if an Internet port is to be printed out the following code would be required: printf("port number %d\n", ntohs(sp->s_port)); On machines other than the VAX these routines are defined as null macros. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 18 - Client/Server Model 4. CLIENT/SERVER MODEL The most commonly used paradigm in constructing distri- buted applications is the client/server model. In this scheme client applications request services from a server process. This implies an asymmetry in establishing communi- cation between the client and server which has been examined in section 2. In this section we will look more closely at the interactions between client and server, and consider some of the problems in developing client and server appli- cations. Client and server require a well known set of conven- tions before service may be rendered (and accepted). This set of conventions comprises a protocol which must be imple- mented at both ends of a connection. Depending on the situation, the protocol may be symmetric or asymmetric. In a symmetric protocol, either side may play the master or slave roles. In an asymmetric protocol, one side is immut- ably recognized as the master, with the other the slave. An example of a symmetric protocol is the TELNET protocol used in the Internet for remote terminal emulation. An example of an asymmetric protocol is the Internet file transfer pro- tocol, FTP. No matter whether the specific protocol used in obtaining a service is symmetric or asymmetric, when access- ing a service there is a ``client process'' and a ``server process''. We will first consider the properties of server processes, then client processes. A server process normally listens at a well know address for service requests. Alternative schemes which use a service server may be used to eliminate a flock of server processes clogging the system while remaining dormant most of the time. The Xerox Courier protocol uses the latter scheme. When using Courier, a Courier client process con- tacts a Courier server at the remote host and identifies the service it requires. The Courier server process then creates the appropriate server process based on a data base and ``splices'' the client and server together, voiding its part in the transaction. This scheme is attractive in that the Courier server process may provide a single contact point for all services, as well as carrying out the initial steps in authentication. However, while this is an attrac- tive possibility for standardizing access to services, it does introduce a certain amount of overhead due to the intermediate process involved. Implementations which pro- vide this type of service within the system can minimize the cost of client server rendezvous. The portal notion described in the ``4.2BSD System Manual'' embodies many of the ideas found in Courier, with the rendezvous mechanism implemented internal to the system. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 19 - Client/Server Model 4.1. Servers In 4.2BSD most servers are accessed at well known Internet addresses or UNIX domain names. When a server is started at boot time it advertises it services by listening at a well know location. For example, the remote login server's main loop is of the form shown in Figure 2. main(argc, argv) int argc; char **argv; { int f; struct sockaddr_in from; struct servent *sp; sp = getservbyname("login", "tcp"); if (sp == NULL) { fprintf(stderr, "rlogind: tcp/login: unknown service\n"); exit(1); } ... #ifndef DEBUG <> #endif ... sin.sin_port = sp->s_port; ... f = socket(AF_INET, SOCK_STREAM, 0); ... if (bind(f, (caddr_t)&sin, sizeof (sin)) < 0) { ... } ... listen(f, 5); for (;;) { int g, len = sizeof (from); g = accept(f, &from, &len); if (g < 0) { if (errno != EINTR) perror("rlogind: accept"); continue; } if (fork() == 0) { close(f); doit(g, &from); } close(g); } } Figure 2. Remote login server. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 20 - Client/Server Model The first step taken by the server is look up its ser- vice definition: sp = getservbyname("login", "tcp"); if (sp == NULL) { fprintf(stderr, "rlogind: tcp/login: unknown service\n"); exit(1); } This definition is used in later portions of the code to define the Internet port at which it listens for service requests (indicated by a connection). Step two is to disassociate the server from the con- trolling terminal of its invoker. This is important as the server will likely not want to receive signals delivered to the process group of the controlling terminal. Once a server has established a pristine environment, it creates a socket and begins accepting service requests. The bind call is required to insure the server listens at its expected location. The main body of the loop is fairly simple: for (;;) { int g, len = sizeof (from); g = accept(f, &from, &len); if (g < 0) { if (errno != EINTR) perror("rlogind: accept"); continue; } if (fork() == 0) { close(f); doit(g, &from); } close(g); } An accept call blocks the server until a client requests service. This call could return a failure status if the call is interrupted by a signal such as SIGCHLD (to be dis- cussed in section 5). Therefore, the return value from accept is checked to insure a connection has actually been established. With a connection in hand, the server then forks a child process and invokes the main body of the remote login protocol processing. Note how the socket used by the parent for queueing connection requests is closed in the child, while the socket created as a result of the accept is closed in the parent. The address of the client is also handed the doit routine because it requires it in authenticating clients. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 21 - Client/Server Model 4.2. Clients The client side of the remote login service was shown earlier in Figure 1. One can see the separate, asymmetric roles of the client and server clearly in the code. The server is a passive entity, listening for client connec- tions, while the client process is an active entity, ini- tiating a connection when invoked. Let us consider more closely the steps taken by the client remote login process. As in the server process the first step is to locate the service definition for a remote login: sp = getservbyname("login", "tcp"); if (sp == NULL) { fprintf(stderr, "rlogin: tcp/login: unknown service\n"); exit(1); } Next the destination host is looked up with a gethostbyname call: hp = gethostbyname(argv[1]); if (hp == NULL) { fprintf(stderr, "rlogin: %s: unknown host\n", argv[1]); exit(2); } With this accomplished, all that is required is to establish a connection to the server at the requested host and start up the remote login protocol. The address buffer is cleared, then filled in with the Internet address of the foreign host and the port number at which the login process resides: bzero((char *)&sin, sizeof (sin)); bcopy(hp->h_addr, (char *)sin.sin_addr, hp->h_length); sin.sin_family = hp->h_addrtype; sin.sin_port = sp->s_port; A socket is created, and a connection initiated. s = socket(hp->h_addrtype, SOCK_STREAM, 0); if (s < 0) { perror("rlogin: socket"); exit(3); } ... if (connect(s, (char *)&sin, sizeof (sin)) < 0) { perror("rlogin: connect"); exit(4); } DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 22 - Client/Server Model The details of the remote login protocol will not be con- sidered here. 4.3. Connectionless servers While connection-based services are the norm, some ser- vices are based on the use of datagram sockets. One, in particular, is the ``rwho'' service which provides users with status information for hosts connected to a local area network. This service, while predicated on the ability to broadcast information to all hosts connected to a particular network, is of interest as an example usage of datagram sockets. A user on any machine running the rwho server may find out the current status of a machine with the ruptime(1) pro- gram. The output generated is illustrated in Figure 3. arpa up 9:45, 5 users, load 1.15, 1.39, 1.31 cad up 2+12:04, 8 users, load 4.67, 5.13, 4.59 calder up 10:10, 0 users, load 0.27, 0.15, 0.14 dali up 2+06:28, 9 users, load 1.04, 1.20, 1.65 degas up 25+09:48, 0 users, load 1.49, 1.43, 1.41 ear up 5+00:05, 0 users, load 1.51, 1.54, 1.56 ernie down 0:24 esvax down 17:04 ingres down 0:26 kim up 3+09:16, 8 users, load 2.03, 2.46, 3.11 matisse up 3+06:18, 0 users, load 0.03, 0.03, 0.05 medea up 3+09:39, 2 users, load 0.35, 0.37, 0.50 merlin down 19+15:37 miro up 1+07:20, 7 users, load 4.59, 3.28, 2.12 monet up 1+00:43, 2 users, load 0.22, 0.09, 0.07 oz down 16:09 statvax up 2+15:57, 3 users, load 1.52, 1.81, 1.86 ucbvax up 9:34, 2 users, load 6.08, 5.16, 3.28 Figure 3. ruptime output. Status information for each host is periodically broad- cast by rwho server processes on each machine. The same server process also receives the status information and uses it to update a database. This database is then interpreted to generate the status information for each host. Servers operate autonomously, coupled only by the local network and its broadcast capabilities. The rwho server, in a simplified form, is pictured in Figure 4. There are two separate tasks performed by the server. The first task is to act as a receiver of status information broadcast by other hosts on the network. This DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 23 - Client/Server Model job is carried out in the main loop of the program. Packets received at the rwho port are interrogated to insure they've been sent by another rwho server process, then are time stamped with their arrival time and used to update a file indicating the status of the host. When a host has not been heard from for an extended period of time, the database interpretation routines assume the host is down and indicate such on the status reports. This algorithm is prone to error as a server may be down while a host is actually up, but serves our current needs. The second task performed by the server is to supply information regarding the status of its host. This involves periodically acquiring system status information, packaging it up in a message and broadcasting it on the local network for other rwho servers to hear. The supply function is triggered by a timer and runs off a signal. Locating the system status information is somewhat involved, but unin- teresting. Deciding where to transmit the resultant packet does, however, indicates some problems with the current pro- tocol. Status information is broadcast on the local network. For networks which do not support the notion of broadcast another scheme must be used to simulate or replace broad- casting. One possibility is to enumerate the known neigh- bors (based on the status received). This, unfortunately, requires some bootstrapping information, as a server started up on a quiet network will have no known neighbors and thus never receive, or send, any status information. This is the identical problem faced by the routing table management pro- cess in propagating routing status information. The stan- dard solution, unsatisfactory as it may be, is to inform one or more servers of known neighbors and request that they always communicate with these neighbors. If each server has at least one neighbor supplied it, status information may then propagate through a neighbor to hosts which are not (possibly) directly neighbors. If the server is able to support networks which provide a broadcast capability, as well as those which do not, then networks with an arbitrary topology may share status information*. The second problem with the current scheme is that the rwho process services only a single local network, and this network is found by reading a file. It is important that software operating in a distributed environment not have any site-dependent information compiled into it. This would _________________________ * One must, however, be concerned about ``loops''. That is, if a host is connected to multiple networks, it will receive status information from itself. This can lead to an endless, wasteful, exchange of informa- tion. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 24 - Client/Server Model main() { ... sp = getservbyname("who", "udp"); net = getnetbyname("localnet"); sin.sin_addr = inet_makeaddr(INADDR_ANY, net); sin.sin_port = sp->s_port; ... s = socket(AF_INET, SOCK_DGRAM, 0); ... bind(s, &sin, sizeof (sin)); ... sigset(SIGALRM, onalrm); onalrm(); for (;;) { struct whod wd; int cc, whod, len = sizeof (from); cc = recvfrom(s, (char *)&wd, sizeof (struct whod), 0, &from, &len); if (cc <= 0) { if (cc < 0 && errno != EINTR) perror("rwhod: recv"); continue; } if (from.sin_port != sp->s_port) { fprintf(stderr, "rwhod: %d: bad from port\n", ntohs(from.sin_port)); continue; } ... if (!verify(wd.wd_hostname)) { fprintf(stderr, "rwhod: malformed host name from %x\n", ntohl(from.sin_addr.s_addr)); continue; } (void) sprintf(path, "%s/whod.%s", RWHODIR, wd.wd_hostname); whod = open(path, FWRONLY|FCREATE|FTRUNCATE, 0666); ... (void) time(&wd.wd_recvtime); (void) write(whod, (char *)&wd, cc); (void) close(whod); } } Figure 4. rwho server. require a separate copy of the server at each host and make maintenance a severe headache. 4.2BSD attempts to isolate host-specific information from applications by providing system calls which return the necessary information-. _________________________ - An example of such a system call is the gethost- name(2) call which returns the host's ``official'' name. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 25 - Client/Server Model Unfortunately, no straightforward mechanism currently exists for finding the collection of networks to which a host is directly connected. Thus the rwho server performs a lookup in a file to find its local network. A better, though still unsatisfactory, scheme used by the routing process is to interrogate the system data structures to locate those directly connected networks. A mechanism to acquire this information from the system would be a useful addition. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 26 - Advanced Topics 5. ADVANCED TOPICS A number of facilities have yet to be discussed. For most users of the ipc the mechanisms already described will suffice in constructing distributed applications. However, others will find need to utilize some of the features which we consider in this section. 5.1. Out of band data The stream socket abstraction includes the notion of ``out of band'' data. Out of band data is a logically independent transmission channel associated with each pair of connected stream sockets. Out of band data is delivered to the user independently of normal data along with the SIGURG signal. In addition to the information passed, a logical mark is placed in the data stream to indicate the point at which the out of band data was sent. The remote login and remote shell applications use this facility to propagate signals from between client and server processes. When a signal is expected to flush any pending output from the remote process(es), all data up to the mark in the data stream is discarded. The stream abstraction defines that the out of band data facilities must support the reliable delivery of at least one out of band message at a time. This message may contain at least one byte of data, and at least one message may be pending delivery to the user at any one time. For communications protocols which support only in-band signal- ing (i.e. the urgent data is delivered in sequence with the normal data) the system extracts the data from the normal data stream and stores it separately. This allows users to choose between receiving the urgent data in order and receiving it out of sequence without having to buffer all the intervening data. To send an out of band message the SOF_OOB flag is sup- plied to a send or sendto calls, while to receive out of band data SOF_OOB should be indicated when performing a recvfrom or recv call. To find out if the read pointer is currently pointing at the mark in the data stream, the SIOCATMARK ioctl is provided: ioctl(s, SIOCATMARK, &yes); If yes is a 1 on return, the next read will return data after the mark. Otherwise (assuming out of band data has arrived), the next read will provide data sent by the client prior to transmission of the out of band signal. The rou- tine used in the remote login process to flush output on receipt of an interrupt or quit signal is shown in Figure 5. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 27 - Advanced Topics oob() { int out = 1+1; char waste[BUFSIZ], mark; signal(SIGURG, oob); /* flush local terminal input and output */ ioctl(1, TIOCFLUSH, (char *)&out); for (;;) { if (ioctl(rem, SIOCATMARK, &mark) < 0) { perror("ioctl"); break; } if (mark) break; (void) read(rem, waste, sizeof (waste)); } recv(rem, &mark, 1, SOF_OOB); ... } Figure 5. Flushing terminal i/o on receipt of out of band data. 5.2. Signals and process groups Due to the existence of the SIGURG and SIGIO signals each socket has an associated process group (just as is done for terminals). This process group is initialized to the process group of its creator, but may be redefined at a later time with the SIOCSPGRP ioctl: ioctl(s, SIOCSPGRP, &pgrp); A similar ioctl, SIOCGPGRP, is available for determining the current process group of a socket. 5.3. Pseudo terminals Many programs will not function properly without a ter- minal for standard input and output. Since a socket is not a terminal, it is often necessary to have a process communi- cating over the network do so through a pseudo terminal. A pseudo terminal is actually a pair of devices, master and slave, which allow a process to serve as an active agent in communication between processes and users. Data written on the slave side of a pseudo terminal is supplied as input to a process reading from the master side. Data written on the master side is given the slave as input. In this way, the process manipulating the master side of the pseudo terminal has control over the information read and written on the slave side. The remote login server uses pseudo terminals for remote login sessions. A user logging in to a machine across the network is provided a shell with a slave pseudo DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 28 - Advanced Topics terminal as standard input, output, and error. The server process then handles the communication between the programs invoked by the remote shell and the user's local client pro- cess. When a user sends an interrupt or quit signal to a process executing on a remote machine, the client login pro- gram traps the signal, sends an out of band message to the server process who then uses the signal number, sent as the data value in the out of band message, to perform a killpg(2) on the appropriate process group. 5.4. Internet address binding Binding addresses to sockets in the Internet domain can be fairly complex. Communicating processes are bound by an association. An association is composed of local and foreign addresses, and local and foreign ports. Port numbers are allocated out of separate spaces, one for each Internet protocol. Associations are always unique. That is, there may never be duplicate tuples. The bind system call allows a process to specify half of an association, , while the connect and accept primitives are used to complete a socket's association. Since the association is created in two steps the association uniqueness requirement indicated above could be violated unless care is taken. Further, it is unrealistic to expect user programs to always know proper values to use for the local address and local port since a host may reside on multiple networks and the set of allo- cated port numbers is not directly accessible to a user. To simplify local address binding the notion of a ``wildcard'' address has been provided. When an address is specified as INADDR_ANY (a manifest constant defined in ), the system interprets the address as ``any valid address''. For example, to bind a specific port number to a socket, but leave the local address unspecified, the following code might be used: #include #include ... struct sockaddr_in sin; ... s = socket(AF_INET, SOCK_STREAM, 0); sin.sin_family = AF_INET; sin.sin_addr.s_addr = INADDR_ANY; sin.sin_port = MYPORT; bind(s, (char *)&sin, sizeof (sin)); Sockets with wildcarded local addresses may receive messages directed to the specified port number, and addressed to any of the possible addresses assigned a host. For example, if DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 29 - Advanced Topics a host is on a networks 46 and 10 and a socket is bound as above, then an accept call is performed, the process will be able to accept connection requests which arrive either from network 46 or network 10. In a similar fashion, a local port may be left unspeci- fied (specified as zero), in which case the system will select an appropriate port number for it. For example: sin.sin_addr.s_addr = MYADDRESS; sin.sin_port = 0; bind(s, (char *)&sin, sizeof (sin)); The system selects the port number based on two criteria. The first is that ports numbered 0 through 1023 are reserved for privileged users (i.e. the super user). The second is that the port number is not currently bound to some other socket. In order to find a free port number in the privileged range the following code is used by the remote shell server: struct sockaddr_in sin; ... lport = IPPORT_RESERVED - 1; sin.sin_addr.s_addr = INADDR_ANY; ... for (;;) { sin.sin_port = htons((u_short)lport); if (bind(s, (caddr_t)&sin, sizeof (sin)) >= 0) break; if (errno != EADDRINUSE && errno != EADDRNOTAVAIL) { perror("socket"); break; } lport--; if (lport == IPPORT_RESERVED/2) { fprintf(stderr, "socket: All ports in use\n"); break; } } The restriction on allocating ports was done to allow processes executing in a ``secure'' environment to perform authentication based on the originating address and port number. In certain cases the algorithm used by the system in selecting port numbers is unsuitable for an application. This is due to associations being created in a two step pro- cess. For example, the Internet file transfer protocol, FTP, specifies that data connections must always originate from the same local port. However, duplicate associations are avoided by connecting to different foreign ports. In this situation the system would disallow binding the same DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 30 - Advanced Topics local address and port number to a socket if a previous data connection's socket were around. To override the default port selection algorithm then an option call must be per- formed prior to address binding: setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (char *)0, 0); bind(s, (char *)&sin, sizeof (sin)); With the above call, local addresses may be bound which are already in use. This does not violate the uniqueness requirement as the system still checks at connect time to be sure any other sockets with the same local address and port do not have the same foreign address and port (if an associ- ation already exists, the error EADDRINUSE is returned). Local address binding by the system is currently done somewhat haphazardly when a host is on multiple networks. Logically, one would expect the system to bind the local address associated with the network through which a peer was communicating. For instance, if the local host is connected to networks 46 and 10 and the foreign host is on network 32, and traffic from network 32 were arriving via network 10, the local address to be bound would be the host's address on network 10, not network 46. This unfortunately, is not always the case. For reasons too complicated to discuss here, the local address bound may be appear to be chosen at random. This property of local address binding will nor- mally be invisible to users unless the foreign host does not understand how to reach the address selected*. 5.5. Broadcasting and datagram sockets By using a datagram socket it is possible to send broadcast packets on many networks supported by the system (the network itself must support the notion of broadcasting; the system provides no broadcast simulation in software). Broadcast messages can place a high load on a network since they force every host on the network to service them. Con- sequently, the ability to send broadcast packets has been limited to the super user. To send a broadcast message, an Internet datagram socket should be created: s = socket(AF_INET, SOCK_DGRAM, 0); and at least a port number should be bound to the socket: _________________________ * For example, if network 46 were unknown to the host on network 32, and the local address were bound to that located on network 46, then even though a route between the two hosts existed through network 10, a connection would fail. DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 31 - Advanced Topics sin.sin_family = AF_INET; sin.sin_addr.s_addr = INADDR_ANY; sin.sin_port = MYPORT; bind(s, (char *)&sin, sizeof (sin)); Then the message should be addressed as: dst.sin_family = AF_INET; dst.sin_addr.s_addr = INADDR_ANY; dst.sin_port = DESTPORT; and, finally, a sendto call may be used: sendto(s, buf, buflen, 0, &dst, sizeof (dst)); Received broadcast messages contain the senders address and port (datagram sockets are anchored before a message is allowed to go out). 5.6. Signals Two new signals have been added to the system which may be used in conjunction with the interprocess communication facilities. The SIGURG signal is associated with the existence of an ``urgent condition''. The SIGIO signal is used with ``interrupt driven i/o'' (not presently imple- mented). SIGURG is currently supplied a process when out of band data is present at a socket. If multiple sockets have out of band data awaiting delivery, a select call may be used to determine those sockets with such data. An old signal which is useful when constructing server processes is SIGCHLD. This signal is delivered to a process when any children processes have changed state. Normally servers use the signal to ``reap'' child processes after exiting. For example, the remote login server loop shown in Figure 2 may be augmented as follows: DRAFT of January 30, 1994 Leffler/Fabry/Joy 4.2BSD IPC Primer - 32 - Advanced Topics int reaper(); ... sigset(SIGCHLD, reaper); listen(f, 10); for (;;) { int g, len = sizeof (from); g = accept(f, &from, &len, 0); if (g < 0) { if (errno != EINTR) perror("rlogind: accept"); continue; } ... } ... #include reaper() { union wait status; while (wait3(&status, WNOHANG, 0) > 0) ; } If the parent server process fails to reap its chil- dren, a large number of ``zombie'' processes may be created. DRAFT of January 30, 1994 Leffler/Fabry/Joy