4.2BSD Networking Implementation Notes
Revised July, 1983
Samuel J. Leffler, William N. Joy, Robert S. Fabry
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720
(415) 642-7780
ABSTRACT
This report describes the internal structure
of the networking facilities developed for the
4.2BSD version of the UNIX* operating system for
the VAX-. These facilities are based on several
central abstractions which structure the external
(user) view of network communication as well as
the internal (system) implementation.
The report documents the internal structure
of the networking system. The ``4.2BSD System
Manual'' provides a description of the user inter-
face to the networking facilities.
_________________________
* UNIX is a trademark of Bell Laboratories.
- DEC, VAX, DECnet, and UNIBUS are trademarks of Digi-
tal Equipment Corporation.
Networking Implementation - i - Contents
TABLE OF CONTENTS
1. Introduction
2. Overview
3. Goals
4. Internal address representation
5. Memory management
6. Internal layering
.1. Socket layer
.1.1. Socket state
.1.2. Socket data queues
.1.3. Socket connection queueing
.2. Protocol layer(s)
.3. Network-interface layer
.3.1. UNIBUS interfaces
7. Socket/protocol interface
8. Protocol/protocol interface
.1. pr_output
.2. pr_input
.3. pr_ctlinput
.4. pr_ctloutput
9. Protocol/network-interface interface
.1. Packet transmission
.2. Packet reception
10. Gateways and routing issues
.1. Routing tables
.2. Routing table interface
.3. User level routing policies
11. Raw sockets
.1. Control blocks
.2. Input processing
.3. Output processing
12. Buffering and congestion control
.1. Memory management
.2. Protocol buffering policies
.3. Queue limiting
.4. Packet forwarding
Networking Implementation - ii - Contents
13. Out of band data
14. Trailer protocols
Acknowledgements
References
Networking Implementation - 1 - Introduction
1. Introduction
This report describes the internal structure of facili-
ties added to the 4.2BSD version of the UNIX operating sys-
tem for the VAX. The system facilities provide a uniform
user interface to networking within UNIX. In addition, the
implementation introduces a structure for network communica-
tions which may be used by system implementors in adding new
networking facilities. The internal structure is not visi-
ble to the user, rather it is intended to aid implementors
of communication protocols and network services by providing
a framework which promotes code sharing and minimizes imple-
mentation effort.
The reader is expected to be familiar with the C pro-
gramming language and system interface, as described in the
4.2BSD System Manual [Joy82a]. Basic understanding of net-
work communication concepts is assumed; where required any
additional ideas are introduced.
The remainder of this document provides a description
of the system internals, avoiding, when possible, those por-
tions which are utilized only by the interprocess communica-
tion facilities.
Networking Implementation - 2 - Overview
2. Overview
If we consider the International Standards
Organization's (ISO) Open System Interconnection (OSI) model
of network communication [ISO81] [Zimmermann80], the net-
working facilities described here correspond to a portion of
the session layer (layer 3) and all of the transport and
network layers (layers 2 and 1, respectively).
The network layer provides possibly imperfect data
transport services with minimal addressing structure.
Addressing at this level is normally host to host, with
implicit or explicit routing optionally supported by the
communicating agents.
At the transport layer the notions of reliable
transfer, data sequencing, flow control, and service
addressing are normally included. Reliability is usually
managed by explicit acknowledgement of data delivered.
Failure to acknowledge a transfer results in retransmission
of the data. Sequencing may be handled by tagging each mes-
sage handed to the network layer by a sequence number and
maintaining state at the endpoints of communication to util-
ize received sequence numbers in reordering data which
arrives out of order.
The session layer facilities may provide forms of
addressing which are mapped into formats required by the
transport layer, service authentication and client authenti-
cation, etc. Various systems also provide services such as
data encryption and address and protocol translation.
The following sections begin by describing some of the
common data structures and utility routines, then examine
the internal layering. The contents of each layer and its
interface are considered. Certain of the interfaces are
protocol implementation specific. For these cases examples
have been drawn from the Internet [Cerf78] protocol family.
Later sections cover routing issues, the design of the raw
socket interface and other miscellaneous topics.
Networking Implementation - 3 - Goals
3. Goals
The networking system was designed with the goal of
supporting multiple protocol families and addressing styles.
This required information to be ``hidden'' in common data
structures which could be manipulated by all the pieces of
the system, but which required interpretation only by the
protocols which ``controlled'' it. The system described
here attempts to minimize the use of shared data structures
to those kept by a suite of protocols (a protocol family),
and those used for rendezvous between ``synchronous'' and
``asynchronous'' portions of the system (e.g. queues of data
packets are filled at interrupt time and emptied based on
user requests).
A major goal of the system was to provide a framework
within which new protocols and hardware could be easily be
supported. To this end, a great deal of effort has been
extended to create utility routines which hide many of the
more complex and/or hardware dependent chores of networking.
Later sections describe the utility routines and the under-
lying data structures they manipulate.
Networking Implementation - 4 - Address representation
4. Internal address representation
Common to all portions of the system are two data
structures. These structures are used to represent
addresses and various data objects. Addresses, internally
are described by the sockaddr structure,
struct sockaddr {
short sa_family; /* data format identifier */
char sa_data[14]; /* address */
};
All addresses belong to one or more address families which
define their format and interpretation. The sa_family field
indicates which address family the address belongs to, the
sa_data field contains the actual data value. The size of
the data field, 14 bytes, was selected based on a study of
current address formats*.
_________________________
* Later versions of the system support variable length
addresses.
Networking Implementation - 5 - Memory management
5. Memory management
A single mechanism is used for data storage: memory
buffers, or mbuf's. An mbuf is a structure of the form:
struct mbuf {
struct mbuf *m_next; /* next buffer in chain */
u_long m_off; /* offset of data */
short m_len; /* amount of data in this mbuf */
short m_type; /* mbuf type (accounting) */
u_char m_dat[MLEN]; /* data storage */
struct mbuf *m_act; /* link in higher-level mbuf list */
};
The m_next field is used to chain mbufs together on linked
lists, while the m_act field allows lists of mbufs to be
accumulated. By convention, the mbufs common to a single
object (for example, a packet) are chained together with the
m_next field, while groups of objects are linked via the
m_act field (possibly when in a queue).
Each mbuf has a small data area for storing informa-
tion, m_dat. The m_len field indicates the amount of data,
while the m_off field is an offset to the beginning of the
data from the base of the mbuf. Thus, for example, the
macro mtod, which converts a pointer to an mbuf to a pointer
to the data stored in the mbuf, has the form
#define mtod(x,t) ((t)((int)(x) + (x)->m_off))
(note the t parameter, a C type cast, is used to cast the
resultant pointer for proper assignment).
In addition to storing data directly in the mbuf's data
area, data of page size may be also be stored in a separate
area of memory. The mbuf utility routines maintain a pool
of pages for this purpose and manipulate a private page map
for such pages. The virtual addresses of these data pages
precede those of mbufs, so when pages of data are separated
from an mbuf, the mbuf data offset is a negative value. An
array of reference counts on pages is also maintained so
that copies of pages may be made without core to core copy-
ing (copies are created simply by duplicating the relevant
page table entries in the data page map and incrementing the
associated reference counts for the pages). Separate data
pages are currently used only when copying data from a user
process into the kernel, and when bringing data in at the
hardware level. Routines which manipulate mbufs are not
normally aware if data is stored directly in the mbuf data
array, or if it is kept in separate pages.
The following utility routines are available for mani-
pulating mbuf chains:
Networking Implementation - 6 - Memory management
m = m_copy(m0, off, len);
The m_copy routine create a copy of all, or part, of a
list of the mbufs in m0. Len bytes of data, starting
off bytes from the front of the chain, are copied.
Where possible, reference counts on pages are used
instead of core to core copies. The original mbuf
chain must have at least off + len bytes of data. If
len is specified as M_COPYALL, all the data present,
offset as before, is copied.
m_cat(m, n);
The mbuf chain, n, is appended to the end of m. Where
possible, compaction is performed.
m_adj(m, diff);
The mbuf chain, m is adjusted in size by diff bytes.
If diff is non-negative, diff bytes are shaved off the
front of the mbuf chain. If diff is negative, the
alteration is performed from back to front. No space
is reclaimed in this operation, alterations are accom-
plished by changing the m_len and m_off fields of
mbufs.
m = m_pullup(m0, size);
After a successful call to m_pullup, the mbuf at the
head of the returned list, m, is guaranteed to have at
least size bytes of data in contiguous memory (allowing
access via a pointer, obtained using the mtod macro).
If the original data was less than size bytes long, len
was greater than the size of an mbuf data area (112
bytes), or required resources were unavailable, m is 0
and the original mbuf chain is deallocated.
This routine is particularly useful when verifying
packet header lengths on reception. For example, if a
packet is received and only 8 of the necessary 16 bytes
required for a valid packet header are present at the
head of the list of mbufs representing the packet, the
remaining 8 bytes may be ``pulled up'' with a single
m_pullup call. If the call fails the invalid packet
will have been discarded.
By insuring mbufs always reside on 128 byte boundaries
it is possible to always locate the mbuf associated with a
data area by masking off the low bits of the virtual
address. This allows modules to store data structures in
mbufs and pass them around without concern for locating the
original mbuf when it comes time to free the structure. The
dtom macro is used to convert a pointer into an mbuf's data
area to a pointer to the mbuf,
#define dtom(x) ((struct mbuf *)((int)x & ~(MSIZE-1)))
Networking Implementation - 7 - Memory management
Mbufs are used for dynamically allocated data struc-
tures such as sockets, as well as memory allocated for pack-
ets. Statistics are maintained on mbuf usage and can be
viewed by users using the netstat(1) program.
Networking Implementation - 8 - Internal layering
6. Internal layering
The internal structure of the network system is divided
into three layers. These layers correspond to the services
provided by the socket abstraction, those provided by the
communication protocols, and those provided by the hardware
interfaces. The communication protocols are normally lay-
ered into two or more individual cooperating layers, though
they are collectively viewed in the system as one layer pro-
viding services supportive of the appropriate socket
abstraction.
The following sections describe the properties of each
layer in the system and the interfaces each must conform to.
6.1. Socket layer
The socket layer deals with the interprocess communica-
tions facilities provided by the system. A socket is a
bidirectional endpoint of communication which is ``typed''
by the semantics of communication it supports. The system
calls described in the 4.2BSD System Manual are used to
manipulate sockets.
A socket consists of the following data structure:
struct socket {
short so_type; /* generic type */
short so_options; /* from socket call */
short so_linger; /* time to linger while closing */
short so_state; /* internal state flags */
caddr_t so_pcb; /* protocol control block */
struct protosw *so_proto; /* protocol handle */
struct socket *so_head; /* back pointer to accept socket */
struct socket *so_q0; /* queue of partial connections */
short so_q0len; /* partials on so_q0 */
struct socket *so_q; /* queue of incoming connections */
short so_qlen; /* number of connections on so_q */
short so_qlimit; /* max number queued connections */
struct sockbuf so_snd; /* send queue */
struct sockbuf so_rcv; /* receive queue */
short so_timeo; /* connection timeout */
u_short so_error; /* error affecting connection */
short so_oobmark; /* chars to oob mark */
short so_pgrp; /* pgrp for signals */
};
Each socket contains two data queues, so_rcv and
so_snd, and a pointer to routines which provide supporting
services. The type of the socket, so_type is defined at
socket creation time and used in selecting those services
which are appropriate to support it. The supporting proto-
col is selected at socket creation time and recorded in the
Networking Implementation - 9 - Internal layering
socket data structure for later use. Protocols are defined
by a table of procedures, the protosw structure, which will
be described in detail later. A pointer to a protocol
specific data structure, the ``protocol control block'' is
also present in the socket structure. Protocols control
this data structure and it normally includes a back pointer
to the parent socket structure(s) to allow easy lookup when
returning information to a user (for example, placing an
error number in the so_error field). The other entries in
the socket structure are used in queueing connection
requests, validating user requests, storing socket charac-
teristics (e.g. options supplied at the time a socket is
created), and maintaining a socket's state.
Processes ``rendezvous at a socket'' in many instances.
For instance, when a process wishes to extract data from a
socket's receive queue and it is empty, or lacks sufficient
data to satisfy the request, the process blocks, supplying
the address of the receive queue as an ``wait channel' to be
used in notification. When data arrives for the process and
is placed in the socket's queue, the blocked process is
identified by the fact it is waiting ``on the queue''.
6.1.1. Socket state
A socket's state is defined from the following:
#define SS_NOFDREF 0x001 /* no file table ref any more */
#define SS_ISCONNECTED 0x002 /* socket connected to a peer */
#define SS_ISCONNECTING 0x004 /* in process of connecting to peer */
#define SS_ISDISCONNECTING 0x008/* in process of disconnecting */
#define SS_CANTSENDMORE 0x010 /* can't send more data to peer */
#define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */
#define SS_CONNAWAITING 0x040 /* connections awaiting acceptance */
#define SS_RCVATMARK 0x080 /* at mark on input */
#define SS_PRIV 0x100 /* privileged */
#define SS_NBIO 0x200 /* non-blocking ops */
#define SS_ASYNC 0x400 /* async i/o notify */
The state of a socket is manipulated both by the proto-
cols and the user (through system calls). When a socket is
created the state is defined based on the type of
input/output the user wishes to perform. ``Non-blocking''
I/O implies a process should never be blocked to await
resources. Instead, any call which would block returns
prematurely with the error EWOULDBLOCK (the service request
may be partially fulfilled, e.g. a request for more data
than is present).
If a process requested ``asynchronous'' notification of
events related to the socket the SIGIO signal is posted to
the process. An event is a change in the socket's state,
Networking Implementation - 10 - Internal layering
examples of such occurances are: space becoming available in
the send queue, new data available in the receive queue,
connection establishment or disestablishment, etc.
A socket may be marked ``priviledged'' if it was
created by the super-user. Only priviledged sockets may
send broadcast packets, or bind addresses in priviledged
portions of an address space.
6.1.2. Socket data queues
A socket's data queue contains a pointer to the data
stored in the queue and other entries related to the manage-
ment of the data. The following structure defines a data
queue:
struct sockbuf {
short sb_cc; /* actual chars in buffer */
short sb_hiwat; /* max actual char count */
short sb_mbcnt; /* chars of mbufs used */
short sb_mbmax; /* max chars of mbufs to use */
short sb_lowat; /* low water mark */
short sb_timeo; /* timeout */
struct mbuf *sb_mb; /* the mbuf chain */
struct proc *sb_sel; /* process selecting read/write */
short sb_flags; /* flags, see below */
};
Data is stored in a queue as a chain of mbufs. The
actual count of characters as well as high and low water
marks are used by the protocols in controlling the flow of
data. The socket routines cooperate in implementing the
flow control policy by blocking a process when it requests
to send data and the high water mark has been reached, or
when it requests to receive data and less than the low water
mark is present (assuming non-blocking I/O has not been
specified).
When a socket is created, the supporting protocol
``reserves'' space for the send and receive queues of the
socket. The actual storage associated with a socket queue
may fluctuate during a socket's lifetime, but is assumed
this reservation will always allow a protocol to acquire
enough memory to satisfy the high water marks.
The timeout and select values are manipulated by the
socket routines in implementing various portions of the
interprocess communications facilities and will not be
described here.
A socket queue has a number of flags used in synchron-
izing access to the data and in acquiring resources;
Networking Implementation - 11 - Internal layering
#define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */
#define SB_WANT 0x02 /* someone is waiting to lock */
#define SB_WAIT 0x04 /* someone is waiting for data/space */
#define SB_SEL 0x08 /* buffer is selected */
#define SB_COLL 0x10 /* collision selecting */
The last two flags are manipulated by the system in imple-
menting the select mechanism.
6.1.3. Socket connection queueing
In dealing with connection oriented sockets (e.g.
SOCK_STREAM) the two sides are considered distinct. One
side is termed active, and generates connection requests.
The other side is called passive and accepts connection
requests.
From the passive side, a socket is created with the
option SO_ACCEPTCONN specified, creating two queues of sock-
ets: so_q0 for connections in progress and so_q for connec-
tions already made and awaiting user acceptance. As a pro-
tocol is preparing incoming connections, it creates a socket
structure queued on so_q0 by calling the routine
sonewconn(). When the connection is established, the socket
structure is then transfered to so_q, making it available
for an accept.
If an SO_ACCEPTCONN socket is closed with sockets on
either so_q0 or so_q, these sockets are dropped.
6.2. Protocol layer(s)
Protocols are described by a set of entry points and
certain socket visible characteristics, some of which are
used in deciding which socket type(s) they may support.
An entry in the ``protocol switch'' table exists for
each protocol module configured into the system. It has the
following form:
Networking Implementation - 12 - Internal layering
struct protosw {
short pr_type; /* socket type used for */
short pr_family; /* protocol family */
short pr_protocol; /* protocol number */
short pr_flags; /* socket visible attributes */
/* protocol-protocol hooks */
int (*pr_input)(); /* input to protocol (from below) */
int (*pr_output)(); /* output to protocol (from above) */
int (*pr_ctlinput)(); /* control input (from below) */
int (*pr_ctloutput)(); /* control output (from above) */
/* user-protocol hook */
int (*pr_usrreq)(); /* user request */
/* utility hooks */
int (*pr_init)(); /* initialization routine */
int (*pr_fasttimo)(); /* fast timeout (200ms) */
int (*pr_slowtimo)(); /* slow timeout (500ms) */
int (*pr_drain)(); /* flush any excess space possible */
};
A protocol is called through the pr_init entry before
any other. Thereafter it is called every 200 milliseconds
through the pr_fasttimo entry and every 500 milliseconds
through the pr_slowtimo for timer based actions. The system
will call the pr_drain entry if it is low on space and this
should throw away any non-critical data.
Protocols pass data between themselves as chains of
mbufs using the pr_input and pr_output routines. Pr_input
passes data up (towards the user) and pr_output passes it
down (towards the network); control information passes up
and down on pr_ctlinput and pr_ctloutput. The protocol is
responsible for the space occupied by any the arguments to
these entries and must dispose of it.
The pr_userreq routine interfaces protocols to the
socket code and is described below.
The pr_flags field is constructed from the following
values:
#define PR_ATOMIC 0x01 /* exchange atomic messages only */
#define PR_ADDR 0x02 /* addresses given with messages */
#define PR_CONNREQUIRED 0x04 /* connection required by protocol */
#define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */
#define PR_RIGHTS 0x10 /* passes capabilities */
Protocols which are connection-based specify the
PR_CONNREQUIRED flag so that the socket routines will never
attempt to send data before a connection has been esta-
blished. If the PR_WANTRCVD flag is set, the socket rou-
tines will notfiy the protocol when the user has removed
data from the socket's receive queue. This allows the
Networking Implementation - 13 - Internal layering
protocol to implement acknowledgement on user receipt, and
also update windowing information based on the amount of
space available in the receive queue. The PR_ADDR field
indicates any data placed in the socket's receive queue will
be preceded by the address of the sender. The PR_ATOMIC
flag specifies each user request to send data must be per-
formed in a single protocol send request; it is the
protocol's responsibility to maintain record boundaries on
data to be sent. The PR_RIGHTS flag indicates the protocol
supports the passing of capabilities; this is currently
used only the protocols in the UNIX protocol family.
When a socket is created, the socket routines scan the
protocol table looking for an appropriate protocol to sup-
port the type of socket being created. The pr_type field
contains one of the possible socket types (e.g.
SOCK_STREAM), while the pr_family field indicates which pro-
tocol family the protocol belongs to. The pr_protocol field
contains the protocol number of the protocol, normally a
well known value.
6.3. Network-interface layer
Each network-interface configured into a system defines
a path through which packets may be sent and received. Nor-
mally a hardware device is associated with this interface,
though there is no requirement for this (for example, all
systems have a software ``loopback'' interface used for
debugging and performance analysis). In addition to manipu-
lating the hardware device, an interface module is responsi-
ble for encapsulation and deencapsulation of any low level
header information required to deliver a message to it's
destination. The selection of which interface to use in
delivering packets is a routing decision carried out at a
higher level than the network-interface layer. Each inter-
face normally identifies itself at boot time to the routing
module so that it may be selected for packet delivery.
An interface is defined by the following structure,
Networking Implementation - 14 - Internal layering
struct ifnet {
char *if_name; /* name, e.g. ``en'' or ``lo'' */
short if_unit; /* sub-unit for lower level driver */
short if_mtu; /* maximum transmission unit */
int if_net; /* network number of interface */
short if_flags; /* up/down, broadcast, etc. */
short if_timer; /* time 'til if_watchdog called */
int if_host[2]; /* local net host number */
struct sockaddr if_addr; /* address of interface */
union {
struct sockaddr ifu_broadaddr;
struct sockaddr ifu_dstaddr;
} if_ifu;
struct ifqueue if_snd; /* output queue */
int (*if_init)(); /* init routine */
int (*if_output)(); /* output routine */
int (*if_ioctl)(); /* ioctl routine */
int (*if_reset)(); /* bus reset routine */
int (*if_watchdog)(); /* timer routine */
int if_ipackets; /* packets received on interface */
int if_ierrors; /* input errors on interface */
int if_opackets; /* packets sent on interface */
int if_oerrors; /* output errors on interface */
int if_collisions; /* collisions on csma interfaces */
struct ifnet *if_next;
};
Each interface has a send queue and routines used for
initialization, if_init, and output, if_output. If the
interface resides on a system bus, the routine if_reset will
be called after a bus reset has been performed. An interface
may also specify a timer routine, if_watchdog, which should
be called every if_timer seconds (if non-zero).
The state of an interface and certain characteristics
are stored in the if_flags field. The following values are
possible:
#define IFF_UP 0x1 /* interface is up */
#define IFF_BROADCAST 0x2 /* broadcast address valid */
#define IFF_DEBUG 0x4 /* turn on debugging */
#define IFF_ROUTE 0x8 /* routing entry installed */
#define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */
#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */
#define IFF_RUNNING 0x40 /* resources allocated */
#define IFF_NOARP 0x80 /* no address resolution protocol */
If the interface is connected to a network which supports
transmission of broadcast packets, the IFF_BROADCAST flag
will be set and the if_broadaddr field will contain the
address to be used in sending or accepting a broadcast
packet. If the interface is associated with a point to
Networking Implementation - 15 - Internal layering
point hardware link (for example, a DEC DMR-11), the
IFF_POINTOPOINT flag will be set and if_dstaddr will contain
the address of the host on the other side of the connection.
These addresses and the local address of the interface,
if_addr, are used in filtering incoming packets. The inter-
face sets IFF_RUNNING after it has allocated system
resources and posted an initial read on the device it
manages. This state bit is used to avoid multiple alloca-
tion requests when an interface's address is changed. The
IFF_NOTRAILERS flag indicates the interface should refrain
from using a trailer encapsulation on outgoing packets;
trailer protocols are described in section 14. The
IFF_NOARP flag indicates the interface should not use an
``address resolution protocol'' in mapping internetwork
addresses to local network addresses.
The information stored in an ifnet structure for point
to point communication devices is not currently used by the
system internally. Rather, it is used by the user level
routing process in determining host network connections and
in initially devising routes (refer to chapter 10 for more
information).
Various statistics are also stored in the interface
structure. These may be viewed by users using the
netstat(1) program.
The interface address and flags may be set with the
SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used to
initially define each interface's address; SIOGSIFFLAGS can
be used to mark an interface down and perform site-specific
configuration.
6.3.1. UNIBUS interfaces
All hardware related interfaces currently reside on the
UNIBUS. Consequently a common set of utility routines for
dealing with the UNIBUS has been developed. Each UNIBUS
interface utilizes a structure of the following form:
Networking Implementation - 16 - Internal layering
struct ifuba {
short ifu_uban; /* uba number */
short ifu_hlen; /* local net header length */
struct uba_regs *ifu_uba; /* uba regs, in vm */
struct ifrw {
caddr_t ifrw_addr; /* virt addr of header */
int ifrw_bdp; /* unibus bdp */
int ifrw_info; /* value from ubaalloc */
int ifrw_proto;/* map register prototype */
struct pte *ifrw_mr;/* base of map registers */
} ifu_r, ifu_w;
struct pte ifu_wmap[IF_MAXNUBAMR];/* base pages for output */
short ifu_xswapd; /* mask of clusters swapped */
short ifu_flags; /* used during uballoc's */
struct mbuf *ifu_xtofree; /* pages being dma'd out */
};
The if_uba structure describes UNIBUS resources held by
an interface. IF_NUBAMR map registers are held for datagram
data, starting at ifr_mr. UNIBUS map register ifr_mr[-1]
maps the local network header ending on a page boundary.
UNIBUS data paths are reserved for read and for write, given
by ifr_bdp. The prototype of the map registers for read and
for write is saved in ifr_proto.
When write transfers are not full pages on page boun-
daries the data is just copied into the pages mapped on the
UNIBUS and the transfer is started. If a write transfer is
of a (1024 byte) page size and on a page boundary, UNIBUS
page table entries are swapped to reference the pages, and
then the initial pages are remapped from ifu_wmap when the
transfer completes.
When read transfers give whole pages of data to be
input, page frames are allocated from a network page list
and traded with the pages already containing the data, map-
ping the allocated pages to replace the input pages for the
next UNIBUS data input.
The following utility routines are available for use in
writing network interface drivers, all use the ifuba struc-
ture described above.
if_ubainit(ifu, uban, hlen, nmr);
if_ubainit allocates resources on UNIBUS adaptor uban
and stores the resultant information in the ifuba
structure pointed to by ifu. It is called only at boot
time or after a UNIBUS reset. Two data paths (buffered
or unbuffered, depending on the ifu_flags field) are
allocated, one for reading and one for writing. The
nmr parameter indicates the number of UNIBUS mapping
registers required to map a maximal sized packet onto
Networking Implementation - 17 - Internal layering
the UNIBUS, while hlen specifies the size of a local
network header, if any, which should be mapped
separately from the data (see the description of
trailer protocols in chapter 14). Sufficient UNIBUS
mapping registers and pages of memory are allocated to
initialize the input data path for an initial read.
For the output data path, mapping registers and pages
of memory are also allocated and mapped onto the
UNIBUS. The pages associated with the output data path
are held in reserve in the event a write requires copy-
ing non-page-aligned data (see if_wubaput below). If
if_ubainit is called with resources already allocated,
they will be used instead of allocating new ones (this
normally occurs after a UNIBUS reset). A 1 is returned
when allocation and initialization is successful, 0
otherwise.
m = if_rubaget(ifu, totlen, off0);
if_rubaget pulls read data off an interface. totlen
specifies the length of data to be obtained, not count-
ing the local network header. If off0 is non-zero, it
indicates a byte offset to a trailing local network
header which should be copied into a separate mbuf and
prepended to the front of the resultant mbuf chain.
When page sized units of data are present and are
page-aligned, the previously mapped data pages are
remapped into the mbufs and swapped with fresh pages;
thus avoiding any copying. A 0 return value indicates
a failure to allocate resources.
if_wubaput(ifu, m);
if_wubaput maps a chain of mbufs onto a network inter-
face in preparation for output. The chain includes any
local network header, which is copied so that it
resides in the mapped and aligned I/O space. Any other
mbufs which contained non page sized data portions are
also copied to the I/O space. Pages mapped from a pre-
vious output operation (no longer needed) are unmapped
and returned to the network page pool.
Networking Implementation - 18 - Socket/protocol interface
7. Socket/protocol interface
The interface between the socket routines and the com-
munication protocols is through the pr_usrreq routine
defined in the protocol switch table. The following
requests to a protocol module are possible:
#define PRU_ATTACH 0 /* attach protocol */
#define PRU_DETACH 1 /* detach protocol */
#define PRU_BIND 2 /* bind socket to address */
#define PRU_LISTEN 3 /* listen for connection */
#define PRU_CONNECT 4 /* establish connection to peer */
#define PRU_ACCEPT 5 /* accept connection from peer */
#define PRU_DISCONNECT 6 /* disconnect from peer */
#define PRU_SHUTDOWN 7 /* won't send any more data */
#define PRU_RCVD 8 /* have taken data; more room now */
#define PRU_SEND 9 /* send this data */
#define PRU_ABORT 10 /* abort (fast DISCONNECT, DETATCH) */
#define PRU_CONTROL 11 /* control operations on protocol */
#define PRU_SENSE 12 /* return status into m */
#define PRU_RCVOOB 13 /* retrieve out of band data */
#define PRU_SENDOOB 14 /* send out of band data */
#define PRU_SOCKADDR 15 /* fetch socket's address */
#define PRU_PEERADDR 16 /* fetch peer's address */
#define PRU_CONNECT2 17 /* connect two sockets */
/* begin for protocols internal use */
#define PRU_FASTTIMO 18 /* 200ms timeout */
#define PRU_SLOWTIMO 19 /* 500ms timeout */
#define PRU_PROTORCV 20 /* receive from below */
#define PRU_PROTOSEND 21 /* send to below */
A call on the user request routine is of the form,
error = (*protosw[].pr_usrreq)(up, req, m, addr, rights);
int error; struct socket *up; int req; struct mbuf *m, *rights; caddr_t addr;
The mbuf chain, m, and the address are optional parameters.
The rights parameter is an optional pointer to an mbuf chain
containing user specified capabilities (see the sendmsg and
recvmsg system calls). The protocol is responsible for
disposal of both mbuf chains. A non-zero return value gives
a UNIX error number which should be passed to higher level
software. The following paragraphs describe each of the
requests possible.
PRU_ATTACH
When a protocol is bound to a socket (with the socket
system call) the protocol module is called with this
request. It is the responsibility of the protocol
module to allocate any resources necessary. The
``attach'' request will always precede any of the other
requests, and should not occur more than once.
PRU_DETACH
Networking Implementation - 19 - Socket/protocol interface
This is the antithesis of the attach request, and is
used at the time a socket is deleted. The protocol
module may deallocate any resources assigned to the
socket.
PRU_BIND
When a socket is initially created it has no address
bound to it. This request indicates an address should
be bound to an existing socket. The protocol module
must verify the requested address is valid and avail-
able for use.
PRU_LISTEN
The ``listen'' request indicates the user wishes to
listen for incoming connection requests on the associ-
ated socket. The protocol module should perform any
state changes needed to carry out this request (if pos-
sible). A ``listen'' request always precedes a request
to accept a connection.
PRU_CONNECT
The ``connect'' request indicates the user wants to a
establish an association. The addr parameter supplied
describes the peer to be connected to. The effect of a
connect request may vary depending on the protocol.
Virtual circuit protocols, such as TCP [Postel80b], use
this request to initiate establishment of a TCP connec-
tion. Datagram protocols, such as UDP [Postel79], sim-
ply record the peer's address in a private data struc-
ture and use it to tag all outgoing packets. There are
no restrictions on how many times a connect request may
be used after an attach. If a protocol supports the
notion of multi-casting, it is possible to use multiple
connects to establish a multi-cast group. Alterna-
tively, an association may be broken by a
PRU_DISCONNECT request, and a new association created
with a subsequent connect request; all without destroy-
ing and creating a new socket.
PRU_ACCEPT
Following a successful PRU_LISTEN request and the
arrival of one or more connections, this request is
made to indicate the user has accepted the first con-
nection on the queue of pending connections. The pro-
tocol module should fill in the supplied address buffer
with the address of the connected party.
PRU_DISCONNECT
Eliminate an association created with a PRU_CONNECT
request.
PRU_SHUTDOWN
This call is used to indicate no more data will be sent
and/or received (the addr parameter indicates the
Networking Implementation - 20 - Socket/protocol interface
direction of the shutdown, as encoded in the soshutdown
system call). The protocol may, at its discretion,
deallocate any data structures related to the shutdown.
PRU_RCVD
This request is made only if the protocol entry in the
protocol switch table includes the PR_WANTRCVD flag.
When a user removes data from the receive queue this
request will be sent to the protocol module. It may be
used to trigger acknowledgements, refresh windowing
information, initiate data transfer, etc.
PRU_SEND
Each user request to send data is translated into one
or more PRU_SEND requests (a protocol may indicate a
single user send request must be translated into a sin-
gle PRU_SEND request by specifying the PR_ATOMIC flag
in its protocol description). The data to be sent is
presented to the protocol as a list of mbufs and an
address is, optionally, supplied in the addr parameter.
The protocol is responsible for preserving the data in
the socket's send queue if it is not able to send it
immediately, or if it may need it at some later time
(e.g. for retransmission).
PRU_ABORT
This request indicates an abnormal termination of ser-
vice. The protocol should delete any existing
association(s).
PRU_CONTROL
The ``control'' request is generated when a user per-
forms a UNIX ioctl system call on a socket (and the
ioctl is not intercepted by the socket routines). It
allows protocol-specific operations to be provided out-
side the scope of the common socket interface. The
addr parameter contains a pointer to a static kernel
data area where relevant information may be obtained or
returned. The m parameter contains the actual ioctl
request code (note the non-standard calling conven-
tion).
PRU_SENSE
The ``sense'' request is generated when the user makes
an fstat system call on a socket; it requests status of
the associated socket. There currently is no common
format for the status returned. Information which might
be returned includes per-connection statistics, proto-
col state, resources currently in use by the connec-
tion, the optimal transfer size for the connection
(based on windowing information and maximum packet
size). The addr parameter contains a pointer to a
static kernel data area where the status buffer should
be placed.
Networking Implementation - 21 - Socket/protocol interface
PRU_RCVOOB
Any ``out-of-band'' data presently available is to be
returned. An mbuf is passed in to the protocol module
and the protocol should either place data in the mbuf
or attach new mbufs to the one supplied if there is
insufficient space in the single mbuf.
PRU_SENDOOB
Like PRU_SEND, but for out-of-band data.
PRU_SOCKADDR
The local address of the socket is returned, if any is
currently bound to the it. The address format (proto-
col specific) is returned in the addr parameter.
PRU_PEERADDR
The address of the peer to which the socket is con-
nected is returned. The socket must be in a
SS_ISCONNECTED state for this request to be made to the
protocol. The address format (protocol specific) is
returned in the addr parameter.
PRU_CONNECT2
The protocol module is supplied two sockets and
requested to establish a connection between the two
without binding any addresses, if possible. This call
is used in implementing the system call.
The following requests are used internally by the pro-
tocol modules and are never generated by the socket rou-
tines. In certain instances, they are handed to the
pr_usrreq routine solely for convenience in tracing a
protocol's operation (e.g. PRU_SLOWTIMO).
PRU_FASTTIMO
A ``fast timeout'' has occured. This request is made
when a timeout occurs in the protocol's pr_fastimo rou-
tine. The addr parameter indicates which timer
expired.
PRU_SLOWTIMO
A ``slow timeout'' has occured. This request is made
when a timeout occurs in the protocol's pr_slowtimo
routine. The addr parameter indicates which timer
expired.
PRU_PROTORCV
This request is used in the protocol-protocol inter-
face, not by the routines. It requests reception of
data destined for the protocol and not the user. No
protocols currently use this facility.
PRU_PROTOSEND
This request allows a protocol to send data destined
Networking Implementation - 22 - Socket/protocol interface
for another protocol module, not a user. The details
of how data is marked ``addressed to protocol'' instead
of ``addressed to user'' are left to the protocol
modules. No protocols currently use this facility.
Networking Implementation - 23 -Protocol/protocol interface
8. Protocol/protocol interface
The interface between protocol modules is through the
pr_usrreq, pr_input, pr_output, pr_ctlinput, and
pr_ctloutput routines. The calling conventions for all but
the pr_usrreq routine are expected to be specific to the
protocol modules and are not guaranteed to be consistent
across protocol families. We will examine the conventions
used for some of the Internet protocols in this section as
an example.
8.1. pr_output
The Internet protocol UDP uses the convention,
error = udp_output(inp, m);
int error; struct inpcb *inp; struct mbuf *m;
where the inp, ``internet protocol control block'', passed
between modules conveys per connection state information,
and the mbuf chain contains the data to be sent. UDP per-
forms consistency checks, appends its header, calculates a
checksum, etc. before passing the packet on to the IP
module:
error = ip_output(m, opt, ro, allowbroadcast);
int error; struct mbuf *m, *opt; struct route *ro; int allowbroadcast;
The call to IP's output routine is more complicated
than that for UDP, as befits the additional work the IP
module must do. The m parameter is the data to be sent, and
the opt parameter is an optional list of IP options which
should be placed in the IP packet header. The ro parameter
is is used in making routing decisions (and passing them
back to the caller). The final parameter, allowbroadcast is
a flag indicating if the user is allowed to transmit a
broadcast packet. This may be inconsequential if the under-
lying hardware does not support the notion of broadcasting.
All output routines return 0 on success and a UNIX
error number if a failure occured which could be immediately
detected (no buffer space available, no route to destina-
tion, etc.).
8.2. pr_input
Both UDP and TCP use the following calling convention,
(void) (*protosw[].pr_input)(m);
struct mbuf *m;
Each mbuf list passed is a single packet to be processed by
the protocol module.
Networking Implementation - 24 -Protocol/protocol interface
The IP input routine is a VAX software interrupt level
routine, and so is not called with any parameters. It
instead communicates with network interfaces through a
queue, ipintrq, which is identical in structure to the
queues used by the network interfaces for storing packets
awaiting transmission.
8.3. pr_ctlinput
This routine is used to convey ``control'' information
to a protocol module (i.e. information which might be passed
to the user, but is not data). This routine, and the
pr_ctloutput routine, have not been extensively developed,
and thus suffer from a ``clumsiness'' that can only be
improved as more demands are placed on it.
The common calling convention for this routine is,
(void) (*protosw[].pr_ctlinput)(req, info);
int req; caddr_t info;
The req parameter is one of the following,
#define PRC_IFDOWN 0 /* interface transition */
#define PRC_ROUTEDEAD 1 /* select new route if possible */
#define PRC_QUENCH 4 /* some said to slow down */
#define PRC_HOSTDEAD 6 /* normally from IMP */
#define PRC_HOSTUNREACH 7 /* ditto */
#define PRC_UNREACH_NET 8 /* no route to network */
#define PRC_UNREACH_HOST 9 /* no route to host */
#define PRC_UNREACH_PROTOCOL 10 /* dst says bad protocol */
#define PRC_UNREACH_PORT 11 /* bad port # */
#define PRC_MSGSIZE 12 /* message size forced drop */
#define PRC_REDIRECT_NET 13 /* net routing redirect */
#define PRC_REDIRECT_HOST 14 /* host routing redirect */
#define PRC_TIMXCEED_INTRANS 17 /* packet lifetime expired in transit */
#define PRC_TIMXCEED_REASS 18 /* lifetime expired on reass q */
#define PRC_PARAMPROB 19 /* header incorrect */
while the info parameter is a ``catchall'' value which is
request dependent. Many of the requests have obviously been
derived from ICMP (the Internet Control Message Protocol),
and from error messages defined in the 1822 host/IMP conven-
tion [BBN78]. Mapping tables exist to convert control
requests to UNIX error codes which are delivered to a user.
8.4. pr_ctloutput
This routine is not currently used by any protocol
modules.
Networking Implementation - 25 - Protocol/network-interface
9. Protocol/network-interface interface
The lowest layer in the set of protocols which comprise
a protocol family must interface itself to one or more net-
work interfaces in order to transmit and receive packets.
It is assumed that any routing decisions have been made
before handing a packet to a network interface, in fact this
is absolutely necessary in order to locate any interface at
all (unless, of course, one uses a single ``hardwired''
interface). There are two cases to be concerned with,
transmission of a packet, and receipt of a packet; each will
be considered separately.
9.1. Packet transmission
Assuming a protocol has a handle on an interface, ifp,
a (struct ifnet *), it transmits a fully formatted packet
with the following call,
error = (*ifp->if_output)(ifp, m, dst)
int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;
The output routine for the network interface transmits the
packet m to the dst address, or returns an error indication
(a UNIX error number). In reality transmission may not be
immediate, or successful; normally the output routine simply
queues the packet on its send queue and primes an interrupt
driven routine to actually transmit the packet. For unreli-
able mediums, such as the Ethernet, ``successful'' transmis-
sion simply means the packet has been placed on the cable
without a collision. On the other hand, an 1822 interface
guarantees proper delivery or an error indication for each
message transmitted. The model employed in the networking
system attaches no promises of delivery to the packets
handed to a network interface, and thus corresponds more
closely to the Ethernet. Errors returned by the output rou-
tine are normally trivial in nature (no buffer space,
address format not handled, etc.).
9.2. Packet reception
Each protocol family must have one or more ``lowest
level'' protocols. These protocols deal with internetwork
addressing and are responsible for the delivery of incoming
packets to the proper protocol processing modules. In the
PUP model [Boggs78] these protocols are termed Level 1 pro-
tocols, in the ISO model, network layer protocols. In our
system each such protocol module has an input packet queue
assigned to it. Incoming packets received by a network
interface are queued up for the protocol module and a VAX
software interrupt is posted to initiate processing.
Three macros are available for queueing and dequeueing
packets,
Networking Implementation - 26 - Protocol/network-interface
IF_ENQUEUE(ifq, m)
This places the packet m at the tail of the queue ifq.
IF_DEQUEUE(ifq, m)
This places a pointer to the packet at the head of
queue ifq in m. A zero value will be returned in m if
the queue is empty.
IF_PREPEND(ifq, m)
This places the packet m at the head of the queue ifq.
Each queue has a maximum length associated with it as a
simple form of congestion control. The macro IF_QFULL(ifq)
returns 1 if the queue is filled, in which case the macro
IF_DROP(ifq) should be used to bump a count of the number of
packets dropped and the offending packet dropped. For exam-
ple, the following code fragment is commonly found in a net-
work interface's input routine,
if (IF_QFULL(inq)) {
IF_DROP(inq);
m_freem(m);
} else
IF_ENQUEUE(inq, m);
Networking Implementation - 27 - Gateways and routing
10. Gateways and routing issues
The system has been designed with the expectation that
it will be used in an internetwork environment. The
``canonical'' environment was envisioned to be a collection
of local area networks connected at one or more points
through hosts with multiple network interfaces (one on each
local area network), and possibly a connection to a long
haul network (for example, the ARPANET). In such an
environment, issues of gatewaying and packet routing become
very important. Certain of these issues, such as congestion
control, have been handled in a simplistic manner or specif-
ically not addressed. Instead, where possible, the network
system attempts to provide simple mechanisms upon which more
involved policies may be implemented. As some of these
problems become better understood, the solutions developed
will be incorporated into the system.
This section will describe the facilities provided for
packet routing. The simplistic mechanisms provided for
congestion control are described in chapter 12.
10.1. Routing tables
The network system maintains a set of routing tables
for selecting a network interface to use in delivering a
packet to its destination. These tables are of the form:
struct rtentry {
u_long rt_hash; /* hash key for lookups */
struct sockaddr rt_dst; /* destination net or host */
struct sockaddr rt_gateway; /* forwarding agent */
short rt_flags; /* see below */
short rt_refcnt; /* no. of references to structure */
u_long rt_use; /* packets sent using route */
struct ifnet *rt_ifp; /* interface to give packet to */
};
The routing information is organized in two separate
tables, one for routes to a host and one for routes to a
network. The distinction between hosts and networks is
necessary so that a single mechanism may be used for both
broadcast and multi-drop type networks, and also for net-
works built from point-to-point links (e.g DECnet [DEC80]).
Each table is organized as a hashed set of linked
lists. Two 32-bit hash values are calculated by routines
defined for each address family; one based on the destina-
tion being a host, and one assuming the target is the net-
work portion of the address. Each hash value is used to
locate a hash chain to search (by taking the value modulo
the hash table size) and the entire 32-bit value is then
used as a key in scanning the list of routes. Lookups are
Networking Implementation - 28 - Gateways and routing
applied first to the routing table for hosts, then to the
routing table for networks. If both lookups fail, a final
lookup is made for a ``wildcard'' route (by convention, net-
work 0). By doing this, routes to a specific host on a net-
work may be present as well as routes to the network. This
also allows a ``fall back'' network route to be defined to
an ``smart'' gateway which may then perform more intelligent
routing.
Each routing table entry contains a destination (who's
at the other end of the route), a gateway to send the packet
to, and various flags which indicate the route's status and
type (host or network). A count of the number of packets
sent using the route is kept for use in deciding between
multiple routes to the same destination (see below), and a
count of ``held references'' to the dynamically allocated
structure is maintained to insure memory reclamation occurs
only when the route is not in use. Finally a pointer to the
a network interface is kept; packets sent using the route
should be handed to this interface.
Routes are typed in two ways: either as host or net-
work, and as ``direct'' or ``indirect''. The host/network
distinction determines how to compare the rt_dst field dur-
ing lookup. If the route is to a network, only a packet's
destination network is compared to the rt_dst entry stored
in the table. If the route is to a host, the addresses must
match bit for bit.
The distinction between ``direct'' and ``indirect''
routes indicates whether the destination is directly con-
nected to the source. This is needed when performing local
network encapsulation. If a packet is destined for a peer
at a host or network which is not directly connected to the
source, the internetwork packet header will indicate the
address of the eventual destination, while the local network
header will indicate the address of the intervening gateway.
Should the destination be directly connected, these
addresses are likely to be identical, or a mapping between
the two exists. The RTF_GATEWAY flag indicates the route is
to an ``indirect'' gateway agent and the local network
header should be filled in from the rt_gateway field instead
of rt_dst, or from the internetwork destination address.
It is assumed multiple routes to the same destination
will not be present unless they are deemed equal in cost
(the current routing policy process never installs multiple
routes to the same destination). However, should multiple
routes to the same destination exist, a request for a route
will return the ``least used'' route based on the total
number of packets sent along this route. This can result in
a ``ping-pong'' effect (alternate packets taking alternate
routes), unless protocols ``hold onto'' routes until they no
longer find them useful; either because the destination has
Networking Implementation - 29 - Gateways and routing
changed, or because the route is lossy.
Routing redirect control messages are used to dynami-
cally modify existing routing table entries as well as
dynamically create new routing table entries. On hosts
where exhaustive routing information is too expensive to
maintain (e.g. work stations), the combination of wildcard
routing entries and routing redirect messages can be used to
provide a simple routing management scheme without the use
of a higher level policy process. Statistics are kept by the
routing table routines on the use of routing redirect mes-
sages and their affect on the routing tables. These statis-
tics may be viewed using
Status information other than routing redirect control
messages may be used in the future, but at present they are
ignored. Likewise, more intelligent ``metrics'' may be used
to describe routes in the future, possibly based on
bandwidth and monetary costs.
10.2. Routing table interface
A protocol accesses the routing tables through three
routines, one to allocate a route, one to free a route, and
one to process a routing redirect control message. The rou-
tine rtalloc performs route allocation; it is called with a
pointer to the following structure,
struct route {
struct rtentry *ro_rt;
struct sockaddr ro_dst;
};
The route returned is assumed ``held'' by the caller until
disposed of with an rtfree call. Protocols which implement
virtual circuits, such as TCP, hold onto routes for the
duration of the circuit's lifetime, while connection-less
protocols, such as UDP, currently allocate and free routes
on each transmission.
The routine rtredirect is called to process a routing
redirect control message. It is called with a destination
address and the new gateway to that destination. If a non-
wildcard route exists to the destination, the gateway entry
in the route is modified to point at the new gateway sup-
plied. Otherwise, a new routing table entry is inserted
reflecting the information supplied. Routes to interfaces
and routes to gateways which are not directly accesible from
the host are ignored.
10.3. User level routing policies
Routing policies implemented in user processes manipu-
late the kernel routing tables through two ioctl calls. The
Networking Implementation - 30 - Gateways and routing
commands SIOCADDRT and SIOCDELRT add and delete routing
entries, respectively; the tables are read through the
/dev/kmem device. The decision to place policy decisions in
a user process implies routing table updates may lag a bit
behind the identification of new routes, or the failure of
existing routes, but this period of instability is normally
very small with proper implementation of the routing pro-
cess. Advisory information, such as ICMP error messages and
IMP diagnostic messages, may be read from raw sockets
(described in the next section).
One routing policy process has already been imple-
mented. The system standard ``routing daemon'' uses a vari-
ant of the Xerox NS Routing Information Protocol [Xerox82]
to maintain up to date routing tables in our local environ-
ment. Interaction with other existing routing protocols,
such as the Internet GGP (Gateway-Gateway Protocol), may be
accomplished using a similar process.
Networking Implementation - 31 - Raw sockets
11. Raw sockets
A raw socket is a mechanism which allows users direct
access to a lower level protocol. Raw sockets are intended
for knowledgeable processes which wish to take advantage of
some protocol feature not directly accessible through the
normal interface, or for the development of new protocols
built atop existing lower level protocols. For example, a
new version of TCP might be developed at the user level by
utilizing a raw IP socket for delivery of packets. The raw
IP socket interface attempts to provide an identical inter-
face to the one a protocol would have if it were resident in
the kernel.
The raw socket support is built around a generic raw
socket interface, and (possibly) augmented by protocol-
specific processing routines. This section will describe
the core of the raw socket interface.
11.1. Control blocks
Every raw socket has a protocol control block of the
following form,
struct rawcb {
struct rawcb *rcb_next; /* doubly linked list */
struct rawcb *rcb_prev;
struct socket *rcb_socket; /* back pointer to socket */
struct sockaddr rcb_faddr; /* destination address */
struct sockaddr rcb_laddr; /* socket's address */
caddr_t rcb_pcb; /* protocol specific stuff */
short rcb_flags;
};
All the control blocks are kept on a doubly linked list for
performing lookups during packet dispatch. Associations may
be recorded in the control block and used by the output rou-
tine in preparing packets for transmission. The addresses
are also used to filter packets on input; this will be
described in more detail shortly. If any protocol specific
information is required, it may be attached to the control
block using the rcb_pcb field.
A raw socket interface is datagram oriented. That is,
each send or receive on the socket requires a destination
address. This address may be supplied by the user or stored
in the control block and automatically installed in the out-
going packet by the output routine. Since it is not possi-
ble to determine whether an address is present or not in the
control block, two flags, RAW_LADDR and RAW_FADDR, indicate
if a local and foreign address are present. Another flag,
RAW_DONTROUTE, indicates if routing should be performed on
outgoing packets. If it is, a route is expected to be allo-
cated for each ``new'' destination address. That is, the
Networking Implementation - 32 - Raw sockets
first time a packet is transmitted a route is determined,
and thereafter each time the destination address stored in
rcb_route differs from rcb_faddr, or rcb_route.ro_rt is
zero, the old route is discarded and a new one allocated.
11.2. Input processing
Input packets are ``assigned'' to raw sockets based on
a simple pattern matching scheme. Each network interface or
protocol gives packets to the raw input routine with the
call:
raw_input(m, proto, src, dst)
struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;
The data packet then has a generic header prepended to it of
the form
struct raw_header {
struct sockproto raw_proto;
struct sockaddr raw_dst;
struct sockaddr raw_src;
};
and it is placed in a packet queue for the ``raw input pro-
tocol'' module. Packets taken from this queue are copied
into any raw sockets that match the header according to the
following rules,
1) The protocol family of the socket and header agree.
2) If the protocol number in the socket is non-zero, then
it agrees with that found in the packet header.
3) If a local address is defined for the socket, the
address format of the local address is the same as the
destination address's and the two addresses agree bit
for bit.
4) The rules of 3) are applied to the socket's foreign
address and the packet's source address.
A basic assumption is that addresses present in the control
block and packet header (as constructed by the network
interface and any raw input protocol module) are in a canon-
ical form which may be ``block compared''.
11.3. Output processing
On output the raw pr_usrreq routine passes the packet
and raw control block to the raw protocol output routine for
any processing required before it is delivered to the
appropriate network interface. The output routine is nor-
mally the only code required to implement a raw socket
Networking Implementation - 33 - Raw sockets
interface.
Networking Implementation -Buffering and congestion control
12. Buffering and congestion control
One of the major factors in the performance of a proto-
col is the buffering policy used. Lack of a proper buffer-
ing policy can force packets to be dropped, cause falsified
windowing information to be emitted by protocols, fragment
host memory, degrade the overall host performance, etc. Due
to problems such as these, most systems allocate a fixed
pool of memory to the networking system and impose a policy
optimized for ``normal'' network operation.
The networking system developed for UNIX is little dif-
ferent in this respect. At boot time a fixed amount of
memory is allocated by the networking system. At later
times more system memory may be requested as the need
arises, but at no time is memory ever returned to the sys-
tem. It is possible to garbage collect memory from the net-
work, but difficult. In order to perform this garbage col-
lection properly, some portion of the network will have to
be ``turned off'' as data structures are updated. The
interval over which this occurs must kept small compared to
the average inter-packet arrival time, or too much traffic
may be lost, impacting other hosts on the network, as well
as increasing load on the interconnecting mediums. In our
environment we have not experienced a need for such compac-
tion, and thus have left the problem unresolved.
The mbuf structure was introduced in chapter 5. In
this section a brief description will be given of the allo-
cation mechanisms, and policies used by the protocols in
performing connection level buffering.
12.1. Memory management
The basic memory allocation routines place no restric-
tions on the amount of space which may be allocated. Any
request made is filled until the system memory allocator
starts refusing to allocate additional memory. When the
current quota of memory is insufficient to satisfy an mbuf
allocation request, the allocator requests enough new pages
from the system to satisfy the current request only. All
memory owned by the network is described by a private page
table used in remapping pages to be logically contiguous as
the need arises. In addition, an array of reference counts
parallels the page table and is used when multiple copies of
a page are present.
Mbufs are 128 byte structures, 8 fitting in a 1Kbyte
page of memory. When data is placed in mbufs, if possible,
it is copied or remapped into logically contiguous pages of
memory from the network page pool. Data smaller than the
size of a page is copied into one or more 112 byte mbuf data
areas.
Networking Implementation -Buffering and congestion control
12.2. Protocol buffering policies
Protocols reserve fixed amounts of buffering for send
and receive queues at socket creation time. These amounts
define the high and low water marks used by the socket rou-
tines in deciding when to block and unblock a process. The
reservation of space does not currently result in any action
by the memory management routines, though it is clear if one
imposed an upper bound on the total amount of physical
memory allocated to the network, reserving memory would
become important.
Protocols which provide connection level flow control
do this based on the amount of space in the associated
socket queues. That is, send windows are calculated based
on the amount of free space in the socket's receive queue,
while receive windows are adjusted based on the amount of
data awaiting transmission in the send queue. Care has been
taken to avoid the ``silly window syndrome'' described in
[Clark82] at both the sending and receiving ends.
12.3. Queue limiting
Incoming packets from the network are always received
unless memory allocation fails. However, each Level 1 pro-
tocol input queue has an upper bound on the queue's length,
and any packets exceeding that bound are discarded. It is
possible for a host to be overwhelmed by excessive network
traffic (for instance a host acting as a gateway from a high
bandwidth network to a low bandwidth network). As a
``defensive'' mechanism the queue limits may be adjusted to
throttle network traffic load on a host. Consider a host
willing to devote some percentage of its machine to handling
network traffic. If the cost of handling an incoming packet
can be calculated so that an acceptable ``packet handling
rate'' can be determined, then input queue lengths may be
dynamically adjusted based on a host's network load and the
number of packets awaiting processing. Obviously, discard-
ing packets is not a satisfactory solution to a problem such
as this (simply dropping packets is likely to increase the
load on a network); the queue lengths were incorporated
mainly as a safeguard mechanism.
12.4. Packet forwarding
When packets can not be forwarded because of memory
limitations, the system generates a ``source quench'' mes-
sage. In addition, any other problems encountered during
packet forwarding are also reflected back to the sender in
the form of ICMP packets. This helps hosts avoid unneeded
retransmissions.
Broadcast packets are never forwarded due to possible
dire consequences. In an early stage of network
Networking Implementation -Buffering and congestion control
development, broadcast packets were forwarded and a ``rout-
ing loop'' resulted in network saturation and every host on
the network crashing.
Networking Implementation - 37 - Out of band data
13. Out of band data
Out of band data is a facility peculiar to the stream
socket abstraction defined. Little agreement appears to
exist as to what its semantics should be. TCP defines the
notion of ``urgent data'' as in-line, while the NBS proto-
cols [Burruss81] and numerous others provide a fully
independent logical transmission channel along which out of
band data is to be sent. In addition, the amount of the
data which may be sent as an out of band message varies from
protocol to protocol; everything from 1 bit to 16 bytes or
more.
A stream socket's notion of out of band data has been
defined as the lowest reasonable common denominator (at
least reasonable in our minds); clearly this is subject to
debate. Out of band data is expected to be transmitted out
of the normal sequencing and flow control constraints of the
data stream. A minimum of 1 byte of out of band data and
one outstanding out of band message are expected to be sup-
ported by the protocol supporting a stream socket. It is a
protocols perogative to support larger sized messages, or
more than one outstanding out of band message at a time.
Out of band data is maintained by the protocol and usu-
ally not stored in the socket's send queue. The PRU_SENDOOB
and PRU_RCVOOB requests to the pr_usrreq routine are used in
sending and receiving data.
Networking Implementation - 38 - Trailer protocols
14. Trailer protocols
Core to core copies can be expensive. Consequently, a
great deal of effort was spent in minimizing such opera-
tions. The VAX architecture provides virtual memory
hardware organized in page units. To cut down on copy
operations, data is kept in page sized units on page-aligned
boundaries whenever possible. This allows data to be moved
in memory simply by remapping the page instead of copying.
The mbuf and network interface routines perform page table
manipulations where needed, hiding the complexities of the
VAX virtual memory hardware from higher level code.
Data enters the system in two ways: from the user, or
from the network (hardware interface). When data is copied
from the user's address space into the system it is depo-
sited in pages (if sufficient data is present to fill an
entire page). This encourages the user to transmit informa-
tion in messages which are a multiple of the system page
size.
Unfortunately, performing a similar operation when tak-
ing data from the network is very difficult. Consider the
format of an incoming packet. A packet usually contains a
local network header followed by one or more headers used by
the high level protocols. Finally, the data, if any, follows
these headers. Since the header information may be variable
length, DMA'ing the eventual data for the user into a page
aligned area of memory is impossible without a priori
knowledge of the format (e.g. supporting only a single pro-
tocol header format).
To allow variable length header information to be
present and still ensure page alignment of data, a special
local network encapsulation may be used. This encapsula-
tion, termed a trailer protocol, places the variable length
header information after the data. A fixed size local net-
work header is then prepended to the resultant packet. The
local network header contains the size of the data portion,
and a new trailer protocol header, inserted before the vari-
able length information, contains the size of the variable
length header information. The following trailer protocol
header is used to store information regarding the variable
length protocol header:
struct {
short protocol; /* original protocol no. */
short length; /* length of trailer */
};
The processing of the trailer protocol is very simple.
On output, the local network header indicates a trailer
encapsulation is being used. The protocol identifier also
Networking Implementation - 39 - Trailer protocols
includes an indication of the number of data pages present
(before the trailer protocol header). The trailer protocol
header is initialized to contain the actual protocol and
variable length header size, and appended to the data along
with the variable length header information.
On input, the interface routines identify the trailer
encapsulation by the protocol type stored in the local net-
work header, then calculate the number of pages of data to
find the beginning of the trailer. The trailing information
is copied into a separate mbuf and linked to the front of
the resultant packet.
Clearly, trailer protocols require cooperation between
source and destination. In addition, they are normally cost
effective only when sizable packets are used. The current
scheme works because the local network encapsulation header
is a fixed size, allowing DMA operations to be performed at
a known offset from the first data page being received.
Should the local network header be variable length this
scheme fails.
Statistics collected indicate as much as 200Kb/s can be
gained by using a trailer protocol with 1Kbyte packets. The
average size of the variable length header was 40 bytes (the
size of a minimal TCP/IP packet header). If hardware sup-
ports larger sized packets, even greater gains may be real-
ized.
Networking Implementation - 40 - Acknowledgements
Acknowledgements
The internal structure of the system is patterned after
the Xerox PUP architecture [Boggs79], while in certain
places the Internet protocol family has had a great deal of
influence in the design. The use of software interrupts for
process invocation is based on similar facilities found in
the VMS operating system. Many of the ideas related to pro-
tocol modularity, memory management, and network interfaces
are based on Rob Gurwitz's TCP/IP implementation for the
4.1BSD version of UNIX on the VAX [Gurwitz81]. Greg Chesson
explained his use of trailer encapsulations in Datakit,
instigating their use in our system.
References
[Boggs79] Boggs, D. R., J. F. Shoch, E. A. Taft,
and R. M. Metcalfe; PUP: An Internetwork
Architecture. Report CSL-79-10. XEROX
Palo Alto Research Center, July 1979.
[BBN78] Bolt Beranek and Newman; Specification
for the Interconnection of Host and IMP.
BBN Technical Report 1822. May 1978.
[Cerf78] Cerf, V. G.; The Catenet Model for
Internetworking. Internet Working
Group, IEN 48. July 1978.
[Clark82] Clark, D. D.; Window and Acknowledge-
ment Strategy in TCP. Internet Working
Group, IEN Draft Clark-2. March 1982.
[DEC80] Digital Equipment Corporation; DECnet
DIGITAL Network Architecture - General
Description. Order No. AA-K179A-TK.
October 1980.
[Gurwitz81] Gurwitz, R. F.; VAX-UNIX Networking
Support Project - Implementation
Description. Internetwork Working
Group, IEN 168. January 1981.
[ISO81] International Organization for Standard-
ization. ISO Open Systems Interconnec-
tion - Basic Reference Model. ISO/TC
97/SC 16 N 719. August 1981.
[Joy82a] Joy, W.; Cooper, E.; Fabry, R.; Leffler,
S.; and McKusick, M.; 4.2BSD System
Manual. Computer Systems Research
Networking Implementation - 41 - References
Group, Technical Report 5. University
of California, Berkeley. Draft of Sep-
tember 1, 1982.
[Postel79] Postel, J., ed. DOD Standard User
Datagram Protocol. Internet Working
Group, IEN 88. May 1979.
[Postel80a] Postel, J., ed. DOD Standard Internet
Protocol. Internet Working Group, IEN
128. January 1980.
[Postel80b] Postel, J., ed. DOD Standard Transmis-
sion Control Protocol. Internet Working
Group, IEN 129. January 1980.
[Xerox81] Xerox Corporation. Internet Transport
Protocols. Xerox System Integration
Standard 028112. December 1981.
[Zimmermann80] Zimmermann, H. OSI Reference Model -
The ISO Model of Architecture for Open
Systems Interconnection. IEEE Transac-
tions on Communications. Com-28(4);
425-432. April 1980.