1 .NC "The Design of Unix IPC"
4 The ARGO implementation of
5 TP and CLNP was designed to fit into the AOS
8 All the standard protocol hooks are used.
9 To understand the design, it is useful to have
11 Leffler, Joy, and Fabry:
12 \*(lq4.2 BSD Networking Implementation Notes\*(rq July 1983.
13 This section describes the
14 design of the IPC support in the AOS kernel.
15 .sh 1 "Functional Unit Overview"
20 is a monolithic program of considerable size and complexity.
21 The code can be separated into parts of distinct function,
22 but there are no kernel processes per se.
23 The kernel code is either executed on behalf of a user
24 process, in which case the kernel was entered by a system call,
25 or it is executed on behalf of a hardware or software interrupt.
26 The following sections describe briefly the major functional units
29 .so figs/func_units.nr
31 shows the arrangement of these kernel units and
33 .sh 2 "The file system."
35 .sh 2 "Virtual memory support."
37 This includes protection, swapping, paging, and
39 .sh 2 "Blocked device drivers (disks, tapes)."
41 All these drivers share some minor functional units,
42 such as buffer management and bus support
43 for the various types of busses on the machine.
44 .sh 2 "Interprocess communication (IPC)."
47 support for various protocols,
48 buffer management, and a standard interface for inter-protocol
50 .sh 2 "Network interface drivers."
52 These drivers are closely tied to the IPC support.
53 They use the IPC's buffer management unit rather
54 than the buffers used by the blocked device drivers.
55 The interface between these drivers and the rest of the kernel
56 differs from the interface used by the blocked devices.
59 This is terminal support, including the user interface
60 and the device drivers.
61 .sh 2 "System call interface."
63 This handles signals, traps, and system calls.
66 The clock is used in various forms by many
68 .sh 2 "User process support (the rest)."
70 This includes support for accounting, process creation,
71 control, scheduling, and destruction.
75 The major functional unit that supports IPC
76 can be divided into the following smaller functional
78 .sh 3 "Buffer management."
80 All protocols share a pool of buffers called \fImbufs\fR:
89 +struct mbuf+*m_next;+/* next buffer in chain */
90 +u_long+m_off;+/* offset of data */
91 +short+m_len;+/* amount of data */
92 +short+m_type;+/* mbuf type (0 == free) */
93 +u_char+m_dat[MLEN];+/* data storage */
94 +struct mbuf+*m_act;+/* link in 2-d structure */
100 There are two forms of mbufs - small ones and large ones.
101 Small ones are 128 octets in
104 in the ARGO release. Small mbufs are copied by byte-to-byte
106 The data in these mbufs are kept in the character
107 array field \fIm_dat\fR in the mbuf structure
109 For this type of mbuf, the field \fIm_off\fR is positive,
110 and is the offset to the beginning of the data from
111 the beginning of the mbuf structure itself.
112 Large mbufs, called \fIclusters\fR, are page-sized
114 They may be \*(lqcopied\*(rq by multiply mapping the pages they occupy.
115 They consist of a page of memory plus a small mbuf structure
116 whose fields are used
117 to link clusters into chains, but whose \fIm_dat\fR array is
119 The \fIm_off\fR field of the structure
120 is the offset (positive or negative) from the
121 beginning of the mbuf structure to the beginning
122 of the data page part of the cluster.
123 In the case of clusters, the offset is always out of the
124 bounds of the \fIm_dat\fR array and so it is alway possible
125 to tell from the \fIm_off\fR field whether an mbuf structure
126 is part of a cluster or is a small mbuf.
127 All mbufs permanently reside in memory.
128 The mbuf management unit manages its own page table.
129 The mbuf manager keeps limited statistics on the quantities and
130 types of buffers in use.
131 Mbufs are used for many purposes, and most of these purposes
132 have a type associated with them.
133 Some of the types that buffers may take are
134 MT_FREE (not allocated), MT_DATA,
135 MT_HEADER, MT_SOCKET (socket structure),
136 MT_PCB (protocol control block),
137 MT_RTABLE (routing tables),
139 MT_SOOPTS (arguments passed to \fIgetsockopt()\fR and
141 Data are passed among functional units by means
142 of queues, the contents of which are
143 either chains of mbufs or groups of chains of mbufs.
144 Mbufs are linked into chains with the \fIm_next\fR field.
145 Chains of mbufs are linked into groups with the \fIm_act\fR
147 The \fIm_act\fR field allows a protocol to retain packet
148 boundaries in a queue of mbufs.
151 Routing decisions in the kernel are made by the procedure \fIrtalloc()\fR.
152 This procedure will scan the kernel routing tables (stored in mbufs)
153 looking for a route. A route is represented by
162 +u_long+rt_hash;+/* to speed lookups */
163 +struct sockaddr+rt_dst;+/* key */
164 +struct sockaddr+rt_gateway;+/* value */
165 +short+rt_flags;+/* up/down?, host/net */
166 +short+rt_refcnt;+/* # held references */
167 +u_long+rt_use;+/* raw # packets forwarded */
168 +struct ifnet+*rt_ifp;+/* interface to use */
173 When looking for a route, \fIrtalloc()\fR will first hash the entire destination
174 address, and scan the routing tables looking for a complete route. If a route
175 is not found, then \fIrtalloc()\fR will rescan the table looking for a route
176 which matches the \fInetwork\fR portion of the address. If a route is still
177 not found, then a default route is used (if present).
179 If a route is found, the entity which called \fIrtalloc()\fR can use information
180 from the \fIrtentry\fR structure to dispatch the datagram. Specifically, the
181 datagram is queued on the interface identified by the interface
182 pointer \fIrt_ifp\fR.
185 This is the protocol-independent part of the IPC support.
186 Each communication endpoint (which may or may not be associated
187 with a connection) is represented by the following structure:
196 +short+so_type;+/* type, e.g. SOCK_DGRAM */
197 +short+so_options;+/* from socket call */
198 +short+so_linger;+/* time to linger @ close */
199 +short+so_state;+/* internal state flags */
200 +caddr_t+so_pcb;+/* network layer pcb */
201 +struct protosw+*so_proto;+/* protocol handle */
202 +struct socket+*so_head;+/* ptr to accept socket */
203 +struct socket+*so_q0;+/* queue of partial connX */
204 +short+so_q0len;+/* # partials on so_q0 */
205 +struct socket+*so_q;+/* queue of incoming connX */
206 +short+so_qlen;+/* # connections on so_q */
207 +short+so_qlimit;+/* max # queued connX */
209 ++short+sb_cc;+/* actual chars in buffer */
210 ++short+sb_hiwat;+/* max actual char count */
211 ++short+sb_mbcnt;+/* chars of mbufs used */
212 ++short+sb_mbmax;+/* max chars of mbufs to use */
213 ++short+sb_lowat;+/* low water mark (not used yet) */
214 ++short+sb_timeo;+/* timeout (not used ) */
215 ++struct mbuf+*sb_mb;+/* the mbuf chain */
216 ++struct proc+*sb_sel;+/* process selecting */
217 ++short+sb_flags;+/* flags, see below */
219 +short+so_timeo;+/* connection timeout */
220 +u_short+so_error;+/* error affecting connX */
221 +short+so_oobmark;+/* oob mark (TCP only) */
222 +short+so_pgrp;+/* pgrp for signals */
228 The socket code maintains a pair of queues for each socket,
229 \fIso_rcv\fR and \fIso_snd\fR.
230 Each queue is associated with a count of the number of characters
231 in the queue, the maximum number of characters allowed to be put
232 in the queue, some status information (\fIsb_flags\fR), and
233 several unused fields.
234 For a send operation, data are copied from the user's address space
235 into chains of mbufs.
236 This is done by the socket module, which then calls the underlying
237 transport protocol module to place the data
239 This is generally done by
240 appending to the chain beginning at \fIsb_mb\fR.
241 The socket module copies data from the \fIso_rcv\fR queue
242 to the user's address space to effect a receive operation.
243 The underlying transport layer is expected to have put incoming
244 data into \fIso_rcv\fR by calling procedures in this module.
246 .sh 3 "Transport protocol management."
248 All protocols and address types must be \*(lqregistered\*(rq in a
249 common way in order to use the IPC user interface.
250 Each protocol must have an entry in a protocol switch table.
251 Each entry takes the form:
260 +short+pr_type;+/* socket type used for */
261 +short+pr_family;+/* protocol family */
262 +short+pr_protocol;+/* protocol # from the database */
263 +short+pr_flags;+/* status information */
264 +++/* protocol-protocol hooks */
265 +int+(*pr_input)();+/* input (from below) */
266 +int+(*pr_output)();+/* output (from above) */
267 +int+(*pr_ctlinput)();+/* control input */
268 +int+(*pr_ctloutput)();+/* control output */
269 +++/* user-protocol hook */
270 +int+(*pr_usrreq)();+/* user request: see list below */
271 +++/* utility hooks */
272 +int+(*pr_init)();+/* initialization hook */
273 +int+(*pr_fasttimo)();+/* fast timeout (200ms) */
274 +int+(*pr_slowtimo)();+/* slow timeout (500ms) */
275 +int+(*pr_drain)();+/* free some space (not used) */
281 Associated with each protocol are the types of socket
282 abstractions supported by the protocol (\fIpr_type\fR), the
283 format of the addresses used by the protocol (\fIpr_family\fR),
284 the routines to be called to perform
285 a standard set of protocol functions (\fIpr_input\fR,...,\fIpr_drain\fR),
286 and some status information (\fIpr_flags\fR).
287 The field pr_flags keeps such information as
288 SS_ISCONNECTED (this socket has a peer),
289 SS_ISCONNECTING (this socket is in the process of establishing
291 SS_ISDISCONNECTING (this socket is in the process of being disconnected),
292 SS_CANTSENDMORE (this socket is half-closed and cannot send),
293 SS_CANTRCVMORE (this socket is half-closed and cannot receive).
294 There are some flags that are specific to the TCP concept
296 A flag SS_OOBAVAIL was added for the ARGO implementation, to support
297 the TP concept of out-of-band data (expedited data).
298 .sh 3 "Network Interface Drivers"
300 The drivers for the devices attaching a Unix machine to a network
301 medium share a common interface to the protocol
303 There is a common data structure for managing queues,
304 not surprisingly, a chain of mbufs.
305 There is a set of macros that are used to enqueue and
306 dequeue mbuf chains at high priority.
308 delivers an indication to a protocol entity when
309 an incoming packet has been placed on a queue by
313 .sh 3 "Support for individual protocols."
315 Each protocol is written as a separate functional unit.
316 Because all protocols share the clock and the mbuf pool, they
317 are not entirely insulated from each other.
318 The details of TP are described in a section that
320 .\"*****************************************************
325 shows the arrangement of the IPC support.
328 IPC was designed for DoD Internet protocols, all of
329 which run over DoD IP.
330 The assumptions that DoD Internet is the domain
331 and that DoD IP is the network layer
332 appear in the code and data structures in numerous places.
333 For example, it is assumed that addresses can be compared
334 by a bitwise comparison of 4 octets.
335 Another example is that the transport protocols all directly call
337 There are no hooks in the data structures through
338 which the transport layer can choose a network level protocol.
339 A third example is that the host's local addresses
340 are stored in the network interface drivers and the drivers
341 have only one address - an Internet address.
342 A fourth example is that headers are assumed to
343 fit in one small mbuf (112 bytes for data in AOS).
344 A fifth example is this:
345 It is assumed in many places that buffer space is managed
346 in units of characters or octets.
347 The user data are copied from user address space into the kernel mbufs
349 by the socket code, a protocol-independent part of the kernel.
350 This is fine for a stream protocol, but it means that a
351 packet protocol, in order to \*(lqpacketize\*(rq the data,
352 must perform a memory-to-memory copy
353 that might have been avoided had the protocol layer done the original
354 copy from user address space.
355 Furthermore, protocols that count credit in terms of packets or
356 buffers rather than characters do not work efficiently because
357 the computation of buffer space is not in the protocol module,
358 but rather it is in the socket code module.
359 This list of examples is not complete.
361 To summarize, adding a new transport protocol to the kernel consists of
362 adding entries to the tables in the protocol management
364 modifying the network interface driver(s) to recognize
365 new network protocol identifiers,
367 new system calls to the kernel and to the user library,
369 adding code modules for each of the protocols,
370 and correcting deficiencies in the socket code,
371 where the assumptions made about the nature of
372 transport protocols do not apply.