1 .\" Copyright (C) Caldera International Inc. 2001-2002. All rights reserved.
3 .\" Redistribution and use in source and binary forms, with or without
4 .\" modification, are permitted provided that the following conditions are
7 .\" Redistributions of source code and documentation must retain the above
8 .\" copyright notice, this list of conditions and the following
11 .\" Redistributions in binary form must reproduce the above copyright
12 .\" notice, this list of conditions and the following disclaimer in the
13 .\" documentation and/or other materials provided with the distribution.
15 .\" All advertising materials mentioning features or use of this software
16 .\" must display the following acknowledgement:
18 .\" This product includes software developed or owned by Caldera
19 .\" International, Inc. Neither the name of Caldera International, Inc.
20 .\" nor the names of other contributors may be used to endorse or promote
21 .\" products derived from this software without specific prior written
24 .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
25 .\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
26 .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
27 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
28 .\" DISCLAIMED. IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
29 .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
30 .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
31 .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
32 .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
33 .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
34 .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
35 .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
37 .\" @(#)iosys 8.1 (Berkeley) 6/8/93
38 .EH 'PSD:3-%''The UNIX I/O System'
39 .OH 'The UNIX I/O System''PSD:3-%'
45 AT&T Bell Laboratories
48 This paper gives an overview of the workings of the UNIX\(dg
50 \(dgUNIX is a Trademark of Bell Laboratories.
53 It was written with an eye toward providing
54 guidance to writers of device driver routines,
55 and is oriented more toward describing the environment
56 and nature of device drivers than the implementation
57 of that part of the file system which deals with
60 It is assumed that the reader has a good knowledge
61 of the overall structure of the file system as discussed
62 in the paper ``The UNIX Time-sharing System.''
63 A more detailed discussion
65 ``UNIX Implementation;''
66 the current document restates parts of that one,
67 but is still more detailed.
69 conjunction with a copy of the system code,
70 since it is basically an exegesis of that code.
74 There are two classes of device:
78 The block interface is suitable for devices
79 like disks, tapes, and DECtape
80 which work, or can work, with addressible 512-byte blocks.
81 Ordinary magnetic tape just barely fits in this category,
82 since by use of forward
84 backward spacing any block can be read, even though
85 blocks can be written only at the end of the tape.
86 Block devices can at least potentially contain a mounted
88 The interface to block devices is very highly structured;
89 the drivers for these devices share a great many routines
90 as well as a pool of buffers.
92 Character-type devices have a much
93 more straightforward interface, although
94 more work must be done by the driver itself.
96 Devices of both types are named by a
101 These numbers are generally stored as an integer
102 with the minor device number
103 in the low-order 8 bits and the major device number
104 in the next-higher 8 bits;
109 are available to access these numbers.
110 The major device number selects which driver will deal with
111 the device; the minor device number is not used
112 by the rest of the system but is passed to the
113 driver at appropriate times.
114 Typically the minor number
115 selects a subdevice attached to
116 a given controller, or one of
117 several similar hardware interfaces.
119 The major device numbers for block and character devices
120 are used as indices in separate tables;
121 they both start at 0 and therefore overlap.
130 system calls is to set up entries in three separate
132 The first of these is the
135 which is stored in the system's per-process
138 This table is indexed by
139 the file descriptor returned by the
143 and is accessed during
147 or other operation on the open file.
148 An entry contains only
149 a pointer to the corresponding
153 which is a per-system data base.
154 There is one entry in the
161 This table is per-system because the same instance
162 of an open file must be shared among the several processes
163 which can result from
165 after the file is opened.
169 flags which indicate whether the file
170 was open for reading or writing or is a pipe, and
171 a count which is used to decide when all processes
172 using the entry have terminated or closed the file
173 (so the entry can be abandoned).
174 There is also a 32-bit file offset
175 which is used to indicate where in the file the next read
176 or write will take place.
177 Finally, there is a pointer to the
178 entry for the file in the
181 which contains a copy of the file's i-node.
183 Certain open files can be designated ``multiplexed''
184 files, and several other flags apply to such
186 In such a case, instead of an offset,
187 there is a pointer to an associated multiplex channel table.
188 Multiplex channels will not be discussed here.
192 table corresponds precisely to an instance of
196 if the same file is opened several times,
198 entries in this table.
200 there is at most one entry
203 table for a given file.
204 Also, a file may enter the
206 table not only because it is open,
207 but also because it is the current directory
208 of some process or because it
209 is a special file containing a currently-mounted
214 table differs somewhat from the
215 corresponding i-node as stored on the disk;
216 the modified and accessed times are not stored,
217 and the entry is augmented
218 by a flag word containing information about the entry,
219 a count used to determine when it may be
220 allowed to disappear,
221 and the device and i-number
222 whence the entry came.
223 Also, the several block numbers that give addressing
224 information for the file are expanded from
225 the 3-byte, compressed format used on the disk to full
229 During the processing of an
233 call for a special file,
234 the system always calls the device's
236 routine to allow for any special processing
237 required (rewinding a tape, turning on
238 the data-terminal-ready lead of a modem, etc.).
242 routine is called only when the last
243 process closes a file,
244 that is, when the i-node table entry
245 is being deallocated.
246 Thus it is not feasible
247 for a device to maintain, or depend on,
248 a count of its users, although it is quite
250 implement an exclusive-use device which cannot
251 be reopened until it has been closed.
261 table entry are used to set up the
267 which respectively contain the (user) address
268 of the I/O target area, the byte-count for the transfer,
269 and the current location in the file.
270 If the file referred to is
271 a character-type special file, the appropriate read
272 or write routine is called; it is responsible
273 for transferring data and updating the
274 count and current location appropriately
276 Otherwise, the current location is used to calculate
277 a logical block number in the file.
278 If the file is an ordinary file the logical block
279 number must be mapped (possibly using indirect blocks)
280 to a physical block number; a block-type
281 special file need not be mapped.
282 This mapping is performed by the
285 In any event, the resulting physical block number
286 is used, as discussed below, to
287 read or write the appropriate device.
289 Character Device Drivers
293 table specifies the interface routines present for
295 Each device provides five routines:
296 open, close, read, write, and special-function
300 Any of these may be missing.
301 If a call on the routine
305 on non-exclusive devices that require no setup)
308 entry can be given as
310 if it should be considered an error,
313 on read-only devices)
319 structure also contains a pointer to the
321 structure associated with the terminal.
325 routine is called each time the file
326 is opened with the full device number as argument.
327 The second argument is a flag which is
328 non-zero only if the device is to be written upon.
332 routine is called only when the file
333 is closed for the last time,
334 that is when the very last process in
335 which the file is open closes it.
336 This means it is not possible for the driver to
337 maintain its own count of its users.
338 The first argument is the device number;
339 the second is a flag which is non-zero
340 if the file was open for writing in the process which
346 is called, it is supplied the device
348 The per-user variable
351 the number of characters indicated by the user;
352 for character devices, this number may be 0
355 is the address supplied by the user from which to start
357 The system may call the
358 routine internally, so the
361 is supplied that indicates,
366 refers to the system address space instead of
374 characters from the user's buffer to the device,
377 for each character passed.
378 For most drivers, which work one character at a time,
381 is used to pick up characters
382 from the user's buffer.
383 Successive calls on it return
384 the characters to be written until
386 goes to 0 or an error occurs,
387 when it returns \(mi1.
389 takes care of interrogating
394 Write routines which want to transfer
395 a probably large number of characters into an internal
396 buffer may also use the routine
397 .I "iomove(buffer, offset, count, flag)"
398 which is faster when many characters must be moved.
406 bytes from the start of the buffer;
410 (which is 0) in the write case.
412 the caller is responsible for making sure
413 the count is not too large and is non-zero.
414 As an efficiency note,
416 is much slower if any of
417 .I "buffer+offset, count"
424 routine is called under conditions similar to
428 is guaranteed to be non-zero.
429 To return characters to the user, the routine
431 is available; it takes care of housekeeping
434 and returns \(mi1 as the last character
437 is returned to the user;
438 before that time, 0 is returned.
440 is also usable as with
444 but the same cautions apply.
446 The ``special-functions'' routine
451 system calls as follows:
455 is a pointer to the device's routine,
457 is the device number,
464 the device is supposed to place up to 3 words of status information
465 into the vector; this will be returned to the caller.
471 the device should take up to 3 words of
472 control information from
476 Finally, each device should have appropriate interrupt-time
478 When an interrupt occurs, it is turned into a C-compatible call
479 on the devices's interrupt routine.
480 The interrupt-catching mechanism makes
481 the low-order four bits of the ``new PS'' word in the
482 trap vector for the interrupt available
483 to the interrupt handler.
484 This is conventionally used by drivers
485 which deal with multiple similar devices
486 to encode the minor device number.
487 After the interrupt has been processed,
488 a return from the interrupt handler will
489 return from the interrupt itself.
491 A number of subroutines are available which are useful
492 to character device drivers.
493 Most of these handlers, for example, need a place
494 to buffer characters in the internal interface
495 between their ``top half'' (read/write)
496 and ``bottom half'' (interrupt) routines.
497 For relatively low data-rate devices, the best mechanism
498 is the character queue maintained by the
503 A queue header has the structure
506 int c_cc; /* character count */
507 char *c_cf; /* first character */
508 char *c_cl; /* last character */
511 A character is placed on the end of a queue by
518 The routine returns \(mi1 if there is no space
519 to put the character, 0 otherwise.
520 The first character on the queue may be retrieved
523 which returns either the (non-negative) character
524 or \(mi1 if the queue is empty.
526 Notice that the space for characters in queues is
527 shared among all devices in the system
528 and in the standard system there are only some 600
529 character slots available.
530 Thus device handlers,
531 especially write routines, must take
532 care to avoid gobbling up excessive numbers of characters.
534 The other major help available
535 to device handlers is the sleep-wakeup mechanism.
537 .I "sleep(event, priority)"
538 causes the process to wait (allowing other processes to run)
542 at that time, the process is marked ready-to-run
543 and the call will return when there is no
551 has happened, that is, causes processes sleeping
552 on the event to be awakened.
555 is an arbitrary quantity agreed upon
556 by the sleeper and the waker-up.
557 By convention, it is the address of some data area used
558 by the driver, which guarantees that events
561 Processes sleeping on an event should not assume
562 that the event has really happened;
563 they should check that the conditions which
564 caused them to sleep no longer hold.
566 Priorities can range from 0 to 127;
567 a higher numerical value indicates a less-favored
568 scheduling situation.
569 A distinction is made between processes sleeping
570 at priority less than the parameter
572 and those at numerically larger priorities.
574 be interrupted by signals, although it
575 is conceivable that it may be swapped out.
576 Thus it is a bad idea to sleep with
577 priority less than PZERO on an event which might never occur.
578 On the other hand, calls to
581 may never return if the process is terminated by
582 some signal in the meantime.
583 Incidentally, it is a gross error to call
585 in a routine called at interrupt time, since the process
586 which is running is almost certainly not the
587 process which should go to sleep.
588 Likewise, none of the variables in the user area
590 should be touched, let alone changed, by an interrupt routine.
593 wishes to wait for some event for which it is inconvenient
594 or impossible to supply a
596 (for example, a device going on-line, which does not
597 generally cause an interrupt),
599 .I "sleep(&lbolt, priority)
602 is an external cell whose address is awakened once every 4 seconds
603 by the clock interrupt routine.
606 .I "spl4( ), spl5( ), spl6( ), spl7( )"
608 set the processor priority level as indicated to avoid
609 inconvenient interrupts from the device.
611 If a device needs to know about real-time intervals,
613 .I "timeout(func, arg, interval)
615 This routine arranges that after
617 sixtieths of a second, the
621 as argument, in the style
623 Timeouts are used, for example,
624 to provide real-time delays after function characters
625 like new-line and tab in typewriter output,
626 and to terminate an attempt to
627 read the 201 Dataphone
629 if there is no response within a specified number
631 Notice that the number of sixtieths of a second is limited to 32767,
632 since it must appear to be positive,
633 and that only a bounded number of timeouts
634 can be going on at once.
637 is called at clock-interrupt time, so it should
638 conform to the requirements of interrupt routines
641 The Block-device Interface
643 Handling of block devices is mediated by a collection
644 of routines that manage a set of buffers containing
645 the images of blocks of data on the various devices.
646 The most important purpose of these routines is to assure
647 that several processes that access the same block of the same
648 device in multiprogrammed fashion maintain a consistent
649 view of the data in the block.
650 A secondary but still important purpose is to increase
651 the efficiency of the system by
652 keeping in-core copies of blocks that are being
654 The main data base for this mechanism is the
657 Each buffer header contains a pair of pointers
658 .I "(b_forw, b_back)"
659 which maintain a doubly-linked list
660 of the buffers associated with a particular
663 .I "(av_forw, av_back)"
664 which generally maintain a doubly-linked list of blocks
665 which are ``free,'' that is,
666 eligible to be reallocated for another transaction.
667 Buffers that have I/O in progress
668 or are busy for other purposes do not appear in this list.
670 also contains the device and block number to which the
671 buffer refers, and a pointer to the actual storage associated with
673 There is a word count
674 which is the negative of the number of words
675 to be transferred to or from the buffer;
676 there is also an error byte and a residual word
677 count used to communicate information
678 from an I/O routine to its caller.
679 Finally, there is a flag word
680 with bits indicating the status of the buffer.
681 These flags will be discussed below.
683 Seven routines constitute
684 the most important part of the interface with the
686 Given a device and block number,
691 return a pointer to a buffer header for the block;
692 the difference is that
694 is guaranteed to return a buffer actually containing the
695 current data for the block,
698 returns a buffer which contains the data in the
699 block only if it is already in core (whether it is
700 or not is indicated by the
703 In either case the buffer, and the corresponding
704 device block, is made ``busy,''
705 so that other processes referring to it
706 are obliged to wait until it becomes free.
708 is used, for example,
709 when a block is about to be totally rewritten,
710 so that its previous contents are
712 still, no other process can be allowed to refer to the block
713 until the new data is placed into it.
717 routine is used to implement read-ahead.
718 it is logically similar to
720 but takes as an additional argument the number of
721 a block (on the same device) to be read asynchronously
722 after the specifically requested block is available.
724 Given a pointer to a buffer,
728 makes the buffer again available to other processes.
729 It is called, for example, after
730 data has been extracted following a
732 There are three subtly-different write routines,
733 all of which take a buffer pointer as argument,
734 and all of which logically release the buffer for
735 use by others and place it on the free list.
738 buffer on the appropriate device queue,
739 waits for the write to be done,
740 and sets the user's error flag if required.
742 places the buffer on the device's queue, but does not wait
743 for completion, so that errors cannot be reflected directly to
746 does not start any I/O operation at all,
748 the buffer so that if it happens
749 to be grabbed from the free list to contain
750 data from some other block, the data in it will
755 is used when one wants to be sure that
756 I/O takes place correctly, and that
757 errors are reflected to the proper user;
758 it is used, for example, when updating i-nodes.
760 is useful when more overlap is desired
761 (because no wait is required for I/O to finish)
762 but when it is reasonably certain that the
763 write is really required.
765 is used when there is doubt that the write is
766 needed at the moment.
769 is called when the last byte of a
771 system call falls short of the end of a
772 block, on the assumption that
775 will be given soon which will re-use the same block.
777 as the end of a block is passed,
779 is called, since probably the block will
780 not be accessed again soon and one might as
781 well start the writing process as soon as possible.
783 In any event, notice that the routines
787 dedicate the given block exclusively to the
788 use of the caller, and make others wait,
790 .I "brelse, bwrite, bawrite,"
793 must eventually be called to free the block for use by others.
795 As mentioned, each buffer header contains a flag
796 word which indicates the status of the buffer.
798 one important channel for information between the drivers and the
799 block I/O system, it is important to understand these flags.
800 The following names are manifest constants which
801 select the associated flag bits.
803 This bit is set when the buffer is handed to the device strategy routine
804 (see below) to indicate a read operation.
807 is defined as 0 and does not define a flag; it is provided
808 as a mnemonic convenience to callers of routines like
810 which have a separate argument
811 which indicates read or write.
814 to 0 when a block is handed to the device strategy
815 routine and is turned on when the operation completes,
816 whether normally as the result of an error.
817 It is also used as part of the return argument of
819 to indicate if 1 that the returned
820 buffer actually contains the data in the requested block.
822 This bit may be set to 1 when
824 is set to indicate that an I/O or other error occurred.
827 byte of the buffer header may contain an error code
831 is 0 the nature of the error is not specified.
832 Actually no driver at present sets
834 the latter is provided for a future improvement
835 whereby a more detailed error-reporting
836 scheme may be implemented.
838 This bit indicates that the buffer header is not on
839 the free list, i.e. is
840 dedicated to someone's exclusive use.
841 The buffer still remains attached to the list of
842 blocks associated with its device, however.
847 which calls it) searches the buffer list
848 for a given device and finds the requested
849 block with this bit on, it sleeps until the bit
852 This bit is set for raw I/O transactions that
853 need to allocate the Unibus map on an 11/70.
855 This bit is set on buffers that have the Unibus map allocated,
858 routine knows to deallocate the map.
860 This flag is used in conjunction with the
863 Before sleeping as described
867 Conversely, when the block is freed and the busy bit
872 is given for the block header whenever
875 This strategem avoids the overhead
878 every time a buffer is freed on the chance that someone
881 This bit may be set on buffers just before releasing them; if it
883 the buffer is placed at the head of the free list, rather than at the
885 It is a performance heuristic
886 used when the caller judges that the same block will not soon be used again.
890 to indicate to the appropriate device driver
891 that the buffer should be released when the
892 write has been finished, usually at interrupt time.
893 The difference between
897 is that the former starts I/O, waits until it is done, and
899 The latter merely sets this bit and starts I/O.
900 The bit indicates that
902 should be called for the buffer on completion.
906 before releasing the buffer.
909 while searching for a free block,
910 discovers the bit is 1 in a buffer it would otherwise grab,
911 it causes the block to be written out before reusing it.
917 table contains the names of the interface routines
918 and that of a table for each block device.
920 Just as for character devices, block device drivers may supply
926 called respectively on each open and on the final close
928 Instead of separate read and write routines,
929 each block device driver has a
931 routine which is called with a pointer to a buffer
933 As discussed, the buffer header contains
934 a read/write flag, the core address,
935 the block number, a (negative) word count,
936 and the major and minor device number.
937 The role of the strategy routine
938 is to carry out the operation as requested by the
939 information in the buffer header.
940 When the transaction is complete the
952 In cases where the device
953 is capable, under error-free operation,
954 of transferring fewer words than requested,
955 the device's word-count register should be placed
956 in the residual count slot of
958 otherwise, the residual count should be set to 0.
959 This particular mechanism is really for the benefit
960 of the magtape driver;
961 when reading this device
962 records shorter than requested are quite normal,
963 and the user should be told the actual length of the record.
965 Although the most usual argument
966 to the strategy routines
967 is a genuine buffer header allocated as discussed above,
968 all that is actually required
969 is that the argument be a pointer to a place containing the
970 appropriate information.
973 routine, which manages movement
974 of core images to and from the swapping device,
975 uses the strategy routine
977 Care has to be taken that
978 no extraneous bits get turned on in the
981 The device's table specified by
984 byte to contain an active flag and an error count,
985 a pair of links which constitute the
986 head of the chain of buffers for the device
987 .I "(b_forw, b_back),"
988 and a first and last pointer for a device queue.
989 Of these things, all are used solely by the device driver
991 except for the buffer-chain pointers.
992 Typically the flag encodes the state of the
993 device, and is used at a minimum to
994 indicate that the device is currently engaged in
995 transferring information and no new command should be issued.
996 The error count is useful for counting retries
998 The device queue is used to remember stacked requests;
999 in the simplest case it may be maintained as a first-in
1001 Since buffers which have been handed over to
1002 the strategy routines are never
1003 on the list of free buffers,
1004 the pointers in the buffer which maintain the free list
1005 .I "(av_forw, av_back)"
1006 are also used to contain the pointers
1007 which maintain the device queues.
1009 A couple of routines
1010 are provided which are useful to block device drivers.
1012 arranges that the buffer to which
1014 points be released or awakened,
1017 strategy module has finished with the buffer,
1018 either normally or after an error.
1019 (In the latter case the
1021 bit has presumably been set.)
1025 can be used to examine the error bit in a buffer header
1026 and arrange that any error indication found therein is
1027 reflected to the user.
1028 It may be called only in the non-interrupt
1029 part of a driver when I/O has completed
1033 Raw Block-device I/O
1035 A scheme has been set up whereby block device drivers may
1036 provide the ability to transfer information
1037 directly between the user's core image and the device
1038 without the use of buffers and in blocks as large as
1039 the caller requests.
1040 The method involves setting up a character-type special file
1041 corresponding to the raw device
1046 routines which set up what is usually a private,
1047 non-shared buffer header with the appropriate information
1048 and call the device's strategy routine.
1049 If desired, separate
1053 routines may be provided but this is usually unnecessary.
1054 A special-function routine might come in handy, especially for
1057 A great deal of work has to be done to generate the
1058 ``appropriate information''
1059 to put in the argument buffer for
1060 the strategy module;
1061 the worst part is to map relocated user addresses to physical addresses.
1062 Most of this work is done by
1063 .I "physio(strat, bp, dev, rw)
1064 whose arguments are the name of the
1071 and a read-write flag
1073 whose value is either
1078 makes sure that the user's base address and count are
1079 even (because most devices work in words)
1080 and that the core area affected is contiguous
1082 it delays until the buffer is not busy, and makes it
1083 busy while the operation is in progress;
1084 and it sets up user error return information.