share/doc/psd/02.implement/implement

   1 .\" Copyright (c) 1986, 1993
   2 .\"     The Regents of the University of California.  All rights reserved.
   3 .\"
   4 .\" Copyright (C) Caldera International Inc. 2001-2002.  All rights reserved.
   5 .\"
   6 .\" Redistribution and use in source and binary forms, with or without
   7 .\" modification, are permitted provided that the following conditions are
   8 .\" met:
   9 .\"
  10 .\" Redistributions of source code and documentation must retain the above
  11 .\" copyright notice, this list of conditions and the following
  12 .\" disclaimer.
  13 .\"
  14 .\" Redistributions in binary form must reproduce the above copyright
  15 .\" notice, this list of conditions and the following disclaimer in the
  16 .\" documentation and/or other materials provided with the distribution.
  17 .\"
  18 .\" All advertising materials mentioning features or use of this software
  19 .\" must display the following acknowledgement:
  20 .\"
  21 .\" This product includes software developed or owned by Caldera
  22 .\" International, Inc.  Neither the name of Caldera International, Inc.
  23 .\" nor the names of other contributors may be used to endorse or promote
  24 .\" products derived from this software without specific prior written
  25 .\" permission.
  26 .\"
  27 .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
  28 .\" INTERNATIONAL, INC.  AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
  29 .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  30 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  31 .\" DISCLAIMED.  IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
  32 .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
  33 .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  34 .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
  35 .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  36 .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
  37 .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
  38 .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  39 .\"
  40 .\"     @(#)implement   8.1 (Berkeley) 6/8/93
  41 .\"
  42 .\" $FreeBSD$
  43 .EH 'PSD:2-%''UNIX Implementation'
  44 .OH 'UNIX Implementation''PSD:2-%'
  45 .de P1
  46 .DS
  47 ..
  48 .de P2
  49 .DE
  50 ..
  51 .de UL
  52 .lg 0
  53 .if n .ul
  54 \%\&\\$3\f3\\$1\fR\&\\$2
  55 .lg
  56 ..
  57 .de UC
  58 \&\\$3\s-1\\$1\\s0\&\\$2
  59 ..
  60 .de IT
  61 .lg 0
  62 .if n .ul
  63 \%\&\\$3\f2\\$1\fR\&\\$2
  64 .lg
  65 ..
  66 .de SP
  67 .sp \\$1
  68 ..
  69 .hw device
  70 .TL
  71 UNIX Implementation
  72 .AU "MH 2C-523" 2394
  73 K. Thompson
  74 .AI
  75 AT&T Bell Laboratories
  76 Murray Hill, NJ
  77 .AB
  78 This paper describes in high-level terms the
  79 implementation of the resident
  80 .UX
  81 kernel.
  82 This discussion is broken into three parts.
  83 The first part describes
  84 how the
  85 .UX
  86 system views processes, users, and programs.
  87 The second part describes the I/O system.
  88 The last part describes the
  89 .UX
  90 file system.
  91 .AE
  92 .NH
  93 INTRODUCTION
  94 .PP
  95 The
  96 .UX
  97 kernel consists of about 10,000
  98 lines of C code and about 1,000 lines of assembly code.
  99 The assembly code can be further broken down into
 100 200 lines included for
 101 the sake of efficiency
 102 (they could have been written in C)
 103 and 800 lines to perform hardware
 104 functions not possible in C.
 105 .PP
 106 This code represents 5 to 10 percent of what has
 107 been lumped into the broad expression
 108 ``the
 109 .UX
 110 operating system.''
 111 The kernel is the only
 112 .UX
 113 code that
 114 cannot be substituted by a user to his
 115 own liking.
 116 For this reason,
 117 the kernel should make as few real
 118 decisions as possible.
 119 This does not mean to allow the user
 120 a million options to do the same thing.
 121 Rather, it means to allow only one way to
 122 do one thing,
 123 but have that way be the least-common divisor
 124 of all the options that might have been provided.
 125 .PP
 126 What is or is not implemented in the kernel
 127 represents both a great responsibility and a great power.
 128 It is a soap-box platform on
 129 ``the way things should be done.''
 130 Even so, if
 131 ``the way'' is too radical,
 132 no one will follow it.
 133 Every important decision was weighed
 134 carefully.
 135 Throughout,
 136 simplicity has been substituted for efficiency.
 137 Complex algorithms are used only if
 138 their complexity can be localized.
 139 .NH
 140 PROCESS CONTROL
 141 .PP
 142 In the
 143 .UX
 144 system,
 145 a user executes programs in an
 146 environment called a user process.
 147 When a system function is required,
 148 the user process calls the system
 149 as a subroutine.
 150 At some point in this call,
 151 there is a distinct switch of environments.
 152 After this,
 153 the process is said to be a system process.
 154 In the normal definition of processes,
 155 the user and system processes are different
 156 phases of the same process
 157 (they never execute simultaneously).
 158 For protection,
 159 each system process has its own stack.
 160 .PP
 161 The user process may execute
 162 from a read-only text segment,
 163 which is shared by all processes
 164 executing the same code.
 165 There is no
 166 .IT functional
 167 benefit
 168 from shared-text segments.
 169 An
 170 .IT efficiency
 171 benefit comes from the fact
 172 that there is no need to swap read-only
 173 segments out because the original
 174 copy on secondary memory is still current.
 175 This is a great benefit to interactive
 176 programs that tend to be swapped while
 177 waiting for terminal input.
 178 Furthermore,
 179 if two processes are
 180 executing
 181 simultaneously
 182 from the same copy of a read-only segment,
 183 only one copy needs to reside in
 184 primary memory.
 185 This is a secondary effect,
 186 because
 187 simultaneous execution of a program
 188 is not common.
 189 It is ironic that this effect,
 190 which reduces the use of primary memory,
 191 only comes into play when there is
 192 an overabundance of primary memory,
 193 that is,
 194 when there is enough memory
 195 to keep waiting processes loaded.
 196 .PP
 197 All current read-only text segments in the
 198 system are maintained from the
 199 .IT "text table" .
 200 A text table entry holds the location of the
 201 text segment on secondary memory.
 202 If the segment is loaded,
 203 that table also holds the primary memory location
 204 and the count of the number of processes
 205 sharing this entry.
 206 When this count is reduced to zero,
 207 the entry is freed along with any
 208 primary and secondary memory holding the segment.
 209 When a process first executes a shared-text segment,
 210 a text table entry is allocated and the
 211 segment is loaded onto secondary memory.
 212 If a second process executes a text segment
 213 that is already allocated,
 214 the entry reference count is simply incremented.
 215 .PP
 216 A user process has some strictly private
 217 read-write data
 218 contained in its
 219 data segment.
 220 As far as possible,
 221 the system does not use the user's
 222 data segment to hold system data.
 223 In particular,
 224 there are no I/O buffers in the
 225 user address space.
 226 .PP
 227 The user data segment has two growing boundaries.
 228 One, increased automatically by the system
 229 as a result of memory faults,
 230 is used for a stack.
 231 The second boundary is only grown (or shrunk) by
 232 explicit requests.
 233 The contents of newly allocated primary memory
 234 is initialized to zero.
 235 .PP
 236 Also associated and swapped with
 237 a process is a small fixed-size
 238 system data segment.
 239 This segment contains all
 240 the data about the process
 241 that the system needs only when the
 242 process is active.
 243 Examples of the kind of data contained
 244 in the system data segment are:
 245 saved central processor registers,
 246 open file descriptors,
 247 accounting information,
 248 scratch data area,
 249 and the stack for the system phase
 250 of the process.
 251 The system data segment is not
 252 addressable from the user process
 253 and is therefore protected.
 254 .PP
 255 Last,
 256 there is a process table with
 257 one entry per process.
 258 This entry contains all the data
 259 needed by the system when the process
 260 is
 261 .IT not
 262 active.
 263 Examples are
 264 the process's name,
 265 the location of the other segments,
 266 and scheduling information.
 267 The process table entry is allocated
 268 when the process is created, and freed
 269 when the process terminates.
 270 This process entry is always directly
 271 addressable by the kernel.
 272 .PP
 273 Figure 1 shows the relationships
 274 between the various process control
 275 data.
 276 In a sense,
 277 the process table is the
 278 definition of all processes,
 279 because
 280 all the data associated with a process
 281 may be accessed
 282 starting from the process table entry.
 283 .KF
 284 .if t .in .375i
 285 .so fig1.pic
 286 .if t .in -.375i
 287 .sp 2v
 288 .ce
 289 Fig. 1\(emProcess control data structure.
 290 .KE
 291 .NH 2
 292 Process creation and program execution
 293 .PP
 294 Processes are created by the system primitive
 295 .UL fork .
 296 The newly created process (child) is a copy of the original process (parent).
 297 There is no detectable sharing of primary memory between the two processes.
 298 (Of course,
 299 if the parent process was executing from a read-only
 300 text segment,
 301 the child will share the text segment.)
 302 Copies of all writable data segments
 303 are made for the child process.
 304 Files that were open before the
 305 .UL fork
 306 are
 307 truly shared after the
 308 .UL fork .
 309 The processes are informed as to their part in the
 310 relationship to
 311 allow them to select their own
 312 (usually non-identical)
 313 destiny.
 314 The parent may
 315 .UL wait
 316 for the termination of
 317 any of its children.
 318 .PP
 319 A process may
 320 .UL exec
 321 a file.
 322 This consists of exchanging the current text and data
 323 segments of the process for new text and data
 324 segments specified in the file.
 325 The old segments are lost.
 326 Doing an
 327 .UL exec
 328 does
 329 .IT not
 330 change processes;
 331 the process that did the
 332 .UL exec
 333 persists,
 334 but
 335 after the
 336 .UL exec
 337 it is executing a different program.
 338 Files that were open
 339 before the
 340 .UL exec
 341 remain open after the
 342 .UL exec .
 343 .PP
 344 If a program,
 345 say the first pass of a compiler,
 346 wishes to overlay itself with another program,
 347 say the second pass,
 348 then it simply
 349 .UL exec s
 350 the second program.
 351 This is analogous
 352 to a ``goto.''
 353 If a program wishes to regain control
 354 after
 355 .UL exec ing
 356 a second program,
 357 it should
 358 .UL fork
 359 a child process,
 360 have the child
 361 .UL exec
 362 the second program, and
 363 have the parent
 364 .UL wait
 365 for the child.
 366 This is analogous to a ``call.''
 367 Breaking up the call into a binding followed by
 368 a transfer is similar to the subroutine linkage in
 369 SL-5.
 370 .[
 371 griswold hanson sl5 overview
 372 .]
 373 .NH 2
 374 Swapping
 375 .PP
 376 The major data associated with a process
 377 (the user data segment,
 378 the system data segment, and
 379 the text segment)
 380 are swapped to and from secondary
 381 memory, as needed.
 382 The user data segment and the system data segment
 383 are kept in contiguous primary memory to reduce
 384 swapping latency.
 385 (When low-latency devices, such as bubbles,
 386 .UC CCD s,
 387 or scatter/gather devices,
 388 are used,
 389 this decision will have to be reconsidered.)
 390 Allocation of both primary
 391 and secondary memory is performed
 392 by the same simple first-fit algorithm.
 393 When a process grows,
 394 a new piece of primary memory is allocated.
 395 The contents of the old memory is copied to the new memory.
 396 The old memory is freed
 397 and the tables are updated.
 398 If there is not enough primary memory,
 399 secondary memory is allocated instead.
 400 The process is swapped out onto the
 401 secondary memory,
 402 ready to be swapped in with
 403 its new size.
 404 .PP
 405 One separate process in the kernel,
 406 the swapping process,
 407 simply swaps the other
 408 processes in and out of primary memory.
 409 It examines the
 410 process table looking for a process
 411 that is swapped out and is
 412 ready to run.
 413 It allocates primary memory for that
 414 process and
 415 reads its segments into
 416 primary memory, where that process competes for the
 417 central processor with other loaded processes.
 418 If no primary memory is available,
 419 the swapping process makes memory available
 420 by examining the process table for processes
 421 that can be swapped out.
 422 It selects a process to swap out,
 423 writes it to secondary memory,
 424 frees the primary memory,
 425 and then goes back to look for a process
 426 to swap in.
 427 .PP
 428 Thus there are two specific algorithms
 429 to the swapping process.
 430 Which of the possibly many processes that
 431 are swapped out is to be swapped in?
 432 This is decided by secondary storage residence
 433 time.
 434 The one with the longest time out is swapped in first.
 435 There is a slight penalty for larger processes.
 436 Which of the possibly many processes that
 437 are loaded is to be swapped out?
 438 Processes that are waiting for slow events
 439 (i.e., not currently running or waiting for
 440 disk I/O)
 441 are picked first,
 442 by age in primary memory,
 443 again with size penalties.
 444 The other processes are examined
 445 by the same age algorithm,
 446 but are not taken out unless they are
 447 at least of some age.
 448 This adds
 449 hysteresis to the swapping and
 450 prevents total thrashing.
 451 .PP
 452 These swapping algorithms are the
 453 most suspect in the system.
 454 With limited primary memory,
 455 these algorithms cause total swapping.
 456 This is not bad in itself, because
 457 the swapping does not impact the
 458 execution of the resident processes.
 459 However, if the swapping device must
 460 also be used for file storage,
 461 the swapping traffic severely
 462 impacts the file system traffic.
 463 It is exactly these small systems
 464 that tend to double usage of limited disk
 465 resources.
 466 .NH 2
 467 Synchronization and scheduling
 468 .PP
 469 Process synchronization is accomplished by having processes
 470 wait for events.
 471 Events are represented by arbitrary integers.
 472 By convention,
 473 events are chosen to be addresses of
 474 tables associated with those events.
 475 For example, a process that is waiting for
 476 any of its children to terminate will wait
 477 for an event that is the address of
 478 its own process table entry.
 479 When a process terminates,
 480 it signals the event represented by
 481 its parent's process table entry.
 482 Signaling an event on which no process
 483 is waiting has no effect.
 484 Similarly,
 485 signaling an event on which many processes
 486 are waiting will wake all of them up.
 487 This differs considerably from
 488 Dijkstra's P and V
 489 synchronization operations,
 490 .[
 491 dijkstra sequential processes 1968
 492 .]
 493 in that
 494 no memory is associated with events.
 495 Thus there need be no allocation of events
 496 prior to their use.
 497 Events exist simply by being used.
 498 .PP
 499 On the negative side,
 500 because there is no memory associated with events,
 501 no notion of ``how much''
 502 can be signaled via the event mechanism.
 503 For example,
 504 processes that want memory might
 505 wait on an event associated with
 506 memory allocation.
 507 When any amount of memory becomes available,
 508 the event would be signaled.
 509 All the competing processes would then wake
 510 up to fight over the new memory.
 511 (In reality,
 512 the swapping process is the only process
 513 that waits for primary memory to become available.)
 514 .PP
 515 If an event occurs
 516 between the time a process decides
 517 to wait for that event and the
 518 time that process enters the wait state,
 519 then
 520 the process will wait on an event that has
 521 already happened (and may never happen again).
 522 This race condition happens because there is no memory associated with
 523 the event to indicate that the event has occurred;
 524 the only action of an event is to change a set of processes
 525 from wait state to run state.
 526 This problem is relieved largely
 527 by the fact that process switching can
 528 only occur in the kernel by explicit calls
 529 to the event-wait mechanism.
 530 If the event in question is signaled by another
 531 process,
 532 then there is no problem.
 533 But if the event is signaled by a hardware
 534 interrupt,
 535 then special care must be taken.
 536 These synchronization races pose the biggest
 537 problem when
 538 .UX
 539 is adapted to multiple-processor configurations.
 540 .[
 541 hawley meyer multiprocessing unix
 542 .]
 543 .PP
 544 The event-wait code in the kernel
 545 is like a co-routine linkage.
 546 At any time,
 547 all but one of the processes has called event-wait.
 548 The remaining process is the one currently executing.
 549 When it calls event-wait,
 550 a process whose event has been signaled
 551 is selected and that process
 552 returns from its call to event-wait.
 553 .PP
 554 Which of the runable processes is to run next?
 555 Associated with each process is a priority.
 556 The priority of a system process is assigned by the code
 557 issuing the wait on an event.
 558 This is roughly equivalent to the response
 559 that one would expect on such an event.
 560 Disk events have high priority,
 561 teletype events are low,
 562 and time-of-day events are very low.
 563 (From observation,
 564 the difference in system process priorities
 565 has little or no performance impact.)
 566 All user-process priorities are lower than the
 567 lowest system priority.
 568 User-process priorities are assigned
 569 by an algorithm based on the
 570 recent ratio of the amount of compute time to real time consumed
 571 by the process.
 572 A process that has used a lot of
 573 compute time in the last real-time
 574 unit is assigned a low user priority.
 575 Because interactive processes are characterized
 576 by low ratios of compute to real time,
 577 interactive response is maintained without any
 578 special arrangements.
 579 .PP
 580 The scheduling algorithm simply picks
 581 the process with the highest priority,
 582 thus
 583 picking all system processes first and
 584 user processes second.
 585 The compute-to-real-time ratio is updated
 586 every second.
 587 Thus,
 588 all other things being equal,
 589 looping user processes will be
 590 scheduled round-robin with a
 591 1-second quantum.
 592 A high-priority process waking up will
 593 preempt a running, low-priority process.
 594 The scheduling algorithm has a very desirable
 595 negative feedback character.
 596 If a process uses its high priority
 597 to hog the computer,
 598 its priority will drop.
 599 At the same time, if a low-priority
 600 process is ignored for a long time,
 601 its priority will rise.
 602 .NH
 603 I/O SYSTEM
 604 .PP
 605 The I/O system
 606 is broken into two completely separate systems:
 607 the block I/O system and the character I/O system.
 608 In retrospect,
 609 the names should have been ``structured I/O''
 610 and ``unstructured I/O,'' respectively;
 611 while the term ``block I/O'' has some meaning,
 612 ``character I/O'' is a complete misnomer.
 613 .PP
 614 Devices are characterized by a major device number,
 615 a minor device number, and
 616 a class (block or character).
 617 For each class,
 618 there is an array of entry points into the device drivers.
 619 The major device number is used to index the array
 620 when calling the code for a particular device driver.
 621 The minor device number is passed to the
 622 device driver as an argument.
 623 The minor number has no significance other
 624 than that attributed to it by the driver.
 625 Usually,
 626 the driver uses the minor number to access
 627 one of several identical physical devices.
 628 .PP
 629 The use of the array of entry points
 630 (configuration table)
 631 as the only connection between the
 632 system code and the device drivers is
 633 very important.
 634 Early versions of the system had a much
 635 less formal connection with the drivers,
 636 so that it was extremely hard to handcraft
 637 differently configured systems.
 638 Now it is possible to create new
 639 device drivers in an average of a few hours.
 640 The configuration table in most cases
 641 is created automatically by a program
 642 that reads the system's parts list.
 643 .NH 2
 644 Block I/O system
 645 .PP
 646 The model block I/O device consists
 647 of randomly addressed, secondary
 648 memory blocks of 512 bytes each.
 649 The blocks are uniformly addressed
 650 0, 1, .\|.\|. up to the size of the device.
 651 The block device driver has the job of
 652 emulating this model on a
 653 physical device.
 654 .PP
 655 The block I/O devices are accessed
 656 through a layer of buffering software.
 657 The system maintains a list of buffers
 658 (typically between 10 and 70)
 659 each assigned a device name and
 660 a device address.
 661 This buffer pool constitutes a data cache
 662 for the block devices.
 663 On a read request,
 664 the cache is searched for the desired block.
 665 If the block is found,
 666 the data are made available to the
 667 requester without any physical I/O.
 668 If the block is not in the cache,
 669 the least recently used block in the cache is renamed,
 670 the correct device driver is called to
 671 fill up the renamed buffer, and then the
 672 data are made available.
 673 Write requests are handled in an analogous manner.
 674 The correct buffer is found
 675 and relabeled if necessary.
 676 The write is performed simply by marking
 677 the buffer as ``dirty.''
 678 The physical I/O is then deferred until
 679 the buffer is renamed.
 680 .PP
 681 The benefits in reduction of physical I/O
 682 of this scheme are substantial,
 683 especially considering the file system implementation.
 684 There are,
 685 however,
 686 some drawbacks.
 687 The asynchronous nature of the
 688 algorithm makes error reporting
 689 and meaningful user error handling
 690 almost impossible.
 691 The cavalier approach to I/O error
 692 handling in the
 693 .UX
 694 system is partly due to the asynchronous
 695 nature of the block I/O system.
 696 A second problem is in the delayed writes.
 697 If the system stops unexpectedly,
 698 it is almost certain that there is a
 699 lot of logically complete,
 700 but physically incomplete,
 701 I/O in the buffers.
 702 There is a system primitive to
 703 flush all outstanding I/O activity
 704 from the buffers.
 705 Periodic use of this primitive helps,
 706 but does not solve, the problem.
 707 Finally,
 708 the associativity in the buffers
 709 can alter the physical I/O sequence
 710 from that of the logical I/O sequence.
 711 This means that there are times
 712 when data structures on disk are inconsistent,
 713 even though the software is careful
 714 to perform I/O in the correct order.
 715 On non-random devices,
 716 notably magnetic tape,
 717 the inversions of writes can be disastrous.
 718 The problem with magnetic tapes is ``cured'' by
 719 allowing only one outstanding write request
 720 per drive.
 721 .NH 2
 722 Character I/O system
 723 .PP
 724 The character I/O system consists of all
 725 devices that do not fall into the block I/O model.
 726 This includes the ``classical'' character devices
 727 such as communications lines, paper tape, and
 728 line printers.
 729 It also includes magnetic tape and disks when
 730 they are not used in a stereotyped way,
 731 for example, 80-byte physical records on tape
 732 and track-at-a-time disk copies.
 733 In short,
 734 the character I/O interface
 735 means ``everything other than block.''
 736 I/O requests from the user are sent to the
 737 device driver essentially unaltered.
 738 The implementation of these requests is, of course,
 739 up to the device driver.
 740 There are guidelines and conventions
 741 to help the implementation of
 742 certain types of device drivers.
 743 .NH 3
 744 Disk drivers
 745 .PP
 746 Disk drivers are implemented
 747 with a queue of transaction records.
 748 Each record holds a read/write flag,
 749 a primary memory address,
 750 a secondary memory address, and
 751 a transfer byte count.
 752 Swapping is accomplished by passing
 753 such a record to the swapping device driver.
 754 The block I/O interface is implemented by
 755 passing such records with requests to
 756 fill and empty system buffers.
 757 The character I/O interface to the disk
 758 drivers create a transaction record that
 759 points directly into the user area.
 760 The routine that creates this record also insures
 761 that the user is not swapped during this
 762 I/O transaction.
 763 Thus by implementing the general disk driver,
 764 it is possible to use the disk
 765 as a block device,
 766 a character device, and a swap device.
 767 The only really disk-specific code in normal
 768 disk drivers is the pre-sort of transactions to
 769 minimize latency for a particular device, and
 770 the actual issuing of the I/O request.
 771 .NH 3
 772 Character lists
 773 .PP
 774 Real character-oriented devices may
 775 be implemented using the common
 776 code to handle character lists.
 777 A character list is a queue of characters.
 778 One routine puts a character on a queue.
 779 Another gets a character from a queue.
 780 It is also possible to ask how many
 781 characters are currently on a queue.
 782 Storage for all queues in the system comes
 783 from a single common pool.
 784 Putting a character on a queue will allocate
 785 space from the common pool and link the
 786 character onto the data structure defining the queue.
 787 Getting a character from a queue returns
 788 the corresponding space to the pool.
 789 .PP
 790 A typical character-output device
 791 (paper tape punch, for example)
 792 is implemented by passing characters
 793 from the user onto a character queue until
 794 some maximum number of characters is on the queue.
 795 The I/O is prodded to start as
 796 soon as there is anything on the queue
 797 and, once started,
 798 it is sustained by hardware completion interrupts.
 799 Each time there is a completion interrupt,
 800 the driver gets the next character from the queue
 801 and sends it to the hardware.
 802 The number of characters on the queue is checked and,
 803 as the count falls through some intermediate level,
 804 an event (the queue address) is signaled.
 805 The process that is passing characters from
 806 the user to the queue can be waiting on the event, and
 807 refill the queue to its maximum
 808 when the event occurs.
 809 .PP
 810 A typical character input device
 811 (for example, a paper tape reader)
 812 is handled in a very similar manner.
 813 .PP
 814 Another class of character devices is the terminals.
 815 A terminal is represented by three
 816 character queues.
 817 There are two input queues (raw and canonical)
 818 and an output queue.
 819 Characters going to the output of a terminal
 820 are handled by common code exactly as described
 821 above.
 822 The main difference is that there is also code
 823 to interpret the output stream as
 824 .UC  ASCII
 825 characters and to perform some translations,
 826 e.g., escapes for deficient terminals.
 827 Another common aspect of terminals is code
 828 to insert real-time delay after certain control characters.
 829 .PP
 830 Input on terminals is a little different.
 831 Characters are collected from the terminal and
 832 placed on a raw input queue.
 833 Some device-dependent code conversion and
 834 escape interpretation is handled here.
 835 When a line is complete in the raw queue,
 836 an event is signaled.
 837 The code catching this signal then copies a
 838 line from the raw queue to a canonical queue
 839 performing the character erase and line kill editing.
 840 User read requests on terminals can be
 841 directed at either the raw or canonical queues.
 842 .NH 3
 843 Other character devices
 844 .PP
 845 Finally,
 846 there are devices that fit no general category.
 847 These devices are set up as character I/O drivers.
 848 An example is a driver that reads and writes
 849 unmapped primary memory as an I/O device.
 850 Some devices are too
 851 fast to be treated a character at time,
 852 but do not fit the disk I/O mold.
 853 Examples are fast communications lines and
 854 fast line printers.
 855 These devices either have their own buffers
 856 or ``borrow'' block I/O buffers for a while and
 857 then give them back.
 858 .NH
 859 THE FILE SYSTEM
 860 .PP
 861 In the
 862 .UX
 863 system,
 864 a file is a (one-dimensional) array of bytes.
 865 No other structure of files is implied by the
 866 system.
 867 Files are attached anywhere
 868 (and possibly multiply)
 869 onto a hierarchy of directories.
 870 Directories are simply files that
 871 users cannot write.
 872 For a further discussion
 873 of the external view of files and directories,
 874 see Ref.\0
 875 .[
 876 ritchie thompson unix bstj 1978
 877 %Q This issue
 878 .].
 879 .PP
 880 The
 881 .UX
 882 file system is a disk data structure
 883 accessed completely through
 884 the block I/O system.
 885 As stated before,
 886 the canonical view of a ``disk'' is
 887 a randomly addressable array of
 888 512-byte blocks.
 889 A file system breaks the disk into
 890 four self-identifying regions.
 891 The first block (address 0)
 892 is unused by the file system.
 893 It is left aside for booting procedures.
 894 The second block (address 1)
 895 contains the so-called ``super-block.''
 896 This block,
 897 among other things,
 898 contains the size of the disk and
 899 the boundaries of the other regions.
 900 Next comes the i-list,
 901 a list of file definitions.
 902 Each file definition is
 903 a 64-byte structure, called an i-node.
 904 The offset of a particular i-node
 905 within the i-list is called its i-number.
 906 The combination of device name
 907 (major and minor numbers) and i-number
 908 serves to uniquely name a particular file.
 909 After the i-list,
 910 and to the end of the disk,
 911 come free storage blocks that
 912 are available for the contents of files.
 913 .PP
 914 The free space on a disk is maintained
 915 by a linked list of available disk blocks.
 916 Every block in this chain contains a disk address
 917 of the next block in the chain.
 918 The remaining space contains the address of up to
 919 50 disk blocks that are also free.
 920 Thus with one I/O operation,
 921 the system obtains 50 free blocks and a
 922 pointer where to find more.
 923 The disk allocation algorithms are
 924 very straightforward.
 925 Since all allocation is in fixed-size
 926 blocks and there is strict accounting of
 927 space,
 928 there is no need to compact or garbage collect.
 929 However,
 930 as disk space becomes dispersed,
 931 latency gradually increases.
 932 Some installations choose to occasionally compact
 933 disk space to reduce latency.
 934 .PP
 935 An i-node contains 13 disk addresses.
 936 The first 10 of these addresses point directly at
 937 the first 10 blocks of a file.
 938 If a file is larger than 10 blocks (5,120 bytes),
 939 then the eleventh address points at a block
 940 that contains the addresses of the next 128 blocks of the file.
 941 If the file is still larger than this
 942 (70,656 bytes),
 943 then the twelfth block points at up to 128 blocks,
 944 each pointing to 128 blocks of the file.
 945 Files yet larger
 946 (8,459,264 bytes)
 947 use the thirteenth address for a ``triple indirect'' address.
 948 The algorithm ends here with the maximum file size
 949 of 1,082,201,087 bytes.
 950 .PP
 951 A logical directory hierarchy is added
 952 to this flat physical structure simply
 953 by adding a new type of file, the directory.
 954 A directory is accessed exactly as an ordinary file.
 955 It contains 16-byte entries consisting of
 956 a 14-byte name and an i-number.
 957 The root of the hierarchy is at a known i-number
 958 (\fIviz.,\fR 2).
 959 The file system structure allows an arbitrary, directed graph
 960 of directories with regular files linked in
 961 at arbitrary places in this graph.
 962 In fact,
 963 very early
 964 .UX
 965 systems used such a structure.
 966 Administration of such a structure became so
 967 chaotic that later systems were restricted
 968 to a directory tree.
 969 Even now,
 970 with regular files linked multiply
 971 into arbitrary places in the tree,
 972 accounting for space has become a problem.
 973 It may become necessary to restrict the entire
 974 structure to a tree,
 975 and allow a new form of linking that
 976 is subservient to the tree structure.
 977 .PP
 978 The file system allows
 979 easy creation,
 980 easy removal,
 981 easy random accessing,
 982 and very easy space allocation.
 983 With most physical addresses confined
 984 to a small contiguous section of disk,
 985 it is also easy to dump, restore, and
 986 check the consistency of the file system.
 987 Large files suffer from indirect addressing,
 988 but the cache prevents most of the implied physical I/O
 989 without adding much execution.
 990 The space overhead properties of this scheme are quite good.
 991 For example,
 992 on one particular file system,
 993 there are 25,000 files containing 130M bytes of data-file content.
 994 The overhead (i-node, indirect blocks, and last block breakage)
 995 is about 11.5M bytes.
 996 The directory structure to support these files
 997 has about 1,500 directories containing 0.6M bytes of directory content
 998 and about 0.5M bytes of overhead in accessing the directories.
 999 Added up any way,
1000 this comes out to less than a 10 percent overhead for actual
1001 stored data.
1002 Most systems have this much overhead in
1003 padded trailing blanks alone.
1004 .NH 2
1005 File system implementation
1006 .PP
1007 Because the i-node defines a file,
1008 the implementation of the file system centers
1009 around access to the i-node.
1010 The system maintains a table of all active
1011 i-nodes.
1012 As a new file is accessed,
1013 the system locates the corresponding i-node,
1014 allocates an i-node table entry, and reads
1015 the i-node into primary memory.
1016 As in the buffer cache,
1017 the table entry is considered to be the current
1018 version of the i-node.
1019 Modifications to the i-node are made to
1020 the table entry.
1021 When the last access to the i-node goes
1022 away,
1023 the table entry is copied back to the
1024 secondary store i-list and the table entry is freed.
1025 .KF
1026 .if t .in .25i
1027 .so fig2.pic
1028 .if t .in -.25i
1029 .sp 2v
1030 .ce
1031 Fig. 2\(emFile system data structure.
1032 .sp
1033 .KE
1034 .PP
1035 All I/O operations on files are carried out
1036 with the aid of the corresponding i-node table entry.
1037 The accessing of a file is a straightforward
1038 implementation of the algorithms mentioned previously.
1039 The user is not aware of i-nodes and i-numbers.
1040 References to the file system are made in terms of
1041 path names of the directory tree.
1042 Converting a path name into an i-node table entry
1043 is also straightforward.
1044 Starting at some known i-node
1045 (the root or the current directory of some process),
1046 the next component of the path name is
1047 searched by reading the directory.
1048 This gives an i-number and an implied device
1049 (that of the directory).
1050 Thus the next i-node table entry can be accessed.
1051 If that was the last component of the path name,
1052 then this i-node is the result.
1053 If not,
1054 this i-node is the directory needed to look up
1055 the next component of the path name, and the
1056 algorithm is repeated.
1057 .PP
1058 The user process accesses the file system with
1059 certain primitives.
1060 The most common of these are
1061 .UL open ,
1062 .UL create ,
1063 .UL read ,
1064 .UL write ,
1065 .UL seek ,
1066 and
1067 .UL close .
1068 The data structures maintained are shown in Fig. 2.
1069 In the system data segment associated with a user,
1070 there is room for some (usually between 10 and 50) open files.
1071 This open file table consists of pointers that can be used to access
1072 corresponding i-node table entries.
1073 Associated with each of these open files is
1074 a current I/O pointer.
1075 This is a byte offset of
1076 the next read/write operation on the file.
1077 The system treats each read/write request
1078 as random with an implied seek to the
1079 I/O pointer.
1080 The user usually thinks of the file as
1081 sequential with the I/O pointer
1082 automatically counting the number of bytes
1083 that have been read/written from the file.
1084 The user may,
1085 of course,
1086 perform random I/O by setting the I/O pointer
1087 before reads/writes.
1088 .PP
1089 With file sharing,
1090 it is necessary to allow related
1091 processes to share a common I/O pointer
1092 and yet have separate I/O pointers
1093 for independent processes
1094 that access the same file.
1095 With these two conditions,
1096 the I/O pointer cannot reside
1097 in the i-node table nor can
1098 it reside in the list of
1099 open files for the process.
1100 A new table
1101 (the open file table)
1102 was invented for the sole purpose
1103 of holding the I/O pointer.
1104 Processes that share the same open
1105 file
1106 (the result of
1107 .UL fork s)
1108 share a common open file table entry.
1109 A separate open of the same file will
1110 only share the i-node table entry,
1111 but will have distinct open file table entries.
1112 .PP
1113 The main file system primitives are implemented as follows.
1114 .UL \&open
1115 converts a file system path name into an i-node
1116 table entry.
1117 A pointer to the i-node table entry is placed in a
1118 newly created open file table entry.
1119 A pointer to the file table entry is placed in the
1120 system data segment for the process.
1121 .UL \&create
1122 first creates a new i-node entry,
1123 writes the i-number into a directory, and
1124 then builds the same structure as for an
1125 .UL open .
1126 .UL \&read
1127 and
1128 .UL write
1129 just access the i-node entry as described above.
1130 .UL \&seek
1131 simply manipulates the I/O pointer.
1132 No physical seeking is done.
1133 .UL \&close
1134 just frees the structures built by
1135 .UL open
1136 and
1137 .UL create .
1138 Reference counts are kept on the open file table entries and
1139 the i-node table entries to free these structures after
1140 the last reference goes away.
1141 .UL \&unlink
1142 simply decrements the count of the
1143 number of directories pointing at the given i-node.
1144 When the last reference to an i-node table entry
1145 goes away,
1146 if the i-node has no directories pointing to it,
1147 then the file is removed and the i-node is freed.
1148 This delayed removal of files prevents
1149 problems arising from removing active files.
1150 A file may be removed while still open.
1151 The resulting unnamed file vanishes
1152 when the file is closed.
1153 This is a method of obtaining temporary files.
1154 .PP
1155 There is a type of unnamed
1156 .UC  FIFO
1157 file called a
1158 .UL pipe.
1159 Implementation of
1160 .UL pipe s
1161 consists of implied
1162 .UL seek s
1163 before each
1164 .UL read
1165 or
1166 .UL write
1167 in order to implement
1168 first-in-first-out.
1169 There are also checks and synchronization
1170 to prevent the
1171 writer from grossly outproducing the
1172 reader and to prevent the reader from
1173 overtaking the writer.
1174 .NH 2
1175 Mounted file systems
1176 .PP
1177 The file system of a
1178 .UX
1179 system
1180 starts with some designated block device
1181 formatted as described above to contain
1182 a hierarchy.
1183 The root of this structure is the root of
1184 the
1185 .UX
1186 file system.
1187 A second formatted block device may be
1188 mounted
1189 at any leaf of
1190 the current hierarchy.
1191 This logically extends the current hierarchy.
1192 The implementation of
1193 mounting
1194 is trivial.
1195 A mount table is maintained containing
1196 pairs of designated leaf i-nodes and
1197 block devices.
1198 When converting a path name into an i-node,
1199 a check is made to see if the new i-node is a
1200 designated leaf.
1201 If it is,
1202 the i-node of the root
1203 of the block device replaces it.
1204 .PP
1205 Allocation of space for a file is taken
1206 from the free pool on the device on which the
1207 file lives.
1208 Thus a file system consisting of many
1209 mounted devices does not have a common pool of
1210 free secondary storage space.
1211 This separation of space on different
1212 devices is necessary to allow easy
1213 unmounting
1214 of a device.
1215 .NH 2
1216 Other system functions
1217 .PP
1218 There are some other things that the system
1219 does for the user\-a
1220 little accounting,
1221 a little tracing/debugging,
1222 and a little access protection.
1223 Most of these things are not very
1224 well developed
1225 because our use of the system in computing science research
1226 does not need them.
1227 There are some features that are missed in some
1228 applications, for example, better inter-process communication.
1229 .PP
1230 The
1231 .UX
1232 kernel is an I/O multiplexer more than
1233 a complete operating system.
1234 This is as it should be.
1235 Because of this outlook,
1236 many features are
1237 found in most
1238 other operating systems that are missing from the
1239 .UX
1240 kernel.
1241 For example,
1242 the
1243 .UX
1244 kernel does not support
1245 file access methods,
1246 file disposition,
1247 file formats,
1248 file maximum size,
1249 spooling,
1250 command language,
1251 logical records,
1252 physical records,
1253 assignment of logical file names,
1254 logical file names,
1255 more than one character set,
1256 an operator's console,
1257 an operator,
1258 log-in,
1259 or log-out.
1260 Many of these things are symptoms rather than features.
1261 Many of these things are implemented
1262 in user software
1263 using the kernel as a tool.
1264 A good example of this is the command language.
1265 .[
1266 bourne shell 1978 bstj
1267 %Q This issue
1268 .]
1269 Each user may have his own command language.
1270 Maintenance of such code is as easy as
1271 maintaining user code.
1272 The idea of implementing ``system'' code with general
1273 user primitives
1274 comes directly from
1275 .UC  MULTICS .
1276 .[
1277 organick multics 1972
1278 .]
1279 .LP
1280 .[
1281 $LIST$
1282 .]