share/doc/papers/devfs/paper.me

   1 .\" format with ditroff -me
   2 .\" $FreeBSD$
   3 .\" format made to look as a paper for the proceedings is to look
   4 .\" (as specified in the text)
   5 .if n \{ .po 0
   6 .       ll 78n
   7 .       na
   8 .\}
   9 .if t \{ .po 1.0i
  10 .       ll 6.5i
  11 .       nr pp 10                \" text point size
  12 .       nr sp \n(pp+2           \" section heading point size
  13 .       nr ss 1.5v              \" spacing before section headings
  14 .\}
  15 .nr tm 1i
  16 .nr bm 1i
  17 .nr fm 2v
  18 .he ''''
  19 .de bu
  20 .ip \0\s-2\(bu\s+2
  21 ..
  22 .lp
  23 .rs
  24 .ce 5
  25 .sp
  26 .sz 14
  27 .b "Rethinking /dev and devices in the UNIX kernel"
  28 .sz 12
  29 .sp
  30 .i "Poul-Henning Kamp"
  31 .sp .1
  32 .i "<phk@FreeBSD.org>"
  33 .i "The FreeBSD Project"
  34 .i
  35 .sp 1.5
  36 .b Abstract
  37 .lp
  38 An outstanding novelty in UNIX at its introduction was the notion
  39 of ``a file is a file is a file and even a device is a file.''
  40 Going from ``hardware only changes when the DEC Field engineer is here''
  41 to ``my toaster has USB'' has put serious strain on the rather crude
  42 implementation of the ``devices as files'' concept, an implementation which
  43 has survived practically unchanged for 30 years in most UNIX variants.
  44 Starting from a high-level view of devices and the semantics that
  45 have grown around them over the years, this paper takes the audience on a
  46 grand tour of the redesigned FreeBSD device-I/O system,
  47 to convey an overview of how it all fits together, and to explain why
  48 things ended up as they did, how to use the new features and
  49 in particular how not to.
  50 .sp
  51 .if t \{
  52 .2c
  53 .\}
  54 .\" end boilerplate... paper starts here.
  55 .sh 1 "Introduction"
  56 .sp
  57 There are really only two fundamental ways to conceptualise
  58 I/O devices in an operating system:
  59 The usual way and the UNIX way.
  60 .lp
  61 The usual way is to treat I/O devices as their own class of things,
  62 possibly several classes of things, and provide APIs tailored
  63 to the semantics of the devices.
  64 In practice this means that a program must know what it is dealing
  65 with, it has to interact with disks one way, tapes another and
  66 rodents yet a third way, all of which are different from how it
  67 interacts with a plain disk file.
  68 .lp
  69 The UNIX way has never been described better than in the very first
  70 paper
  71 published on UNIX by Ritchie and Thompson [Ritchie74]:
  72 .(q
  73 Special files constitute the most unusual feature of the UNIX filesystem.
  74 Each supported I/O device is associated with at least one such file.
  75 Special files are read and written just like ordinary disk files,
  76 but requests to read or write result in activation of the associated device.
  77 An entry for each special file resides in directory /dev,
  78 although a link may be made to one of these files just as it may to an
  79 ordinary file.
  80 Thus, for example, to write on a magnetic tape one may write on the file /dev/mt.
  81
  82 Special files exist for each communication line, each disk, each tape drive,
  83 and for physical main memory.
  84 Of course, the active disks and the memory special files are protected from indiscriminate access.
  85
  86 There is a threefold advantage in treating I/O devices this way:
  87 file and device I/O are as similar as possible;
  88 file and device names have the same syntax and meaning,
  89 so that a program expecting a file name as a parameter can be passed a device name;
  90 finally, special files are subject to the same protection mechanism as regular files.
  91 .)q
  92 .lp
  93 .\" (Why was this so special at the time?)
  94 At the time, this was quite a strange concept; it was totally accepted
  95 for instance, that neither the system administrator nor the users were
  96 able to interact with a disk as a disk.
  97 Operating systems simply
  98 did not provide access to disk other than as a filesystem.
  99 Most vendors did not even release a program to initialise a
 100 disk-pack with a filesystem: selling pre-initialised and ``quality
 101 tested'' disk-packs was quite a profitable business.
 102 .lp
 103 In many cases some kind of API for reading and
 104 writing individual sectors on a disk pack
 105 did exist in the operating system,
 106 but more often than not
 107 it was not listed in the public documentation.
 108 .sh 2 "The traditional implementation"
 109 .lp
 110 .\" (Explain how opening /dev/lpt0 lands you in the right device driver)
 111 The initial implementation used hardcoded inode numbers [Ritchie98].
 112 The console
 113 device would be inode number 5, the paper-tape-punch number 6 and so on,
 114 even if those inodes were also actual regular files in the filesystem.
 115 .lp
 116 For reasons one can only too vividly imagine, this was changed and
 117 Thompson
 118 [Thompson78]
 119 describes how the implementation now used ``major and minor''
 120 device numbers to index though the devsw array to the correct device driver.
 121 .lp
 122 For all intents and purposes, this is the implementation which survives
 123 in most UNIX-like systems even to this day.
 124 Apart from the access control and timestamp information which is
 125 found in all inodes, the special inodes in the filesystem contain only
 126 one piece of information: the major and minor device numbers, often
 127 logically OR'ed to one field.
 128 .lp
 129 When a program opens a special file, the kernel uses the major number
 130 to find the entry points in the device driver, and passes the combined
 131 major and minor numbers as a parameter to the device driver.
 132 .sh 1 "The challenge"
 133 .lp
 134 Now, we did not talk much about where the special inodes came from
 135 to begin with.
 136 They were created by hand, using the
 137 mknod(2) system call, usually through the mknod(8) program.
 138 .lp
 139 In those days a
 140 computer had a very static hardware configuration\**
 141 .(f
 142 \** Unless your assigned field engineer was present on site.
 143 .)f
 144 and it certainly did not
 145 change while the system was up and running, so creating device nodes
 146 by hand was certainly an acceptable solution.
 147 .lp
 148 The first sign that this would not hold up as a solution came with
 149 the advent of TCP/IP and the telnet(1) program, or more precisely
 150 with the telnetd(8) daemon.
 151 In order to support remote login a ``pseudo-tty'' device driver was implemented,
 152 basically as tty driver which instead of hardware had another device which
 153 would allow a process to ``act as hardware'' for the tty.
 154 The telnetd(8) daemon would read and write data on the ``master'' side of
 155 the pseudo-tty and the user would be running on the ``slave'' side,
 156 which would act just like any other tty: you could change the erase
 157 character if you wanted to and all the signals and all that stuff worked.
 158 .lp
 159 Obviously with a device requiring no hardware, you can compile as many
 160 instances into the kernel as you like, as long as you do not use
 161 too much memory.
 162 As system after system was connected
 163 to the ARPANet, ``increasing number of ptys'' became a regular task
 164 for system administrators, and part of this task was to create
 165 more special nodes in the filesystem.
 166 .lp
 167 Several UNIX vendors also noticed an issue when they sold minicomputers
 168 in many different configurations: explaining to system administrators
 169 just which special nodes they would need and how to create them were
 170 a significant documentation hassle.  Some opted for the simple solution
 171 and pre-populated /dev with every conceivable device node, resulting
 172 in a predictable slowdown on access to filenames in /dev.
 173 .lp
 174 System V UNIX provided a band-aid solution:
 175 a special boot sequence would take effect if the kernel or
 176 the hardware had changed since last reboot.
 177 This boot procedure would
 178 amongst other things create the necessary special files in the filesystem,
 179 based on an intricate system of per device driver configuration files.
 180 .lp
 181 In the recent years, we have become used to hardware which changes
 182 configuration at any time: people plug USB, Firewire and PCCard
 183 devices into their computers.
 184 These devices can be anything from modems and disks to GPS receivers
 185 and fingerprint authentication hardware.
 186 Suddenly maintaining the
 187 correct set of special devices in ``/dev'' became a major headache.
 188 .lp
 189 Along the way, UNIX kernels had learned to deal with multiple filesystem
 190 types [Heidemann91a] and a ``device-pseudo-filesystem'' was a pretty
 191 obvious idea.
 192 The device drivers have a pretty good idea which
 193 devices they have found in the configuration, so all that is needed is
 194 to present this information as a filesystem filled with just the right
 195 special files.
 196 Experience has shown that this like most other ``pseudo
 197 filesystems'' sound a lot simpler in theory than in practice.
 198 .sh 1 "Truly understanding devices"
 199 .lp
 200 Before we continue, we need to fully understand the
 201 ``device special file'' in UNIX.
 202 .lp
 203 First we need to realize that a special file has the nature of
 204 a pointer from the filesystem into a different namespace;
 205 a little understood fact with far reaching consequences.
 206 .lp
 207 One implication of this is that several special files can
 208 exist in the filename namespace all pointing to the same device
 209 but each having their own access and timestamp attributes:
 210 .lp
 211 .(b M
 212 .vs -3
 213 \fC\s-3guest# ls -l /dev/fd0 /tmp/fd0
 214 crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0
 215 crw-rw-rw- 1 root wheel    9, 0 Sep 27 19:24 /tmp/fd0\fP\s+3
 216 .vs +3
 217 .)b
 218 Obviously, the administrator needs to be on top of this:
 219 one popular way to exploit an unguarded root prompt is
 220 to create a replica of the special file /dev/kmem
 221 in a location where it will not be noticed.
 222 Since /dev/kmem gives access to the kernel memory,
 223 gaining any particular
 224 privilege can be arranged by suitably modifying the kernel's
 225 data structures through the illicit special file.
 226 .lp
 227 When NFS appeared it opened a new avenue for this attack:
 228 People may have root privilege on one machine but not another.
 229 Since device nodes are not interpreted on the NFS server
 230 but rather on the local computer,
 231 a user with root privilege on a NFS client
 232 computer can create a device node to his liking on a filesystem
 233 mounted from an NFS server.
 234 This device node can in turn be used to
 235 circumvent the security of other computers which mount that filesystem,
 236 including the server, unless they protect themselves by not
 237 trusting any device entries on untrusted filesystem by mounting such
 238 filesystems with the \fCnodev\fP mount-option.
 239 .lp
 240 The fact that the device itself does not actually exist inside the
 241 filesystem which holds the special file makes it possible
 242 to perform boot-strapping stunts in the spirit
 243 of Baron Von Münchausen [raspe1785],
 244 where a filesystem is (re)mounted using one of its own
 245 device vnodes:
 246 .(b M
 247 .vs -3
 248 \fC\s-2guest# mount -o ro /dev/fd0 /mnt
 249 guest# fsck /mnt/dev/fd0
 250 guest# mount -u -o rw /mnt/dev/fd0 /mnt\fP\s+2
 251 .vs +3
 252 .)b
 253 .lp
 254 Other interesting details are chroot(2) and jail(2) [Kamp2000] which
 255 provide filesystem isolation for process-trees.
 256 Whereas chroot(2) was not implemented as a security tool [Mckusick1999]
 257 (although it has been widely used as such), the jail(2) security
 258 facility in FreeBSD provides a pretty convincing ``virtual machine''
 259 where even the root privilege is isolated and restricted to the designated
 260 area of the machine.
 261 Obviously chroot(2) and jail(2) may require access to a well-defined
 262 subset of devices like /dev/null, /dev/zero and /dev/tty,
 263 whereas access to other devices such as /dev/kmem
 264 or any disks could be used to compromise the integrity of the jail(2)
 265 confinement.
 266 .lp
 267 For a long time FreeBSD, like almost all UNIX-like systems had two kinds
 268 of devices, ``block'' and
 269 ``character'' special files, the difference being that ``block''
 270 devices would provide caching and alignment for disk device access.
 271 This was one of those minor architectural mistakes which took
 272 forever to correct.
 273 .lp
 274 The argument that block devices were a mistake is really very
 275 very simple:  Many devices other than disks have multiple modes
 276 of access which you select by choosing which special file to use.
 277 .lp
 278 Pick any old timer and he will be able to recite painful
 279 sagas about the crucial difference between the /dev/rmt
 280 and /dev/nrmt devices for tape access.\**
 281 .(f
 282 \** Make absolutely sure you know the difference before you take
 283 important data on a multi-file 9-track tape to remote locations.
 284 .)f
 285 .lp
 286 Tapes, asynchronous ports, line printer ports and many other devices
 287 have implemented submodes, selectable by the user
 288 at a special filename level, but that has not earned them their
 289 own special file types.
 290 Only disks\**
 291 .(f
 292 \** Well, OK: and some 9-track tapes.
 293 .)f
 294 have enjoyed the privilege of getting an entire file type dedicated to a
 295 a minor device mode.
 296 .lp
 297 Caching and alignment modes should have been enabled by setting
 298 some bit in the minor device number on the disk special file,
 299 not by polluting the filesystem code with another file type.
 300 .lp
 301 In FreeBSD block devices were not even implemented in a fashion
 302 which would be of any use, since any write errors would never be
 303 reported to the writing process.  For this reason, and since no
 304 applications
 305 were found to be in existence which relied on block devices
 306 and since historical usage was indeed historical [Mckusick2000],
 307 block devices were removed from the FreeBSD system.
 308 This greatly simlified the task of keeping track of open(2)
 309 reference counts for disks and
 310 removed much magic special-case code throughout.
 311 .lp
 312 .sh 1 "Files, sockets, pipes, SVID IPC and devices"
 313 .sp
 314 It is an instructive lesson in inconsistency to look at the
 315 various types of ``things'' a process can access in UNIX-like
 316 systems today.
 317 .lp
 318 First there are normal files, which are our reference yardstick here:
 319 they are accessed with open(2), read(2), write(2), mmap(2), close(2)
 320 and various other auxiliary system calls.
 321 .lp
 322 Sockets and pipes are also accessed via file handles but each has
 323 its own namespace.  That means you cannot open(2) a socket,\**
 324 .(f
 325 \** This is particularly bizarre in the case of UNIX domain sockets
 326 which use the filesystem as their namespace and appear in directory
 327 listings.
 328 .)f
 329 but you can read(2) and write(2) to it.
 330 Sockets and pipes vector off at the file descriptor level and do
 331 not get in touch with the vnode based part of the kernel at all.
 332 .lp
 333 Devices land somewhere in the middle between pipes and sockets on
 334 one side and normal files on the other.
 335 They use the filesystem
 336 namespace, are implemented with vnodes, and can be operated
 337 on like normal files, but don't actually live in the filesystem.
 338 .lp
 339 Devices are in fact special-cased all the way through the vnode system.
 340 For one thing devices break the ``one file-one vnode''
 341 rule, making it necessary to chain all vnodes for the same
 342 device together in
 343 order to be able to find ``the canonical vnode for this device node'',
 344 but more importantly, many operations have to be specifically denied
 345 on special file vnodes since they do not make any sense.
 346 .lp
 347 For true inconsistency, consider the SVID IPC mechanisms - not
 348 only do they not operate via file handles,
 349 but they also sport a singularly
 350 illconceived 32 bit numeric namespace and a dedicated set of
 351 system calls for access.
 352 .lp
 353 Several people have convincingly argued that this is an inconsistent
 354 mess, and have proposed and implemented more consistent operating systems
 355 like the Plan9 from Bell Labs [Pike90a] [Pike92a].
 356 Unfortunately reality is that people are not interested in learning a new
 357 operating system when the one they have is pretty darn good, and
 358 consequently research into better and more consistent ways is
 359 a pretty frustrating [Pike2000] but by no means irrelevant topic.
 360 .sh 1 "Solving the /dev maintenance problem"
 361 .lp
 362 There are a number of obvious, simple but wrong ways one could
 363 go about solving the ``/dev'' maintenance problem.
 364 .lp
 365 The very straightforward way is to hack the namei() kernel function
 366 responsible for filename translation and lookup.
 367 It is only a minor matter of programming to
 368 add code to special-case any lookup which ends up in ``/dev''.
 369 But this leads to problems:  in the case of chroot(2) or jail(2), the
 370 administrator will want to present only a subset of the available
 371 devices in ``/dev'', so some kind of state will have to be kept per
 372 chroot(2)/jail(2) about which devices are visible and
 373 which devices are hidden, but no obvious location for this information
 374 is available in the absence of a mount data structure.
 375 .lp
 376 It also leads to some unpleasant issues
 377 because of the fact that ``/dev/foo'' is a synthesised directory
 378 entry which may or may not actually be present on the filesystem
 379 which seems to provide ``/dev''.
 380 The vnodes either have to belong to a filesystem or they
 381 must be special-cased throughout the vnode layer of the kernel.
 382 .lp
 383 Finally there is the simple matter of generality:
 384 hardcoding the string "/dev" in the kernel is very general.
 385 .lp
 386 A cruder solution is to leave it to a daemon: make a special
 387 device driver, have a daemon read messages from it and create and
 388 destroy nodes in ``/dev'' in response to these messages.
 389 .lp
 390 The main drawback to this idea is that now we have added IPC
 391 to the mix introducing new and interesting race conditions.
 392 .lp
 393 Otherwise this solution is a surprisingly effective,
 394 but chroot(2)/jail(2) requirements prevents a simple implementation
 395 and running a daemon per jail would become an administrative
 396 nightmare.
 397 .lp
 398 Another pitfall of
 399 this approach is that we are not able to remount the root filesystem
 400 read-write at boot until we have a device node for the root device,
 401 but if this node is missing we cannot create it with a daemon since
 402 the root filesystem (and hence /dev) is read-only.
 403 Adding a read-write memory-filesystem mount /dev to solve this problem
 404 does not improve
 405 the architectural qualities further and certainly the KISS principle has
 406 been violated by now.
 407 .lp
 408 The final and in the end only satisfactory solution is to write a ``DEVFS''
 409 which mounts on ``/dev''.
 410 .lp
 411 The good news is that it does solve the problem with chroot(2) and jail(2):
 412 just mount a DEVFS instance on the ``dev'' directory inside the filesystem
 413 subtree where the chroot or jail lives.  Having a mountpoint gives us
 414 a convenient place to keep track of the local state of this DEVFS mount.
 415 .lp
 416 The bad news is that it takes a lot of cleanup and care to implement
 417 a DEVFS into a UNIX kernel.
 418 .sh 1 "DEVFS architectural decisions"
 419 .lp
 420 Before implementing a DEVFS, it is necessary to decide on a range
 421 of corner cases in behaviour, and some of these choices have proved
 422 surprisingly hard to settle for the FreeBSD project.
 423 .sh 2 "The ``persistence'' issue"
 424 .lp
 425 When DEVFS in FreeBSD was initially presented at a BoF at the 1995
 426 USENIX Technical Conference in New Orleans,
 427 a group of people demanded that it provide ``persistence''
 428 for administrative changes.
 429 .lp
 430 When trying to get a definition of ``persistence'', people can generally
 431 agree that if the administrator changes the access control bits of
 432 a device node, they want that mode to survive across reboots.
 433 .lp
 434 Once more tricky examples of the sort of manipulations one can do
 435 on special files are proposed, people rapidly disagree about what
 436 should be supported and what should not.
 437 .lp
 438 For instance, imagine a
 439 system with one floppy drive which appears in DEVFS as ``/dev/fd0''.
 440 Now the administrator, in order to get some badly written software
 441 to run, links this to ``/dev/fd1'':
 442 .(b M
 443 \fC\s-2ln /dev/fd0 /dev/fd1\fP\s+2
 444 .)b
 445 This works as expected and with persistence in DEVFS, the link is
 446 still there after a reboot.
 447 But what if after a reboot another floppy drive has been connected
 448 to the system?
 449 This drive would naturally have the name ``/dev/fd1'',
 450 but this name is now occupied by the administrators hard link.
 451 Should the link be broken?
 452 Should the new floppy drive be called
 453 ``/dev/fd2''?  Nobody can agree on anything but the ugliness of the
 454 situation.
 455 .lp
 456 Given that we are no longer dependent on DEC Field engineers to
 457 change all four wheels to see which one is flat, the basic assumption
 458 that the machine has a constant hardware configuration is simply no
 459 longer true.
 460 The new assumption one should start from when analysing this
 461 issue is that when the system boots, we cannot know what devices we
 462 will find, and we can not know if the devices we do find
 463 are the same ones we had when the system was last shut down.
 464 .lp
 465 And in fact, this is very much the case with laptops today:  if I attach
 466 my IOmega Zip drive to my laptop it appears like a SCSI disk named
 467 ``/dev/da0'', but so does the RAID-5 array attached to the PCI SCSI controller
 468 installed in my laptop's docking station.  If I change mode to ``a+rw''
 469 on the Zip drive, do I want that mode to apply to the RAID-5 as well?
 470 Unlikely.
 471 .lp
 472 And what if we have persistent information about the mode of
 473 device ``/dev/sio0'', but we boot and do not find any sio devices?
 474 Do we keep the information in our device-persistence registry?
 475 How long do we keep it?  If I borrow a modem card,
 476 set the permissions to some non-standard value like 0666,
 477 and then attach some other serial device a year from now - do I
 478 want some old permissions changes to come back and haunt me,
 479 just because they both happened to be ``/dev/sio0''?
 480 Unlikely.
 481 .lp
 482 The fact that more people have laptop computers today than
 483 five years ago, and the fact that nobody has been able to credibly
 484 propose where a persistent DEVFS would actually store the
 485 information about these things in the first place has settled the issue.
 486 .lp
 487 Persistence may be the right answer, but to the
 488 wrong question: persistence is not a desirable property for a DEVFS
 489 when the hardware configuration may change literally at any time.
 490 .sh 2 "Who decides on the names?"
 491 .lp
 492 In a DEVFS-enabled system, the responsibility for creating nodes in
 493 /dev shifts to the device drivers, and consequently the device
 494 drivers get to choose the names of the device files.
 495 In addition an initial value for owner, group and mode bits are
 496 provided by the device driver.
 497 .lp
 498 But should it be possible to rename ``/dev/lpt0'' to ``/dev/myprinter''?
 499 While the obvious affirmative answer is easy to arrive at, it leaves
 500 a lot to be desired once the implications are unmasked.
 501 .lp
 502 Most device drivers know their own name and use it purposefully in
 503 their debug and log messages to identify themselves.
 504 Furthermore, the ``NewBus'' [NewBus] infrastructure facility,
 505 which ties hardware to device drivers, identifies things by name
 506 and unit numbers.
 507 .lp
 508 A very common way to report errors in fact:
 509 .(b M
 510 .vs -3
 511 \fC\s-2#define LPT_NAME "lpt" /* our official name */
 512 [...]
 513 printf(LPT_NAME
 514     ": cannot alloc ppbus (%d)!", error);\fP\s+2
 515 .vs +3
 516 .)b
 517 .lp
 518 So despite the user renaming the device node pointing to the printer
 519 to ``myprinter'', this has absolutely no effect in the kernel and can
 520 be considered a userland aliasing operation.
 521 .lp
 522 The decision was therefore made that it should not be possible to rename
 523 device nodes since it would only lead to confusion and because the desired
 524 effect could be attained by giving the user the ability to create
 525 symlinks in DEVFS.
 526 .sh 2 "On-demand device creation"
 527 .lp
 528 Pseudo-devices like pty, tun and bpf,
 529 but also some real devices, may not pre-emptively create entries for all
 530 possible device nodes.  It would be a pointless waste of resources
 531 to always create 1000 ptys just in case they are needed,
 532 and in the worst case more than 1800 device nodes would be needed per
 533 physical disk to represent all possible slices and partitions.
 534 .lp
 535 For pseudo-devices the task at hand is to make a magic device node,
 536 ``/dev/pty'', which when opened will magically transmogrify into the
 537 first available pty subdevice, maybe ``/dev/pty123''.
 538 .lp
 539 Device submodes, on the other hand, work by having multiple
 540 entries in /dev, each with a different minor number, as a way to instruct
 541 the device driver in aspects of its operation.  The most widespread
 542 example is probably ``/dev/mt0'' and ``/dev/nmt0'', where the node
 543 with the extra ``n''
 544 instructs the tape device driver to not rewind on close.\**
 545 .(f
 546 \** This is the answer to the question in footnote number 2.
 547 .)f
 548 .lp
 549 Some UNIX systems have solved the problem for pseudo-devices by
 550 creating magic cloning devices like ``/dev/tcp''.
 551 When a cloning device is opened,
 552 it finds a free instance and through vnode and file descriptor mangling
 553 return this new device to the opening process.
 554 .lp
 555 This scheme has two disadvantages: the complexity of switching vnodes
 556 in midstream is non-trivial, but even worse is the fact that it
 557 does not work for
 558 submodes for a device because it only reacts to one particular /dev entry.
 559 .lp
 560 The solution for both needs is a more flexible on-demand device
 561 creation, implemented in FreeBSD as a two-level lookup.
 562 When a
 563 filename is looked up in DEVFS, a match in the existing device nodes is
 564 sought first and if found, returned.
 565 If no match is found, device drivers are polled in turn to ask if
 566 they would be able to synthesise a device node of the given name.
 567 .lp
 568 The device driver gets a chance to modify the name
 569 and create a device with make_dev().
 570 If one of the drivers succeeds in this, the lookup is started over and
 571 the newly found device node is returned:
 572 .(b M
 573 .vs -3
 574 \fC\s-2pty_clone()
 575    if (name != "pty")
 576       return(NULL); /* no luck */
 577    n = find_next_unit();
 578    dev = make_dev(...,n,"pty%d",n);
 579    name = dev->name;
 580    return(dev);\fP\s+2
 581 .vs +3
 582 .)b
 583 .lp
 584 An interesting mixed use of this mechanism is with the sound device drivers.
 585 Modern sound devices have multiple channels, presumably to allow the
 586 user to listen to CNN, Napstered MP3 files and Quake sound effects at
 587 the same time.
 588 The only problem is that all applications attempt to open ``/dev/dsp''
 589 since they have no concept of multiple sound devices.
 590 The sound device drivers use the cloning facility to direct ``/dev/dsp''
 591 to the first available sound channel completely transparently to the
 592 process.
 593 .lp
 594 There are very few drawbacks to this mechanism, the major one being
 595 that ``ls /dev'' now errs on the sparse side instead of the rich when used
 596 as a system device inventory, a practice which has always been
 597 of dubious precision at best.
 598 .sh 2 "Deleting and recreating devices"
 599 .lp
 600 Deleting device nodes is no problem to implement, but as likely as not,
 601 some people will want a method to get them back.
 602 Since only the device driver know how to create a given device,
 603 recreation cannot be performed solely on the basis of the parameters
 604 provided by a process in userland.
 605 .lp
 606 In order to not complicate the code which updates the directory
 607 structure for a mountpoint to reflect changes in the DEVFS inode list,
 608 a deleted entry is merely marked with DE_WHITEOUT instead of being
 609 removed entirely.
 610 Otherwise a separate list would be needed for inodes which we had
 611 deleted so that they would not be mistaken for new inodes.
 612 .lp
 613 The obvious way to recreate deleted devices is to let mknod(2) do it
 614 by matching the name and disregarding the major/minor arguments.
 615 Recreating the device with mknod(2) will simply remove the DE_WHITEOUT
 616 flag.
 617 .sh 2 "Jail(2), chroot(2) and DEVFS"
 618 .lp
 619 The primary requirement from facilities like jail(2) and chroot(2)
 620 is that it must be possible to control the contents of a DEVFS mount
 621 point.
 622 .lp
 623 Obviously, it would not be desirable for dynamic devices to pop
 624 into existence in the carefully pruned /dev of jails so it must be
 625 possible to mark a DEVFS mountpoint as ``no new devices''.
 626 And in the same way, the jailed root should not be able to recreate
 627 device nodes which the real root has removed.
 628 .lp
 629 These behaviours will be controlled with mount options, but these have not
 630 yet been implemented because FreeBSD has run out of bitmap flags for
 631 mount options, and a new unlimited mount option implementation is
 632 still not in place at the time of writing.
 633 .lp
 634 One mount option ``jaildevfs'', will restrict the contents of the
 635 DEVFS mountpoint to the ``normal set'' of devices for a jail and
 636 automatically hide all future devices and make it impossible
 637 for a jailed root to un-hide hidden entries while letting an un-jailed
 638 root do so.
 639 .lp
 640 Mounting or remounting read-only, will prevent all future
 641 devices from appearing and will make it impossible to
 642 hide or un-hide entries in the mountpoint.
 643 This is probably only useful for chroots or jails where no tty
 644 access is intended since cloning will not work either.
 645 .lp
 646 More mount options may be needed as more experience is gained.
 647 .sh 2 "Default mode, owner & group"
 648 .lp
 649 When a device driver creates a device node, and a DEVFS mount adds it
 650 to its directory tree, it needs to have some values for the access
 651 control fields: mode, owner and group.
 652 .lp
 653 Currently, the device driver specifies the initial values in the
 654 make_dev() call, but this is far from optimal.
 655 For one thing, embedding magic UIDs and GIDs in the kernel is simply
 656 bad style unless they are numerically zero.
 657 More seriously, they represent compile-time defaults which in these
 658 enlightened days is rather old-fashioned.
 659 .lp
 660 .sh 1 "Cleaning up before we build: struct specinfo and dev_t"
 661 .lp
 662 Most of the rest of the paper will be about the various challenges
 663 and issues in the implementation of DEVFS in FreeBSD.
 664 All of this should be applicable to other systems derived from
 665 4.4BSD-Lite as well.
 666 .lp
 667 POSIX has defined a type called ``dev_t'' which is the identity of a device.
 668 This is mainly for use in the few system calls which knows about devices:
 669 stat(2), fstat(2) and mknod(2).
 670 A dev_t is constructed by logically OR'ing
 671 the major# and minor# for the device.
 672 Since those have been defined
 673 as having no overlapping bits, the major# and minor#
 674 can be retrieved from the dev_t by a simple masking operation.
 675 .lp
 676 Although the kernel had a well-defined concept of any particular
 677 device it did not have a data structure to represent "a device".
 678 The device driver has such a structure, traditionally called ``softc''
 679 but the high kernel does not (and should not!) have access to the
 680 device driver's private data structures.
 681 .lp
 682 It is an interesting tale how things got to be this way,\**
 683 .(f
 684 \** Basically, devices should have been moved up with sockets and
 685 pipes at the file descriptor level when the VFS layering was introduced,
 686 rather than have all the special casing throughout the vnode system.
 687 .)f
 688 but for now just record for
 689 a fact how the actual relationship between the data structures was
 690 in the 4.4BSD release (Fig. 1). [44BSDBook]
 691 .(z
 692 .PS 3
 693 F: box "file" "handle"
 694 arrow down from F.s
 695 V: box "vnode"
 696 arrow right from V.e
 697 S: box "specinfo"
 698 arrow down from V.s
 699 I: box "inode"
 700 arrow right from I.e
 701 C: box invis "devsw[]" "[major#]"
 702 arrow down from C.s
 703 D: box "device" "driver"
 704 line right from D.e
 705 box invis "softc[]" "[minor#]"
 706 F2: box "file" "handle" at F + (2.5,0)
 707 arrow down from F2.s
 708 V2: box "vnode"
 709 arrow right from V2.e
 710 S2: box "specinfo"
 711 arrow down from V2.s
 712 I2: box "inode"
 713 arrow left from I2.w
 714 .PE
 715 .ce 1
 716 Fig. 1 - Data structures in 4.4BSD
 717 .)z
 718 .lp
 719 As for all other files, a vnode references a filesystem inode, but
 720 in addition it points to a ``specinfo'' structure.  In the inode
 721 we find the dev_t which is used to reference the device driver.
 722 .lp
 723 Access to the device driver happens by extracting the major# from
 724 the dev_t, indexing through the global devsw[] array to locate
 725 the device driver's entry point.
 726 .lp
 727 The device driver will extract the minor# from the dev_t and use
 728 that as the index into the softc array of private data per device.
 729 .lp
 730 The ``specinfo'' structure is a little sidekick vnodes grew underway,
 731 and is used to find all vnodes which reference the same device (i.e.
 732 they have the same  major# and minor#).
 733 This linkage is used to determine
 734 which vnode is the ``chosen one'' for this device, and to keep track of
 735 open(2)/close(2) against this device.
 736 The actual implementation was an inefficient hash implementation,
 737 which depending on the vnode reclamation rate and /dev directory lookup
 738 traffic, may become a measurable performance liability.
 739 .sh 2 "The new vnode/inode/dev_t layout"
 740 .lp
 741 In the new layout (Fig. 2) the specinfo structure takes a central
 742 role.  There is only one instanace of struct specinfo per
 743 device (i.e. unique major#
 744 and minor# combination) and all vnodes referencing this device point
 745 to this structure directly.
 746 .(z
 747 .PS 2.25
 748 F: box "file" "handle"
 749 arrow down from F.s
 750 V: box "vnode"
 751 arrow right from V.e
 752 S: box "specinfo"
 753 arrow down from V.s
 754 I: box "inode"
 755 F2: box "file" "handle" at F + (2.5,0)
 756 arrow down from F2.s
 757 V2: box "vnode"
 758 arrow left from V2.w
 759 arrow down from V2.s
 760 I2: box "inode"
 761 arrow down from S.s
 762 D: box "device" "driver"
 763 .PE
 764 .ce 1
 765 Fig. 2 - The new FreeBSD data structures.
 766 .)z
 767 .lp
 768 In userland, a dev_t is still the logical OR of the major# and
 769 minor#, but this entity is now called a udev_t in the kernel.
 770 In the kernel a dev_t is now a pointer to a struct specinfo.
 771 .lp
 772 All vnodes referencing a device are linked to a list hanging
 773 directly off the specinfo structure, removing the need for the
 774 hash table and  consequently simplifying and speeding up a lot
 775 of code dealing with vnode instantiation, retirement and
 776 name-caching.
 777 .lp
 778 The entry points to the device driver are stored in the specinfo
 779 structure, removing the need for the devsw[] array and allowing
 780 device drivers to use separate entrypoints for various minor numbers.
 781 .lp
 782 This is very convenient for devices which have a ``control''
 783 device for management and tuning.  The control device, almost always
 784 have entirely separate open/close/ioctl implementations [MD.C].
 785 .lp
 786 In addition to this, two data elements are included in the specinfo
 787 structure but ``owned'' by the device driver.  Typically the
 788 device driver will store a pointer to the softc structure in
 789 one of these, and unit number or mode information in the other.
 790 .lp
 791 This removes the need for drivers to find the softc using array
 792 indexing based on the minor#, and at the same time has obliviated
 793 the need for the compiled-in ``NFOO'' constants which traditionally
 794 determined how many softc structures and therefore devices
 795 the driver could support.\**
 796 .(f
 797 \** Not to mention all the drivers which implemented panic(2)
 798 because they forgot to perform bounds checking on the index before
 799 using it on their softc arrays.
 800 .)f
 801 .lp
 802 There are some trivial technical issues relating to allocating
 803 the storage for specinfo early in the boot sequence and how to
 804 find a specinfo from the udev_t/major#+minor#, but they will
 805 not be discussed here.
 806 .sh 2 "Creating and destroying devices"
 807 .lp
 808 Ideally, devices should only be created and
 809 destroyed by the device drivers which know what devices are present.
 810 This is accomplished with the make_dev() and destroy_dev()
 811 function calls.
 812 .lp
 813 Life is seldom quite that simple.  The operating system might be called
 814 on to act as a NFS server for a diskless workstation, possibly even
 815 of a different architecture, so we still need to be able to represent
 816 device nodes with no device driver backing in the filesystems and
 817 consequently we need to be able to create a specinfo from
 818 the major#+minor# in these inodes when we encounter them.
 819 In practice this is quite trivial, but in a few places in the code
 820 one needs to be aware of the existence
 821 of both ``named'' and ``anonymous'' specinfo structures.
 822 .lp
 823 The make_dev() call creates a specinfo structure and populates
 824 it with driver entry points, major#, minor#, device node name
 825 (for instance ``lpt0''), UID, GID and access mode bits.  The return
 826 value is a dev_t (i.e.,  a pointer to struct specinfo).
 827 If the device driver determines that the device is no longer
 828 present, it calls destroy_dev(), giving a dev_t as argument
 829 and the dev_t will be cleaned and converted to an anonymous dev_t.
 830 .lp
 831 Once created with make_dev() a named dev_t exists until destroy_dev()
 832 is called by the driver.  The driver can rely on this and keep state
 833 in the fields in dev_t which is reserved for driver use.
 834 .sh 1 "DEVFS"
 835 .lp
 836 By now we have all the relevant information about each device node
 837 collected in struct specinfo but we still have one problem to
 838 solve before we can add the DEVFS filesystem on top of it.
 839 .sh 2 "The interrupt problem"
 840 .lp
 841 Some device drivers, notably the CAM/SCSI subsystem in FreeBSD
 842 will discover changes in the device configuration inside an interrupt
 843 routine.
 844 .lp
 845 This imposes some limitations on what can and should do be done:
 846 first one should minimise the amount
 847 of work done in an interrupt routine for performance reasons;
 848 second, to avoid deadlocks, vnodes and mountpoints should not be
 849 accessed from an interrupt routine.
 850 .lp
 851 Also, in addition to the locking issue,
 852 a machine can have many instances of DEVFS mounted:
 853 for a jail(8) based virtual-machine web-server several hundred instances
 854 is not unheard of, making it far too expensive to update all of them
 855 in an interrupt routine.
 856 .lp
 857 The solution to this problem is to do all the filesystem work on
 858 the filesystem side of DEVFS and use atomically manipulated integer indices
 859 (``inode numbers'') as the barrier between the two sides.
 860 .lp
 861 The functions called from the device drivers, make_dev(), destroy_dev()
 862 &c. only manipulate the DEVFS inode number of the dev_t in
 863 question and do not even get near any mountpoints or vnodes.
 864 .lp
 865 For make_dev() the task is to assign a unique inode number to the
 866 dev_t and store the dev_t in the DEVFS-global inode-to-dev_t array.
 867 .(b M
 868 .vs -3
 869 \fC\s-2make_dev(...)
 870     store argument values in dev_t
 871     assign unique inode number to dev_t
 872     atomically insert dev_t into inode_array\fP\s+2
 873 .vs +3
 874 .)b
 875 .lp
 876 For destroy_dev() the task is the opposite: clear the inode number
 877 in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t
 878 array.
 879 .(b M
 880 .vs -3
 881 \fC\s-2destroy_dev(...)
 882     clear fields in dev_t
 883     zero dev_t inode number.
 884     atomically clear entry in inode_array\fP\s+2
 885 .vs +3
 886 .)b
 887 .lp
 888 Both functions conclude by atomically incrementing a global variable
 889 \fCdevfs_generation\fP to leave an indication to the filesystem
 890 side that something has changed.
 891 .lp
 892 By modifying the global state only with atomic instructions, locks
 893 have been entirely avoided in this part of the code which means that
 894 the make_dev() and destroy_dev() functions can be called from practically
 895 anywhere in the kernel at any time.
 896 .lp
 897 On the filesystem side of DEVFS, the only two vnode methods which examine
 898 or rely on the directory structure, VOP_LOOKUP and VOP_READDIR,
 899 call the function devfs_populate() to update their mountpoint's view
 900 of the device hierarchy to match current reality before doing any work.
 901 .(b M
 902 .vs -3
 903 \fC\s-2devfs_readdir(...)
 904     devfs_populate(...)
 905     ...\fP\s+2
 906 .)b
 907 .vs +3
 908 .lp
 909 The devfs_populate() function, compares the current \fCdevfs_generation\fP
 910 to the value saved in the mountpoint last time devfs_populate() completed
 911 and if (actually: while) they differ a linear run is made through the
 912 devfs-global inode-array and the directory tree of the mountpoint is
 913 brought up to date.
 914 .lp
 915 The actual code is slightly more complicated than shown in the pseudo-code
 916 here because it has to deal with subdirectories and hidden entries.
 917 .(b M
 918 .vs -3
 919 \fC\s-2devfs_populate(...)
 920   while (mount->generation != devfs_generation)
 921     for i in all inodes
 922       if inode created)
 923         create directory entry
 924       else if inode destroyed
 925         remove directory entry
 926 .vs +3
 927 .)b
 928 .lp
 929 Access to the global DEVFS inode table is again implemented
 930 with atomic instructions and failsafe retries to avoid the
 931 need for locking.
 932 .lp
 933 From a performance point of view this scheme also means that a particular
 934 DEVFS mountpoint is not updated until it needs to be, and then always by
 935 a process belonging to the jail in question thus minimising and
 936 distributing the CPU load.
 937 .sh 1 "Device-driver impact"
 938 .lp
 939 All these changes have had a significant impact on how device drivers
 940 interact with the rest of the kernel regarding registration of
 941 devices.
 942 .lp
 943 If we look first at the ``before'' image in Fig. 3, we notice first
 944 the NFOO define which imposes a firm upper limit on the number of
 945 devices the kernel can deal with.
 946 Also notice that the softc structure for all of them is allocated
 947 at compile time.
 948 This is because most device drivers (and texts on writing device
 949 drivers) are from before the general
 950 kernel malloc facility [Mckusick1988] was introduced into the BSD kernel.
 951 .lp
 952 .(b M
 953 .vs -3
 954 \fC\s-2
 955 #ifndef NFOO
 956 #       define NFOO     4
 957 #endif
 958
 959 struct foo_softc {
 960         ...
 961 } foo_softc[NFOO];
 962
 963 int nfoo = 0;
 964
 965 foo_open(dev, ...)
 966 {
 967         int unit = minor(dev);
 968         struct foo_softc *sc;
 969
 970         if (unit >= NFOO || unit >= nfoo)
 971                 return (ENXIO);
 972
 973         sc = &foo_softc[unit]
 974
 975         ...
 976 }
 977
 978 foo_attach(...)
 979 {
 980         struct foo_softc *sc;
 981         static int once;
 982
 983         ...
 984         if (nfoo >= NFOO) {
 985                 /* Have hardware, can't handle */
 986                 return (-1);
 987         }
 988         sc = &foo_softc[nfoo++];
 989         if (!once) {
 990                 cdevsw_add(&cdevsw);
 991                 once++;
 992         }
 993         ...
 994 }
 995 \fP\s+2
 996 Fig. 3 - Device-driver, old style.
 997 .vs +3
 998 .)b
 999 .lp
1000 Also notice how range checking is needed to make sure that the
1001 minor# is inside range.  This code gets more complex if device-numbering
1002 is sparse.  Code equivalent to that shown in the foo_open() routine
1003 would also be needed in foo_read(), foo_write(), foo_ioctl() &c.
1004 .lp
1005 Finally notice how the attach routine needs to remember to register
1006 the cdevsw structure (not shown) when the first device is found.
1007 .lp
1008 Now, compare this to our ``after'' image in Fig. 4.
1009 NFOO is totally gone and so is the compile time allocation
1010 of space for softc structures.
1011 .lp
1012 The foo_open (and foo_close, foo_ioctl &c) functions can now
1013 derive the softc pointer directly from the dev_t they receive
1014 as an argument.
1015 .lp
1016 .(b M
1017 .vs -3
1018 \fC\s-2
1019 struct foo_softc {
1020         ....
1021 };
1022
1023 int nfoo;
1024
1025 foo_open(dev, ...)
1026 {
1027         struct foo_softc *sc = dev->si_drv1;
1028
1029         ...
1030 }
1031
1032 foo_attach(...)
1033 {
1034         struct foo_softc *sc;
1035
1036         ...
1037         sc = MALLOC(..., M_ZERO);
1038         if (sc == NULL) {
1039                 /* Have hardware, can't handle */
1040                 return (-1);
1041         }
1042         sc->dev = make_dev(&cdevsw, nfoo,
1043             UID_ROOT, GID_WHEEL, 0644,
1044             "foo%d", nfoo);
1045         nfoo++;
1046         sc->dev->si_drv1 = sc;
1047         ...
1048 }
1049 \fP\s+2
1050 Fig. 4 - Device-driver, new style.
1051 .vs +3
1052 .)b
1053 .lp
1054 In foo_attach() we can now attach to all the devices we can
1055 allocate memory for and we register the cdevsw structure per
1056 dev_t rather than globally.
1057 .lp
1058 This last trick is what allows us to discard all bounds checking
1059 in the foo_open() &c. routines, because they can only be
1060 called through the cdevsw, and the cdevsw is only attached to
1061 dev_t's which foo_attach() has created.
1062 There is no way to end
1063 up in foo_open() with a dev_t not created by foo_attach().
1064 .lp
1065 In the two examples here, the difference is only 10 lines of source
1066 code, primarily because only one of the worker functions of the
1067 device driver is shown.
1068 In real device drivers it is not uncommon to save 50 or more lines
1069 of source code which typically is about a percent or two of the
1070 entire driver.
1071 .sh 1 "Future work"
1072 .lp
1073 Apart from some minor issues to be cleaned up, DEVFS is now a reality
1074 and future work therefore is likely concentrate on applying the
1075 facilities and functionality of DEVFS to FreeBSD.
1076 .sh 2 "devd"
1077 .lp
1078 It would be logical to complement DEVFS with a ``device-daemon'' which
1079 could configure and de-configure devices as they come and go.
1080 When a disk appears, mount it.
1081 When a network interface appears, configure it.
1082 And in some configurable way allow the user to customise the action,
1083 so that for instance images will automatically be copied off the
1084 flash-based media from a camera, &c.
1085 .lp
1086 In this context it is good to question how we view dynamic devices.
1087 If for instance a printer is removed in the middle of a print job
1088 and another printer arrives a moment later, should the system
1089 automatically continue the print job on this new printer?
1090 When a disk-like device arrives, should we always mount it?  Should
1091 we have a database of known disk-like devices to tell us where to
1092 mount it, what permissions to give the mountpoint?
1093 Some computers come in multiple configurations, for instance laptops
1094 with and without their docking station.  How do we want to present
1095 this to the users and what behaviour do the users expect?
1096 .sh 2 "Pathname length limitations"
1097 .lp
1098 In order to simplify memory management in the early stages of boot,
1099 the pathname relative to the mountpoint is presently stored in a
1100 small fixed size buffer inside struct specinfo.
1101 It should be possible to use filenames as long as the system otherwise
1102 permits, so some kind of extension mechanism is called for.
1103 .lp
1104 Since it cannot be guaranteed that memory can be allocated in
1105 all the possible scenarios where make_dev() can be called, it may
1106 be necessary to mandate that the caller allocates the buffer if
1107 the content will not fit inside the default buffer size.
1108 .sh 2 "Initial access parameter selection"
1109 .lp
1110 As it is now, device drivers propose the initial mode, owner and group
1111 for the device nodes, but it would be more flexible if it were possible
1112 to give the kernel a set of rules, much like packet filtering rules,
1113 which allow the user to set the wanted policy for new devices.
1114 Such a mechanism could also be used to filter new devices for mount
1115 points in jails and to determine other behaviour.
1116 .lp
1117 Doing these things from userland results in some awkward race conditions
1118 and software bloat for embedded systems, so a kernel approach may be more
1119 suitable.
1120 .sh 2 "Applications of on-demand device creation"
1121 .lp
1122 The facility for on-demand creation of devices has some very interesting
1123 possibilities.
1124 .lp
1125 One planned use is to enable user-controlled labelling
1126 of disks.
1127 Today disks have names like /dev/da0, /dev/ad4, but since
1128 this numbering is topological any change in the hardware configuration
1129 may rename the disks, causing /etc/fstab and backup procedures
1130 to get out of sync with the hardware.
1131 .lp
1132 The current idea is to store on the media of the disk a user-chosen
1133 disk name and allow access through this name, so that for instance
1134 /dev/mydisk0
1135 would be a symlink to whatever topological name the disk might have
1136 at any given time.
1137 .lp
1138 To simplify this and to avoid a forest of symlinks, it will probably
1139 be decided to move all the sub-divisions of a disk into one subdirectory
1140 per disk so just a single symlink can do the job.
1141 In practice that means that the current /dev/ad0s2f will become
1142 something like /dev/ad0/s2f and so on.
1143 Obviously, in the same way, disks could also be accessed by their
1144 topological address, down to the specific path in a SAN environment.
1145 .lp
1146 Another potential use could be for automated offline data media libraries.
1147 It would be quite trivial to make it possible to access all the media
1148 in the library using /dev/lib/$LABEL which would be a remarkable
1149 simplification compared with most current automated retrieval facilities.
1150 .lp
1151 Another use could be to access devices by parameter rather than by
1152 name.  One could imagine sending a printjob to /dev/printer/color/A2
1153 and behind the scenes a search would be made for a device with the
1154 correct properties and paper-handling facilities.
1155 .sh 1 "Conclusion"
1156 .lp
1157 DEVFS has been successfully implemented in FreeBSD,
1158 including a powerful, simple and flexible solution supporting
1159 pseudo-devices and on-demand device node creation.
1160 .lp
1161 Contrary to the trend, the implementation added functionality
1162 with a net decrease in source lines,
1163 primarily because of the improved API seen from device drivers point of view.
1164 .lp
1165 Even if DEVFS is not desired, other 4.4BSD derived UNIX variants
1166 would probably benefit from adopting the dev_t/specinfo related
1167 cleanup.
1168 .sh 1 "Acknowledgements"
1169 .lp
1170 I first got started on DEVFS in 1989 because the abysmal performance
1171 of the Olivetti M250 computer forced me to implement a network-disk-device
1172 for Minix in order to retain my sanity.
1173 That initial work led to a
1174 crude but working DEVFS for Minix, so obviously both Andrew Tannenbaum
1175 and Olivetti deserve credit for inspiration.
1176 .lp
1177 Julian Elischer implemented a DEVFS for FreeBSD around 1994 which never
1178 quite made it to maturity and subsequently was abandoned.
1179 .lp
1180 Bruce Evans deserves special credit not only for his keen eye for detail,
1181 and his competent criticism but also for his enthusiastic resistance to the
1182 very concept of DEVFS.
1183 .lp
1184 Many thanks to the people who took time to help me stamp out ``Danglish''
1185 through their reviews and comments:  Chris Demetriou, Paul Richards,
1186 Brian Somers, Nik Clayton, and Hanne Munkholm.
1187 Any remaining insults to proper use of english language are my own fault.
1188 .\" (list & why)
1189 .sh 1 "References"
1190 .lp
1191 [44BSDBook]
1192 Mckusick, Bostic, Karels & Quarterman:
1193 ``The Design and Implementation of 4.4 BSD Operating System.''
1194 Addison Wesley, 1996, ISBN 0-201-54979-4.
1195 .lp
1196 [Heidemann91a]
1197 John S. Heidemann:
1198 ``Stackable layers: an architecture for filesystem development.''
1199 Master's thesis, University of California, Los Angeles, July 1991.
1200 Available as UCLA technical report CSD-910056.
1201 .lp
1202 [Kamp2000]
1203 Poul-Henning Kamp and Robert N. M. Watson:
1204 ``Confining the Omnipotent root.''
1205 Proceedings of the SANE 2000 Conference.
1206 Available in FreeBSD distributions in \fC/usr/share/papers\fP.
1207 .lp
1208 [MD.C]
1209 Poul-Henning Kamp et al:
1210 FreeBSD memory disk driver:
1211 \fCsrc/sys/dev/md/md.c\fP
1212 .lp
1213 [Mckusick1988]
1214 Marshall Kirk Mckusick, Mike J. Karels:
1215 ``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX-Kernel''
1216 Proceedings of the San Francisco USENIX Conference, pp. 295-303, June 1988.
1217 .lp
1218 [Mckusick1999]
1219 Dr. Marshall Kirk Mckusick:
1220 Private email communication.
1221 \fI``According to the SCCS logs, the chroot call was added by Bill Joy
1222 on March 18, 1982 approximately 1.5 years before 4.2BSD was released.
1223 That was well before we had ftp servers of any sort (ftp did not
1224 show up in the source tree until January 1983).  My best guess as
1225 to its purpose was to allow Bill to chroot into the /4.2BSD build
1226 directory and build a system using only the files, include files,
1227 etc contained in that tree.  That was the only use of chroot that
1228 I remember from the early days.''\fP
1229 .lp
1230 [Mckusick2000]
1231 Dr. Marshall Kirk Mckusick:
1232 Private communication at BSDcon2000 conference.
1233 \fI``I have not used block devices since I wrote FFS and that
1234 was \fPmany\fI years ago.''\fP
1235 .lp
1236 [NewBus]
1237 NewBus is a subsystem which provides most of the glue between
1238 hardware and device drivers.  Despite the importance of this
1239 there has never been published any good overview documentation
1240 for it.
1241 The following article by Alexander Langer in ``Dæmonnews'' is
1242 the best reference I can come up with:
1243 \fC\s-2http://www.daemonnews.org/200007/newbus-intro.html\fP\s+2
1244 .lp
1245 [Pike2000]
1246 Rob Pike:
1247 ``Systems Software Research is Irrelevant.''
1248 \fC\s-2http://www.cs.bell\-labs.com/who/rob/utah2000.pdf\fP\s+2
1249 .lp
1250 [Pike90a]
1251 Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey:
1252 ``Plan 9 from Bell Labs.''
1253 Proceedings of the Summer 1990 UKUUG Conference.
1254 .lp
1255 [Pike92a]
1256 Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom:
1257 ``The Use of Name Spaces in Plan 9.''
1258 Proceedings of the 5th ACM SIGOPS Workshop.
1259 .lp
1260 [Raspe1785]
1261 Rudolf Erich Raspe:
1262 ``Baron Münchhausen's Narrative of his marvellous Travels and Campaigns in Russia.''
1263 Kearsley, 1785.
1264 .lp
1265 [Ritchie74]
1266 D.M. Ritchie and K. Thompson:
1267 ``The UNIX Time-Sharing System''
1268 Communications of the ACM, Vol. 17, No. 7, July 1974.
1269 .lp
1270 [Ritchie98]
1271 Dennis Ritchie: private conversation at USENIX Annual Technical Conference
1272 New Orleans, 1998.
1273 .lp
1274 [Thompson78]
1275 Ken Thompson:
1276 ``UNIX Implementation''
1277 The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.