share/doc/papers/newvm/1.t

   1 .\" Copyright (c) 1986 The Regents of the University of California.
   2 .\" All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 3. All advertising materials mentioning features or use of this software
  13 .\"    must display the following acknowledgement:
  14 .\"     This product includes software developed by the University of
  15 .\"     California, Berkeley and its contributors.
  16 .\" 4. Neither the name of the University nor the names of its contributors
  17 .\"    may be used to endorse or promote products derived from this software
  18 .\"    without specific prior written permission.
  19 .\"
  20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  23 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  30 .\" SUCH DAMAGE.
  31 .\"
  32 .\"     @(#)1.t 5.1 (Berkeley) 4/16/91
  33 .\" $FreeBSD$
  34 .\"
  35 .NH
  36 Motivations for a New Virtual Memory System
  37 .PP
  38 The virtual memory system distributed with Berkeley UNIX has served
  39 its design goals admirably well over the ten years of its existence.
  40 However the relentless advance of technology has begun to render it
  41 obsolete.
  42 This section of the paper describes the current design,
  43 points out the current technological trends,
  44 and attempts to define the new design considerations that should
  45 be taken into account in a new virtual memory design.
  46 .NH 2
  47 Implementation of 4.3BSD virtual memory
  48 .PP
  49 All Berkeley Software Distributions through 4.3BSD
  50 have used the same virtual memory design.
  51 All processes, whether active or sleeping, have some amount of
  52 virtual address space associated with them.
  53 This virtual address space
  54 is the combination of the amount of address space with which they initially
  55 started plus any stack or heap expansions that they have made.
  56 All requests for address space are allocated from available swap space
  57 at the time that they are first made;
  58 if there is insufficient swap space left to honor the allocation,
  59 the system call requesting the address space fails synchronously.
  60 Thus, the limit to available virtual memory is established by the
  61 amount of swap space allocated to the system.
  62 .PP
  63 Memory pages are used in a sort of shell game to contain the
  64 contents of recently accessed locations.
  65 As a process first references a location
  66 a new page is allocated and filled either with initialized data or
  67 zeros (for new stack and break pages).
  68 As the supply of free pages begins to run out, dirty pages are
  69 pushed to the previously allocated swap space so that they can be reused
  70 to contain newly faulted pages.
  71 If a previously accessed page that has been pushed to swap is once
  72 again used, a free page is reallocated and filled from the swap area
  73 [Babaoglu79], [Someren84].
  74 .NH 2
  75 Design assumptions for 4.3BSD virtual memory
  76 .PP
  77 The design criteria for the current virtual memory implementation
  78 were made in 1979.
  79 At that time the cost of memory was about a thousand times greater per
  80 byte than magnetic disks.
  81 Most machines were used as centralized time sharing machines.
  82 These machines had far more disk storage than they had memory
  83 and given the cost tradeoff between memory and disk storage,
  84 wanted to make maximal use of the memory even at the cost of
  85 wasting some of the disk space or generating extra disk I/O.
  86 .PP
  87 The primary motivation for virtual memory was to allow the
  88 system to run individual programs whose address space exceeded
  89 the memory capacity of the machine.
  90 Thus the virtual memory capability allowed programs to be run that
  91 could not have been run on a swap based system.
  92 Equally important in the large central timesharing environment
  93 was the ability to allow the sum of the memory requirements of
  94 all active processes to exceed the amount of physical memory on
  95 the machine.
  96 The expected mode of operation for which the system was tuned
  97 was to have the sum of active virtual memory be one and a half
  98 to two times the physical memory on the machine.
  99 .PP
 100 At the time that the virtual memory system was designed,
 101 most machines ran with little or no networking.
 102 All the file systems were contained on disks that were
 103 directly connected to the machine.
 104 Similarly all the disk space devoted to swap space was also
 105 directly connected.
 106 Thus the speed and latency with which file systems could be accessed
 107 were roughly equivalent to the speed and latency with which swap
 108 space could be accessed.
 109 Given the high cost of memory there was little incentive to have
 110 the kernel keep track of the contents of the swap area once a process
 111 exited since it could almost as easily and quickly be reread from the
 112 file system.
 113 .NH 2
 114 New influences
 115 .PP
 116 In the ten years since the current virtual memory system was designed,
 117 many technological advances have occurred.
 118 One effect of the technological revolution is that the
 119 micro-processor has become powerful enough to allow users to have their
 120 own personal workstations.
 121 Thus the computing environment is moving away from a purely centralized
 122 time sharing model to an environment in which users have a
 123 computer on their desk.
 124 This workstation is linked through a network to a centralized
 125 pool of machines that provide filing, computing, and spooling services.
 126 The workstations tend to have a large quantity of memory,
 127 but little or no disk space.
 128 Because users do not want to be bothered with backing up their disks,
 129 and because of the difficulty of having a centralized administration
 130 backing up hundreds of small disks, these local disks are typically
 131 used only for temporary storage and as swap space.
 132 Long term storage is managed by the central file server.
 133 .PP
 134 Another major technical advance has been in all levels of storage capacity.
 135 In the last ten years we have experienced a factor of four decrease in the
 136 cost per byte of disk storage.
 137 In this same period of time the cost per byte of memory has dropped
 138 by a factor of a hundred!
 139 Thus the cost per byte of memory compared to the cost per byte of disk is
 140 approaching a difference of only about a factor of ten.
 141 The effect of this change is that the way in which a machine is used
 142 is beginning to change dramatically.
 143 As the amount of physical memory on machines increases and the number of
 144 users per machine decreases, the expected
 145 mode of operation is changing from that of supporting more active virtual
 146 memory than physical memory to that of having a surplus of memory that can
 147 be used for other purposes.
 148 .PP
 149 Because many machines will have more physical memory than they do swap
 150 space (with diskless workstations as an extreme example!),
 151 it is no longer reasonable to limit the maximum virtual memory
 152 to the amount of swap space as is done in the current design.
 153 Consequently, the new design will allow the maximum virtual memory
 154 to be the sum of physical memory plus swap space.
 155 For machines with no swap space, the maximum virtual memory will
 156 be governed by the amount of physical memory.
 157 .PP
 158 Another effect of the current technology is that the latency and overhead
 159 associated with accessing the file system is considerably higher
 160 since the access must be over the network
 161 rather than to a locally-attached disk.
 162 One use of the surplus memory would be to
 163 maintain a cache of recently used files;
 164 repeated uses of these files would require at most a verification from
 165 the file server that the data was up to date.
 166 Under the current design, file caching is done by the buffer pool,
 167 while the free memory is maintained in a separate pool.
 168 The new design should have only a single memory pool so that any
 169 free memory can be used to cache recently accessed files.
 170 .PP
 171 Another portion of the memory will be used to keep track of the contents
 172 of the blocks on any locally-attached swap space analogously
 173 to the way that memory pages are handled.
 174 Thus inactive swap blocks can also be used to cache less-recently-used
 175 file data.
 176 Since the swap disk is locally attached, it can be much more quickly
 177 accessed than a remotely located file system.
 178 This design allows the user to simply allocate their entire local disk
 179 to swap space, thus allowing the system to decide what files should
 180 be cached to maximize its usefulness.
 181 This design has two major benefits.
 182 It relieves the user of deciding what files
 183 should be kept in a small local file system.
 184 It also insures that all modified files are migrated back to the
 185 file server in a timely fashion, thus eliminating the need to dump
 186 the local disk or push the files manually.
 187 .NH
 188 User Interface
 189 .PP
 190 This section outlines our new virtual memory interface as it is
 191 currently envisioned.
 192 The details of the system call interface are contained in Appendix A.
 193 .NH 2
 194 Regions
 195 .PP
 196 The virtual memory interface is designed to support both large,
 197 sparse address spaces as well as small, densely-used address spaces.
 198 In this context, ``small'' is an address space roughly the
 199 size of the physical memory on the machine,
 200 while ``large'' may extend up to the maximum addressability of the machine.
 201 A process may divide its address space up into a number of regions.
 202 Initially a process begins with four regions;
 203 a shared read-only fill-on-demand region with its text,
 204 a private fill-on-demand region for its initialized data,
 205 a private zero-fill-on-demand region for its uninitialized data and heap,
 206 and a private zero-fill-on-demand region for its stack.
 207 In addition to these regions, a process may allocate new ones.
 208 The regions may not overlap and the system may impose an alignment
 209 constraint, but the size of the region should not be limited
 210 beyond the constraints of the size of the virtual address space.
 211 .PP
 212 Each new region may be mapped either as private or shared.
 213 When it is privately mapped, changes to the contents of the region
 214 are not reflected to any other process that map the same region.
 215 Regions may be mapped read-only or read-write.
 216 As an example, a shared library would be implemented as two regions;
 217 a shared read-only region for the text, and a private read-write
 218 region for the global variables associated with the library.
 219 .PP
 220 A region may be allocated with one of several allocation strategies.
 221 It may map some memory hardware on the machine such as a frame buffer.
 222 Since the hardware is responsible for storing the data,
 223 such regions must be exclusive use if they are privately mapped.
 224 .PP
 225 A region can map all or part of a file.
 226 As the pages are first accessed, the region is filled in with the
 227 appropriate part of the file.
 228 If the region is mapped read-write and shared, changes to the
 229 contents of the region are reflected back into the contents of the file.
 230 If the region is read-write but private,
 231 changes to the region are copied to a private page that is not
 232 visible to other processes mapping the file,
 233 and these modified pages are not reflected back to the file.
 234 .PP
 235 The final type of region is ``anonymous memory''.
 236 Uninitialed data uses such a region, privately mapped;
 237 it is zero-fill-on-demand and its contents are abandoned
 238 when the last reference is dropped.
 239 Unlike a region that is mapped from a file,
 240 the contents of an anonymous region will never be read from or
 241 written to a disk unless memory is short and part of the region
 242 must be paged to a swap area.
 243 If one of these regions is mapped shared,
 244 then all processes see the changes in the region.
 245 This difference has important performance considerations;
 246 the overhead of reading, flushing, and possibly allocating a file
 247 is much higher than simply zeroing memory.
 248 .PP
 249 If several processes wish to share a region,
 250 then they must have some way of rendezvousing.
 251 For a mapped file this is easy;
 252 the name of the file is used as the rendezvous point.
 253 However, processes may not need the semantics of mapped files
 254 nor be willing to pay the overhead associated with them.
 255 For anonymous memory they must use some other rendezvous point.
 256 Our current interface allows processes to associate a
 257 descriptor with a region, which it may then pass to other
 258 processes that wish to attach to the region.
 259 Such a descriptor may be bound into the UNIX file system
 260 name space so that other processes can find it just as
 261 they would with a mapped file.
 262 .NH 2
 263 Shared memory as high speed interprocess communication
 264 .PP
 265 The primary use envisioned for shared memory is to
 266 provide a high speed interprocess communication (IPC) mechanism
 267 between cooperating processes.
 268 Existing IPC mechanisms (\fIi.e.\fP pipes, sockets, or streams)
 269 require a system call to hand off a set
 270 of data destined for another process, and another system call
 271 by the recipient process to receive the data.
 272 Even if the data can be transferred by remapping the data pages
 273 to avoid a memory to memory copy, the overhead of doing the system
 274 calls limits the throughput of all but the largest transfers.
 275 Shared memory, by contrast, allows processes to share data at any
 276 level of granularity without system intervention.
 277 .PP
 278 However, to maintain all but the simplest of data structures,
 279 the processes must serialize their modifications to shared
 280 data structures if they are to avoid corrupting them.
 281 This serialization is typically done with semaphores.
 282 Unfortunately, most implementations of semaphores are
 283 done with system calls.
 284 Thus processes are once again limited by the need to do two
 285 system calls per transaction, one to lock the semaphore, the
 286 second to release it.
 287 The net effect is that the shared memory model provides little if
 288 any improvement in interprocess bandwidth.
 289 .PP
 290 To achieve a significant improvement in interprocess bandwidth
 291 requires a large decrease in the number of system calls needed to
 292 achieve the interaction.
 293 In profiling applications that use
 294 serialization locks such as the UNIX kernel,
 295 one typically finds that most locks are not contested.
 296 Thus if one can find a way to avoid doing a system call in the case
 297 in which a lock is not contested,
 298 one would expect to be able to dramatically reduce the number
 299 of system calls needed to achieve serialization.
 300 .PP
 301 In our design, cooperating processes manage their semaphores
 302 in their own address space.
 303 In the typical case, a process executes an atomic test-and-set instruction
 304 to acquire a lock, finds it free, and thus is able to get it.
 305 Only in the (rare) case where the lock is already set does the process
 306 need to do a system call to wait for the lock to clear.
 307 When a process is finished with a lock,
 308 it can clear the lock itself.
 309 Only if the ``WANT'' flag for the lock has been set is
 310 it necessary for the process to do a system call to cause the other
 311 process(es) to be awakened.
 312 .PP
 313 Another issue that must be considered is portability.
 314 Some computers require access to special hardware to implement
 315 atomic interprocessor test-and-set.
 316 For such machines the setting and clearing of locks would
 317 all have to be done with system calls;
 318 applications could still use the same interface without change,
 319 though they would tend to run slowly.
 320 .PP
 321 The other issue of compatibility is with System V's semaphore
 322 implementation.
 323 Since the System V interface has been in existence for several years,
 324 and applications have been built that depend on this interface,
 325 it is important that this interface also be available.
 326 Although the interface is based on system calls for both setting and
 327 clearing locks,
 328 the same interface can be obtained using our interface without
 329 system calls in most cases.
 330 .PP
 331 This implementation can be achieved as follows.
 332 System V allows entire sets of semaphores to be set concurrently.
 333 If any of the locks are unavailable, the process is put to sleep
 334 until they all become available.
 335 Under our paradigm, a single additional semaphore is defined
 336 that serializes access to the set of semaphores being simulated.
 337 Once obtained in the usual way, the set of semaphores can be
 338 inspected to see if the desired ones are available.
 339 If they are available, they are set, the guardian semaphore
 340 is released and the process proceeds.
 341 If one or more of the requested set is not available,
 342 the guardian semaphore is released and the process selects an
 343 unavailable semaphores for which to wait.
 344 On being reawakened, the whole selection process must be repeated.
 345 .PP
 346 In all the above examples, there appears to be a race condition.
 347 Between the time that the process finds that a semaphore is locked,
 348 and the time that it manages to call the system to sleep on the
 349 semaphore another process may unlock the semaphore and issue a wakeup call.
 350 Luckily the race can be avoided.
 351 The insight that is critical is that the process and the kernel agree
 352 on the physical byte of memory that is being used for the semaphore.
 353 The system call to put a process to sleep takes a pointer
 354 to the desired semaphore as its argument so that once inside
 355 the kernel, the kernel can repeat the test-and-set.
 356 If the lock has cleared
 357 (and possibly the wakeup issued) between the time that the process
 358 did the test-and-set and eventually got into the sleep request system call,
 359 then the kernel immediately resumes the process rather than putting
 360 it to sleep.
 361 Thus the only problem to solve is how the kernel interlocks between testing
 362 a semaphore and going to sleep;
 363 this problem has already been solved on existing systems.
 364 .NH
 365 References
 366 .sp
 367 .IP [Babaoglu79] 20
 368 Babaoglu, O., and Joy, W.,
 369 ``Data Structures Added in the Berkeley Virtual Memory Extensions
 370 to the UNIX Operating System''
 371 Computer Systems Research Group, Dept of EECS, University of California,
 372 Berkeley, CA 94720, USA, November 1979.
 373 .IP [Someren84] 20
 374 Someren, J. van,
 375 ``Paging in Berkeley UNIX'',
 376 Laboratorium voor schakeltechniek en techneik v.d.
 377 informatieverwerkende machines,
 378 Codenummer 051560-44(1984)01, February 1984.