contrib/subversion/subversion/libsvn_fs_base/notes/structure

   1 Subversion on Berkeley DB                                    -*- text -*-
   2
   3 There are many different ways to implement the Subversion filesystem
   4 interface.  You could implement it directly using ordinary POSIX
   5 filesystem operations; you could build it using an SQL server as a
   6 back end; you could build it on RCS; and so on.
   7
   8 This implementation of the Subversion filesystem interface is built on
   9 top of Berkeley DB (http://www.sleepycat.com).  Berkeley DB supports
  10 transactions and recoverability, making it well-suited for Subversion.
  11
  12
  13 \f
  14 Nodes and Node Revisions
  15
  16 In a Subversion filesystem, a `node' corresponds roughly to an
  17 `inode' in a Unix filesystem:
  18
  19    * A node is either a file or a directory.
  20
  21    * A node's contents change over time.
  22
  23    * When you change a node's contents, it's still the same node; it's
  24      just been changed.  So a node's identity isn't bound to a specific
  25      set of contents.
  26
  27    * If you rename a node, it's still the same node, just under a
  28      different name.  So a node's identity isn't bound to a particular
  29      filename.
  30
  31 A `node revision' refers to a node's contents at a specific point in
  32 time.  Changing a node's contents always creates a new revision of that
  33 node.  Once created, a node revision's contents never change.
  34
  35 When we create a node, its initial contents are the initial revision of
  36 the node.  As users make changes to the node over time, we create new
  37 revisions of that same node.  When a user commits a change that deletes
  38 a file from the filesystem, we don't delete the node, or any revision
  39 of it --- those stick around to allow us to recreate prior revisions of
  40 the filesystem.  Instead, we just remove the reference to the node
  41 from the directory.
  42
  43
  44 \f
  45 ID's
  46
  47 Within the database, we refer to nodes and node revisions using a
  48 string of three unique identifiers (the "node ID", the "copy ID", and
  49 the "txn ID"), separated by periods.
  50
  51     node_revision_id ::= node_id '.' copy_id '.' txn_id
  52
  53 The node ID is unique to a particular node in the filesystem across
  54 all of revision history.  That is, two node revisions who share
  55 revision history (perhaps because they are different revisions of the
  56 same node, or because one is a copy of the other, e.g.) have the same
  57 node ID, whereas two node revisions who have no common revision
  58 history will not have the same node ID.
  59
  60 The copy ID is a key into the `copies' table (see `Copies' below), and
  61 identifies that a given node revision, or one of its ancestors,
  62 resulted from a unique filesystem copy operation.
  63
  64 The txn ID is just an identifier that is unique to a single filesystem
  65 commit.  All node revisions created as part of a commit share this txn
  66 ID (which, incidentally, gets its name from the fact that this id is
  67 the same id used as the primary key of Subversion transactions; see
  68 `Transactions' below).
  69
  70 A directory entry identifies the file or subdirectory it refers to
  71 using a node revision ID --- not a node ID.  This means that a change
  72 to a file far down in a directory hierarchy requires the parent
  73 directory of the changed node to be updated, to hold the new node
  74 revision ID.  Now, since that parent directory has changed, its parent
  75 needs to be updated, and so on to the root.  We call this process
  76 "bubble-up".
  77
  78 If a particular subtree was unaffected by a given commit, the node
  79 revision ID that appears in its parent will be unchanged.  When
  80 doing an update, we can notice this, and ignore that entire
  81 subtree.  This makes it efficient to find localized changes in
  82 large trees.
  83
  84
  85 \f
  86 A Word About Keys
  87
  88 Some of the Subversion database tables use base-36 numbers as their
  89 keys.  Some debate exists about whether the use of base-36 (as opposed
  90 to, say, regular decimal values) is either necessary or good.  It is
  91 outside the scope of this document to make a claim for or against this
  92 usage.  As such, the reader will please note that for the majority of
  93 the document, the use of the term "number" when referring to keys of
  94 database tables should be interpreted to mean "a monotonically
  95 increasing unique key whose order with respect to other keys in the
  96 table is irrelevant".  :-)
  97
  98 To determine the actual type currently in use for the keys of a given
  99 table, you are invited to check out the "Appendix: Filesystem
 100 structure summary" section of this document.
 101
 102
 103 \f
 104 NODE-REVISION: how we represent a node revision
 105
 106 We represent a given revision of a file or directory node using a list
 107 skel (see include/private/svn_skel.h for an explanation of skels).
 108 A node revision skel has the form:
 109
 110     (HEADER PROP-KEY KIND-SPECIFIC ...)
 111
 112 where HEADER is a header skel, whose structure is common to all nodes,
 113 PROP-KEY is the key of the representation that contains this node's
 114 properties list, and the KIND-SPECIFIC elements carry data dependent
 115 on what kind of node this is --- file, directory, etc.
 116
 117 HEADER has the form:
 118
 119     (KIND CREATED-PATH [PRED-ID [PRED-COUNT [HAS-MERGEINFO MERGEINFO-COUNT]]])
 120
 121 where:
 122
 123    * KIND indicates what sort of node this is.  It must be one of the
 124      following:
 125        - "file", indicating that the node is a file (see FILE below).
 126        - "dir", indicating that the node is a directory (see DIR below).
 127
 128    * CREATED-PATH is the canonicalized absolute filesystem path at
 129      which this node was created.
 130
 131    * PRED-ID, if present, indicates the node revision which is the
 132      immediate ancestor of this node.
 133
 134    * PRED-COUNT, if present, indicates the number of predecessors the
 135      node revision has (recursively).
 136
 137    * HAS-MERGEINFO and MERGEINFO-COUNT, if present, indicate ...
 138      ### TODO
 139
 140 Note that a node cannot change its kind from one revision to the next.
 141 A directory node is always a directory; a file node is always a file;
 142 etc.  The fact that the node's kind is stored in each node revision,
 143 rather than in some revision-independent place, might suggest that
 144 it's possible for a node to change kinds from revision to revision, but
 145 Subversion does not allow this.
 146
 147 PROP-KEY is a key into the `representations' table (see REPRESENTATIONS
 148 below), whose value is a representation pointing to a string
 149 (see `strings' table) that is a PROPLIST skel.
 150
 151 The KIND-SPECIFIC portions are discussed below.
 152
 153
 154 \f
 155 PROPLIST: a property list is a list skel of the form:
 156
 157     (NAME1 VALUE1 NAME2 VALUE2 ...)
 158
 159 where each NAMEi is the name of a property, and VALUEi is the value of
 160 the property named NAMEi.  Every valid property list has an even
 161 number of elements.
 162
 163
 164 \f
 165 FILE: how files are represented.
 166
 167 If a NODE-REVISION's header's KIND is "file", then the node-revision
 168 skel represents a file, and has the form:
 169
 170     (HEADER PROP-KEY DATA-INFO [EDIT-DATA-KEY])
 171
 172 where
 173
 174     DATA-INFO ::= DATA-KEY | (DATA-KEY DATA-KEY-UNIQID)
 175
 176 and DATA-KEY identifies the representation for the file's current
 177 contents, and EDIT-DATA-KEY identifies the representation currently
 178 available for receiving new contents for the file.
 179
 180 DATA-KEY-UNIQID ...
 181 ### TODO
 182
 183 See discussion of representations later.
 184
 185
 186 \f
 187 DIR: how directories are represented.
 188
 189 If the header's KIND is "dir", then the node-revision skel
 190 represents a directory, and has the form:
 191
 192     (HEADER PROP-KEY ENTRIES-KEY)
 193
 194 where ENTRIES-KEY identifies the representation for the directory's
 195 entries list (see discussion of representations later).  An entries
 196 list has the form
 197
 198     (ENTRY ...)
 199
 200 where each entry is
 201
 202     (NAME ID)
 203
 204 where:
 205
 206    * NAME is the name of the directory entry, in UTF-8, and
 207
 208    * ID is the ID of the node revision to which this entry refers
 209
 210
 211 \f
 212 REPRESENTATIONS: where and how Subversion stores your data.
 213
 214 Some parts of a node revision are essentially constant-length: for
 215 example, the KIND field and the REV.  Other parts can have
 216 arbitrarily varying length: property lists, file contents, and
 217 directory entry lists.  This variable-length data is often similar
 218 from one revision to the next, so Subversion stores just the deltas
 219 between them, instead of successive fulltexts.
 220
 221 The HEADER portion of a node revision holds the constant-length stuff,
 222 which is never deltified.  The rest of a node revision just points to
 223 data stored outside the node revision proper.  This design makes the
 224 repository code easier to maintain, because deltification and
 225 undeltification are confined to a layer separate from node revisions,
 226 and makes the code more efficient, because Subversion can retrieve
 227 just the parts of a node it needs for a given operation.
 228
 229 Deltifiable data is stored in the `strings' table, as mediated by the
 230 `representations' table.  Here's how it works:
 231
 232 The `strings' table stores only raw bytes.  A given string could be
 233 any one of these:
 234
 235    - a file's contents
 236    - a delta that reconstructs file contents, or part of a file's contents
 237    - a directory entry list skel
 238    - a delta that reconstructs a dir entry list skel, or part of same
 239    - a property list skel
 240    - a delta that reconstructs a property list skel, or part of same
 241
 242 There is no way to tell, from looking at a string, what kind of data
 243 it is.  A directory entry list skel is indistinguishable from file
 244 contents that just happen to look exactly like the unparsed form of a
 245 directory entry list skel.  File contents that just happen to look
 246 like svndiff data are indistinguishable from delta data.
 247
 248 The code is able to interpret a given string because Subversion
 249
 250    a) knows whether to be looking for a property list or some
 251       kind-specific data,
 252
 253    b) knows the `kind' of the node revision in question,
 254
 255    c) always goes through the `representations' table to discover if
 256       any undeltification or other transformation is needed.
 257
 258 The `representations' table is an intermediary between node revisions
 259 and strings.  Node revisions never refer directly into the `strings'
 260 table; instead, they always refer into the `representations' table,
 261 which knows whether a given string is a fulltext or a delta, and if it
 262 is a delta, what it is a delta against.  That, combined with the
 263 knowledge in (a) and (b) above, allows Subversion to retrieve the data
 264 and parse it appropriately.  A representation has the form:
 265
 266    (HEADER KIND-SPECIFIC)
 267
 268 where HEADER is
 269
 270    (KIND TXN [MD5 [SHA1]])
 271
 272 The KIND is "fulltext" or "delta".  TXN is the txn ID for the txn in
 273 which this representation was created.  MD5 is a checksum of the
 274 representation's contents, that is, what the representation produces,
 275 regardless of whether it is stored deltified or as fulltext.  (For
 276 compatibility with older versions of Subversion, MD5 may be
 277 absent, in which case the filesystem behaves as though the checksum is
 278 there and is correct.) An additional kind of checksum, SHA1, is present
 279 in newer formats, starting with version ...
 280 ### TODO
 281
 282 The TXN also serves as a kind of mutability flag: if txn T tries to
 283 change a representation's contents, but the rep's TXN is not T, then
 284 something has gone horribly wrong and T should leave the rep alone
 285 (and probably error).  Of course, "change a representation" here means
 286 changing what the rep's consumer sees.  Switching a representation's
 287 storage strategy, for example from fulltext to deltified, wouldn't
 288 count as a change, since that wouldn't affect what the rep produces.
 289
 290 KIND-SPECIFIC varies considerably depending on the kind of
 291 representation.  Here are the two forms currently recognized:
 292
 293    (("fulltext" TXN [MD5 [SHA1]]) STRING-KEY)
 294        The data is at STRING-KEY in the `strings' table.
 295
 296    (("delta" TXN [MD5 [SHA1]]) (OFFSET WINDOW) ...)
 297        Each OFFSET indicates the point in the fulltext that this
 298        element reconstructs, and WINDOW says how to reconstruct it:
 299
 300        WINDOW ::= (DIFF SIZE REP-KEY [REP-OFFSET]) ;
 301        DIFF   ::= ("svndiff" VERSION STRING-KEY)
 302
 303        Notice that a WINDOW holds only metadata.  REP-KEY says what
 304        the window should be applied against, or none if this is a
 305        self-compressed delta; SIZE says how much data this window
 306        reconstructs; VERSION says what version of the svndiff format
 307        is being used (currently only version 0 is supported); and
 308        STRING-KEY says which string contains the actual svndiff data
 309        (there is no diff data held directly in the representations
 310        table, of course).
 311
 312        Note also that REP-KEY might refer to a representation that
 313        itself requires undeltification.  We use a delta combiner to
 314        combine all the deltas needed to reproduce the fulltext from
 315        some stored plaintext.
 316
 317        Branko says this is what REP-OFFSET is for:
 318        > The offsets embedded in the svndiff are stored in a string;
 319        > these offsets would be in the representation. The point is that
 320        > you get all the information you need to select the appropriate
 321        > windows from the rep skel -- without touching a single
 322        > string. This means a bit more space used in the repository, but
 323        > lots less memory used on the server.
 324
 325        We'll see if it turns out to be necessary.
 326
 327 In the future, there may be other representations, for example
 328 indicating that the text is stored elsewhere in the database, or
 329 perhaps in an ordinary Unix file.
 330
 331 Let's work through an example node revision:
 332
 333    (("file" REV COUNT) PROP-KEY "2345")
 334
 335 The entry for key "2345" in `representations' is:
 336
 337    (("delta" TXN CHECKSUM) (0 (("svndiff" 0 "1729") 65 "2343")))
 338
 339 and the entry for key "2343" in `representations' is:
 340
 341    (("fulltext" TXN CHECKSUM) "1001")
 342
 343 while the entry for key "1729" in `strings' is:
 344
 345    <some unprintable glob of svndiff data>
 346
 347 which, when applied to the fulltext at key "1001" in strings, results
 348 in this new fulltext:
 349
 350    "((some text) (that looks) (deceptively like) (directory entries))"
 351
 352 Et voila!  Subversion knew enough, via the `representations' and
 353 `strings' tables, to undeltify and get that fulltext; and knew enough,
 354 because of the node revision's "file" type, to interpret the result as
 355 file contents, not as a directory entry list.
 356
 357 (Note that the `strings' table stores multiple DB values per key.
 358 That is, although it's accurate to say there is one string per key,
 359 the string may be divided into multiple consecutive blocks, all
 360 sharing that key.  You use a Berkeley DB cursor to find the desired
 361 value[s], when retrieving a particular offset+len in a string.)
 362
 363 Representations know nothing about ancestry -- the `representations'
 364 table never refers to node revision id's, only to strings or to other
 365 representations.  In other words, while the `nodes' table allows
 366 recovery of ancestry information, the `representations' and `strings'
 367 tables together handle deltification and undeltification
 368 *independently* of ancestry.  At present, Subversion generally stores
 369 the youngest strings in "fulltext" form, and older strings as "delta"s
 370 against them (unless the delta would save no space compared to the
 371 fulltext).  However, there's nothing magic about that particular
 372 arrangement.  Other interesting alternatives:
 373
 374    * We could store the N most recently accessed strings as fulltexts,
 375      letting access patterns determine the most appropriate
 376      representation for each revision.
 377
 378    * We could occasionally store deltas against the N'th younger
 379      revision, storing larger jumps with a frequency inverse to the
 380      distance covered, yielding a tree-structured history.
 381
 382 Since the filesystem interface doesn't expose these details, we can
 383 change the representation pretty much as we please to optimize
 384 whatever parameter we care about --- storage size, speed, robustness,
 385 etc.
 386
 387 Representations never share strings - every string is referred to by
 388 exactly one representation.  This is so that when we change a
 389 representation to a different form (e.g. during deltification), we can
 390 delete the strings containing the old form, and know that we're not
 391 messing up any other reps by doing so.
 392
 393
 394 Further Notes On Deltifying:
 395 ----------------------------
 396
 397 When a representation is deltified, it is changed in place.
 398 New strings are created containing the new delta, the representation
 399 is changed to refer to the new strings, and the original (usually
 400 fulltext) string or strings are deleted.
 401
 402 The node revisions referring to that representation will not be
 403 changed; instead, the same rep key will now be associated with
 404 different value.  That way, we get reader locking for free: if someone
 405 is reading a file while Subversion is deltifying that file, one of the
 406 two sides will get a DB_DEADLOCK and svn_fs__retry_txn() will retry.
 407
 408 ### todo: add a note about cycle-checking here, too.
 409
 410
 411 \f
 412 The Berkeley DB "nodes" table
 413
 414 The database contains a table called "nodes", which is a btree indexed
 415 by node revision ID's, mapping them onto REPRESENTATION skels.  Node 0
 416 is always the root directory, and node revision ID 0.0.0 is always the
 417 empty directory.  We use the value of the key 'next-key' to indicate
 418 the next unused node ID.
 419
 420 Assuming that we store the most recent revision on every branch as
 421 fulltext, and all other revisions as deltas, we can retrieve any node
 422 revision by searching for the last revision of the node, and then
 423 walking backwards to specific revision we desire, applying deltas as
 424 we go.
 425
 426
 427 \f
 428 REVISION: filesystem revisions, and the Berkeley DB "revisions" table
 429
 430 We represent a filesystem revision using a skel of the form:
 431
 432     ("revision" TXN)
 433
 434 where TXN is the key into the `transactions' table (see 'Transactions' below)
 435 whose value is the transaction that was committed to create this revision.
 436
 437 The database contains a table called "revisions", which is a
 438 record-number table mapping revision numbers onto REVISION skels.
 439 Since Berkeley DB record numbers start with 1, whereas Subversion
 440 filesystem revision numbers start at zero, revision V is stored as
 441 record number V+1 in the `revisions' table.  Filesystem revision zero
 442 always has node revision 0.0.0 as its root directory; that node
 443 revision is guaranteed to be an empty directory.
 444
 445
 446 \f
 447 Transactions
 448
 449 Every transaction ends when it is either successfully committed, or
 450 aborted.  We call a transaction which has been either committed or
 451 aborted "finished", and one which hasn't "unfinished".
 452
 453 Transactions are identified by unique numbers, called transaction
 454 ID's.  Currently, transaction ID's are never reused, though this is
 455 not mandated by the schema.  In the database, we always represent a
 456 transaction ID in its shortest ASCII form.
 457
 458 The Berkeley DB `transactions' table records both unfinished and
 459 committed transactions.  Every key in this table is a transaction ID.
 460 Unfinished transactions have values that are skels of one of the
 461 following forms:
 462
 463    ("transaction" ROOT-ID BASE-ID PROPLIST COPIES)
 464    ("dead" ROOT-ID BASE-ID PROPLIST COPIES)
 465
 466 where:
 467
 468    * ROOT-ID is the node revision ID of the transaction's root
 469      directory.  This is of the form 0.0.THIS-TXN-ID.
 470
 471    * BASE-ID is the node revision ID of the root of the transaction's
 472      base revision.  This is of the form 0.0.BASE-TXN-ID - the base
 473      transaction is, of course, the transaction of the base revision.
 474
 475    * PROPLIST is a skel giving the revision properties for the
 476      transaction.
 477
 478    * COPIES contains a list of keys into the `copies' table,
 479      referencing all the filesystem copies created inside of this
 480      transaction.  If the transaction is aborted, these copies get
 481      removed from the `copies' table.
 482
 483    * A "dead" transaction is one that has been requested to be
 484      destroyed, and should never, ever, be committed.
 485
 486 Committed transaction, however, have values that are skels of the form:
 487
 488    ("committed" ROOT-ID REV PROPLIST COPIES)
 489
 490 where:
 491
 492    * ROOT-ID is the node revision ID of the committed transaction's (or
 493      revision's) root node.
 494
 495    * REV represents the revision that was created when the
 496      transaction was committed.
 497
 498    * PROPLIST is a skel giving the revision properties for the
 499      committed transaction.
 500
 501    * COPIES contains a list of keys into the `copies' table,
 502      referencing all the filesystem copies created by this committed
 503      transaction.  Nothing currently uses this information for
 504      committed transactions, but it could be useful in the future.
 505
 506 As the sole exception to the rule above, the `transactions' table
 507 always has one entry whose key is `next-key', and whose value is the
 508 lowest transaction ID that has never yet been used.  We use this entry
 509 to allocate ID's for new transactions.
 510
 511 The `transactions' table is a btree, with no particular sort order.
 512
 513
 514 \f
 515 Changes
 516
 517 As modifications are made (files and dirs added or removed, text and
 518 properties changed, etc.) on Subversion transaction trees, the
 519 filesystem tracks the basic change made in the Berkeley DB `changes'
 520 table.
 521
 522 The `changes' table is a btree with Berkeley's "duplicate keys"
 523 functionality (and with no particular sort order), and maps the
 524 one-to-many relationship of a transaction ID to a "change" item.
 525 Change items are skels of the form:
 526
 527    ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD)
 528
 529 where:
 530
 531    * PATH is the path that was operated on to enact this change.
 532
 533    * ID is the node revision ID of the node changed.  The precise
 534      meaning varies based on the kind of the change:
 535       - "add" or "modify": a new node revision created in the current
 536         txn.
 537       - "delete": a node revision from a previous txn.
 538       - "replace": a replace operation actually acts on two node
 539         revisions, one being deleted, one being added.  Only the added
 540         node-revision ID is recorded in the `changes' table - this is
 541         a design flaw.
 542       - "reset": no node revision applies.  A zero atom is used as a
 543         placeholder.
 544
 545    * CHANGE-KIND is one of the following:
 546
 547      - "add"     : PATH/ID was added to the filesystem.
 548      - "delete"  : PATH/ID was removed from the filesystem.
 549      - "replace" : PATH/ID was removed, then re-added to the filesystem.
 550      - "modify"  : PATH/ID was otherwise modified.
 551      - "reset"   : Ignore any previous changes for PATH/ID in this txn.
 552                    This kind is no longer created by Subversion 1.3.0
 553                    and later, and can probably be removed at the next
 554                    schema bump.
 555
 556    * TEXT-MOD is a bit specifying whether or not the contents of
 557      this node was modified.
 558
 559    * PROP-MOD is a bit specifying whether or not the properties of
 560      this node where modified.
 561
 562 In order to fully describe the changes made to any given path as part
 563 of a single transaction, one must read all the change items associated
 564 with the transaction's ID, and "collapse" multiple entries that refer
 565 to that path.
 566
 567
 568 \f
 569 Copies
 570
 571 Each time a filesystem copy operation is performed, Subversion records
 572 meta-data about that copy.
 573
 574 Copies are identified by unique numbers called copy ID's.  Currently,
 575 copy ID's are never reused, though this is not mandated by the schema.
 576 In the database, we always represent a copy ID in its shortest ASCII
 577 form.
 578
 579 The Berkeley DB `copies' table records all filesystem copies.  Every
 580 key in this table is copy ID, and every value is a skel of one of the
 581 following forms:
 582
 583    ("copy" SRC-PATH SRC-TXN DST-NODE-ID)
 584    ("soft-copy" SRC-PATH SRC-TXN DST-NODE-ID)
 585
 586 where:
 587
 588    * "copy" indicates an explicitly requested copy, and "soft-copy"
 589      indicates a node that was cloned internally as part of an
 590      explicitly requested copy of some parent directory. See the
 591      section "Copies and Copy IDs" in the file <fs-history> for
 592      details.
 593
 594    * SRC-PATH and SRC-TXN are the canonicalized absolute path and
 595      transaction ID, respectively, of the source of the copy.
 596
 597    * DST-NODE-ID represents the new node revision created as a result
 598      of the copy.
 599
 600 As the sole exception to the rule above, the `copies' table always has
 601 one entry whose key is `next-key', and whose value is the lowest copy ID
 602 that has never yet been used.  We use this entry to allocate new
 603 copy ID's.
 604
 605 The `copies' table is a btree, with no particular sort order.
 606
 607
 608 \f
 609 Locks
 610
 611 When a caller locks a file -- reserving an exclusive right to modify
 612 or delete it -- an lock object is created in a `locks' table.
 613
 614 The `locks' table is a btree whose key is a UUID string known as
 615 a "lock-token", and whose value is a skel representing a lock.  The
 616 fields in the skel mirror those of an svn_lock__t (see svn_types.h):
 617
 618     ("lock" PATH TOKEN OWNER COMMENT XML-P CREATION-DATE EXPIRATION-DATE)
 619
 620 where:
 621
 622    * PATH is the absolute filesystem path reserved by the lock.
 623
 624    * TOKEN is the universally unique identifier of the lock, known
 625      as the lock-token.  This is the same as the row's key.
 626
 627    * OWNER is the authenticated username that "owns" the lock.
 628
 629    * COMMENT is a string describing the lock.  It may be empty, or it
 630      might describe the rationale for locking.
 631
 632    * XML-P is a boolean (either 0 or 1) indicating whether the COMMENT
 633      field is wrapped in an XML tag.  (This is something only used by
 634      the DAV layer, for webdav interoperabliity.)
 635
 636    * CREATION-DATE is a string representation of the date/time when
 637      the lock was created.  (see svn_time_to_cstring())
 638
 639    * EXPIRATION-DATE is a string representation of the date/time when
 640      the lock will cease to be valid.  (see svn_time_to_cstring())
 641
 642 In addition to creating a lock in the `locks' table, a new row is
 643 created in a `lock-tokens' table.  The `lock-tokens' table is a btree
 644 whose key is an absolute path in the filesystem.  The value of each
 645 key is a lock-token (which is a key into the `locks' table.)
 646
 647 To test if a path is locked, simply check if the path is a key in the
 648 `lock-tokens' table.  To see if a certain directory has any locked
 649 children below, we ask BerkeleyDB to do a "greater or equal match" on
 650 the directory path, and see if any results come back.  If they do,
 651 then at least one of the directory's children is locked, and thus the
 652 directory cannot be deleted without further investigation.
 653
 654 Locks are ephemeral things, not historied in any way.  They are
 655 potentially created and deleted quite often.  When a lock is
 656 destroyed, the appropriate row is removed from the `locks' table.
 657 Additionally, the locked-path is removed from the `lock-tokens' table.
 658
 659
 660
 661
 662 \f
 663 Merge rules
 664
 665 The Subversion filesystem must provide the following characteristics:
 666
 667 - clients can submit arbitrary rearrangements of the tree, to be
 668   performed as atomic changes to the filesystem tree
 669 - multiple clients can submit non-overlapping changes at the same time,
 670   without blocking
 671 - readers must never block other readers or writers
 672 - writers must never block readers
 673 - writers may block writers
 674
 675 Merging rules:
 676
 677    The general principle: a series of changes can be merged iff the
 678    final outcome is independent of the order you apply them in.
 679
 680 Merging algorithm:
 681
 682    For each entry NAME in the directory ANCESTOR:
 683
 684      Let ANCESTOR-ENTRY, SOURCE-ENTRY, and TARGET-ENTRY be the IDs of
 685      the name within ANCESTOR, SOURCE, and TARGET respectively.
 686      (Possibly null if NAME does not exist in SOURCE or TARGET.)
 687
 688      If ANCESTOR-ENTRY == SOURCE-ENTRY, then:
 689        No changes were made to this entry while the transaction was in
 690        progress, so do nothing to the target.
 691
 692      Else if ANCESTOR-ENTRY == TARGET-ENTRY, then:
 693        A change was made to this entry while the transaction was in
 694        process, but the transaction did not touch this entry.  Replace
 695        TARGET-ENTRY with SOURCE-ENTRY.
 696
 697      Else:
 698        Changes were made to this entry both within the transaction and
 699        to the repository while the transaction was in progress.  They
 700        must be merged or declared to be in conflict.
 701
 702        If SOURCE-ENTRY and TARGET-ENTRY are both null, that's a
 703        double delete; if one of them is null, that's a delete versus
 704        a modification. In any of these cases, flag a conflict.
 705
 706        If any of the three entries is of type file, declare a conflict.
 707
 708        If either SOURCE-ENTRY or TARGET-ENTRY is not a direct
 709        modification of ANCESTOR-ENTRY (determine by comparing the
 710        node-id fields), declare a conflict.  A replacement is
 711        incompatible with a modification or other replacement--even
 712        an identical replacement.
 713
 714        Direct modifications were made to the directory ANCESTOR-ENTRY
 715        in both SOURCE and TARGET.  Recursively merge these
 716        modifications.
 717
 718    For each leftover entry NAME in the directory SOURCE:
 719
 720      If NAME exists in TARGET, declare a conflict.  Even if SOURCE and
 721      TARGET are adding exactly the same thing, two additions are not
 722      auto-mergeable with each other.
 723
 724      Add NAME to TARGET with the entry from SOURCE.
 725
 726    Now that we are done merging the changes from SOURCE into the
 727    directory TARGET, update TARGET's predecessor to be SOURCE.
 728
 729 The following algorithm was used when the Subversion filesystem was
 730 initially written, but has been replaced with the simpler and more
 731 performant algorithm above:
 732
 733    Merging two nodes, A and B, with respect to a common ancestor
 734    ANCESTOR:
 735
 736    - First, the merge fails unless A, B, and ANCESTOR are all the same
 737      kind of node.
 738    - If A and B are text files:
 739      - If A is an ancestor of B, then B is the merged result.
 740      - If A is identical to B, then B (arbitrarily) is the merged
 741        result.
 742      - Otherwise, the merge fails.
 743    - If A and B are both directories:
 744      - For every directory entry E in either A, B, or ANCESTOR, here
 745        are the cases:
 746          - E exists in neither ANCESTOR nor A.
 747          - E doesn't exist in ANCESTOR, and has been added to A.
 748          - E exists in ANCESTOR, but has been deleted from A.
 749          - E exists in both ANCESTOR and A ...
 750            - but refers to different nodes.
 751            - but refers to different revisions of the same node.
 752            - and refers to the same node revision.
 753
 754        The same set of possible relationships with ANCESTOR holds for B,
 755        so there are thirty-six combinations.  The matrix is symmetrical
 756        with A and B reversed, so we only have to describe one triangular
 757        half, including the diagonal --- 21 combinations.
 758
 759        - (6) E exists in neither ANCESTOR nor A:
 760          - (1) E exists in neither ANCESTOR nor B.  Can't occur, by
 761            assumption that E exists in either A, B, or ancestor.
 762          - (1) E has been added to B.  Add E in the merged result. ***
 763          - (1) E has been deleted from B.  Can't occur, by assumption
 764            that E doesn't exist in ANCESTOR.
 765          - (3) E exists in both ANCESTOR and B.  Can't occur, by
 766            assumption that E doesn't exist in ancestor.
 767        - (5) E doesn't exist in ANCESTOR, and has been added to A.
 768          - (1) E doesn't exist in ANCESTOR, and has been added to B.
 769            Conflict.
 770          - (1) E exists in ANCESTOR, but has been deleted from B.
 771            Can't occur, by assumption that E doesn't exist in
 772            ANCESTOR.
 773          - (3) E exists in both ANCESTOR and B.  Can't occur, by
 774            assumption that E doesn't exist in ANCESTOR.
 775        - (4) E exists in ANCESTOR, but has been deleted from A.
 776          - (1) E exists in ANCESTOR, but has been deleted from B.  If
 777            neither delete was a result of a rename, then omit E from
 778            the merged tree.  *** Otherwise, conflict.
 779          - E exists in both ANCESTOR and B ...
 780            - (1) but refers to different nodes.  Conflict.
 781            - (1) but refers to different revisions of the same node.
 782              Conflict.
 783            - (1) and refers to the same node revision.  Omit E from
 784              the merged tree. ***
 785        - (3) E exists in both ANCESTOR and A, but refers to different
 786          nodes.
 787          - (1) E exists in both ANCESTOR and B, but refers to
 788            different nodes.  Conflict.
 789          - (1) E exists in both ANCESTOR and B, but refers to
 790            different revisions of the same node.  Conflict.
 791          - (1) E exists in both ANCESTOR and B, and refers to the same
 792            node revision.  Replace E with A's node revision.  ***
 793        - (2) E exists in both ANCESTOR and A, but refers to different
 794          revisions of the same node.
 795          - (1) E exists in both ANCESTOR and B, but refers to
 796            different revisions of the same node.  Try to merge A/E and
 797            B/E, recursively.  ***
 798          - (1) E exists in both ANCESTOR and B, and refers to the same
 799            node revision.  Replace E with A's node revision.  ***
 800        - (1) E exists in both ANCESTOR and A, and refers to the same
 801          node revision.
 802          - (1) E exists in both ANCESTOR and B, and refers to the same
 803            node revision.  Nothing has happened to ANCESTOR/E, so no
 804            change is necessary.
 805
 806    *** == something actually happens
 807
 808 \f
 809 Non-Historical Properties
 810
 811 [[Yes, do tell.]]
 812
 813 \f
 814 UUIDs: Universally Unique Identifiers
 815
 816 Every filesystem has a UUID.  This is represented as record #1 in the
 817 `uuids' table.
 818
 819 \f
 820 Layers
 821
 822 In previous structurings of the code, I had trouble keeping track of
 823 exactly who has implemented which promises, based on which other
 824 promises from whom.
 825
 826 I hope the arrangement below will help me keep things straight, and
 827 make the code more reliable.  The files are arranged in order from
 828 low-level to high-level: each file depends only on services provided
 829 by the files before it.
 830
 831 skel.c, id.c, dbt.c, convert-size.c
 832
 833                 Low-level utility functions.
 834
 835 fs_skels.c      Routines for marshaling between skels and native FS types.
 836
 837 fs.c            Creating and destroying filesystem objects.
 838
 839 err.c           Error handling.
 840
 841 nodes-table.c, txn-table.c, rev-table.c, reps-table.c, strings-table.c
 842
 843                 Create and open particular database tables.
 844                 Responsible for intra-record consistency.
 845
 846 node-rev.c      Creating, reading, and writing node revisions.
 847                 Responsible for deciding what gets deltified when.
 848
 849 reps-strings.c
 850                 Retrieval and storage of represented strings.
 851                 This will handle delta-based storage,
 852
 853 dag.c           Operations on the DAG filesystem.  "DAG" because the
 854                 interface exposes the filesystem's sharing structure.
 855                 Enforce inter-record consistency.
 856
 857 tree.c          Operations on the tree filesystem.  This layer is
 858                 built on top of dag.c, but transparently distinguishes
 859                 virtual copies, making the underlying DAG look like a
 860                 real tree.  This makes incomplete transactions behave
 861                 like ordinary mutable filesystems.
 862
 863 delta.c         Computing deltas.
 864
 865
 866 \f
 867 Appendix: Filesystem structure summary
 868 ======================================
 869
 870 Berkeley DB tables
 871 ------------------
 872
 873                 "nodes" : btree(ID -> NODE-REVISION, "next-key" -> NODE-ID)
 874             "revisions" : recno(REVISION)
 875          "transactions" : btree(TXN -> TRANSACTION, "next-key" -> TXN)
 876               "changes" : btree(TXN -> CHANGE)
 877                "copies" : btree(CPY -> COPY, "next-key" -> CPY)
 878               "strings" : btree(STR -> STRING, "next-key" -> STR)
 879       "representations" : btree(REP -> REPRESENTATION, "next-key" -> REP)
 880                 "uuids" : recno(UUID)
 881                 "locks" : btree(TOKEN -> LOCK)
 882           "lock-tokens" : btree(PATH -> TOKEN)
 883          "node-origins" : btree(NODE-ID -> ID)
 884         "checksum-reps" : btree(SHA1SUM -> REP, "next-key" -> number-36)
 885         "miscellaneous" : btree(STRING -> STRING)
 886
 887 Syntactic elements
 888 ------------------
 889
 890 Table keys:
 891
 892                      ID ::= NODE-REV-ID ;
 893                     TXN ::= number-36 ;
 894                     CPY ::= number-36 ;
 895                     STR ::= number-36 ;
 896                     REP ::= number-36 ;
 897                   TOKEN ::= uuid ;
 898
 899 Property lists:
 900
 901                PROPLIST ::= (PROP ...) ;
 902                    PROP ::= atom atom ;
 903
 904
 905 Filesystem revisions:
 906
 907                REVISION ::= ("revision" TXN) ;
 908
 909
 910 Transactions:
 911
 912             TRANSACTION ::= UNFINISHED-TXN | COMMITTED-TXN | DEAD-TXN
 913          UNFINISHED-TXN ::= ("transaction" ROOT-ID BASE-ID PROPLIST COPIES) ;
 914           COMMITTED-TXN ::= ("committed" ROOT-ID REV PROPLIST COPIES) ;
 915                DEAD-TXN ::= ("dead" ROOT-ID BASE-ID PROPLIST COPIES) ;
 916                 ROOT-ID ::= NODE-REV-ID ;
 917                 BASE-ID ::= NODE-REV-ID ;
 918                  COPIES ::= (CPY ...) ;
 919                     REV ::= number ;
 920
 921
 922 Changes:
 923
 924                  CHANGE ::= ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD) ;
 925             CHANGE-KIND ::= "add" | "delete" | "replace" | "modify" | "reset";
 926                TEXT-MOD ::= atom ;
 927                PROP-MOD ::= atom ;
 928
 929
 930 Copies:
 931
 932                    COPY ::= REAL-COPY | SOFT-COPY
 933               REAL-COPY ::= ("copy" SRC-PATH SRC-TXN DST-NODE-ID)
 934               SOFT-COPY ::= ("soft-copy" SRC-PATH SRC-TXN DST-NODE-ID)
 935                SRC-PATH ::= atom ;
 936                 SRC-TXN ::= TXN ;
 937             DST-NODE-ID ::= NODE-REV-ID ;
 938
 939
 940 Entries lists:
 941
 942                 ENTRIES ::= (ENTRY ...) ;
 943                   ENTRY ::= (NAME ID) ;
 944                    NAME ::= atom ;
 945
 946
 947 Node revisions:
 948
 949           NODE-REVISION ::= FILE | DIR ;
 950                    FILE ::= (HEADER PROP-KEY DATA-INFO [EDIT-DATA-KEY]) ;
 951                     DIR ::= (HEADER PROP-KEY ENTRIES-KEY) ;
 952                  HEADER ::= (KIND CREATED-PATH
 953                              [PRED-ID [PRED-COUNT
 954                               [HAS-MERGEINFO MERGEINFO-COUNT]]]) ;
 955                    KIND ::= "file" | "dir" ;
 956                 PRED-ID ::= NODE-REV-ID | "";
 957              PRED-COUNT ::= number | "" ;
 958            CREATED-PATH ::= atom ;
 959                PROP-KEY ::= atom ;
 960               DATA-INFO ::= DATA-KEY | (DATA-KEY DATA-KEY-UNIQID)
 961                DATA-KEY ::= atom ;
 962         DATA-KEY-UNIQID ::= atom ;
 963           EDIT-DATA-KEY ::= atom ;
 964           HAS-MERGEINFO ::= "0" | "1" ;
 965         MERGEINFO-COUNT ::= number ;
 966
 967
 968 Representations:
 969
 970          REPRESENTATION ::= FULLTEXT | DELTA ;
 971                FULLTEXT ::= (HEADER STRING-KEY) ;
 972                   DELTA ::= (HEADER (OFFSET WINDOW) ...) ;
 973                  WINDOW ::= (DIFF SIZE REP-KEY [REP-OFFSET]) ;
 974                    DIFF ::= ("svndiff" VERSION STRING-KEY) ;
 975                 VERSION ::= number ;
 976                 REP-KEY ::= atom ;
 977              STRING-KEY ::= atom ;
 978                  OFFSET ::= number ;
 979              REP-OFFSET ::= number ;
 980
 981                  HEADER ::= (KIND TXN [MD5 [SHA1]]) ;
 982                    KIND ::= "fulltext" | "delta" ;
 983
 984                    SIZE ::= number ;
 985                     MD5 ::= ("md5" MD5SUM) ;
 986                    SHA1 ::= ("sha1" SHA1SUM) ;
 987                  MD5SUM ::= atom ;
 988                 SHA1SUM ::= atom ;
 989
 990
 991 Strings:
 992
 993                  STRING ::= RAWTEXT | LISTTEXT | DIFFTEXT
 994                 RAWTEXT ::= /{anything.class}*/ ;
 995                LISTTEXT ::= list ;
 996                DIFFTEXT ::= /{anything.class}*/ ;
 997
 998
 999 Node revision IDs:
1000
1001             NODE-REV-ID ::= NODE-ID '.' CPY '.' TXN ;
1002                 NODE-ID ::= number ;
1003
1004 UUIDs:
1005                    UUID ::= uuid ;
1006
1007
1008 Locks:
1009
1010                    LOCK ::= ("lock" PATH TOKEN OWNER
1011                                     COMMENT XML-P CR-DATE [X-DATE]);
1012                    PATH ::= atom ;
1013                   OWNER ::= atom ;
1014                 COMMENT ::= atom ;
1015                   XML-P ::= "0" | "1" ;
1016                 CR-DATE ::= atom ;
1017                  X-DATE ::= atom ;
1018
1019 Lock tokens:
1020
1021                (the value is just a lock-token, which is a uuid)
1022
1023
1024 Node origins:
1025
1026                 NODE-ID ::= NODE-REV-ID ;
1027
1028
1029 Lexical elements
1030 ----------------
1031
1032 UUIDs:
1033
1034                    uuid ::= hexits-32 '-' hexits-16 '-' hexits-16 '-'
1035                             hexits-16 '-' hexits-48 ;
1036
1037 Numbers:
1038
1039                  number ::= /{digit.class}+/ ;
1040               number-36 ::= /{base36.class}+/ ;
1041               hexits-32 ::= /{base16.class}{8}/ ;
1042               hexits-16 ::= /{base16.class}{4}/ ;
1043               hexits-48 ::= /{base16.class}{12}/ ;
1044
1045 (Note: the following are described in skel.h)
1046 Skels:
1047
1048                    skel ::= atom | list;
1049                    list ::= list.head list.body.opt list.tail ;
1050                    atom ::= atom.imp-len | atom.exp-len ;
1051
1052               list.head ::= '(' spaces.opt ;
1053               list.tail ::= spaces.opt ')' ;
1054           list.body.opt ::=  | list.body ;
1055               list.body ::= skel | list.body spaces.opt skel ;
1056
1057            atom.imp-len ::= /{name.class}[^\(\){ws.class}]*/ ;
1058            atom.exp-len ::= /({digit.class}+){ws.class}.{\1}/ ;
1059
1060              spaces.opt ::= /{ws.class}*/ ;
1061
1062
1063 Character classes:
1064
1065                ws.class ::= [\t\n\f\r\ ] ;
1066             digit.class ::= [0-9] ;
1067              name.class ::= [A-Za-z] ;
1068            base16.class ::= [0-9a-f]
1069            base36.class ::= [a-z0-9]
1070          anything.class ::= anything at all ;
1071
1072
1073 \f
1074 Appendix: 'miscellaneous' table contents
1075 ======================================
1076
1077 The 'miscellaneous' table contains string keys mapped to string
1078 values.  Here is a table of the supported keys, the descriptions of
1079 their values, and the filesystem format version in which they were
1080 introduced.
1081
1082    Fmt   Key                  Value
1083    ---   ------------------   ------------------------------------
1084     4    forward-delta-rev    Youngest revision in the repository as of
1085                               the moment when it was upgraded to support
1086                               forward deltas.