contrib/subversion/subversion/libsvn_fs_fs/structure-indexes

   1 This file describes the design, data model, and storage formats of FSFS
   2 index data.
   3
   4
   5 Design
   6 ======
   7
   8 Each pack and each rev file using logical addressing contains exactly
   9 two index sections.  One, the log-to-phys index, maps the (rev, item_index)
  10 pairs to absolute file offsets.  The other, phys-to-log, is a reverse
  11 index that gives basic information on any file location.  This is enough
  12 to read and cache any data without traversing DAGs.
  13
  14 Rev and pack files are immutable, so the same is true for index data.
  15 During a transaction or while packing a file, a proto index file gets
  16 written (actually, one log-to-phys and one phys-to-log).  They use a
  17 simpler, less compact format with fixed record lengths.  A proto index
  18 basically aggregates all the information that must later be transformed
  19 into the final index.
  20
  21
  22 General design concerns
  23 -----------------------
  24
  25 In Subversion, there is no limit to the size of a revision; even practical
  26 limits are in the order of millions of changes at least.  Index data for
  27 these would be multiple megabytes in size with pack file indexes possibly
  28 approaching 1 GB.  To ensure we still get roughly O(1) access time, we
  29 need a hierarchical data structure.
  30
  31 Therefore, the indexes will start with a header containing an array of
  32 references to sub-sections or pages.  The length of these pages varies
  33 but is limited to a size configurable in fsfs.conf.
  34
  35 Finally, it is assumed that whole pages can be cached efficiently and
  36 with a high cache hit rate.  So, although a page may have a thousand or
  37 more entries, the access time is still amortized O(1) in many scenarios.
  38
  39
  40 Items and item types
  41 --------------------
  42
  43 The index implementation treats item_index and item type as simple ints,
  44 except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
  45 Since they have been defined as 0, the code may test for "used" etc.
  46 by simply comparing with 0.
  47
  48 See section "addressing modes" in structure to a list of item types
  49 and pre-defined item_index values.
  50
  51
  52 Encoding
  53 --------
  54
  55 The final index data format is tuned for space and decoding efficiency.
  56 Indexes are stored as a sequence of variable integers.  The encoding is
  57 as follows:
  58
  59 * Unsigned integers are stored in little endian order with a variable
  60   length 7b/8b encoding.  If most significant bit a byte has been set,
  61   the next byte has also belongs to the same value.
  62
  63   0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits)
  64   0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits)
  65   0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits)
  66   0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits)
  67
  68   Technically, we can represent integers of arbitrary lengths.  Currently,
  69   we only generate and parse up to 64 bits.
  70
  71 * Signed integers are mapped onto the unsigned value space as follows:
  72
  73   x >= 0 ->  2 * x
  74   x < 0  -> -2 * x - 1
  75
  76   Again, we can represent arbitrary length numbers that way but the code
  77   is currently restricted to 64 bits.
  78
  79 Most data is unsigned by nature but will be stored differentially using
  80 signed integers.
  81
  82
  83 Encoding in proto-index files
  84 -----------------------------
  85
  86 These have a much simpler encoding.  Throughout the files, all records have
  87 the same length (but different between L2P and P2L).  All records contain
  88 unsigned 64 bit integers only, stored in little endian byte order.
  89
  90
  91 Log-to-phys index
  92 =================
  93
  94 This index has to map (rev, item_index) -> offset.  It assumes that the
  95 item_index values per revision are dense and start at 0.  There may be
  96 unused item_index values, though; the data structure simply gets less
  97 space-efficient when the more sparse the value space gets.
  98
  99
 100 Index data model
 101 ----------------
 102
 103 hierarchy:
 104
 105   header -> per-revision info -> page -> offset
 106
 107   There is one entry per revision in the header.  Per revision there are
 108   one or more pages (exclusive to that revision) containing up to a known,
 109   fixed limit of entries (= page size).  The total access path is:
 110
 111   pages = header->pages[revision];
 112   offsets = page = pages[item_index / page_size];
 113   offset = offsets[item_index % page_size];
 114
 115   Different log-to-phys indexes in the same repository may have different
 116   page sizes but within any given index, the page size is the same and
 117   immutable.
 118
 119 header:
 120
 121   <first revision>   ... first revision covered by this index
 122   <revision count>   ... number of revision covered by this index
 123   <page size>        ... maximum number of entries per page
 124   <page table index> ... array, for each revision containing the index in
 125                          <page table> of the first page that belongs to
 126                          this revision.  This has <revision count>+1
 127                          entries to terminate the last revision.
 128   <page table>       ... array of page headers.  It has
 129                          <page table index>[<revision count>] entries.
 130
 131 page table:
 132
 133   <offset>           ... absolute position of the page contents within the
 134                          index
 135   <entry count>      ... number of offset entries in the page.
 136                          Must match <header>.<page size> unless this is
 137                          the last page for the respective revision.
 138   <size>             ... length in bytes of the on-disk page description.
 139                          Note that this is redundant with the <offset>.
 140
 141 page:
 142
 143   <entry count>      ... number of offset entries in the page.
 144                          Must match <header>.<page size> unless this is
 145                          the last page for the respective revision.
 146                          Redundant with <page table>.<entry count>
 147   <offsets>          ... array of absolute file positions within the rev /
 148                          pack file.  This has <entry count> entries.
 149
 150
 151 Index on-disk format
 152 --------------------
 153
 154   index := "L2P-INDEX\n" header revisions pages offsets
 155
 156   header := u(<header>.<first revision>) \
 157             u(<header>.<page size>) \
 158             u(<header>.<revision count>) \
 159             u(s(<header>.<page table>))
 160
 161   revisions := u(  <header>.<page table index>[k+1]
 162                  - <header>.<page table index>[k]),
 163                for k in 0 .. <header>.<revision count>-1
 164
 165   pages := u(<header>.<page table>[k].<size>) \
 166            u(<header>.<page table>[k].<entry count>),
 167            for k in 0 .. s(<header>.<page table>)-1
 168
 169   offsets := page(k),
 170              for k in 0 .. s(<header>.<page table>)-1
 171
 172   page(k) := i(<header>.<page table>[k].<offsets>[0]) \
 173              i(  <header>.<page table>[k].<offsets>[l] \
 174                - <header>.<page table>[k].<offsets>[l - 1]),
 175              for l in 1 .. s(<header>.<page table>[k].<entry count>)-1
 176
 177   u(x) ... unsigned int x in 7b/8b encoding
 178   i(x) ... signed int x in 7b/8b encoding
 179   s(x) ... number of entries in array x
 180
 181
 182 Proto index file format
 183 -----------------------
 184
 185 The index will be created from a "revision-less" proto index file
 186 containing (<offset><item_index>) pairs only.
 187
 188 All mappings belonging to the same revision must be written in one go but
 189 there is no restriction on the order of those entries.  To signify that
 190 a new revision begins, a (0, 0) mapping must be written.  A (0, 0) entry
 191 at the beginning of the file is optional and will be ignored.
 192
 193   <bof>         /* begin of proto index file for revision r and following */
 194   (0, 0)        /* mark start of revision r, optional for first rev */
 195   (off, item)*  /* zero or more mappings in random order */
 196   (0, 0)        /* mark start of revision r + 1 */
 197   (off, item)*  /* zero or more mappings in random order */
 198   (0, 0)        /* mark start of revision r + 2 */
 199   (off, item)*  /* zero or more mappings in random order */
 200   ...
 201   <eof>         /* end of file. */
 202
 203 All entries are pairs of 64 bit unsigned integers in little endian order.
 204
 205
 206 Phys-to-log index
 207 =================
 208
 209 This index has to map offset -> (rev, item_index, type, len, checksum).
 210
 211
 212 Index data model
 213 ----------------
 214
 215 hierarchy:
 216
 217   header -> page -> item info
 218
 219   Logically, the index splits up index rev / pack file into pages of a
 220   fixed size.  That page size may differ from the FS's block size.  The
 221   index will describe each rev / pack file page with one index page.
 222
 223   page = header->pages[offset % page_size];
 224   item info = binary_search(page.data, offset)
 225
 226   Note that the current implementation will not return any data if the
 227   offset is does not match any actual item start.  To simplify the lookup,
 228   the last index page will have an "unused item" entry for the section
 229   behind EOF.  Holes aren't allowed as well, i.e. every byte of the rev /
 230   pack is expected to be covered by the index.
 231
 232   Also, there may be items stretching across page borders or even over
 233   multiple pages.  The data model solves this issue by storing the item
 234   descriptions as a "primary array" and then representing the pages as
 235   ranges within that array.  Thus multiple pages may reference the same
 236   item description.
 237
 238 header:
 239
 240   <first revision>   ... first revision covered by this index
 241   <file size>        ... size of the rev / pack file in bytes
 242   <page size>        ... number of bytes in the rev / pack file covered by
 243                          each index page
 244   <page count>       ... number of pages
 245   <offsets>          ... array of page offsets, i.e. locations the page
 246                          data within the index.
 247                          This array has <page count> + 1 entries.
 248
 249 page:
 250
 251   <entries>          ... array of item descriptions, ordered by offset.
 252                          First and last entry may cross page boundaries.
 253
 254 entry:
 255
 256   <offset>           ... absolute position in the pack / rev file
 257   <size>             ... on-disk size of the item in the pack / rev file
 258   <type>             ... item type
 259   <FNV checksum>     ... modified 32 bit FNV-1a checksum of that section
 260                          of the pack / rev file (see below). 0 for empty
 261                          or zero-length items
 262   <revision>         ... revision that this item belongs to
 263   <item_index>       ... item_index within that revision
 264
 265
 266 Index on-disk format
 267 --------------------
 268
 269   index := "P2L-INDEX\n" header pages items
 270
 271   header := u(<header>.<first revision>) \
 272             u(<header>.<file size>) \
 273             u(<header>.<page size>) \
 274             u(<header>.<page count>)
 275
 276   pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
 277            for k in 0 .. <header>.<page count> -1
 278
 279   items := u(<items in page k>[0].<offset>) \
 280            u(<items in page k>[l].<size>) \
 281            i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
 282            i(  <items in page k>[l].<revision>
 283              - <items in page k>[l-1].<revision>), \
 284            u(FNV checksum)
 285            for l in 0 .. s(<items in page k>)-1,
 286            for k in 0 .. <header>.<page count>-1
 287
 288   u(x) ... unsigned int x in 7b/8b encoding
 289   i(x) ... signed int x in 7b/8b encoding
 290   s(x) ... number of entries in collection x
 291   c(x) ... compound function := x.<item_index> * 8 + x.<type>
 292
 293   Access to negative indexes gives a 0 value.
 294
 295   <Items in page k> are in strict ascending offset order.  Items that
 296   started after the begin of a given page and overlap with the next page
 297   will not be stored in the start page.  The runtime representation will
 298   duplicate items overlapping page boundaries; the on-disk representation
 299   will not.  Thus, pages inside large items will have zero entries on disk.
 300
 301
 302 Proto index file format
 303 -----------------------
 304
 305 The index will be created from a proto index file containing simple
 306 instances of svn_fs_fs__p2l_entry_t with the following element order:
 307
 308   item offset               as uint64
 309   item size                 as uint64
 310   item type                 as uint64
 311   modified FNV1a checksum   as uint64
 312   revision                  as uint64, with SVN_INVALID_REVNUM mapped to 0
 313                                        and revisions >= 0 stored as rev+1
 314   item index                as uint64
 315
 316 All values are stored in little endian order.
 317
 318 Page table and header information, except start revision and page size,
 319 can easily be derived from that information.
 320
 321 All entries must be written in strict offset order.  Overlapping entries
 322 are not allowed; zero-length items are.
 323
 324 In transactions, the final revision number may not be known when writing
 325 the proto index file (e.g. while still writing the proto rev file).  Items
 326 with revision set to SVN_INVALID_REVNUM will therefore be automatically
 327 updated when creating the final index.  This is possible in conjunction
 328 with rev files but not for pack files.
 329
 330
 331 FNV checksum
 332 ------------
 333
 334 FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
 335 For performance reasons we use a modified version:
 336
 337 * split the input byte stream [b0 .. bN] into 4 sub-streams of equal
 338   length and up to 3 remnants:
 339
 340   [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]
 341
 342 * calculate 32 bit FNV-1a checksums for the 4 substreams:
 343
 344   h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])
 345
 346 * concatenate the big endian representation of these checksums (4 bytes
 347   each) plus the remnant of the original stream into a 16 to 19 byte long
 348   intermediate:
 349
 350   [i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19
 351
 352 * fold the variable-length intermediate into a compact 32 bit checksum:
 353
 354   FNV checksum = fnv_1a([i0 .. iK])