1 This file describes the design, data model, and storage formats of FSFS
8 Each pack and each rev file using logical addressing contains exactly
9 two index sections. One, the log-to-phys index, maps the (rev, item_index)
10 pairs to absolute file offsets. The other, phys-to-log, is a reverse
11 index that gives basic information on any file location. This is enough
12 to read and cache any data without traversing DAGs.
14 Rev and pack files are immutable, so the same is true for index data.
15 During a transaction or while packing a file, a proto index file gets
16 written (actually, one log-to-phys and one phys-to-log). They use a
17 simpler, less compact format with fixed record lengths. A proto index
18 basically aggregates all the information that must later be transformed
22 General design concerns
23 -----------------------
25 In Subversion, there is no limit to the size of a revision; even practical
26 limits are in the order of millions of changes at least. Index data for
27 these would be multiple megabytes in size with pack file indexes possibly
28 approaching 1 GB. To ensure we still get roughly O(1) access time, we
29 need a hierarchical data structure.
31 Therefore, the indexes will start with a header containing an array of
32 references to sub-sections or pages. The length of these pages varies
33 but is limited to a size configurable in fsfs.conf.
35 Finally, it is assumed that whole pages can be cached efficiently and
36 with a high cache hit rate. So, although a page may have a thousand or
37 more entries, the access time is still amortized O(1) in many scenarios.
43 The index implementation treats item_index and item type as simple ints,
44 except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
45 Since they have been defined as 0, the code may test for "used" etc.
46 by simply comparing with 0.
48 See section "addressing modes" in structure to a list of item types
49 and pre-defined item_index values.
55 The final index data format is tuned for space and decoding efficiency.
56 Indexes are stored as a sequence of variable integers. The encoding is
59 * Unsigned integers are stored in little endian order with a variable
60 length 7b/8b encoding. If most significant bit a byte has been set,
61 the next byte has also belongs to the same value.
63 0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits)
64 0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits)
65 0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits)
66 0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits)
68 Technically, we can represent integers of arbitrary lengths. Currently,
69 we only generate and parse up to 64 bits.
71 * Signed integers are mapped onto the unsigned value space as follows:
76 Again, we can represent arbitrary length numbers that way but the code
77 is currently restricted to 64 bits.
79 Most data is unsigned by nature but will be stored differentially using
83 Encoding in proto-index files
84 -----------------------------
86 These have a much simpler encoding. Throughout the files, all records have
87 the same length (but different between L2P and P2L). All records contain
88 unsigned 64 bit integers only, stored in little endian byte order.
94 This index has to map (rev, item_index) -> offset. It assumes that the
95 item_index values per revision are dense and start at 0. There may be
96 unused item_index values, though; the data structure simply gets less
97 space-efficient when the more sparse the value space gets.
105 header -> per-revision info -> page -> offset
107 There is one entry per revision in the header. Per revision there are
108 one or more pages (exclusive to that revision) containing up to a known,
109 fixed limit of entries (= page size). The total access path is:
111 pages = header->pages[revision];
112 offsets = page = pages[item_index / page_size];
113 offset = offsets[item_index % page_size];
115 Different log-to-phys indexes in the same repository may have different
116 page sizes but within any given index, the page size is the same and
121 <first revision> ... first revision covered by this index
122 <revision count> ... number of revision covered by this index
123 <page size> ... maximum number of entries per page
124 <page table index> ... array, for each revision containing the index in
125 <page table> of the first page that belongs to
126 this revision. This has <revision count>+1
127 entries to terminate the last revision.
128 <page table> ... array of page headers. It has
129 <page table index>[<revision count>] entries.
133 <offset> ... absolute position of the page contents within the
135 <entry count> ... number of offset entries in the page.
136 Must match <header>.<page size> unless this is
137 the last page for the respective revision.
138 <size> ... length in bytes of the on-disk page description.
139 Note that this is redundant with the <offset>.
143 <entry count> ... number of offset entries in the page.
144 Must match <header>.<page size> unless this is
145 the last page for the respective revision.
146 Redundant with <page table>.<entry count>
147 <offsets> ... array of absolute file positions within the rev /
148 pack file. This has <entry count> entries.
154 index := "L2P-INDEX\n" header revisions pages offsets
156 header := u(<header>.<first revision>) \
157 u(<header>.<page size>) \
158 u(<header>.<revision count>) \
159 u(s(<header>.<page table>))
161 revisions := u( <header>.<page table index>[k+1]
162 - <header>.<page table index>[k]),
163 for k in 0 .. <header>.<revision count>-1
165 pages := u(<header>.<page table>[k].<size>) \
166 u(<header>.<page table>[k].<entry count>),
167 for k in 0 .. s(<header>.<page table>)-1
170 for k in 0 .. s(<header>.<page table>)-1
172 page(k) := i(<header>.<page table>[k].<offsets>[0]) \
173 i( <header>.<page table>[k].<offsets>[l] \
174 - <header>.<page table>[k].<offsets>[l - 1]),
175 for l in 1 .. s(<header>.<page table>[k].<entry count>)-1
177 u(x) ... unsigned int x in 7b/8b encoding
178 i(x) ... signed int x in 7b/8b encoding
179 s(x) ... number of entries in array x
182 Proto index file format
183 -----------------------
185 The index will be created from a "revision-less" proto index file
186 containing (<offset><item_index>) pairs only.
188 All mappings belonging to the same revision must be written in one go but
189 there is no restriction on the order of those entries. To signify that
190 a new revision begins, a (0, 0) mapping must be written. A (0, 0) entry
191 at the beginning of the file is optional and will be ignored.
193 <bof> /* begin of proto index file for revision r and following */
194 (0, 0) /* mark start of revision r, optional for first rev */
195 (off, item)* /* zero or more mappings in random order */
196 (0, 0) /* mark start of revision r + 1 */
197 (off, item)* /* zero or more mappings in random order */
198 (0, 0) /* mark start of revision r + 2 */
199 (off, item)* /* zero or more mappings in random order */
201 <eof> /* end of file. */
203 All entries are pairs of 64 bit unsigned integers in little endian order.
209 This index has to map offset -> (rev, item_index, type, len, checksum).
217 header -> page -> item info
219 Logically, the index splits up index rev / pack file into pages of a
220 fixed size. That page size may differ from the FS's block size. The
221 index will describe each rev / pack file page with one index page.
223 page = header->pages[offset % page_size];
224 item info = binary_search(page.data, offset)
226 Note that the current implementation will not return any data if the
227 offset is does not match any actual item start. To simplify the lookup,
228 the last index page will have an "unused item" entry for the section
229 behind EOF. Holes aren't allowed as well, i.e. every byte of the rev /
230 pack is expected to be covered by the index.
232 Also, there may be items stretching across page borders or even over
233 multiple pages. The data model solves this issue by storing the item
234 descriptions as a "primary array" and then representing the pages as
235 ranges within that array. Thus multiple pages may reference the same
240 <first revision> ... first revision covered by this index
241 <file size> ... size of the rev / pack file in bytes
242 <page size> ... number of bytes in the rev / pack file covered by
244 <page count> ... number of pages
245 <offsets> ... array of page offsets, i.e. locations the page
246 data within the index.
247 This array has <page count> + 1 entries.
251 <entries> ... array of item descriptions, ordered by offset.
252 First and last entry may cross page boundaries.
256 <offset> ... absolute position in the pack / rev file
257 <size> ... on-disk size of the item in the pack / rev file
259 <FNV checksum> ... modified 32 bit FNV-1a checksum of that section
260 of the pack / rev file (see below). 0 for empty
262 <revision> ... revision that this item belongs to
263 <item_index> ... item_index within that revision
269 index := "P2L-INDEX\n" header pages items
271 header := u(<header>.<first revision>) \
272 u(<header>.<file size>) \
273 u(<header>.<page size>) \
274 u(<header>.<page count>)
276 pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
277 for k in 0 .. <header>.<page count> -1
279 items := u(<items in page k>[0].<offset>) \
280 u(<items in page k>[l].<size>) \
281 i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
282 i( <items in page k>[l].<revision>
283 - <items in page k>[l-1].<revision>), \
285 for l in 0 .. s(<items in page k>)-1,
286 for k in 0 .. <header>.<page count>-1
288 u(x) ... unsigned int x in 7b/8b encoding
289 i(x) ... signed int x in 7b/8b encoding
290 s(x) ... number of entries in collection x
291 c(x) ... compound function := x.<item_index> * 8 + x.<type>
293 Access to negative indexes gives a 0 value.
295 <Items in page k> are in strict ascending offset order. Items that
296 started after the begin of a given page and overlap with the next page
297 will not be stored in the start page. The runtime representation will
298 duplicate items overlapping page boundaries; the on-disk representation
299 will not. Thus, pages inside large items will have zero entries on disk.
302 Proto index file format
303 -----------------------
305 The index will be created from a proto index file containing simple
306 instances of svn_fs_fs__p2l_entry_t with the following element order:
308 item offset as uint64
311 modified FNV1a checksum as uint64
312 revision as uint64, with SVN_INVALID_REVNUM mapped to 0
313 and revisions >= 0 stored as rev+1
316 All values are stored in little endian order.
318 Page table and header information, except start revision and page size,
319 can easily be derived from that information.
321 All entries must be written in strict offset order. Overlapping entries
322 are not allowed; zero-length items are.
324 In transactions, the final revision number may not be known when writing
325 the proto index file (e.g. while still writing the proto rev file). Items
326 with revision set to SVN_INVALID_REVNUM will therefore be automatically
327 updated when creating the final index. This is possible in conjunction
328 with rev files but not for pack files.
334 FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
335 For performance reasons we use a modified version:
337 * split the input byte stream [b0 .. bN] into 4 sub-streams of equal
338 length and up to 3 remnants:
340 [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]
342 * calculate 32 bit FNV-1a checksums for the 4 substreams:
344 h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])
346 * concatenate the big endian representation of these checksums (4 bytes
347 each) plus the remnant of the original stream into a 16 to 19 byte long
350 [i0 .. iK] = [big-endian(h0) ... big-endian(h3) remnant ], 16 <= K+1 <= 19
352 * fold the variable-length intermediate into a compact 32 bit checksum:
354 FNV checksum = fnv_1a([i0 .. iK])