2 .\" Copyright (c) 2002 Poul-Henning Kamp
3 .\" Copyright (c) 2002 Networks Associates Technology, Inc.
4 .\" All rights reserved.
6 .\" This software was developed for the FreeBSD Project by Poul-Henning Kamp
7 .\" and NAI Labs, the Security Research Division of Network Associates, Inc.
8 .\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
9 .\" DARPA CHATS research program.
11 .\" Redistribution and use in source and binary forms, with or without
12 .\" modification, are permitted provided that the following conditions
14 .\" 1. Redistributions of source code must retain the above copyright
15 .\" notice, this list of conditions and the following disclaimer.
16 .\" 2. Redistributions in binary form must reproduce the above copyright
17 .\" notice, this list of conditions and the following disclaimer in the
18 .\" documentation and/or other materials provided with the distribution.
19 .\" 3. The names of the authors may not be used to endorse or promote
20 .\" products derived from this software without specific prior written
23 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
24 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
25 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
26 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
27 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
28 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
29 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
30 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
31 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
32 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
42 .Nd "modular disk I/O request transformation framework"
45 .Cd options GEOM_CACHE
46 .Cd options GEOM_CONCAT
49 .Cd options GEOM_JOURNAL
50 .Cd options GEOM_LABEL
51 .Cd options GEOM_LINUX_LVM
53 .Cd options GEOM_MIRROR
54 .Cd options GEOM_MOUNTVER
55 .Cd options GEOM_MULTIPATH
57 .Cd options GEOM_PART_APM
58 .Cd options GEOM_PART_BSD
59 .Cd options GEOM_PART_BSD64
60 .Cd options GEOM_PART_EBR
61 .Cd options GEOM_PART_EBR_COMPAT
62 .Cd options GEOM_PART_GPT
63 .Cd options GEOM_PART_LDM
64 .Cd options GEOM_PART_MBR
65 .Cd options GEOM_PART_VTOC8
67 .Cd options GEOM_RAID3
68 .Cd options GEOM_SHSEC
69 .Cd options GEOM_STRIPE
71 .Cd options GEOM_VIRSTOR
76 framework provides an infrastructure in which
78 can perform transformations on disk I/O requests on their path from
79 the upper kernel to the device drivers and back.
83 context range from the simple geometric
84 displacement performed in typical disk partitioning modules over RAID
85 algorithms and device multipath resolution to full blown cryptographic
86 protection of the stored data.
88 Compared to traditional
89 .Dq "volume management" ,
92 and in some cases all previous implementations in the following ways:
97 It is trivially simple to write a new class
98 of transformation and it will not be given stepchild treatment.
100 someone for some reason wanted to mount IBM MVS diskpacks, a class
101 recognizing and configuring their VTOC information would be a trivial
105 is topologically agnostic.
106 Most volume management implementations
107 have very strict notions of how classes can fit together, very often
108 one fixed hierarchy is provided, for instance, subdisk - plex -
112 Being extensible means that new transformations are treated no differently
113 than existing transformations.
115 Fixed hierarchies are bad because they make it impossible to express
116 the intent efficiently.
117 In the fixed hierarchy above, it is not possible to mirror two
118 physical disks and then partition the mirror into subdisks, instead
119 one is forced to make subdisks on the physical volumes and to mirror
120 these two and two, resulting in a much more complex configuration.
122 on the other hand does not care in which order things are done,
123 the only restriction is that cycles in the graph will not be allowed.
124 .Sh "TERMINOLOGY AND TOPOLOGY"
126 is quite object oriented and consequently the terminology
127 borrows a lot of context and semantics from the OO vocabulary:
131 represented by the data structure
134 particular kind of transformation.
135 Typical examples are MBR disk
136 partition, BSD disklabel, and RAID5 classes.
138 An instance of a class is called a
140 and represented by the data structure
145 will be one geom of class MBR for each disk.
149 represented by the data structure
151 is the front gate at which a geom offers service.
154 a disk-like thing which appears in
158 All providers have three main properties:
166 is the backdoor through which a geom connects to another
167 geom provider and through which I/O requests are sent.
169 The topological relationship between these entities are as follows:
172 A class has zero or more geom instances.
174 A geom has exactly one class it is derived from.
176 A geom has zero or more consumers.
178 A geom has zero or more providers.
180 A consumer can be attached to zero or one providers.
182 A provider can have zero or more consumers attached.
185 All geoms have a rank-number assigned, which is used to detect and
186 prevent loops in the acyclic directed graph.
191 A geom with no attached consumers has rank=1.
193 A geom with attached consumers has a rank one higher than the
194 highest rank of the geoms of the providers its consumers are
197 .Sh "SPECIAL TOPOLOGICAL MANEUVERS"
198 In addition to the straightforward attach, which attaches a consumer
199 to a provider, and detach, which breaks the bond, a number of special
200 topological maneuvers exists to facilitate configuration and to
201 improve the overall flexibility.
204 is a process that happens whenever a new class or new provider
205 is created, and it provides the class a chance to automatically configure an
206 instance on providers which it recognizes as its own.
207 A typical example is the MBR disk-partition class which will look for
208 the MBR table in the first sector and, if found and validated, will
209 instantiate a geom to multiplex according to the contents of the MBR.
211 A new class will be offered to all existing providers in turn and a new
212 provider will be offered to all classes in turn.
214 Exactly what a class does to recognize if it should accept the offered
215 provider is not defined by
217 but the sensible set of options are:
220 Examine specific data structures on the disk.
222 Examine properties like
228 Examine the rank number of the provider's geom.
230 Examine the method name of the provider's geom.
233 is the process by which a provider is removed while
234 it potentially is still being used.
236 When a geom orphans a provider, all future I/O requests will
238 on the provider with an error code set by the geom.
240 consumers attached to the provider will receive notification about
241 the orphanization when the event loop gets around to it, and they
242 can take appropriate action at that time.
244 A geom which came into being as a result of a normal taste operation
245 should self-destruct unless it has a way to keep functioning whilst
246 lacking the orphaned provider.
247 Geoms like disk slicers should therefore self-destruct whereas
248 RAID5 or mirror geoms will be able to continue as long as they do
251 When a provider is orphaned, this does not necessarily result in any
252 immediate change in the topology: any attached consumers are still
253 attached, any opened paths are still open, any outstanding I/O
254 requests are still outstanding.
256 The typical scenario is:
258 .Bl -bullet -offset indent -compact
260 A device driver detects a disk has departed and orphans the provider for it.
262 The geoms on top of the disk receive the orphanization event and
263 orphan all their providers in turn.
264 Providers which are not attached to will typically self-destruct
266 This process continues in a quasi-recursive fashion until all
267 relevant pieces of the tree have heard the bad news.
269 Eventually the buck stops when it reaches geom_dev at the top
274 to stop any more requests from
276 It will sleep until any and all outstanding I/O requests have
278 It will explicitly close (i.e.: zero the access counts), a change
279 which will propagate all the way down through the mesh.
280 It will then detach and destroy its geom.
282 The geom whose provider is now detached will destroy the provider,
283 detach and destroy its consumer and destroy its geom.
285 This process percolates all the way down through the mesh, until
286 the cleanup is complete.
289 While this approach seems byzantine, it does provide the maximum
290 flexibility and robustness in handling disappearing devices.
292 The one absolutely crucial detail to be aware of is that if the
293 device driver does not return all I/O requests, the tree will
296 is a special case of orphanization used to protect
297 against stale metadata.
298 It is probably easiest to understand spoiling by going through
303 on top of which an MBR geom provides
313 and that both the MBR and BSD geoms have
314 autoconfigured based on data structures on the disk media.
315 Now imagine the case where
317 is opened for writing and those
318 data structures are modified or overwritten: now the geoms would
319 be operating on stale metadata unless some notification system
320 can inform them otherwise.
322 To avoid this situation, when the open of
325 all attached consumers are told about this and geoms like
326 MBR and BSD will self-destruct as a result.
329 is closed, it will be offered for tasting again
330 and, if the data structures for MBR and BSD are still there, new
331 geoms will instantiate themselves anew.
333 Now for the fine print:
335 If any of the paths through the MBR or BSD module were open, they
336 would have opened downwards with an exclusive bit thus rendering it
339 for writing in that case.
341 the requested exclusive bit would render it impossible to open a
342 path through the MBR geom while
346 From this it also follows that changing the size of open geoms can
347 only be done with their cooperation.
349 Finally: the spoiling only happens when the write count goes from
350 zero to non-zero and the retasting happens only when the write count goes
351 from non-zero to zero.
353 is the process where the administrator issues instructions
354 for a particular class to instantiate itself.
356 ways to express intent in this case - a particular provider may be
357 specified with a level of override forcing, for instance, a BSD
358 disklabel module to attach to a provider which was not found palatable
359 during the TASTE operation.
361 Finally, I/O is the reason we even do this: it concerns itself with
362 sending I/O requests through the graph.
363 .It Em "I/O REQUESTS" ,
366 originate at a consumer,
367 are scheduled on its attached provider and, when processed, are returned
369 It is important to realize that the
371 which enters through the provider of a particular geom does not
373 come out on the other side
375 Even simple transformations like MBR and BSD will clone the
377 modify the clone, and schedule the clone on their
379 Note that cloning the
381 does not involve cloning the
382 actual data area specified in the I/O request.
384 In total, four different I/O requests exist in
386 read, write, delete, and
389 Read and write are self explanatory.
391 Delete indicates that a certain range of data is no longer used
392 and that it can be erased or freed as the underlying technology
394 Technologies like flash adaptation layers can arrange to erase
395 the relevant blocks before they will become reassigned and
396 cryptographic devices may want to fill random bits into the
397 range to reduce the amount of data available for attack.
399 It is important to recognize that a delete indication is not a
400 request and consequently there is no guarantee that the data actually
401 will be erased or made unavailable unless guaranteed by specific
405 semantics are required, a
406 geom should be pushed which converts delete indications into (a
407 sequence of) write requests.
410 supports inspection and manipulation
411 of out-of-band attributes on a particular provider or path.
412 Attributes are named by
414 strings and they will be discussed in
415 a separate section below.
418 (Stay tuned while the author rests his brain and fingers: more to come.)
420 Several flags are provided for tracing
422 operations and unlocking
423 protection mechanisms via the
424 .Va kern.geom.debugflags
426 All of these flags are off by default, and great care should be taken in
428 .Bl -tag -width indent
429 .It 0x01 Pq Dv G_T_TOPOLOGY
430 Provide tracing of topology change events.
431 .It 0x02 Pq Dv G_T_BIO
432 Provide tracing of buffer I/O requests.
433 .It 0x04 Pq Dv G_T_ACCESS
434 Provide tracing of access check controls.
436 .It 0x10 (allow foot shooting)
437 Allow writing to Rank 1 providers.
438 This would, for example, allow the super-user to overwrite the MBR on the root
439 disk or write random sectors elsewhere to a mounted disk.
440 The implications are obvious.
441 .It 0x40 Pq Dv G_F_DISKIOCTL
442 This is unused at this time.
443 .It 0x80 Pq Dv G_F_CTLDUMP
444 Dump contents of gctl requests.
448 The following options have been deprecated and will be removed in
461 .Cd GEOM_PART_VTOC8 ,
463 options, respectively, instead.
466 .Xr DECLARE_GEOM_CLASS 9 ,
476 .Xr g_provider_by_name 9
478 This software was developed for the
481 .An Poul-Henning Kamp
482 and NAI Labs, the Security Research Division of Network Associates, Inc.\&
483 under DARPA/SPAWAR contract N66001-01-C-8035
486 DARPA CHATS research program.
488 The first precursor for
490 was a gruesome hack to Minix 1.2 and was
492 An earlier attempt to implement a less general scheme
497 .An Poul-Henning Kamp Aq Mt phk@FreeBSD.org