1 .\" Copyright (c) 2018 Rick Macklem
3 .\" Redistribution and use in source and binary forms, with or without
4 .\" modification, are permitted provided that the following conditions
6 .\" 1. Redistributions of source code must retain the above copyright
7 .\" notice, this list of conditions and the following disclaimer.
8 .\" 2. Redistributions in binary form must reproduce the above copyright
9 .\" notice, this list of conditions and the following disclaimer in the
10 .\" documentation and/or other materials provided with the distribution.
12 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
13 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
14 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
15 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
16 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
17 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
18 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
19 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
20 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
21 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31 .Nd NFS Version 4.1 Parallel NFS Protocol Server
33 A set of FreeBSD servers may be configured to provide a
36 One FreeBSD system needs to be configured as a MetaData Server (MDS) and
37 at least one additional FreeBSD system needs to be configured as one or
38 more Data Servers (DS)s.
40 These FreeBSD systems are configured to be NFSv4.1 servers, see
44 if you are not familiar with configuring a NFSv4.1 server.
45 .Sh DS server configuration
46 The DS(s) need to be configured as NFSv4.1 server(s), with a top level exported
47 directory used for storage of data files.
48 This directory must be owned by
50 and would normally have a mode of
52 Within this directory there needs to be additional directories named
53 ds0,...,dsN (where N is 19 by default) also owned by
57 These are the directories where the data files are stored.
58 The following command can be run by root when in the top level exported
59 directory to create these subdirectories.
60 .Bd -literal -offset indent
61 jot -w ds 20 0 | xargs mkdir -m 700
66 is the default and can be set to a larger value on the MDS as shown below.
68 The top level exported directory used for storage of data files must be
69 exported to the MDS with the
70 .Dq maproot=root sec=sys
71 export options so that the MDS can create entries in these subdirectories.
72 It must also be exported to all pNFS aware clients, but these clients do
75 export option and this directory should be exported to them with the same
76 options as used by the MDS to export file system(s) to the clients.
78 It is possible to have multiple DSs on the same FreeBSD system, but each
79 of these DSs must have a separate top level exported directory used for storage
80 of data files and each
81 of these DSs must be mountable via a separate IP address.
82 Alias addresses can be set on the DS server system for a network
85 to create these different IP addresses.
86 Multiple DSs on the same server may be useful when data for different file systems
87 on the MDS are being stored on different file system volumes on the FreeBSD
89 .Sh MDS server configuration
90 The MDS must be a separate FreeBSD system from the FreeBSD DS system(s) and
92 It is configured as a NFSv4.1 server with file system(s) exported to
96 command line argument for
98 is used to indicate that it is running as the MDS for a pNFS server.
100 The DS(s) must all be mounted on the MDS using the following mount options:
101 .Bd -literal -offset indent
102 nfsv4,minorversion=1,soft,retrans=2
105 so that they can be defined as DSs in the
108 Normally these mounts would be entered in the
111 For example, if there are four DSs named nfsv4-data[0-3], the
113 lines might look like:
115 nfsv4-data0:/ /data0 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
116 nfsv4-data1:/ /data1 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
117 nfsv4-data2:/ /data2 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
118 nfsv4-data3:/ /data3 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
125 indicates that the NFS server is a pNFS MDS and specifies what
132 nfs_server_flags line in your
136 nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3"
139 This example specifies that the data files should be distributed over the
140 four DSs and File layouts will be issued to pNFS enabled clients.
141 If issuing Flexible File layouts is desired for this case, setting the sysctl
142 .Dq vfs.nfsd.default_flexfile
149 Alternately, this variant of
151 will specify that two way mirroring is to be done, via the
155 nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3 -m 2"
158 With two way mirroring, the data file for each exported file on the MDS
159 will be stored on two of the DSs.
160 When mirroring is enabled, the server will always issue Flexible File layouts.
162 It is also possible to specify which DSs are to be used to store data files for
163 specific exported file systems on the MDS.
164 For example, if the MDS has exported two file systems
168 to clients, the following variant of
170 will specify that data files for
172 will be stored on nfsv4-data0 and nfsv4-data1, whereas the data files for
174 will be store on nfsv4-data2 and nfsv4-data3.
176 nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2"
179 This can be used by system administrators to control where data files are
180 stored and might be useful for control of storage use.
181 For this case, it may be convenient to co-locate more than one of the DSs
182 on the same FreeBSD server, using separate file systems on the DS system
183 for storage of the respective DS's data files.
184 If mirroring is desired for this case, the
186 option also needs to be specified.
187 There must be enough DSs assigned to each exported file system on the MDS
188 to support the level of mirroring.
189 The above example would be fine for two way mirroring, but four way mirroring
190 would not work, since there are only two DSs assigned to each exported file
193 The number of subdirectories in each DS is defined by the
194 .Dq vfs.nfs.dsdirsize
196 This value can be increased from the default of 20, but only when the
198 is not running and after the additional ds20,... subdirectories have been
199 created on all the DSs.
200 For a service that will store a large number of files this sysctl should be
201 set much larger, to avoid the number of entries in a subdirectory from
204 Once operational, NFSv4.1 FreeBSD client mounts done with the
206 option should do I/O directly on the DSs.
207 The clients mounting the MDS must be running the
209 daemon for pNFS to work.
211 .Bd -literal -offset indent
218 Non-pNFS aware clients or NFSv3 mounts will do all I/O RPCs on the MDS,
219 which acts as a proxy for the appropriate DS(s).
220 .Sh Backing up a pNFS service
221 Since the data is separated from the metadata, the simple way to back up
222 a pNFS service is to do so from an NFS client that has the service mounted
224 If you back up the MDS exported file system(s) on the MDS, you must do it
225 in such a way that the
227 namespace extended attributes get backed up.
228 .Sh Handling of failed mirrored DSs
229 When a mirrored DS fails, it can be disabled one of three ways:
231 1 - The MDS detects a problem when trying to do proxy
232 operations on the DS.
233 This can take a couple of minutes
234 after the DS failure or network partitioning occurs.
236 2 - A pNFS client can report an I/O error that occurred for a DS to the MDS in
237 the arguments for a LayoutReturn operation.
239 3 - The system administrator can perform the pnfsdskill(8) command on the MDS
240 to disable it. If the system administrator does a pnfsdskill(8) and it fails
241 with ENXIO (Device not configured) that normally means the DS was already
242 disabled via #1 or #2. Since doing this is harmless, once a system
243 administrator knows that there is a problem with a mirrored DS, doing the
244 command is recommended.
246 Once a system administrator knows that a mirrored DS has malfunctioned
247 or has been network partitioned, they should do the following as root/su
249 .Bd -literal -offset indent
250 # pnfsdskill <mounted-on-path-of-DS>
251 # umount -N <mounted-on-path-of-DS>
254 Note that the <mounted-on-path-of-DS> must be the exact mounted-on path
255 string used when the DS was mounted on the MDS.
257 Once the mirrored DS has been disabled, the pNFS service should continue to
258 function, but file updates will only happen on the DS(s)
259 that have not been disabled. Assuming two way mirroring, that implies
260 the one DS of the pair stored in the
262 extended attribute for the file on the MDS, for files stored on the disabled DS.
264 The next step is to clear the IP address in the
266 extended attribute on all files on the MDS for the failed DS.
267 This is done so that, when the disabled DS is repaired and brought back online,
268 the data files on this DS will not be used, since they may be out of date.
269 The command that clears the IP address is
276 # pnfsdsfile -r nfsv4-data3 yyy.c
277 yyy.c: nfsv4-data2.home.rick ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 0.0.0.0 ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000
280 replaces nfsv4-data3 with an IPv4 address of 0.0.0.0, so that nfsv4-data3
283 Normally this will be called within a
285 command for all regular
286 files in the exported directory tree and must be done on the MDS.
289 you will probably also want the
291 option so that it won't spit out the results for every file.
292 If the disabled/repaired DS is nfsv4-data3, the commands done on the MDS
295 # cd <top-level-exported-dir>
296 # find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} \;
299 There is a problem with the above command if the file found by
301 is renamed or unlinked before the
303 command is done on it.
304 This should normally generate an error message.
305 A simple unlink is harmless
306 but a link/unlink or rename might result in the file not having been processed
308 To check that all files have their IP addresses set to 0.0.0.0 these
309 commands can be used (assuming the
313 # cd <top-level-exported-dir>
314 # find . -type f -exec pnfsdsfile {} \; | sed "/nfsv4-data3/!d"
317 Any line(s) printed require the
322 Once this is done, the replaced/repaired DS can be brought back online.
323 It should have empty ds0,...,dsN directories under the top level exported
324 directory for storage of data files just like it did when first set up.
325 Mount it on the MDS exactly as you did before disabling it.
326 For the nfsv4-data3 example, the command would be:
328 # mount -t nfs -o nfsv4,minorversion=1,soft,retrans=2 nfsv4-data3:/ /data3
331 Then restart the nfsd to re-enable the DS.
333 # /etc/rc.d/nfsd restart
336 Now, new files can be stored on nfsv4-data3,
337 but files with the IP address zeroed out on the MDS will not yet use the
338 repaired DS (nfsv4-data3).
339 The next step is to go through the exported file tree on the MDS and,
341 files with an IPv4 address of 0.0.0.0 in its extended attribute, copy the file
342 data to the repaired DS and re-enable use of this mirror for it.
343 This command for copying the file data for one MDS file is
345 and it will also normally be used in a
347 For the example case, the commands on the MDS would be:
349 # cd <top-level-exported-dir>
350 # find . -type f -exec pnfsdscopymr -r /data3 {} \;
353 When this completes, the recovery should be complete or at least nearly so.
354 As noted above, if a link/unlink or rename occurs on a file name while the
357 is in progress, it may not get copied.
358 To check for any file(s) not yet copied, the commands are:
360 # cd <top-level-exported-dir>
361 # find . -type f -exec pnfsdsfile {} \; | sed "/0\.0\.0\.0/!d"
364 If this command prints out any file name(s), these files must
367 command done on them to complete the recovery.
369 # pnfsdscopymr -r /data3 <file-path-reported>
372 If this commmand fails with the error
374 .Dq pnfsdscopymr: Copymr failed for file <path>: Device not configured
376 repeatedly, this may be caused by a Read/Write layout that has not
378 The only way to get rid of such a layout is to restart the
381 All of these commands are designed to be
382 done while the pNFS service is running and can be re-run safely.
384 For a more detailed discussion of the setup and management of a pNFS service
386 .Bd -literal -offset indent
387 http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
406 command first appeared in
409 Since the MDS cannot be mirrored, it is a single point of failure just
413 For non-mirrored configurations, all FreeBSD systems used in the service
414 are single points of failure.